Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification

This thesis utilizes sentiment classification task within the field of artificial intelligence for financial news using the combination of machine learning, linguistics, and statistical methods. The motivation for this approach comes from human emotion and vital information that lies in the finan...

Full description

Saved in:
Bibliographic Details
Main Author: Yazdani, Sepideh Foroozan
Format: Thesis
Language:English
Published: 2017
Subjects:
Online Access:http://psasir.upm.edu.my/id/eprint/113985/1/113985.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This thesis utilizes sentiment classification task within the field of artificial intelligence for financial news using the combination of machine learning, linguistics, and statistical methods. The motivation for this approach comes from human emotion and vital information that lies in the financial news like news reports and impacts on the market. In recent years, a huge amount of this information is accessible for investment and research analysis in a text format where investors and researchers can simply get access to the desired information through a variety of channels on the Internet. Despite the studies conducted in automated sentiment classification of financial news, there are still challenges in some parts of text mining and financial news classification that concerns feature extraction, feature selection, and classification processes. Most existing literature on sentiment financial news typically relies on very simple linguistic features, such as Bag-of-Words (BOW) in which each piece of news is represented using distinct words with frequencies as a feature type, and only a few numbers of the studies have employed complicated approaches. Obviously, not all words are needed to reflect a given text. The primary downside of the BOW or unigrams is the huge number of linguistic features that it produces. The secondary downside is that linguistic features have too much information to become features while it is not clear which ones are important to the sentiment of financial news classification. Furthermore, since the extraction of words is based on their high frequency, typically low frequency-based linguistic features can be worth ignored. This research proposes two feature process models, Ngram-based and the NgramPOS-based models for the sentiment classification of financial news. The Ngram-based model utilizes statistical approaches for feature processing in order to classify financial news. This high frequency-based model combines unigrams and bigrams along with Term Frequency-Inverse Document Frequency (TF-IDF) (unsupervised feature weighting) while applying Document Frequency (DF) method with a certain threshold as dimensionality reduction method since it is suitable for high dimensional feature space. NgramPOS-based model is able to enhance the performance of feature processing in Ngram-based model. NgramPOS-based model employs a combination of statistical and linguistic approaches to extract sentiment information as features in order to classify financial news. This low frequency-based model extracts the combination of sentimentrich words and phrases as unigrams and bigrams using the defined POS-based fixed patterns along with the binary weighting method and applies Principle Component Analysis (PCA) as an unsupervised method to reduce the dimension of the extracted feature space. Both models utilized RBF Support Vector Machine (SVM) with optimized parameters (∁, γ) to classify the financial news as positive and negative news. Experiments showed that the combination of features as unigram and bigram along with TF-IDF and binary feature weighting methods in both models leads to the best result in financial news classification among, diverse feature spaces, with different accuracy for two models as 97.34% and 67.19% respectively.