Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification
This thesis utilizes sentiment classification task within the field of artificial intelligence for financial news using the combination of machine learning, linguistics, and statistical methods. The motivation for this approach comes from human emotion and vital information that lies in the finan...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2017
|
Subjects: | |
Online Access: | http://psasir.upm.edu.my/id/eprint/113985/1/113985.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | This thesis utilizes sentiment classification task within the field of artificial intelligence
for financial news using the combination of machine learning, linguistics, and
statistical methods. The motivation for this approach comes from human emotion and
vital information that lies in the financial news like news reports and impacts on the
market. In recent years, a huge amount of this information is accessible for investment
and research analysis in a text format where investors and researchers can simply get
access to the desired information through a variety of channels on the Internet.
Despite the studies conducted in automated sentiment classification of financial news,
there are still challenges in some parts of text mining and financial news classification
that concerns feature extraction, feature selection, and classification processes. Most
existing literature on sentiment financial news typically relies on very simple linguistic
features, such as Bag-of-Words (BOW) in which each piece of news is represented
using distinct words with frequencies as a feature type, and only a few numbers of the
studies have employed complicated approaches. Obviously, not all words are needed to
reflect a given text. The primary downside of the BOW or unigrams is the huge number
of linguistic features that it produces. The secondary downside is that linguistic features
have too much information to become features while it is not clear which ones are
important to the sentiment of financial news classification. Furthermore, since the
extraction of words is based on their high frequency, typically low frequency-based
linguistic features can be worth ignored. This research proposes two feature process
models, Ngram-based and the NgramPOS-based models for the sentiment classification
of financial news.
The Ngram-based model utilizes statistical approaches for feature processing in order
to classify financial news. This high frequency-based model combines unigrams and
bigrams along with Term Frequency-Inverse Document Frequency (TF-IDF) (unsupervised feature weighting) while applying Document Frequency (DF) method
with a certain threshold as dimensionality reduction method since it is suitable for high
dimensional feature space.
NgramPOS-based model is able to enhance the performance of feature processing in
Ngram-based model. NgramPOS-based model employs a combination of statistical and
linguistic approaches to extract sentiment information as features in order to classify
financial news. This low frequency-based model extracts the combination of sentimentrich
words and phrases as unigrams and bigrams using the defined POS-based fixed
patterns along with the binary weighting method and applies Principle Component
Analysis (PCA) as an unsupervised method to reduce the dimension of the extracted
feature space.
Both models utilized RBF Support Vector Machine (SVM) with optimized parameters
(∁, γ) to classify the financial news as positive and negative news. Experiments showed
that the combination of features as unigram and bigram along with TF-IDF and binary
feature weighting methods in both models leads to the best result in financial news
classification among, diverse feature spaces, with different accuracy for two models as
97.34% and 67.19% respectively. |
---|