Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification

This thesis utilizes sentiment classification task within the field of artificial intelligence for financial news using the combination of machine learning, linguistics, and statistical methods. The motivation for this approach comes from human emotion and vital information that lies in the finan...

Full description

Saved in:
Bibliographic Details
Main Author: Yazdani, Sepideh Foroozan
Format: Thesis
Language:English
Published: 2017
Subjects:
Online Access:http://psasir.upm.edu.my/id/eprint/113985/1/113985.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-upm-ir.113985
record_format uketd_dc
spelling my-upm-ir.1139852024-12-04T08:26:32Z Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification 2017-10 Yazdani, Sepideh Foroozan This thesis utilizes sentiment classification task within the field of artificial intelligence for financial news using the combination of machine learning, linguistics, and statistical methods. The motivation for this approach comes from human emotion and vital information that lies in the financial news like news reports and impacts on the market. In recent years, a huge amount of this information is accessible for investment and research analysis in a text format where investors and researchers can simply get access to the desired information through a variety of channels on the Internet. Despite the studies conducted in automated sentiment classification of financial news, there are still challenges in some parts of text mining and financial news classification that concerns feature extraction, feature selection, and classification processes. Most existing literature on sentiment financial news typically relies on very simple linguistic features, such as Bag-of-Words (BOW) in which each piece of news is represented using distinct words with frequencies as a feature type, and only a few numbers of the studies have employed complicated approaches. Obviously, not all words are needed to reflect a given text. The primary downside of the BOW or unigrams is the huge number of linguistic features that it produces. The secondary downside is that linguistic features have too much information to become features while it is not clear which ones are important to the sentiment of financial news classification. Furthermore, since the extraction of words is based on their high frequency, typically low frequency-based linguistic features can be worth ignored. This research proposes two feature process models, Ngram-based and the NgramPOS-based models for the sentiment classification of financial news. The Ngram-based model utilizes statistical approaches for feature processing in order to classify financial news. This high frequency-based model combines unigrams and bigrams along with Term Frequency-Inverse Document Frequency (TF-IDF) (unsupervised feature weighting) while applying Document Frequency (DF) method with a certain threshold as dimensionality reduction method since it is suitable for high dimensional feature space. NgramPOS-based model is able to enhance the performance of feature processing in Ngram-based model. NgramPOS-based model employs a combination of statistical and linguistic approaches to extract sentiment information as features in order to classify financial news. This low frequency-based model extracts the combination of sentimentrich words and phrases as unigrams and bigrams using the defined POS-based fixed patterns along with the binary weighting method and applies Principle Component Analysis (PCA) as an unsupervised method to reduce the dimension of the extracted feature space. Both models utilized RBF Support Vector Machine (SVM) with optimized parameters (∁, γ) to classify the financial news as positive and negative news. Experiments showed that the combination of features as unigram and bigram along with TF-IDF and binary feature weighting methods in both models leads to the best result in financial news classification among, diverse feature spaces, with different accuracy for two models as 97.34% and 67.19% respectively. Time-series analysis Finance-Mathematical models-Computer programs Eimeria tenella 2017-10 Thesis http://psasir.upm.edu.my/id/eprint/113985/ http://psasir.upm.edu.my/id/eprint/113985/1/113985.pdf text en public http://ethesis.upm.edu.my/id/eprint/18043 doctoral Universiti Putra Malaysia Time-series analysis Finance-Mathematical models-Computer programs Eimeria tenella Azmi Murad, Masrah Azrifah
institution Universiti Putra Malaysia
collection PSAS Institutional Repository
language English
advisor Azmi Murad, Masrah Azrifah
topic Time-series analysis
Finance-Mathematical models-Computer programs
Eimeria tenella
spellingShingle Time-series analysis
Finance-Mathematical models-Computer programs
Eimeria tenella
Yazdani, Sepideh Foroozan
Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification
description This thesis utilizes sentiment classification task within the field of artificial intelligence for financial news using the combination of machine learning, linguistics, and statistical methods. The motivation for this approach comes from human emotion and vital information that lies in the financial news like news reports and impacts on the market. In recent years, a huge amount of this information is accessible for investment and research analysis in a text format where investors and researchers can simply get access to the desired information through a variety of channels on the Internet. Despite the studies conducted in automated sentiment classification of financial news, there are still challenges in some parts of text mining and financial news classification that concerns feature extraction, feature selection, and classification processes. Most existing literature on sentiment financial news typically relies on very simple linguistic features, such as Bag-of-Words (BOW) in which each piece of news is represented using distinct words with frequencies as a feature type, and only a few numbers of the studies have employed complicated approaches. Obviously, not all words are needed to reflect a given text. The primary downside of the BOW or unigrams is the huge number of linguistic features that it produces. The secondary downside is that linguistic features have too much information to become features while it is not clear which ones are important to the sentiment of financial news classification. Furthermore, since the extraction of words is based on their high frequency, typically low frequency-based linguistic features can be worth ignored. This research proposes two feature process models, Ngram-based and the NgramPOS-based models for the sentiment classification of financial news. The Ngram-based model utilizes statistical approaches for feature processing in order to classify financial news. This high frequency-based model combines unigrams and bigrams along with Term Frequency-Inverse Document Frequency (TF-IDF) (unsupervised feature weighting) while applying Document Frequency (DF) method with a certain threshold as dimensionality reduction method since it is suitable for high dimensional feature space. NgramPOS-based model is able to enhance the performance of feature processing in Ngram-based model. NgramPOS-based model employs a combination of statistical and linguistic approaches to extract sentiment information as features in order to classify financial news. This low frequency-based model extracts the combination of sentimentrich words and phrases as unigrams and bigrams using the defined POS-based fixed patterns along with the binary weighting method and applies Principle Component Analysis (PCA) as an unsupervised method to reduce the dimension of the extracted feature space. Both models utilized RBF Support Vector Machine (SVM) with optimized parameters (∁, γ) to classify the financial news as positive and negative news. Experiments showed that the combination of features as unigram and bigram along with TF-IDF and binary feature weighting methods in both models leads to the best result in financial news classification among, diverse feature spaces, with different accuracy for two models as 97.34% and 67.19% respectively.
format Thesis
qualification_level Doctorate
author Yazdani, Sepideh Foroozan
author_facet Yazdani, Sepideh Foroozan
author_sort Yazdani, Sepideh Foroozan
title Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification
title_short Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification
title_full Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification
title_fullStr Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification
title_full_unstemmed Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification
title_sort automated frequency-based statistical and linguistic feature process models for financial news sentiment classification
granting_institution Universiti Putra Malaysia
publishDate 2017
url http://psasir.upm.edu.my/id/eprint/113985/1/113985.pdf
_version_ 1818586175795888128