Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification

This thesis utilizes sentiment classification task within the field of artificial intelligence for financial news using the combination of machine learning, linguistics, and statistical methods. The motivation for this approach comes from human emotion and vital information that lies in the finan...

Full description

Saved in:

Bibliographic Details
Main Author:	Yazdani, Sepideh Foroozan
Format:	Thesis
Language:	English
Published:	2017
Subjects:	Time-series analysis Finance-Mathematical models-Computer programs Eimeria tenella
Online Access:	http://psasir.upm.edu.my/id/eprint/113985/1/113985.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my-upm-ir.113985
record_format	uketd_dc
spelling	my-upm-ir.1139852024-12-04T08:26:32Z Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification 2017-10 Yazdani, Sepideh Foroozan This thesis utilizes sentiment classification task within the field of artificial intelligence for financial news using the combination of machine learning, linguistics, and statistical methods. The motivation for this approach comes from human emotion and vital information that lies in the financial news like news reports and impacts on the market. In recent years, a huge amount of this information is accessible for investment and research analysis in a text format where investors and researchers can simply get access to the desired information through a variety of channels on the Internet. Despite the studies conducted in automated sentiment classification of financial news, there are still challenges in some parts of text mining and financial news classification that concerns feature extraction, feature selection, and classification processes. Most existing literature on sentiment financial news typically relies on very simple linguistic features, such as Bag-of-Words (BOW) in which each piece of news is represented using distinct words with frequencies as a feature type, and only a few numbers of the studies have employed complicated approaches. Obviously, not all words are needed to reflect a given text. The primary downside of the BOW or unigrams is the huge number of linguistic features that it produces. The secondary downside is that linguistic features have too much information to become features while it is not clear which ones are important to the sentiment of financial news classification. Furthermore, since the extraction of words is based on their high frequency, typically low frequency-based linguistic features can be worth ignored. This research proposes two feature process models, Ngram-based and the NgramPOS-based models for the sentiment classification of financial news. The Ngram-based model utilizes statistical approaches for feature processing in order to classify financial news. This high frequency-based model combines unigrams and bigrams along with Term Frequency-Inverse Document Frequency (TF-IDF) (unsupervised feature weighting) while applying Document Frequency (DF) method with a certain threshold as dimensionality reduction method since it is suitable for high dimensional feature space. NgramPOS-based model is able to enhance the performance of feature processing in Ngram-based model. NgramPOS-based model employs a combination of statistical and linguistic approaches to extract sentiment information as features in order to classify financial news. This low frequency-based model extracts the combination of sentimentrich words and phrases as unigrams and bigrams using the defined POS-based fixed patterns along with the binary weighting method and applies Principle Component Analysis (PCA) as an unsupervised method to reduce the dimension of the extracted feature space. Both models utilized RBF Support Vector Machine (SVM) with optimized parameters (∁, γ) to classify the financial news as positive and negative news. Experiments showed that the combination of features as unigram and bigram along with TF-IDF and binary feature weighting methods in both models leads to the best result in financial news classification among, diverse feature spaces, with different accuracy for two models as 97.34% and 67.19% respectively. Time-series analysis Finance-Mathematical models-Computer programs Eimeria tenella 2017-10 Thesis http://psasir.upm.edu.my/id/eprint/113985/ http://psasir.upm.edu.my/id/eprint/113985/1/113985.pdf text en public http://ethesis.upm.edu.my/id/eprint/18043 doctoral Universiti Putra Malaysia Time-series analysis Finance-Mathematical models-Computer programs Eimeria tenella Azmi Murad, Masrah Azrifah
institution	Universiti Putra Malaysia
collection	PSAS Institutional Repository
language	English
advisor	Azmi Murad, Masrah Azrifah
topic	Time-series analysis Finance-Mathematical models-Computer programs Eimeria tenella
spellingShingle	Time-series analysis Finance-Mathematical models-Computer programs Eimeria tenella Yazdani, Sepideh Foroozan Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification
description	This thesis utilizes sentiment classification task within the field of artificial intelligence for financial news using the combination of machine learning, linguistics, and statistical methods. The motivation for this approach comes from human emotion and vital information that lies in the financial news like news reports and impacts on the market. In recent years, a huge amount of this information is accessible for investment and research analysis in a text format where investors and researchers can simply get access to the desired information through a variety of channels on the Internet. Despite the studies conducted in automated sentiment classification of financial news, there are still challenges in some parts of text mining and financial news classification that concerns feature extraction, feature selection, and classification processes. Most existing literature on sentiment financial news typically relies on very simple linguistic features, such as Bag-of-Words (BOW) in which each piece of news is represented using distinct words with frequencies as a feature type, and only a few numbers of the studies have employed complicated approaches. Obviously, not all words are needed to reflect a given text. The primary downside of the BOW or unigrams is the huge number of linguistic features that it produces. The secondary downside is that linguistic features have too much information to become features while it is not clear which ones are important to the sentiment of financial news classification. Furthermore, since the extraction of words is based on their high frequency, typically low frequency-based linguistic features can be worth ignored. This research proposes two feature process models, Ngram-based and the NgramPOS-based models for the sentiment classification of financial news. The Ngram-based model utilizes statistical approaches for feature processing in order to classify financial news. This high frequency-based model combines unigrams and bigrams along with Term Frequency-Inverse Document Frequency (TF-IDF) (unsupervised feature weighting) while applying Document Frequency (DF) method with a certain threshold as dimensionality reduction method since it is suitable for high dimensional feature space. NgramPOS-based model is able to enhance the performance of feature processing in Ngram-based model. NgramPOS-based model employs a combination of statistical and linguistic approaches to extract sentiment information as features in order to classify financial news. This low frequency-based model extracts the combination of sentimentrich words and phrases as unigrams and bigrams using the defined POS-based fixed patterns along with the binary weighting method and applies Principle Component Analysis (PCA) as an unsupervised method to reduce the dimension of the extracted feature space. Both models utilized RBF Support Vector Machine (SVM) with optimized parameters (∁, γ) to classify the financial news as positive and negative news. Experiments showed that the combination of features as unigram and bigram along with TF-IDF and binary feature weighting methods in both models leads to the best result in financial news classification among, diverse feature spaces, with different accuracy for two models as 97.34% and 67.19% respectively.
format	Thesis
qualification_level	Doctorate
author	Yazdani, Sepideh Foroozan
author_facet	Yazdani, Sepideh Foroozan
author_sort	Yazdani, Sepideh Foroozan
title	Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification
title_short	Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification
title_full	Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification
title_fullStr	Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification
title_full_unstemmed	Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification
title_sort	automated frequency-based statistical and linguistic feature process models for financial news sentiment classification
granting_institution	Universiti Putra Malaysia
publishDate	2017
url	http://psasir.upm.edu.my/id/eprint/113985/1/113985.pdf
_version_	1818586175795888128

Automated frequency-based statistical and linguistic feature process models for financial news sentiment classification

Similar Items