Fake review annotation model and classification through reviewers' writing style
In the last decade, online product reviews have become the main source of information during customers' decision making and business' purchasing processes. Unfortunately, fraudsters have produced untruthful reviews driven intentionally for profit or publicity. Their activities deceive p...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2019
|
Subjects: | |
Online Access: | http://psasir.upm.edu.my/id/eprint/90777/1/FSKTM%202020%203%20IR.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my-upm-ir.90777 |
---|---|
record_format |
uketd_dc |
spelling |
my-upm-ir.907772021-09-27T03:38:45Z Fake review annotation model and classification through reviewers' writing style 2019-09 Shojaee, Somayeh In the last decade, online product reviews have become the main source of information during customers' decision making and business' purchasing processes. Unfortunately, fraudsters have produced untruthful reviews driven intentionally for profit or publicity. Their activities deceive potential organizations to reshape their businesses, customers from making best decisions and opinion mining techniques from reaching accurate conclusions. One of the big challenges of spam review detection is the lack of available labeled gold standard real-life product review dataset. Manually labeling product reviews as fake or real is one of the approaches to deal with the problem. However, recognizing whether a review is fake or real is very difficult by only reading the content of the review, because spammers can easily craft a fake review that is just like any other real reviews. To address this problem we enhance the inter-annotator agreement in manually labeling approach by proposing a model to annotate product reviews as fake or real. This is the first contribution of this research study. The proposed annotation model is designed, implemented and accessed online. Our crawled reviews are labeled by three annotators who were trained and paid to complete the labeling through our system. The spamicity score has been calculated for each review and a label has been assigned to every review based on their spamicity score. The Fleiss's Kappa is calculated for three annotators with value of 0.89, which shows \almost perfect agreement" between them. The labeled real-life product review dataset is the second contribution of this study. To test the accuracy of our model, we also re-labeled a portion of available Yelp.com dataset through our system and calculated the disagreement with their actual label based on the Yelp.com's filltering system. We found that only 7% of the reviews were labeled differently. The other open problem of fake product review classification is the lack of historic knowledge independent feature sets. Most of the feature-based fake review detection techniques are only applicable on a specific product domain or historic knowledge is needed to extract these features. To address the problem, this study presents a set of domain and historic knowledge independent features, namely writing style and readability, which can be applied to almost any review hosting site. The feature set is the third contribution of this study. Writing style here refers to linguistic aspects that identify fake and real reviewers. Fake reviewers try hard to write a review that sounds like genuine, hence it affects their writing style and also readability of their fake reviews consequently. The method dependently detects reviewers' writing style before spamming can hurt a product or a business. The evaluation results of our features on the only available crowdsourced labeled gold standard dataset, with the accuracy of 90.7%, and on our proposed dataset with the accuracy of 98.9%, suggest significant differences between fake and real reviews on writing style and readability level. Computer networks - Security measures Security systems 2019-09 Thesis http://psasir.upm.edu.my/id/eprint/90777/ http://psasir.upm.edu.my/id/eprint/90777/1/FSKTM%202020%203%20IR.pdf text en public doctoral Universiti Putra Malaysia Computer networks - Security measures Security systems Azmi Murad, Masrah Azrifah |
institution |
Universiti Putra Malaysia |
collection |
PSAS Institutional Repository |
language |
English |
advisor |
Azmi Murad, Masrah Azrifah |
topic |
Computer networks - Security measures Security systems |
spellingShingle |
Computer networks - Security measures Security systems Shojaee, Somayeh Fake review annotation model and classification through reviewers' writing style |
description |
In the last decade, online product reviews have become the main source of
information during customers' decision making and business' purchasing processes.
Unfortunately, fraudsters have produced untruthful reviews driven
intentionally for profit or publicity. Their activities deceive potential organizations
to reshape their businesses, customers from making best decisions
and opinion mining techniques from reaching accurate conclusions.
One of the big challenges of spam review detection is the lack of available
labeled gold standard real-life product review dataset. Manually labeling
product reviews as fake or real is one of the approaches to deal with the
problem. However, recognizing whether a review is fake or real is very difficult
by only reading the content of the review, because spammers can easily craft
a fake review that is just like any other real reviews.
To address this problem we enhance the inter-annotator agreement in manually
labeling approach by proposing a model to annotate product reviews as
fake or real. This is the first contribution of this research study. The proposed
annotation model is designed, implemented and accessed online. Our
crawled reviews are labeled by three annotators who were trained and paid
to complete the labeling through our system. The spamicity score has been
calculated for each review and a label has been assigned to every review based
on their spamicity score. The Fleiss's Kappa is calculated for three annotators
with value of 0.89, which shows \almost perfect agreement" between them. The labeled real-life product review dataset is the second contribution of this
study. To test the accuracy of our model, we also re-labeled a portion of available
Yelp.com dataset through our system and calculated the disagreement
with their actual label based on the Yelp.com's filltering system. We found
that only 7% of the reviews were labeled differently.
The other open problem of fake product review classification is the lack of
historic knowledge independent feature sets. Most of the feature-based fake
review detection techniques are only applicable on a specific product domain
or historic knowledge is needed to extract these features. To address the
problem, this study presents a set of domain and historic knowledge independent
features, namely writing style and readability, which can be applied to
almost any review hosting site. The feature set is the third contribution of
this study. Writing style here refers to linguistic aspects that identify fake
and real reviewers. Fake reviewers try hard to write a review that sounds
like genuine, hence it affects their writing style and also readability of their
fake reviews consequently. The method dependently detects reviewers' writing
style before spamming can hurt a product or a business. The evaluation
results of our features on the only available crowdsourced labeled gold standard
dataset, with the accuracy of 90.7%, and on our proposed dataset with
the accuracy of 98.9%, suggest significant differences between fake and real
reviews on writing style and readability level. |
format |
Thesis |
qualification_level |
Doctorate |
author |
Shojaee, Somayeh |
author_facet |
Shojaee, Somayeh |
author_sort |
Shojaee, Somayeh |
title |
Fake review annotation model and classification through reviewers' writing style |
title_short |
Fake review annotation model and classification through reviewers' writing style |
title_full |
Fake review annotation model and classification through reviewers' writing style |
title_fullStr |
Fake review annotation model and classification through reviewers' writing style |
title_full_unstemmed |
Fake review annotation model and classification through reviewers' writing style |
title_sort |
fake review annotation model and classification through reviewers' writing style |
granting_institution |
Universiti Putra Malaysia |
publishDate |
2019 |
url |
http://psasir.upm.edu.my/id/eprint/90777/1/FSKTM%202020%203%20IR.pdf |
_version_ |
1747813657396379648 |