Ever seen a review like this?
My husband and I satayed for two nights at the Hilton Chicago,and enjoyed every minute of it! The bedrooms are immaculate,and the linnens are very soft. We also appreciated the free wifi,as we could stay in touch with friends while staying in Chicago. The bathroom was quite spacious,and I loved the smell of the shampoo they provided-not like most hotel shampoos. Their service was amazing,and we absolutely loved the beautiful indoor pool. I would recommend staying here to anyone.
Your author has come across dozens that started like this one: with “My husband and I” or “My spouse and I”. Surfing over to Yelp and choosing San Francisco brings up another, “My husband stayed here for a little less than a week and were extremely pleased with the place…”
Turns out there’s a good reason for this similarity: many of these reviews are fake, put there by mercenaries, making as little as $5 for two, necessarily glowing, reviews. The $5 figure is from the New York Times, via A&LD. Bogus “five-star” ratings on sites like Amazon and TripAdvisor turn out to be a large problem.
The glowing notice above is known to be fake because it was solicited via a website that specializes in selling fake reviews (I have no idea whether the Yelp review is real or genuine). This solicitation was done as part of a study by Myle Ott and others at Cornell in an effort to develop an algorithm that can detect fakes.
Incidentally, Ott is a computer scientist, and those guys say “train algorithm” when statisticians say “fit model” or physicists say “build model.” All these terms mean exactly the same thing—though, admittedly “training an algorithm” sounds sexier than “fitting a model.” “Training” implies that “learning” can go on indefinitely, while “fitting” implies merely applying some formula. Computer scientists are winning the battle of terminology. They are also—justifiably—winning the battle over the philosophy of modeling, but that’s a story for another day.
Building the algorithm to determine fraudulent reviews is not simple; however, creating the database from which to fit the model is the real trick. One approach was to gather reviews which are too similar, vis à vis plagiarism. Another was to “ask participants to give both their true and untrue views on personal issues (e.g., their stance on the death penalty).” Everybody becomes their own control in this way.
Here, the authors did one better and solicited 400 fake reviews in the same way that fake reviews are solicited by actual websites. They also gathered 400 hoped-to-be-genuine reviews from TripAdvisor. In the end, they had 20 real and 20 fake reviews for 20 different hotels. These were used to fit their model—or train their algorithm, if you will.
One tidbit was the discovery that fake reviews are often written in a hurry. One “took just 5 seconds and contained 114 words.” This of course implies the text was prepared in advance and cut and pasted in. Reviews written by first-time users, or newly created users names, are also more likely to be fake. Sites like TripAdvisor can use these facts as pieces of information to flag a review as genuine or fake.
The models themselves were naive Bayes and support vector machines, both commonly used as classifiers. Classification is the meat and potatoes of statistics (I would say it is the sole reason for its existence; of that, more another time). Logistic regression is classification, as are discriminant analysis, so-called machine learning algorithms, and on and on.
Support vector machines are a kind of non-parametric discriminant analysis. Various combinations of functions of data are produced which spit out whether the given message is likely fake or likely real. If you want to be fancy, you say SVMs “find a high-dimensional separating hyperplane between two groups of data.”
The data is the content of the messages themselves: how long it took them to be written, the number of times the word “I” was used, and so on. For example, deceptive reviews used “experience”, “my husband”, “I”, “feel”, “business”, and “vacation” more than genuine ones.
They got about 90% accuracy on their test data, which is excellent. Especially considering that human readers do no better than 50%. Experience says that that high rate won’t be realized on new data. Why?
Well, the model was fit to the data at hand. If new data was exactly like the data at hand, then the new accuracy rate would be the same as the old. But the new data is never exactly like the old data: if it was, it would be a mere copy. It is the inevitable differences between the old and new that account for the decrease in performance.
This wisdom applies not just to Ott’s model, but to all statistical/probabilistic or computer science/fuzzy logic models. The models’ performance is always conditional on the data at hand.
Ott has made his data publicly available. Do not download, however, unless you know how to read things like this, “!/.__The/DT ,/,__and/CC ,/,__and/CC ,/,__and/CC ,/,__and/CC ,/,__as/IN ./.__I/PRP ./.__The/DT ./.__Their/PRP$ ./…”