Research is needed to build usable natural language parsing engine for the Dutch language. There has been allot of linguistics research done for the English, French and German language which has resulted in fairly impressive analysis engines such as IBM's Watson. The technology of natural language parsing engine has unfortunately not reached a level where full interpretation and comprehension of natural language has been achieved, less ambitious engines have however had some success. Unfortunately due to the small market the Dutch language has not been extensively researched and therefore no open source analysis engines exist. One of the most marketable engines in automated linguistics analysis is "Sentiment Analysis", determining if the author of a piece of text wished to convey a negative or positive sentiment with the text. There are a few publications on the topic such as Good News or Bad News? which has an accuracy of 63%. Roughly translated this means that the algorithm used for the publication correctly classifies 63% of the sentences it is given. Unfortunately this application is not open source. We have received a number of requests from our clients for such a functionality in our solutions. Incentro has therefore started the open source project Dutch sentiment analysis engine, which our preliminary results seem to indicate has an accuracy of 66%. The approach we used is slightly different then that presented in the paper but still follows the same concept. Our approach differs from that presented in the paper in the aspect that we assume that multi word patters are, in some cases, better in determining sentiment then a single word matching.
This article will show our finding and describe our approach in more detail. Our finding are based in product and film reviews that we have collected from various sources. Reviews where chosen as our corpus because they offer a dataset of texts which have already be classified(by sentiment) by their authors.
If we plot the average confidence over the text sentiment we get the graph shown in figure 1. The horizontal axis shows the sentiment classification the author gave the text, 0.5 being the most negative and 5 being the most positive. The vertical axis indicates the average sentiment that is given by the engine from text in the individual sentiment classes(0,5 - 5). A negative confidence is meant to indicate a negative sentiment and a positive confidents a positive sentiment.
Figure 1: The Average confidence of sentiment over review rating
These result show a clear and stedy rise in sentiment as we move from 0,5 to 5. This is exacly what one would exspect to see from a true random sample of texts.
If we plot the percentage of positive(Blue line), negative(Green line) and neutral(Red line) matches the engine returns over the complete collection of texts that we have it returns the following graph. The horizontal axis shows the sentiment classification the author gave the text, 0.5 being the most negative and 5 being the most positive. Texts that return a confidence values which is between -0.003 and 0.003 is considers to have a neutral sentiment.
Figure 2: The matching Performance over review ratings
At first glance this seems like an intuitive result considering that the positive and negative percentage-of-matches cross each other just below 2,5. Since you would expect that the number of texts that could be interpreted as somewhat positive would decline the closer one would get to 0,5, and vice-versa for negative results. Unfortunately this also points out our short comings. At the extremes, 0.5 and 5 we see that we have only +- 66% of the text correctly classified. I say "correctly" since we assume that texts that are given the rating 0,5 are almost always meant to be negative, and vice-versa with reviews with a rating of 5. We also see that this collect may not be suited for testing the engines accuracy with neutral sentences, since there no significant deviation in neutral matches over the various ratings. The argument can be made that this corpus will never contain neutral text since the object of a review is to convey sentiment of exclusively, a positive, or negative nature. These finding do however follow the characteristics of some of the state-of-the-art sentiment analysis engine ,including those of the English language, which at least indicates that it performs at least as well if not better than any other engine available.
Our aproch can be described as follows:
- Find all paters(word paters) in a collection of texts that are definatly intended to provide a sentiment.
- Filter out paters that are indicative of positive sentiment and others which are indicative of negative sentiment.
- Order these paters to result in the highest accuracy when classifying text.
Collecting pre-classified pieces of text
We needed pieces of text that have already been classified into groups that we where pretty sure we could isolate a group that would contain text where the author intended the text to convey a negative sentiment. The same was needed for text that convey positive sentiment. The classical approach that has been use for the English language is to use product or service reviews e.g Amazone.com. A set or reviews were collected from various Dutch sites and totalled in over 793000 reviews. This collection or corpus was dived into a training and test set using a 80/20 ratio respectively.
Finding nGrams in the text
The nGrams we were looking for could also be termed an word sequences. We were looking for word sequences that are often used when a author wants to convey a negative or positive sentiment. The nGrams that where selected where sequences that were used at least twice in the training set. The resulted in 941388 nGrams with lengths ranging from 2 to 7 words. This set was the supplemented with 337390 Dutch words the Opentaal word set.
Converting nGrams to patterns
The nGrams collected where to specific to there domains to be used efficiently on general Dutch text. For example take the nGram "Dit is de slechtste film alle tijden" (Translation: This is the worst film of all time), the word film is specific to it's domain. the pattern "Dit is de slechtste subjectClass alle tijden" would correctly match a more general set of negative sentiment. Since Dutch used what was ones masculine or feminine articles (e.g. de, het, dit, die) for nouns generalizing these words would also allow the pattern to match a more general set of text. The stander way of matching text using wild cards is to use regular expressions. The result would be "(?:dit)|(?:deze) is (?de)|(?het) slechtste %subjectClass% alle tijden". The reason the subject class was not replaced by a regular expression at this time is that this construction could facilitate specializing these patterns to a particular domain. A simple find/replace of "%subjectClass%" with "auto" would make this pattern specific to cars.
Taking statistics in how many times each gram occurs in a class
If a pattern occurs in positive texts more often then in negative texts we assume it is a pattern that recognises positive sentiment, and the same goes for negative sentiment. Now because for example the pattern "(?:dit)|(?:deze) is" would occur often in our corpus it, and other patter that we would say do not convey sentiment, would have also made it into the pattern set. To filter out these types of patterns we also recoded how often they match each of the sentiment classes (negative, positive).
Order and filtering the list of patterns
If the margin between the percentage on positive matches and incorrect matches is to small the pattern is marked to be ignored and will not be places in the final set of patterns that will guide the analysis engine. The final list is then order by the difference in positive matches and incorrect matches in descending order.