Apr 12, 2010

More terrible than terrific: do we have more negative words than positive ones in English?

One of the common approaches to the problem of sentiment analysis (a field under text mining & natural language processing (NLP), where programs try to detect opinion in natural language texts) is to build a dictionary of 'opinion' words. The words are classified as negative & positive. Given words from a sentence, a program can look up the dictionary to see if any of these words appear in dictionary, and then use the positive or negative category as an input in detecting sentiment for that sentence. (Of course, this is a simplified explanation of what actually happens.)

We work in this field and so, in one of our approaches, have built such a lexicon. Our's is a small list and hence not comprehensive, but sufficient for our purposes. Now, I noticed that I had a lot more words tagged as negative rather than as positive. Stated in numbers, there were 434 words marked positive, and 1348 marked negative. I had initially built a much smaller list by hand, and then expanded the lexicon automatically by (partially) using an approach (pdf) described by Italian researchers Andrea Esuli and Fabrizio Sebastiani.

They had also created SentiWordNet. This extends WordNet, which is a popular language resource used in natural lanuage processing and in essence, is a dictionary-thesaurus on steroids (the good kind :-)). WordNet contains over 150,000 words and arranges them 'conceptually', by grouping together synonyms that make up unique 'senses' (these groups are called 'synsets') (it may be obvious why I didn't the word 'sensually' to describe the arrangement). SentiWordNet augments this by attaching a positive and a negative score to each synset. (Here, I won't discuss why a synset can have both a positive & negative score.) Words like 'horrible' or bad have a high negative score, while awesome and pleasant are very positive.

Coming back to our question. Seeing the difference in my list, I wondered if this was a possibly valid observation, or if my lexicon was just poorly constructed, or a consequence of applying the expansion technique in part. So I counted the number of positive & negative synsets in SentiWordNet (again, not going into details here). I found 14134 negative synsets and 12720 positive ones. Perhaps not a significant difference, but still the negative side is a little greater in number (and I haven't actually counted words, only sense groups). So it could just be that I chose or generated more negative words.

This is all anecdotal and perhaps some fun for language geeks to talk about when they're stuck in a long queue and haven't brought a book along :-)


Anonymous said...

Thank you sir, for reminding me of countless cack sessions in IIT, spent on discussing topics that had nothing to do with my own research! *If* we were sitting at Badlu's with a cutting in hand, the question I would have asked is: Are the positive/negative scores in SentiWordNet just free integers, or are they fractions that sum to 1?

Unknown said...

and thank you sir, for questioning things as usual :-0

Irrespective of location, let me answer the question: each synset is scored under 3 parts - objectivity, -ve, +ve (the last two being subjectivity, I guess). These are fractions that sum up to 1. I am not completely happy with the nature and method of the scoring that I have seen, which seems counter-intuitive to me.

SameerDS said...

I had expected my comment to show up in your buzz! Perhaps you have to tweak some setting somewhere to make that happen. About the normalised positive/negative scores, I can see two things: one is that it allows a comparison of two words to see which is more positive. But if tomorrow I insert a really really bad word in SentiWordNet, bad enough to make Darth Vader and his Death Star look like a 5-year old with a popsicle, it will completely swamp the positive/negative scores for all the words!

Unknown said...

Sameer: I don't think Buzz takes in comments from blogs (something I have been wishing they could do). In fact, I can add any rss feed to my buzz - so there is no implied ownership, I think.

Back to SWN: no, the normalisation isn't across all synset i.e. a synset's score isn't relative to the top one. It's just that within a synset, the scores for obj, -ve, +ve add up to 1. Sorry if I gave you the wrong impression.

Unknown said...

(there was a considerable discussion on my buzz profile - pasting the thread here, as it may be useful to readers of this post)

Link to this post:

Apr 13 Anupam Goyal: I know next to nothing about NLP but there might be cases where a negative becomes a positive in conjunction with other words around it. Or, the negative semantics might change due to the context.

Apr 13 Anupam Goyal: Should the frequency of use of all negative words dilute the score that is given to them? For example "good" being used in a sentence might carry a higher score than "terrible" since on average there would be generally less "terrible" things in the world. Or, should it be the other way round? Random thoughts.

Apr 13 Ramanand J: @Anupam Goyal context: yes, that is handled in sentences/documents - here, I'm only discussing the meaning of words outside of context. Am not suggesting we use more -ve words (though that may be be true on the web where people crib a lot :-) - an interesting point.

freq: um, not sure. you are saying the rarer something is, the more imp. it should be scored. however, here, the scoring is limited to the concept irrespective of us. i.e. something like 'harridan' is negative even if it is rare.

thanks for the comments - see, you don't need to know any NLP :-)

Apr 13 Sudarshan Purohit: (me in the don't-know-NLP camp too). One reason I can think of for this is the human tendency to justify, quantify, nuance, or otherwise specialize negative comments to others, leading to a larger vocabulary for those sentiments. Positive words, well, they're more likely to be taken as is with no hard feelings, so less nuance required there.

Apr 14 siddharth dani: @Sudarshan, my thoughts exactly.
@Ramanand, Am I correct in understanding the following: if a word's positive score is greater than it's negative score then you are saying that it is a positive word and vice versa.

Apr 14 Ramanand J: @Sudarshan Purohit interesting thought!

@siddharth dani you are right. I simply chose to award the synset winner as the one with the higher score (which does mask any nuances inside). of the remaining, there were some where the neg score==pos score. Any comments on this naive approach?

Apr 14 siddharth dani: I think the naive approach should work pretty well while deciding whether to label a particular synset positive or negative since the whole goal of this task is to find an overall tendency in the dictionary of the English language and not go too much into individual cases. Would it make sense to give some sort of a tolerance in labeling borderline cases? If a negative an positive score for a synset are within this tolerance, do not count that synset at all; or something to this tune. It would be great to see if this tendency is found in other languages as well. I guess Marathi would have a similar if not greater bias as does English, since being able to be negative in various different ways actually makes the speaker more typically Marathi and is hence conducive to resulting in a larger negative dictionary :).

8:19 am Ramanand J: @siddharth dani ha ha - there is a Marathi wordnet, but no Marathi sentiwordnet. We'll wait till then to check your hypothesis :-). Your point about borderline cases is quite valid.