Stemming | Robert Gimeno's Adventures in Data Science

I’ve previously described why an organisation would want to analyse Twitter and described my initial architecture for achieving a targeted analysis. The targeting relies on identifying tweeters who are relevant to the organisation’s aims in order to keep the size of the network manageable and remove the ‘noise’ of irrelevant tweeters. To date identifying ‘relevant’ tweeters has relied on scoring each tweet against a list of keywords; this is somewhat crude and I’ve been looking at simple ways to improve it. I’ve been aware of stemming for a while and have have now found a C# implementation, along with some others. There is a good explanation of stemming on Wikipedia so I won’t try and repeat that. To apply it first run all the keywords through the stemming algorithm and then run all the words in the tweet through the algorithm before comparing. This should produce more matches, or eliminates the need to try and capture all the variations (plurals, tenses, etc.) of the word you are interested in. It’s definitely not perfect and I am aware there are more sophisticated approaches which take context into account, however think it will be better than just keywords. I’ve not tried comparing results against a simple keyword list, before I do does anyone have any further guidance?

Robert Gimeno's Adventures in Data Science

Data everywhere but what can it tell us?

Tag Archives: Stemming