Mining Twitter from Windows Azure (Part 1)

Organisations need to know what is being said about them, and Twitter is one of the obvious places to find this out. I’m not looking at trawling all of Twitter but targeting tweets about the organisation or its competitors and what the organisation and its competitors are tweeting.

Because Twitter data is mostly already public, and no private data is being retrieved, this investigation was also a good opportunity to explore Microsoft’s Azure cloud computing platform.

I have built the architecture shown in the following diagram in order to collect and analyse Twitter data:

Twitter Monitor

Nothing here is not well documented elsewhere but I’ll run through the key components:

Having built the above I can say it all works very well, my key learning point are:

  • The Twitter API provides a search which is a great place to start exploring who is saying what about an organisations
  • Thinking in terms of relational database tables is probably a mistake (that I have fallen into). For example I have a table of followers for a given Twitter account and then a table of information about Twitter accounts; to get the information for all the followers of a given Twitter account it is necessary to join these table. Doh! Can’t do that with Azure Table Storage. Now because I’m not dealing with large volumes this can be accomplished by ‘joining’ the tables in memory (I used a hash table but I’m sure LINQ can do it). The correct solution is either to use SQL Azure or move the information about the Twitter accounts into the table of followers and accept some duplication…or maybe a graph database 🙂
  • Probably this is a bug: TweetSharp (or the underlying JSON and REST libraries) will cause a stack overflow (!) if trying to retrieve a list of follower IDs when the Twitter user has protected tweets;  you can ensure this does not happen by checking the user’s protected status first.


2 thoughts on “Mining Twitter from Windows Azure (Part 1)

  1. Pingback: Identifying relevant tweeters: keywords and stemming | Robert Gimeno's Adventures in Data Science

  2. Pingback: Cloud: initial evaluation of Windows Azure PaaS | Gimeno IT Europe Ltd.

Leave a Reply

Your email address will not be published. Required fields are marked *