Organisations need to know what is being said about them, and Twitter is one of the obvious places to find this out. I’m not looking at trawling all of Twitter but targeting tweets about the organisation or its competitors and what the organisation and its competitors are tweeting.
Because Twitter data is mostly already public, and no private data is being retrieved, this investigation was also a good opportunity to explore Microsoft’s Azure cloud computing platform.
I have built the architecture shown in the following diagram in order to collect and analyse Twitter data:
Nothing here is not well documented elsewhere but I’ll run through the key components:
- Twitter exposes a REST API
- TweetSharp is a .NET Library that wraps the Twitter REST API
- An Azure Cloud Service Worker Role is a background processing container
- The MS Azure Libraries are provided in the Azure SDK
- Table Storage is a non-relational managed store for tabular data
- Windows Azure Websites are effectively IIS websites running in Azure
Having built the above I can say it all works very well, my key learning point are:
- The Twitter API provides a search which is a great place to start exploring who is saying what about an organisations
- Thinking in terms of relational database tables is probably a mistake (that I have fallen into). For example I have a table of followers for a given Twitter account and then a table of information about Twitter accounts; to get the information for all the followers of a given Twitter account it is necessary to join these table. Doh! Can’t do that with Azure Table Storage. Now because I’m not dealing with large volumes this can be accomplished by ‘joining’ the tables in memory (I used a hash table but I’m sure LINQ can do it). The correct solution is either to use SQL Azure or move the information about the Twitter accounts into the table of followers and accept some duplication…or maybe a graph database 🙂
- Probably this is a bug: TweetSharp (or the underlying JSON and REST libraries) will cause a stack overflow (!) if trying to retrieve a list of follower IDs when the Twitter user has protected tweets; you can ensure this does not happen by checking the user’s protected status first.