An interseting couple of days.
Google are targeting enterprises as a PaaS provider. Their view is Digital Natives will bring consumer technologies into the workplace and concentrate on systems of collaboration rather than systems of record . They put a lot of value in the Gartner Nexus of Forces.
Performance Management of Cloud Platforms (PaaS): with elastic computing sloppy development of inefficient code can be masked by infrastructure but at what cost? Application Performance Monitoring may be more important, than in on-prem, to help stop money leaks.
The panel discussion of Big Data Skills started with a great quote: “do you have a [Big Data] Problem or is it a Big [Data Problem]?”, well I found it amusing. The discussion concluded that:
- Many organisations are struggling to get Big Data out of R&D
- We need to be careful that it does not become over-intrusive in people’s lives
As we know there is rarely anything genuinely new: I met the guys from elasticsearch who explained it uses Lucene (as does Neo4j) to index text and that this was something called an inverted index. Well that bought back a few memories from the early 90s when I worked for Dataware Technologies integrating BRS into customer solutions.
The Open Data Institute promote the use of government data and offer help (not financial) to start-ups who want to take that data and add value to it.
Talend echoed sentiment from last week’s BDA conference: why Extract – Transform – Load when you can Extract – Load – Transform using Hadoop.
The Cloud Security Alliance called for more transparency and honesty which should apply to corporations as well as governments. It’s something Enterprise Architects need to consider when they examine proposals that will impact individuals’ anonymity.
The question of whether recent revelations about PRISM will see people move away from Facebook was raise during a panel discussing protection of sensitive data in the cloud. I suspect not but time will tell.
Digital Innovation Group reinforced the message about the importance of context and trying to understand the language of peoples social media output. You need to be able to deal with slang and differentiate between not agreeing with an opinion versus that being their opinion.
Organisations need to know what is being said about them, and Twitter is one of the obvious places to find this out. I’m not looking at trawling all of Twitter but targeting tweets about the organisation or its competitors and what the organisation and its competitors are tweeting.
Because Twitter data is mostly already public, and no private data is being retrieved, this investigation was also a good opportunity to explore Microsoft’s Azure cloud computing platform.
I have built the architecture shown in the following diagram in order to collect and analyse Twitter data:
Nothing here is not well documented elsewhere but I’ll run through the key components:
Having built the above I can say it all works very well, my key learning point are:
- The Twitter API provides a search which is a great place to start exploring who is saying what about an organisations
- Thinking in terms of relational database tables is probably a mistake (that I have fallen into). For example I have a table of followers for a given Twitter account and then a table of information about Twitter accounts; to get the information for all the followers of a given Twitter account it is necessary to join these table. Doh! Can’t do that with Azure Table Storage. Now because I’m not dealing with large volumes this can be accomplished by ‘joining’ the tables in memory (I used a hash table but I’m sure LINQ can do it). The correct solution is either to use SQL Azure or move the information about the Twitter accounts into the table of followers and accept some duplication…or maybe a graph database 🙂
- Probably this is a bug: TweetSharp (or the underlying JSON and REST libraries) will cause a stack overflow (!) if trying to retrieve a list of follower IDs when the Twitter user has protected tweets; you can ensure this does not happen by checking the user’s protected status first.
If you want to perform analysis comparing sub-sets of a graph derived from your email logs you will need to bring in some additional data, for example department. You might be lucky in that the organisations addressing scheme provides some clues e.g. ‘email@example.com’ or ‘firstname.lastname@example.org’ but if not you will have to find another source. Now you might have an efficient HR department who could provide an extract of employees with department, role and any other data they don’t mind you having. Or they might not be so helpful. If HR won’t help there are some other sources: the directory service (e.g. Microsoft Active Directory) or maybe an intranet site that acts as a phonebook and/or organisational chart. Which one you select will depend on content and accuracy. In my case the intranet site provided both the most accurate and richest source of data. Each intranet is going to be different so maybe the following is not going to work for yours but I hope the approach provides some inspiration.
The intranet I have worked with is composed of two page types:
- A page describing the employee with phone numbers, job role, location, department and also some free-text fields allowing people to input experience, skills, what they work on and languages spoken.
- A Hierarchy page: given an employee it displays their manager and direct reports.
Luckily each page type could be bought up by knowing the employee and manager ID and constructing a fairly simple url.
I thought it would be useful to understand where an employee sits in the overall structure so I wrote a program to traverse the directory top-down. Given a known starting point (the ID of the most senior employee and their manager, effectively blank) the algorithm worked like this:
- Find all the direct reports of the current employee
- Recursively call step 1 for each direct report (if any) and keep a count of how many there are
- Get the directory information for the current employee
- Store the directory information and the count of direct reports plus any other derived information (e.g. the employee’s level in the hierarchy and a list of all the managers above them)
Now unfortunately this does not find all the employees we have email records for; this turned out to be because the intranet directory is incomplete so as a second exercise:
- Find all the employees for which there is an email record but no directory entry recorded from the first phase
- Query the intranet directory by email address (fortunately this was a feature of the one I was using)
- Store the directory information
This second step could be used to populate the whole list but it does not provide such rich hierarchical information.
The directory information is best retained in its own table. This is because it changes over time and should be obtained periodically (e.g. monthly). However a refresh of the directory data SHOULD NOT overwrite, instead add a field that contains the date the data was fetched. This makes it possible to use directory data that corresponds to the period of data being analysed or to still derive a tie between people: if A used to manage B, even if they no longer do.
Example directory table. Note: RunDate is the date the directory was traversed (there will be multiple runs in 1 table); ID and manager are internal IDs for the intranet directory; rank is the position in the hierarchy (0 highest); Span is the number of direct reports.
Graph analysis relies on getting data into Nodes (sometimes called Vertices) and Edges. If you have, as I suggested, first built a relational database then a second extract-transform-load step is necessary to get the data into the graph structure. I’ll assume the simplest scenario to describe these steps, which is simply to take all the data. Extract is fairly simple; there are two queries that can be run against the relational database: the first to get the nodes is just a straightforward select against the nodes the second query is to get the edges. The query to get the edges is not so simple and depends how you want to analyse the data; I’ll assume almost the simplest approach: you need to join the email table to the recipients table, select the Sender ID , the Recipient ID and a count(*) this is then grouped by the Sender ID and Recipient ID. This gives an edge for each sender-receiver plus a count of the number of emails; note this is directional, A sends B x emails and B sends A y emails. The simplest possible query could have omitted the count(*) but this is very useful as it gives an indication of the strength of the connection between two nodes (often called the Edge weight). The Transform step could be omitted if the desired analysis can be performed with the nodes and directed edges, however this is not always what’s wanted. If you want to understand the strength of a connection between nodes but don’t care about direction then a transform may be necessary. Now this can be achieved in other ways but it’s useful to understand how to do this in an intermediary because this is useful when combining data from more than one source. Now bear in mind this works for 10,000 nodes: I like to use a hash table (C#) to build the edges. For each edge first re-order the nodes by ID and then create a key by combining the IDs then test the hash table to see if this key exists; if the key is not found create a new entry using the key and store the count as the associated value; if the key already existed retrieve the associated value, add on the count and save it back. The hash table will now contain undirected Edges and the associated combined message count; you can see it would be easy to add in additional counts from other sources to create a combined weight for the relationship. The Load step is going to vary depending on the target but it’s basically taking the nodes as selected from the relational table and the edges from the hash table and getting them into the target. I’ll briefly explore three targets:
- Node XL (remember good for small sets of data, < 1000): if you are using .NET then it’s very simple to drop the NodeXL control onto a Windows form and get a reference to the underlying object model. Then whip through the nodes (or vertices as NodeXL calls them) and add them followed by the Edges; for each Edge you need to get the Node IDs which just requires a quick lookup directly from the NodeXL model.
- Gephi (can handle larger sets, easily 10,000 nodes): my favourite approach is to write out a .GEXF file following very much the method above but there is no need to look-up any internal references for the Edges, you just need the two Node IDs.
- Neo4j (can handle larger sets, easily 10,000 nodes): if you are writing in Java then I believe it’s possible to directly manipulate the object model (very much like you can with .NET and NodeXL) but I used the REST API which is definitely fast enough when running on the same computer. There are some .NET libraries available the wrap the REST API but the level of maturity varies. I have found a problem specific to Neo4j, which is it wants to create its own node IDs which can’t be changed. When you create a new Node Neo4j returns the ID which you will need when adding the Edges, so I suggest recording these in a hash-table to map them to the Node ID from the relational database otherwise you will need to query Neo4j to get the IDs every time you add an Edge.
I hope this has given an overview of how to get email logs into a graph-aware database or application by using and ETL step to first store an intermediate relational database and a second ETL step to move from a relational to graph structure. Keep an eye out for future postings that I intend to drill into some of the detail and expansion of the architecture described here.
If you want to study an organisation email is a rich source, but what have we learnt about the architecture of a solution that allows this analysis to be conducted? First let me caveat the solution I’ll describe: it was built for an enterprise that manages all of its own email with approximately 2000 users (nodes); now this is part of a larger organisation where we have some interest in connections between the enterprise and larger organisation, for this analysis there are approximately 10,000 nodes; if all emails available are included (including those from outside the larger organisation) then there are approximately 100,000 nodes, we have not performed analysis at this scale. Not all technologies suit all scales of analysis (for example NodeXL is really only effective on graphs with under 2000 nodes) so please bear this in mind for your own domain.
Step 1: Getting hold of the email: our example organisation uses Microsoft Exchange Server which allows the administrator to enable Message Tracking logs; these logs include the sender and recipient list for each email, the date and time is was sent and some other pieces of information. The logs will never include the content of the message but can be configured to include the subject of the message. Depending on your organisations security and/or privacy policies this could be contentious but useful if you can obtain it; I’ll be posting a follow-up entitled “Dealing with Sensitive Data” which describes how the message subject can be used whilst maintaining privacy.
Step 2: Somewhere to put it: ultimately a graph database (like Neo4j), or graph visualisation tools (like Gephi) are going to be the best way to analyse many aspects of the data. However I would recommend first loading the email data into a good old-fashioned relational database (I’ve used SQL Express 2012 for everything I describe here). Reasons for doing this are: (1) familiarity, it’s easy to reason about the data if you have a relational database background; (2) you can perform some very useful analysis directly from the database; (3) it allows you to incrementally load data (I’ve not found this particularly simple in Neo4j); (4) it’s easy to merge with other relational data sources. The structure I’ve discovered works best is as follows:
• “node” table: 1 row per email address consisting of an integer ID (primary key), the email address
• “email” table: 1 row per email with an integer email ID, the ‘node’ ID of the sender, the date and time.
• “recipient” table: an email can be sent to multiple recipients; this table has no unique ID of its own but instead has the “email” ID and the “node” ID of each recipient
The tables are shown below. Note some additional fields that I won’t go into now, the important ones are the first 3 in “email” and the first two in “node” and “recipient”
Step 3: Extract from Email Logs, Transform and Load into the relational database: Firstly you’ll want to open and read each email log-file; the Exchange message tracking logs repeat the information we are concerned with several times as they record he progress of each message at a number of points. I found the best way to get at the data I wanted was to find a line that contained an event-id of “RECEIVE” and source-context is not “Journaling” (you may not have Journaling in your files but probably worth excluding in case it gets enabled one day). The rest is pretty simple: create a “node” record for the sender-address if one does not already exist, create a new “email” record (the sender_ID is the “node” record just created/found) and then for each address in recipient-address create a “recipient” record using the email ID just created and then a new or existing “node” ID for each recipient.
Email logs can add-up to a lot of data so I’d advise loading them incrementally rather than clearing down the database and re-loading every time you get a new batch. This requires you consider how to take account of existing node records: you could do this with a bit of stored procedure logic but my approach was to load all the existing nodes into an in-memory hash-table and keep track that way.
In part 2 I’ll explore how to get the relational representation into a graph.