Graph analysis relies on getting data into Nodes (sometimes called Vertices) and Edges. If you have, as I suggested, first built a relational database then a second extract-transform-load step is necessary to get the data into the graph structure. I’ll assume the simplest scenario to describe these steps, which is simply to take all the data. Extract is fairly simple; there are two queries that can be run against the relational database: the first to get the nodes is just a straightforward select against the nodes the second query is to get the edges. The query to get the edges is not so simple and depends how you want to analyse the data; I’ll assume almost the simplest approach: you need to join the email table to the recipients table, select the Sender ID , the Recipient ID and a count(*) this is then grouped by the Sender ID and Recipient ID. This gives an edge for each sender-receiver plus a count of the number of emails; note this is directional, A sends B x emails and B sends A y emails. The simplest possible query could have omitted the count(*) but this is very useful as it gives an indication of the strength of the connection between two nodes (often called the Edge weight). The Transform step could be omitted if the desired analysis can be performed with the nodes and directed edges, however this is not always what’s wanted. If you want to understand the strength of a connection between nodes but don’t care about direction then a transform may be necessary. Now this can be achieved in other ways but it’s useful to understand how to do this in an intermediary because this is useful when combining data from more than one source. Now bear in mind this works for 10,000 nodes: I like to use a hash table (C#) to build the edges. For each edge first re-order the nodes by ID and then create a key by combining the IDs then test the hash table to see if this key exists; if the key is not found create a new entry using the key and store the count as the associated value; if the key already existed retrieve the associated value, add on the count and save it back. The hash table will now contain undirected Edges and the associated combined message count; you can see it would be easy to add in additional counts from other sources to create a combined weight for the relationship. The Load step is going to vary depending on the target but it’s basically taking the nodes as selected from the relational table and the edges from the hash table and getting them into the target. I’ll briefly explore three targets:
- Node XL (remember good for small sets of data, < 1000): if you are using .NET then it’s very simple to drop the NodeXL control onto a Windows form and get a reference to the underlying object model. Then whip through the nodes (or vertices as NodeXL calls them) and add them followed by the Edges; for each Edge you need to get the Node IDs which just requires a quick lookup directly from the NodeXL model.
- Gephi (can handle larger sets, easily 10,000 nodes): my favourite approach is to write out a .GEXF file following very much the method above but there is no need to look-up any internal references for the Edges, you just need the two Node IDs.
- Neo4j (can handle larger sets, easily 10,000 nodes): if you are writing in Java then I believe it’s possible to directly manipulate the object model (very much like you can with .NET and NodeXL) but I used the REST API which is definitely fast enough when running on the same computer. There are some .NET libraries available the wrap the REST API but the level of maturity varies. I have found a problem specific to Neo4j, which is it wants to create its own node IDs which can’t be changed. When you create a new Node Neo4j returns the ID which you will need when adding the Edges, so I suggest recording these in a hash-table to map them to the Node ID from the relational database otherwise you will need to query Neo4j to get the IDs every time you add an Edge.
I hope this has given an overview of how to get email logs into a graph-aware database or application by using and ETL step to first store an intermediate relational database and a second ETL step to move from a relational to graph structure. Keep an eye out for future postings that I intend to drill into some of the detail and expansion of the architecture described here.