Tag Archives: Social Network Analysis

Is there a difference in the direction of email for different ranks?

After looking at the overall direction of email the next question is does this vary by rank? The graph below shows the direction of email by rank; as might be expected rank 0 (the most senior) can only send to lower ranks (there is only 1 rank 0) and rank 8 cannot send downwards. In-between the shape of the curve is remarkably well behaved; I would say this does not show much bias at any rank, considering their position:

Email_to_rank_by_rank

Does email get directed down, up or sideways?

Having asked who is sending all the email and who’s receiving it another simple statistic is the percentage of email directed upwards, downwards or sideways in the hierarchy. The following pie chart shows the breakdown of comparing the rank of the sender to the rank of the recipient (for each recipient of the email):

Email_to_rank

Sample size: 10 million

Who’s receiving all that email?

I previously asked who was sending all the email; well the next question is who receives it all? Once again the middle management features highly but then so do the senior managers. This sort of result is probably what I would expect thinking about the roles these groups perform but is this typical of most organisations?

Email Received

Average emails received per person, grouped by rank.

Who’s sending all that email…it’s the middle management!

This is a pretty simple metric to look at but might be quite revealing to your organisation. I’ve matched the sender of an email to their rank derived from the corporate directory. The rank is their position from the top of the directory e.g. A manages B manages C – if A is at the top of the directory then A=1, B=2 and C=3; this approximates their grade and roughly those ranked 1-2 are senior executives, 3-4 are middle management and 5+ get to do all the work. The chart shows the average number of emails sent at each rank over a number of months. Ranks 1 and 2 are combined because rank 1 is a very small sample.

Email Sent

An individual’s view of their network

In my previous post I described searching from a number of target nodes (e.g. people that speak German) back to the querying user (node). It’s very simple (probably simpler) to show a person what their network looks like, pretty much in the way any social networking site can. I’ve combined the results from the two following Cypher queries, the first finds friends and the second friends of friends. I’m sure it could all be done in one query (help me Cypher experts):

START n=node:node_auto_index(email = ‘[email protected]’) MATCH p = (n)–(x) RETURN DISTINCT length(p) AS len, EXTRACT( n IN NODES(p):n.email) AS email, EXTRACT( r IN RELATIONSHIPS(p):r.score) AS score

START n=node:node_auto_index(email = ‘[email protected]’) MATCH p = (n)–(x)–(y) RETURN DISTINCT length(p) AS len, EXTRACT( n IN NODES(p):n.email) AS email, EXTRACT( r IN RELATIONSHIPS(p):r.score) AS score

Again the results from these queries is loaded into memory and ordered by the lowest (weakest) edge score in each path and the highest rated are visualised using NodeXL:

Network

The central person (node), whose network is displayed is coloured black and their connections are blue; red nodes are people who have left the organisation – probably not that useful here.

Cutting it down to size – an individual’s view of the graph

Showing someone a graph of where they fit into an organisation along with over 2000 colleagues is simply not practical, not only can’t it all be fitted on a screen at a meaningful level but it’s just going to be a mess of edges (connections).  I previously described loading SNA data first into a relational database and then into Neo4j. The structure I have built in Neo4j is quite simple: a node (person) has a number of attributes including email address and languages which is a list of languages a person speaks (people provide this information to the corporate directory). The edges in Neo4j have a single attribute, score, which attempts to represent the relative strength of the relationship between two people (nodes). The score is derived from a number of data sources I have previously described.

So here’s the scenario I’m looking at: let’s say you need to find someone who speaks German;  the corporate directory can be searched on a number of attributes including language but what if you don’t know any of the people who speak German and you don’t really want to approach people you don’t know. Well this is where your friends of friends might be able to help but how do you know which of you friends may know one of the people who speak German. Well this is perfect territory for graph databases.

To query Neo4j firstly I’ve enabled auto-indexing for the email and languages attributes (along with others); this is done by editing the neo4j.properties file in the conf folder:

neo4j-auto-index

Then, using the Neo4j’s cypher query language execute the following (I’m rather proud of this Cypher query but I’m sure a Cypher expert might be able to put me right):

START
n=node(*), m=node:node_auto_index(email = ‘[email protected]’)
MATCH
p = allShortestPaths( (m)<-[r*..6]->(n) )
WHERE
n.languages =~ ‘(?i).*german.*’
RETURN DISTINCT
length(p) AS len,
EXTRACT( n IN NODES(p):n.email) AS email,
EXTRACT( r IN RELATIONSHIPS(p):r.score) AS score
ORDER BY
length(p)

Let’s take that apart:

  • START: defines two sets of nodes: n is all or any node, m uses the query to locate the node that matches the email address of the user making the query
  • MATCH: the allShortestPaths is a built-in function that does exactly what it says, its argument restricts this to a maximum of six hops and between the two sets identified in the START clause; it returns a list of paths, referred to as p. (A path contains the both nodes and edges, e.g. you get Node1–Edge1->Node2–Edge2->Node3). Every path will connect a node n (the German speaker) to node m (the user making the query) and include up to 4 intermediary nodes.
  • WHERE: this reduces the size of set n (remember this is node(*) ) to only those that have a language attribute matching the supplied regular expression (the fiddly bits just means ignore the case)
  • RETURN: there will be three things returned:
    • the length of the path,
    • a list of all the nodes in the path
    • the ‘score’ values between each pair of nodes
    • ORDER BY: shortest paths are best (usually, more of that next), so the ordering is by the path length

Now shortest paths are best, right? Well maybe but from the work I’ve previously described there is a score for every relationship (edge), is a short path with low scores better than a long path with high scores? Well I don’t know and I’d love to hear from anyone who’s read a paper on the subject or has any views. The following illustrates this, the capital letters represent a node (person) and the number in brackets is the score of the relationship between the two nodes. A is the user making the query and X speaks German, there are the following paths:

Path 1: A–(350)–B–(193)–X

Path 2: A–(5)–C–(9)–X

Path 3: A–(150)–D–(210)–E–(105)–X

Clearly the first is pretty good: the path is short and the relationships are relatively strong. But what about the next one, the path is short but A does not know C that well and, likewise, C does not know X well; it may be better to go via D and E, although the path is longer everyone on it probably has a strong relationship.

Having experimented empirically I found a good result is obtained by ranking the paths based on the lowest relationship score in the path (think of this as the ‘weakest link’), the above three paths are ordered thus (I have highlighted the ‘weakest link’ in each):

Path 1: A–(350)–B–(193)–X

Path 3: A–(150)–D–(210)–E–(105)–X

Path 2: A–(5)–C–(9)–X

Results with this ranking appear to work very well; in general shortest paths tend to feature and longer paths with high scores are rarer because, of course, it only takes one weak link to demote it in the rankings.

The ranked paths can be presented to the user but there is often a lot of redundancy, it could be that Path 3 is repeated but going through node F instead of E. Therefore I’ve found it better to turn the list of paths back into a graph. I’ve used a visualisation component from the excellent NodeXL library to do this. I’ve also limited the number of paths displayed as too many produce a somewhat unreadable result; this is why ranking the paths was important, we want to display the best ones if we can’t display them all.

knows-languages

The black node in the centre is the user making the query; blue nodes are people who can speak German; white nodes are intermediaries. The red nodes represent people who have left the organisation, they aren’t really that useful here I just haven’t got round to adding an option to filter them out. Each relationship (line) is labelled with the score (strength).

Cutting it down to size – summarising the graph

Showing someone a view of their organisation of over 2000 people rendered in Gephi might produce an initial ‘wow’ but they can then struggle to make sense of it. A better approach might be to summarise the information by amalgamating the nodes into departments (or whatever organisational unit is meaningful). The following graph shows the organisation I have illustrated before but broken down into ‘departments’, the sizes of the nodes are proportional to the number of people in the department and the widths of the edges are the totals of all the weights from each underling edge between two groups of underlying nodes. To amalgamate the data I created a query that assigns each node ID (from the underlying SQL database) to a group which is then used to summarise data using in-memory hash tables; this has the advantage, over a SQL ‘group by’ clause, of creating arbitrary groups independent of the fields in the select clause.

Departments

Looking at the graph you can immediately see strong ties between IT and Ops but BU 2 looks a bit isolated given its size.

Combining Data to Weight Social Connections in an Organisation

SNA Data Sources

In the above diagram red nodes are from the division of the organisation under study; green and blue are from two other divisions and grey nodes are uncategorised or central functions.

I’ve previously described determining a ‘score’ for social connections taking data from email, meetings, directory and timesheets. The question is how to combine them to produce as complete a picture as possible from the data at hand. I’m fairly sure the best answer is not to simply add the scores together but I’ve not found any guidance that would help do anything except that so that’s exactly what I have done…add them up. To summarise the score (or more correctly weight) given to each edge is made from:

  • 1 point for each email exchanged (where there are a maximum of 10 recipients)
  • 1 point for each minute in a one-to-one meeting, reducing rapidly as the number of attendees in the meeting increases
  • 300 points for being in a manages/managed by relationship
  • 1 point for every hour spent on a project divided by the number of people on the project

I’ve pulled this data together over the following periods:

  • Email: 6 months
  • Meetings: 2 years, 3 months
  • Corporate Directory: 6 months (but this is very slow to change so probably reflects the vast majority of the last 2 years)
  • Projects: 1 year

The coverage of the data also varies:

  • Email: for the core of the organisation being studied this is excellent as the data comes from Exchange Server Logs, for the periphery there is limited coverage as only emails being exchanged with the core are captured
  • Meetings: probably less than 50% as not all rooms are visible and not all meetings are booked in rooms; also teleconference information is not captured
  • Corporate Directory: very good, 90%+ but data is limited to corporate hierarchy
  • Timesheets: good but the system is not used universally as not everyone works on projects.

Some Observations using this approach:

  • Email dominates the structure of the network, the others add very little for those in the core; however for those outside the core the others provide additional insight into the structure.
  • There is overlap in these sources, for example we expect a manger will share emails with their reports and that people on a project will have meetings together but, as the coverage of each source is not compete, this is a small price to pay for seeing the whole network.

Despite the rather simplistic approach the results appear to work quite well but I’d love to hear from anyone who has implemented, or read about, a smarter way to combine these types of SNA sources.

Beyond Email: Timesheets

The final source of social network data I have actively exploited is from a time recording system. These are going to vary widely between organisations and the one I’ve worked with has been quite fiddly to manipulate so I won’t go into too much detail. What I’d like to describe is how I’ve derived the ‘score’:

  1. Extract a list of projects
  2. For each project find the number of people who have booked time to it
  3. Find every pair of people who have booked time to each project
  4. From the pair of people take the lowest number of hours booked to the project
  5. Divide the lowest number of hours (4) by the number of people on the project (2)

The reasoning here is that the number of hours booked to a project by any given person has to be shared-out amongst all the other people in the project. Unlike meetings, where I used minutes as part of the formula here I use hours because working on a project does not imply necessarily spending lots of time interacting.

Beyond Email: Corporate Directory

I’ve already discussed the corporate directory when it comes to enriching email data. It also contains information that can help build a picture of the social network. Specifically the line manager relationship is held in a typical corporate directory.  If you have imported the directory like I did then there is no more to do as the employee ID and manager ID are already there. I’ve also kept historical management structure relationships on the basis that a line management period will build a relationship that tends to persist.

Martin House pointed me to an interesting paper which contains a diagram overlaying the corporate structure (from the directory) with the social network revealed by emails, looks pretty neat, must have a go at creating one.