Category Archives: Social Network Analysis

Email vs. Instant Messaging for Social Network Analysis, Round 3

Making the most of available data within an organisation needs to balance the effort of obtaining and analysing the information against the value derived from that information. I’ve been looking to see if IM data is worth collecting when an organisation has already collected email data. The following graphic shows social networks based on Email, IM and then combined data, each colour represents a department in the organisation being studied:

Visually the following can be observed:

Email is much more heavily used than IM and gives a more complete picture of the network
The result of combining Email and IM shows that the email structure dominates but, as one can see, the department coloured magenta (•) has moved from neighbouring the red (•) department to neighbouring the green (•) department. At this time I don’t have an explanation as to why such a dramatic change is seen but I will be investigating.

But what do the numbers look like? The following graph metrics were calculated using Gephi:

	Email	IM	Combined
Average Degree	71.672	15.682	74.263
Avg. Weighted Degree	1722.546	705.802	2331.18
Network Diameter	4	7	4
Graph Density	0.054	0.014	0.056
Modularity	0.661	0.689	0.667
Avg. Clustering Coef.	0.395	0.261	0.391
Avg. Path Length	2.281	3.115	2.252

Adding in the IM data has increased Average Degree from 71.6 to 74.2 and Graph Density from 0.054 to 0.056; this shows that it has identified relationships that email did not and, therefore, does enhance the pure email graph.

If using the graph to identify the strength of relationships or influence the next question is what weight to assign to IMs compared to email? My initial thought was that an IM contains much less than an email so would be worth an order of one-tenth of an email. However measuring the average degree of IMs (16) it seems that people are more reserved in who they communicate with using IMs and, presumably, have a closer relationship. Therefore I have equated one IM message with one email message.

Email vs. Instant Messaging for Social Network Analysis, Round 1

2 Replies

Email has long been studied to understand Social Networks. In more recent years organisations have embraced internal Instant Messaging, such as Microsoft Lync. Any organisation that wants to understand the Social Networks, and other insights communication data might reveal, will want to know if analysis of Instant Messaging logs can enhance insight from other sources like Email. Depending on an organisations culture Email and Instant Messaging (IM) may represent different levels of trust or formality; revealing a cultural meaning of Email versus IM is beyond this discussion but anyone who has spent a reasonable amount of time in an organisation will probably be aware of that organisation’s cultural tendencies in communications.

My initial question is does IM show us anything Email does not? Starting at a macro view I revisited the inter-departmental communications: there was not much difference when looking at the IM traffic versus the Email traffic (when viewed as percentages of overall traffic) but there did seem to be some potentially significant differences in the volumes of communications internal to a department (I did not previously examine these numbers).

To further explore differences between IM and Email, at a macro level, I have looked at the percentage of internal departmental communication compared to total communication emanation from the department. Here is a chart of the results:

For the purposes of looking at differences between IM and Email, this chart shows there does not seem to be much difference in the ratio of use between intra- and inter-departmental communications. My conclusion, so-far, is that there does not seem to be significant differences in the use of IM versus Email at this level.

As an aside, the above graph does seem somewhat inconsistent in the function between department size and the percentage of communication that is internal. Looking at the purpose of each department they can be categorised as either ‘Product’ (focussed on selling and servicing a product line) or ‘Shared’ (e.g. HR, Accounts) and these seem to fall neatly into those with an internal communications level below 70% (Shared) and above 80% (Product); here is a table of the data:

As this is leaving my discussion of differences between Email and IM (this shows very little) I’ll explain more in another post.

Chord Diagrams

1 Reply

Circular layouts work well for individual to individual visualisations but for group to
group (such as departments in an organisation) where there is close to a fully
connected graph chord diagrams seem to offer a clearer layout. The following
diagram shows the same data for the volume of communication between
departments. They don’t quite show the same thing: the circular layout’s nodes
are proportional to the size of the department whereas the chord is based
purely on the volume of communication.

The circular diagram was created with the circular layout plugin in Gephi; the chord diagram is adapted from this article and uses the ds.j3 library which means it’s interactive in the browser.

Circular Layouts, Binning and Edge Bundling

Time Well Spent at Cloud World Forum and Big Data World Congress

Mining Twitter from Windows Azure (Part 1)

2 Replies

Organisations need to know what is being said about them, and Twitter is one of the obvious places to find this out. I’m not looking at trawling all of Twitter but targeting tweets about the organisation or its competitors and what the organisation and its competitors are tweeting.

Because Twitter data is mostly already public, and no private data is being retrieved, this investigation was also a good opportunity to explore Microsoft’s Azure cloud computing platform.

I have built the architecture shown in the following diagram in order to collect and analyse Twitter data:

Nothing here is not well documented elsewhere but I’ll run through the key components:

Twitter exposes a REST API
TweetSharp is a .NET Library that wraps the Twitter REST API
An Azure Cloud Service Worker Role is a background processing container
The MS Azure Libraries are provided in the Azure SDK
Table Storage is a non-relational managed store for tabular data
Windows Azure Websites are effectively IIS websites running in Azure

Having built the above I can say it all works very well, my key learning point are:

The Twitter API provides a search which is a great place to start exploring who is saying what about an organisations
Thinking in terms of relational database tables is probably a mistake (that I have fallen into). For example I have a table of followers for a given Twitter account and then a table of information about Twitter accounts; to get the information for all the followers of a given Twitter account it is necessary to join these table. Doh! Can’t do that with Azure Table Storage. Now because I’m not dealing with large volumes this can be accomplished by ‘joining’ the tables in memory (I used a hash table but I’m sure LINQ can do it). The correct solution is either to use SQL Azure or move the information about the Twitter accounts into the table of followers and accept some duplication…or maybe a graph database 🙂
Probably this is a bug: TweetSharp (or the underlying JSON and REST libraries) will cause a stack overflow (!) if trying to retrieve a list of follower IDs when the Twitter user has protected tweets; you can ensure this does not happen by checking the user’s protected status first.

Calculating Influence

3 Replies

If influence can roughly be equated with the volume, and to whom, an individual communicates then an ‘influencing score’ can be calculated. I’m not looking here at measures of centrality; although I do plan in incorporate this at a later date. Within what appears to be a command-and control organisation rank is very important and is known for each individual. I propose that the influencing score between any two individuals is made from the following three factors:

The rank of the target to which the subject is connected
The strength of the connection (dictated by the lowest scoring edge)
The distance (number of edges) from the subject to the target

It would also be essential to track changes over time is this to be a useful measure.

How did I go about creating this?

First I created a scored relationship, as described before, (but reducing the value of being a line manager to 30) between each person (node) on a per-month basis. Each relationship (link) was typed consistently so, for example for May 2013 is called MONTH_2013_05
Secondly I use a dynamically generated Cypher query to obtain the graph (network) of relationships to the subject for each month: START n=node:node_auto_index(email = ‘[email protected]’) MATCH p = (n)-[:MONTH_2013_05*1..2]-(x) RETURN DISTINCT ‘~’ AS begin, length(p) AS len, EXTRACT( n IN NODES(p):n.email + ‘<‘ + n.rank + ‘>’) AS email, EXTRACT( r IN RELATIONSHIPS(p):r.score) AS score. I won’t pick this query apart here but if anyone wants an explanation please get in touch. I’m also not using any Neo4j client libraries and simply parsing out the result which is why there is a ‘~‘ to mark the start of each record.
The score, for each path (the ‘p’ in the Cypher query is then calculated and all the scores are added together. Here is an example: A—(67)—B—(50)—C<rank 4> where A is the subject has a score of 50 (the weakest link) * 6 ( the multiplier for rank 4, more later) * 0.1 (the distance factor)

Because rank is considered so important the highest ranks are given much higher multipliers as listed: 0:100, 1:50, 2:25, 3:12, 4:6, 5:3, 6:2, <=7:1

Indirect relationships are reduced to 10% of the score for a direct relationship. The Cypher query only returns a maximum of 1 intermediate node (1..2)

The following chart shows a plot for three individuals who are in a direct line of management; as expected the influencing score drops as the rank drops. The relative scores are also reasonably consistent.

The next plot shows another direct line management relationship, the senior manager is the same as before. This time it shows a distinct rise in influence of the mid-ranked individual.

The measurement of influence I have described is fairly crude, for example it bounces around based on when people are on holiday (this can be fixed by using a value averaged over active days) and there is a degree of double-counting (which can be removed by pruning indirect connections when a direct connection exists) however empirically it produces results that reflect reality of individuals known to the author.

Robert Gimeno's Adventures in Data Science

Data everywhere but what can it tell us?

Category Archives: Social Network Analysis

Mining Twitter from Windows Azure (Part 1)