Having recently been seeking a new position I’ve noticed that very often people can struggle to pull basic facts from my CV, for example “I see you did x for n years”. I can understand why, they have a lot to look at and only limited time to wade through the detail describing a long, and varied, career. Now I’m not the first to suggest a data-driven, or infographic, C.V. but has anyone got experience of using one applying for software architect roles, and how well was it received?
In the organisation I’ve studied email is the undisputed winner at exposing the organisational network. I’ve previously described a number of other sources of data for organisational network analysis but are there others?
Some of the more obvious ones are telephone records, both desk and mobile phones. These would be relatively simple to analyse, provided they are easily matched to employees, and I expect the results would be quite good.
But there are some others I’ve noticed:
- Conference call dialled-in numbers: captures employees joining meetings by phone, there would be some overlap with meeting room bookings but as these are going to be at the same time de-duplication is possible.
- Entry control systems: people arriving or leaving a particular building or floor at the same time are likely to have shared some conversation.
- Train, Air and Hotel bookings: employees making the same trips are fairly likely to be spending time together discussing both business and other interests and maybe even having a drink in the hotel bar!
- Electronic payments for vending machines and catering: employees who have used a particular electronic payment point at the same time might have had a discussion over a coffee.
- Car pools: if the organisation manages a car pool scheme it is fairly certain those sharing a car will get to know each other.
I’ve not had the opportunity to explore any of these data sets so I’d like to hear from anyone who has. But is this going too far, your thoughts?
In the pre-print book Twitter Data Analytics section 22.214.171.124 ‘Retweet Network Fallacy’ describes the problem that with a retweet you can only tell who the original tweeter is and not the network through which it propagated.
I’ve been doing some work on using retweets to help understand the strength of connections in a social network, you can see some early results here and this is an issue. I believe this can, at least partially, be resolved by using follower data.
I’ve previously discussed Twitter Followers and I don’t think this concept takes too much explanation. Here is a simple example of a followers network: Peter is followed by John who is, in turn, followed by Alice and Bob, etc.
Now let’s look at the problem described in Twitter Data Analytics, retweets: in this simple example Peter is retweeted by John who is, in turn retweeted by Alice and Bob
But if you look purely at retweet data returned by the Twitter API you will see this:
It appears that John, Alice and Bob retweeted Peter; the structure of the network is lost.
Let’s look at this again with the retweets overlaid on the followers network:
How does this help? Well firstly load the follower network into a graph database and then take a look at each retweeter. Graph databases are excellent at searching relationships and finding paths, so starting with John and Alice:
In this simple case there is a single path from John back to Peter with no intermediate nodes, it’s quite likely John directly retweeted Peter. The only path from Alice back to Peter is the one through John (and we know John retweeted) so it is quite likely that Alice saw the retweet from John and retweeted that.
Now humans, being complicated things, make complicated social networks. Let’s take a closer look at Bob:
There are multiple paths from Bob back to Peter and we could take the following strategies: look at only the shortest paths, look at all paths, or set a limit on path lengths (if we set a limit of 2, also the shortest, we will only identify the paths through John and Mary but not Jim). I would suggest, if possible, looking at all paths but this will have performance consequences for very large graphs.
The next question is do we want to look at all the tweets from Mary and Jim to see if they also retweeted the original (this can be an expensive operation – in API time or if buying tweets)? If not then we don’t know the actual path but can estimate the likelihood of each path having been taken perhaps with the shortest being more likely (I’m not exactly sure how the maths has to work here…help me out), or maybe there is some historical data we have about existing paths.
If examining each person’s tweets then consider the following possibilities, from the above example:
- Only John retweeted: the path is most likely via John
- John and Mary retweeted: it’s equally likely to have come via John or Mary, unless there is some historic data suggesting stronger existing ties to one of them
- John, Mary and Jim retweeted: well I’m not sure, are the shorter paths more likely, I guess they are as they would have placed the retweet on the users timeline first but the path through Jim cannot be completely dismissed, I’m wondering if infer.net could help here?
Finally consider this:
Well who knows? Maybe there are some protected Twitter accounts or maybe Steve picked this up because of a hashtag, which is a whole other topic.
Has anyone out there already answered these questions, please get in touch?
A compelling presentation showing how easy it is to take off-the-shelf software (COTS in the old jargon) to go right from extracting social data to sorting, querying and presenting it.
For those of us from an IT Architecture background I’ve illustrated the ETL/Data Warehouse type steps that these three offerings bring together (they have already built the integrations), it really did look very straightforward.
What was missing, for me, was the ability to explore or analyse the social network. I spoke to Datasift a few weeks ago about Twitter data and they explained that they did not provide follower data so, at this stage, those of us wanting to look into the networks are still going to have to write a bit of code.