Category Archives: Big Data

Unstructured data: text

I recently completed an excellent book which examines how to deal with information presented as text. It’s called Taming Text from Manning. The authors do a good job of introducing each topic and explain how a number of open source tools can be applied to the problems each topic presents. I’ve not studied the latter but the former are a great introduction.

I have summarised each topic, below:

It’s hard to get an algorithm to understand text in the way humans can. Language is complex and an area of much academic study. Text is everywhere and contains plenty of potentially useful information.
The first step in dealing with text is to break it down into parts and the most simplistic aim of this step is to extract individual words, however there are a number of approaches and more sophisticated ones will need to handle punctuation. The process of splitting text down is called tokenisation. Individual works may often then be put through a stemming algorithm in order to be able to equate pluralised and different tenses of the same stem. A stem might be a recognisable word but not necessarily.
In order to search content it must first be indexed which will require tokenisation and stemming and maybe also stop-word removal and synonym expansion. It is also useful, for subsequent ranking, to use an index that allows the distance between words found from the search phrase to be calculated for each document searched. There are a number of algorithmic approaches for ranking results the simplest of which are based on the vector space model. Obviously ranking is an evolving area and the big internet search engines are constantly evolving it. Another refinement that can be applied to search is the key constituent of spell-checking: fuzzy matching.
Fuzzy matching is another area of academic research with some established algorithms based on character overlap, edit distance and n-gram edit distance which may all be combined with prefix matching using a trie (prefix tree). The most important aspect of fuzzy matching to understand is that different algorithms will be more or less effective depending on the sort of information being matched, for example Movie Titles are best matched on Jaro-Winkler distance but Movie Actors are bet matched with a more exact algorithm given that they are used like brand names.
It can be useful to be able to extract people, places and things (including monetary amounts, dates, etc.). Again there are a number of algorithms for achieving this including open source implementations from the OpenNLP project. Machine learning can play a part provided there is plenty of tagged (training) examples available, which is especially useful where domain-specific text needs to be ‘understood’.
Given a large set of documents often there is a requirement to group similar documents. This is a process called clustering and can be observed in operation on news amalgamation sites. Note that clustering does not assign meaning to each cluster. There are a number of established algorithms, many of which are shared with other clustering problems. Given the large volumes and algorithm complexity a real-world clustering task is quite likely to want to make use of parallel processing and this is what the Carrot and Apache Mahout projects provide by building on top of Apache Hadoop.
Another activity for sets of documents is classification which is similar to clustering but starts with sets of documents that have been assigned to a pre-determined category by a human or other mechanism, for example asking users to tag articles. Example classification tasks are sentiment analysis or rating reviews as positive or negative. Of course there are a number of algorithms and implementations to choose from with each having trade-offs in accuracy and performance.

Social Analysis with DataSift, Google Enterprise and Tableau at #sdwk13 London

Time Well Spent at Cloud World Forum and Big Data World Congress

Leave a reply

An interseting couple of days.

Google are targeting enterprises as a PaaS provider. Their view is Digital Natives will bring consumer technologies into the workplace and concentrate on systems of collaboration rather than systems of record . They put a lot of value in the Gartner Nexus of Forces.

Performance Management of Cloud Platforms (PaaS): with elastic computing sloppy development of inefficient code can be masked by infrastructure but at what cost? Application Performance Monitoring may be more important, than in on-prem, to help stop money leaks.

The panel discussion of Big Data Skills started with a great quote: “do you have a [Big Data] Problem or is it a Big [Data Problem]?”, well I found it amusing. The discussion concluded that:

Many organisations are struggling to get Big Data out of R&D
We need to be careful that it does not become over-intrusive in people’s lives

As we know there is rarely anything genuinely new: I met the guys from elasticsearch who explained it uses Lucene (as does Neo4j) to index text and that this was something called an inverted index. Well that bought back a few memories from the early 90s when I worked for Dataware Technologies integrating BRS into customer solutions.

The Open Data Institute promote the use of government data and offer help (not financial) to start-ups who want to take that data and add value to it.

Talend echoed sentiment from last week’s BDA conference: why Extract – Transform – Load when you can Extract – Load – Transform using Hadoop.

The Cloud Security Alliance called for more transparency and honesty which should apply to corporations as well as governments. It’s something Enterprise Architects need to consider when they examine proposals that will impact individuals’ anonymity.

The question of whether recent revelations about PRISM will see people move away from Facebook was raise during a panel discussing protection of sensitive data in the cloud. I suspect not but time will tell.

Digital Innovation Group reinforced the message about the importance of context and trying to understand the language of peoples social media output. You need to be able to deal with slang and differentiate between not agreeing with an opinion versus that being their opinion.

Whitehall Media Big Data Analytics, June 2013

2 Replies

A very interesting day. In the introduction the chairman asked for a show of hands as to who worked in IT or other parts of organisations: the split was about 50/50.

My main observations are:

Hadoop is the standard; think of it as ETL on steroids: you will probably still want to feed the results into traditional databases and analytical tools. Hive provides a SQL-like language over the top. You can use Hadoop to make your archives ‘active’
Organisations need to know what is being said about them, too often people find out what is happening in their own organisation on social media first.
Think of the value in the data. For example car manufacturers are increasing the number of sensors in cars and collecting the data: they understand how you drive, maybe they could offer you insurance?
Context is very important when looking at a piece of unstructured data.
Decision makers need to be given a relevant subset of data.
Organisations need to monitor global mega-trends. Take a look at http://www.news-spectrum.com/
If you are analysing email content the disclaimers often placed at the end of the message can cause a lot of misleading conclusions
“See Lots – Know Little – Do Less” (David Ackroyd, Telefonica); in other words too much information is not useful
When you have a lot of data you can start looking for hidden patterns
Prediction: can you spot customers who are about to depart?
A Big Data initiative needs to offer value. Look for the sweet spot: a conjunction of revenue, cost and risk.
Make sure Big Data thinking includes an outside-in perspective
Data Art is the next big paradigm?

Robert Gimeno's Adventures in Data Science

Data everywhere but what can it tell us?

Category Archives: Big Data

Unstructured data: text

Whitehall Media Big Data Analytics, June 2013