Category Archives: Weka

Prediction – Don’t Give Up

I previously described a dataset that was stubbornly refusing to yield any predictability. Using a larger set of the same data I realised that rather than trying to predict an individual opportunity to sell what was actually of interest was is a particular lead likely to become a customer (most people don’t buy multiple products of this type and there is often a number of interactions with each customer before a sale). By focussing on customer it became clear there were some noticeable distributions amongst the parameters. The dataset was very nicely split 50/50 between successful and non-successful opportunities and the best predictors were able to correctly classify almost 70% of cases. The most accurate predictor (highlighted) was the time between the first and ultimate opportunity with that customer – probably not the most useful as it could be argued as customer coming back after a reasonable period has thought about the product  and decided to proceed.

Weka Parameters 2012 Set

Whilst the mathematically best predictor might not be that useful in the business setting there are some other parameters that show noticeably useful predictive capabilities.

Prediction: not all that easy?

Predicting if a potential sale is converted into an actual sale, based on a information about the potential sale, is a classification problem. We are trying to classify each potential sale into a ‘converted’ or ‘not converted’ class. However its not always going to be possible. The following set of graphs show Weka’s pre-processing parameters plot:

Weka Parameters

The final graph (bottom right) shows the outcome from the training data that has been captured: blue indicates ‘not converted’ and red indicates ‘converted’, the ratio is approximately 40% converted and 60% not The preceding eleven graphs show various parameters (e.g. customer’s age category or product variant); each category within the graph is a stacked bar showing the converted and non-converted ratio for that subset. Visually it can be observed that the ratios of converted to non-converted are fairly consistent across all the parameters; there are a few exceptions but these are all in the much smaller subsets. This is not a promising start, could a combination of two parameters be the answer? Weka allows us to visualise each pair of parameters with a plot matrix:

Weka Plot Matrix

If any combination of two parameters were a good classifier then we would start to see asymmetrical distributions of the colours in the plots at the intersection of those parameters. Note the top and right rows are the parameter being predicted which is why the colours clearly separate into a row or column – effectively these are what was shown in the previous group of charts.

Maybe three or more parameters would produce a good classifier; to achieve this visually some more dimensions would be needed but this is not going to work beyond at most 3. Fortunately there are numerous classification algorithms, many of which have implementations in Weka, which will handle multiple parameters. Unfortunately for the dataset in question no classifiers produced a reasonable result  so it would appear the parameters selected (because they were readily available) do not have a measurable influence on outcome.

Prediction

Leaving behind email data for a while another set of data my organisation collects represents potential business. Using Weka I have started to investigate if conversion of the potential business to actual business can be predicted. I have a sample of 11,000 recent prior cases and know if they have led to business (‘converted’) or not. 41% converted and 59% did not. Using some summary level data I have tried a number of classification algorithms. Many of these reported approximately 60% correctly classified instances (which is not very good considering always predicting a ‘No’ is 59% correct). The best gave 70%-75% but on further investigation (of those where enough information was provided) this was due to over-fitting: this was particularly noticeable for rule-based algorithms which where generating up to a few-hundred rules.