Tag Archives: Weka

Prediction: not all that easy?

Predicting if a potential sale is converted into an actual sale, based on a information about the potential sale, is a classification problem. We are trying to classify each potential sale into a ‘converted’ or ‘not converted’ class. However its not always going to be possible. The following set of graphs show Weka’s pre-processing parameters plot:

Weka Parameters

The final graph (bottom right) shows the outcome from the training data that has been captured: blue indicates ‘not converted’ and red indicates ‘converted’, the ratio is approximately 40% converted and 60% not The preceding eleven graphs show various parameters (e.g. customer’s age category or product variant); each category within the graph is a stacked bar showing the converted and non-converted ratio for that subset. Visually it can be observed that the ratios of converted to non-converted are fairly consistent across all the parameters; there are a few exceptions but these are all in the much smaller subsets. This is not a promising start, could a combination of two parameters be the answer? Weka allows us to visualise each pair of parameters with a plot matrix:

Weka Plot Matrix

If any combination of two parameters were a good classifier then we would start to see asymmetrical distributions of the colours in the plots at the intersection of those parameters. Note the top and right rows are the parameter being predicted which is why the colours clearly separate into a row or column – effectively these are what was shown in the previous group of charts.

Maybe three or more parameters would produce a good classifier; to achieve this visually some more dimensions would be needed but this is not going to work beyond at most 3. Fortunately there are numerous classification algorithms, many of which have implementations in Weka, which will handle multiple parameters. Unfortunately for the dataset in question no classifiers produced a reasonable result  so it would appear the parameters selected (because they were readily available) do not have a measurable influence on outcome.


Leaving behind email data for a while another set of data my organisation collects represents potential business. Using Weka I have started to investigate if conversion of the potential business to actual business can be predicted. I have a sample of 11,000 recent prior cases and know if they have led to business (‘converted’) or not. 41% converted and 59% did not. Using some summary level data I have tried a number of classification algorithms. Many of these reported approximately 60% correctly classified instances (which is not very good considering always predicting a ‘No’ is 59% correct). The best gave 70%-75% but on further investigation (of those where enough information was provided) this was due to over-fitting: this was particularly noticeable for rule-based algorithms which where generating up to a few-hundred rules.