In the first installment of this blog series, we described how the quest for artificial intelligence (AI) gave us the discipline of machine learning – the study of how to enable an intelligent agent to learn from data to improve its performance. But what has any of that got to do with commercial analytics?
Learning and predicting from data pre-dates the study of AI – and dusting off centuries old, tried and tested mathematical techniques like linear regression and Bayesian statistics turned out to be (much) easier than some of the “hard” problems in AI.
Not only that, but as as more and more business processes and systems were computerized during the 70s and 80s – and as commercial databases begin to proliferate as a result – applying these methods to the data in those databases also turned out to have very valuable commercial applications, like forecasting demand for perishable products in grocery retail, for example, or identifying potentially fraudulent transactions in retail finance.
“Knowledge discovery in databases” started as an off-shoot of machine learning, with the first Knowledge Discovery and Data Mining workshop taking place at an AI conference in 1989 and helping to coin the term “data mining” in the process – a term that we will come back to a little later in this blog. And so, AI gave rise to the study of machine learning – which led in turn to data mining.
Supervised and unsupervised methods
Machine learning is often concerned with making so-called “supervised predictions”, i.e. in learning from a training set of historical data in which objects or outcomes are known and are labelled, so that the intelligent agent can differentiate between, say, a cat and a mat. Or so that it can learn to identify the signals in petabytes of sensor data that characterize the imminent failure of a train, a jet engine or a paper mill. The objective, in both cases, is to produce a model that can predict a target variable – whether an object is a cat or a mat, or whether a train will fail or not within the next 36 hours – from input data – images harvested from the Internet, or the readings from the temperature, pressure and vibration sensors on the train.
By contrast, data mining is often also concerned with the discovery of previously unknown patterns or structures in data. Retailers, for example, have long been interested in finding groups of customers who behave in similar ways and in “clustering” shopping missions, to understand consumer behavior and how stores are shopped. These are examples of the applications of “unsupervised methods”; we are still feeding the clustering algorithms historical data, but the data aren’t labelled - because we don’t know exactly which outcomes we are looking for. When one of us undertook our first customer behavioural segmentation project using an unsupervised approach, for example, we were not expecting to find a large group of consumers shopping our stores between 5pm and 9pm and whose baskets almost exclusively contained breath mints, flowers and chocolates – nor another, buying almost exclusively frozen products, apparently for immediate consumption. But there they were!
Four things to remember
We don’t want you to get too hung-up on history or terminology, but we do want you to understand four things.
Firstly, you simply can’t “machine learn everything” – not least because supervised methods pre-suppose that you have a relevant, labelled training data-set to learn from and because the results of an unsupervised analysis may be hard to interpret, or even irrelevant. But also because in many cases there are anyway better routes to goal. By and large the big web properties don’t try to “machine learn” how big or which colour to make the “buy it now button” – they mostly run multiple, concurrent A/B tests instead. It’s quicker and it’s easier. And the output is not a prediction that may – or may not – prove to be accurate; but instead is a measurement of whether treatment A is more effective than treatment B for a particular customer segment right now and that can be easily compared with other similar measurements.
Secondly, that “data mining” was once the cool new term – popularised, in part, by vendors and marketing departments who thought that “knowledge discovery in databases” wasn’t catchy enough – and who wanted to try and distinguish the application of these methods to commercial data captured in databases from dry, dusty and apparently far-off academic concerns about machine leaning and AI. Fast-forward three decades - and now vendor marketing departments are in many cases attempting to differentiate their offers from existing data mining technologies by applying the label “machine learning” to them, apparently without realising that the term pre-dates the term “data mining”. In a very real sense, the marketing hype has literally come full circle.
Thirdly, that data mining started as an off-shoot of machine learning – itself a product of the pursuit of AI  – and that the fields remain closely linked and continue to share multiple techniques, algorithms, and researchers. So closely linked, in fact, that the two expressions are often used interchangeably - and in many situations are practically synonymous. When a mobile telecommunications company builds a model to predict which customers are likely to churn based on historical data that describes customers who have already recently cancelled their service, we can – and probably we should - call that “machine learning” (because we are using a computer to build a model from labelled historical data), even if we use a mathematical method, like linear regression, that pre-dates Turing, digital computers and the Dartmouth Conference. In practice, you will find plenty of practitioners describing the same activity as “data mining”, “data science” or just plain old “analytics”. You say tom-ay-to, and I say tom-ah-to.
Lastly, whilst you should absolutely embrace some of the newer machine learning techniques and technologies – as we’ll see later in this series of blogs, the deep learning family of methods in particular has already become the de facto solution for a whole range of high-value business problems - you would be unwise to throw out the more established methods and techniques in the process. Because as we’ll also see later, in many cases we may prefer a simple solution that is sufficiently accurate to a more complex one.
 
                             
                            