Data Science is an emerging field that is plagued by lurid, often inconsequential reports of success. The press has been all too happy to predict the future demise of the human race.
But sifting through chaff, we do see some genuinely interesting reports of work that affects both bottom-line profit and top-line revenue.
Location discovery service Foursquare, using a range of data science techniques, predicted that Chipotle’s Q1 2016 revenue would drop by 30%, compared to the previous year (Chipotle is a US fast-food outlet) .
Financial results showed a drop of 29.7%. An accurate prediction, that could have been used by Chipotle to presumptively mitigate against an impending revenue drop. This is one example of data science being used to help business. But what is data science?
Data Science is the act of engineering value from data. I prefer the name data science because it more accurately describes the multidisciplinary approach, as opposed to a more popular, but acutely specific sub-discipline like deep learning, for example.
The reason for its existence is not due to the proliferation of observable data, which has existed for decades in some domains, but because the technology wasn’t flexible or powerful enough to handle all the data. Furthermore, the media has played a significant role in altering perceptions, which has caused some parts of data science hit the mainstream.
This has resulted in the overuse of some of the terminology, names that refer to a specific component of data science, but are actually referring to the field as a whole; machine learning has had this problem.
Machine learning is a field that attempts to teach a machine (not necessarily a computer) to perform some task that a human is able to perform well. An example would be to sort cucumbers by size, shape and colour or large industrial robots highly repetitive tasks.
More generally, machine learning is the practice of automating some data driven process. We would use machine learning to compute the steps required to automate a business, typically human derived, process. You can see how this can generalise. One task where you wouldn’t be looking to automate a process would be when providing insight. You would analyse the data to provide better information than you had before; this is called analytics.
This has become more business-focused than the other data science disciplines. This is due to the fact that the aim of analytics is to produce information rather than a desired result. And companies require information to make good decisions.
At their heart, organisations are information processors. Whomever can accept, parse and act upon the best data have an inherent advantage. Data-driven decision making, directed by analytics, has pushed some companies towards becoming more profitable and more productive than their competitors.
Hence, analytics is the process of turning data into actionable insights. This could be as simple as a new process or policy, like for hiring, or could be an entire new business vertical or spin-off.
But to generate quality insights, we need quality data. And one thing analytics isn’t good at is deciding what data is important. Often, it becomes obvious after the fact. Hence, businesses are vacuuming up data, essentially capturing everything, in case it may be useful in the future. The result, is large amounts of data.
What is big data? This is the easiest one; it’s lots and lots of data. But “lots of data” doesn’t have the same ring to it.
There have been significant changes to the tooling. Traditionally data was captured in a batch-like process; captured in daily chunks, for example. And data was stored in large relational databases that were incredibly slow.
Technologies have progressed to the point where we now stream data, like a garden hose, straight into a variety of sinks like decision engines, analytics tools and fast distributed databases.
And with this monumental volume of data, some older techniques now make a lot of sense; like deep learning.
Deep learning started out being a purely academic field. They used the idea that neurons, the processing units in your brain, could be used to make decision functions. These neurons were and continue to be simple mathematical functions bound by inputs and outputs.
Their beauty lies in the fact that you can build up structures of neurons to make complex decisions. The issue was that they were hard to “tune”; given an input and an expected output, it was difficult to alter the parameters of the mathematical functions to produce the best result.
But that all changed when 1) we started using fast CPUs (processing units inside computers) and GPUs (processing units inside “gamers” graphics cards), but most importantly, 2) when there was a large amount of data available to “train” the networks of neurons.
The proceeds of deep learning are incredible and surpass humans in some tasks. But this does not mean that computers are dangerous, or indeed, are able to think.
Artificial intelligence, or AI, is probably the winner of the hype-crown. This is the much novelised discipline of trying synthesise intelligence. The hardest part is defining what intelligence is.
If a computer can perform a task better than a human, does that make it intelligent? Like an ATM, is that intelligent? What about a car that could drive itself, a machine which controls acceleration and direction with a bunch of cameras, is that intelligent? A chat-bot, in which a human cannot distinguish the algorithm from that of a real operator, is that intelligent?
As you can see, most intelligent actors, when you understand the technology behind them, are a selection of algorithms put together in novel packages. Hence, AI is used more as a marketing term than as a technical one.
Conversely, data mining has had industrial applications for decades. Data science, by definition is only possible if you have the data. And finding the right data takes the majority of the time in a data science application.
For example, when businesses first begin to develop a data driven solution for their problem, they often find that the data that they have is either lacking in quantity, quality or doesn’t exist. Data mining is the collective name for collecting, cleaning and prioritising data.
Finally, statistics forms a broad fuzzy intersection through all other sub-disciplines. The data we collect is a representation of the full population of data. So we must be careful when interpreting the conclusions derived from the data.
Furthermore, valuable insights, decisions and predictions can arise from pure statistics alone. For businesses, decisions driven by statistical insights can provide dramatic results.
You can see that data science comprises of a set of tools that intend to improve business processes, decisions and products through the use of data. The application of these tools often happens in tandem; deep learning relies entirely on the data, so data mining and data integrity is possibly even more important than deep learning itself.
I like the term data science because it comfortably acknowledges that real data driven projects are not driven by hyperbole. Opportunities require approaches that differ depending on the circumstances and care should be taken when evaluating solutions.