Now we have a firm understanding of how business problems map to solutions we need to learn the techniques to deliver the solutions. This section introduces the basic terminology and concepts used in data science.
First lets discuss what the goal is. What is the goal?
Based upon what?
How can we improve the quality of the decision or prediction?
Think about this for a moment. It’s a key insight. Think about your projects. Your research. The decisions you make. They are all based upon some information. And you can make better decisions when you have more good quality information.
Claude Shannon defined exactly what information is. He defined a measure of the level of information contained within a variable and called it entropy.
Entropy is a measure of how much information is contained within an event. A coin toss has lower entropy (less information) than the roll of a die. This is because the coin toss only has two possible outcomes. A die has 6.
A random variable with a wide spread has more entropy that one with narrow spread.
Key Point: High entropy problems are harder to solve.
Uncertainty represents the effects of entropy.
High entropy problems are highly uncertain.
I.e. we cannot be certain about a solution if the data has high entropy.
Key Point: We want to be certain about the result.
Hence, the whole point of any modelling process is to reduce the amount of uncertainty in a decision or estimate that we make.
If we can make good decisions we can build good products and good businesses.
Reducing uncertainty by reducing the entropy measure of the data is the topic of the next section.
Key Point: We need to reduce the uncertainty to improve our decision. The question of the entire course is: how?
Before we move any further, we need to talk terminology.
Data science is really bad for lots of different, complex words all meaning the same thing.
The reason for this is that data science has emerged from a number of different disciplines.
For example, statistic’s L1-norm is the same as machine learning’s Manhattan distance.
Also, terminology is quite personal to an individual’s experience. Some of this terminology is my own.
An observation is a single measurement. It is often (even by me) referred to as an individual sample or data point. Words don’t matter so long as you and your audience understand that you mean a single instance.
A sample is a chosen set of observations. But this isn’t generally used because of it’s confusion with an individual sample.
Instead, data scientists often use the word dataset to refer to a collection of observations.
How you choose a sample is very, very important. More about that later.
Other than observations, the next most important word is feature.
A feature is one dimension of the measurement. For example, a finance dataset might have a
loan_amount feature. A marketing dataset might have an
An attribute is another word for a feature.
Labels represent the answers to the problem, if there are any.
Labels are required for supervised machine learning tasks.
For example, in a classification problem, labels represent the correct class for an observation.
Labels are also often called targets.
We can also generalise and abstract solutions into different types. We’ve already mentioned models but haven’t defined what a model is.
Models are a simplified version of reality. We create them to be able to understand and act upon the underlying process.
Reality is messy. We use imprecise tools and equipment to sample an chaotic natural process.
The mess that is included in our measurements is called noise.
Noise masks the data that we are interested in.
Models attempt to simplify the measurement and ignore the noise.
Simpler models are easier to understand and easier to act upon.
The creation of models from data is called induction.
Contrast this to deduction, which is the process of formulating a model or theory from logical assertions.
Traditional science emphasised the importance of deductive reasoning, and is still preferred in many more traditional disciplines.
And in data science it still holds an important place in sanity checking. If the results are not what you expect, then either you don’t understand what is going on, or you’ve made the wrong deductions.
Prediction is often used for “predicting the future”.
But in Data Science prediction means: estimate the most probable output.
So when our algorithm decides that this instance belongs to a class, we’re making a prediction.
When talking about producing learning from data, there are two distinct forms of learning:
The type of learning is usually defined by whether the data has labels.
Supervised machine learning occurs when an algorithm is provided with data is labelled with a known outcome.
Supervised learning is then split into the type of label that is used. Some algorithms work with categorical data, some with continuous data, some both.
Examples of supervised questions:
Unsupervised problems arise when there are no labels indicated in the data.
Unsupervised problems often require some form of grouping or clustering to find similar types of data.
Examples of unsupervised questions:
Fairly often problems come with data that has some labels. For example some instances have been manually labelled by experts, but cannot possibly label all data.
In this case there are a subset of special techniques called semi-supervised algorithms.
These are often simply a combination of clustering followed by classification.
I.e. Similar instances can be labelled with the same label. The difficulty is where you draw the line.