# Discover

## CloudNativeX Interview: Reinforcement Learning

Apr 2021

Join Lee Razo and Phil Winder for this comprehensive introduction to Reinforcement Learning, an area of machine learning in which problems are tackled with intelligent agents which take actions to maximize a specified reward. Phil (quite literally) wrote the book on this topic and he takes us through the fundamentals of RL, some common use cases as well as tips on how even a small or mid-sized company can get started with and benefit from RL.

## The Future of Transportation Infrastructure: Reinforcement Learning

Mar 2021

The lock-downs endured during the coronavirus pandemic have given many the opportunity to work from home, potentially for the first time. Along with the guilt of failing at home-schooling, trying to work with noisy babies or animals, the lock-down has entirely changed the way in which we travel. When I speak to people about the pandemic, the lack of commute is one of the few positives they can take away from this experience and has led some to even question why they are paying for accommodation in some of the most expensive areas in the UK.

## InfoQ Podcast: Phil Winder on the History, Practical Application, and Ethics of Reinforcement Learning

Mar 2021

InfoQ · Phil Winder on the History, Practical Application, and Ethics of Reinforcement Learning Charles Humble, friend and editor of InfoQ, was kind enough to ask me for an interview to talk more about my new book, in podcast format. From the blurb: In this episode of the InfoQ podcast Dr Phil Winder, CEO of Winder Research, sits down with InfoQ podcast co-host Charles Humble. They discuss: the history of Reinforcement Learning (RL); the application of RL in fields such as robotics and content discovery; scaling RL models and running them in production; and ethical considerations for RL.

## Solving Three Common Manufacturing Problems with Reinforcement Learning

Feb 2021

Like many industries, manufacturing is experiencing an explosion in both the growth of and access to data. The data is complex and multi-faceted, for example the data may originate from the production line, the environment, through usage, or even from users. When viewed in this light, the explosion is often called “big data” and the effect called smart manufacturing (USA) or industrie 4.0 (Germany). The data must be acted upon to be useful.

## Inventory Control and Supply Chain Optimization with Reinforcement Learning

Feb 2021

Inventory control is the problem of attempting to optimize product or stock levels given the unique constraints and requirements of a business. It is an important problem because every goods-based business has to spend resources on maintaining stock levels so that they can deliver products that customers want. Every improvement to inventory control has a direct improvement the delivery of the business. Beginners study tactics, experts study logistics, so they say.

## DataTalksClub - Industrial Applications of Reinforcement Learning

Feb 2021

Reinforcement learning (RL), a sub-discipline of machine learning, has been gaining academic and media notoriety after hyped marketing “reveals” of agents playing various games. But these hide the fact that RL is immensely useful in many practical, industrial situations where hand-coding strategies or policies would be impractical or sub-optimal. Following the theme of my new book (https://rl-book.com​), I present a rebuttal to the hyperbole by analysing five different industrial case studies from a variety of sectors.

## GOTO Book Club: How to Leverage Reinforcement Learning

Feb 2021

In this episode of GOTO’s book club I speak to Rebecca Nugent, Feinberg professor of statistics and data science at Carnegie Mellon univeristy. We talk, at length, about the application of reinforcment learning, specifically how it could be a way of creating truly personalised teaching curricula. It’s a really interesting discussion and it’s great to get someone of Rebecca’s calibre to bounce ideas off.

## Free RL Book Competition and Merry Christmas

Dec 2020

2020 has been quite a year. I think back over all the so called “1-in-100 year” events. The statistician in me suggests that this isn’t as surprising as it sounds, given the size of the world and globalized communication. But the fact that COVID-19 has affected so many is almost unprecedented. To help provide some semblance of festive cheer I’m happy to offer one lucky person a free copy of my new book, Reinforcement Learning: Industrial Applications of Intelligent Agents.

## A Code-Driven Introduction to Reinforcement Learning

Nov 2020

Notebook link Abstract Reinforcement learning (RL) is lined up to become the hottest new artificial intelligence paradigm in the next few years. Building upon machine learning, reinforcement learning has the potential to automate strategic-level thinking in industry. In this presentation I present a code-driven introduction to RL, where you will explore a fundamental framework called the Markov decision process (MDP) and learn how to build an RL algorithm to solve it.

## 5 Productivity Tips for Data Scientists

Aug 2020

Many articles talk about how professionals can make their workdays extra productive. However, for people like data scientists, whose jobs are extremely demanding, some tips are more valuable than others. For instance, it is important that you analyse how you spend your time. In the same breath, it would be in your best interest to organise your time into blocks, as these can help you focus on tasks – one at a time and without any interruption – and automate any process that you repeat.

Aug 2020

## Improving Data Science Strategy at Neste

Aug 2020

Winder Research helped Neste develop their data science strategy to nudge their data scientists to produce more secure, more robust, production ready products. The results of this work were: A unified company-wide data science strategy Simplified product development - “just follow the process” More robust, more secure products Decreased to-market time Our Client Neste is an energy company that focuses on renewables. The efficiency and optimization savings that machine learning, artificial intelligent and data science can provide play a key role in their strategy.

Jun 2020

## Building an Enterprise NLP Platform

Jun 2020

Winder Research has built a state of the art natural language processing (NLP) platform for a large oil and gas enterprise. This work leveraged a range of cloud-native technologies and sophisticated deep learning-based (DL) machine learning (ML) techniques to deliver a range of applications. Key successes are: New NLP workflows developed in hours, not weeks. Hugely scalable, from zero to minimise cost to tens of thousands of concurrent connections. Enforced corporate governance and unification, without burdening the developer.

## Developing a Real-Life Project

Jun 2020

I’m often asked questions in the vain of “how did you figure that out?”. Other times, and I’m less of a fan of these, I get questions like “you estimated X, why did it take 2*X?”, which I respond with a definition of the word estimate. Both of these types of questions are about the research and development process. Non-developers, and especially non-engineers, are often never exposed to the process of research and development.

## A Simple Docker-Based Workflow for Deploying a Machine Learning Model

Apr 2020

In software engineering, the famous quote by Phil Karlton, extended by Martin Fowler goes something like: “There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors.” In data science, there’s one hard thing that towers over all other hard things: deployment.

## COVID-19 Hierarchical Bayesian Logistic Model with pymc3

Apr 2020

I have two outstanding tasks from the previous notebooks. The first is that I haven’t iterated over all countries.

## COVID-19 Logistic Bayesian Model

Apr 2020

This post builds upon the exponential model created in a previous post. The main issue was that there an exponential model does not include a limit. A logistic model introduces this limit. I also perform some very basic backtesting and future prediction.

## COVID-19 Exponential Bayesian Model Backtesting

Apr 2020

This notebook builds upon the exponential bayesian model to implement simple backtesting. The idea here is to hold out data, train a model, and see how well the model is able to predict those results.

## COVID-19 Exponential Bayesian Model

Apr 2020

The purposes of this notebook is to provide initial experience with the pymc3 library for the purpose of modeling and forecasting COVID-19 virus summary statistics. This model is very simple, and therefore not very accurate, but serves as a good introduction to the topic.

## COVID-19 Response: Athena Project and an Introduction Bayesian Analysis

Apr 2020

Over the next couple of weeks I will be using Bayesian analysis to model the spread of COVID-19. Inspired by Alex Stage who started the Athena Project, I have committed Winder Research to helping Athena reach its goals.

## How to Start a Data Science Project With No or Little Data

Feb 2020

Data is an essential asset of modern business. It empowers companies by surfacing unique insights about their customers and creates actionable products. The more data you possess, the better you meet and exceed your customers' expectations.

## Keep it Clean: Why Bad Data Ruins Projects and How to Fix it

Jan 2020

Slides Abstract The Internet is full of examples of how to train models. But the reality is that industrial projects spend the majority of the time working with data. The largest improvements in performance can often be found through improving the underlying data. Bad data is costing the US economy an estimated 3.1 trillion Dollars and approximately 27% of data is flawed in the world’s top companies. Bad data also contributes to the failure of many Data Science projects.

## Fast Time-Series Filters in Python

Oct 2019

Time-series (TS) filters are often used in digital signal processing for distributed acoustic sensing (DAS). The goal is to remove a subset of frequencies from a digitised TS signal. To filter a signal you must touch all of the data and perform a convolution. This is a slow process when you have a large amount of data. The purpose of this post is to investigate which filters are fastest in Python.

## A Comparison of Reinforcement Learning Frameworks: Dopamine, RLLib, Keras-RL, Coach, TRFL, Tensorforce, Coach and more

Jul 2019

Reinforcement Learning (RL) frameworks help engineers by creating higher level abstractions of the core components of an RL algorithm. This makes code easier to develop, easier to read and improves efficiency.

But choosing a framework introduces some amount of lock in. An investment in learning and using a framework can make it hard to break away. This is just like when you decide which pub to visit. It’s very difficult not to buy a beer, no matter how bad the place is.

## Announcement: New Reinforcement Learning Book with O'Reilly

Jun 2019

I’m excited to announce that I have agreed with O’Reilly Media to write a new book on Reinforcement Learning. The contracts have just been signed and I’ve started the writing process. It is likely to take around a year to be released so I’m hoping that it will be ready around Summer 2020.

## Keep it Clean: Why Bad Data Ruins Projects and How to Fix it

Apr 2019

Slides Abstract The Internet is full of examples of how to train models. But the reality is that industrial projects spend the majority of the time working with data. The largest improvements in performance can often be found through improving the underlying data. Bad data is costing the US economy an estimated 3.1 trillion Dollars and approximately 27% of data is flawed in the world’s top companies. Bad data also contributes to the failure of many Data Science projects.

## Google Releases AI Platform with help from Winder Research

Apr 2019

At their Cloud’s Next 19 conference, Google has announced the launch of an expanded AI platform. For a number of years Google has been expanding it’s portfolio to compete with AI products from Azure and AWS. But this is the first time that the platform can be considered “end-to-end”.

## DevOps and Data Science: DataDevOps?

Mar 2019

I’ve seen a few posts recently about the emergence of a new field that is kind of like DevOps, but not quite, because it involves too much data. Verbally, about two years ago and in blog form about a year ago, I used the word DataDevOps, because that’s what I did. I develop and operate Data Science platforms, products and services. But more recently I have read of the emergence of DataOps.

## Local Jenkins Development Environment on Minikube on OSX

Mar 2019

Developing Jenkinsfile pipelines is hard. I think my world record for the number of attempts to get a working Jenkinsfile is around 20. When you have to continually push and run your pipeline on a managed Jenkins instance, the feedback cycle is long. And the primary bottleneck to developer productivity is the length of the feedback cycle.

## Scikit Learn to Pandas: Data types shouldn't be this hard

Feb 2019

Nearly everyone using Python for Data Science has used or is using the Pandas Data Analysis/Preprocessing library. It is as much of a mainstay as Scikit-Learn. Despite this, one continuing bugbear is the different core data types used by each: pandas.DataFrame and np.array. Wouldn’t it be great if we didn’t have to worry about converting DataFrames to numpy types and back again? Yes, it would. Step forward Scikit Pandas. Sklearn Pandas Sklearn Pandas, part of the Scikit Contrib package, adds some syntactic sugar to use Dataframes in sklearn pipelines and back again.

## 7 Reasons Why You Shouldn't Use Helm in Production

Jan 2019

Helm is billed as “the package manager for Kubernetes”. The goal was to provide a high-level package management-like experience for Kubernetes. This was a goal for all the major containerisation platforms. For example, Apache Mesos has Mesos Frameworks. And given the standardisation on package management at an OS level (yum, apt-get, brew, choco, etc.) and an application level (npm, pip, gem, etc.), this makes total sense, right?

## Using Data Science to block hackers

Oct 2018

Executive Summary Winder Research was engaged by Bitsensor to research and implement Data Science algorithms that could automate the detection and classification of web attackers. After gathering data, researching a Machine Learning solution and implementing Cloud-Native software, we delivered three new features: Tool classification - detect which automated tools were being used to perform the attack Attacker grouping - provide the capability of detecting distributed attacks by the same attacker Killchain classification - establish the phase of an attack (e.

## Bulding a Cloud-Native PaaS

Oct 2018

Executive Summary Winder Research worked with its partner, Container Solutions, to deliver core components of the Weave Cloud Platform-as-a-Service (PaaS). Kubernetes and Terraform implementations on Google Cloud Platform Delivered crucial billing components to track and bill for per-second usage Helped initiate, architect and deliver Weave Flux, a Git-Ops CI/CD enabler Client Weaveworks makes it fast and simple for developers and DevOps teams to build and operate powerful containerized applications.

## How Winder Research Made Enterprise Cloud Migration Possible

Oct 2018

Executive Summary Truly global company, tens of thousands of staff across tens of regions. Problem: Colossal amounts of data, lack the computational flexibility to remain competitive. Solution: Cloud data platform leveraging Microservices, Serverless object storage and database technologies. Benefits: 4x faster, more memory and number of gpus compared to best on-premise hardware. 10x quicker time to market. 10 Petabytes of data. A very large enterprise in the oil and gas industry asked Winder Research to help them migrate mission critical workflows to the cloud and create competitive differentiators through the application of Data Science (a.

## A Comparison of Serverless Frameworks for Kubernetes: OpenFaas, OpenWhisk, Fission, Kubeless and more

Sep 2018

The term Serverless has become synonymous with AWS Lambda. Decoupling from AWS has two benefits; it avoids lock in and improves flexibility.

The misnomer Serverless, is a set of techniques and technologies that abstract away the underlying hardware completely. Obviously these functions still run on “servers” somewhere, but the point is we don’t care. Developers only need to provide code as a function. Functions are then used or consumed via an API, usually REST, but also through message bus technologies (Kafka, Kinesis, Nats, SQS, etc.).

This provides a comparison and recommendation for a Serverless framework for the Kubernetes platform.

## How to Test Terraform Infrastructure Code

Aug 2018

Infrastructure as code has become a paradigm, but infrastructure scripts are often written and run only once. This works for simplistic infrastructure requirements (e.g. k8s deployments). But when there is a requirement for more varied infrastructure or greater resiliency then testing infrastructure code becomes a requirement. This blog post introduces a current project that has found tools and patterns to deal with this problem.

## Cloud Native Data Science: Best Practices

May 2018

Following the Cloud Native best practices of immutability, automation and provenance will serve you well in a CNDS project. But working with data brings its own subtle challenges around these themes.

## Cloud Native Data Science: Technology

May 2018

Technology choices in data-driven products are, as you would expect, largely directed by the type and amount of data. The first and most crucial decision to make is whether the data will be processed in a batch or streaming fashion.

## Cloud Native Data Science: Strategy

May 2018

Data Science has become an important part of any business because it provides a competitive advantage. Very early on, Amazon’s data on book purchases allowed them to deliver personalised recommendations whilst customers were browsing their site. Their main competitor in the US at the time was Borders, who mainly operated in physical stores. This physicality prevented them from seamlessly providing customers with personalised recommendations [1]. This example highlights how strategic business decisions and data science are inextricably linked.

## Life and Death Decisions: Testing Data Science

Apr 2018

Abstract We live in a world where decisions are being made by software. From mortgage applications to driverless vehicles, the results can be life-changing. But the benefits of automation are clear. If businesses use data science to automate decisions they will become more productive and more profitable. So the question becomes: how can we be sure that these algorithms make the best decisions? How can we prove that an autonomous vehicle will make the right decision when life depends on it?

## How to List all AMIs for each region in AWS

Apr 2018

A current project required a list of Amazon Machine Images (AMIs) for all regions for use in terraform. I couldn’t find a script to do this for me, so here you will find one that uses the aws cli, jq and a bit of Bash.

## AI Panel of Experts

Mar 2018

Join the track speakers and invited guests as they discuss where AI is heading and how it’s affecting software today. Enjoyed fielding questions about #DataScience and #AI today at #QConLondon. Great questions and expert speakers, but SMEs are underrepresented in data science. We need more SMEs speaking! pic.twitter.com/Vasi24z3LY — Phil Winder (@DrPhilWinder) March 6, 2018

## Why do we use Standard Deviation?

Jan 2018

Why do we use Standard Deviation and is it Right? It’s a fundamental question and it has knock on effects for all algorithms used within data science. But what is interesting is that there is a history. People haven’t always used variance and standard deviation as the defacto measure of spread. But first, what is it? Standard Deviation The Standard Deviation is used throughout statistics and data science as a measure of “spread” or “dispersion” of a feature.

## Root Cause Analysis: The 5-Whys

Jan 2018

Root Cause Analysis: The 5-Whys Deciding what problem you should try and solve is one of the hardest steps to get right in Data Science. If you get it wrong, then you’ll spend significant amounts of time free wheeling around the rest of the data science process and end up with something that nobody wants or cares about. There is nothing worse that someone suggesting that your work has no value.

## Distance Measures with Large Datasets

Jan 2018

Distance Measures for Similarity Matching with Large Datasets Today I had an interesting question from a client that was using a distance metric for similarity matching. The problem I face is that given one vector v and a list of vectors X how do I calculate the Euclidean distance between v and each vector in X in the most efficient way possible in order to get the top matching vectors?

## 603: Nearest Neighbour Tips and Tricks

Jan 2018

Dimensionality and domain knowledge Is it right to use the same distance measure for all features? E.g. height and sex? CPU and Disk space? Some features will have more of an effect than others due to their scales. ??? In this version of the algorithm all features are used in the distance calculation. This treats all features the same. So a measure of height has the same effect as the measure of sex.

## 602: Nearest Neighbour Classification and Regression

Jan 2018

More than just similarities Classification: Predict the same class as the nearest observations Regression: Predict the same value as the nearest observations ??? Remember for classification tasks, we want to predict a class for a new observation. What we could do is predict a class that is the same as the nearest neighbour. Simple! For regression tasks, we need to predict a value. Again, we could use the value of the nearest neighbour!

## 601: Similarity and Nearest Neighbours

Jan 2018

This section introduces the idea of “similarity”. Why?: Simplicity Many business tasks require a measure of “similarity” Works well Business reasoning Why would businesses want to use a measure of similarity? What business problems map well to similarity classifiers? Find similar companies on a CRM Find similar people in an online dating app Find similar configurations of machines in a data centre Find pictures of cats that look like this cat Recommend products to buy from similar customers Find similar wines Similarity What is similarity?

## 503: Visualising Overfitting in High Dimensional Problems

Jan 2018

Validation curve One simple method of visualising overfitting is with a validation curve, (a.k.a fitting curve). This is a plot of a score (e.g. accuracy) verses some parameter in the model. Let’s compare the make_circles dataset again and vary the SVM->RBF->gamma value. ??? Performance of the SVM->RBF algorithm when altering the parameters of the RBF. We can see that we are underfitting at low values of $$\gamma$$. So we can make the model more complex by allowing the SVM to fit smaller and smaller kernels.

## 502: Preventing Overfitting with Holdout

Jan 2018

Holdout We have been using: Training data Not representative of production. We want to pretend like we are seeing new data: Hold back some data ??? When we train the model, we do so on some data. This is called training data. Up to now, we have been using the same training data to measure our accuracy. If we create a lookup table, our accuracy will be 100%.

## 501: Over and Underfitting

Jan 2018

Generalisation and overfitting “enough rope to hang yourself with” We can create classifiers that have a decision boundary of any shape. Very easy to overfit the data. This section is all about what overfitting is and why it is bad. ??? Speaking generally, we can create classifiers that correspond to any shape. We have so much flexibility that we could end up overfitting the data. This is where chance data, data that is noise, is considered a valid part of the model.

## 404: Nonlinear, Linear Classification

Jan 2018

Nonlinear functions Sometimes data cannot be separated by a simple threshold or linear boundary. We can also use nonlinear functions as a decision boundary. ??? To represent more complex data, we can introduce nonlinearities. Before we do, bear in mind: More complex interactions between features yield solutions that overfit data; to compensate we will need more data. More complex solutions take a greater amount of computational power Anti-KISS The simplest way of adding a nonlinearities is to add various permutations of the original features.

## 403: Linear Classification

Jan 2018

Classification via a model Decision trees created a one-dimensional decision boundary We could easily imagine using a linear model to define a decision boundary ??? Previously we used fixed decision boundaries to segment the data based upon how informative the segmentation would be. The decision boundary represents a one-dimensional rule that separates the data. We could easily increase the number or complexity of the parameters used to define the boundary.

## 402: Optimisation and Gradient Descent

Jan 2018

Optimisation When discussing regression we found that these have closed solutions. I.e. solutions that can be solved directly. For many other algorithms there is no closed solution available. In these cases we need to use an optimisation algorithm. The goals of these algorithms is to iteratively step towards the correct result. Gradient descent Given a cost function, the gradient decent algorithm calculates the gradient of the last step and move in the direction of that gradient.

## 401: Linear Regression

Jan 2018

Regression and Linear Classifiers Traditional linear regression (a.k.a. Ordinary Least Squares) is the simplest and classic form of regression. Given a linear model in the form of: \begin{align} f(\mathbf{x}) & = w_0 + w_1x_1 + w_2x_2 + \dots \\ & = \mathbf{w} ^T \cdot \mathbf{x} \end{align} Linear regression finds the parameters $$\mathbf{w}$$ that minimises the mean squared error (MSE)… The MSE is the sum of the squared values between the predicted value and the actual value.

## 302: How to Engineer Features

Jan 2018

Engineering features You want to do this because: Reduces the number of features without losing information Better features than the original Make data more suitable for training ??? Another part of the data wrangling challenge is to create better features from current ones. Distribution/Model specific rescaling Most models expect normally distributed data. If you can, transform the data to be normal. Infer the distribution from the histogram (and confirm by fitting distributions)

## 301: Data Engineering

Jan 2018

Your job depends on your data The goal of this section is to: Talk about what data is and the context provided by your domain Discover how to massage data to produce the best results Find out how and where we can discover new data ??? If you have inadequate data you will not be able to succeed in any data science task. More generally, I want you to focus on your data.

## 203: Examples and Decision Trees

Jan 2018

Example: Segmentation via Information Gain There’s a fairly famous dataset called the “mushroom dataset”. It describes whether mushrooms are edible or not, depending on an array of features. The nice thing about this dataset is that the features are all catagorical. So we can go through and segment the data for each value in a feature. This is some example data: poisonous cap-shape cap-surface cap-color bruises? p x s n t e x s y t e b s w t p x y w t e x s g f etc.

## 202: Segmentation For Classification

Jan 2018

Segmentation So let’s walk through a very visual, intuitive example to help describe what all data science algorithms are trying to do. This will seem quite complicated if you’ve never done anything like this before. That’s ok! I want to do this to show you that all algorithms that you’ve every heard of have some very basic assumption of what they are trying to do. At the end of this, we will have completely derived one very important type of classifier.

## 201: Basics and Terminology

Jan 2018

The ultimate goal First lets discuss what the goal is. What is the goal? The goal is to make a decision or a prediction Based upon what? Information How can we improve the quality of the decision or prediction? The quality of the solution is defined by the certainty represented by the information. Think about this for a moment. It’s a key insight. Think about your projects.

## 102: How to do a Data Science Project

Jan 2018

Problems in Data Science Understanding the problem “the five-whys” Different questions dramatically effect the tools and techniques used to solve the problem. Data Science as a Process More Science than Engineering Research Problem Model High risk High reward Difficult Unpredictable CRISP-DM Process By Kenneth Jensen CC BY-SA 3.0, via Wikimedia Commons

## 101: Why Data Science?

Jan 2018

What is Data Science? Software Engineering, Maths, Automation, Data A.k.a: Machine Learning, AI, Big Data, etc. It’s current rise in popularity is due to more data and more computing power. For more information: https://winderresearch.com/what-is-data-science/ Examples US Supermarket Giants Target: Optimising Marketing using customer spending data. Walmart: Predicting demand ahead of a natural disaster. Discovery Most projects are “Discovery Projects”.

## K-NN For Classification

Jan 2018

K-NN For Classification Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. In a previous workshop we investigated how the nearest neighbour algorithm uses the concept of distance as a similarity measure. We can also use this concept of similarity as a classification metric. I.e. new observations will be classified the same as its neighbours. This is accomplished by finding the most similar observations and setting the predicted classification as some combination of the k-nearest neighbours.

## Nearest Neighbour Algorithms

Jan 2018

Nearest Neighbour Algorithms Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. Nearest neighbour algorithms are a class of algorithms that use some measure of similarity. They rely on the premise that observations which are close to each other (when comparing all of the features) are similar to each other. Making this assumption, we can do some interesting things like: Recommendations Find similar stuff But more crucially, they provide an insight into the character of the data.

## Entropy Based Feature Selection

Jan 2018

Entropy Based Feature Selection Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. One simple way to evaluate the importance of features (something we will deal with later) is to calculate the entropy for prospective splits. In this example, we will look at a real dataset called the “mushroom dataset”. It is a large collection of data about poisonous and edible mushrooms. Attribute Information: (classes: edible=e, poisonous=p) 1.

## Information and Entropy

Jan 2018

Information and Entropy Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. Remember the goal of data science. The goal is to make a decision based upon some data. The quality of that decision depends on our information. If we have good, clear information then we can make well informed decisions. If we have bad, messy data then our decisions will be poor.

## Testing Model Robustness with Jitter

Jan 2018

Testing Model Robustness with Jitter Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. To test whether your models are robust to changes, one simple test is to add some noise to the test data. When we alter the magnitude of the noise, we can infer how well the model will perform with new data and different sources of noise. In this example we’re going to add some random, normally-distributed noise, but it doesn’t have to be normally distributed!

## Qualitative Model Evaluation - Visualising Performance

Jan 2018

Qualitative Model Evaluation - Visualising Performance Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. Being able to evaluate models numerically is really important for optimisation tasks. However, performing a visual evaluation provides two main benefits: Easier to spot mistakes Easier to explain to other people It is so easy to miss a gross error when looking at summary statistics alone. Always visualise your data/results!

## Detrending Seasonal Data

Jan 2018

Detrending Seasonal Data Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. statsmodels is a comprehensive library for time series data analysis. And it has a really neat set of functions to detrend data. So if you see that your features have any trends that are time-dependent, then give this a try. It’s essentially fitting the multiplicative model: $y(t) = Level * Trend * Seasonality * Noise$

## Quantitative Model Evaluation

Jan 2018

Quantitative Model Evaluation Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. We need to be able to compare models for a range of tasks. The most common use case is to decide whether changes to your model improve performance. Typically we want to visualise this, and we will in another workshop, but first we need to establish some quantitative measures of performance.

## Visualising Underfitting and Overfitting in High Dimensional Data

Jan 2018

Visualising Underfitting and Overfitting in High Dimensional Data Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. In the previous workshop we plotted the decision boundary for under and overfitting classifiers. This is great, but very often it is impossible to visualise the data, usually because there are too many dimensions in the dataset. In thise case we need to visualise performance in another way.

## Principal Component Analysis

Jan 2018

Dimensionality Reduction - Principal Component Analysis Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. Sometimes data has redundant dimensions. For example, when predicting weight from height data you would expect that information about their eye colour provides no predictive power. In this simple case we can simply remove that feature from the data. With more complex data it is usual to have combinations of features that provide predictive power.

## Hierarchical Clustering - Agglomerative

Jan 2018

Hierarchical Clustering - Agglomerative Clustering Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. Clustering is an unsupervised task. In other words, we don’t have any labels or targets. This is common when you receive questions like “what can we do with this data?” or “can you tell me the characteristics of this data?”. There are quite a few different ways of performing clustering, but one way is to form clusters hierarchically.

## Evidence, Probabilities and Naive Bayes

Jan 2018

Evidence, Probabilities and Naive Bayes Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. Bayes rule is one of the most useful parts of statistics. It allows us to estimate probabilities that would otherwise be impossible. In this worksheet we look at bayes at a basic level, then try a naive classifier. Bayes Rule For more intuition about Bayes Rule, make sure you check out the training.

## Support Vector Machines

Jan 2018

Support Vector Machines Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. If you remember from the video training, SVMs are classifiers that attemt to maximise the separation between classes, no matter what the distribution of the data. This means that they can sometimes fit noise more than they fit the data. But because they are aiming to separate classes, they do a really good job at optimising for accuracy.

## Data Cleaning Example - Loan Data

Jan 2018

Data Cleaning Example - Loan Data Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. A huge amount of time is spent cleaning, removing, scaling data. All in an effort to squeeze a bit more performance out of the model. The data we are using is from Kaggle, and is available in raw from from here. You will need to sign into kaggle if you want to download the full data.

## Why Correlating Data is Bad and What to do About it

Jan 2018

Correlating Data Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. Correlations between features are bad because you are effectively telling the model that this information is twice more important than everything else. You’re feeding the model the same data twice. Technically it’s known as multicollinear, which is the generalisation to any number of features that could be correlated. Generally correlating features will decrease the performance of your model, so we need to find them and remove them.

## Regression: Dealing With Outliers

Jan 2018

Regression: Dealing with Outliers Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. Outliers are observations that are spurious. You can usually spot outliers visually; they are often far away from the rest of the observations. Sometimes they are caused by a measurement error, sometimes noise and occasionally they can be observations of interest (e.g. fraud detection). But outliers skew the estimates of the mean and standard deviation and therefore affect linear models that use error measures that assume normality (e.

## Probability Distributions

Jan 2018

Probability Distributions Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. This workshop is about another way of presenting data. We can plot how frequent observations are to better characterise the data. Imagine you had some data. For sake of example, imagine that is a measure of peoples' height. If you measured 10 people, then you would see 10 different heights. The heights are said to be distributed along the height axis.

## Overfitting and Underfitting

Jan 2018

Underfitting and Overfitting Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. Imagine you had developed a model that predicts some output. The goal of any model is to generate a correct prediction and avoid incorrect predictions. But how can we be sure that predictions are as good as they can possibly be? Now constrain your imagining to a classification task (other tasks have similar properties but I find classification easiest to reason about).

## Mean and Standard Deviation

Jan 2018

Mean and Standard Deviation Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. This workshop is about two fundamental measures of data. I want to you start thinking about how you can best describe or summarise data. How can we best take a set of data and describe that data in as few variables as possible? These are called summary statistics because they summarise statistical data.

## Logistic Regression

Jan 2018

Logistic Regression Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. I find the name logistic regression annoying. We don’t normally use logistic regression for anything other than classification; but statistics coined the name long ago. Despite the name, logistic regression is incredibly useful. Instead of optimising the error of the distance like we did in standard linear regression, we can frame the problem probabilistically.

## Linear Regression

Jan 2018

Linear Regression Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. Regression is a traditional task from statistics that attempts to fit model to some input data to predict the numerical value of an output. The data is assumed to be continuous. The goal is to be able to take a new observation and predict the output with minmal error. Some examples might be “what will next quater’s profits be?

## Linear Classification

Jan 2018

Linear Classification Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. We learnt that we can use a linear model (and possibly gradient descent) to fit a straight line to some data. To do this we minimised the mean-squared-error (often known as the optimisation/loss/cost function) between our prediction and the data. It’s also possible to slightly change the optimisation function to fit the line to separate classes.

## Introduction to Python and Jupyter Notebooks

Jan 2018

Introduction to Python and Jupyter Notebooks Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. This workshop is a quick introduction to using Python and Jupyter Notebooks. Python For most Data Science tasks there are two competing Open Source languages. R is favoured more by those with a mathematical background. Python is preferred by those with a programming background; all of my workshops are currently in Python.

Jan 2018

Introduction to Gradient Descent Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. For only a few algorithms an analytical solution exists. For example, we can use the Normal Equation to solve a linear regression problem directly. However, for most algorithms we rely cannot solve the problem analytically; usually because it’s impossible to solve the equation. So instead we have to try something else.

## Histograms and Skewed Data

Jan 2018

Histograms and Inverting Skewed Data Welcome! This workshop is from WinderResearch.com. Sign up to receive more free workshops, training and videos. When we first receive some data, it can be in a mess. If we tried to force that data into a model it is more than likely that the results will be useless. So we need to spend a significant amount of time cleaning the data. This workshop is all about bad data.

## Introduction to Monitoring Microservices with Prometheus

Dec 2017

https://prometheus.io is an open source time series database that focuses on capturing measurements and exposing them via an API. I love Prometheus because it it so simple; it’s minimalism is its greatest feature. It achieves this by pulling metrics from instrumented applications, not pulling like many of its competitors. In other words Prometheus “scrapes” the metrics from the application.

This means that it works very well in a distributed, cloud-native environment. All of the services are unburdened by load on the monitoring system. This has knock on effects meaning that HA is supported through simple duplication and scaling is supported through segmentation.

## Logging vs Tracing vs Monitoring

Nov 2017

What do you mean by monitoring? Why do you need it? What are the real needs and are you monitoring them? Ask yourself these questions. Can you answer them? If not, you’re probably doing monitoring wrong.

This post asks the basic question. What is monitoring? How does it compare to logging and tracing? Let’s find out.

## The Meaning of (Artificial) Life: A Prelude to What is Data Science?

Nov 2017

Abstract The Hitchhiker’s Guide says the meaning of life is 42. Considering that the field of Data Science is going through a period of exponential growth it too could soon find that the meaning of an artificial life is also 42. But if you are not involved on a day-to-day basis, the expansion can seem bewildering. The story of how disparate disciplines have combined to produce Data Science is fascinating.

## Research-Driven Development: Improve the Software You Love While Staying Productive

Oct 2017

Abstract Have you ever wondered which parts of your job you love or hate? Chances are that like most developers you love learning and new problems to solve. You hate monotony and bureaucracy. You’ve probably put strategies in place to mitigate the things you don’t like. An anarchic development process like Agile, to reduce the amount of time in meetings. But have you ever thought about the way in which you approach learning and problem solving?

## What is Artificial Intelligence?

Oct 2017

If you ask anyone what they think AI is, they’re probably going to talk about sci-fi. Science fiction has been greatly influenced by the field of artificial intelligence, or A.I.

Probably the two most famous books about A.I. are I, Robot, released in 1950 by Isaac Asimov and 2001: A Space Odyssy, released in 1968 by Arthur C. Clarke.

I, Robot introduced the three laws of robotics. 1) A robot must not injure a human being, 2) a robot must obay the orders, except where the orders would conflict with the First Law and 3) a robot must protect its own existance as long as such protection does not conflict with the First or Second Laws.

2001: A Space Odyssey is a story about a psychopathic A.I. called HAL 9000 that intentionally tries to kill the humans on board a space station to save it’s own skin, in a sense.

But the history of AI stems back much further…

## The Meaning of (Artificial) Life: A Prelude to What is Data Science?

Oct 2017

Abstract The Hitchhiker’s Guide says the meaning of life is 42. Considering that the field of Data Science is going through a period of exponential growth it too could soon find that the meaning of an artificial life is also 42. But if you are not involved on a day-to-day basis, the expansion can seem bewildering. The story of how disparate disciplines have combined to produce Data Science is fascinating.

## What Is Data Science?

Jul 2017

Data Science is an emerging field that is plagued by lurid, often inconsequential reports of success. The press has been all too happy to predict the future demise of the human race.

But sifting through chaff, we do see some genuinely interesting reports of work that affects both bottom-line profit and top-line revenue.

## Secure my Socks: Exploring Microservice Security in an Open-Source Sock Shop - AOTB

Jul 2017

Abstract In this talk, you will discover a reference microservices architecture – the sock shop – which we will abuse in order to investigate microservice security on the Kubernetes orchestrator and Weave Net, a software-defined network. Despite covering a range of topics, it will focus on the demonstration of two key areas: network policy and secure containers. Objective: You will learn how to secure containers and improve network security through the use of a software defined network.

## What is Cloud-Native?

Jun 2017

Cloud-Native, a collection of tools and best practices, disrupts the ideas behind traditional software development. I am a firm believer of the core concepts, which include visibility, repeatability, resiliency and robustness.

The idea begins in 2015 when the Linux Foundation formed the Cloud-Native Computing Foundation. The idea was to collect the tools and processes that are often employed to develop cloud-based software.

However, the result was a collection of best practices which extend well beyond the realms of the cloud. This post introduces the essential components: DevOps, continuous delivery, microservices and containers.

## Cloud-Native Data Science: Turning Data-Oriented Business Problems Into Scalable Solutions

Jun 2017

Abstract The proliferation of Data Science is largely due to: ubiquitous data, increasing computational power and industry acceptance that solutions are an asset. Data Science applications are no longer a simple dataset on a single laptop. In a recent project, we help develop a novel cloud-native machine learning service. It is unique in that problems are packaged as containers and submitted to the cloud for processing. This enables users to distribute and scale their models easily.

## Secure my Socks: Exploring Microservice Security in an Open-Source Sock Shop - CL

May 2017

Abstract In this talk, you will discover a reference microservices architecture – the sock shop – which we will abuse in order to investigate microservice security on the Kubernetes orchestrator and Weave Net, a software-defined network. Despite covering a range of topics, it will focus on the demonstration of two key areas: network policy and secure containers. Objective: You will learn how to secure containers and improve network security through the use of a software defined network.

## Developers _are_ Researchers - Improve the work you love with Research Driven Development

May 2017

Abstract Have you ever wondered which parts of your job you love or hate? Chances are that like most developers you love learning and new problems to solve. You hate monotony and bureaucracy. You’ve probably put strategies in place to mitigate the things you don’t like. An anarchic development process like Agile, to reduce the amount of time in meetings. But have you ever thought about the way in which you approach learning and problem solving?

## Monitor My Socks: Using Prometheus in a Polyglot Open Source Microservices Reference Architecture

Apr 2017

Abstract This presentation describes how Prometheus was integrated into a polyglot microservices application. We will use the “Sock Shop”, a cloud-native reference microservices architecture to demonstrate some of the best practices and pitfalls of attempting to unify monitoring in real life. Attendees will be able to use this application as a reference point, or as a real life starting point for their own applications. Specifically, we will cover:

## How to use Javascript Promises to lazily update data

Apr 2017

Last week I was working on a simple implementation updating a shopping cart for a site, the frontend was written in html/javascript. The brief - when the quantity of an item in the cart was modified the client could press an update cart button which would update the cart database, after which it was necessary to recalculate the total values of the order and refresh the page with the new totals.

## What is the Cloud?

Mar 2017

The terms “Cloud” or “Cloud Services” have become so laden with buzz that they would be happy to compete with Apollo 11 or Toy Story. But the hype often hides the most important aspects that you need to know. Like how it works, or what you can do with it. This is the first of several introductory pieces that focus on the very basics of modern applications.

## Surprise at CPU Hogging in Golang

Jan 2017

In one of my applications, for various reasons, we now have a batch like process and a HTTP based REST application running inside the same binary. Today I came up against an issue where HTTP latencies were around 10 seconds when the batch process was running.

After some debugging, the reason for this is that although the two are running in separate Go routines, the batch process is not allowing the scheduler to schedule the HTTP request until the batch process has finished.

## Secure my Socks: Exploring Microservice Security in an Open Source Sock Shop

Nov 2016

Abstract Microservices are often lamented as “providing enough rope to hang yourself”, which gives the impression that microservices are inherently insecure. But if we do microservices right, we can improve security with a range of measures all designed to prevent further intrusion and disruption. In this talk, you will discover a reference microservices architecture - the sock shop - which we will abuse in order to investigate microservice security on the Kubernetes orchestrator and Weave Net, a software-defined networking product from Weaveworks.

## Gocoverage - Simplifying Go Code Coverage

Oct 2016

Go introduced vendoring into version 1.5 of the language. The vendor folder is used as a dependency cache for a project. Because of the unique way Go handles dependencies, the cache is full code from an entire repository; worts and all. Go will search the vendor folder for its dependencies before it searches the global GOPATH. Tools have emerged to corral the vendor folder and one of my favourites is glide.

## How to Test in a Microservices Architecture

Sep 2016

The testing of microservices is inherently more difficult than testing monoliths due to the distributed nature of the code under test. But distributed applications are worth pursuing because by definition they are decoupled and scalable.

With planning, the result is a pipeline that automatically ensures quality. The automated assurance of quality becomes increasingly important in larger projects, because no one person wants to or is able to ensure the quality of the application as a whole.

This article provides some guidelines that I have developed whilst working on a range of software with microservice architectures. I attempt to align the concepts with best practice (references at the end of this article), but some of the terminology is my own.

## Go-Micro - Opinions and Examples

Jul 2016

I recently undertook a time-boxed four hour spike to investigate another Go microservices framework. Go-Micro is a “RPC framework for microservices”. It aims to provide common components that are often used in microservice deployments. It advertises itself as providing a pluggable architecture and boasts a long list of compatibilities.

## Cohesive Microservices with GoKit

Jul 2016

Whilst working on a cross-orchestration reference microservices application this week, my colleagues and I from Container-Solutions and WeaveWorks ported all of our simple Go microservices to the GoKit framework.

## An Overview of Mesos' New Unified Containerizer

Jul 2016

This week I was lucky enough to have spend some time with Mesos 1.0.0-RC1 and specifically, the new unified containerizer. But first, let’s discuss what has existed for the last few years.

## The Benefits of Outsourcing Research and Development

Apr 2016

When I think about my business, I often end up at one fundamental business truth. The goal of any business is to make profit. This is achieved by either making more money or reducing costs. Often, business activities are outsourced to experts in order to achieve one of these. For example, I outsource my accounting to an expert, in order to a) ensure that I’m being tax efficient (save costs) and b) because I lack the expertise and the time/desire to obtain it (save costs).

Most traditional outsourcing routes (e.g. accountancy, recruiting, etc.) affect either making more money (e.g. marketing) or reducing costs (e.g. recruiting). Outsourcing research and development is different, in that it can benefit your business in both of these ways.

## Reinforcement Learning

Jan 0001

Reinforcement Learning Book Reinforcement learning (RL) is a new-ish paradigm for using machines to solve problems that involve sequential decisions. This makes RL particularly exciting in long-sighted applications of artificial intelligence (AI), where the one-shot decisions of machine learning (ML) quickly become stale and are unscalable. Winder Research provides reinforcement learning development and consultancy services to business customers. For those wanting to learn more about reinforcement learning, you should check out our new book.