Cloud-Native Data Science Blog

Have you ever wondered what cloud-native actually is? Are you confused about how to use data science to improve your business? Check out these informative blogs to help you get started.

Free RL Book Competition and Merry Christmas

Dec 2020

2020 has been quite a year. I think back over all the so called “1-in-100 year” events. The statistician in me suggests that this isn’t as surprising as it sounds, given the size of the world and globalized communication. But the fact that COVID-19 has affected so many is almost unprecedented. To help provide some semblance of festive cheer I’m happy to offer one lucky person a free copy of my new book, Reinforcement Learning: Industrial Applications of Intelligent Agents.

5 Productivity Tips for Data Scientists

Aug 2020

Many articles talk about how professionals can make their workdays extra productive. However, for people like data scientists, whose jobs are extremely demanding, some tips are more valuable than others. For instance, it is important that you analyse how you spend your time. In the same breath, it would be in your best interest to organise your time into blocks, as these can help you focus on tasks – one at a time and without any interruption – and automate any process that you repeat.

Unit Testing Data: What is it and how do you do it?

Aug 2020

Data Testing plays an indispensable role in data projects. When businesses fail to test their data, it becomes difficult to understand the error and where it occurred, which makes solving the problem even harder. If data testing is performed correctly, it will improve business decisions, minimize losses, and increase revenues. This article presents common questions about unit testing raw data. If your question isn’t listed, please contact us, and we will be happy to help.

Developing a Real-Life Project

Jun 2020

I’m often asked questions in the vain of “how did you figure that out?”. Other times, and I’m less of a fan of these, I get questions like “you estimated X, why did it take 2*X?”, which I respond with a definition of the word estimate. Both of these types of questions are about the research and development process. Non-developers, and especially non-engineers, are often never exposed to the process of research and development.

COVID-19 Exponential Bayesian Model

Apr 2020

The purposes of this notebook is to provide initial experience with the pymc3 library for the purpose of modeling and forecasting COVID-19 virus summary statistics. This model is very simple, and therefore not very accurate, but serves as a good introduction to the topic.

Fast Time-Series Filters in Python

Oct 2019

Time-series (TS) filters are often used in digital signal processing for distributed acoustic sensing (DAS). The goal is to remove a subset of frequencies from a digitised TS signal. To filter a signal you must touch all of the data and perform a convolution. This is a slow process when you have a large amount of data. The purpose of this post is to investigate which filters are fastest in Python.

A Comparison of Reinforcement Learning Frameworks: Dopamine, RLLib, Keras-RL, Coach, TRFL, Tensorforce, Coach and more

Jul 2019

Reinforcement Learning (RL) frameworks help engineers by creating higher level abstractions of the core components of an RL algorithm. This makes code easier to develop, easier to read and improves efficiency.

But choosing a framework introduces some amount of lock in. An investment in learning and using a framework can make it hard to break away. This is just like when you decide which pub to visit. It’s very difficult not to buy a beer, no matter how bad the place is.

DevOps and Data Science: DataDevOps?

Mar 2019

I’ve seen a few posts recently about the emergence of a new field that is kind of like DevOps, but not quite, because it involves too much data. Verbally, about two years ago and in blog form about a year ago, I used the word DataDevOps, because that’s what I did. I develop and operate Data Science platforms, products and services. But more recently I have read of the emergence of DataOps.

Scikit Learn to Pandas: Data types shouldn't be this hard

Feb 2019

Nearly everyone using Python for Data Science has used or is using the Pandas Data Analysis/Preprocessing library. It is as much of a mainstay as Scikit-Learn. Despite this, one continuing bugbear is the different core data types used by each: pandas.DataFrame and np.array. Wouldn’t it be great if we didn’t have to worry about converting DataFrames to numpy types and back again? Yes, it would. Step forward Scikit Pandas. Sklearn Pandas Sklearn Pandas, part of the Scikit Contrib package, adds some syntactic sugar to use Dataframes in sklearn pipelines and back again.

7 Reasons Why You Shouldn't Use Helm in Production

Jan 2019

Helm is billed as “the package manager for Kubernetes”. The goal was to provide a high-level package management-like experience for Kubernetes. This was a goal for all the major containerisation platforms. For example, Apache Mesos has Mesos Frameworks. And given the standardisation on package management at an OS level (yum, apt-get, brew, choco, etc.) and an application level (npm, pip, gem, etc.), this makes total sense, right?

A Comparison of Serverless Frameworks for Kubernetes: OpenFaas, OpenWhisk, Fission, Kubeless and more

Sep 2018

The term Serverless has become synonymous with AWS Lambda. Decoupling from AWS has two benefits; it avoids lock in and improves flexibility.

The misnomer Serverless, is a set of techniques and technologies that abstract away the underlying hardware completely. Obviously these functions still run on “servers” somewhere, but the point is we don’t care. Developers only need to provide code as a function. Functions are then used or consumed via an API, usually REST, but also through message bus technologies (Kafka, Kinesis, Nats, SQS, etc.).

This provides a comparison and recommendation for a Serverless framework for the Kubernetes platform.

How to Test Terraform Infrastructure Code

Aug 2018

Infrastructure as code has become a paradigm, but infrastructure scripts are often written and run only once. This works for simplistic infrastructure requirements (e.g. k8s deployments). But when there is a requirement for more varied infrastructure or greater resiliency then testing infrastructure code becomes a requirement. This blog post introduces a current project that has found tools and patterns to deal with this problem.

Cloud Native Data Science: Strategy

May 2018

Data Science has become an important part of any business because it provides a competitive advantage. Very early on, Amazon’s data on book purchases allowed them to deliver personalised recommendations whilst customers were browsing their site. Their main competitor in the US at the time was Borders, who mainly operated in physical stores. This physicality prevented them from seamlessly providing customers with personalised recommendations [1]. This example highlights how strategic business decisions and data science are inextricably linked.

How to List all AMIs for each region in AWS

Apr 2018

A current project required a list of Amazon Machine Images (AMIs) for all regions for use in terraform. I couldn’t find a script to do this for me, so here you will find one that uses the aws cli, jq and a bit of Bash.

Introduction to Monitoring Microservices with Prometheus

Dec 2017

https://prometheus.io is an open source time series database that focuses on capturing measurements and exposing them via an API. I love Prometheus because it it so simple; it’s minimalism is its greatest feature. It achieves this by pulling metrics from instrumented applications, not pulling like many of its competitors. In other words Prometheus “scrapes” the metrics from the application.

This means that it works very well in a distributed, cloud-native environment. All of the services are unburdened by load on the monitoring system. This has knock on effects meaning that HA is supported through simple duplication and scaling is supported through segmentation.

Logging vs Tracing vs Monitoring

Nov 2017

What do you mean by monitoring? Why do you need it? What are the real needs and are you monitoring them? Ask yourself these questions. Can you answer them? If not, you’re probably doing monitoring wrong.

This post asks the basic question. What is monitoring? How does it compare to logging and tracing? Let’s find out.

What is Artificial Intelligence?

Oct 2017

If you ask anyone what they think AI is, they’re probably going to talk about sci-fi. Science fiction has been greatly influenced by the field of artificial intelligence, or A.I.

Probably the two most famous books about A.I. are I, Robot, released in 1950 by Isaac Asimov and 2001: A Space Odyssy, released in 1968 by Arthur C. Clarke.

I, Robot introduced the three laws of robotics. 1) A robot must not injure a human being, 2) a robot must obay the orders, except where the orders would conflict with the First Law and 3) a robot must protect its own existance as long as such protection does not conflict with the First or Second Laws.

2001: A Space Odyssey is a story about a psychopathic A.I. called HAL 9000 that intentionally tries to kill the humans on board a space station to save it’s own skin, in a sense.

But the history of AI stems back much further…

What Is Data Science?

Jul 2017

Data Science is an emerging field that is plagued by lurid, often inconsequential reports of success. The press has been all too happy to predict the future demise of the human race.

But sifting through chaff, we do see some genuinely interesting reports of work that affects both bottom-line profit and top-line revenue.

What is Cloud-Native?

Jun 2017

Cloud-Native, a collection of tools and best practices, disrupts the ideas behind traditional software development. I am a firm believer of the core concepts, which include visibility, repeatability, resiliency and robustness.

The idea begins in 2015 when the Linux Foundation formed the Cloud-Native Computing Foundation. The idea was to collect the tools and processes that are often employed to develop cloud-based software.

However, the result was a collection of best practices which extend well beyond the realms of the cloud. This post introduces the essential components: DevOps, continuous delivery, microservices and containers.

How to use Javascript Promises to lazily update data

Apr 2017

Last week I was working on a simple implementation updating a shopping cart for a site, the frontend was written in html/javascript. The brief - when the quantity of an item in the cart was modified the client could press an update cart button which would update the cart database, after which it was necessary to recalculate the total values of the order and refresh the page with the new totals.

What is the Cloud?

Mar 2017

The terms “Cloud” or “Cloud Services” have become so laden with buzz that they would be happy to compete with Apollo 11 or Toy Story. But the hype often hides the most important aspects that you need to know. Like how it works, or what you can do with it. This is the first of several introductory pieces that focus on the very basics of modern applications.

Surprise at CPU Hogging in Golang

Jan 2017

In one of my applications, for various reasons, we now have a batch like process and a HTTP based REST application running inside the same binary. Today I came up against an issue where HTTP latencies were around 10 seconds when the batch process was running.

After some debugging, the reason for this is that although the two are running in separate Go routines, the batch process is not allowing the scheduler to schedule the HTTP request until the batch process has finished.

Gocoverage - Simplifying Go Code Coverage

Oct 2016

Go introduced vendoring into version 1.5 of the language. The vendor folder is used as a dependency cache for a project. Because of the unique way Go handles dependencies, the cache is full code from an entire repository; worts and all. Go will search the vendor folder for its dependencies before it searches the global GOPATH. Tools have emerged to corral the vendor folder and one of my favourites is glide.

How to Test in a Microservices Architecture

Sep 2016

The testing of microservices is inherently more difficult than testing monoliths due to the distributed nature of the code under test. But distributed applications are worth pursuing because by definition they are decoupled and scalable.

With planning, the result is a pipeline that automatically ensures quality. The automated assurance of quality becomes increasingly important in larger projects, because no one person wants to or is able to ensure the quality of the application as a whole.

This article provides some guidelines that I have developed whilst working on a range of software with microservice architectures. I attempt to align the concepts with best practice (references at the end of this article), but some of the terminology is my own.

The Benefits of Outsourcing Research and Development

Apr 2016

When I think about my business, I often end up at one fundamental business truth. The goal of any business is to make profit. This is achieved by either making more money or reducing costs. Often, business activities are outsourced to experts in order to achieve one of these. For example, I outsource my accounting to an expert, in order to a) ensure that I’m being tax efficient (save costs) and b) because I lack the expertise and the time/desire to obtain it (save costs).

Most traditional outsourcing routes (e.g. accountancy, recruiting, etc.) affect either making more money (e.g. marketing) or reducing costs (e.g. recruiting). Outsourcing research and development is different, in that it can benefit your business in both of these ways.