Data Testing plays an indispensable role in data projects. When businesses fail to test their data, it becomes difficult to understand the error and where it occurred, which makes solving the problem even harder. If data testing is performed correctly, it will improve business decisions, minimize losses, and increase revenues.
This article presents common questions about unit testing raw data. If your question isn’t listed, please contact us, and we will be happy to help.
In software development, unit testing is a verification and validation technique in which a developer tests if individual methods and functions, components, or modules used by your software are fit for use. Unit tests are very low-level, close to the source of your application. They are often cheap to automate and can be run very quickly by a continuous integration server.
Data unit testing examines the quality of the data instead of software. It can be done through various means, including:
- Checking if data falls within the min-max ranges of that data element (based on all previous data registered),
- Defining several validation rules, and display data units violate these validation rules.
- Analysis of data sets, i.e., examining gaps in data, missing values, existing trends, and so forth.
Ensuring data quality is vital for building an effective data product and improving the accuracy and reliability of data.
Research has shown that poor data is, on average, costing businesses 30% of their revenue. According to Gartner, poor data quality costs organizations amounts between $9.7 million and $14.2 million annually. Not performing Unit tests on data can lead to several low-level issues, such as:
- Missing values can lead to failures in production systems that require non-null values.
- Changes in the distribution of data can lead to unexpected outputs of machine learning models.
- Aggregations of incorrect data can lead to wrong business decisions.
Poor data quality causes not only high costs in terms of financial opportunities, but also reputational damage. The impact on customer satisfaction threatens a company’s reputation, as customers can use social media to share their negative experiences.
For example, a Hawaiian airline booking application accidentally charged air mile purchases in dollar amounts, which meant users didn’t pay 200,000 miles, but $200,000. This could have been caught and prevented using in-flight data testing.
Unit testing data have various advantages, to name a few:
- Increased data confidence: Data unit testing establishes trust and understanding of data. It allows spreading confidence throughout organizations, allowing them to feel confident in the insights obtained from the data.
- Scalability: By implementing data unit testing practices, organizations can scale up their business operations while still ensuring that the data remains accurate and reliable.
- Saving costs and time: Unit testing data allows companies to detect errors and inaccurate data early. Thus, business decisions are based only on high-quality data, which saves cost and time.
- More Informed Decision-Making: Enhanced data quality allows organizations to reach better decision-making. The higher your data quality is, the less risk and the more confidence you can have in your business decisions.
- Improved productivity: With no more time wasted on backtracking errors and double-checking results, organizations can invest their resources in achieving further results that align with their business goals.
- Batch Data Test: The Batch Data Test strategy involves testing a high volume of data where transactions are collected over a given period of time. Batch testing is ideal for large volumes of data sets and projects that require deeper analysis. The strategy is not as recommended for projects that involve speed or real-time results.
- Real-Time Data Test: Real-Time Data Test is testing data as soon as it becomes available. In other words, you can get insights or can draw conclusions on data quality immediately after the data enters the system. Real-time testing allows businesses to react and fix the quality of their data without delay.
So what are the common forms of unit test you can perform on your data. Below are a few that we use in our projects:
- Sanity Check: Sanity checks are a crucial step in the data wrangling process. The final analysis is only as accurate as the data. Not understanding your data well, checking for inconsistent or duplicate data will lead you to a skewed analysis. The following are some practices performed in sanity checks:
- Taking a random sample of the data.
- Checking for datatype mismatches, variations in how values are entered, and missing values.
- Looking for duplicate records and outliers.
- Distribution Check: Distributional tests evaluate data distribution and to test data for normality. Sometimes data may look good on the surface, but if you examine the distribution of data, you notice gaps or the distribution of a value that doesn’t make logical sense. Abnormal data distribution may indicate a larger data quality issue that requires further investigation.
- Correlation test: When it comes to investigating the relationship between two continuous quantitative variables, Pearson’s correlation coefficient is a good measure of the strength of the association between the two variables. To study the relationship between two variables, draw a scatter plot of the variables to check for linearity. The coefficient should not be calculated if the relationship is not linear. The nearer the scatter of points is to a straight line, the higher the strength of association between the variables. You can automate this using tests for correlation too.
- Cycles/Trends: Modeling cycles and trends is a fundamental part of data analysis. If you have multiple years of data, and that data varies on a regularly recurring cycle (annually, monthly, daily, or any other period), knowing how to model existing cycles is important for understanding the processes that affect the measured data and ultimately making forecasts.
- Anomalies: Anomalies are unusual points or patterns in a given dataset. Anomaly detection provides a way to reduce the raw quantity of data that needs to be considered by further analysis. The following techniques are often used for anomaly detection:
- Standard deviation
- Isolation forest
There are many tools for data unit testing; the following are a few of the most used tools to check your data quality:
- Open Source
- Apache Griffin: Apache Griffin is a data quality assertion framework. It works by defining your definition of quality, measuring those definitions on batch or streaming data, then producing reports on the result. The problem is that the project is quite old, but not very popular. It hasn’t gained as much traction as it hoped. Also, it has a very limited set of measurements it can make. For example, there are no statistical tests to test raw data.
- JSON Schema: Using a schema for unit testing might sound strange, but many schema libraries allow you to enforce data requirements in the specification. For example, JSON Schema allows you to set minimum and maximum values, so you can unit test the data in the API layer. This helps to prevent bad data before it affects your models or pipelines.
- Amazon deequ: Deequ is a data unit testing library to find errors before the data gets fed to machine learning algorithms. It computes data quality metrics regularly, checks constraints set by the user, and publishes data in case of success. In error cases, dataset publication can be stopped, and users receive a notification in order to take action.
- great_expectations: a great library of data assertions that you can use with any pipeline tool. Insert these into your pipelines to make them far more robust.
- DBT: is a data transformation tool (the T in ETL) that has a natural way of defining tests as part of the transformation pipeline.
- Talend: Talend offers many software solutions such as data quality, data management, and big data, and it has a separate product for all these solutions. Talend’s Data Quality solution profiles, cleans, and masks data in any format or size to high-quality data. The problem with this tool is that many users complain about the difficulty of the learning process.
- Xplenty: is an ETL cloud-based platform for streamlining data processing with an intuitive graphic interface to implement data transformations. Some users reported that Xplenty’s error messages are not always descriptive enough despite its easy-to-navigate graphical user interface.
- RightData: is a data quality testing tool designed for automating data quality assurance. It identifies issues related to data consistency, quality, completeness, and gaps. It is important to note that the last three tools of the list are proprietary, and their usage might require extra expenses according to each company’s offering.
- Trifacta: Is a proprietary platform for data pipelining and has limited testing capabilities.
Monitoring data across multiple sources such as cloud, web, and mobile applications become easier by testing data and controlling its quality in real-time. Continuous monitoring and alerting helps to detect new opportunities and anomalies that may impact businesses. It enables them with ongoing oversight over their data, gaining online insight into what needs attention.
Implementing a monitoring strategy allows organizations to be proactive and more productive by identifying issues that are indeed anomalies and others that are repetitive and require attention once and for all.
Tracking data quality and implementing monitoring systems allows you to parse, standardize, and match the data in real-time.
Poor data quality can lead to increasingly false outputs by machine learning algorithms. Thus, losing opportunities to monetize data and reach business goals because of the data issues.
For data to be beneficial, it needs to be of high quality, which can only be achieved through regular data testing. Performing unit-tests, as the first stage of your testing strategy, leads organizations to reap the benefits of their data. The success of new technologies depends heavily on data quality. The better your data’s quality, the faster your algorithm can produce results, and the better the results are.
Thanks to Larysa Visengeriyeva for the Trifacta and great_expectations recommendations. Thanks to Oliver Laslett for the dbt recommendation.