Data is an essential asset of modern business. It empowers companies by surfacing unique insights about their customers and creates actionable products. The more data you possess, the better you meet and exceed your customers’ expectations.
Often though, initial enthusiasm for a machine learning project is tempered by the lack of available data. When Winder Research is asked to help companies to develop a data-based project, the most common question is: “how does the availability of data affect my project?”
This post presents seven common situations and their solutions. If your question isn’t listed, please contact us and we will be happy to help.
The first step in any data science project is obtaining data. You will need to:
- Identify the problem
You need to fully understand the problem. What are the goals? How are they measured? What or who are the entities? Determining goals and targets will provide insights about what to collect and how much you need.
- Determine a timeframe
Next you should plan how to obtain the required data. Is the data available immediately or do you need to collect it? How long will that take? How much will it cost? This aspect could impact your delivery schedule.
- Choose an appropriate collection method
The source of your data will depend on your business goal and the project domain. For example, you could collect data from:
- interviews or surveys
- product metrics
- documents or records
- transactional or procedural data
- customer behaviour
You can now start collecting data. Create a schedule to monitor your data collection process. Regularly checking progress will allow you to adapt as conditions change. You might find that you need more data than you initially realised. If so, jump back to step 1 and iterate. In the early stages, always err towards more data; it is much easier to ignore than it is to find more.
Q2. I have a small number of positive examples. I have tried training a classification model, but it doesn’t work very well. How can I use this to help me find more data?
One common problem is an unbalanced distribution of classes within a dataset. For example, in fraud detection datasets, most of the transactions are not fraud, and only a few are. Here are a few mitigation techniques:
- Undersampling is the process of randomly deleting some observations from the majority class to match the numbers with the minority class.
# Separate input features and target X = df.drop('targetClass', axis=1) y = df.Class # setting up testing and training sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=27) # concatenate our training data back together X = pd.concat([X_train, y_train], axis=1) # separate minority and majority classes majorityClass = X[X.Class==0] minorityClass = X[X.Class==1] majorityClass_downsampled = resample(majorityClass, replace = False, # sample without replacement n_samples = len(minorityClass), # match minority n random_state = 27) # reproducible results # combine minority and downsampled majority downsampled = pd.concat([majorityClass_downsampled, minorityClass]) # checking counts downsampled.targetClass.value_counts()
- Oversampling is the process of producing synthetic data that randomly sample from observations in the minority class. A commonly used technique is called Synthetic Minority Over-sampling Technique (SMOTE) in which new observations are generated by drawing lines between points in feature-space.
#Still using the same majorityClass and minorityClass from above #upsample minority minorityClass_upsampled = resample(minorityClass, replace=True, # sample with replacement n_samples=len(majorityClass), # match number in majority class random_state=27) # reproducible results # combine majority and upsampled minority upsampled = pd.concat([majorityClass, minorityClass_upsampled]) # check new class counts upsampled.targetClass.value_counts()
#Upsamling minority using SMOTE smoteTechnique = SMOTE(sampling_strategy='minority', random_state=27) X_train, y_train = smoteTechnique.fit_sample(X_train, y_train)
Note that you should always split your dataset into train-test sets before applying these techniques. This will increase your model performance and avoid overfitting.
Q3. I have a small amount of data because it is challenging to acquire new data. But I want to create a classification model. How can I do this robustly?
Your lack of data is critical, since data lies at the heart of any artificial intelligence project. Production model performance correlates with the size of training data. But how much is good enough?
As a rough guide, you need about ten times as many observations as the number of parameters in your model. For example, if you are building a linear regression model with two features, then you need at least 30 observations (two weights, one intercept). Less than this and your model is likely to overfit. Nevertheless, fewer data can be used based on the use case.
The following factors affect how much data you need:
- the number of parameters in your model
- expected model Performance
- the output of your model.
With any amount of data, make your model more robust by following these recommendations:
- Choose simple models: smaller models require less data.
- If you’re training a classifier, start with logistic regression.
- For tree-based models, limit the maximum depth.
- If you’re predicting categories, start with a simple linear model with a limited number of features.
- Apply regularization methods to make a model more conservative.
#Implementing a simple classifier with regularization method logReg = LogisticRegression(solver='liblinear', penalty='l1', C=0.1, class_weight='balanced' ) #fit model logReg.fit(X_train, y_train) #get predictions logReg.predict(X_test)
- Remove outliers from data. Outliers can significantly impact your model when trained on small datasets. You can remove them or use more robust techniques such as quantile regression.
#Detect Outliers and remove them isoforest = IsolationForest(n_jobs=-1, random_state=1) isoforest.fit(X_train) outliersPred = isoforest.predict(X_train) X_train = X_train[np.where(outliersPred == 1, True, False)] y_train = y_train[np.where(outliersPred == 1, True, False)]
- Select relevant features: This can be done using several techniques such as including recursive elimination, analyzing correlation with a target variable, and importance analysis. Feature selection requires a familiarity with the subject area, so consulting a domain expert would be beneficial.
#Select Relevant features with recursive elimination based on initial model and removing them from sklearn.feature_selection import RFE rfe = RFE(logReg) rfe.fit(X_train, y_train) X_train.drop(X_train.columns[np.where(rfe.support_ == False)], axis=1, inplace=True)
- Ensemble several models. Combining results from many models can provide consensus and make solutions more robust.
#Ensembling initial model with XGBooost from mlxtend.classifier import StackingClassifier from xgboost import XGBClassifier preds = pd.DataFrame() #predictions Dataframe stackedModel = StackingClassifier(classifiers=[ logReg, XGBClassifier(max_depth=2) ], meta_classifier= logReg ) stackedModel.fit(X_train, y_train) preds['stack_pred'] = stackedModel.predict(X_test)
Q4. I have created a model based upon a small amount of data. But when I put it in production, performance drops. How can I improve the robustness of the model?
There are several strategies to use in order to improve model performance:
If your model is underfitting, you can try increasing the number of input features or a more complex model. If your model is overfitting, do the opposite.
Cross-validation: is an important preventative step against overfitting. You use your initial training data to create multiple mini train-test splits, and then use them to tune your model. As in k-fold cross-validation, the data is divided into k subsets, called folds. Then, the algorithm is iteratively trained on k-1 folds while using the remaining fold as the test set (called the “holdout fold”).
Regularization helps by forcing your model to be simpler. This depends on the type of model you use. For example: using dropout on neural networks, pruning decision trees, etc.
Ensembling combines predictions from separate models. There are multiple methods for ensembling, but the two most common are:
Bagging tries reducing the chance of overfitting complex models by training a large number of “strong” learners in parallel, then combines them to “smooth out” their predictions.
Boosting tries to increase the predictive flexibility of simple models by training a large number of “weak” learners in sequence, then combines them into a single strong learner.
You can also ensemble different models together with a simple majority vote.
Q5. I have created a model based upon a small amount of data, and I’m using a complex model (e.g. deep learning). But it doesn’t work well in real life?
Deep neural networks need large datasets to achieve high performance. The more data you acquire, the better your model performs. For smaller datasets, simpler machine learning models such as regressions, random forest, and SVM often outperform deep networks. Consider applying classical models or obtaining more data. Linear algorithm attains good performance with hundreds of examples per class. But, you may need thousands of samples per class for a nonlinear algorithm like an artificial neural network.
The most common issue hampering data science projects is that the business problem isn’t clearly understood. Determining a project goal begins by asking a lot of questions, ones that are specific, relevant, and unambiguous.
When the right questions are asked, the data starts providing comprehensive perspectives and relevant predictions.
The following is how the different types of data enable you to solve your business initiatives:
- Transactional data enables more granular, more detailed decisions (localization, seasonality, multi-dimensionality).
- Unstructured data enables complete and more accurate decisions (new metrics, dimensions, and dimensional attributes)
- Data Velocity enables more frequent, more timely decisions (hourly versus weekly; on-demand analytic model updates).
- Predictive Analytics enables More actionable, predictive decisions (Optimize, Recommend, Predict, Score, Forecast).
Sensitive data is useful information that can only be used when obfuscated. This includes:
- Personal data: names, identification numbers, location data from mobile phones or GPS, physical characteristics, economic characteristics, …
- Confidential data: financial information, passwords, national safety
- Business critical data: if compromised, could be harmful to the business (e.g. trade secrets)
- Ethical: sometimes there is no legal requirement, but ethically, data should be anonymised.
There are several security strategies to store sensitive data:
Anonymization: an irreversible destruction of the identifiable data. Anonymized personal data can no longer identify the individual nor considered as personal data.
Pseudonymization: a method to substitute identifiable data with a reversible, consistent value. Unlike anonymization, person-related data that could allow backtracking is not purged. Although pseudonymized data are still legally considered as sensitive data, it’s considered as a secure approach.
Encryption: the process of converting plaintext to ciphertext. Encryption takes readable data and alters it so that it appears random. A good encryption strategy uses reliable encryption and convenient key management.
#Import libraries import cryptography from cryptography.fernet import Fernet #generate an encryption key. key = Fernet.generate_key() #Save encryption key file = open('key.key', 'wb') #wb = write bytes file.write(key) file.close() # Open the file to encrypt with open(‘sensitiveData.csv’, 'rb') as f: data = f.read() #Encrypt data using the key fernet = Fernet(key) encrypted = fernet.encrypt(data) # Write the encrypted file with open('sensitiveData.csv.encrypted', 'wb') as f: f.write(encrypted)
You should balance the utility of the data against the level of risk:
Identify sensitive data with a high level of confidence: The common scenarios of sensitive data are: Sensitive data in columns: can be specific columns in structured datasets such as a user’s first name, last name, or mailing address. Sensitive data in unstructured text-based datasets: can often be detected using known patterns from an unstructured text-based dataset. Sensitive data in free-form unstructured data: can be text reports, audio recordings, photographs, or scanned receipts. Sensitive data in a combination of fields Sensitive data in unstructured content: embedded contextual information in unstructured content.
Create a data governance plan and best practices documentation. This will help to make suitable decisions when sensitive data cannot be masked or removed.These are the common concepts to consider when establishing a governance policy framework: Establish a secure location for governance documentation. Omit encryption keys, hash functions, or other tools from your documentation. Document all sources of sensitive data, their location of storage, and precise the types of present data. Include the remediation steps taken to protect it. Document the locations where remediation steps are complicated, inconsistent, or impossible. Set up a process to continually scan for and determine new sources of sensitive data. Describe the roles and (possibly) individual employee names who have temporary or permanent access to sensitive data, and describe why they required the access. Determine where employees can access sensitive data, if, how, and where they can copy it, and any other constraints associated with access. Regularly review who can access sensitive data and identify if access is still required. Communicate, enforce, and regularly review your policies.
Secure the sensitive data without remarkably impacting the project, you can protect sensitive data by: Removing sensitive data: Before building your machine learning model, delete user-specific information from your dataset if it is not necessary for your project. Nevertheless, there are cases where this can significantly reduce the value of your dataset. Masking sensitive data: When removing sensitive information is not possible, you can still train effective models with data in a masked format. Masking techniques are: Apply a substitution cipher by replacing all occurrences of a plain-text identifier by its hashed or encrypted value. Tokenize by replacing an unrelated dummy value for the real value stored in each sensitive field. The mapping is encrypted/hashed in a separate and more secure database. This approach works only if the same token value is reused for identical values. Use dimension-reduction techniques such as Principal Components Analysis (PCA) to blend various features and train your model only on the resulting PCA vectors. Coarsening sensitive data: used to lower the precision or granularity of data to make it challenging to recognize sensitive data within the dataset, while still giving you similar benefits versus training your model with the pre-coarsened data.
Starting a data science project does not necessarily require gathering billions of samples. The amount of data needed heavily depends on the type of business problem and the type of technologies you are using. That being said, launching your data science journey with a small amount of data is possible, but make sure you constrain the problem and restrain your choice of model.