5 Effective Ways to Handle Imbalanced Data in Machine Learning


Image by Author

Introduction

Here’s a something that new machine learning practitioners figure out almost immediately: not all datasets are created equal.

It may now seem obvious to you, but had you considered this before undertaking machine learning projects on a real world dataset? As an example of a single class vastly outnumbering the rest, take for instance some rare disease, which only 1{7df079fc2838faf5776787b4855cb970fdd91ea41b0d21e47918e41b3570aafe} of the population has. Would a predictive model that only ever predicts “no disease” still be thought of as beneficial even if it is 99{7df079fc2838faf5776787b4855cb970fdd91ea41b0d21e47918e41b3570aafe} correct? Of course not.

In machine learning, imbalanced datasets can be obstacles to model performance, often seemingly insurmountable. There is an expectation in many common machine learning algorithms that the classes within the data distribution are similarly represented. Machine learning models trained on imbalanced datasets tend to be overly biased towards the majority class, leading to a clear under-representation of the minority — which is often the more important when the data calls for action.

These heavily skewed datasets are found everywhere in the wild, from battling rare medical disorders where our numerical data is scarce and hard to come by to fraud detection in finance (the majority of payments made are not fraudulent). The aim of this article is to introduce 5 reliable strategies for managing class-imbalanced data.

1. Resampling Techniques

Resampling can add samples from the minority class or remove samples from the majority class in an effort to balance the classes.

A common set of techniques for oversampling the less common class include creating new samples of this under-represented class. Random oversampling is a simple method that creates new samples for the less common class by duplicating existing samples. However, anyone familiar with the basics of machine learning will immediately note the risk of overfitting. More sophisticated approaches include Synthetic Minority Over-sampling Technique (SMOTE), which constructs new samples by interpolating between existing minority-class samples.

Perhaps unsurprisingly, techniques for undersampling the more common class involve removing samples from it. Random under-sampling calls for the random discarding of some samples from the more common class, for example. However, this kind of under-sampling can create information loss. In order to mitigate this, more sophisticated undersampling methods like Tomek links or the Neighborhood Cleaning Rule (NCR) can be employed, which aim to remove majority samples that are close to or overlapping with minority samples, having the added benefits of creating a more distinct boundary between the classes and potentially reducing noise while preserving important information.

Let’s look at a very basic implementation example of both SMOTE and random under-sampling using the imbalanced-learn library.

Each approach has its pros and cons. To underscore the points, oversampling can lead to overfitting, especially with simple duplication, while undersampling may discard potentially useful information. Often a combination of techniques yields the best results.

2. Algorithmic Ensemble Methods

Ensemble methods involve the combination of a number of models in order to produce an overall stronger model for the class that you want to predict; this strategy can be useful for problems with imbalanced data, especially when the imbalanced class is one that you are particularly interested in.

One form of ensemble learning is known as bagging (bootstrap aggregating). The concept behind bagging is to randomly create a series of subsets from your data, train a model on each of them, and then combine the predictions of those models. The random forest algorithm is a particular implementation of bagging used often for imbalanced data. Random forests create individual decision trees using a random subset of the relevant data, introducing mutliple “copies” of the data in quesiton, and combine their output in a way that is effective at preventing overfitting and improving the overall generalization of the model.

Boosting is another technique, where you train a model on the data sequentially, with each new model created trying to improve upon the errors of the models that have come previously. For dealing with imbalanced classes, boosting becomes a powerful tool. For example, Gradient Boosting can teach itself to be particularly sensitive to methods of misclassifying the minority class, and adjust accordingly.

These techniques can all be implemented in Python using common libraries. Here is an example of random forest and gradient boosting in code.

In the above excerpt, n_estimators defines the number of trees or boosting stages, while class_weight in the RandomForestClassifier handles imbalanced classes by adjusting class weights.

These methods inherently handle imbalanced data by their nature of combining multiple models or focusing on hard-to-classify instances. They often perform well without explicit resampling, though combining them with resampling techniques can yield even better results.

3. Adjust Class Weights

Class weighting is exactly what it sounds like, a technique where we assign higher weights to the minority class during model training in order to make the model pay more attention to the underrepresented class.

Some machine learning libraries like scikit-learn implement class weight adjustment weightings. In the case where one class occurs more frequently in a dataset than another, misclassifications of the minority class are given increased penalty.

In logistic regression, for example, class weighting can be set as follows.

By adjusting class weights, we change how the model penalizes for misclassifying each class. But fret not, these weights do not actually affect how the model goes about making each prediction, only how the model updates its weights during optimization. This means that the class weight adjustments will impact the loss function the model employs when making its predictions. One consideration is to ensure that minority classes are bot overly discounted, as it is possible that a class can essentially be trained away.

4. Use Appropriate Evaluation Metrics

When dealing with imbalanced data, accuracy can be a misleading metric. A model that always predicts the majority class might have high accuracy but fail completely at identifying the minority class.

Instead, consider metrics like precision, recall, F1-score, and Area Under the Receiver Operating Characteristic curve (AUC-ROC). As a reminder:

  • Precision measures the proportion of positive identifications that were actually correct
  • Recall measures the proportion of actual positives that were identified correctly
  • The F1-score is the harmonic mean of precision and recall, providing a balanced measure

AUC-ROC is particularly useful for imbalanced data as it’s insensitive to class imbalance. It measures the model’s ability to distinguish between classes across various threshold settings.

Confusion matrices are also invaluable. They provide a tabular summary of the model’s performance, showing true positives, false positives, true negatives, and false negatives.

Here’s how to calculate these metrics. This should help serve as a reminder that many of our existing tools come in handy in our special case of imblanced classes.

Remember to choose metrics based on your specific problem. If false positives are costly, focus on precision. If missing any positive cases is problematic, prioritize recall. The F1-score and AUC-ROC provide good overall measures of performance.

5. Generate Synthetic Samples

Synthetic sample generation is an advanced technique to balance datasets by creating new, artificial samples of the minority class.

SMOTE (Synthetic Minority Over-sampling Technique) is a popular algorithm for generating synthetic samples. It works by selecting a minority class sample and finding its k-nearest neighbors. New samples are then created by interpolating between the chosen sample and these neighbors.

Here’s a simple practical example of implementing SMOTE using the imblanced-learn library.

Advanced variants like ADASYN (Adaptive Synthetic) and BorderlineSMOTE focus on generating samples in areas where the minority class is most likely to be misclassified.

While effective, synthetic sample generation doesn’t come without potential risk. It can introduce noise or create unrealistic samples if not used carefully. It’s important to validate that the synthetic samples make sense in the context of your problem domain.

Summary

Handling imbalanced data is a crucial step in many machine learning workflows. In this article, we have taken a look at five different ways of going about this: resampling methods, ensemble strategies, class weighting, correct evaluation measures, and generating artificial samples.

Remember that, as in all things machine learning, there is no universal solution to the problem of imbalanced data. Aside from testing out a variety of different approaches to this issue in your project, it can also be worthwhile trying a mix of these different methods together, and trying different possible configurations. The optimal methodology will be specific to the dataset at hand, the business problem, and problem-specific formal evaluation metrics.

Developing the tools to deal with imblanced datasets in your machine learning projects is but one more way in which you will be ready to create machine learning models which are maximially effective.



Source link