Machine Learning on Imbalanced datasets

SBI – Department of Systems Biology and Bioinformatics
Faculty of Computer Science and Electrical Engineering
University of Rostock
Ulmenstrasse 69 | 18057 Rostock
Germany
+49 381 498-7571
olaf.wolkenhauer@uni-rostock.de

Machine Learning on Imbalanced datasets

In real world scenarios, datasets are often imbalanced. That is, the datasets meant for supervised learning, divides into classes, where in some classes there are a very large number of instancess, compared to the others. Training machine learning algorithms on such data is challenging. We have developed an algorithm that overcomes problems of widely used algorithms.

The Synthetic Minority Oversampling TEchnique (SMOTE) is widely-used for the analysis of imbalanced datasets. It is known that SMOTE frequently over-generalizes the minority class, leading to misclassifications for the majority class, and effecting the overall balance of the model. We present an approach that overcomes this limitation of SMOTE and its extensions, employing Localized Random Affine Shadowsampling (LoRAS) to oversample from an approximated data manifold of the minority class. We benchmarked our LoRAS algorithm with 12 publicly available datasets, some with very high imbalance and some with a large number of features and can show the improved approximation of the data manifold for a given class in those datasets. In addition, we have constructed a mathematical framework to prove that LoRAS is a more effective oversampling technique since it provides a better estimate to mean of the underlying local data distribution of the minority class data space. We compared the performance of LoRAS, SMOTE, and several SMOTE extensions and observed that for imbalanced datasets LoRAS, on average generates better predictive Machine Learning (ML) models in terms of F1-score and Balanced Accuracy. For a visual summary of our algorithm please click on the LINK (pdf).

Related publications

Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling

Bej S, Galow AM, David R, Wolfien M, Wolkenhauer O

BMC Bioinformatics 2021

DOI: https://doi.org/10.1186/s12859-021-04469-x

URL: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04469-x

A multi-schematic classifier-independent oversampling approach for imbalanced datasets

Bej S, Schultz K, Srivastava P, Wolfien M, Wolkenhauer O

IEEE Access 2021

DOI: 10.1109/ACCESS.2021.3108450

URL: https://doi.org/10.1109/ACCESS.2021.3108450

LoRAS: An oversampling approach for imbalanced datasets

Bej S, Davtyan N, Wolfien M, Nassar M, Wolkenhauer O

Machine Learning 2020

DOI: https://doi.org/10.1007/s10994-020-05913-4

URL: https://link.springer.com/article/10.1007/s10994-020-05913-4

Combining uniform manifold approximation with localized affine shadowsampling improves classification of imbalanced datasets

Bej S, Srivastava P, Wolfien M, Wolkenhauer O

2021 International Joint Conference on Neural Networks (IJCNN), 2021, pp. 1-8,

DOI: 10.1109/IJCNN52387.2021.9534072

Improved imbalanced classification through convex space learning

Saptarshi Bej

Imbalanced datasets for classification problems, characterised by unequal distribution of samples, are abundant in practical scenarios. Oversampling algorithms generate synthetic data to enrich classification performance for such datasets. In this thesis, I discuss two algorithms LoRAS & ProWRAS, improving on the state-of-the-art as shown through rigorous benchmarking on publicly available datasets. A biological application for detection of rare cell-types from single-cell transcriptomics data is also discussed. The thesis also provides a better theoretical understanding behind oversampling.

Defense: 16 Dec. 2021

DOI: https://doi.org/10.18453/rosdok_id00003503