SBI – Department of Systems Biology and Bioinformatics
Faculty of Computer Science and Electrical Engineering
University of Rostock
Ulmenstrasse 69 | 18057 Rostock
Germany
+49 381 498-7571
olaf.wolkenhauer@uni-rostock.de

Dr. Saptarshi Bej

Stays hungry, stays foolish, seeks to learn!

PhD Students

Research interest

Currently, I am pursuing several research problems in SBI Rostock.

1) Machine learning for finding a needle in a haystack and its relevance in the Systems Medicine context: The promise of personalized medicine is that diagnosis, prognosis and therapeutic decisions are more specific to the individual patient. An example for more personalized diagnostics is to combine conventional routine data, with multiple omics data. Increasing the types of data or number of features inherently increases the number of subgroups that represent patient subpopulations relevant to clinical decision-making. From a machine learning perspective, the group we target for characterization and classification will then be much smaller compared to the rest of the population. If an algorithm sees numerous cases for a “regular” or “usual” case but is exposed to only a few cases of what we are aiming to classify or predict, this is referred to an “imbalanced dataset”.

In real world scenarios, datasets are often imbalanced. That is, the datasets meant for supervised learning, divides into classes, where in some classes there are a very large number of instances, compared to the others. Training machine learning algorithms on such data is challenging. We have developed several algorithms that overcomes problems of widely used algorithms. We are looking for numerous biological/clinical applications related to personalized treatment for our methods. Furthermore, we already developed an application of these algorithms on Single-cell technology.

Synthetic oversampling based on the SMOTE algorithm has been an important cornerstone in improving imbalanced learning. We addressed the limitations of SMOTE-based oversampling algorithms through the novel idea of convex space learning. In an analytical explanation behind the idea, we show that SMOTE-based oversampling algorithms generate synthetic samples with high variance in a minority class data neighborhood. We developed the LoRAS algorithm that can model the convex space of the minority class using multiple convex combinations of shadowsamples in a minority class neighborhood.

Moreover, to address the issue of classifier dependence of SMOTE-based oversampling algorithms, we proposed the ProWRAS algorithm, an improvement over the previously proposed LoRAS algorithm. By controlling the variance of the synthetic samples, as well as a proximity-weighted clustering system of the minority class data, the ProWRAS algorithm improves the performance, compared to algorithms that generate synthetic samples through modelling high dimensional convex spaces of the minority class. Most importantly, the performance of ProWRAS with proper choice of oversampling schemes, is independent of the classifier used. We demonstrate through rigorous benchmarking studies that the ProWRAS algorithm, with proper choice of parameters, can adapt to classifier specific oversampling schemes and thereby perform in a classifier-independent way. ProWRAS have been benchmarked against the leading oversampling algorithms, for multiple datasets, demonstrating its convincing superiority over the state-of-the-art.

2) Effective patient stratification from epidemiological data: One of our relevant research´focuses on the stratification of T2DM populations from epidemiological data, analyzing the National Family Health Survey-4 (NFHS-4) dataset from India, containing a wide spectrum of features, including medical history, dietary and addiction habits, socio-economic and lifestyle patterns of 10,125 T2DM patients.

Usually, manifold learning algorithms such as t-SNE or UMAP are used for reducing and visualizing data into lower dimensions and thereby finding clusters in the data. However, we have noticed a fascinating challenge that arises from the diverse feature types typically present in clinical/epidemiological data. We found that, even though there are a small amount of continuous features in a dataset, they have an overpowering effect while using UMAP for dimension reduction. We provided a solution for this in the form of a feature-type distributed clustering framework using different distance measures for different data types. However, the workflow was typical to the NFHS-4 dataset and not enough research could be conducted to generalize it for tabular clinical datasets with diverse feature types.

From a methodological perspective, we show that for diverse data types, frequent in epidemiological datasets, feature-type-distributed clustering using UMAP is effective as opposed to the conventional use of the UMAP algorithm. Application of UMAP based clustering workflow for this type of dataset is novel in itself. Our clustering paradigm applies UMAP separately on continuous, nominal and ordinal features separately. For each of these feature categories, we create a lower dimensional embedding of the dataset. Finally, we integrate the lower dimensional embeddings to extract clusters from them using the DBSCAN algorithm, a clustering algorithm used for extracting clusters from data based on data density. Our findings demonstrate the presence of a heterogeneity among Indian T2DM patients with regard to sociodemographic and dietary patterns. From our analysis, we conclude that, existence of significant non-obese T2DM subpopulations characterized by younger age group and economic disadvantage, raise the need of different screening criteria for T2DM among rural Indian residents.

3) Relationship extraction from biomedical texts: Natural Language Processing (NLP) has contributed to extracting relationships among biological entities, including genes, their mutations, proteins, diseases, processes, phenotypes, and drugs, for a comprehensive and concise understanding of information in the literature. Self-attention-based models for Relationship Extraction (RE) have played an increasingly important role in NLP. However, self-attention models for RE are framed as a classification problem, which limits its practical usability.

We have developed a novel approach for RE, referred to as Attention Retrieval Model (ARM), that can resolve the aforementioned limitations of the regular classification approach for RE. ARM learns the linguistic context between two related entities or between an interaction word and a related entity in a text from training data, rather than attempting to classify the text based on predefined annotations.

Our experiments show that ARM provides a flexible framework for a modeler to customize their model, with the opportunity to integrate expert knowledge on interaction keywords. ARM provides an opportunity to learn from integrated data with diverse entity types and contextual nuances of the language. This facilitates data integration across datasets. Furthermore, unlike its classification-based counterpart, ARM can extract relationships that are unannotated in the training data, analogous to zero shot learning. ARM provides a unique self-attention-based deep learning framework for RE, that can capture directed entity relationships.

4) Graph and Network theory and analysis: Graph theory is one of my passions. I love learning about the subject since my Masters degree. I am especially fascinated by the Barnette's Conjecture (unsloved since 1969). I also like to work on network analysis strategies for Protein interaction networks.

Projects

Research Projects

Preserving Logical and Functional Dependencies in Synthetic Clinical Datasets

Preserving Logical and Functional Dependencies in Synthetic Clinical Datasets

DFG Project Number: 576429337

This project develops dependency-aware methods for generating synthetic clinical tabular data, focusing on preserving both logical and functional relationships among attributes while maintaining data utility and fidelity.

Convex space learning for synthetic data generation in clinical research

This project develops NextConvGeN, a deep learning framework for generating realistic and privacy-preserving synthetic clinical data. By extending the principle of convex space learning beyond imbalanced datasets, NextConvGeN enables the creation of representative tabular data that can improve machine learning applications in clinical decision support and patient stratification.

Machine Learning on Imbalanced datasets

iRhythmics: Programming pacemaker cells for in vitro drug testing

The project addresses the generation and establishment of programmed pacemaker cells for an in vitro drug testing possibility to perform predictive tests. This may lead to an improved treatment of cardiac arrhythmias or an accurate identification of potential drug molecules at an early stage of development. Important benefits will arise in verifying the safety of a wide variety of medicines while reducing animal testing.

The TOTO Project: Towards a Theory of Tissue Organisation

~ In biology, the exception is the rule. ~

~ With our work, we are not really interested in the unique, but in what is general in the unique.~

With this project, we want to address a biological and a methodological challenge. First, we wish to clarify how the functioning of cells, and the functioning of a tissue relate to each other. Do cells exercise a degree of autonomy, or is their behavior completely determined by the functioning of the tissue? Such questions are important in understanding the emergence and progression of diseases. For example, it remains unclear whether the causative origin of colon cancer is a cell, or a consequence of tissue organization.

GB-XMap: Assessing the risk of gut-brain cross-diseases

Investigating the gut-brain-axis

The gut–brain axis (GBA) provides a bidirectional homeostatic communication between the gastrointestinal tract and the central nervous system. The interdisciplinary collaboration is going to fully explore a first comprehensive GBA cross-disease map of genetic, expression and regulatory changes associated with ulcerative colitis and schizophrenia disease entities.

2018-present	Research Assistant and PhD student, SBI, Universität Rostock Rostock
2016-2017	Research assistant, Universität Paderborn
2009-2014	Integrated BS-MS degree (major in Mathematics and specialization in Graph Theory), Indian Institute of Science Eduaction and Research, Kolkata

Selected publications

LoRAS: An oversampling approach for imbalanced datasets

Bej S, Davtyan N, Wolfien M, Nassar M, Wolkenhauer O

Machine Learning 2020

DOI: https://doi.org/10.1007/s10994-020-05913-4

URL: https://link.springer.com/article/10.1007/s10994-020-05913-4

Convex space learning for tabular synthetic data generation

Manjunath Mahendra, Chaithra Umesh, Kristian Schultz, Olaf Wolkenhauer, Saptarshi Bej

Neurocomputing, Volume 659, 2026, 131722, ISSN 0925-2312

DOI: https://doi.org/10.1016/j.neucom.2025.131722

URL: https://www.sciencedirect.com/science/article/abs/pii/S092523122502394X?via%3Dihub

Identification of key factors for malnutrition diagnosis in chronic gastrointestinal diseases using machine learning underscores the importance of GLIM criteria as well as additional parameters

Karen Rischmüller, Vanessa Caton, Markus Wolfien, Luise Ehlers, Matti van Welzen, David Leon Brauer, ... ,Robert Jaster, Olaf Wolkenhauer, Georg Lamprecht, Saptarshi Bej

DOI: https://doi.org/10.3389/fnut.2024.1479501

URL: https://www.frontiersin.org/journals/nutrition/articles/10.3389/fnut.2024.1479501/

Multivariate functional linear discriminant analysis for partially-observed time series

Bordoloi R, Réda C, Trautmann O, Bej S, Wolkenhauer O

Machine Learning

DOI: https://doi.org/10.1007/s10994-025-06741-0

URL: https://link.springer.com/article/10.1007/s10994-025-06741-0

Accounting for diverse feature-types improves patient stratification on tabular clinical datasets

Bej S, Umesh C, Mahendra M, Schultz K, Sarkar J, Wolkenhauer O

Machine Learning with Applications

(Volume 14, 2023)

DOI: https://doi.org/10.1016/j.mlwa.2023.100490

Contribution of Synthetic Data Generation towards an Improved Patient Stratification in Palliative Care

Hahn W, Schütte K, Schultz K, Wolkenhauer O, Sedlmayr M, Schuler U, Eichler M, Bej S, Wolfien M

JPM (2022)

DOI: https://doi.org/10.3390/jpm12081278

URL: https://www.mdpi.com/2075-4426/12/8/1278

Identification and epidemiological characterization of Type-2 Diabetes sub-population using an unsupervised machine learning approach

Bej S, Sarkar J, Biswas S, Mitra P, Chakrabarti P, Wolkenhauer O

Nutrition and Diabetes (Springer Nature) (2022)

DOI: https://doi.org/10.1038/s41387-022-00206-2

URL: https://www.nature.com/articles/s41387-022-00206-2

Attention retrieval model for entity relation extraction from biological literature

Srivastava P, Bej S, Schultz K, Yordanova K, Wolkenhauer O

IEEE Xplore (2022)

DOI: https://doi.org/10.1109/ACCESS.2022.3154820

URL: https://ieeexplore.ieee.org/document/9721887

Cross-tissue transcriptome-wide association studies identify susceptibility genes shared between schizophrenia and inflammatory bowel disease

Uellendahl-Werth F, Maj C, Borisov O, Wacker EM, Bej S, Wolkenhauer O, Degenhardt F, Ellinghaus D et al.

Nature Comms Bio (2022)

DOI: https://doi.org/10.1038/s42003-022-03031-6

URL: https://www.nature.com/articles/s42003-022-03031-6

Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling

Bej S, Galow AM, David R, Wolfien M, Wolkenhauer O

BMC Bioinformatics 2021

DOI: https://doi.org/10.1186/s12859-021-04469-x

URL: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04469-x

Self-attention based models for the extraction of molecular interactions from biological texts

Srivastava P, Bej S, Yordanova K, Wolkenhauer O

Biomolecules 2021

DOI: https://doi.org/10.3390/biom11111591

URL: https://www.mdpi.com/2218-273X/11/11/1591

Comprehensive Characterization of Multitissue Expression Landscape, Co-Expression Networks and Positive Selection in Pikeperch

Nguinkal JA, Verleih M, de los Ríos-Pérez L, Brunner RM, Sahm A, Bej S, Rebl A, Goldammer T

Cells 2021

DOI: https://doi.org/10.3390/cells10092289

URL: https://www.mdpi.com/2073-4409/10/9/2289

A multi-schematic classifier-independent oversampling approach for imbalanced datasets

Bej S, Schultz K, Srivastava P, Wolfien M, Wolkenhauer O

IEEE Access 2021

DOI: 10.1109/ACCESS.2021.3108450

URL: https://doi.org/10.1109/ACCESS.2021.3108450

Hamiltonian cycles in annular decomposable Barnette graphs

Bej S

JDMSC 2020. Full text in arXiv

DOI: https://doi.org/10.1080/09720529.2021.1961893

Protein-coding variants contribute to the risk of atopic dermatitis and skin-specific gene expression

Mucha S, ... Bej S, ..., Wolfien M, ..., Wolkenhauer O, ..., Ellinghaus D

The Journal of Allergy and Clinical Immunology 2019

DOI: https://doi.org/10.1016/j.jaci.2019.10.030

URL: https://www.sciencedirect.com/science/article/abs/pii/S0091674919314800

On extension of regular graphs

Banerjee A, Bej S

DOI: https://doi.org/10.1080/09720529.2015.1085740

Coloring sums of extensions of certain graphs

Kok J, Bej S

DOI: 10.13069/jacodesmath.349383

Factors of edge-chromatic critical graphs: a brief survey and some equivalences

Bej S, Steffen E

URL: http://dimie.unibas.it/site/home/in-evidenza/articolo3004582.html

Combining uniform manifold approximation with localized affine shadowsampling improves classification of imbalanced datasets

Bej S, Srivastava P, Wolfien M, Wolkenhauer O

2021 International Joint Conference on Neural Networks (IJCNN), 2021, pp. 1-8,

DOI: 10.1109/IJCNN52387.2021.9534072

Improved imbalanced classification through convex space learning

Saptarshi Bej

Imbalanced datasets for classification problems, characterised by unequal distribution of samples, are abundant in practical scenarios. Oversampling algorithms generate synthetic data to enrich classification performance for such datasets. In this thesis, I discuss two algorithms LoRAS & ProWRAS, improving on the state-of-the-art as shown through rigorous benchmarking on publicly available datasets. A biological application for detection of rare cell-types from single-cell transcriptomics data is also discussed. The thesis also provides a better theoretical understanding behind oversampling.

Defense: 16 Dec. 2021

DOI: https://doi.org/10.18453/rosdok_id00003503

Skills

Graph and Network Theory
Boolean modelling
Python
Machine learning
Deep Learning
RNA seq data analysis

Awards and Distinctions

DAAD pries 2020 für hervorragende Leistungen ausländischer Studierender an (Universität Rostock)

Teaching Experience

Tutor in the 'Biosystems modelling and simulation' course offered at the University of Rostock from 2019-2020. My subject of teaching includes introduction to machine learning and deep learning and their applicability in the biomedical fields
Tutor in the 'Data Science with Python' undergraduate seminar course offered at the University of Rostock from 2020.

Back