Who Am I?
A data scientist with a background in biotechnology and hands-on experience building fraud detection pipelines at a Fortune 500 healthcare company. I work across the full data stack from writing Pandas on AWS SageMaker to designing anomaly triggers and outliers that surface suspicious billing patterns at scale
My research side runs parallel with 12+ publications, 2 best paper and presentation awards across IEEE and Springer conferences, and also an independent bioinformatics work on immune checkpoint gene discovery using PCA and network analysis in R. I'm someone who moves between research and engineering without losing thread of either, and I am always more interested in what I have not figured out yet than what I already know
What I Specialise?
Machine Learning & AI: Anomaly Detection, Fraud Detection Pipelines, Supervised & Unsupervised Learning, Feature Engineering, Model Evaluation, Deep Learning (CNNs, Sequence Models), NLP & Transformers (Foundations)
Data Science & Analytics: Exploratory Data Analysis (EDA), Statistical Analysis, Hypothesis Testing, Data Cleaning & Preprocessing, Dimensionality Reduction (PCA), Clustering (Mclust, K-Means), KPI Tracking, Data Visualization & Reporting
Programming & Core Tools: Python (Pandas, NumPy, Scikit-learn), SQL (Athena, BigQuery, PostgreSQL, MySQL), R (ggplot2, mclust), Jupyter Notebooks, Microsoft Excel
Cloud & ML Systems: AWS (S3, Athena, Glue, Redshift Serverless, Lambda, SageMaker), ETL Pipelines, End-to-End ML Pipelines, Model Training & Deployment Workflows, Experimentation
Bioinformatics: Gene Expression Analysis, PCA-based Analysis, Mclust Clustering, Pathway Enrichment (Enrichr, KEGG), Network Analysis (STRING, Cytoscape)
Tools & Workflow: Tableau, Power BI, Looker Studio, Domo, Git, GitHub, Docker (Basics)
What I've Learned and Contributed?
- Built 12+ anomaly detection triggers across healthcare claims and provider data; designed a composite risk scoring framework using log-scaled transformations and percentile-based normalization (CUME_DIST) — improving detection coverage by 20% and contributing to an estimated $400K–$700K in potential savings
- Applied BIRCH clustering and LOF to segment providers; engineered 15+ statistical features identifying 18% high-risk providers and improving anomaly lift by 11%
- Collaborated with fraud investigators and business stakeholders to translate model outputs into prioritized investigation leads
- Engineered 40+ features from 10M+ claim records; improved model precision by 13% and deployed scoring pipelines on AWS (S3, Athena, SageMaker), flagging 17% of claims as high-risk
- Built an unsupervised ML pipeline on a ~49,000 feature gene expression dataset (GEO: GSE57329), applying PCA + GMM (mclust, G=9); silhouette score 0.81
- Reduced feature space by 98% (49,000 to ~1,000 genes) using sequential statistical filtering
- Identified clusters enriched for immune-related pathways; highlighted hub genes (CD4, CXCL10, FMO3) via PPI analysis and pathway enrichment (KEGG, Reactome)
- Constructed a Fully Residual CNN for brain tumor segmentation; integrated MDRNNs to enhance NLP model performance for malware classification and family prediction
- Developed and evaluated ML models for classification, clustering, and anomaly detection; built 3+ ETL pipelines
- Conducted statistical analysis across RL, Distributed ML, and NLP research tracks supporting 5+ projects
- Contributed to peer-reviewed papers for Computers in Biology and Medicine, Springer, and a patent filing
- Contributed to NLP pipelines processing 100K+ customer records using TF-IDF/Word2Vec
- Supported BERT-based text classification achieving 85% accuracy on internal benchmarks
- Applied PCA and t-SNE for dimensionality reduction; developed sentiment and topic trend visualizations
- Synthesized EPS-based nanocomposites (AgNO₃ + SDS) from Bacillus amyloliquefaciens to inhibit oral biofilm
- Validated antimicrobial efficacy via SEM imaging and UV–Vis spectrophotometry; demonstrated 65% reduction in bacterial adhesion
- Contributed findings to a published book chapter on nanomaterial-based biofilm interventions
- Assisted in building a deep learning pipeline for forensic speech signal segmentation across 3 benchmark datasets (~2,900 audio samples)
- Contributed to experimental documentation supporting a paper submission on AI-based speech segmentation
- MTH101 Mathematics for Computer Science
- CSE101 Introduction to Computer Science & Programming
- CSE201 Data Structures and Algorithms
- CSE202 Database Systems
- CSE301 Principles of Computer System Design & Architecture
- CSE302 Capstone Project
- MA19153 Applied Calculus
- MA19251 Differential Equations & Vector Calculus
- MA19353 Transforms and Numerical Methods
- MA19453 Probability and Statistics
- GE19211 Problem Solving & Programming in Python
- CS19411 Python Programming for Machine Learning
- BT19702 Bioinformatics
- BT19201 Biochemistry
- BT19301 Microbiology
- BT19502 Molecular Biology
- BT19303 Cell Biology
- BT19602 Genetic Engineering
- BT19504 Immunology
What I've Built?
- 99.93% accuracy (Random Forest)
- 97.94% accuracy (XGBoost)
- 43,400 patient records trained & tested
- 49,000+ genes processed
- 9-cluster VVV model (BIC/ICL validated)
- 3 hub genes identified (Cd4, Cxcl10, Fmo3)
- 1M+ claims processed
- ~3% of claims flagged by fraud rules
- Top 1% (~128K claims) priority-queued
- 5M+ log records processed
- ~90% detection accuracy
- 2 Glue ETL jobs (parse + aggregate)
- 10K+ research papers
- 40% literature retrieval efficiency improved
- Multi-language preprocessing tested
- 900 multi-channel ad records
- 5 marketing channels modelled
- Shift 10–15% budget from low-ROI channels



