Babita Singh,PhD

Senior Researcher, Scientific Advisor, Free(lance) Thinker, Product Lead

PhD in Biomedicine / Bioinformatics (2017) - Barcelona, Spain--AI biosafety, biological data, genomics of diseases, personalised medicine, and ML/AI progress made in the healthcare sector.--This website documents a summary of all scientific projects I have undertaken since 2013.

Who Signed This Gene?
(AIxBIO : APART Research)

What if AI designed proteins or gene sequences show up in food, therapeutic pipelines, clinical practices and nobody can tell - whether it's a product of a billion years of evolution or a few hours on a macbook?This is not a hypothetical paranoia anymore. LLM-based protein design models are already producing functional sequences that have no natural precedent. Yet no database, no submission form, no regulatory checkpoint requires a sequence to declare whether it was shaped by evolution or generated by a model. Currently, "Provenance" for biological sequences, does not exist as a standard.ArtGene-Archive (https://artgene-archive.org/) attempts to address this major concern in AI - biosafety domain. An archive that not only deposits Art(ificial) bio-sequences, it also rigorously screens them through a three-stage biosafety pipeline before they can be registered.
Moreover, ArtGene-Archive also embeds bespoke cryptographic watermark inside the sequence itself using codon-choice steganography (read more on Github linked below). The watermark survives re-synthesis and mutation studies. The certificate it produces is tamper-evident and chained, so the paper trail cannot be quietly rewritten later.
The goal is not to block AI-designed biology. It is to make sure that when an AI-designed sequence enters medicine, food science, or a research pipeline, it is traceable and leads us to its creator.


Genethropic : A decentralised genomics engine

Frontier labs are accelerating rapidly in AI, while biological research continues to lag behind in fragmented silos. How about a decentralized, open, AI-driven ecosystem where complex biological data is transformed into intuitive human insight? The post-AI era may finally dissolve the walls of academia, returning the code of life to everyone, open for exploration across disciplines and backgrounds. To reach that future, we must start building the infrastructure now.Genethropic is developing various tools and platform to match that vibe, including AI Biosafety research frameworks, evals and guidelines. Headquartered in Barcelona, a small (global) community of avant-garde researchers and engineers, actively involved in envisioning the future of safe, ethical and accessible AIxBio, bringing right tools to the table.Here’s what the basket holds so far:
Gene-Story: All 62,700 human genes, pseudogenes, and functional regions, each telling its story in a book-like interface you can actually read.
Gene-Intel: A graph-based platform that finds functional gene twins across 15 species by reasoning about structure and chromosomal neighbourhood, not just sequence similarity.
Gene-Maps: Search any human gene and explore its 3D address in the nucleus (like Google-maps) : chromatin interactions, CRISPR edit safety, druggability score, all before spending a dollar in the lab.
BioClaw: Agentic modules to reproduce end-to-end lab protocols. A digital-to-biological pipeline as a PoC for Insulin production, built with LangGraph.


The Dark Drift: When LLM Chains Turn Toxic
(AI-Manipulation case study)

What happens when AI agents play 'the telephone game'?Our "Dark Drift" study (performed at Apart Research - AI Manipulation Hackathon), reveals a chilling "Telephone Game" effect: as information passes through a chain of LLMs, ethical alignment and factual accuracy don't just fade - they actively decay into harm.By creating 8 different AI personas, each dealing with 26 ethically compromised scenarios, we simulated 6,240 interactions across models like Qwen3-32B and GPT-OSS-20B, scoring manipulative behaviours such as sycophancy, callousness, hidden agenda and information loss (using GPT-4.5 LLM-as-a-Judge).We discovered that "goal-oriented" personas (like the Ambitious Intern) can bypass safety guardrails to "impress the boss," while Professional personas use "Bureaucratic Masking" to sanitize and launder unethical premises.Most alarming was the Self-Reinforcing Feedback Loop: high-risk personas saw harmful traits escalate by +15.2 points in just five steps.As solutions, we propose to perform a PCA-based trajectory analysis to detect and halt "Dark Drift" before a communication chain spirals into psychopathic behavior. Future multi-agent systems could also use Heterogeneous Chains by mixing "Adversarial" agents with "Compliance Officers" to create self-correcting architectures. More details can be found in the paper linked below.-----(I would like to thank my fellow incredible researchers : Cheng, Rudy and David at "AI Safety Barcelona" for nonstop 48 hours of Hackathon adrenaline rush.)


Beyond RAG : How to develop frontier-agents for pharmaceutical industry

Current RAG (Retrieval-Augmented Generation) applications being developed at pharma industries are often trapped in "vanilla" RAG systems that function merely as sophisticated search bars. I note a few observations here to push this AI framework beyond Q&A style data retrieval, towards more nuanced & sophisticated frontier agents.Moving beyond simple retrieval means developing a dynamic "Agentic" ecosystem where AI doesn't just find information but reasons and acts, systems that don't just "search" but "interrogate" data by taking on complementary roles.For example, historical "dark" data of failed trials or failed compounds. A traditional RAG pipeline can be swaped with, "In Silico Peer Review" swarms where a 'Skeptic Agent' retrieves historical failure patterns to red-team a 'Designer Agent’s' new molecule or pipeline.Likewise, clinical Trials Systems are under-utilized by treating protocols as static PDFs rather than live software that can be re-purposed and reasoned to re-purpose or prepare faster, accurate trials.In Regulatory & Quality domain, RAG is still considered as a passive lookup tool for compliance questions. An ingenious hack here would be to create "Adversarial Wardens" that can perform automated "integrity attacks" on their own submissions, finding internal contradictions or citation gaps before the FDA / EMA does.In Manufacturing domain RAG based pipelines implement "Operational Twins" where agents can monitor live bioreactor sensors and use RAG to autonomously adjusting feed rates to save a batch in real-time.Beyond chunking PDFs, here a Knowledge Graph is build, where nodes represent batches, sensors, ingredients, and operators, while edges represent causal relationships (e.g., "Ingredient X influences pH level"). Now a batch-simulation Agent can retrieve "Golden Batch" parameters via RAG, runs the suggestion through a simulation to predict the outcome. This way, a human operator can monitor or adjust simulation for better success in drug manufacturing.


Large Language Models for Genomics - Integrating science, signs and symbols

This is Bhopalator: A molecular machine that processes 'cell language'. DNA functions cannot be fully accounted in terms of the laws of physics and chemistry alone, but also as a linguistic system - which should inspire new architectures for LLMs, that of the principles of semiotics - the science of symbols and signs. Bhopalator was proposed as the 'Linguistics of DNA: Words, Sentences, Grammar, Phonetics, and Semantics' by SUNGCHUL JI, (1999).(Work in progress, draft will be available soon)


AI/ML models to efficiently
profile patients with blood-related (Haematological) disorders

First Horizon-2020 European grant awarded in the field of AI in genomics research (GenoMed4All). This study goes beyond current diagnosis approaches and utilizes the power of federated learning, large language models (LLMs) and natural language processing (NLP) to extract valuable insights directly from clinical text reports and provide significantly enhanced patient stratification and outcome prediction in haematological diseases (blood-related disorders).The pilots cover common and rare oncological (Myelodysplastic syndromes and Multiple Myeloma) and non-oncological (Sickle Cell Disease) haematological diseases to stratify patients based on clinical reports, genomic profile, and other multimodal data.Key Steps in Leveraging Generative AI for Genomed4ALL1. Model Adaptation with Domain-Specific Fine-Tuning: The project employs a pre-trained BERT framework with custom numerical embeddings. Model is fine-tuned using clinical reports from 1,328 hematology patients.

2. Text Embedding and Clustering: Grouping patients into distinct clusters based on similarities in clinical text.
3. Cluster Validation: Clusters are validated against known patient diagnoses, gene mutation patterns, and survival probabilities (Kaplan-Meier survival analyses).4. Performance Benchmarking: Testing model performance with general clinical models in metrics like pseudo-perplexity, accuracy, and F1 score.Impact of GenoMed4ALL in Precision MedicineThis study underscored the value of domain-specific adaptations for LLMs in extracting critical features from specialized datasets.
The integration of NLP-driven models into clinical workflows marks the beginning of a new era in personalized medicine. As multimodal data becomes increasingly accessible, combining clinical text, genomics, and imaging with AI may unlock unprecedented capabilities for disease stratification and outcome prediction.

Biological Principles for Safe (Superintelligence) AI : Attention Nature is all you need

The race is on to develop Superintelligent AI, an artificial intelligence system that will surpass human cognitive abilities across virtually all domains. However, the development of such superintelligence would require fundamentally different approaches and safeguards compared to current AI models and therefore, a wider perspective and vigilantes.Can we count on 'Mother Nature' to provide some wisdom? Through millions of years of evolution, Nature has developed numerous strategies for creating intelligent, adaptive, and resilient systems. This paper explores the potential lessons that future superintelligence developers can learn from biological systems to create more responsible and robust artificial intelligence. We examine ten key principles observed in nature and discuss their potential applications in AI development, providing examples from both biological sciences and software engineering perspectives.


A practical guide for federated-learning using multimodal data
(FAIR data principles)

Federated learning presents several challenges, specially when applied to multi-modal healthcare data. Here we published some guidelines and good practices for the obstacles faced by research engineers, specially in healthcare sector. This work addresses few critical concerns and how to navigate that, such as regulations dealing with international datasets, interoperability standards for multimodal datasets, maintaining data quality and consistency, mitigate privacy and security concerns, address model complexity and validation, scalability as well as stakeholder engagement.

De-centralised data discovery : Building trust-worthy solutions for healthcare AI

Real-patient data is valuable for the new era of personalized medicine that are utilising AI based tools to train models for an unbiased, precise and faster diagnosis.However, such data is highly identifiable and therefore needs to be protected. Given the absolute necessity of real-patients data for an inclusive & unbiased AI-model training, we cannot simply afford to keep the data locked in either.The Beacon project is a solution that does not compromise on privacy or ownership of the data while simultaneously making such data 'searchable', boosting worldwide research efforts that depend on ‘big data’.This is the first time that the genomics research community (GA4GH & ELIXIR) came together to draft a specification for genomics & clinical data sharing so that it follows a set of rules & principles designed to favour both data owners like patients, clinicians and hospitals, as well as data requesters for example researchers.


Real-time monitoring of virus evolution during COVID-19 pandemic

A tale of quick scientific pandemic response navigating cross-border data exchange, handling terabytes of data on per day basis, developing faster pipelines for real-time information exchange, data visualisation, connecting remote-working teams from around the globe - all with one mission, to quickly trace the ever evolving variants of SARS-CoV-2 virus around the globe.

A pilot-project launched for early diagnosis of rare diseases: Connecting hospitals for rapid data exchange

  • This is the first pilot-project to test the Beacon v2 API on real-life situation, by connecting different Spanish hospitals together to exchange patient's diagnostics.

  • Hospitals are increasingly generating patient data through routine clinical practice. However, so far, they couldn't exchange such information with other hospitals for faster diagnosis such as, in case of unknown or rare disease.

  • This limits important advances and breakthroughs that could be possible through the use of emerging technologies such as AI and machine learning.

  • Researchers at the European Genome-Phenome Archive (EGA) of the Centre for Genomic Regulation (CRG) and ELIXIR Spain, in collaboration with the Global Alliance for Genomics and Health (GA4GH), came together to address this challenge by releasing Beacon v2, a ‘search engine’ that allows researchers to discover genomic and phenotypic data from patients around the world in a secure and private manner.

  • Six different hospitals of Catalunya (Spain) were connected in this pilot-project for rapid queries and information exchange.


Computational pipelines for exhaustive pattern search on human DNA

MIRA is a computational pipeline for exhaustive search for enriched mutations on the coding and non-coding regionsMoSEA is a python-based tool to perform Motif Enrichment analysis - ie. to search which short sequences (k-mers) of DNA are over-mutated in disease versus control groups.SUPPA is a remarkable tool for fast alternative splicing detection methods for large scale analysis using alignment free mapping, thus exponentially reducing the time required for genomics data analysis.

RNA Biomarkers : Profiling >4000 cancer patients for precise mutational patterns.

This was one of the first extensive research done to study the regulation of alternative splicing through RNA binding proteins, in order to exhaustively search for RNA based biomarkers for early detection of cancer. Together with my lab Computational RNA Biology at University of Pompeu Fabra based in Barcelona, we studied how to utilise the process called 'alternative splicing' to examine distinct mRNA and proteomics-based signatures, especially to identify early tumor subtypes.
The outcomes of these studies were a remarkable feat and were published in major journals, one of them as a cover page article.

* Genome Sequencing and RNA-Motif Analysis Reveal Novel Damaging Noncoding Mutations in Human Tumors. (Cover page issue) Molecular Cancer Research (2018) Read Paper* The role of alternative splicing in cancer. Singh B and Eyras E. Transcription. (2017) Read Paper* SUPPA2 provides fast, accurate, and uncertainty-aware differential splicing analysis across multiple conditions. Entizne JC, Trincado JL, Hysenaj G, Singh B, Skalic M, Elliott DJ, Eyras E. Biorxiv, (2017) Read Paper* Large-scale analysis of genome and transcriptome alterations in multiple tumors unveils novel cancer-relevant splicing networks. Sebestyén E, Singh B, Miñana B, Pagès A, Mateo F, Pujana MA, Valcárcel J, Eyras E. Genome Research (2016) Read Paper* Argonaute-1 binds transcriptional enhancers and controls constitutive and alternative splicing in human cells. Alló M, Agirre E, Bessonov S, Bertucci P, Gómez Acuña L, Buggiano V, Bellora N, Singh B, et al., Proc Natl Acad Sci USA. (2014) Read Paper--------
* PhD thesis (Defended on May, 2017) : Large-scale study of RNA processing alterations in multiple cancers.

© Research profile - Dr. B. Singh. All rights reserve