Research
Advancing Medical AI Science
Our research spans large language model training, biomedical NLP, genomic AI, and safe clinical reasoning. We publish openly and collaborate with leading institutions.
Focus Areas
Where we push boundaries
Biomedical NLP
Named entity recognition, relation extraction, and semantic reasoning across clinical and biomedical text corpora.
LLM Alignment
Medical-specific RLHF and DPO pipelines ensuring safe, grounded, and factually accurate model outputs.
Genomic AI
Deep learning on genomic sequences, structural variant calling, and gene-phenotype association modeling.
Drug Discovery AI
Generative and predictive AI for molecular property optimization, ADMET modeling, and target identification.
Clinical Reasoning
Chain-of-thought clinical decision support, differential diagnosis generation, and evidence-based reasoning.
Medical Vision
Multimodal models for pathology slide analysis, radiology interpretation, and visual clinical grounding.
Methodology
How we build trustworthy models
Every DeepCog model follows a rigorous 4-stage pipeline from data curation to clinical validation.
01 //
Data Curation
Multi-source biomedical corpus assembly with quality filtering, deduplication, and expert annotation.
02 //
Pre-training
Domain-specific continued pre-training on curated corpora with medical tokenizer optimization.
03 //
DPO Alignment
Direct Preference Optimization using expert clinician preference data for safe, accurate outputs.
04 //
Clinical Eval
Benchmark evaluation on MedQA, USMLE, PubMedQA, and internal clinical validation sets.
Publications
Recent research papers
OpenBioLLM: Advancing Open-Source Biomedical Large Language Models with Expert-Curated Preference Data
We introduce OpenBioLLM-70B, a state-of-the-art open-source biomedical LLM achieving 91.2% on MedQA. Our novel DPO pipeline leverages expert medical preference annotations across 120K instruction pairs, surpassing GPT-4 on 7 of 9 medical benchmarks.
MedQA 91.2%
USMLE 89.4%
Open Source
GenomicLLM: A Domain-Specific Language Model for Variant Interpretation and Gene-Disease Association
We present GenomicLLM-7B, trained on 180M genomic sequences from NCBI and Ensembl. The model achieves 87% accuracy on clinical variant classification tasks, enabling automated interpretation of VCF files and generation of clinical genomics reports.
Genomics
Variant Calling
MedDPO: Scaling Direct Preference Optimization for Clinical Safety in Medical Language Models
We introduce MedDPO, a clinical-safety-focused DPO framework that reduces medical hallucination by 68% while maintaining benchmark performance. Our 120K preference dataset is annotated by board-certified physicians across 24 specialties.
Safety
Alignment
Hallucination ↓68%
MolLLM: Unified Molecular Language Modeling for ADMET Prediction and Lead Optimization
MolLLM bridges natural language and molecular representations using a unified tokenizer for SMILES, InChI, and IUPAC names. Achieves top performance on 14 ADMET benchmarks from TDC, with 3x speedup over existing graph neural network approaches.
Drug Discovery
ADMET
Molecules