Back to Case Studies

HLA Peptide Discovery — Scientific Data Platform Modernisation

From fragmented Excel workbooks and email-driven handoffs to a scalable, query-driven scientific intelligence platform — enabling Immuno-Oncology researchers to discover, filter, and interrogate HLA peptide targets across 283,000+ peptides with multi-dimensional precision.


Phases
3-Phase Delivery
Team
Proteomics · Bioinformatics · TFAs
Scale
283K+ Peptides
Industry
Immuno-Oncology · Data Engineering
HLA Peptide Discovery Platform — Immuno-Oncology Research

A scalability crisis hidden inside spreadsheets

Immuno-Oncology peptide discovery generates an enormous and ever-growing volume of data. MS/MS experiments produce hundreds of thousands of peptide candidates that must be cross-referenced against HLA typing, RNA-Seq expression, NetMHC binding predictions, tumour expression data, and off-target risk algorithms before any target can be confidently nominated.

The legacy workflow managed this complexity through per-experiment Excel workbooks, manual script execution, and email-based coordination between Proteomics, Bioinformatics, and Therapeutic Focus Area teams. With 283,000+ peptides already accumulated and millions projected, the system was heading toward a data integrity and scalability crisis.

Scientists could not easily answer fundamental discovery questions — Was this peptide ever observed before? In how many human samples? Is it tumour-enriched? What are its off-target risks? The answers existed in the data, but the architecture made them inaccessible.

Key Challenges

No centralised database — experiments tracked in per-experiment Excel workbooks and CSV exports

30–40 MS/MS samples processed monthly with 283,000+ peptides and no structured query capability

Manual PEAKS export, script execution, and NetMHC queries driven by key-person dependency

No linkage between summarised results and raw experimental evidence

Email-based coordination between Proteomics, Bioinformatics, and Therapeutic Focus Areas

Projected exponential peptide growth to millions — existing architecture could not scale

Key Requirements

Centralised scientific database at patient, experiment, and peptide level

Automated ingestion from PEAKS MS/MS exports with HLA typing and RNA-Seq integration

Complex multi-parameter query engine with threshold-based filtering

Drill-down from summarised results to raw experimental MS evidence

Automated annotation and off-target script triggering

Scalable architecture supporting millions of peptides without performance degradation

From raw MS data to actionable discovery intelligence

Every experiment flows through an automated pipeline — ingested, integrated with genomic data, indexed, and made instantly queryable with drill-down to raw spectral evidence.

MS/MS (PEAKS)
MS/MS (PEAKS)
Raw spectral data
Ingestion Engine
Ingestion Engine
Parse & normalise
Scientific DB
Scientific DB
Patient · Experiment · Peptide
HLA + RNA-Seq
HLA + RNA-Seq
Genomic integration
Query Engine
Query Engine
Multi-parameter discovery
Researcher Dashboard
Researcher Dashboard
Drill-down to raw evidence

A purpose-built scientific discovery engine

The platform was delivered in three structured phases — each building on the last to progressively replace manual workflows, integrate genomic data sources, and unlock increasingly sophisticated discovery capabilities.

Automated Data Ingestion

Automated Data Ingestion

Event-driven ingestion pipeline processes raw PEAKS MS/MS exports automatically — normalising, deduplicating, and structuring peptide data at patient, experiment, and peptide level without manual intervention

Scientific Database Architecture

Scientific Database Architecture

Purpose-built relational database stores peptide records across 100s of cell lines and patient samples — with HLA typing, RNA-Seq metadata, and NetMHC binding predictions all co-located for unified querying

Multi-Dimensional Query Engine

Multi-Dimensional Query Engine

Researchers can filter across expression thresholds, binding affinity, off-target risk scores, sample frequency, and tissue type simultaneously — enabling complex discovery queries that previously required days of manual consolidation

Drill-Down to Raw Evidence

Drill-Down to Raw Evidence

Every summarised result is traceable to its source MS/MS experimental data — scientists can navigate from a peptide summary down to the raw spectral evidence in seconds

Annotation & Off-Target Automation

Annotation & Off-Target Automation

Annotation and off-target prediction scripts are triggered automatically on ingestion — eliminating manual execution steps and reducing the risk of missed processing or version inconsistencies

Ecosystem Integration Roadmap

Ecosystem Integration Roadmap

Platform architected for integration with Benchling (research registry), BSI (specimen inventory), and external genomic datasets — TCGA tumour expression and GTEx normal tissue expression already incorporated

Built in three progressive phases

Each phase delivered immediate value while building the foundations for the next capability layer.

01

Data Ingestion, Storage & Query Foundation

Centralised scientific database design at patient and experiment level

Raw PEAKS MS data and HLA typing ingestion

Simple and complex multi-threshold query engines

Historical dataset migration (283K+ peptides)

Export interfaces for annotation and off-target pipelines

02

Advanced Querying & System Integration

Patient vs Cell Line frequency field split

Ion Score added as searchable metric

NetMHC predictions integrated directly

TCGA and GTEx expression datasets ingested

Drill-down from summary to experimental edited data

Tissue-type filtering: Tumour vs Normal vs Cell Line

03

Enhancements & Ecosystem Integration

Script version control and re-run capability

Automated notification workflows

Integration roadmap: Benchling, BSI specimen inventory, Research Data Lake API

Cross-therapeutic reuse positioning

From weeks of manual effort to seconds of structured discovery

The platform fundamentally changed how the organisation interacts with its peptide data. Scientists can now ask complex multi-dimensional questions — combining expression thresholds, binding affinity, off-target risk, and sample frequency — and receive results instantly, with full drill-down to raw spectral evidence.

By eliminating Excel-based restructuring, manual script execution, and email-driven coordination, the platform recovered 3–6 FTE equivalent effort annually — while accelerating target nomination cycles and reducing scientific attrition risk from key-person dependency.

Estimated Value Delivered
3–6 FTE equivalent effort recovered annually
Across manual data restructuring, script execution, and cross-team coordination
80%
Less Manual Processing
90%
Faster Peptide Search
3–5×
Query Complexity
3–6 FTE
Effort Saved Annually

Beyond efficiency — changing what's scientifically possible

🎯

Tumour-Specific Prioritisation

Scientists can now identify and prioritise peptides enriched in tumour tissue vs normal tissue — a query impossible in the legacy architecture

🛡️

Off-Target Risk Reduction

Automated off-target prediction at ingestion reduces the risk of nominating candidates with high normal-tissue cross-reactivity

📡

HLA Binding Confidence

Direct NetMHC integration within the platform improves confidence in binding affinity predictions and enables rank-based filtering

🔬

Raw Evidence Access

Any summarised result can be drilled to its source MS/MS spectral data — giving scientists full transparency from discovery to evidence

🌐

Cross-Therapeutic Applicability

Platform architecture is not limited to Oncology — designed for reuse across multiple Therapeutic Focus Areas as the organisation expands

📈

Millions-Scale Readiness

Structured database architecture handles exponential peptide growth without performance degradation — ready for millions of records

Facing a similar challenge?

Our architects are ready to design a solution tailored to your scientific and enterprise constraints.