Who it is for
Research teams that need a transcriptomics pipeline they can defend: deterministic species and platform gates, audit-grade provenance, and a clean handoff to downstream analysis or manuscript figures.
Run the first four ingestion gates in your browser right now — parse, QC, species gate, trim. Production alignment, RO-Crate provenance, and the full pipeline run on our Nextflow / Snakemake backend.
Who it is for
Research teams that need a transcriptomics pipeline they can defend: deterministic species and platform gates, audit-grade provenance, and a clean handoff to downstream analysis or manuscript figures.
What we do
We run a two-branch transcriptomics factory (microarray vs RNA-seq) behind a fail-closed ingestion gate. The LLM proposes candidate datasets and metadata; a Pydantic-validated gate decides what enters the pipeline.
What you get
A versioned run manifest, an RO-Crate provenance bundle, MultiQC-aggregated QC, a species-verified sample list, container-pinned preprocessing outputs, and a decision-ready brief you can act on.
The workbench above covers steps 1, 2, and 5. Steps 3 and 4 (alignment and containerised production QC) run on the backend.
01
Ingestion gate
Repository metadata, platform/assay fields, publication cross-check. Fail-closed: conflicting evidence moves to manual review or rejection. The browser workbench above runs the first 4 gates locally.
02
Empirical species check
FastQ Screen / Kraken2 for RNA-seq; verifyBamID2 + CrosscheckFingerprints for human samples. The browser workbench uses a k-mer index to fail-closed if the dominant species is not on the allow-list.
03
Technology-specific branch
Microarray (RLE, NUSE, percent present) and RNA-seq (FastQC, RNA-SeQC 2, RSeQC) are never mixed. Batch effects are detected before they are corrected. Production runs only.
04
Containerised QC
Pinned Nextflow / Snakemake runs, MultiQC aggregation, batch-aware thresholds. ENCODE-style read depth and replicate standards where applicable. Production runs only.
05
Provenance & handoff
DataLad-versioned data, RO-Crate workflow-run provenance, Git-versioned code, and a decision-ready brief that links every output back to a study accession. Production runs only.
Every component is selected for portability, auditability, and the ability to rerun a clean pipeline and reproduce the output.
Nextflow
Production orchestration for HPC, cloud, and workstation.
Snakemake
Leaner alternative, especially for R-heavy custom workflows.
nf-core conventions
Style guide and quality floor for reusable pipelines.
MultiQC
Aggregated QC report across all modules.
DataLad
Git-annex versioning of large raw and derived datasets.
Workflow Run RO-Crate
Captures execution provenance with inputs and outputs.
Pydantic + XML prompts
Local schema validation so the LLM cannot relax scientific constraints.
FastQ Screen / Kraken2
Empirical species verification before alignment.
verifyBamID2
Human-sample contamination and identity check.
Repository-native metadata, minimum-information standards, and a bilingual controlled vocabulary. We use these as hard gates, not as guidelines.
MIAME / MINSEQE
Minimum information standards that pin what a usable study must report.
MAGE-TAB / SOFT
Repository-native formats for ArrayExpress, BioStudies, and GEO sample and platform metadata.
ENCODE bulk RNA-seq
Read length >= 50 bp, two or more replicates, ~30M aligned reads, Spearman >= 0.9 isogenic / >= 0.8 anisogenic.
Bilingual controlled vocabulary
Canonical English ontology terms for tissue, disease, strain, and perturbation; Korean mirror for ops.
The LLM proposes candidate inclusions and never relaxes scientific constraints. Final inclusion is a deterministic validator decision.
Microarray and RNA-seq never share a preprocessing branch. Different metadata, raw files, QC, and batch behaviour.
Container digests, reference builds, and annotation releases are pinned in the run manifest. A clean rerun reproduces the output or the pipeline is not yet production.
Send a short note and we will return a route preview, an owner, and a fit score. Project-tier engagements start at \u20A98M.
Submissions are routed into the Brown Biotech Notion intake hub.
Share just enough context to route the request well. You'll see the route, owner, approval gate, and next action after submit.