strict-omics · project lane

LLM proposes. Deterministic gates decide.

Run the first four ingestion gates in your browser right now — parse, QC, species gate, trim. Production alignment, RO-Crate provenance, and the full pipeline run on our Nextflow / Snakemake backend.

Fail-closed ingestion gateBrowser-side QC + speciesContainer-pinned productionRO-Crate provenance

Up to 200K reads processed locally per run. No upload. No server pipeline call.Have a CSV instead? Try the multi-omics tool →

Who it is for

Research teams that need a transcriptomics pipeline they can defend: deterministic species and platform gates, audit-grade provenance, and a clean handoff to downstream analysis or manuscript figures.

What we do

We run a two-branch transcriptomics factory (microarray vs RNA-seq) behind a fail-closed ingestion gate. The LLM proposes candidate datasets and metadata; a Pydantic-validated gate decides what enters the pipeline.

What you get

A versioned run manifest, an RO-Crate provenance bundle, MultiQC-aggregated QC, a species-verified sample list, container-pinned preprocessing outputs, and a decision-ready brief you can act on.

Operating steps

The workbench above covers steps 1, 2, and 5. Steps 3 and 4 (alignment and containerised production QC) run on the backend.

Ingestion gate

Repository metadata, platform/assay fields, publication cross-check. Fail-closed: conflicting evidence moves to manual review or rejection. The browser workbench above runs the first 4 gates locally.

Empirical species check

FastQ Screen / Kraken2 for RNA-seq; verifyBamID2 + CrosscheckFingerprints for human samples. The browser workbench uses a k-mer index to fail-closed if the dominant species is not on the allow-list.

Technology-specific branch

Microarray (RLE, NUSE, percent present) and RNA-seq (FastQC, RNA-SeQC 2, RSeQC) are never mixed. Batch effects are detected before they are corrected. Production runs only.

Containerised QC

Pinned Nextflow / Snakemake runs, MultiQC aggregation, batch-aware thresholds. ENCODE-style read depth and replicate standards where applicable. Production runs only.

Provenance & handoff

DataLad-versioned data, RO-Crate workflow-run provenance, Git-versioned code, and a decision-ready brief that links every output back to a study accession. Production runs only.

Reference

Wong MYH et al. Neuro-symbolic artificial intelligence in medicine. Nature Biomedical Engineering (2026). doi.org/10.1038/s41551-026-01728-1

Why strict-omics is the canonical composite NeSy example: the browser workbench is the neural layer (LLM proposes candidate datasets and metadata); the Pydantic-validated ingestion gate is the symbolic layer (fail-closed on conflicting evidence). Same architecture the paper's Fig. 1 names “composite NeSy.”

Stack

Production-grade tools, pinned by digest.

Every component is selected for portability, auditability, and the ability to rerun a clean pipeline and reproduce the output.

Nextflow

Production orchestration for HPC, cloud, and workstation.

Snakemake

Leaner alternative, especially for R-heavy custom workflows.

nf-core conventions

Style guide and quality floor for reusable pipelines.

MultiQC

Aggregated QC report across all modules.

DataLad

Git-annex versioning of large raw and derived datasets.

Workflow Run RO-Crate

Captures execution provenance with inputs and outputs.

Pydantic + XML prompts

Local schema validation so the LLM cannot relax scientific constraints.

FastQ Screen / Kraken2

Empirical species verification before alignment.

verifyBamID2

Human-sample contamination and identity check.

Standards we enforce

Repository-native metadata, minimum-information standards, and a bilingual controlled vocabulary. We use these as hard gates, not as guidelines.

MIAME / MINSEQE

Minimum information standards that pin what a usable study must report.

MAGE-TAB / SOFT

Repository-native formats for ArrayExpress, BioStudies, and GEO sample and platform metadata.

ENCODE bulk RNA-seq

Read length >= 50 bp, two or more replicates, ~30M aligned reads, Spearman >= 0.9 isogenic / >= 0.8 anisogenic.

Bilingual controlled vocabulary

Canonical English ontology terms for tissue, disease, strain, and perturbation; Korean mirror for ops.

Boundaries

The LLM proposes candidate inclusions and never relaxes scientific constraints. Final inclusion is a deterministic validator decision.

Microarray and RNA-seq never share a preprocessing branch. Different metadata, raw files, QC, and batch behaviour.

Container digests, reference builds, and annotation releases are pinned in the run manifest. A clean rerun reproduces the output or the pipeline is not yet production.

Route preview

Scope a strict, audit-grade transcriptomics pipeline

Send a short note and we will return a route preview, an owner, and a fit score. Project-tier engagements start at \u20A98M.

Which organism and assay type are you working with? (Homo sapiens, Mus musculus, microarray, bulk RNA-seq, single-cell, spatial, etc.)

Do you have raw data, and in what format? (FASTQ / SRA / BAM for RNA-seq, CEL / IDAT for microarray, or processed matrices only.)

What decision does this pipeline need to support? (target ID, cohort selection, manuscript figure, audit-grade dataset, regulatory submission, etc.)

24h response targetShort intake, clear next step

Submissions are routed into the Brown Biotech Notion intake hub.

Triage preview

Send a concise project brief

Share just enough context to route the request well. You'll see the route, owner, approval gate, and next action after submit.