gapsmith
gapsmith is a Rust reimplementation of gapseq. It reconstructs genome-scale metabolic models from bacterial proteomes: pathway detection, transporter detection, draft model assembly, growth-medium inference, and iterative gap-filling — all as a single static binary.
What it does
Given a protein FASTA, gapsmith:
- Finds pathways — aligns proteins against per-reaction reference
FASTAs and scores pathway completeness (
gapsmith find). - Finds transporters — matches against a TCDB-derived substrate
table (
gapsmith find-transport). - Assembles a draft model — combines the above into a
stoichiometrically consistent metabolic network (
gapsmith draft). - Infers a growth medium — applies boolean rules over detected
reactions to predict minimal-medium compounds (
gapsmith medium). - Gap-fills — an iterative pFBA + knock-out loop adds the minimal
set of extra reactions needed for biomass growth (
gapsmith fill).
Each step is independently callable, or chain them end-to-end with
gapsmith doall.
One-liner
gapsmith doall genome.faa.gz -f output/
This produces a gap-filled SBML model that loads directly in COBRApy, COBRAToolbox, or any SBML-compatible tool.
Why a rewrite?
Upstream gapseq is R + bash (~9k LOC). gapsmith replaces the R stack —
including the cobrar wrapper and external GLPK/CPLEX — with:
- In-process HiGHS via
good_lpfor FBA/pFBA - Native CBOR model format (replaces R's RDS)
- Single static binary, no R install
--aligner precomputedmode for re-using one alignment across many genomesbatch-alignfor clustering N genomes with mmseqs2 and amortising the per-reaction alignment
Benchmarks show ~3× faster pathway detection on real bacterial proteomes and 35–40% lower peak memory. See Comparison with upstream gapseq for details.
Scaling out: metagenomes and multi-genome runs
gapsmith also ships optional, additive features for community and multi-genome workloads:
--gspa-run— consume protein clusters + alignments produced upstream by gspa so thousands of genomes share one alignment.doall-batch— rayon-parallel driver with a--shard i/Nselector for SLURM array jobs (scales from one machine to 1 M genomes).community per-mag— per-MAG FBA under a shared (union) medium. Linear in N MAGs.community cfba— compose N drafts into one community model with a shared_e0pool and a weighted-sum biomass objective; optional balanced-growth constraint for classical cFBA semantics.
See Multi-genome & metagenome workflows for the full recipe.
Where to start
- Install & first run: User guide
- Every CLI flag: CLI reference
- Metagenomes / multi-genome runs: Multi-genome workflows
- How the crates fit together: Architecture
- R → Rust module mapping: Feature matrix
- Intentional deviations from upstream: Porting notes
gapsmith user guide
A Rust reimplementation of gapseq — informed prediction and analysis of bacterial metabolic pathways and genome-scale networks. Single static binary, in-process LP solver, faster on every stage of the pipeline (see benchmarks in the README).
This guide walks a new user through installing gapsmith, reconstructing a genome-scale metabolic model end-to-end, and understanding each stage of the output.
1. Install
Binary prerequisites
gapsmith shells out to external aligners and a couple of HMM tools. Install whichever aligner you want to use — you only need one:
| Tool | Required for | apt / bioconda |
|---|---|---|
blastp + makeblastdb | --aligner blastp (default) | ncbi-blast+ |
diamond | --aligner diamond | diamond |
mmseqs | --aligner mmseqs2 | mmseqs2 |
bedtools | none (legacy; only needed for upstream gapsmith find shell pipeline) | bedtools |
--aligner precomputed skips the aligner entirely and reads a user-
supplied TSV; see Precomputed alignment mode
below.
Reference data
gapsmith needs the standard gapseq dat/ directory (SEED reactions +
metabolites, biomass templates, medium rules, pathway tables). Get it by
cloning the upstream repo:
git clone --depth 1 https://github.com/jotech/gapseq.git
export GAPSMITH_DATA_DIR=$PWD/gapseq/dat
Or point to it explicitly on every invocation with --data-dir.
Reference sequences
Sync the genome-specific BLAST/diamond reference FASTA database from Zenodo:
gapsmith update-sequences -t Bacteria # pinned release
gapsmith update-sequences -t Bacteria -Z latest # newest Zenodo version
gapsmith update-sequences -c # check-only, no download
Without this, find / find-transport / doall will fail — the FASTA
files they align against live under <data-dir>/seq/<Taxonomy>/.
Build
git clone <this-repo>
cd gapsmith
cargo build --release --workspace
# Binary is target/release/gapsmith
The HiGHS LP solver is bundled via good_lp → highs-sys (CMake'd from
source on first build; ~30 s). An optional cbc feature adds CBC as a
fallback solver — requires coinor-libcbc-dev on the build host:
cargo build --release --workspace --features gapsmith-fill/cbc
2. Quick start — reconstruct E. coli K-12 in one command
# Download an E. coli reference proteome
wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/all_assembly_versions/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_protein.faa.gz
# Reconstruct: find→find-transport→draft→medium (auto)→fill
gapseq --data-dir $GAPSMITH_DATA_DIR doall \
GCF_000005845.2_ASM584v2_protein.faa.gz \
-f ecoli_out/ \
-A diamond \
-b 200 \
-k 0.01
# Inspect the result
gapsmith fba ecoli_out/GCF_000005845.2_ASM584v2_protein-filled.gmod.cbor
Outputs in ecoli_out/:
<genome>-all-Reactions.tbl,<genome>-all-Pathways.tbl— find results<genome>-Transporter.tbl— find-transport result<genome>-draft.gmod.cbor,<genome>-draft.xml— draft model<genome>-medium.csv— inferred growth medium<genome>-filled.gmod.cbor,<genome>-filled.xml— gap-filled model<genome>-filled-added.tsv— list of reactions added by each fill step
The filled CBOR or SBML loads directly in COBRApy, COBRAToolbox, cobrar, or any SBML-aware tool.
3. Individual stages
3.1 gapsmith find — pathway and reaction detection
Aligns the genome's proteins against gapseq's per-reaction reference
FASTAs, scores each pathway for completeness, and writes a
*-Reactions.tbl + *-Pathways.tbl.
# One pathway by MetaCyc id
gapsmith find -p PWY-6587 genome.faa
# A category shorthand — expands to gapseq's internal hierarchy match
gapsmith find -p amino genome.faa
# ↑ one of: amino | nucl | cofactor | carbo | polyamine |
# fatty | energy | terpenoid | degradation |
# core | min | kegg | all
# Every active pathway in metacyc + custom (gapseq's default)
gapsmith find -p all -l metacyc,custom genome.faa
# Alternative aligner
gapsmith find -A diamond -p all genome.faa
gapsmith find -A mmseqs2 -p all genome.faa
Key options:
-b/--bitcutoff— BLAST bitscore cutoff for "high-evidence" hits (default 200).-c/--coverage— minimum query coverage (default 75%).-a/--completeness-main— pathway-completeness cutoff (default 80%).-k/--completeness-hints— relaxed cutoff when every key reaction is present (default 66%).-n/--include-superpathways— by default superpathways are filtered out (matches upstream gapseq).-t/--taxonomy—Bacteria(default) orArchaea.
The output *-Pathways.tbl is byte-identical to real gapseq's output
on tested cases (-p PWY-6587 and -p amino on ecoli). See the
porting-notes for the handful of intentional
deviations.
3.2 gapsmith find-transport — transporter detection
Iterates gapseq's transporter substrate list (dat/subex.tbl) and aligns
the genome against both TCDB and the curated SEED transporter FASTAs.
gapsmith find-transport -A diamond -b 50 genome.faa
# → <stem>-Transporter.tbl
Row count and the TC-identifier set match real gapseq on the ecoli toy genome (550 distinct TC ids, 1808 rows).
3.3 gapsmith draft — build a draft model
Combines the *-Reactions.tbl + *-Transporter.tbl with the SEED
reaction DB into a metabolic model. Adds biomass, exchanges, diffusion
reactions, conditional transporters, dedups by stoichiometric hash.
gapsmith draft \
-r ecoli-all-Reactions.tbl \
-t ecoli-Transporter.tbl \
-b neg \
-o drafts/
# → drafts/ecoli-draft.gmod.cbor (native format)
# drafts/ecoli-draft.xml (SBML L3V1+FBC2+groups, validates
# against libSBML with 0 errors)
The -b flag picks the biomass template:
auto— readgram=from the Reactions.tbl header (gapseq defaults).pos/neg— Gram+ / Gram-.archaea— Archaea.<path/to/custom.json>— user biomass in the same JSON format asdat/biomass/biomass_Gram_neg.json.
3.4 gapsmith medium — rule-based medium inference
Evaluates 82 boolean rules in dat/medium_prediction_rules.tsv against
the draft + Pathways.tbl to infer a growth medium.
gapsmith medium \
-m drafts/ecoli-draft.gmod.cbor \
-p ecoli-all-Pathways.tbl \
-o ecoli-medium.csv
# Manual overrides — force oxygen uptake 0 to predict anaerobic
gapsmith medium ... -c "cpd00007:0;cpd00011:0"
The inferred medium CSV is compatible with gapsmith fill -n. On the
ecoli toy genome the output is byte-identical to upstream gapseq's
toy/ecoli-medium.csv.
3.5 gapsmith fill — iterative gap-filling
4-phase pFBA + KO-essentiality pipeline. Steps 1 + 2 + 2b are the
default "quick" path; Steps 3 + 4 run only with --full-suite.
gapsmith fill \
drafts/ecoli-draft.gmod.cbor \
-n ecoli-medium.csv \
-r ecoli-all-Reactions.tbl \
-k 0.01 \
-o filled/
# Full 4-phase suite (adds energy-source + fermentation screens;
# 10-30 minutes on a whole bacterium)
gapsmith fill ... --full-suite
# Opt into the futile-cycle prune — drops candidate reactions that
# saturate at 0.99·max_flux with all boundaries closed. Expensive on
# large candidate pools.
gapsmith fill ... --prune-futile
Phase breakdown:
- Step 1 — user medium + biomass target. Bulk of the fills.
- Step 2 — MM_glu + available carbon sources; iterate every biomass substrate, attach a sink per substrate, gap-fill if infeasible.
- Step 2b — aerobic / anaerobic variant, depending on whether the user medium has O₂.
- Step 3 (
--full-suite) — energy-source screen with ESP1–5 pseudo-reactions (menaquinone / ubiquinone / NADH / ferredoxin / plastoquinone redox couples). - Step 4 (
--full-suite) — fermentation-product screen: maximise outflow on each exchange compound.
Output: <genome>-filled.gmod.cbor + .xml + -added.tsv (rxn id +
step number for every addition).
3.6 gapsmith fba — FBA / pFBA on a model
Utility to solve FBA or parsimonious FBA on a CBOR/JSON model. Useful for validating a draft before gap-filling, or for post-hoc analysis.
gapsmith fba drafts/ecoli-draft.gmod.cbor
gapsmith fba --pfba --min-growth 0.01 --top 20 filled/ecoli-filled.gmod.cbor
gapsmith fba -r EX_cpd00027_e0 filled/ecoli-filled.gmod.cbor
Split-flux LP via good_lp with HiGHS backend. Solves the 2378×2953
ecoli draft in under 100 ms.
3.7 gapsmith adapt — edit reactions + force growth
# Add specific SEED rxns
gapsmith adapt -m model.gmod.cbor -a rxn00011,rxn00102
# Add every rxn in a MetaCyc pathway
gapsmith adapt -m model.gmod.cbor -a "PWY-6587"
# Remove
gapsmith adapt -m model.gmod.cbor -r rxn00011
# Force growth on D-glucose (runs gap-fill)
gapsmith adapt -m model.gmod.cbor -w cpd00027:TRUE -b Reactions.tbl
# Force NO growth on glucose (closes uptake)
gapsmith adapt -m model.gmod.cbor -w cpd00027:FALSE
3.8 gapsmith pan — pan-draft from N drafts
# Glob a directory of drafts
gapsmith pan -m drafts/ -t 0.06 -f pan_out/
# Explicit list
gapsmith pan -m a-draft.gmod.cbor,b-draft.gmod.cbor,c-draft.gmod.cbor -t 0.5
# Only emit the binary presence/absence table
gapsmith pan -m drafts/ -b -f pan_out/
-t sets the minimum across-draft reaction frequency for inclusion in
the pan-draft. Default 0.06 matches upstream gapseq.
3.9 gapsmith doall — full pipeline in one command
gapsmith doall genome.faa.gz -f out/ -A diamond -b 200 -k 0.01
# Chains: find -p all → find-transport → draft → medium (auto) → fill
Add --full-suite to include fill Steps 3/4. Use -m <medium.csv> to
skip the medium-inference step and supply your own medium.
4. Precomputed alignment mode
For many-genome batch runs, gapsmith can skip the per-genome aligner call entirely and read a pre-computed alignment TSV:
# Pre-run BLAST / diamond / mmseqs2 yourself (or use gapsmith batch-align)
diamond makedb --in reference.faa -d ref.dmnd
diamond blastp -q genome.faa -d ref.dmnd -o hits.tsv \
--outfmt 6 qseqid sseqid pident qcovs evalue bitscore stitle sstart send
# Feed to find / find-transport
gapsmith find -A precomputed -P hits.tsv -p all genome.faa
gapsmith find-transport -A precomputed -P hits.tsv genome.faa
gapsmith batch-align — cluster N genomes + single alignment pass
When you're annotating many genomes with the same reference, run one cluster + alignment over all of them and expand per-genome hits:
gapsmith batch-align \
-q reference.faa \
-g genomes_dir/ \
-o aligned/ \
--aligner diamond \
--cluster-identity 0.5 \
--cluster-coverage 0.8
# → aligned/<genome_id>.tsv per genome in genomes_dir/
Amortises the alignment cost over many genomes — especially effective when the proteomes share a lot of sequence homology.
4b. Multi-genome, metagenome, and gspa integration
When annotating 100+ genomes — or when the input is a metagenome —
reach for these additional subcommands instead of looping doall:
--gspa-run — consume precomputed clusters from gspa
If you have a gspa run directory (mmseqs2-clustered proteins + alignment against cluster reps), point any per-genome subcommand at it:
gapsmith doall genome.faa -f out/ \
--gspa-run /path/to/gspa-run/ \
--gspa-genome-id my-genome
gapsmith skips the aligner entirely, reads the rep-level hits, and fans them out to this genome's cluster members.
gapsmith doall-batch — parallel N-genome driver
# 100 genomes on one machine, reusing a gspa cluster, 32 rayon workers
gapsmith doall-batch --gspa-run gspa-run/ -f out/ --jobs 32
# 1 M genomes on SLURM: 1024 array tasks, each processes ~1000 genomes
sbatch --array=0-1023 slurm_script.sh # script passes --shard ${SLURM_ARRAY_TASK_ID}/1024
Key flags: --genomes-dir / --genomes-list / --gspa-run, --jobs,
--shard i/N, --continue-on-error.
gapsmith community per-mag — shared-medium per-MAG FBA
For metagenomes with 50+ MAGs where a full cFBA isn't tractable:
gapsmith community per-mag \
--gspa-run gspa-run/ \
--drafts-root out/ \
-o community/
Outputs community-medium.csv (union of per-MAG media) and
per-mag-growth.tsv with abundance-weighted community growth.
gapsmith community cfba — full community LP
For small, curated communities (≤ ~50 organisms) where cross-feeding matters:
gapsmith community cfba \
--drafts-dir out/drafts/ \
--abundance abundances.tsv \
--balanced-growth \
-o community/
Composes N drafts into one block-diagonal model with a shared _e0
extracellular pool, builds a weighted-sum biomass objective, and
optionally constrains every organism to the community's growth rate.
→ Full recipe: Multi-genome & metagenome workflows.
5. Solver, format, and file-layout notes
5.1 Model format
- Native:
.gmod.cbor— CBOR-encodedModel(fast load, compact). - JSON:
.json— human-readable, viagapsmith convert. - SBML:
.xml— SBML L3V1 + FBC2 + groups. 0 libSBML errors on every emitted model. Loads cleanly in COBRApy / COBRAToolbox.
Conversion between native formats:
gapsmith convert model.gmod.cbor model.json --pretty
gapsmith convert model.json model.gmod.cbor
gapsmith export-sbml model.gmod.cbor model.xml
5.2 LP solver
good_lp 1.15 + HiGHS (bundled). The HiGHS crate statically links its
C++ solver via CMake; no system HiGHS needed.
Optional CBC fallback: compile with --features gapsmith-fill/cbc and
pass --cbc-fallback in the heuristic options. Requires
coinor-libcbc-dev + coinor-libclp-dev on the build host.
5.3 Sequence database layout
<seq-dir>/<Taxonomy>/
├── rev/<reaction>.fasta ← curated reviewed sequences
├── unrev/<reaction>.fasta ← unreviewed / broader
├── rxn/<rxn_id>.fasta ← per-SEED-reaction FASTAs
├── tcdb.fasta ← transporter DB
├── transporter.fasta ← curated transporter hits
├── user/<reaction>.fasta ← your own references (optional)
└── version_seqDB.json ← Zenodo version stamp
gapsmith find / find-transport resolve reference FASTAs in this
priority order: user/ → rxn/ → rev/<EC> → unrev/<EC> →
rev/<md5(name)> → unrev/<md5(name)>. Matching real gapseq's
prepare_batch_alignments.R:150-234.
6. Troubleshooting
"external tool makeblastdb exited with status 1"
The tool was fed a gzipped FASTA. gapsmith doall auto-decompresses
.gz; for standalone find / find-transport invocations, ungzip
first.
"full candidate pool is infeasible"
The draft + every SEED reaction together can't produce biomass. Either:
- The biomass template asks for a metabolite no SEED reaction produces (peptidoglycan cpd15665 on some drafts — see porting-notes for the known cycle gap), or
- Your medium is missing a critical nutrient (run
gapsmith medium --manual-fluxto check what the rule-based inference assigns).
Use gapsmith fba --pfba --min-growth 0.001 on the draft to narrow the
biomass-component that's blocking.
pFBA heuristic "max iterations reached"
HiGHS exhausted the tolerance + pFBA-coefficient ladder. Options:
- Widen the medium (add more carbon sources).
- Lower
-k(minimum growth rate). - Rebuild with
--features cbcand opt-in to the CBC fallback.
Draft has 0 reactions / stays empty
The upstream *-Reactions.tbl has dbhit= empty on every row. That
means none of the reference FASTAs were found for the pathways you
asked about. Verify <seq-dir>/<Taxonomy>/ is populated
(gapsmith update-sequences -c).
7. Performance tips
- Release build:
cargo build --release --workspace. Debug builds are ~50× slower for the LP-heavy stages. - Use diamond over BLASTp:
-A diamond. Comparable accuracy, 5–20× faster on large proteomes. - Parallelism:
-K <threads>on every aligner-invoking subcommand. gapsmith is rayon-parallel in the LP-free stages (find, transport, medium) too. - Batch mode: if you're annotating 10+ genomes,
gapsmith batch-align--aligner precomputedamortises the alignment cost.
- Multi-genome scale-out: for 100+ genomes or 1k+ metagenomes, use
gapsmith doall-batchwith--gspa-runso the expensive alignment is done exactly once across the whole collection. Shard across SLURM array tasks with--shard i/N. See Multi-genome workflows.
8. Where to look next
porting-notes.md— every intentional deviation from upstream R gapseq.feature-matrix.md— exhaustive feature list with R source → Rust module pointers.- README.md — status + benchmarks.
- Per-crate rustdoc:
cargo doc --workspace --no-deps --open.
CLI reference
Every gapseq subcommand, exhaustive flag list. Generated from
--help, annotated per-section. Run any command with -h to see
these locally.
Global options
These apply to every subcommand:
| Flag | Description |
|---|---|
--data-dir <PATH> | Override gapseq reference-data directory. Resolution order when absent: GAPSMITH_DATA_DIR env → $XDG_DATA_HOME/gapsmith → <exe-dir>/../dat → ./dat. |
--seq-dir <PATH> | Override the sequence database directory. Defaults to <data-dir>/seq. |
-K, --threads <N> | Worker threads. Default: all available cores. |
-v, -vv, -vvv | Logging verbosity (info / debug / trace). |
-h, --help | Print help. |
gapsmith doall
End-to-end pipeline: find → find-transport → draft → medium → fill.
| Flag | Default | Description |
|---|---|---|
<GENOME> | — | Required. Protein FASTA (.faa / .faa.gz). |
-A, --aligner | blastp | blastp, diamond, mmseqs2, or precomputed. |
-b, --bitcutoff | 200.0 | Bitscore cutoff for the find stages. |
-c, --coverage | 75 | Coverage cutoff (0–100 %). |
-l, --min-bs-core | 50.0 | Core-reaction bitscore cutoff. |
-t, --taxonomy | auto | Bacteria, Archaea, or auto (currently maps to Bacteria). |
-m, --medium | auto | Medium CSV, or auto to infer via gapsmith medium. |
-f, --out-dir | . | Output directory. |
-K, --threads | all cores | Thread count. |
--full-suite | off | Include fill Steps 3 + 4 (slow). |
-k, --min-growth | 0.01 | Minimum growth for gap-filling. |
--gspa-run <DIR> | — | Read precomputed cluster-rep hits from a gspa manifest instead of running the aligner. |
--gspa-genome-id <ID> | FASTA stem | Row id in the manifest's genomes.tsv matching this run. |
Gzipped inputs are auto-decompressed into a tempfile::tempdir.
gapsmith find
Pathway and reaction detection.
| Flag | Default | Description |
|---|---|---|
<GENOME> | — | Protein FASTA. |
-p, --pathways | all | Pathway keyword or id (or all). |
-l, --pathway-db | metacyc,custom | Comma-separated: metacyc, kegg, seed, custom, all. |
-t, --taxonomy | Bacteria | Reference subfolder under seq-dir. |
-A, --aligner | diamond | Aligner backend. |
-P, --precomputed | — | Precomputed alignment TSV (skip aligner). |
--gspa-run <DIR> | — | gspa manifest — expands rep-level hits onto this genome's cluster members. |
--gspa-genome-id <ID> | FASTA stem | Genome id in the manifest matching this invocation. |
--gspa-coverage-fraction | off | Treat the alignment TSV's qcov as 0–1 (mmseqs2 native). |
-b, --bitcutoff | 200.0 | Bitscore cutoff. |
-i, --identcutoff | 0.0 | Identity cutoff (%). |
-c, --coverage | 75 | Coverage cutoff. |
-a, --completeness-main | 80.0 | Pathway completeness cutoff. |
-k, --completeness-hints | 66.0 | Relaxed cutoff when every key reaction is present. |
-s, --strict | off | Disable key-reaction heuristic. |
-n, --include-superpathways | off | Default is to filter them out (matches upstream). |
-o, --out-dir | . | Output directory. |
-u, --suffix | — | Filename suffix <stem>-<suffix>-{Reactions,Pathways}.tbl. |
gapsmith find-transport
Transporter detection.
| Flag | Default | Description |
|---|---|---|
<GENOME> | — | Protein FASTA. |
-A, --aligner | blastp | Aligner backend. |
-P, --precomputed | — | Precomputed TSV. |
-b, --bitcutoff | 50.0 | Transport hits use a lower threshold than biosynthetic. |
-i, --identcutoff | 0.0 | Identity cutoff. |
-c, --coverage | 75 | Coverage cutoff. |
-a, --nouse-alternatives | off | Disable the alt-transporter fallback. |
-m, --only-met | — | Restrict to one substrate keyword. |
-o, --out-dir | . | Output directory. |
gapsmith draft
Build a draft metabolic model from find + find-transport output.
| Flag | Default | Description |
|---|---|---|
-r, --reactions | required | *-Reactions.tbl. |
-t, --transporter | required | *-Transporter.tbl. |
-b, --biomass | auto | auto / pos / neg / archaea / <path.json>. |
-n, --name | inferred | Model id. |
-u, --high-evi-rxn-bs | 200.0 | Bitscore threshold for core reactions. |
-l, --min-bs-for-core | 50.0 | Lower bound for candidate pool. |
-o, --out-dir | . | Output directory. |
--no-sbml | off | Skip SBML emission. |
gapsmith medium
Rule-based medium inference.
| Flag | Default | Description |
|---|---|---|
-m, --model | required | Draft CBOR / JSON. |
-p, --pathways | required | *-Pathways.tbl from find. |
-c, --manual-flux | — | cpdXXXXX:val;cpdYYYYY:val overrides. |
-o, --output | inferred | Output CSV path. |
-f, --out-dir | . | Output directory (when -o is absent). |
gapsmith fill
Iterative gap-filling.
| Flag | Default | Description |
|---|---|---|
<MODEL> | required | Draft CBOR / JSON. |
-n, --media | required | Medium CSV. |
-r, --reactions | — | *-Reactions.tbl for bitscore-weighted pFBA. |
-t, --target | cpd11416 | Biomass pseudo-metabolite id. |
-k, --min-growth | 0.01 | Minimum growth rate floor. |
-b, --bcore | 50.0 | Minimum bitscore for "core" classification. |
--high-evi | 200.0 | Bitscore-to-weight upper calibration point. |
--dummy-weight | 100.0 | Weight for reactions with no hit. |
-o, --out-dir | . | Output directory. |
--no-sbml | off | Skip SBML. |
--step1-only | off | Run only Step 1 (debugging). |
--full-suite | off | Include fill Steps 3 + 4. |
--prune-futile | off | Thermodynamic futile-cycle prune (opt-in; slow). |
gapsmith fba
FBA / pFBA on an existing model.
| Flag | Default | Description |
|---|---|---|
<MODEL> | required | CBOR / JSON model. |
-r, --objective | model's own | Objective reaction id. |
--pfba | off | Parsimonious FBA. |
--pfba-coef | 0.001 | pFBA biomass trade-off coefficient. |
--min-growth | 0.0 | Biomass floor (pFBA only). |
--top | 20 | Print the top-N highest-absolute-flux reactions. |
--minimise | off | Minimise instead of maximise. |
gapsmith adapt
Edit reactions or force growth on a compound.
| Flag | Default | Description |
|---|---|---|
-m, --model | required | Model file. |
-a, --add | — | Comma-separated rxnNNNNN or MetaCyc pathway ids. |
-r, --remove | — | Same format; reactions are removed. |
-w, --growth | — | cpdNNNNN:TRUE / cpdNNNNN:FALSE. |
-b, --reactions | — | *-Reactions.tbl (needed for -w ...:TRUE gap-filling). |
-k, --min-growth | 0.01 | Min growth floor for the -w ...:TRUE path. |
-f, --out-dir | . | Output directory. |
--no-sbml | off | Skip SBML. |
gapsmith pan
Build a pan-draft from N drafts.
| Flag | Default | Description |
|---|---|---|
-m, --models | required | Comma-separated / directory / single path. |
-t, --min-freq | 0.06 | Minimum across-draft reaction frequency. |
-b, --only-binary | off | Skip the pan-draft, emit only the rxn × model presence TSV. |
-f, --out-dir | . | Output directory. |
--no-sbml | off | Skip SBML. |
gapsmith update-sequences
Zenodo seqdb sync.
| Flag | Default | Description |
|---|---|---|
-t, --taxonomy | Bacteria | Which seqdb to sync. |
-D, --seq-dir | <data-dir>/seq | Seqdb location. |
-Z, --record | pinned | Zenodo record id, or latest. |
-c, --check-only | off | Report version, no download. |
-q, --quiet | off | Suppress progress messages. |
gapsmith convert
CBOR ↔ JSON round-trip.
| Flag | Default | Description |
|---|---|---|
<INPUT> | required | Path to source file. |
<OUTPUT> | required | Destination path. |
--to | from extension | cbor / json (overrides extension). |
--pretty | off | Pretty-print JSON. |
gapsmith export-sbml
Serialise a CBOR / JSON model as SBML L3V1 + FBC2 + groups.
| Flag | Default | Description |
|---|---|---|
<INPUT> | required | CBOR / JSON model. |
<OUTPUT> | required | Destination .xml path. |
--compact | off | Omit pretty-print / nested indentation. |
gapsmith align
Run a single aligner standalone. Useful for debugging reference-FASTA
issues or building a precomputed TSV to feed into find --aligner precomputed.
| Flag | Default | Description |
|---|---|---|
-A, --aligner | required | blastp / tblastn / diamond / mmseqs2 / precomputed. |
-q, --query | required | Query FASTA. |
-t, --target | required | Target FASTA. |
-P, --precomputed | — | When -A precomputed, path to the TSV. |
-b, --bitcutoff | 0.0 | Filter hits below this bitscore. |
-c, --coverage | 0 | Filter below this coverage %. |
-e, --evalue | 1e-5 | E-value cutoff. |
--extra | — | Passthrough flags to the aligner binary. |
-o, --out | stdout | Output TSV path. |
gapsmith batch-align
Cluster N genomes + single alignment + per-genome TSV expansion.
| Flag | Default | Description |
|---|---|---|
-q, --query | required | Reference query FASTA. |
-g, --genomes | required | Directory containing .faa(.gz) files. |
-o, --out-dir | required | Output <genome>.tsv directory. |
-A, --aligner | diamond | Aligner backend (blastp / diamond / mmseqs2). |
--cluster-identity | 0.5 | mmseqs2 cluster identity. |
--cluster-coverage | 0.8 | mmseqs2 cluster coverage. |
-b, --bitcutoff | 0.0 | Per-hit bitscore filter. |
-c, --coverage | 0 | Per-hit coverage filter. |
gapsmith doall-batch
Run doall across many genomes in parallel. See
multi-genome.md for the full recipe.
| Flag | Default | Description |
|---|---|---|
-g, --genomes-dir <DIR> | — | Directory of protein FASTAs. |
--genomes-list <TSV> | — | Explicit TSV list (id<TAB>path[<TAB>abundance]). |
--gspa-run <DIR> | — | Pulls the genome list from a gspa manifest. |
-f, --out-dir <DIR> | required | One <genome_id>/ subdir is written per input. |
-j, --jobs <N> | all cores | Rayon pool size. |
--shard <i/N> | — | Select genomes where index mod N == i. |
--continue-on-error | off | Log failed genomes to doall-batch-errors.tsv instead of aborting. |
(plus every passthrough flag from doall) |
gapsmith community
Community-level optimisation. Two subcommands.
gapsmith community per-mag
Per-MAG FBA under a shared (union) medium. Linear in N.
| Flag | Default | Description |
|---|---|---|
--drafts-dir <DIR> | — | Directory of <id>-{draft,filled}.gmod.cbor. |
--drafts-list <TSV> | — | <id><TAB><cbor>[<TAB><medium>]. |
--gspa-run <DIR> | — | Use manifest's genome list. |
--drafts-root <DIR> | . | Where per-genome doall outputs live (paired with --gspa-run). |
-m, --medium <CSV> | auto | Override the inferred shared medium. |
-o, --out-dir <DIR> | . | Writes community-medium.csv, per-mag-growth.tsv. |
gapsmith community cfba
Compose N drafts into one community model; solve weighted-sum biomass.
| Flag | Default | Description |
|---|---|---|
--drafts-dir <DIR> / --drafts-list <TSV> / --gspa-run <DIR> | — | Exactly one (same semantics as per-mag). |
--drafts-root <DIR> | . | Where doall outputs live. |
--abundance <TSV> | — | <id><TAB><abundance> overrides. |
--biomass-rxn <ID> | bio1 | Per-organism biomass reaction id. |
--balanced-growth | off | Add v(bio_k) == v(bio_community) constraints. |
-m, --medium <CSV> | — | Optional shared medium CSV. |
-o, --out-dir <DIR> | . | Writes community.gmod.cbor, community-fba.tsv. |
gapsmith db inspect
Smoke-test the --data-dir. Loads every reference table and prints row
counts. No flags beyond --data-dir.
gapsmith test
Print resolved paths + which external tool binaries are on PATH.
gapsmith example-model
Emit a small hand-built toy model (3 metabolites / 2 reactions) as CBOR. Useful for smoke-testing format round-trips and downstream tools.
| Flag | Default | Description |
|---|---|---|
<OUTPUT> | required | Destination CBOR path. |
--complex | off | Emit a slightly bigger "complex" variant. |
Multi-genome & metagenome workflows
Gapsmith's single-genome pipeline (doall) stays the default. This chapter
covers three additional, opt-in capabilities for running at scale:
--gspa-run— consume precomputed protein clusters + alignments from the upstream gspa pipeline instead of re-running DIAMOND/BLASTp per genome.doall-batch— rayon- and SLURM-friendly driver that fansdoallacross N genomes (targets 1 000 metagenomes; scales to ~1 M genomes via--shard).community— community-level objectives for metagenomes: a cheap per-MAG mode with a shared union medium, and a full cFBA mode that composes N drafts into one model.
Nothing in here breaks single-genome usage. If you never pass --gspa-run
or use the new subcommands, behaviour is identical to earlier releases.
1. Consuming a gspa manifest (--gspa-run)
Why
Annotating N genomes with a single mmseqs2/linclust cluster + one
alignment against cluster representatives is dramatically cheaper than
N independent BLASTp runs. gspa's Phase 11 already produces that
cluster; --gspa-run plugs the output into gapsmith find /
find-transport / doall.
Manifest layout (v1)
gspa writes (or you assemble by hand) a directory with this shape:
<gspa-run>/
clusters.tsv # 3 cols: rep_id<TAB>member_id<TAB>genome_id
alignment/<ref>.tsv # 8-col gapsmith TSV keyed on rep_id
genomes.tsv # 2–3 cols: genome_id<TAB>faa_path[<TAB>abundance]
gspa-manifest.toml # optional — version/provenance marker
clusters.tsv: every (rep, member, genome) triple emitted by the cross-genome mmseqs2 cluster. The rep appears on its own line as its own member.alignment/*.tsv: one or more 8-column files in gapsmith's canonical format (qseqid pident evalue bitscore qcov stitle sstart send). All.tsv/.tab/.txtfiles in that directory are concatenated transparently — useful when you run DIAMOND and mmseqs2 separately and want both sources of evidence.genomes.tsv: the authoritative list of organism ids + FASTA paths. The optional 3rd column carries a relative abundance (any positive float — RPKM / TPM / raw counts / …) that propagates intocommunity cfba.
Usage
# Single genome, alignment reused from an already-built gspa manifest:
gapsmith doall /path/to/genome.faa -f out/ \
--gspa-run /path/to/gspa-run/ \
--gspa-genome-id genome
# If the FASTA stem already matches the genome_id in genomes.tsv,
# --gspa-genome-id can be omitted.
--gspa-coverage-fraction re-scales mmseqs2-native coverage columns
(0–1) up to the gapsmith convention (0–100). Diamond/BLASTp TSVs are
already 0–100 and need no flag.
What happens under the hood
gapsmith_align::GspaRunAligner implements the Aligner trait. On
align() it:
- Reads
alignment/*.tsvlazily (once, cached). - Looks up each rep hit in
clusters.tsvfiltered totarget_genome_id. - Emits one
Hitper member, rewritingqseqidto the member's original protein id so downstream find / find-transport see a normal per-genome hit table.
2. Batch reconstruction (doall-batch)
When to reach for it
| Scale | Recommended tool |
|---|---|
| 1 genome | gapsmith doall |
| 10 – 200 genomes, 1 box | gapsmith doall-batch --jobs <cores> |
| 1 k – 1 M genomes, cluster | doall-batch --shard i/N inside a SLURM array |
Flags that matter
--genomes-dir <dir>/--genomes-list <tsv>/--gspa-run <dir>— exactly one; the last option pulls the genome list straight out of the gspa manifest.--jobs N— rayon pool size. Each worker processes one genome at a time; we pindoall's own--threadsto 1 so rayon gets clean parallelism.--shard i/N— SLURM-array-friendly selector. After sorting the genome list alphabetically, only genomes whose 0-based indexksatisfiesk mod N == iare processed. Every shard is independent — no shared state, no cross-task coordination.--continue-on-error— log-and-skip instead of aborting the batch when a single genome'sdoallfails. Failures are written to<out-dir>/doall-batch-errors.tsv.
Recipes
100 genomes on one box, reusing a gspa cluster
gapsmith doall-batch \
--gspa-run /data/gspa-run/ \
-f /out/reconstructions/ \
--jobs 32 \
--continue-on-error
Each <genome_id>/ subdirectory under /out/reconstructions/ gets the
standard doall outputs.
1 M genomes on a SLURM cluster
#!/bin/bash
#SBATCH --array=0-1023
#SBATCH --cpus-per-task=16
#SBATCH --mem=32G
#SBATCH --time=24:00:00
gapsmith doall-batch \
--gspa-run /scratch/gspa-run/ \
-f /scratch/recon/ \
--jobs $SLURM_CPUS_PER_TASK \
--shard ${SLURM_ARRAY_TASK_ID}/1024 \
--continue-on-error
- One gspa run on the full 1 M genome set covers the expensive alignment step globally.
- 1 024 SLURM tasks each process ~1 000 genomes in parallel; add tasks or add cores per task to change the balance.
3. Community optimisation (community)
Two subcommands; pick the one that matches your accuracy / scale trade-off. You can run both on the same input and compare.
community per-mag — scalable default
Each MAG keeps its own GEM and its own FBA. We (a) build a shared
medium that is the union of every per-MAG medium (largest maxFlux per
compound wins), (b) run FBA on every draft under that shared medium,
and (c) report the abundance-weighted mean growth rate.
gapsmith community per-mag \
--gspa-run /data/gspa-run/ \
--drafts-root /out/reconstructions/ \
-o /out/community/
Outputs:
community-medium.csv— the shared medium.per-mag-growth.tsv—genome_id, abundance, growth_rate, statusper MAG.
Complexity: linear in N; no community-level LP. This is the mode for
≥ 100 MAGs where a full cFBA isn't tractable.
community cfba — full community LP
Composes N drafts into one block-diagonal model. Private mets/rxns are
suffixed __<genome_id>; extracellular mets (_e0) and exchange
reactions (EX_*_e0) are shared so organisms cross-feed through a
single pool. Bounds on shared exchanges are widened to the most
permissive input.
A synthetic bio_community reaction is added whose flux equals
Σ w_k · bio_k (weights come from --abundance or genomes.tsv).
gapsmith community cfba \
--gspa-run /data/gspa-run/ \
--drafts-root /out/reconstructions/ \
--abundance /data/abundances.tsv \
--balanced-growth \
-m /data/community-medium.csv \
-o /out/community/
Outputs:
community.gmod.cbor— composed community model.community-fba.tsv— per-organism biomass flux plus the community-level objective.
--balanced-growth adds one linear constraint per organism forcing
v(bio_k) == v(bio_community), matching classical cFBA semantics. Drop
the flag for an abundance-weighted average without the equal-growth
constraint (faster, more realistic when abundances are very uneven).
Scaling: the community LP grows roughly N × (rxns + mets). Expect
comfortable behaviour up to ~50 organisms; larger communities are
better served by per-mag first, with cFBA applied to selected
sub-communities.
4. Decision tree
"metagenome / multi-MAG run"
│
┌────────────────┴────────────────┐
│ │
small curated set thousands of MAGs
(≤ ~50 organisms) or fast iteration
│ │
┌───────┴───────┐ ┌───────┴───────┐
│ need obligate │ │ need obligate │
│ cross-feeding?│ │ cross-feeding?│
└───────┬───────┘ └───────┬───────┘
yes │ │ no yes │ │ no
│ │ │ │
community cfba community per-mag community cfba on community per-mag
--balanced-growth selected cohorts
Rules of thumb:
- 1 — 50 organisms, synthetic or curated community:
cfba. - 50 — 1 000 organisms, real-world metagenome:
per-mag. -
1 000 organisms or iterating over many assemblies: always start with
doall-batch+per-mag; drop tocfbaonly for sub-cohorts.
5. Producing a gspa run from outside gspa
Any tool that can emit the three files above works. A minimal hand-rolled example:
# 1. Concatenate proteomes, tagging each header with its genome id.
mkdir -p run/
python tools/concat_with_prefix.py --in genomes/ --out run/all.faa
# 2. Cluster across genomes.
mmseqs easy-cluster run/all.faa run/cluster run/tmp \
--min-seq-id 0.5 -c 0.8 --threads 32
# 3. Pivot the 2-col `run/cluster_cluster.tsv` into 3-col clusters.tsv
# by peeling each member's `<genome_id>|` prefix.
awk -F$'\t' '{
member=$2; rep=$1;
n = split(member, a, "|"); gid=a[1];
orig=substr(member, length(gid)+2);
print rep "\t" orig "\t" gid
}' run/cluster_cluster.tsv > run/clusters.tsv
# 4. Run alignment ONCE against cluster reps.
mkdir -p run/alignment
diamond makedb --in reference.faa -d ref
diamond blastp -q run/cluster_rep_seq.fasta -d ref \
--outfmt 6 qseqid pident evalue bitscore qcovhsp stitle sstart send \
--more-sensitive -k 10 -e 1e-5 > run/alignment/reference.tsv
# 5. genomes.tsv (optional 3rd column = abundance).
for f in genomes/*.faa; do
id=$(basename "$f" .faa)
echo -e "$id\t$(realpath "$f")"
done > run/genomes.tsv
After that, gapsmith doall-batch --gspa-run run/ --jobs 32 -f out/ is
all you need.
Architecture
A bird's-eye view of how gapsmith is organised, how data flows through it, and what each crate is responsible for.
1. Crate dependency graph
gapsmith-core ← Types only, no I/O / DB / solver
↑ ↑ ↑
│ │ │
┌─────────┘ │ └────────────────┐
│ │ │
gapsmith-io gapsmith-db gapsmith-sbml
│ │ ↑
└─┬──────────┘ │
│ │
▼ │
gapsmith-align ──┐ │
│ │ │
▼ │ │
gapsmith-find │ │
│ │ │
├─▶ gapsmith-transport │
│ │
▼ │
gapsmith-draft ──────────────────────┤
│ │
▼ │
gapsmith-fill ◀─── gapsmith-medium │
│ │
└──────────────▶ gapsmith-cli ◀──┘
Arrows point from "depended on" toward "depends on". gapsmith-core is
the leaf: every other crate builds on its Model / Reaction /
Metabolite / StoichMatrix types.
2. Data flow for gapsmith doall
genome.faa.gz
│
▼
┌──────────────┐ ┌──────────────────────┐
│ gapsmith find │ ──── aligner (shell) │ dat/seq/<tax>/ │
│ -p all │ ─ BLAST/ │ rev/ unrev/ rxn/ │
└──────┬───────┘ diamond/ └──────────────────────┘
│ mmseqs2
│ *-Reactions.tbl + *-Pathways.tbl
▼
┌──────────────────────┐
│ gapsmith find-transport│ ─── same aligner, subex.tbl + tcdb.fasta
└──────┬───────────────┘
│ *-Transporter.tbl
▼
┌────────────────────┐
│ gapsmith draft │ ─── seed_reactions_corrected.tsv, biomass JSON
└──────┬─────────────┘
│ *-draft.gmod.cbor (CBOR) + *-draft.xml (SBML)
▼
┌──────────────────┐
│ gapsmith medium │ ─── medium_prediction_rules.tsv + *-Pathways.tbl
│ (auto) │
└──────┬───────────┘
│ *-medium.csv
▼
┌────────────────────┐
│ gapsmith fill │ ─── HiGHS LP (in-process via good_lp)
│ 4-phase suite │
└──────┬─────────────┘
│ *-filled.gmod.cbor + *-filled.xml + *-filled-added.tsv
▼
(COBRApy, COBRAToolbox, cobrar, …)
Every arrow between stages is plain filesystem I/O. No daemon, no IPC.
Each subcommand is individually re-runnable; if you change the medium
you only re-run fill.
3. Model representation
[gapsmith_core::Model] is the single source of truth:
#![allow(unused)] fn main() { pub struct Model { pub annot: ModelAnnot, // provenance metadata pub compartments: Vec<Compartment>, pub mets: Vec<Metabolite>, pub rxns: Vec<Reaction>, pub genes: Vec<GeneId>, pub s: StoichMatrix, // sprs CsMat in CSC } }
Serialised via serde as:
- CBOR (
.gmod.cbor) — native, compact, fast load. The gapsmith internal format; replaces upstream's.RDS. - JSON (
.json) — human-readable;gapsmith convertconverts both ways. Same semantic content as CBOR. - SBML (
.xml) — Level 3 Version 1 + FBC 2 + groups 1. Written by [gapsmith_sbml::write_sbml]; loads cleanly in COBRApy / COBRAToolbox / cobrar.
Metabolite ids use the cpd00001_c0 convention (compartment baked into
the id) for SBML SId compliance. Reaction ids likewise — rxn00001_c0,
EX_cpd00001_e0, bio1, etc.
4. LP plumbing
The gap-filler does a lot of LPs — dozens per Step 1 / 2 / 2b / 3 / 4
run, sometimes thousands in a --full-suite. Key design decisions:
Split-flux encoding
Every reaction r becomes two variables vp_r, vn_r ≥ 0. The net
flux is v_r = vp_r − vn_r. This makes |v_r| = vp_r + vn_r linear —
essential for pFBA.
Bound translation
model [lb, ub] | vp_r upper | vn_r upper |
|---|---|---|
lb ≥ 0, ub ≥ 0 | ub | 0 |
lb ≤ 0, ub ≥ 0 | ub | −lb |
lb ≤ 0, ub ≤ 0 | 0 | −lb |
Strict-direction reactions additionally get vp_r ≥ lb (forward-forced)
or vn_r ≥ −ub (backward-forced).
Mass balance
Walk the CSC matrix once column-by-column (O(nnz) total). For each
non-zero (row, col, coef), accumulate coef · (vp[col] − vn[col])
onto row row's expression. The naive per-row scan would be
O(m · n) — ~45 B iterations on a 3k×5k model. The O(nnz) walk
solves it in under 100 ms on ecoli.
Solver
good_lp 1.15 with HiGHS backend. HiGHS is statically linked via
CMake at build time — no runtime HiGHS dependency. Optional CBC
fallback behind --features cbc: pfba_heuristic retries once with
CBC after HiGHS exhausts the tolerance / coefficient ladder.
5. Alignment layer
gapsmith-align exposes a single [Aligner] trait:
#![allow(unused)] fn main() { pub trait Aligner { fn run(&self, query: &Path, targets: &[&Path], opts: &AlignOpts) -> Result<Vec<Hit>, AlignError>; } }
Implementors — one struct each — shell out to the matching binary and
normalise its output to a [Hit] with the gapseq-canonical column
set (qseqid / pident / evalue / bitscore / qcovs / stitle / sstart /
send).
- [
BlastpAligner] builds an indexed BLAST DB in a temp dir per call. - [
DiamondAligner] mirrors gapseq's--more-sensitivedefault. - [
Mmseqs2Aligner] uses the explicitcreatedb → search → convertalis4-command pipeline (noteasy-search— see porting-notes for why). - [
PrecomputedTsvAligner] skips all shelling out and reads a user-supplied TSV. - [
BatchClusterAligner] is the Rust-only feature: mmseqs2-clusters N input genomes, runs one alignment, expands cluster membership back to per-genome hits. - [
GspaRunAligner] is the multi-genome / metagenome entry point: given a gspa manifest (clusters.tsv+alignment/*.tsv+genomes.tsv) and a target genome id, it fans every rep-level hit onto that genome's cluster members. No alignment is re-run — gspa clustered across N genomes once, upstream.
6. Candidate-pool / gap-fill
The 4-phase gap-fill suite (gapsmith-fill::run_suite) is the heaviest
computation in the pipeline. Sketch of one fill call:
draft model + medium CSV + Reactions.tbl (bitscore weights)
│
▼
┌──────────────────────────────┐
│ apply_medium(draft) │
│ attach EX_<target>_c0 obj │
└──────────────┬───────────────┘
▼
┌──────────────────────────────┐
│ build_full_model │ ← every approved SEED rxn
│ draft + candidates │ not already present,
│ deduped by stoich hash │ sorted by core status
└──────────────┬───────────────┘
▼
┌──────────────────────────────┐
│ (optional) detect_futile_ │ ← parallel LP probe,
│ cycles + drop │ `--prune-futile`
└──────────────┬───────────────┘
▼
┌──────────────────────────────┐
│ pfba_heuristic │ ← tolerance ladder,
│ min_growth ≥ k │ `1e-6 → 1e-9`
└──────────────┬───────────────┘
▼
┌──────────────────────────────┐
│ extract utilized candidates │
│ add to draft │
└──────────────┬───────────────┘
▼
┌──────────────────────────────┐
│ KO essentiality loop │ ← zero bounds, recheck FBA,
│ serial, core-first │ drop non-essential
└──────────────┬───────────────┘
▼
┌──────────────────────────────┐
│ Step 2: biomass components │ ← per substrate, MM_glu
├──────────────────────────────┤
│ Step 2b: aerobic/anaerobic │
├──────────────────────────────┤
│ Step 3: ESP1-5 energy screen │ (`--full-suite`)
├──────────────────────────────┤
│ Step 4: fermentation screen │ (`--full-suite`)
└──────────────────────────────┘
│
▼
filled model
6b. Multi-genome and community paths
The single-genome flow above is unchanged. Three additional, opt-in paths sit alongside it.
6b.1 Alignment reuse via gspa
N proteomes ───▶ [gspa, upstream] ──▶ gspa-run directory
(mmseqs2 linclust, clusters.tsv
diamond vs. ref) alignment/*.tsv
genomes.tsv
│
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
gapsmith find find-transport doall-batch
--gspa-run … --gspa-run … --gspa-run …
│ │ │
└─ reads clusters.tsv + alignment/*.tsv, expands rep
hits onto each target genome's cluster members
(gapsmith_align::GspaRunAligner).
No alignment is run inside gapsmith when --gspa-run is set. Per-genome
find still applies its bitscore / coverage / completeness gates on the
fanned-out hits, so the downstream pipeline is unchanged.
6b.2 doall-batch parallel driver
genomes.tsv ──▶ sort + shard (`--shard i/N`) ──▶ rayon pool
│
├─▶ doall genome_1
├─▶ doall genome_2
├─▶ ...
└─▶ doall genome_k
out-dir/
<genome_1>/
<genome_2>/
...
doall-batch-errors.tsv ← present only with --continue-on-error
Each worker calls doall::run_cli in-process with its --threads
pinned to 1 so rayon owns the parallelism. SLURM array tasks get a
disjoint --shard i/N — no inter-task coordination, no shared state.
6b.3 Community modes
N drafts (gmod.cbor) ──┐
│
┌───────────┴───────────┐
▼ ▼
community per-mag community cfba
(gapsmith-fill:: (gapsmith-fill::
community:: community::
union_medium, compose_models,
per_mag_weights) add_community_biomass)
│ │
▼ ▼
per-mag-growth.tsv + community.gmod.cbor +
community-medium.csv community-fba.tsv
(N independent FBAs) (1 big LP across the
block-diagonal model)
compose_models suffixes every private metabolite and reaction with
__<organism>, leaves extracellular (_e0) ids unsuffixed so they
collapse into one shared pool, and takes the most permissive
EX_*_e0 bounds across organisms. add_community_biomass appends a
community_biomass pseudo-metabolite plus a draining bio_community
reaction (Σ w_k · bio_k), optionally adding one equality per organism
(v(bio_k) == v(bio_community)) for balanced-growth cFBA. The regular
[fba::fba] solves the resulting LP with no solver-side changes.
7. Testing surface
Every algorithmic component has unit tests (~170 total, cargo test --workspace). On top, three integration layers:
- Shell-parity tests (
crates/gapsmith-align/tests/*_parity.rs) — run the real BLAST / diamond / mmseqs2 binaries and diff against our wrappers' output. - R-parity tests (
crates/gapsmith-find/tests/complex_parity.rs,pipeline_parity.rs,crates/gapsmith-transport/tests/parity.rs) — run the actual R gapseq viaRscriptand diff column-by-column. - SBML validator (
tools/validate_sbml.py) — uv-managed venv withpython-libsbml+cobra; runs libSBML consistency checks + COBRApy round-trip on emitted SBML. 0 errors on every model.
8. File-system layout
gapsmith/
crates/
gapsmith-core/ # types
gapsmith-io/ # CBOR/JSON + path resolver
gapsmith-db/ # dat/*.tsv loaders
gapsmith-sbml/ # SBML writer
gapsmith-align/ # aligner trait + 6 backends (incl. gspa-run)
gapsmith-find/ # pathway + reaction detection
gapsmith-transport/ # transporter detection
gapsmith-draft/ # draft model builder
gapsmith-medium/ # rule-based medium inference
gapsmith-fill/ # FBA / pFBA / gap-filler / suite / community
gapsmith-cli/ # clap dispatch + every subcommand
tools/
bench/ # R-vs-rs benchmark runner
validate_sbml.py # libSBML + COBRApy validator
r_complex_detection.R # R parity harness
.sbml-validate/ # uv venv (python-libsbml + cobra)
docs/
user-guide.md # end-user walk-through
cli-reference.md # exhaustive flag list
multi-genome.md # metagenome, gspa integration, cFBA
feature-matrix.md # R source → Rust module pointers
porting-notes.md # intentional deviations from upstream
architecture.md # this file
Cargo.toml # workspace manifest
README.md # status + benchmarks
Feature matrix
Exhaustive list of everything gapsmith implements, one row per feature, with pointers to both the upstream R source and the Rust module.
Status legend:
- ✅ — implemented and tested against real gapseq where feasible
- 🆕 — Rust-only feature (no upstream equivalent)
- ⚠️ — shipped but intentionally deviates from upstream (see porting-notes.md)
- ❌ — deferred / intentionally not ported
1. Subcommands
| Subcommand | R source | Rust module | Status |
|---|---|---|---|
gapsmith test | — | gapsmith-cli/src/commands/test.rs | ✅ |
gapsmith find | src/gapseq_find.sh + src/*.R | gapsmith-find/ + gapsmith-cli/src/commands/find.rs | ✅ byte-identical on PWY-6587 & amino |
gapsmith find-transport | src/transporter.sh + src/analyse_alignments_transport.R | gapsmith-transport/ + gapsmith-cli/src/commands/find_transport.rs | ✅ TC-set + row-count identical |
gapsmith draft | src/generate_GSdraft.R + src/prepare_candidate_reaction_tables.R | gapsmith-draft/ + gapsmith-cli/src/commands/draft.rs | ✅ SBML validates (0 libSBML errors) |
gapsmith medium | src/predict_medium.R | gapsmith-medium/ + gapsmith-cli/src/commands/medium.rs | ✅ byte-identical on ecoli |
gapsmith fill | src/gf.suite.R + src/gapfill4.R | gapsmith-fill/ + gapsmith-cli/src/commands/fill.rs | ✅ 4-phase suite + KO loop |
gapsmith adapt | src/adapt.R + src/gf.adapt.R | gapsmith-cli/src/commands/adapt.rs | ⚠️ EC / KEGG / name resolution deferred |
gapsmith pan | src/pan-draft.R | gapsmith-cli/src/commands/pan.rs | ✅ union + binary table |
gapsmith doall | src/doall.sh | gapsmith-cli/src/commands/doall.rs | ✅ end-to-end on ecore in 2m47s |
gapsmith update-sequences | src/update_sequences.sh | gapsmith-cli/src/commands/update_sequences.rs | ✅ Zenodo sync + md5 diff |
gapsmith convert | — | gapsmith-cli/src/commands/convert.rs | 🆕 CBOR ↔ JSON round-trip |
gapsmith example-model | — | gapsmith-cli/src/commands/example_model.rs | 🆕 toy model fixture |
gapsmith db inspect | — | gapsmith-cli/src/commands/db.rs | 🆕 reference-data row-count dump |
gapsmith export-sbml | cobrar::writeSBMLmod | gapsmith-cli/src/commands/export_sbml.rs | 🆕 CBOR → SBML |
gapsmith align | — | gapsmith-cli/src/commands/align.rs | 🆕 debug-wrap for a single aligner |
gapsmith batch-align | — | gapsmith-cli/src/commands/batch_align.rs | 🆕 cluster N genomes + single alignment |
gapsmith doall-batch | — | gapsmith-cli/src/commands/doall_batch.rs | 🆕 rayon + SLURM-shard parallel doall across N genomes |
gapsmith community per-mag | — | gapsmith-cli/src/commands/community.rs | 🆕 per-MAG FBA under a shared (union) medium |
gapsmith community cfba | — | gapsmith-cli/src/commands/community.rs | 🆕 compose N drafts, weighted-sum biomass objective |
gapsmith fba | — | gapsmith-cli/src/commands/fba.rs | 🆕 FBA / pFBA standalone |
2. Core algorithms
2.1 Alignment layer (gapsmith-align)
| Feature | R source | Rust module |
|---|---|---|
| BLASTp wrapper | gapseq_find.sh blastp block | blast.rs::BlastpAligner |
| tBLASTn wrapper | same, for -n nucl | blast.rs::TblastnAligner |
| DIAMOND wrapper | gapseq_find.sh diamond block | diamond.rs::DiamondAligner |
| mmseqs2 wrapper (full pipeline) | gapseq_find.sh mmseqs block | mmseqs2.rs::Mmseqs2Aligner |
| Precomputed TSV input | — | precomputed.rs::PrecomputedTsvAligner 🆕 |
| Batch-cluster (N genomes → 1 alignment) | — | batch.rs::BatchClusterAligner 🆕 |
| gspa-run manifest reader (cluster-aware hit expansion) | — | gspa.rs::{GspaManifest, GspaRunAligner} 🆕 |
| 2-decimal scientific e-value format | BLAST -outfmt 6 native | tsv.rs |
2.2 find pipeline (gapsmith-find)
| Feature | R source | Rust module |
|---|---|---|
| Pathway table loader (meta / kegg / seed / custom) | gapseq_find.sh:520-532 | gapsmith-db::PathwayTable |
| metacyc + custom merge (custom-wins-on-id) | same | same |
Keyword-shorthand resolution (amino, carbo, ...) | gapseq_find.sh:40-60 | pathways.rs::MatchMode::Hierarchy |
Reference FASTA resolver (user/ → rxn/ → rev/EC → unrev/EC → md5) | prepare_batch_alignments.R:150-234 | seqfile.rs |
| Complex-subunit detection | complex_detection.R | complex.rs (R-parity on 9 cases) |
| Hit classification with exception table | analyse_alignments.R:108-189 | classify.rs |
Pathway completeness scoring (f64 precision) | filter_pathways.R:10-34 | pathways.rs::score |
dbhit lookup (EC + altEC + MetaCyc id + enzyme name) | getDBhit.R:60-130 | dbhit.rs |
noSuperpathways=true default | gapseq_find.sh:20 | find::FindOptions |
Word-boundary-less header filter (matches shell grep -Fivf) | gapseq_find.sh | seqfile.rs ⚠️ intentional |
Output writers (Reactions.tbl, Pathways.tbl) | same | output.rs |
2.3 find-transport pipeline (gapsmith-transport)
| Feature | R source | Rust module |
|---|---|---|
subex.tbl substrate filter | transporter.sh:140-280 | filter.rs |
| TC-id parsing + type canonicalisation | analyse_alignments_transport.R:1-188 | tc.rs |
| Substrate resolution (tcdb_all + FASTA header fallback) | same | runner.rs |
Alt-transporter reaction assignment (gated by --nouse-alternatives) | analyse_alignments_transport.R:110-130 | runner.rs |
Substrate-case preservation (gapseq emits sub=Potassium) | shell behaviour | data.rs |
2.4 Draft model builder (gapsmith-draft)
| Feature | R source | Rust module |
|---|---|---|
| Candidate selection (bitscore ≥ cutoff OR pathway support) | prepare_candidate_reaction_tables.R + generate_GSdraft.R:55-100 | candidate.rs |
| Stoichiometric hash dedup | generate_rxn_stoich_hash.R | stoich_hash.rs |
Best-status-across-rows (OR is_complex, max complex_status, highest-rank pathway_status) | implicit in R's data.table merges | candidate.rs::build_candidates ⚠️ explicit |
| Biomass JSON parser (single + pipe-separated multi-link) | parse_BMjson.R:1-107 | biomass.rs + gapsmith-db::BiomassComponent::links |
| Biomass cofactor mass-rescaling | generate_GSdraft.R:281-292 | biomass.rs ⚠️ menaquinone-8 auto-removal deferred |
| GPR composition (and / or tree, "subunit undefined" edge cases) | get_gene_logic_string.R | gpr.rs |
| Diffusion + exchange expansion | add_missing_exRxns.R:1-156 | exchanges.rs |
| Conditional transporter additions (butyrate, IPA, PPA, phloretate) | generate_GSdraft.R | runner.rs::add_conditional_transporters |
SBML ID sanitiser (-/./:/space → _) | — | builder.rs 🆕 |
Cytosolic met-id format (cpd00001_c0 not cpd00001[c0]) | — | builder.rs ⚠️ for SBML SId compliance |
2.5 FBA / pFBA solver (gapsmith-fill)
| Feature | R source | Rust module |
|---|---|---|
Split-flux LP encoding (vp, vn ≥ 0) | implicit in cobrar::pfbaHeuristic | lp.rs::SplitFluxLp |
| FBA | cobrar::fba | fba.rs::fba |
| pFBA (single call) | cobrar::pfbaHeuristic | pfba.rs::pfba |
pFBA-heuristic tolerance ladder (15 iters, 1e-6 → 1e-9, pFBA-coef relaxation) | gapfill4.R:95-137 | pfba.rs::pfba_heuristic |
| HiGHS solver | — (R uses glpk/cplex) | good_lp 1.15 + highs-sys |
| CBC fallback | — | pfba.rs::pfba_cbc (feature-gated cbc) 🆕 |
Row-expression builder (O(nnz)) | implicit | fba.rs::build_row_exprs 🆕 performance |
2.6 Gap-filling (gapsmith-fill)
| Feature | R source | Rust module |
|---|---|---|
gapfill4 single-iteration driver | gapfill4.R:1-303 | gapfill.rs::gapfill4 |
Candidate pool (draft + all approved SEED, stoich-hash deduped) | construct_full_model.R + gapfill4.R:12-56 | pool.rs::build_full_model |
rxnWeights derivation from bitscores | prepare_candidate_reaction_tables.R:222-228 | pool.rs::rxn_weight |
| KO essentiality loop (serial, core-first, highest-weight-first) | gapfill4.R:247-280 | gapfill.rs::gapfill4 |
| Medium application (close all EX, open per-medium, add missing EX) | constrain.model.R | medium.rs::apply_medium |
Environment overrides (env_highH2.tsv) | adjust_model_env.R | medium.rs::apply_environment_file |
| Step 1 (user medium + biomass target) | gf.suite.R:244-258 | suite.rs::run_suite |
| Step 2 (per-biomass-component on MM_glu + carbon sources) | gf.suite.R:285-372 | suite.rs::step2 |
| Step 2b (aerobic / anaerobic variant) | gf.suite.R:377-464 | suite.rs::run_suite |
| Step 3 (energy-source screen with ESP1-5) | gf.suite.R:480-581 | suite.rs::step3 |
| Step 4 (fermentation-product screen) | gf.suite.R:585-683 | suite.rs::step4 |
| Target-met sink as objective | add_met_sink in add_missing_exRxns.R:56-72 | suite.rs::add_target_sink_obj |
| Futile-cycle detector (parallel pairwise LP probe) | recent upstream cccbb6f0 | futile.rs::detect_futile_cycles (opt-in --prune-futile) |
Community model composition (block-diagonal, shared _e0) | — | community.rs::compose_models 🆕 |
| Weighted-sum community biomass + optional balanced-growth | — | community.rs::add_community_biomass 🆕 |
Union-medium + per-MAG weights (community per-mag mode) | — | community.rs::{union_medium, per_mag_weights, weighted_growth} 🆕 |
2.7 Medium inference (gapsmith-medium)
| Feature | R source | Rust module |
|---|---|---|
| Rules-table loader | predict_medium.R:46 | rules.rs::load_rules |
Boolean-expression evaluator (| & ! < > == <= >=) | eval(parse(text=)) | boolexpr.rs::eval |
Counting-rule support (a + b + c < 3) | same (R int arithmetic) | boolexpr.rs::parse_sum |
| Cross-rule dedup + mean flux | predict_medium.R:84-86 | predict.rs::predict_medium |
| Saccharides / Organic acids category dedup | predict_medium.R:88-92 | predict.rs::predict_medium |
| Manual flux overrides | predict_medium.R:94-114 | predict.rs::parse_manual_flux |
| Proton balancer | predict_medium.R:121-132 | predict.rs::predict_medium |
2.8 Serialisation (gapsmith-sbml, gapsmith-io)
| Feature | R source | Rust module |
|---|---|---|
| CBOR round-trip | — | gapsmith-io::{read,write}_model_cbor 🆕 |
| JSON round-trip | — | gapsmith-io::{read,write}_model_json 🆕 |
| SBML L3V1 + FBC2 + groups writer | cobrar::writeSBMLmod | gapsmith-sbml::write_sbml |
| SBML SId idempotent on mets with compartment suffix | — | writer.rs::species_id 🆕 bugfix |
Streaming via quick-xml | — | writer.rs 🆕 no libSBML dep |
| SBML consistency validation | libSBML native | tools/validate_sbml.py (libSBML + COBRApy) |
2.9 Reference-data loaders (gapsmith-db)
| Feature | R source | Rust module |
|---|---|---|
seed_reactions_corrected.tsv | data.table::fread | seed.rs::load_seed_reactions |
seed_metabolites_edited.tsv | same | seed.rs::load_seed_metabolites |
MNXref cross-refs (mnxref_*.tsv) | same | mnxref.rs |
meta_pwy.tbl / kegg_pwy.tbl / seed_pwy.tbl / custom_pwy.tbl | same | pathway.rs |
subex.tbl | same | subex.rs |
tcdb.tsv | same | tcdb.rs |
exception.tbl | same | exception.rs |
medium_prediction_rules.tsv | same | gapsmith-medium::rules |
complex_subunit_dict.tsv | same | complex.rs |
| Biomass JSON (Gram+, Gram-, archaea, user custom) | parse_BMjson.R | biomass.rs |
SEED stoichiometry parser (-1:cpd00001:0:0:"H2O";...) | parse_BMjson.R:21-29 | stoich_parse.rs |
3. New Rust-only features
| Feature | Rust module | Motivation |
|---|---|---|
Precomputed alignment input (--aligner precomputed -P <tsv>) | gapsmith-align::PrecomputedTsvAligner | Skip per-genome BLAST when the user pre-runs diamond / mmseqs2 at batch scale |
BatchClusterAligner (gapsmith batch-align) | gapsmith-align::BatchClusterAligner | Amortise alignment cost over N genomes via one mmseqs2 cluster + single alignment |
gspa-run manifest reader (--gspa-run <dir>) | gapsmith-align::GspaRunAligner | Consume precomputed cluster-rep hits from the upstream gspa pipeline; fans rep hits onto per-genome members |
gapsmith doall-batch | gapsmith-cli::commands::doall_batch | Rayon + SLURM-array-friendly driver for reconstructing 100 → 1 M genomes in one batch |
gapsmith community per-mag | gapsmith-fill::community + CLI | Shared-medium per-MAG FBA for metagenomes with 50+ MAGs |
gapsmith community cfba | gapsmith-fill::community + CLI | Full community LP (block-diagonal compose, weighted-sum biomass, optional balanced-growth) |
| In-process LP (HiGHS via good_lp) | gapsmith-fill | Replaces R cobrar's shelled-out glpk / cplex; faster warm-starts |
| Optional CBC fallback backend | gapsmith-fill::pfba_cbc (--features cbc) | When HiGHS exhausts the tolerance ladder on pathological LPs |
| CBOR native format | gapsmith-io | Fast, compact, stdlib-free; replaces R's RDS |
gapsmith fba subcommand | gapsmith-cli | Standalone FBA / pFBA without shelling into R |
gapsmith convert subcommand | gapsmith-cli | CBOR ↔ JSON round-trip for inspection |
gapsmith db inspect subcommand | gapsmith-cli | Smoke-test the reference data directory |
gapsmith export-sbml subcommand | gapsmith-cli | Write an arbitrary CBOR model as SBML |
4. Known gaps (deferred)
| Gap | Upstream location | Workaround / plan |
|---|---|---|
| EC / TC conflict resolution (IRanges overlap math) | prepare_candidate_reaction_tables.R::resolve_common_{EC,TC}_conflicts | Affects <1 % of multi-EC annotations. Plan: port when a user case needs it. |
| MIRIAM cross-ref annotations (KEGG / BiGG / MetaNetX / HMDB / ChEBI) | addReactAttr.R + addMetAttr.R | SBML emits ModelSEED id only; round-trip in COBRApy still works. |
| HMM-based taxonomy / gram prediction | predict_domain.R, predict_gramstaining.R | CLI requires explicit `--taxonomy Bacteria |
| Gene-name MD5 fallback in seqfile resolver | uniprot.sh:179 | Common-case MD5 fallback is ported; gene-name branch rarely fires. |
| Menaquinone-8 auto-removal (gated on MENAQUINONESYN-PWY / PWY-5852 / PWY-5837) | generate_GSdraft.R:281-292 | Bio1 includes cpd15500 regardless; affects anaerobic predictions marginally. |
gram_by_network.R (predict gram by metabolic-network similarity) | same | Requires explicit `-b pos |
adapt EC / KEGG / enzyme-name resolution | adapt.R::ids2seed strategies 3–7 | Direct SEED + pathway id resolution works; user can pre-resolve via gapsmith find. |
pan weight medianing (custom_median) | pan-draft_functions.R | Pan-model emits without merged rxnWeights metadata; gapsmith fill on a pan-draft needs the source Reactions.tbl. |
| CPLEX solver support | — | Plan explicitly calls for HiGHS + optional CBC; no CPLEX path. |
MetaCyc DB updaters (meta2pwy.py, meta2genes.py, meta2rea.py) | upstream Python helpers | Run once per year by maintainers; kept in Python. |
5. Testing surface
| Test suite | Tests | What it asserts |
|---|---|---|
gapsmith-core unit | 18 | Type invariants, serde round-trips, stoichiometric matrix construction. |
gapsmith-io unit | 5 | CBOR / JSON round-trip, data-dir auto-detect. |
gapsmith-db unit | 18 | Every reference-data parser on realistic inputs. |
gapsmith-sbml unit + integration | 2 + 1 | SBML writer emits every FBC2 / groups element; libSBML validates cleanly. |
gapsmith-align unit + smoke + parity | 18 + 4 + 3 | Aligner trait, precomputed TSV, gspa-run manifest + fan-out, BLAST / diamond / mmseqs2 shell parity. |
gapsmith-find unit + smoke + parity | 36 + 1 + 2 | Pathway scoring, complex detection (R-parity on 9 cases), find -p PWY-6587 and -p amino byte-identical against real gapseq. |
gapsmith-transport unit + parity | 7 + 1 | TC parsing, substrate resolution, end-to-end row+TC-id parity against real gapseq. |
gapsmith-draft unit + smoke | 10 + 1 | Biomass rescaling, GPR composition, stoich dedup, conditional transporters. |
gapsmith-fill unit + textbook + smoke | 20 + 5 + 1 | FBA / pFBA / pFBA-heuristic on toys; community compose + cFBA on 2-organism toy; gapfill4 end-to-end on ecoli draft. |
gapsmith-medium unit | 14 | Boolean-expression evaluator (incl. counting rules), rule loader, cross-rule dedup, proton balance. |
gapsmith-cli integration | 6 + 1 | CBOR↔JSON round-trip end-to-end via the binary; SLURM-shard parser. |
| Total | ~170 |
Run the full suite:
cargo test --workspace
cargo clippy --workspace --all-targets -- -D warnings
6. LOC breakdown
| Crate | src/ LOC | tests/ LOC |
|---|---|---|
gapsmith-core | ~1 000 | — |
gapsmith-io | ~330 | — |
gapsmith-db | ~1 600 | — |
gapsmith-sbml | ~870 | ~220 |
gapsmith-align | ~1 250 | ~560 |
gapsmith-find | ~2 700 | ~380 |
gapsmith-transport | ~1 040 | ~115 |
gapsmith-draft | ~1 670 | ~80 |
gapsmith-fill | ~2 150 | ~400 |
gapsmith-medium | ~550 | — |
gapsmith-cli | ~2 400 | ~60 |
| Total | ~17 000 | ~2 100 |
gapsmith porting notes
Intentional and unintentional deviations from the upstream R/bash gapseq codebase. Each entry says what differs, why, and whether downstream output parity holds.
This document is the authoritative list of "known non-R behavior". If you find something in gapsmith that differs from real gapseq and it's not listed here, please file an issue — we treat those as bugs.
1. Storage format
- RDS → CBOR. R's
saveRDS(mod, ...)is replaced byciborium-encoded CBOR at<id>.gmod.cbor. SBML is emitted alongside as the portable interchange format. An Rscript-driven CBOR↔RDS bridge can be added later if downstream R consumers need it. - Metabolite IDs:
cpd00001_c0instead ofcpd00001[c0]. SBML SIds must match[A-Za-z_][A-Za-z0-9_]*— brackets break that. The SBML writer derives the species id frommet.id + met.compartmentand only presents<cpd>[comp]to the user via thenameattribute. - Model IDs are sanitized to valid SBML SIds:
-,.,:, space →_. Original id preserved in thenameattribute if different.
2. Alignment layer
- mmseqs2 invocation. gapseq's R pipeline calls explicit
mmseqs createdb+search+convertalis; we replicate this 4-command flow exactly rather than the simplereasy-search, because the latter uses alignment-mode 3 (full-alignment identity) while gapseq calibrates against the k-mer prefilter identity thatsearchreports. - mmseqs2
qseqidnormalization. We post-process mmseqs'sqheaderoutput to the first whitespace-delimited token, matching what blastp/diamond emit natively. gapseq does this via a sed call in the shell script (gapseq_find.shat the tail of each mmseqs block). - Diamond
--more-sensitiveis on by default, matching gapseq'saliArgs="--more-sensitive"default for diamond. - E-value formatting in TSV output: BLAST's exact 2-decimal
scientific notation (
2.52e-41). Initial wrapper used{:.3e}which emitted2.520e-41; fixed to match.
3. find pipeline (M5)
noSuperpathways=trueis always the default. Matches gapseq'sgapseq_find.sh:20. Pass-n/--include-superpathwaysto disable.-pkeyword shorthand.-p amino|nucl|cofactor|carbo|polyamine|fatty|energy|terpenoid|degradation|core|min|kegg|allresolves to the same hierarchy-category name gapseq uses. For-p <literal>we fall back to\b(<literal>)\bregex against id / name / altname / hierarchy — same as gapseq's defaultstypebranch.- Pathway table merge semantics.
-l metacyc,customis default (matching gapseq); on ID collision, thecustom_pwy.tblrow wins. This is how several "GS"-suffixed pathways (e.g. ARGSYN-PWY) have theirsuperpathway=TRUEflag shadowed by the custom row withsuperpathway=FALSE, so they survive the filter. user/reference fastas always resolve against the built-in<gapseq>/dat/seq/<tax>/user/, even when-D/--seq-diroverrides the rest. Matchesprepare_batch_alignments.R:217.- Reaction-name MD5 fallback. Uses
md5sum-compatible hex digest of the raw bytes (no trailing newline), same asuniprot.sh:155. - Header filter uses substring, not word-boundary, match. This is a
deliberate match of gapseq's shell behavior (
grep -Fivfwithout-w). Has the known quirk of accidentally matching4.A.2.1.1against a line containing4.A.2.1.11. Reproducing the quirk is required for byte-identical output.
4. find-transport pipeline (M6)
- Substrate case is preserved from subex.tbl (
sub=Potassium, notsub=potassium). Matches gapseq. - Alt-transporter reaction assignment kicks in when a TC-id hit
exists but no SEED reaction matches the exact TC type. Same rule as
analyse_alignments_transport.R:110-130. Disable with-a.
5. draft pipeline (M7) — known gaps
- EC/TC conflict resolution (
prepare_candidate_reaction_tables.Rresolve_common_EC_conflicts/resolve_common_TC_conflicts, ~130 lines of IRanges overlap math) is not yet ported. Affects <1% of reactions; the mitigations are listed downstream in the draft — a gene with two EC annotations stays attached to both reactions rather than being resolved. - MIRIAM cross-ref annotations. SBML writer emits the ModelSEED id
only. KEGG / BiGG / MetaNetX / HMDB / ChEBI cross-refs are available
in
addReactAttr.R+addMetAttr.Rbut not threaded through yet. Round-trip load in COBRApy works; MIRIAM-dependent downstream tools may need to re-query gapseq'sdat/mnxref_*.tsv. gram_by_network.R(predict gram-staining by metabolic-network similarity against reference genomes) is not ported. The CLI requires--biomass pos|neg|archaeaor reads thegram=field from the Reactions.tbl header (which find writes when gapseq has HMM-predicted the gram).- Biomass rescaling: the menaquinone-8 auto-removal rule
(
generate_GSdraft.R:281-292) is not ported. Gram+/Gram- biomass retains menaquinone-8 regardless of whether the MENAQUINONESYN-PWY / PWY-5852 / PWY-5837 pathways are present. Impact is small (a single metabolite in the biomass reaction) but affects anaerobic-growth predictions.
6. Not yet implemented
gapsmith medium— rule-based medium inference. Planned for M10.gapsmith fill— 4-phase gap-filling driver. Planned for M9 (FBA/pFBA primitives already shipped in M8 undergapsmith-fill).gapsmith adapt— add/remove reactions on an existing model. Planned for M10.gapsmith pan— pan-model union. Planned for M10.gapsmith doall— chain find→find-transport→draft→medium→fill. Planned for M10.gapsmith update-sequences— seqdb download from upstream. Planned for M10.- HMM-based
predict_domain/predict_gramstaining. Planned alongside M10.
6a. fill (M8–M9) — LP primitives + single-iteration gap-fill (shipped)
- Split-flux encoding: each reaction contributes
vp_r, vn_r ≥ 0with netv_r = vp_r − vn_r.|v_r| = vp_r + vn_ris linear, which keeps pFBA a pure LP. The bound translation is documented incrates/gapsmith-fill/src/lp.rs—[lb, ub]on the net flux becomes a per-variable ub pair(max(ub,0), max(−lb,0)), with additional lower bounds when the direction is strictly forced. - FBA (
gapsmith-fill::fba) — maximise / minimiseΣ c_r · (vp_r − vn_r)s.t.S · (vp − vn) = 0. HiGHS backend viagood_lp1.15. - pFBA (
gapsmith-fill::pfba) —minimise Σ w_r·(vp_r+vn_r) − pfba_coef · Σ c_r·(vp_r−vn_r)subject tov_bio ≥ min_growth. Matches cobrar'spfbaHeuristic. - pFBA-heuristic ladder (
gapsmith-fill::pfba_heuristic) — port ofgapfill4.R:95–137. Retries up to 15 times, halving the feasibility tolerance then the pFBA coefficient on solver failure; validates each candidate solution by running FBA on the reduced model after zero-flux reactions are removed. gapsmith fbasubcommand — FBA or--pfbaon a CBOR/JSON model; prints solver status, objective, biomass flux, and the top-N |flux| reactions. Useful for sanity-checking a draft model beforefillis applied.- Candidate pool (
pool.rs::build_full_model) — clone draft, append every approved/corrected SEED reaction not already present, deduped by stoichiometric hash with core-preference on ties. Port ofgapfill4.R:12–56. - Medium application (
medium.rs::apply_medium) — close allEX_*lower bounds then re-open per medium CSV; adds missing EX reactions for compounds not yet in the model. Matchesconstrain.model.R. - Single-iteration gap-fill (
gapfill.rs::gapfill4) — pFBA-heuristic on the full model with bitscore-derived weights, extract utilized candidates, add to draft, run KO essentiality loop. Port ofgapfill4.R:1–303. gapsmith fillsubcommand — end-to-enddraft → filledpipeline. Ecoli onMM_glu.csv: 20 reactions added, final growth 0.53.
Known divergences / outstanding work
- Biomass link-multi parsing —
biomass.jsonsupports"link": "cpd01997:-1|cpd03422:-1"(pipe-separated coupled metabolites). The M7 biomass parser only read the first link. M9 fixed this (BiomassComponent::links()). Without the fix, every biomass reaction was missing two cofactor metabolites, pegging growth to zero. Re-rundrafton any model built before M9 — the old CBOR has a structurally broken biomass. - SBML species-id double suffix — the M7
species_idhelper blindly appended_<comp>to the cpd id. When gapsmith stores metabolites ascpd00001_c0(the actual shipped convention), the SBML emittedM_cpd00001_c0_c0. COBRApy treated these as distinct mets, so SEED reactions and biomass didn't link. Fixed in M9: idempotent when the id already ends with the compartment suffix. - Objective must be
EX_<target>_c0sink, notbio1directly —gf.suite.R:239–242zeros everyobj_coefand then callsadd_met_sink(mod, target.met, obj=1)to attach the objective to a freshly-created sink reaction. Without it, steady-state balance pins the biomass flux to zero because the biomass pseudo-metabolite has no consumer. The CLI wires this up automatically. - 4-phase suite (Steps 2 / 2b / 3 / 4) shipped in the M9 follow-up.
Default CLI runs Steps 1 + 2 + 2b;
--full-suiteadds 3 + 4. Steps 3 and 4 iterate ~200 exchange compounds each with a gapfill4 call — budget 10–30 minutes on a whole-bacterium genome. - Futile-cycle prune shipped (
detect_futile_cycles) but opt-in via--prune-futile. Pairwise LP probe on large candidate pools (~8k rxns) takes ~40 minutes even with rayon parallelism; use selectively. - Candidate-dedup fix (draft-stage latent bug caught during M9
follow-up): my build_candidates accumulator copied
is_complex/complex_status/pathway_statusonly from the highest-bitscore row. That row may have empty status fields (gapseq fills those only on rows where complex detection ran), causing reactions likerxn00011(bs=1797, pyruvate decarboxylase) to be rejected in the runner's complex filter. Fixed: ORis_complex, max-across-rows forcomplex_status, highest-rankpathway_status. Re-rungapseq drafton any model built before this fix. - CBC fallback wired behind
--features cbc. Path:pfba_heuristicretries once withgood_lp::solvers::coin_cbcafter HiGHS exhausts its tolerance/coef ladder. Requirescoinor-libcbc-dev+coinor-libclp-devon the build host.
7. Unimplemented but probably won't be
- The R helpers that depend on
pythoncyc(meta2pwy.py,meta2genes.py,meta2rea.py) are one-off MetaCyc DB updaters run ≤ twice per year by maintainers. They stay in Python; no Rust port planned. - CPLEX solver support — plan stated "no CPLEX path"; gapsmith-fill will
use HiGHS via
good_lpwith CBC/Clp as backups.
8. New features not in gapseq
BatchClusterAligner(M4.5) — cluster N genomes' proteomes with mmseqs2, align once against the reference query FASTA, expand per-genome hit sets. Amortizes alignment cost over many genomes with shared proteins. Not in upstream gapseq.- Precomputed alignment input —
--aligner precomputed -P <tsv>onfind/find-transportskips the alignment step entirely. User must pre-run blastp/diamond/mmseqs2 with the expected 8-column layout.
9. Output format differences
- Pathways.tbl completeness precision. Ported to
f64to match R's default 15-significant-digit output (66.6666666666667). - Reactions.tbl column order matches gapseq exactly after M5 fixes.
- Transporter.tbl matches gapseq exactly after M6 fixes, modulo the
cosmetic
subcolumn case when a substrate has both capitalized and lowercased SUBkey entries (gapseq picks the alphabetically-first post-dedup; we pick insertion-order-first). evalueformatting matches BLAST's-outfmt 6conventions:0for exact zero, 2-decimal scientific notation below 1e-3.
10. Testing surface
- R-vs-Rust parity tests run the actual R implementation via
Rscriptwhere feasible. This lives in three places:crates/gapsmith-find/tests/complex_parity.rsdrivestools/r_complex_detection.Rwhich sources the realsrc/complex_detection.R.crates/gapsmith-find/tests/pipeline_parity.rsruns realgapsmith findand gapsmithfindside-by-side with a synthesized minimal seqdb.crates/gapsmith-transport/tests/parity.rsdoes the same forfind-transport.
- SBML validation runs
python-libsbml+cobravia a uv-managed venv attools/.sbml-validate/. Not part ofcargo test(no libsbml on crates.io) but a manual gate viatools/validate_sbml.py.
Performance
gapsmith shipped with a series of optimizations relative to its initial Rust port. This page catalogues what's in, where wins actually show up, and the methodology behind the numbers.
Shipped optimizations
All items below are on by default unless marked (opt-in) — those are library-level knobs exposed as struct fields, not CLI flags yet.
C1 — SEED stoich-hash cache at load time
crates/gapsmith-db/src/seed.rs. SeedRxnRow gains a #[serde(skip)]
stoich_hash: Option<String> field populated once during
load_seed_reactions (sequential, ~150 ms for 35k rows). Gap-fill and
draft dedup paths read the cached hash instead of re-parsing +
re-sorting the stoichiometry on every access.
C3 — O(1) reaction/metabolite lookups in apply_medium
crates/gapsmith-fill/src/medium.rs. Replaces rxns.iter().find(..)
per medium entry with a one-shot HashMap<rxn_id, index>. Same for
the metabolite-index lookup in the new-exchange fallback path.
O(k·n) → O(n + k) on a 500-rxn draft with 20 medium entries.
A2 — FASTA header cache per find run
crates/gapsmith-find/src/runner.rs. Multiple reactions frequently
resolve to the same reference FASTA (two reactions sharing an EC both
land on rev/<EC>.fasta). A HashMap<PathBuf, Arc<Vec<(String, String)>>>
populated lazily during the unique-reaction loop avoids redundant
file reads + header parses.
E1 — Structural blocked-reaction pre-pass in futile-cycle prune
crates/gapsmith-fill/src/futile.rs::structural_blocked_rxns.
Iterative topological dead-reaction detection on the closed-boundary
model: reactions whose metabolites have no producer or no consumer
(in any alive direction) can't carry flux, so they can't participate
in cycles. Drops 30–70% of the LP-probed candidate pool before any
LP runs. O(iters × nnz), typically 3–8 iterations. Runs inside the
existing --prune-futile flag.
B2 — Skip KO probes for zero-flux reactions
crates/gapsmith-fill/src/gapfill.rs. Any non-core candidate carrying
zero flux in the post-fill max-biomass FBA optimum is provably
non-essential: the constraint x[r] = 0 leaves the current optimum
feasible, so the KO probe would succeed deterministically. Skip the
LP probe; drop the reaction directly. Cuts ~20–40% of KO-loop LP
solves on typical gap-fills.
B1 (opt-in) — HiGHS hot-start via with_initial_solution
crates/gapsmith-fill/src/fba.rs. FbaOptions gains
hot_start: Option<Vec<f64>>. When Some, the caller supplies a
previous net-flux vector; fba() decomposes it into split-flux
(vp, vn) = (max(v,0), max(-v,0)) seeds and threads them through
good_lp's with_initial_solution. HiGHS warm-starts the simplex
from the seed basis. Can nudge HiGHS toward a different alternative
optimum, so off by default for byte-parity.
B3 (opt-in) — pFBA ladder memo across calls
crates/gapsmith-fill/src/pfba.rs. PfbaHeuristicOptions gains
ladder_memo: Option<Arc<Mutex<Option<(f64, f64)>>>>. When Some,
the ladder seeds from the last successful (tolerance, coefficient)
instead of the defaults and writes back on success. Clamped to
[min_tol, start_tol] to prevent a stale memo from pushing the
solver outside the user's configured range. Same byte-parity caveat
as B1. Off by default.
Benchmark results
Hardware: server ws, 64 cores, 128 GB RAM, NVMe SSD, Debian.
Aligner: DIAMOND 2.1.9 (-A diamond -K 8).
Input: E. coli K-12 MG1655 (GCF_000005845.2, 4300 proteins).
Reference DB: upstream gapseq's dat/ at commit-time HEAD.
Comparison: commit a42d172 (pre-Phase-1 baseline) vs 474a1e7
(all Phase 1 / 2 / 3 defaults shipped).
gapsmith find -p all
| Metric | Phase-0 | Phase-3 | Δ |
|---|---|---|---|
| Wall | 18.36 s | 17.96 s | −2.2% |
| User | 115.8 s | 114.1 s | −1.5% |
| Peak RSS | 279 MB | 286 MB | +2.5% |
Pathways.tbl byte-identical. Reactions.tbl varies run-to-run on the
same binary — DIAMOND parallel search with -K > 1 has inherent
non-determinism at the per-hit level. Not a regression; not in our
code.
gapsmith doall (full default pipeline)
| Metric | Phase-0 | Phase-3 | Δ |
|---|---|---|---|
| Wall | 77.6 s | 79.2 s | +2.1% |
| User | 244 s | 245 s | +0.4% |
| Peak RSS | 276 MB | 297 MB | +7.7% |
Semantic parity (via COBRApy):
- 2776 reactions, 2193 metabolites (both)
- Biomass FBA growth: 0.7422696154309283 (identical to 16 decimals)
- Reaction Jaccard: 1.0000
- Metabolite Jaccard: 1.0000
gapsmith fill --full-suite (4-phase, deep KO loops)
| Metric | Phase-0 | Phase-3 | Δ |
|---|---|---|---|
| Wall | 11:51.36 | 11:51.55 | +0.03% |
| User | 1209 s | 1217 s | +0.7% |
| Peak RSS | 107 MB | 123 MB | +15.5% |
Semantic parity:
- 2859 reactions, 2207 metabolites (both)
- Biomass FBA growth: 0.7433163537454 (identical to 13 decimals)
- Reaction Jaccard: 1.0000
- Metabolite Jaccard: 1.0000
Added reactions differ only in step attribution (one reaction
rxn12933_c0 added in Step 1 by P0 vs Step 2 by P3 — B2's zero-flux
skip dropped it from Step 1's add-list because its flux at the
biomass optimum was zero; Step 2 needed it and re-added it). Final
model is identical.
gapsmith fill --full-suite --prune-futile (E1 shines)
Both binaries hit a pre-existing pfba_heuristic: max iterations
failure on this workload (unrelated to Phase 1–3 changes). Timing
to the failure point:
| Metric | Phase-0 | Phase-3 | Δ |
|---|---|---|---|
| Wall | 5:56 | 4:20 | −27% |
| User | 18963 s | 13728 s | −28% |
| Peak RSS | 1354 MB | 1322 MB | −2.4% |
E1 cuts the time needed to exhaust the LP-prune pool substantially.
When --prune-futile actually completes on a well-conditioned input,
this speedup translates directly.
Honest assessment
The measured wins on the default path are small:
- DIAMOND dominates
find/doallwall time (~60 s of 80 s total). fill(Step 1 default) runs few LPs; B2 saves fractions of a second.- C1 / C3 savings are milliseconds, below noise.
- A2 cache adds ~15 MB RSS because per-FASTA reads were already buffered.
Real wins show up in:
--prune-futileenabled (E1: ~25–30% wall reduction)- Deep KO loops in full-suite runs with large candidate pools (B2)
- Batch runs reusing
ladder_memoacross pFBA calls (B3, opt-in) - KO-loop-dominated fills with
hot_startconfigured (B1, opt-in)
Semantic parity is the firm guarantee: identical reaction sets, identical biomass to machine precision, regardless of which optimization path fires.
Scaling to many genomes
The numbers above are all single-genome. For 100+ genomes, the dominant cost moves out of FBA and into per-genome alignment. Gapsmith's multi-genome path routes around that entirely:
--gspa-runconsumes a pre-clustered, pre-aligned gspa manifest so a 1000-genome collection pays for one cross-genome alignment, not 1000. Aligner calls insidefind/find-transportbecome map-lookups.doall-batch --jobs Nrunsdoallacross genomes with rayon; on a 32-core box, speedup is ~min(N, 32)vs. loopingdoallsequentially.doall-batch --shard i/Nmakes SLURM array jobs the unit of parallelism. At 1 024 tasks of 16 cores each, throughput is ~1–2 s of wallclock per genome including I/O — limited by per-task doall-internal work, not by coordination.
Community modes are small LPs (≤ 1000 organisms in per-mag, which is
linear) or a single big LP (cfba). cfba is feasible up to ~50
organisms before HiGHS becomes the bottleneck; for larger communities
prefer per-mag.
See Multi-genome workflows for recipes.
Deferred — where the big wins still are
B1-full — HiGHS basis warm-start across KO probes (highs-sys FFI)
The currently-shipped B1 uses good_lp's with_initial_solution for
a simplex hot-start guess, but each solve still constructs a fresh
highs::Model. True warm-start needs a persistent highs::Model
across the KO loop, mutating bounds via Highs_changeColBounds
and re-solving in place. The highs crate 2.0 doesn't expose that
API — requires ~200 LOC of highs-sys FFI. Estimated 30–50% wall
reduction on KO-dominated fills.
Parallel KO probe
The current KO loop is serial because each commit affects the next iteration's baseline. A valid parallelisation: probe every candidate concurrently against the same post-fill model (all probes are independent at that snapshot), collect a set of provably-non-essential candidates, then re-probe the survivors serially. Bounded to a single pass, correctness-preserving. Estimated 4–8× on multi-core for the Step 4 fermentation-product loop.
Step-4 restructuring
Step 4 probes ~100 fermentation products one at a time, each doing a full gap-fill cycle. Amortising the shared preprocessing across probes (same starting model, same weights) would cut ~5 min off a full-suite run.
Reproducing
# 1. Clone + build both commits
git clone https://github.com/bio-ontology-research-group/gapsmith.git
cd gapsmith
git worktree add -f /tmp/gapsmith-p0 a42d172
cd /tmp/gapsmith-p0 && cargo build --release --bin gapsmith
cd - && cargo build --release --bin gapsmith
# 2. Get reference data
gapsmith update-data -o /path/to/dat
gapsmith --data-dir /path/to/dat update-sequences -t Bacteria
# 3. Run both binaries, compare
for bin in /tmp/gapsmith-p0/target/release/gapsmith ./target/release/gapsmith; do
out=$(mktemp -d)
/usr/bin/time -v "$bin" --data-dir /path/to/dat doall \
-A diamond -K 8 -f "$out" genome.faa.gz 2>&1 | tee "$out.time"
done
# 4. Semantic diff in COBRApy
uv run --with cobra python tools/bench/diff_models.py \
out1/genome-filled.xml out2/genome-filled.xml
tools/bench/run_bench.sh bundles the above end-to-end.
Comparison with upstream gapseq
gapsmith is a Rust reimplementation of gapseq (R/bash, ~9k LOC). This document covers performance benchmarks and feature differences.
Benchmarks
Wall-clock comparison on four real bacterial proteomes. Hardware:
56-core Xeon, 128 GB RAM, Debian 13, NVMe SSD. Same aligner binary
(NCBI blastp 2.17.0 via bioconda). Same reference sequence database
(Zenodo v1.4, record 16908828).
Test genomes
| Organism | Accession | Proteins |
|---|---|---|
| Candidatus Blochmannia floridanus | GCF_000007725.1 | 517 |
| Bacillus subtilis 168 | GCF_000009045.1 | 4,237 |
| Escherichia coli K-12 MG1655 | GCF_000005845.2 | 4,300 |
| Salmonella enterica Typhimurium LT2 | GCF_000006945.2 | 4,554 |
Results
| Genome | Stage | gapseq (R) | gapsmith | Speedup |
|---|---|---|---|---|
| B. floridanus (517) | find -p all | 117 s | 34 s | 3.5× |
| B. floridanus (517) | find-transport | 9 s | 8 s | 1.2× |
| B. subtilis (4,237) | find -p all | 205 s | 73 s | 2.8× |
| B. subtilis (4,237) | find-transport | 25 s | 14 s | 1.8× |
| E. coli (4,300) | find -p all | 218 s | 76 s | 2.9× |
| E. coli (4,300) | find-transport | 27 s | 14 s | 1.9× |
| S. Typhimurium (4,554) | find -p all | 211 s | 76 s | 2.8× |
| S. Typhimurium (4,554) | find-transport | 27 s | 14 s | 1.9× |
gapsmith also uses 35–40% less peak memory (e.g. 498 MB vs 786 MB for
E. coli find).
Stages without R baseline
The draft, medium, and fill stages could not be benchmarked
against upstream R gapseq because the cobrar R package (which replaced
the archived sybil) failed to install on the benchmark host due to a
libsbml/libxml2 ABI conflict in the conda R 4.5 environment. This is a
packaging issue on the test rig, not a limitation of either tool.
gapsmith timings for these stages (E. coli K-12):
| Stage | Wall-time | Peak RSS |
|---|---|---|
draft | 0.6 s | 152 MB |
medium | 0.1 s | 47 MB |
fill (Steps 1 + 2 + 2b) | 52 s | 103 MB |
These stages are fast in gapsmith because the LP solver (HiGHS) runs
in-process via good_lp, rather than shelling out through R's cobrar
wrapper to an external GLPK/CPLEX binary.
Reproducing the benchmarks
# Requires: blastp, Rscript (with data.table, stringr, Biostrings),
# and gapsmith binary on PATH.
bash tools/bench/run_bench.sh <genomes_dir> <results_dir>
python3 tools/bench/aggregate.py <results_dir>
Feature comparison
| Feature | gapseq (R) | gapsmith |
|---|---|---|
find (pathway detection) | ✅ | ✅ byte-identical output |
find-transport | ✅ | ✅ row-count parity |
draft (model assembly) | ✅ | ✅ 0 libSBML errors |
medium (rule-based inference) | ✅ | ✅ byte-identical output |
fill (4-phase gap-filling) | ✅ | ✅ Steps 1–4 |
adapt (add/remove reactions) | ✅ | ✅ (EC/KEGG resolution deferred) |
pan (pan-draft union) | ✅ | ✅ |
doall (end-to-end pipeline) | ✅ | ✅ |
update-sequences (Zenodo sync) | ✅ | ✅ |
| EC/TC conflict resolution | ✅ | ❌ (affects <1% of reactions) |
| MIRIAM cross-ref annotations | ✅ | ❌ (SBML loads without them) |
| HMM-based taxonomy prediction | ✅ | ❌ (use --taxonomy flag) |
| Precomputed alignment input | ❌ | ✅ --aligner precomputed |
| Batch-cluster alignment | ❌ | ✅ batch-align |
| In-process LP solver | ❌ (external GLPK) | ✅ (HiGHS, bundled) |
| FBA/pFBA subcommand | ❌ | ✅ fba |
| Native CBOR model format | ❌ (RDS) | ✅ (replaces RDS) |
| Single static binary | ❌ | ✅ |
Key differences
Solver
gapseq uses R's cobrar package (or the older sybil) which wraps
GLPK or CPLEX. gapsmith uses HiGHS via
good_lp, statically linked at
build time. No runtime solver dependency.
Model format
gapseq stores models as R .RDS files. gapsmith uses CBOR (.gmod.cbor)
as its native format — compact, fast to load, language-agnostic. Both
tools emit SBML for interchange.
Alignment
Both tools shell out to the same external aligner binaries (blastp,
diamond, mmseqs2). gapsmith additionally supports --aligner precomputed
(skip the aligner, read a user-supplied TSV) and batch-align (cluster
N genomes with mmseqs2, align once, expand per-genome).
Known deferred items
See docs/porting-notes.md for the full list. The main gaps:
- EC/TC conflict resolution — affects <1% of multi-EC-annotated genes.
- MIRIAM cross-refs — SBML emits ModelSEED id only; COBRApy round-trip still works.
- HMM-based taxonomy — CLI requires
--taxonomy Bacteria|Archaeainstead of auto-detecting. adaptEC/KEGG resolution — use direct SEED reaction ids or MetaCyc pathway ids.