gapsmith

gapsmith is a Rust reimplementation of gapseq. It reconstructs genome-scale metabolic models from bacterial proteomes: pathway detection, transporter detection, draft model assembly, growth-medium inference, and iterative gap-filling — all as a single static binary.

What it does

Given a protein FASTA, gapsmith:

  1. Finds pathways — aligns proteins against per-reaction reference FASTAs and scores pathway completeness (gapsmith find).
  2. Finds transporters — matches against a TCDB-derived substrate table (gapsmith find-transport).
  3. Assembles a draft model — combines the above into a stoichiometrically consistent metabolic network (gapsmith draft).
  4. Infers a growth medium — applies boolean rules over detected reactions to predict minimal-medium compounds (gapsmith medium).
  5. Gap-fills — an iterative pFBA + knock-out loop adds the minimal set of extra reactions needed for biomass growth (gapsmith fill).

Each step is independently callable, or chain them end-to-end with gapsmith doall.

One-liner

gapsmith doall genome.faa.gz -f output/

This produces a gap-filled SBML model that loads directly in COBRApy, COBRAToolbox, or any SBML-compatible tool.

Why a rewrite?

Upstream gapseq is R + bash (~9k LOC). gapsmith replaces the R stack — including the cobrar wrapper and external GLPK/CPLEX — with:

  • In-process HiGHS via good_lp for FBA/pFBA
  • Native CBOR model format (replaces R's RDS)
  • Single static binary, no R install
  • --aligner precomputed mode for re-using one alignment across many genomes
  • batch-align for clustering N genomes with mmseqs2 and amortising the per-reaction alignment

Benchmarks show ~3× faster pathway detection on real bacterial proteomes and 35–40% lower peak memory. See Comparison with upstream gapseq for details.

Scaling out: metagenomes and multi-genome runs

gapsmith also ships optional, additive features for community and multi-genome workloads:

  • --gspa-run — consume protein clusters + alignments produced upstream by gspa so thousands of genomes share one alignment.
  • doall-batch — rayon-parallel driver with a --shard i/N selector for SLURM array jobs (scales from one machine to 1 M genomes).
  • community per-mag — per-MAG FBA under a shared (union) medium. Linear in N MAGs.
  • community cfba — compose N drafts into one community model with a shared _e0 pool and a weighted-sum biomass objective; optional balanced-growth constraint for classical cFBA semantics.

See Multi-genome & metagenome workflows for the full recipe.

Where to start

gapsmith user guide

A Rust reimplementation of gapseq — informed prediction and analysis of bacterial metabolic pathways and genome-scale networks. Single static binary, in-process LP solver, faster on every stage of the pipeline (see benchmarks in the README).

This guide walks a new user through installing gapsmith, reconstructing a genome-scale metabolic model end-to-end, and understanding each stage of the output.


1. Install

Binary prerequisites

gapsmith shells out to external aligners and a couple of HMM tools. Install whichever aligner you want to use — you only need one:

ToolRequired forapt / bioconda
blastp + makeblastdb--aligner blastp (default)ncbi-blast+
diamond--aligner diamonddiamond
mmseqs--aligner mmseqs2mmseqs2
bedtoolsnone (legacy; only needed for upstream gapsmith find shell pipeline)bedtools

--aligner precomputed skips the aligner entirely and reads a user- supplied TSV; see Precomputed alignment mode below.

Reference data

gapsmith needs the standard gapseq dat/ directory (SEED reactions + metabolites, biomass templates, medium rules, pathway tables). Get it by cloning the upstream repo:

git clone --depth 1 https://github.com/jotech/gapseq.git
export GAPSMITH_DATA_DIR=$PWD/gapseq/dat

Or point to it explicitly on every invocation with --data-dir.

Reference sequences

Sync the genome-specific BLAST/diamond reference FASTA database from Zenodo:

gapsmith update-sequences -t Bacteria             # pinned release
gapsmith update-sequences -t Bacteria -Z latest   # newest Zenodo version
gapsmith update-sequences -c                       # check-only, no download

Without this, find / find-transport / doall will fail — the FASTA files they align against live under <data-dir>/seq/<Taxonomy>/.

Build

git clone <this-repo>
cd gapsmith
cargo build --release --workspace
# Binary is target/release/gapsmith

The HiGHS LP solver is bundled via good_lphighs-sys (CMake'd from source on first build; ~30 s). An optional cbc feature adds CBC as a fallback solver — requires coinor-libcbc-dev on the build host:

cargo build --release --workspace --features gapsmith-fill/cbc

2. Quick start — reconstruct E. coli K-12 in one command

# Download an E. coli reference proteome
wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/all_assembly_versions/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_protein.faa.gz

# Reconstruct: find→find-transport→draft→medium (auto)→fill
gapseq --data-dir $GAPSMITH_DATA_DIR doall \
    GCF_000005845.2_ASM584v2_protein.faa.gz \
    -f ecoli_out/ \
    -A diamond \
    -b 200 \
    -k 0.01

# Inspect the result
gapsmith fba ecoli_out/GCF_000005845.2_ASM584v2_protein-filled.gmod.cbor

Outputs in ecoli_out/:

  • <genome>-all-Reactions.tbl, <genome>-all-Pathways.tbl — find results
  • <genome>-Transporter.tbl — find-transport result
  • <genome>-draft.gmod.cbor, <genome>-draft.xml — draft model
  • <genome>-medium.csv — inferred growth medium
  • <genome>-filled.gmod.cbor, <genome>-filled.xml — gap-filled model
  • <genome>-filled-added.tsv — list of reactions added by each fill step

The filled CBOR or SBML loads directly in COBRApy, COBRAToolbox, cobrar, or any SBML-aware tool.


3. Individual stages

3.1 gapsmith find — pathway and reaction detection

Aligns the genome's proteins against gapseq's per-reaction reference FASTAs, scores each pathway for completeness, and writes a *-Reactions.tbl + *-Pathways.tbl.

# One pathway by MetaCyc id
gapsmith find -p PWY-6587 genome.faa

# A category shorthand — expands to gapseq's internal hierarchy match
gapsmith find -p amino genome.faa
#        ↑ one of: amino | nucl | cofactor | carbo | polyamine |
#                  fatty | energy | terpenoid | degradation |
#                  core | min | kegg | all

# Every active pathway in metacyc + custom (gapseq's default)
gapsmith find -p all -l metacyc,custom genome.faa

# Alternative aligner
gapsmith find -A diamond -p all genome.faa
gapsmith find -A mmseqs2 -p all genome.faa

Key options:

  • -b / --bitcutoff — BLAST bitscore cutoff for "high-evidence" hits (default 200).
  • -c / --coverage — minimum query coverage (default 75%).
  • -a / --completeness-main — pathway-completeness cutoff (default 80%).
  • -k / --completeness-hints — relaxed cutoff when every key reaction is present (default 66%).
  • -n / --include-superpathways — by default superpathways are filtered out (matches upstream gapseq).
  • -t / --taxonomyBacteria (default) or Archaea.

The output *-Pathways.tbl is byte-identical to real gapseq's output on tested cases (-p PWY-6587 and -p amino on ecoli). See the porting-notes for the handful of intentional deviations.

3.2 gapsmith find-transport — transporter detection

Iterates gapseq's transporter substrate list (dat/subex.tbl) and aligns the genome against both TCDB and the curated SEED transporter FASTAs.

gapsmith find-transport -A diamond -b 50 genome.faa
# → <stem>-Transporter.tbl

Row count and the TC-identifier set match real gapseq on the ecoli toy genome (550 distinct TC ids, 1808 rows).

3.3 gapsmith draft — build a draft model

Combines the *-Reactions.tbl + *-Transporter.tbl with the SEED reaction DB into a metabolic model. Adds biomass, exchanges, diffusion reactions, conditional transporters, dedups by stoichiometric hash.

gapsmith draft \
    -r ecoli-all-Reactions.tbl \
    -t ecoli-Transporter.tbl \
    -b neg \
    -o drafts/
# → drafts/ecoli-draft.gmod.cbor   (native format)
#   drafts/ecoli-draft.xml         (SBML L3V1+FBC2+groups, validates
#                                   against libSBML with 0 errors)

The -b flag picks the biomass template:

  • auto — read gram= from the Reactions.tbl header (gapseq defaults).
  • pos / neg — Gram+ / Gram-.
  • archaea — Archaea.
  • <path/to/custom.json> — user biomass in the same JSON format as dat/biomass/biomass_Gram_neg.json.

3.4 gapsmith medium — rule-based medium inference

Evaluates 82 boolean rules in dat/medium_prediction_rules.tsv against the draft + Pathways.tbl to infer a growth medium.

gapsmith medium \
    -m drafts/ecoli-draft.gmod.cbor \
    -p ecoli-all-Pathways.tbl \
    -o ecoli-medium.csv

# Manual overrides — force oxygen uptake 0 to predict anaerobic
gapsmith medium ... -c "cpd00007:0;cpd00011:0"

The inferred medium CSV is compatible with gapsmith fill -n. On the ecoli toy genome the output is byte-identical to upstream gapseq's toy/ecoli-medium.csv.

3.5 gapsmith fill — iterative gap-filling

4-phase pFBA + KO-essentiality pipeline. Steps 1 + 2 + 2b are the default "quick" path; Steps 3 + 4 run only with --full-suite.

gapsmith fill \
    drafts/ecoli-draft.gmod.cbor \
    -n ecoli-medium.csv \
    -r ecoli-all-Reactions.tbl \
    -k 0.01 \
    -o filled/

# Full 4-phase suite (adds energy-source + fermentation screens;
# 10-30 minutes on a whole bacterium)
gapsmith fill ... --full-suite

# Opt into the futile-cycle prune — drops candidate reactions that
# saturate at 0.99·max_flux with all boundaries closed. Expensive on
# large candidate pools.
gapsmith fill ... --prune-futile

Phase breakdown:

  1. Step 1 — user medium + biomass target. Bulk of the fills.
  2. Step 2 — MM_glu + available carbon sources; iterate every biomass substrate, attach a sink per substrate, gap-fill if infeasible.
  3. Step 2b — aerobic / anaerobic variant, depending on whether the user medium has O₂.
  4. Step 3 (--full-suite) — energy-source screen with ESP1–5 pseudo-reactions (menaquinone / ubiquinone / NADH / ferredoxin / plastoquinone redox couples).
  5. Step 4 (--full-suite) — fermentation-product screen: maximise outflow on each exchange compound.

Output: <genome>-filled.gmod.cbor + .xml + -added.tsv (rxn id + step number for every addition).

3.6 gapsmith fba — FBA / pFBA on a model

Utility to solve FBA or parsimonious FBA on a CBOR/JSON model. Useful for validating a draft before gap-filling, or for post-hoc analysis.

gapsmith fba drafts/ecoli-draft.gmod.cbor
gapsmith fba --pfba --min-growth 0.01 --top 20 filled/ecoli-filled.gmod.cbor
gapsmith fba -r EX_cpd00027_e0 filled/ecoli-filled.gmod.cbor

Split-flux LP via good_lp with HiGHS backend. Solves the 2378×2953 ecoli draft in under 100 ms.

3.7 gapsmith adapt — edit reactions + force growth

# Add specific SEED rxns
gapsmith adapt -m model.gmod.cbor -a rxn00011,rxn00102

# Add every rxn in a MetaCyc pathway
gapsmith adapt -m model.gmod.cbor -a "PWY-6587"

# Remove
gapsmith adapt -m model.gmod.cbor -r rxn00011

# Force growth on D-glucose (runs gap-fill)
gapsmith adapt -m model.gmod.cbor -w cpd00027:TRUE -b Reactions.tbl

# Force NO growth on glucose (closes uptake)
gapsmith adapt -m model.gmod.cbor -w cpd00027:FALSE

3.8 gapsmith pan — pan-draft from N drafts

# Glob a directory of drafts
gapsmith pan -m drafts/ -t 0.06 -f pan_out/

# Explicit list
gapsmith pan -m a-draft.gmod.cbor,b-draft.gmod.cbor,c-draft.gmod.cbor -t 0.5

# Only emit the binary presence/absence table
gapsmith pan -m drafts/ -b -f pan_out/

-t sets the minimum across-draft reaction frequency for inclusion in the pan-draft. Default 0.06 matches upstream gapseq.

3.9 gapsmith doall — full pipeline in one command

gapsmith doall genome.faa.gz -f out/ -A diamond -b 200 -k 0.01
# Chains: find -p all → find-transport → draft → medium (auto) → fill

Add --full-suite to include fill Steps 3/4. Use -m <medium.csv> to skip the medium-inference step and supply your own medium.


4. Precomputed alignment mode

For many-genome batch runs, gapsmith can skip the per-genome aligner call entirely and read a pre-computed alignment TSV:

# Pre-run BLAST / diamond / mmseqs2 yourself (or use gapsmith batch-align)
diamond makedb --in reference.faa -d ref.dmnd
diamond blastp -q genome.faa -d ref.dmnd -o hits.tsv \
    --outfmt 6 qseqid sseqid pident qcovs evalue bitscore stitle sstart send

# Feed to find / find-transport
gapsmith find -A precomputed -P hits.tsv -p all genome.faa
gapsmith find-transport -A precomputed -P hits.tsv genome.faa

gapsmith batch-align — cluster N genomes + single alignment pass

When you're annotating many genomes with the same reference, run one cluster + alignment over all of them and expand per-genome hits:

gapsmith batch-align \
    -q reference.faa \
    -g genomes_dir/ \
    -o aligned/ \
    --aligner diamond \
    --cluster-identity 0.5 \
    --cluster-coverage 0.8
# → aligned/<genome_id>.tsv per genome in genomes_dir/

Amortises the alignment cost over many genomes — especially effective when the proteomes share a lot of sequence homology.


4b. Multi-genome, metagenome, and gspa integration

When annotating 100+ genomes — or when the input is a metagenome — reach for these additional subcommands instead of looping doall:

--gspa-run — consume precomputed clusters from gspa

If you have a gspa run directory (mmseqs2-clustered proteins + alignment against cluster reps), point any per-genome subcommand at it:

gapsmith doall genome.faa -f out/ \
    --gspa-run /path/to/gspa-run/ \
    --gspa-genome-id my-genome

gapsmith skips the aligner entirely, reads the rep-level hits, and fans them out to this genome's cluster members.

gapsmith doall-batch — parallel N-genome driver

# 100 genomes on one machine, reusing a gspa cluster, 32 rayon workers
gapsmith doall-batch --gspa-run gspa-run/ -f out/ --jobs 32

# 1 M genomes on SLURM: 1024 array tasks, each processes ~1000 genomes
sbatch --array=0-1023 slurm_script.sh   # script passes --shard ${SLURM_ARRAY_TASK_ID}/1024

Key flags: --genomes-dir / --genomes-list / --gspa-run, --jobs, --shard i/N, --continue-on-error.

gapsmith community per-mag — shared-medium per-MAG FBA

For metagenomes with 50+ MAGs where a full cFBA isn't tractable:

gapsmith community per-mag \
    --gspa-run gspa-run/ \
    --drafts-root out/ \
    -o community/

Outputs community-medium.csv (union of per-MAG media) and per-mag-growth.tsv with abundance-weighted community growth.

gapsmith community cfba — full community LP

For small, curated communities (≤ ~50 organisms) where cross-feeding matters:

gapsmith community cfba \
    --drafts-dir out/drafts/ \
    --abundance abundances.tsv \
    --balanced-growth \
    -o community/

Composes N drafts into one block-diagonal model with a shared _e0 extracellular pool, builds a weighted-sum biomass objective, and optionally constrains every organism to the community's growth rate.

→ Full recipe: Multi-genome & metagenome workflows.


5. Solver, format, and file-layout notes

5.1 Model format

  • Native: .gmod.cbor — CBOR-encoded Model (fast load, compact).
  • JSON: .json — human-readable, via gapsmith convert.
  • SBML: .xml — SBML L3V1 + FBC2 + groups. 0 libSBML errors on every emitted model. Loads cleanly in COBRApy / COBRAToolbox.

Conversion between native formats:

gapsmith convert model.gmod.cbor model.json --pretty
gapsmith convert model.json model.gmod.cbor
gapsmith export-sbml model.gmod.cbor model.xml

5.2 LP solver

good_lp 1.15 + HiGHS (bundled). The HiGHS crate statically links its C++ solver via CMake; no system HiGHS needed.

Optional CBC fallback: compile with --features gapsmith-fill/cbc and pass --cbc-fallback in the heuristic options. Requires coinor-libcbc-dev + coinor-libclp-dev on the build host.

5.3 Sequence database layout

<seq-dir>/<Taxonomy>/
  ├── rev/<reaction>.fasta         ← curated reviewed sequences
  ├── unrev/<reaction>.fasta       ← unreviewed / broader
  ├── rxn/<rxn_id>.fasta           ← per-SEED-reaction FASTAs
  ├── tcdb.fasta                   ← transporter DB
  ├── transporter.fasta            ← curated transporter hits
  ├── user/<reaction>.fasta        ← your own references (optional)
  └── version_seqDB.json           ← Zenodo version stamp

gapsmith find / find-transport resolve reference FASTAs in this priority order: user/rxn/rev/<EC>unrev/<EC>rev/<md5(name)>unrev/<md5(name)>. Matching real gapseq's prepare_batch_alignments.R:150-234.


6. Troubleshooting

"external tool makeblastdb exited with status 1"

The tool was fed a gzipped FASTA. gapsmith doall auto-decompresses .gz; for standalone find / find-transport invocations, ungzip first.

"full candidate pool is infeasible"

The draft + every SEED reaction together can't produce biomass. Either:

  • The biomass template asks for a metabolite no SEED reaction produces (peptidoglycan cpd15665 on some drafts — see porting-notes for the known cycle gap), or
  • Your medium is missing a critical nutrient (run gapsmith medium --manual-flux to check what the rule-based inference assigns).

Use gapsmith fba --pfba --min-growth 0.001 on the draft to narrow the biomass-component that's blocking.

pFBA heuristic "max iterations reached"

HiGHS exhausted the tolerance + pFBA-coefficient ladder. Options:

  1. Widen the medium (add more carbon sources).
  2. Lower -k (minimum growth rate).
  3. Rebuild with --features cbc and opt-in to the CBC fallback.

Draft has 0 reactions / stays empty

The upstream *-Reactions.tbl has dbhit= empty on every row. That means none of the reference FASTAs were found for the pathways you asked about. Verify <seq-dir>/<Taxonomy>/ is populated (gapsmith update-sequences -c).


7. Performance tips

  • Release build: cargo build --release --workspace. Debug builds are ~50× slower for the LP-heavy stages.
  • Use diamond over BLASTp: -A diamond. Comparable accuracy, 5–20× faster on large proteomes.
  • Parallelism: -K <threads> on every aligner-invoking subcommand. gapsmith is rayon-parallel in the LP-free stages (find, transport, medium) too.
  • Batch mode: if you're annotating 10+ genomes, gapsmith batch-align
    • --aligner precomputed amortises the alignment cost.
  • Multi-genome scale-out: for 100+ genomes or 1k+ metagenomes, use gapsmith doall-batch with --gspa-run so the expensive alignment is done exactly once across the whole collection. Shard across SLURM array tasks with --shard i/N. See Multi-genome workflows.

8. Where to look next

  • porting-notes.md — every intentional deviation from upstream R gapseq.
  • feature-matrix.md — exhaustive feature list with R source → Rust module pointers.
  • README.md — status + benchmarks.
  • Per-crate rustdoc: cargo doc --workspace --no-deps --open.

CLI reference

Every gapseq subcommand, exhaustive flag list. Generated from --help, annotated per-section. Run any command with -h to see these locally.


Global options

These apply to every subcommand:

FlagDescription
--data-dir <PATH>Override gapseq reference-data directory. Resolution order when absent: GAPSMITH_DATA_DIR env → $XDG_DATA_HOME/gapsmith<exe-dir>/../dat./dat.
--seq-dir <PATH>Override the sequence database directory. Defaults to <data-dir>/seq.
-K, --threads <N>Worker threads. Default: all available cores.
-v, -vv, -vvvLogging verbosity (info / debug / trace).
-h, --helpPrint help.

gapsmith doall

End-to-end pipeline: find → find-transport → draft → medium → fill.

FlagDefaultDescription
<GENOME>Required. Protein FASTA (.faa / .faa.gz).
-A, --alignerblastpblastp, diamond, mmseqs2, or precomputed.
-b, --bitcutoff200.0Bitscore cutoff for the find stages.
-c, --coverage75Coverage cutoff (0–100 %).
-l, --min-bs-core50.0Core-reaction bitscore cutoff.
-t, --taxonomyautoBacteria, Archaea, or auto (currently maps to Bacteria).
-m, --mediumautoMedium CSV, or auto to infer via gapsmith medium.
-f, --out-dir.Output directory.
-K, --threadsall coresThread count.
--full-suiteoffInclude fill Steps 3 + 4 (slow).
-k, --min-growth0.01Minimum growth for gap-filling.
--gspa-run <DIR>Read precomputed cluster-rep hits from a gspa manifest instead of running the aligner.
--gspa-genome-id <ID>FASTA stemRow id in the manifest's genomes.tsv matching this run.

Gzipped inputs are auto-decompressed into a tempfile::tempdir.


gapsmith find

Pathway and reaction detection.

FlagDefaultDescription
<GENOME>Protein FASTA.
-p, --pathwaysallPathway keyword or id (or all).
-l, --pathway-dbmetacyc,customComma-separated: metacyc, kegg, seed, custom, all.
-t, --taxonomyBacteriaReference subfolder under seq-dir.
-A, --alignerdiamondAligner backend.
-P, --precomputedPrecomputed alignment TSV (skip aligner).
--gspa-run <DIR>gspa manifest — expands rep-level hits onto this genome's cluster members.
--gspa-genome-id <ID>FASTA stemGenome id in the manifest matching this invocation.
--gspa-coverage-fractionoffTreat the alignment TSV's qcov as 0–1 (mmseqs2 native).
-b, --bitcutoff200.0Bitscore cutoff.
-i, --identcutoff0.0Identity cutoff (%).
-c, --coverage75Coverage cutoff.
-a, --completeness-main80.0Pathway completeness cutoff.
-k, --completeness-hints66.0Relaxed cutoff when every key reaction is present.
-s, --strictoffDisable key-reaction heuristic.
-n, --include-superpathwaysoffDefault is to filter them out (matches upstream).
-o, --out-dir.Output directory.
-u, --suffixFilename suffix <stem>-<suffix>-{Reactions,Pathways}.tbl.

gapsmith find-transport

Transporter detection.

FlagDefaultDescription
<GENOME>Protein FASTA.
-A, --alignerblastpAligner backend.
-P, --precomputedPrecomputed TSV.
-b, --bitcutoff50.0Transport hits use a lower threshold than biosynthetic.
-i, --identcutoff0.0Identity cutoff.
-c, --coverage75Coverage cutoff.
-a, --nouse-alternativesoffDisable the alt-transporter fallback.
-m, --only-metRestrict to one substrate keyword.
-o, --out-dir.Output directory.

gapsmith draft

Build a draft metabolic model from find + find-transport output.

FlagDefaultDescription
-r, --reactionsrequired*-Reactions.tbl.
-t, --transporterrequired*-Transporter.tbl.
-b, --biomassautoauto / pos / neg / archaea / <path.json>.
-n, --nameinferredModel id.
-u, --high-evi-rxn-bs200.0Bitscore threshold for core reactions.
-l, --min-bs-for-core50.0Lower bound for candidate pool.
-o, --out-dir.Output directory.
--no-sbmloffSkip SBML emission.

gapsmith medium

Rule-based medium inference.

FlagDefaultDescription
-m, --modelrequiredDraft CBOR / JSON.
-p, --pathwaysrequired*-Pathways.tbl from find.
-c, --manual-fluxcpdXXXXX:val;cpdYYYYY:val overrides.
-o, --outputinferredOutput CSV path.
-f, --out-dir.Output directory (when -o is absent).

gapsmith fill

Iterative gap-filling.

FlagDefaultDescription
<MODEL>requiredDraft CBOR / JSON.
-n, --mediarequiredMedium CSV.
-r, --reactions*-Reactions.tbl for bitscore-weighted pFBA.
-t, --targetcpd11416Biomass pseudo-metabolite id.
-k, --min-growth0.01Minimum growth rate floor.
-b, --bcore50.0Minimum bitscore for "core" classification.
--high-evi200.0Bitscore-to-weight upper calibration point.
--dummy-weight100.0Weight for reactions with no hit.
-o, --out-dir.Output directory.
--no-sbmloffSkip SBML.
--step1-onlyoffRun only Step 1 (debugging).
--full-suiteoffInclude fill Steps 3 + 4.
--prune-futileoffThermodynamic futile-cycle prune (opt-in; slow).

gapsmith fba

FBA / pFBA on an existing model.

FlagDefaultDescription
<MODEL>requiredCBOR / JSON model.
-r, --objectivemodel's ownObjective reaction id.
--pfbaoffParsimonious FBA.
--pfba-coef0.001pFBA biomass trade-off coefficient.
--min-growth0.0Biomass floor (pFBA only).
--top20Print the top-N highest-absolute-flux reactions.
--minimiseoffMinimise instead of maximise.

gapsmith adapt

Edit reactions or force growth on a compound.

FlagDefaultDescription
-m, --modelrequiredModel file.
-a, --addComma-separated rxnNNNNN or MetaCyc pathway ids.
-r, --removeSame format; reactions are removed.
-w, --growthcpdNNNNN:TRUE / cpdNNNNN:FALSE.
-b, --reactions*-Reactions.tbl (needed for -w ...:TRUE gap-filling).
-k, --min-growth0.01Min growth floor for the -w ...:TRUE path.
-f, --out-dir.Output directory.
--no-sbmloffSkip SBML.

gapsmith pan

Build a pan-draft from N drafts.

FlagDefaultDescription
-m, --modelsrequiredComma-separated / directory / single path.
-t, --min-freq0.06Minimum across-draft reaction frequency.
-b, --only-binaryoffSkip the pan-draft, emit only the rxn × model presence TSV.
-f, --out-dir.Output directory.
--no-sbmloffSkip SBML.

gapsmith update-sequences

Zenodo seqdb sync.

FlagDefaultDescription
-t, --taxonomyBacteriaWhich seqdb to sync.
-D, --seq-dir<data-dir>/seqSeqdb location.
-Z, --recordpinnedZenodo record id, or latest.
-c, --check-onlyoffReport version, no download.
-q, --quietoffSuppress progress messages.

gapsmith convert

CBOR ↔ JSON round-trip.

FlagDefaultDescription
<INPUT>requiredPath to source file.
<OUTPUT>requiredDestination path.
--tofrom extensioncbor / json (overrides extension).
--prettyoffPretty-print JSON.

gapsmith export-sbml

Serialise a CBOR / JSON model as SBML L3V1 + FBC2 + groups.

FlagDefaultDescription
<INPUT>requiredCBOR / JSON model.
<OUTPUT>requiredDestination .xml path.
--compactoffOmit pretty-print / nested indentation.

gapsmith align

Run a single aligner standalone. Useful for debugging reference-FASTA issues or building a precomputed TSV to feed into find --aligner precomputed.

FlagDefaultDescription
-A, --alignerrequiredblastp / tblastn / diamond / mmseqs2 / precomputed.
-q, --queryrequiredQuery FASTA.
-t, --targetrequiredTarget FASTA.
-P, --precomputedWhen -A precomputed, path to the TSV.
-b, --bitcutoff0.0Filter hits below this bitscore.
-c, --coverage0Filter below this coverage %.
-e, --evalue1e-5E-value cutoff.
--extraPassthrough flags to the aligner binary.
-o, --outstdoutOutput TSV path.

gapsmith batch-align

Cluster N genomes + single alignment + per-genome TSV expansion.

FlagDefaultDescription
-q, --queryrequiredReference query FASTA.
-g, --genomesrequiredDirectory containing .faa(.gz) files.
-o, --out-dirrequiredOutput <genome>.tsv directory.
-A, --alignerdiamondAligner backend (blastp / diamond / mmseqs2).
--cluster-identity0.5mmseqs2 cluster identity.
--cluster-coverage0.8mmseqs2 cluster coverage.
-b, --bitcutoff0.0Per-hit bitscore filter.
-c, --coverage0Per-hit coverage filter.

gapsmith doall-batch

Run doall across many genomes in parallel. See multi-genome.md for the full recipe.

FlagDefaultDescription
-g, --genomes-dir <DIR>Directory of protein FASTAs.
--genomes-list <TSV>Explicit TSV list (id<TAB>path[<TAB>abundance]).
--gspa-run <DIR>Pulls the genome list from a gspa manifest.
-f, --out-dir <DIR>requiredOne <genome_id>/ subdir is written per input.
-j, --jobs <N>all coresRayon pool size.
--shard <i/N>Select genomes where index mod N == i.
--continue-on-erroroffLog failed genomes to doall-batch-errors.tsv instead of aborting.
(plus every passthrough flag from doall)

gapsmith community

Community-level optimisation. Two subcommands.

gapsmith community per-mag

Per-MAG FBA under a shared (union) medium. Linear in N.

FlagDefaultDescription
--drafts-dir <DIR>Directory of <id>-{draft,filled}.gmod.cbor.
--drafts-list <TSV><id><TAB><cbor>[<TAB><medium>].
--gspa-run <DIR>Use manifest's genome list.
--drafts-root <DIR>.Where per-genome doall outputs live (paired with --gspa-run).
-m, --medium <CSV>autoOverride the inferred shared medium.
-o, --out-dir <DIR>.Writes community-medium.csv, per-mag-growth.tsv.

gapsmith community cfba

Compose N drafts into one community model; solve weighted-sum biomass.

FlagDefaultDescription
--drafts-dir <DIR> / --drafts-list <TSV> / --gspa-run <DIR>Exactly one (same semantics as per-mag).
--drafts-root <DIR>.Where doall outputs live.
--abundance <TSV><id><TAB><abundance> overrides.
--biomass-rxn <ID>bio1Per-organism biomass reaction id.
--balanced-growthoffAdd v(bio_k) == v(bio_community) constraints.
-m, --medium <CSV>Optional shared medium CSV.
-o, --out-dir <DIR>.Writes community.gmod.cbor, community-fba.tsv.

gapsmith db inspect

Smoke-test the --data-dir. Loads every reference table and prints row counts. No flags beyond --data-dir.


gapsmith test

Print resolved paths + which external tool binaries are on PATH.


gapsmith example-model

Emit a small hand-built toy model (3 metabolites / 2 reactions) as CBOR. Useful for smoke-testing format round-trips and downstream tools.

FlagDefaultDescription
<OUTPUT>requiredDestination CBOR path.
--complexoffEmit a slightly bigger "complex" variant.

Multi-genome & metagenome workflows

Gapsmith's single-genome pipeline (doall) stays the default. This chapter covers three additional, opt-in capabilities for running at scale:

  1. --gspa-run — consume precomputed protein clusters + alignments from the upstream gspa pipeline instead of re-running DIAMOND/BLASTp per genome.
  2. doall-batch — rayon- and SLURM-friendly driver that fans doall across N genomes (targets 1 000 metagenomes; scales to ~1 M genomes via --shard).
  3. community — community-level objectives for metagenomes: a cheap per-MAG mode with a shared union medium, and a full cFBA mode that composes N drafts into one model.

Nothing in here breaks single-genome usage. If you never pass --gspa-run or use the new subcommands, behaviour is identical to earlier releases.


1. Consuming a gspa manifest (--gspa-run)

Why

Annotating N genomes with a single mmseqs2/linclust cluster + one alignment against cluster representatives is dramatically cheaper than N independent BLASTp runs. gspa's Phase 11 already produces that cluster; --gspa-run plugs the output into gapsmith find / find-transport / doall.

Manifest layout (v1)

gspa writes (or you assemble by hand) a directory with this shape:

<gspa-run>/
  clusters.tsv            # 3 cols: rep_id<TAB>member_id<TAB>genome_id
  alignment/<ref>.tsv     # 8-col gapsmith TSV keyed on rep_id
  genomes.tsv             # 2–3 cols: genome_id<TAB>faa_path[<TAB>abundance]
  gspa-manifest.toml      # optional — version/provenance marker
  • clusters.tsv: every (rep, member, genome) triple emitted by the cross-genome mmseqs2 cluster. The rep appears on its own line as its own member.
  • alignment/*.tsv: one or more 8-column files in gapsmith's canonical format (qseqid pident evalue bitscore qcov stitle sstart send). All .tsv / .tab / .txt files in that directory are concatenated transparently — useful when you run DIAMOND and mmseqs2 separately and want both sources of evidence.
  • genomes.tsv: the authoritative list of organism ids + FASTA paths. The optional 3rd column carries a relative abundance (any positive float — RPKM / TPM / raw counts / …) that propagates into community cfba.

Usage

# Single genome, alignment reused from an already-built gspa manifest:
gapsmith doall /path/to/genome.faa -f out/ \
    --gspa-run /path/to/gspa-run/ \
    --gspa-genome-id genome

# If the FASTA stem already matches the genome_id in genomes.tsv,
# --gspa-genome-id can be omitted.

--gspa-coverage-fraction re-scales mmseqs2-native coverage columns (0–1) up to the gapsmith convention (0–100). Diamond/BLASTp TSVs are already 0–100 and need no flag.

What happens under the hood

gapsmith_align::GspaRunAligner implements the Aligner trait. On align() it:

  1. Reads alignment/*.tsv lazily (once, cached).
  2. Looks up each rep hit in clusters.tsv filtered to target_genome_id.
  3. Emits one Hit per member, rewriting qseqid to the member's original protein id so downstream find / find-transport see a normal per-genome hit table.

2. Batch reconstruction (doall-batch)

When to reach for it

ScaleRecommended tool
1 genomegapsmith doall
10 – 200 genomes, 1 boxgapsmith doall-batch --jobs <cores>
1 k – 1 M genomes, clusterdoall-batch --shard i/N inside a SLURM array

Flags that matter

  • --genomes-dir <dir> / --genomes-list <tsv> / --gspa-run <dir> — exactly one; the last option pulls the genome list straight out of the gspa manifest.
  • --jobs N — rayon pool size. Each worker processes one genome at a time; we pin doall's own --threads to 1 so rayon gets clean parallelism.
  • --shard i/N — SLURM-array-friendly selector. After sorting the genome list alphabetically, only genomes whose 0-based index k satisfies k mod N == i are processed. Every shard is independent — no shared state, no cross-task coordination.
  • --continue-on-error — log-and-skip instead of aborting the batch when a single genome's doall fails. Failures are written to <out-dir>/doall-batch-errors.tsv.

Recipes

100 genomes on one box, reusing a gspa cluster

gapsmith doall-batch \
    --gspa-run /data/gspa-run/ \
    -f /out/reconstructions/ \
    --jobs 32 \
    --continue-on-error

Each <genome_id>/ subdirectory under /out/reconstructions/ gets the standard doall outputs.

1 M genomes on a SLURM cluster

#!/bin/bash
#SBATCH --array=0-1023
#SBATCH --cpus-per-task=16
#SBATCH --mem=32G
#SBATCH --time=24:00:00

gapsmith doall-batch \
    --gspa-run /scratch/gspa-run/ \
    -f /scratch/recon/ \
    --jobs $SLURM_CPUS_PER_TASK \
    --shard ${SLURM_ARRAY_TASK_ID}/1024 \
    --continue-on-error
  • One gspa run on the full 1 M genome set covers the expensive alignment step globally.
  • 1 024 SLURM tasks each process ~1 000 genomes in parallel; add tasks or add cores per task to change the balance.

3. Community optimisation (community)

Two subcommands; pick the one that matches your accuracy / scale trade-off. You can run both on the same input and compare.

community per-mag — scalable default

Each MAG keeps its own GEM and its own FBA. We (a) build a shared medium that is the union of every per-MAG medium (largest maxFlux per compound wins), (b) run FBA on every draft under that shared medium, and (c) report the abundance-weighted mean growth rate.

gapsmith community per-mag \
    --gspa-run /data/gspa-run/ \
    --drafts-root /out/reconstructions/ \
    -o /out/community/

Outputs:

  • community-medium.csv — the shared medium.
  • per-mag-growth.tsvgenome_id, abundance, growth_rate, status per MAG.

Complexity: linear in N; no community-level LP. This is the mode for ≥ 100 MAGs where a full cFBA isn't tractable.

community cfba — full community LP

Composes N drafts into one block-diagonal model. Private mets/rxns are suffixed __<genome_id>; extracellular mets (_e0) and exchange reactions (EX_*_e0) are shared so organisms cross-feed through a single pool. Bounds on shared exchanges are widened to the most permissive input.

A synthetic bio_community reaction is added whose flux equals Σ w_k · bio_k (weights come from --abundance or genomes.tsv).

gapsmith community cfba \
    --gspa-run /data/gspa-run/ \
    --drafts-root /out/reconstructions/ \
    --abundance /data/abundances.tsv \
    --balanced-growth \
    -m /data/community-medium.csv \
    -o /out/community/

Outputs:

  • community.gmod.cbor — composed community model.
  • community-fba.tsv — per-organism biomass flux plus the community-level objective.

--balanced-growth adds one linear constraint per organism forcing v(bio_k) == v(bio_community), matching classical cFBA semantics. Drop the flag for an abundance-weighted average without the equal-growth constraint (faster, more realistic when abundances are very uneven).

Scaling: the community LP grows roughly N × (rxns + mets). Expect comfortable behaviour up to ~50 organisms; larger communities are better served by per-mag first, with cFBA applied to selected sub-communities.


4. Decision tree

                 "metagenome / multi-MAG run"
                             │
            ┌────────────────┴────────────────┐
            │                                 │
     small curated set                 thousands of MAGs
      (≤ ~50 organisms)                  or fast iteration
            │                                 │
    ┌───────┴───────┐                 ┌───────┴───────┐
    │ need obligate │                 │ need obligate │
    │ cross-feeding?│                 │ cross-feeding?│
    └───────┬───────┘                 └───────┬───────┘
      yes  │    │ no                    yes  │    │ no
           │    │                            │    │
  community cfba  community per-mag  community cfba on  community per-mag
  --balanced-growth                  selected cohorts

Rules of thumb:

  • 1 — 50 organisms, synthetic or curated community: cfba.
  • 50 — 1 000 organisms, real-world metagenome: per-mag.
  • 1 000 organisms or iterating over many assemblies: always start with doall-batch + per-mag; drop to cfba only for sub-cohorts.


5. Producing a gspa run from outside gspa

Any tool that can emit the three files above works. A minimal hand-rolled example:

# 1. Concatenate proteomes, tagging each header with its genome id.
mkdir -p run/
python tools/concat_with_prefix.py --in genomes/ --out run/all.faa

# 2. Cluster across genomes.
mmseqs easy-cluster run/all.faa run/cluster run/tmp \
    --min-seq-id 0.5 -c 0.8 --threads 32

# 3. Pivot the 2-col `run/cluster_cluster.tsv` into 3-col clusters.tsv
#    by peeling each member's `<genome_id>|` prefix.
awk -F$'\t' '{
  member=$2; rep=$1;
  n = split(member, a, "|"); gid=a[1];
  orig=substr(member, length(gid)+2);
  print rep "\t" orig "\t" gid
}' run/cluster_cluster.tsv > run/clusters.tsv

# 4. Run alignment ONCE against cluster reps.
mkdir -p run/alignment
diamond makedb --in reference.faa -d ref
diamond blastp -q run/cluster_rep_seq.fasta -d ref \
    --outfmt 6 qseqid pident evalue bitscore qcovhsp stitle sstart send \
    --more-sensitive -k 10 -e 1e-5 > run/alignment/reference.tsv

# 5. genomes.tsv (optional 3rd column = abundance).
for f in genomes/*.faa; do
  id=$(basename "$f" .faa)
  echo -e "$id\t$(realpath "$f")"
done > run/genomes.tsv

After that, gapsmith doall-batch --gspa-run run/ --jobs 32 -f out/ is all you need.

Architecture

A bird's-eye view of how gapsmith is organised, how data flows through it, and what each crate is responsible for.


1. Crate dependency graph

             gapsmith-core            ← Types only, no I/O / DB / solver
              ↑  ↑  ↑
              │  │  │
    ┌─────────┘  │  └────────────────┐
    │            │                   │
  gapsmith-io   gapsmith-db           gapsmith-sbml
    │            │                   ↑
    └─┬──────────┘                   │
      │                              │
      ▼                              │
  gapsmith-align ──┐                   │
      │          │                   │
      ▼          │                   │
  gapsmith-find    │                   │
      │          │                   │
      ├─▶ gapsmith-transport           │
      │                              │
      ▼                              │
  gapsmith-draft ──────────────────────┤
      │                              │
      ▼                              │
  gapsmith-fill ◀─── gapsmith-medium    │
      │                              │
      └──────────────▶ gapsmith-cli ◀──┘

Arrows point from "depended on" toward "depends on". gapsmith-core is the leaf: every other crate builds on its Model / Reaction / Metabolite / StoichMatrix types.


2. Data flow for gapsmith doall

  genome.faa.gz
        │
        ▼
  ┌──────────────┐                      ┌──────────────────────┐
  │ gapsmith find  │ ──── aligner (shell) │  dat/seq/<tax>/      │
  │ -p all       │          ─ BLAST/    │  rev/ unrev/ rxn/    │
  └──────┬───────┘            diamond/  └──────────────────────┘
         │                    mmseqs2
         │   *-Reactions.tbl  + *-Pathways.tbl
         ▼
  ┌──────────────────────┐
  │ gapsmith find-transport│ ─── same aligner, subex.tbl + tcdb.fasta
  └──────┬───────────────┘
         │   *-Transporter.tbl
         ▼
  ┌────────────────────┐
  │ gapsmith draft       │ ─── seed_reactions_corrected.tsv, biomass JSON
  └──────┬─────────────┘
         │   *-draft.gmod.cbor  (CBOR) + *-draft.xml  (SBML)
         ▼
  ┌──────────────────┐
  │ gapsmith medium    │ ─── medium_prediction_rules.tsv + *-Pathways.tbl
  │ (auto)           │
  └──────┬───────────┘
         │   *-medium.csv
         ▼
  ┌────────────────────┐
  │ gapsmith fill        │ ─── HiGHS LP (in-process via good_lp)
  │ 4-phase suite      │
  └──────┬─────────────┘
         │   *-filled.gmod.cbor  +  *-filled.xml  +  *-filled-added.tsv
         ▼
  (COBRApy, COBRAToolbox, cobrar, …)

Every arrow between stages is plain filesystem I/O. No daemon, no IPC. Each subcommand is individually re-runnable; if you change the medium you only re-run fill.


3. Model representation

[gapsmith_core::Model] is the single source of truth:

#![allow(unused)]
fn main() {
pub struct Model {
    pub annot: ModelAnnot,           // provenance metadata
    pub compartments: Vec<Compartment>,
    pub mets: Vec<Metabolite>,
    pub rxns: Vec<Reaction>,
    pub genes: Vec<GeneId>,
    pub s: StoichMatrix,              // sprs CsMat in CSC
}
}

Serialised via serde as:

  • CBOR (.gmod.cbor) — native, compact, fast load. The gapsmith internal format; replaces upstream's .RDS.
  • JSON (.json) — human-readable; gapsmith convert converts both ways. Same semantic content as CBOR.
  • SBML (.xml) — Level 3 Version 1 + FBC 2 + groups 1. Written by [gapsmith_sbml::write_sbml]; loads cleanly in COBRApy / COBRAToolbox / cobrar.

Metabolite ids use the cpd00001_c0 convention (compartment baked into the id) for SBML SId compliance. Reaction ids likewise — rxn00001_c0, EX_cpd00001_e0, bio1, etc.


4. LP plumbing

The gap-filler does a lot of LPs — dozens per Step 1 / 2 / 2b / 3 / 4 run, sometimes thousands in a --full-suite. Key design decisions:

Split-flux encoding

Every reaction r becomes two variables vp_r, vn_r ≥ 0. The net flux is v_r = vp_r − vn_r. This makes |v_r| = vp_r + vn_r linear — essential for pFBA.

Bound translation

model [lb, ub]vp_r uppervn_r upper
lb ≥ 0, ub ≥ 0ub0
lb ≤ 0, ub ≥ 0ub−lb
lb ≤ 0, ub ≤ 00−lb

Strict-direction reactions additionally get vp_r ≥ lb (forward-forced) or vn_r ≥ −ub (backward-forced).

Mass balance

Walk the CSC matrix once column-by-column (O(nnz) total). For each non-zero (row, col, coef), accumulate coef · (vp[col] − vn[col]) onto row row's expression. The naive per-row scan would be O(m · n) — ~45 B iterations on a 3k×5k model. The O(nnz) walk solves it in under 100 ms on ecoli.

Solver

good_lp 1.15 with HiGHS backend. HiGHS is statically linked via CMake at build time — no runtime HiGHS dependency. Optional CBC fallback behind --features cbc: pfba_heuristic retries once with CBC after HiGHS exhausts the tolerance / coefficient ladder.


5. Alignment layer

gapsmith-align exposes a single [Aligner] trait:

#![allow(unused)]
fn main() {
pub trait Aligner {
    fn run(&self, query: &Path, targets: &[&Path], opts: &AlignOpts)
        -> Result<Vec<Hit>, AlignError>;
}
}

Implementors — one struct each — shell out to the matching binary and normalise its output to a [Hit] with the gapseq-canonical column set (qseqid / pident / evalue / bitscore / qcovs / stitle / sstart / send).

  • [BlastpAligner] builds an indexed BLAST DB in a temp dir per call.
  • [DiamondAligner] mirrors gapseq's --more-sensitive default.
  • [Mmseqs2Aligner] uses the explicit createdb → search → convertalis 4-command pipeline (not easy-search — see porting-notes for why).
  • [PrecomputedTsvAligner] skips all shelling out and reads a user-supplied TSV.
  • [BatchClusterAligner] is the Rust-only feature: mmseqs2-clusters N input genomes, runs one alignment, expands cluster membership back to per-genome hits.
  • [GspaRunAligner] is the multi-genome / metagenome entry point: given a gspa manifest (clusters.tsv + alignment/*.tsv + genomes.tsv) and a target genome id, it fans every rep-level hit onto that genome's cluster members. No alignment is re-run — gspa clustered across N genomes once, upstream.

6. Candidate-pool / gap-fill

The 4-phase gap-fill suite (gapsmith-fill::run_suite) is the heaviest computation in the pipeline. Sketch of one fill call:

  draft model  +  medium CSV  +  Reactions.tbl (bitscore weights)
                         │
                         ▼
          ┌──────────────────────────────┐
          │ apply_medium(draft)          │
          │ attach EX_<target>_c0 obj    │
          └──────────────┬───────────────┘
                         ▼
          ┌──────────────────────────────┐
          │ build_full_model             │  ← every approved SEED rxn
          │   draft + candidates         │    not already present,
          │   deduped by stoich hash     │    sorted by core status
          └──────────────┬───────────────┘
                         ▼
          ┌──────────────────────────────┐
          │ (optional) detect_futile_    │  ← parallel LP probe,
          │   cycles  +  drop            │    `--prune-futile`
          └──────────────┬───────────────┘
                         ▼
          ┌──────────────────────────────┐
          │ pfba_heuristic               │  ← tolerance ladder,
          │   min_growth ≥ k             │    `1e-6 → 1e-9`
          └──────────────┬───────────────┘
                         ▼
          ┌──────────────────────────────┐
          │ extract utilized candidates  │
          │ add to draft                 │
          └──────────────┬───────────────┘
                         ▼
          ┌──────────────────────────────┐
          │ KO essentiality loop         │  ← zero bounds, recheck FBA,
          │   serial, core-first         │    drop non-essential
          └──────────────┬───────────────┘
                         ▼
          ┌──────────────────────────────┐
          │ Step 2: biomass components   │  ← per substrate, MM_glu
          ├──────────────────────────────┤
          │ Step 2b: aerobic/anaerobic   │
          ├──────────────────────────────┤
          │ Step 3: ESP1-5 energy screen │  (`--full-suite`)
          ├──────────────────────────────┤
          │ Step 4: fermentation screen  │  (`--full-suite`)
          └──────────────────────────────┘
                         │
                         ▼
                    filled model

6b. Multi-genome and community paths

The single-genome flow above is unchanged. Three additional, opt-in paths sit alongside it.

6b.1 Alignment reuse via gspa

   N proteomes ───▶  [gspa, upstream]  ──▶  gspa-run directory
                     (mmseqs2 linclust,      clusters.tsv
                      diamond vs. ref)       alignment/*.tsv
                                             genomes.tsv
                           │
     ┌─────────────────────┼─────────────────────┐
     ▼                     ▼                     ▼
  gapsmith find         find-transport         doall-batch
   --gspa-run …          --gspa-run …           --gspa-run …
     │                     │                     │
     └─ reads clusters.tsv + alignment/*.tsv, expands rep
        hits onto each target genome's cluster members
        (gapsmith_align::GspaRunAligner).

No alignment is run inside gapsmith when --gspa-run is set. Per-genome find still applies its bitscore / coverage / completeness gates on the fanned-out hits, so the downstream pipeline is unchanged.

6b.2 doall-batch parallel driver

   genomes.tsv  ──▶ sort + shard (`--shard i/N`) ──▶ rayon pool
                                                       │
                                                       ├─▶ doall genome_1
                                                       ├─▶ doall genome_2
                                                       ├─▶ ...
                                                       └─▶ doall genome_k
   out-dir/
     <genome_1>/
     <genome_2>/
     ...
     doall-batch-errors.tsv  ← present only with --continue-on-error

Each worker calls doall::run_cli in-process with its --threads pinned to 1 so rayon owns the parallelism. SLURM array tasks get a disjoint --shard i/N — no inter-task coordination, no shared state.

6b.3 Community modes

   N drafts (gmod.cbor) ──┐
                          │
              ┌───────────┴───────────┐
              ▼                       ▼
   community per-mag           community cfba
   (gapsmith-fill::             (gapsmith-fill::
    community::                  community::
     union_medium,                compose_models,
     per_mag_weights)             add_community_biomass)
              │                       │
              ▼                       ▼
   per-mag-growth.tsv +        community.gmod.cbor +
   community-medium.csv        community-fba.tsv
   (N independent FBAs)        (1 big LP across the
                                block-diagonal model)

compose_models suffixes every private metabolite and reaction with __<organism>, leaves extracellular (_e0) ids unsuffixed so they collapse into one shared pool, and takes the most permissive EX_*_e0 bounds across organisms. add_community_biomass appends a community_biomass pseudo-metabolite plus a draining bio_community reaction (Σ w_k · bio_k), optionally adding one equality per organism (v(bio_k) == v(bio_community)) for balanced-growth cFBA. The regular [fba::fba] solves the resulting LP with no solver-side changes.


7. Testing surface

Every algorithmic component has unit tests (~170 total, cargo test --workspace). On top, three integration layers:

  1. Shell-parity tests (crates/gapsmith-align/tests/*_parity.rs) — run the real BLAST / diamond / mmseqs2 binaries and diff against our wrappers' output.
  2. R-parity tests (crates/gapsmith-find/tests/complex_parity.rs, pipeline_parity.rs, crates/gapsmith-transport/tests/parity.rs) — run the actual R gapseq via Rscript and diff column-by-column.
  3. SBML validator (tools/validate_sbml.py) — uv-managed venv with python-libsbml + cobra; runs libSBML consistency checks + COBRApy round-trip on emitted SBML. 0 errors on every model.

8. File-system layout

gapsmith/
  crates/
    gapsmith-core/               # types
    gapsmith-io/                 # CBOR/JSON + path resolver
    gapsmith-db/                 # dat/*.tsv loaders
    gapsmith-sbml/               # SBML writer
    gapsmith-align/              # aligner trait + 6 backends (incl. gspa-run)
    gapsmith-find/               # pathway + reaction detection
    gapsmith-transport/          # transporter detection
    gapsmith-draft/              # draft model builder
    gapsmith-medium/             # rule-based medium inference
    gapsmith-fill/               # FBA / pFBA / gap-filler / suite / community
    gapsmith-cli/                # clap dispatch + every subcommand
  tools/
    bench/                     # R-vs-rs benchmark runner
    validate_sbml.py           # libSBML + COBRApy validator
    r_complex_detection.R      # R parity harness
    .sbml-validate/            # uv venv (python-libsbml + cobra)
  docs/
    user-guide.md              # end-user walk-through
    cli-reference.md           # exhaustive flag list
    multi-genome.md            # metagenome, gspa integration, cFBA
    feature-matrix.md          # R source → Rust module pointers
    porting-notes.md           # intentional deviations from upstream
    architecture.md            # this file
  Cargo.toml                   # workspace manifest
  README.md                    # status + benchmarks

Feature matrix

Exhaustive list of everything gapsmith implements, one row per feature, with pointers to both the upstream R source and the Rust module.

Status legend:

  • ✅ — implemented and tested against real gapseq where feasible
  • 🆕 — Rust-only feature (no upstream equivalent)
  • ⚠️ — shipped but intentionally deviates from upstream (see porting-notes.md)
  • ❌ — deferred / intentionally not ported

1. Subcommands

SubcommandR sourceRust moduleStatus
gapsmith testgapsmith-cli/src/commands/test.rs
gapsmith findsrc/gapseq_find.sh + src/*.Rgapsmith-find/ + gapsmith-cli/src/commands/find.rs✅ byte-identical on PWY-6587 & amino
gapsmith find-transportsrc/transporter.sh + src/analyse_alignments_transport.Rgapsmith-transport/ + gapsmith-cli/src/commands/find_transport.rs✅ TC-set + row-count identical
gapsmith draftsrc/generate_GSdraft.R + src/prepare_candidate_reaction_tables.Rgapsmith-draft/ + gapsmith-cli/src/commands/draft.rs✅ SBML validates (0 libSBML errors)
gapsmith mediumsrc/predict_medium.Rgapsmith-medium/ + gapsmith-cli/src/commands/medium.rs✅ byte-identical on ecoli
gapsmith fillsrc/gf.suite.R + src/gapfill4.Rgapsmith-fill/ + gapsmith-cli/src/commands/fill.rs✅ 4-phase suite + KO loop
gapsmith adaptsrc/adapt.R + src/gf.adapt.Rgapsmith-cli/src/commands/adapt.rs⚠️ EC / KEGG / name resolution deferred
gapsmith pansrc/pan-draft.Rgapsmith-cli/src/commands/pan.rs✅ union + binary table
gapsmith doallsrc/doall.shgapsmith-cli/src/commands/doall.rs✅ end-to-end on ecore in 2m47s
gapsmith update-sequencessrc/update_sequences.shgapsmith-cli/src/commands/update_sequences.rs✅ Zenodo sync + md5 diff
gapsmith convertgapsmith-cli/src/commands/convert.rs🆕 CBOR ↔ JSON round-trip
gapsmith example-modelgapsmith-cli/src/commands/example_model.rs🆕 toy model fixture
gapsmith db inspectgapsmith-cli/src/commands/db.rs🆕 reference-data row-count dump
gapsmith export-sbmlcobrar::writeSBMLmodgapsmith-cli/src/commands/export_sbml.rs🆕 CBOR → SBML
gapsmith aligngapsmith-cli/src/commands/align.rs🆕 debug-wrap for a single aligner
gapsmith batch-aligngapsmith-cli/src/commands/batch_align.rs🆕 cluster N genomes + single alignment
gapsmith doall-batchgapsmith-cli/src/commands/doall_batch.rs🆕 rayon + SLURM-shard parallel doall across N genomes
gapsmith community per-maggapsmith-cli/src/commands/community.rs🆕 per-MAG FBA under a shared (union) medium
gapsmith community cfbagapsmith-cli/src/commands/community.rs🆕 compose N drafts, weighted-sum biomass objective
gapsmith fbagapsmith-cli/src/commands/fba.rs🆕 FBA / pFBA standalone

2. Core algorithms

2.1 Alignment layer (gapsmith-align)

FeatureR sourceRust module
BLASTp wrappergapseq_find.sh blastp blockblast.rs::BlastpAligner
tBLASTn wrappersame, for -n nuclblast.rs::TblastnAligner
DIAMOND wrappergapseq_find.sh diamond blockdiamond.rs::DiamondAligner
mmseqs2 wrapper (full pipeline)gapseq_find.sh mmseqs blockmmseqs2.rs::Mmseqs2Aligner
Precomputed TSV inputprecomputed.rs::PrecomputedTsvAligner 🆕
Batch-cluster (N genomes → 1 alignment)batch.rs::BatchClusterAligner 🆕
gspa-run manifest reader (cluster-aware hit expansion)gspa.rs::{GspaManifest, GspaRunAligner} 🆕
2-decimal scientific e-value formatBLAST -outfmt 6 nativetsv.rs

2.2 find pipeline (gapsmith-find)

FeatureR sourceRust module
Pathway table loader (meta / kegg / seed / custom)gapseq_find.sh:520-532gapsmith-db::PathwayTable
metacyc + custom merge (custom-wins-on-id)samesame
Keyword-shorthand resolution (amino, carbo, ...)gapseq_find.sh:40-60pathways.rs::MatchMode::Hierarchy
Reference FASTA resolver (user/rxn/rev/ECunrev/EC → md5)prepare_batch_alignments.R:150-234seqfile.rs
Complex-subunit detectioncomplex_detection.Rcomplex.rs (R-parity on 9 cases)
Hit classification with exception tableanalyse_alignments.R:108-189classify.rs
Pathway completeness scoring (f64 precision)filter_pathways.R:10-34pathways.rs::score
dbhit lookup (EC + altEC + MetaCyc id + enzyme name)getDBhit.R:60-130dbhit.rs
noSuperpathways=true defaultgapseq_find.sh:20find::FindOptions
Word-boundary-less header filter (matches shell grep -Fivf)gapseq_find.shseqfile.rs ⚠️ intentional
Output writers (Reactions.tbl, Pathways.tbl)sameoutput.rs

2.3 find-transport pipeline (gapsmith-transport)

FeatureR sourceRust module
subex.tbl substrate filtertransporter.sh:140-280filter.rs
TC-id parsing + type canonicalisationanalyse_alignments_transport.R:1-188tc.rs
Substrate resolution (tcdb_all + FASTA header fallback)samerunner.rs
Alt-transporter reaction assignment (gated by --nouse-alternatives)analyse_alignments_transport.R:110-130runner.rs
Substrate-case preservation (gapseq emits sub=Potassium)shell behaviourdata.rs

2.4 Draft model builder (gapsmith-draft)

FeatureR sourceRust module
Candidate selection (bitscore ≥ cutoff OR pathway support)prepare_candidate_reaction_tables.R + generate_GSdraft.R:55-100candidate.rs
Stoichiometric hash dedupgenerate_rxn_stoich_hash.Rstoich_hash.rs
Best-status-across-rows (OR is_complex, max complex_status, highest-rank pathway_status)implicit in R's data.table mergescandidate.rs::build_candidates ⚠️ explicit
Biomass JSON parser (single + pipe-separated multi-link)parse_BMjson.R:1-107biomass.rs + gapsmith-db::BiomassComponent::links
Biomass cofactor mass-rescalinggenerate_GSdraft.R:281-292biomass.rs ⚠️ menaquinone-8 auto-removal deferred
GPR composition (and / or tree, "subunit undefined" edge cases)get_gene_logic_string.Rgpr.rs
Diffusion + exchange expansionadd_missing_exRxns.R:1-156exchanges.rs
Conditional transporter additions (butyrate, IPA, PPA, phloretate)generate_GSdraft.Rrunner.rs::add_conditional_transporters
SBML ID sanitiser (-/./:/space → _)builder.rs 🆕
Cytosolic met-id format (cpd00001_c0 not cpd00001[c0])builder.rs ⚠️ for SBML SId compliance

2.5 FBA / pFBA solver (gapsmith-fill)

FeatureR sourceRust module
Split-flux LP encoding (vp, vn ≥ 0)implicit in cobrar::pfbaHeuristiclp.rs::SplitFluxLp
FBAcobrar::fbafba.rs::fba
pFBA (single call)cobrar::pfbaHeuristicpfba.rs::pfba
pFBA-heuristic tolerance ladder (15 iters, 1e-6 → 1e-9, pFBA-coef relaxation)gapfill4.R:95-137pfba.rs::pfba_heuristic
HiGHS solver— (R uses glpk/cplex)good_lp 1.15 + highs-sys
CBC fallbackpfba.rs::pfba_cbc (feature-gated cbc) 🆕
Row-expression builder (O(nnz))implicitfba.rs::build_row_exprs 🆕 performance

2.6 Gap-filling (gapsmith-fill)

FeatureR sourceRust module
gapfill4 single-iteration drivergapfill4.R:1-303gapfill.rs::gapfill4
Candidate pool (draft + all approved SEED, stoich-hash deduped)construct_full_model.R + gapfill4.R:12-56pool.rs::build_full_model
rxnWeights derivation from bitscoresprepare_candidate_reaction_tables.R:222-228pool.rs::rxn_weight
KO essentiality loop (serial, core-first, highest-weight-first)gapfill4.R:247-280gapfill.rs::gapfill4
Medium application (close all EX, open per-medium, add missing EX)constrain.model.Rmedium.rs::apply_medium
Environment overrides (env_highH2.tsv)adjust_model_env.Rmedium.rs::apply_environment_file
Step 1 (user medium + biomass target)gf.suite.R:244-258suite.rs::run_suite
Step 2 (per-biomass-component on MM_glu + carbon sources)gf.suite.R:285-372suite.rs::step2
Step 2b (aerobic / anaerobic variant)gf.suite.R:377-464suite.rs::run_suite
Step 3 (energy-source screen with ESP1-5)gf.suite.R:480-581suite.rs::step3
Step 4 (fermentation-product screen)gf.suite.R:585-683suite.rs::step4
Target-met sink as objectiveadd_met_sink in add_missing_exRxns.R:56-72suite.rs::add_target_sink_obj
Futile-cycle detector (parallel pairwise LP probe)recent upstream cccbb6f0futile.rs::detect_futile_cycles (opt-in --prune-futile)
Community model composition (block-diagonal, shared _e0)community.rs::compose_models 🆕
Weighted-sum community biomass + optional balanced-growthcommunity.rs::add_community_biomass 🆕
Union-medium + per-MAG weights (community per-mag mode)community.rs::{union_medium, per_mag_weights, weighted_growth} 🆕

2.7 Medium inference (gapsmith-medium)

FeatureR sourceRust module
Rules-table loaderpredict_medium.R:46rules.rs::load_rules
Boolean-expression evaluator (| & ! < > == <= >=)eval(parse(text=))boolexpr.rs::eval
Counting-rule support (a + b + c < 3)same (R int arithmetic)boolexpr.rs::parse_sum
Cross-rule dedup + mean fluxpredict_medium.R:84-86predict.rs::predict_medium
Saccharides / Organic acids category deduppredict_medium.R:88-92predict.rs::predict_medium
Manual flux overridespredict_medium.R:94-114predict.rs::parse_manual_flux
Proton balancerpredict_medium.R:121-132predict.rs::predict_medium

2.8 Serialisation (gapsmith-sbml, gapsmith-io)

FeatureR sourceRust module
CBOR round-tripgapsmith-io::{read,write}_model_cbor 🆕
JSON round-tripgapsmith-io::{read,write}_model_json 🆕
SBML L3V1 + FBC2 + groups writercobrar::writeSBMLmodgapsmith-sbml::write_sbml
SBML SId idempotent on mets with compartment suffixwriter.rs::species_id 🆕 bugfix
Streaming via quick-xmlwriter.rs 🆕 no libSBML dep
SBML consistency validationlibSBML nativetools/validate_sbml.py (libSBML + COBRApy)

2.9 Reference-data loaders (gapsmith-db)

FeatureR sourceRust module
seed_reactions_corrected.tsvdata.table::freadseed.rs::load_seed_reactions
seed_metabolites_edited.tsvsameseed.rs::load_seed_metabolites
MNXref cross-refs (mnxref_*.tsv)samemnxref.rs
meta_pwy.tbl / kegg_pwy.tbl / seed_pwy.tbl / custom_pwy.tblsamepathway.rs
subex.tblsamesubex.rs
tcdb.tsvsametcdb.rs
exception.tblsameexception.rs
medium_prediction_rules.tsvsamegapsmith-medium::rules
complex_subunit_dict.tsvsamecomplex.rs
Biomass JSON (Gram+, Gram-, archaea, user custom)parse_BMjson.Rbiomass.rs
SEED stoichiometry parser (-1:cpd00001:0:0:"H2O";...)parse_BMjson.R:21-29stoich_parse.rs

3. New Rust-only features

FeatureRust moduleMotivation
Precomputed alignment input (--aligner precomputed -P <tsv>)gapsmith-align::PrecomputedTsvAlignerSkip per-genome BLAST when the user pre-runs diamond / mmseqs2 at batch scale
BatchClusterAligner (gapsmith batch-align)gapsmith-align::BatchClusterAlignerAmortise alignment cost over N genomes via one mmseqs2 cluster + single alignment
gspa-run manifest reader (--gspa-run <dir>)gapsmith-align::GspaRunAlignerConsume precomputed cluster-rep hits from the upstream gspa pipeline; fans rep hits onto per-genome members
gapsmith doall-batchgapsmith-cli::commands::doall_batchRayon + SLURM-array-friendly driver for reconstructing 100 → 1 M genomes in one batch
gapsmith community per-maggapsmith-fill::community + CLIShared-medium per-MAG FBA for metagenomes with 50+ MAGs
gapsmith community cfbagapsmith-fill::community + CLIFull community LP (block-diagonal compose, weighted-sum biomass, optional balanced-growth)
In-process LP (HiGHS via good_lp)gapsmith-fillReplaces R cobrar's shelled-out glpk / cplex; faster warm-starts
Optional CBC fallback backendgapsmith-fill::pfba_cbc (--features cbc)When HiGHS exhausts the tolerance ladder on pathological LPs
CBOR native formatgapsmith-ioFast, compact, stdlib-free; replaces R's RDS
gapsmith fba subcommandgapsmith-cliStandalone FBA / pFBA without shelling into R
gapsmith convert subcommandgapsmith-cliCBOR ↔ JSON round-trip for inspection
gapsmith db inspect subcommandgapsmith-cliSmoke-test the reference data directory
gapsmith export-sbml subcommandgapsmith-cliWrite an arbitrary CBOR model as SBML

4. Known gaps (deferred)

GapUpstream locationWorkaround / plan
EC / TC conflict resolution (IRanges overlap math)prepare_candidate_reaction_tables.R::resolve_common_{EC,TC}_conflictsAffects <1 % of multi-EC annotations. Plan: port when a user case needs it.
MIRIAM cross-ref annotations (KEGG / BiGG / MetaNetX / HMDB / ChEBI)addReactAttr.R + addMetAttr.RSBML emits ModelSEED id only; round-trip in COBRApy still works.
HMM-based taxonomy / gram predictionpredict_domain.R, predict_gramstaining.RCLI requires explicit `--taxonomy Bacteria
Gene-name MD5 fallback in seqfile resolveruniprot.sh:179Common-case MD5 fallback is ported; gene-name branch rarely fires.
Menaquinone-8 auto-removal (gated on MENAQUINONESYN-PWY / PWY-5852 / PWY-5837)generate_GSdraft.R:281-292Bio1 includes cpd15500 regardless; affects anaerobic predictions marginally.
gram_by_network.R (predict gram by metabolic-network similarity)sameRequires explicit `-b pos
adapt EC / KEGG / enzyme-name resolutionadapt.R::ids2seed strategies 3–7Direct SEED + pathway id resolution works; user can pre-resolve via gapsmith find.
pan weight medianing (custom_median)pan-draft_functions.RPan-model emits without merged rxnWeights metadata; gapsmith fill on a pan-draft needs the source Reactions.tbl.
CPLEX solver supportPlan explicitly calls for HiGHS + optional CBC; no CPLEX path.
MetaCyc DB updaters (meta2pwy.py, meta2genes.py, meta2rea.py)upstream Python helpersRun once per year by maintainers; kept in Python.

5. Testing surface

Test suiteTestsWhat it asserts
gapsmith-core unit18Type invariants, serde round-trips, stoichiometric matrix construction.
gapsmith-io unit5CBOR / JSON round-trip, data-dir auto-detect.
gapsmith-db unit18Every reference-data parser on realistic inputs.
gapsmith-sbml unit + integration2 + 1SBML writer emits every FBC2 / groups element; libSBML validates cleanly.
gapsmith-align unit + smoke + parity18 + 4 + 3Aligner trait, precomputed TSV, gspa-run manifest + fan-out, BLAST / diamond / mmseqs2 shell parity.
gapsmith-find unit + smoke + parity36 + 1 + 2Pathway scoring, complex detection (R-parity on 9 cases), find -p PWY-6587 and -p amino byte-identical against real gapseq.
gapsmith-transport unit + parity7 + 1TC parsing, substrate resolution, end-to-end row+TC-id parity against real gapseq.
gapsmith-draft unit + smoke10 + 1Biomass rescaling, GPR composition, stoich dedup, conditional transporters.
gapsmith-fill unit + textbook + smoke20 + 5 + 1FBA / pFBA / pFBA-heuristic on toys; community compose + cFBA on 2-organism toy; gapfill4 end-to-end on ecoli draft.
gapsmith-medium unit14Boolean-expression evaluator (incl. counting rules), rule loader, cross-rule dedup, proton balance.
gapsmith-cli integration6 + 1CBOR↔JSON round-trip end-to-end via the binary; SLURM-shard parser.
Total~170

Run the full suite:

cargo test --workspace
cargo clippy --workspace --all-targets -- -D warnings

6. LOC breakdown

Cratesrc/ LOCtests/ LOC
gapsmith-core~1 000
gapsmith-io~330
gapsmith-db~1 600
gapsmith-sbml~870~220
gapsmith-align~1 250~560
gapsmith-find~2 700~380
gapsmith-transport~1 040~115
gapsmith-draft~1 670~80
gapsmith-fill~2 150~400
gapsmith-medium~550
gapsmith-cli~2 400~60
Total~17 000~2 100

gapsmith porting notes

Intentional and unintentional deviations from the upstream R/bash gapseq codebase. Each entry says what differs, why, and whether downstream output parity holds.

This document is the authoritative list of "known non-R behavior". If you find something in gapsmith that differs from real gapseq and it's not listed here, please file an issue — we treat those as bugs.


1. Storage format

  • RDS → CBOR. R's saveRDS(mod, ...) is replaced by ciborium-encoded CBOR at <id>.gmod.cbor. SBML is emitted alongside as the portable interchange format. An Rscript-driven CBOR↔RDS bridge can be added later if downstream R consumers need it.
  • Metabolite IDs: cpd00001_c0 instead of cpd00001[c0]. SBML SIds must match [A-Za-z_][A-Za-z0-9_]* — brackets break that. The SBML writer derives the species id from met.id + met.compartment and only presents <cpd>[comp] to the user via the name attribute.
  • Model IDs are sanitized to valid SBML SIds: -, ., :, space → _. Original id preserved in the name attribute if different.

2. Alignment layer

  • mmseqs2 invocation. gapseq's R pipeline calls explicit mmseqs createdb + search + convertalis; we replicate this 4-command flow exactly rather than the simpler easy-search, because the latter uses alignment-mode 3 (full-alignment identity) while gapseq calibrates against the k-mer prefilter identity that search reports.
  • mmseqs2 qseqid normalization. We post-process mmseqs's qheader output to the first whitespace-delimited token, matching what blastp/diamond emit natively. gapseq does this via a sed call in the shell script (gapseq_find.sh at the tail of each mmseqs block).
  • Diamond --more-sensitive is on by default, matching gapseq's aliArgs="--more-sensitive" default for diamond.
  • E-value formatting in TSV output: BLAST's exact 2-decimal scientific notation (2.52e-41). Initial wrapper used {:.3e} which emitted 2.520e-41; fixed to match.

3. find pipeline (M5)

  • noSuperpathways=true is always the default. Matches gapseq's gapseq_find.sh:20. Pass -n / --include-superpathways to disable.
  • -p keyword shorthand. -p amino|nucl|cofactor|carbo|polyamine|fatty|energy|terpenoid|degradation|core|min|kegg|all resolves to the same hierarchy-category name gapseq uses. For -p <literal> we fall back to \b(<literal>)\b regex against id / name / altname / hierarchy — same as gapseq's default stype branch.
  • Pathway table merge semantics. -l metacyc,custom is default (matching gapseq); on ID collision, the custom_pwy.tbl row wins. This is how several "GS"-suffixed pathways (e.g. ARGSYN-PWY) have their superpathway=TRUE flag shadowed by the custom row with superpathway=FALSE, so they survive the filter.
  • user/ reference fastas always resolve against the built-in <gapseq>/dat/seq/<tax>/user/, even when -D / --seq-dir overrides the rest. Matches prepare_batch_alignments.R:217.
  • Reaction-name MD5 fallback. Uses md5sum-compatible hex digest of the raw bytes (no trailing newline), same as uniprot.sh:155.
  • Header filter uses substring, not word-boundary, match. This is a deliberate match of gapseq's shell behavior (grep -Fivf without -w). Has the known quirk of accidentally matching 4.A.2.1.1 against a line containing 4.A.2.1.11. Reproducing the quirk is required for byte-identical output.

4. find-transport pipeline (M6)

  • Substrate case is preserved from subex.tbl (sub=Potassium, not sub=potassium). Matches gapseq.
  • Alt-transporter reaction assignment kicks in when a TC-id hit exists but no SEED reaction matches the exact TC type. Same rule as analyse_alignments_transport.R:110-130. Disable with -a.

5. draft pipeline (M7) — known gaps

  • EC/TC conflict resolution (prepare_candidate_reaction_tables.R resolve_common_EC_conflicts / resolve_common_TC_conflicts, ~130 lines of IRanges overlap math) is not yet ported. Affects <1% of reactions; the mitigations are listed downstream in the draft — a gene with two EC annotations stays attached to both reactions rather than being resolved.
  • MIRIAM cross-ref annotations. SBML writer emits the ModelSEED id only. KEGG / BiGG / MetaNetX / HMDB / ChEBI cross-refs are available in addReactAttr.R + addMetAttr.R but not threaded through yet. Round-trip load in COBRApy works; MIRIAM-dependent downstream tools may need to re-query gapseq's dat/mnxref_*.tsv.
  • gram_by_network.R (predict gram-staining by metabolic-network similarity against reference genomes) is not ported. The CLI requires --biomass pos|neg|archaea or reads the gram= field from the Reactions.tbl header (which find writes when gapseq has HMM-predicted the gram).
  • Biomass rescaling: the menaquinone-8 auto-removal rule (generate_GSdraft.R:281-292) is not ported. Gram+/Gram- biomass retains menaquinone-8 regardless of whether the MENAQUINONESYN-PWY / PWY-5852 / PWY-5837 pathways are present. Impact is small (a single metabolite in the biomass reaction) but affects anaerobic-growth predictions.

6. Not yet implemented

  • gapsmith medium — rule-based medium inference. Planned for M10.
  • gapsmith fill — 4-phase gap-filling driver. Planned for M9 (FBA/pFBA primitives already shipped in M8 under gapsmith-fill).
  • gapsmith adapt — add/remove reactions on an existing model. Planned for M10.
  • gapsmith pan — pan-model union. Planned for M10.
  • gapsmith doall — chain find→find-transport→draft→medium→fill. Planned for M10.
  • gapsmith update-sequences — seqdb download from upstream. Planned for M10.
  • HMM-based predict_domain / predict_gramstaining. Planned alongside M10.

6a. fill (M8–M9) — LP primitives + single-iteration gap-fill (shipped)

  • Split-flux encoding: each reaction contributes vp_r, vn_r ≥ 0 with net v_r = vp_r − vn_r. |v_r| = vp_r + vn_r is linear, which keeps pFBA a pure LP. The bound translation is documented in crates/gapsmith-fill/src/lp.rs[lb, ub] on the net flux becomes a per-variable ub pair (max(ub,0), max(−lb,0)), with additional lower bounds when the direction is strictly forced.
  • FBA (gapsmith-fill::fba) — maximise / minimise Σ c_r · (vp_r − vn_r) s.t. S · (vp − vn) = 0. HiGHS backend via good_lp 1.15.
  • pFBA (gapsmith-fill::pfba) — minimise Σ w_r·(vp_r+vn_r) − pfba_coef · Σ c_r·(vp_r−vn_r) subject to v_bio ≥ min_growth. Matches cobrar's pfbaHeuristic.
  • pFBA-heuristic ladder (gapsmith-fill::pfba_heuristic) — port of gapfill4.R:95–137. Retries up to 15 times, halving the feasibility tolerance then the pFBA coefficient on solver failure; validates each candidate solution by running FBA on the reduced model after zero-flux reactions are removed.
  • gapsmith fba subcommand — FBA or --pfba on a CBOR/JSON model; prints solver status, objective, biomass flux, and the top-N |flux| reactions. Useful for sanity-checking a draft model before fill is applied.
  • Candidate pool (pool.rs::build_full_model) — clone draft, append every approved/corrected SEED reaction not already present, deduped by stoichiometric hash with core-preference on ties. Port of gapfill4.R:12–56.
  • Medium application (medium.rs::apply_medium) — close all EX_* lower bounds then re-open per medium CSV; adds missing EX reactions for compounds not yet in the model. Matches constrain.model.R.
  • Single-iteration gap-fill (gapfill.rs::gapfill4) — pFBA-heuristic on the full model with bitscore-derived weights, extract utilized candidates, add to draft, run KO essentiality loop. Port of gapfill4.R:1–303.
  • gapsmith fill subcommand — end-to-end draft → filled pipeline. Ecoli on MM_glu.csv: 20 reactions added, final growth 0.53.

Known divergences / outstanding work

  • Biomass link-multi parsingbiomass.json supports "link": "cpd01997:-1|cpd03422:-1" (pipe-separated coupled metabolites). The M7 biomass parser only read the first link. M9 fixed this (BiomassComponent::links()). Without the fix, every biomass reaction was missing two cofactor metabolites, pegging growth to zero. Re-run draft on any model built before M9 — the old CBOR has a structurally broken biomass.
  • SBML species-id double suffix — the M7 species_id helper blindly appended _<comp> to the cpd id. When gapsmith stores metabolites as cpd00001_c0 (the actual shipped convention), the SBML emitted M_cpd00001_c0_c0. COBRApy treated these as distinct mets, so SEED reactions and biomass didn't link. Fixed in M9: idempotent when the id already ends with the compartment suffix.
  • Objective must be EX_<target>_c0 sink, not bio1 directlygf.suite.R:239–242 zeros every obj_coef and then calls add_met_sink(mod, target.met, obj=1) to attach the objective to a freshly-created sink reaction. Without it, steady-state balance pins the biomass flux to zero because the biomass pseudo-metabolite has no consumer. The CLI wires this up automatically.
  • 4-phase suite (Steps 2 / 2b / 3 / 4) shipped in the M9 follow-up. Default CLI runs Steps 1 + 2 + 2b; --full-suite adds 3 + 4. Steps 3 and 4 iterate ~200 exchange compounds each with a gapfill4 call — budget 10–30 minutes on a whole-bacterium genome.
  • Futile-cycle prune shipped (detect_futile_cycles) but opt-in via --prune-futile. Pairwise LP probe on large candidate pools (~8k rxns) takes ~40 minutes even with rayon parallelism; use selectively.
  • Candidate-dedup fix (draft-stage latent bug caught during M9 follow-up): my build_candidates accumulator copied is_complex / complex_status / pathway_status only from the highest-bitscore row. That row may have empty status fields (gapseq fills those only on rows where complex detection ran), causing reactions like rxn00011 (bs=1797, pyruvate decarboxylase) to be rejected in the runner's complex filter. Fixed: OR is_complex, max-across-rows for complex_status, highest-rank pathway_status. Re-run gapseq draft on any model built before this fix.
  • CBC fallback wired behind --features cbc. Path: pfba_heuristic retries once with good_lp::solvers::coin_cbc after HiGHS exhausts its tolerance/coef ladder. Requires coinor-libcbc-dev + coinor-libclp-dev on the build host.

7. Unimplemented but probably won't be

  • The R helpers that depend on pythoncyc (meta2pwy.py, meta2genes.py, meta2rea.py) are one-off MetaCyc DB updaters run ≤ twice per year by maintainers. They stay in Python; no Rust port planned.
  • CPLEX solver support — plan stated "no CPLEX path"; gapsmith-fill will use HiGHS via good_lp with CBC/Clp as backups.

8. New features not in gapseq

  • BatchClusterAligner (M4.5) — cluster N genomes' proteomes with mmseqs2, align once against the reference query FASTA, expand per-genome hit sets. Amortizes alignment cost over many genomes with shared proteins. Not in upstream gapseq.
  • Precomputed alignment input--aligner precomputed -P <tsv> on find / find-transport skips the alignment step entirely. User must pre-run blastp/diamond/mmseqs2 with the expected 8-column layout.

9. Output format differences

  • Pathways.tbl completeness precision. Ported to f64 to match R's default 15-significant-digit output (66.6666666666667).
  • Reactions.tbl column order matches gapseq exactly after M5 fixes.
  • Transporter.tbl matches gapseq exactly after M6 fixes, modulo the cosmetic sub column case when a substrate has both capitalized and lowercased SUBkey entries (gapseq picks the alphabetically-first post-dedup; we pick insertion-order-first).
  • evalue formatting matches BLAST's -outfmt 6 conventions: 0 for exact zero, 2-decimal scientific notation below 1e-3.

10. Testing surface

  • R-vs-Rust parity tests run the actual R implementation via Rscript where feasible. This lives in three places:
    • crates/gapsmith-find/tests/complex_parity.rs drives tools/r_complex_detection.R which sources the real src/complex_detection.R.
    • crates/gapsmith-find/tests/pipeline_parity.rs runs real gapsmith find and gapsmith find side-by-side with a synthesized minimal seqdb.
    • crates/gapsmith-transport/tests/parity.rs does the same for find-transport.
  • SBML validation runs python-libsbml + cobra via a uv-managed venv at tools/.sbml-validate/. Not part of cargo test (no libsbml on crates.io) but a manual gate via tools/validate_sbml.py.

Performance

gapsmith shipped with a series of optimizations relative to its initial Rust port. This page catalogues what's in, where wins actually show up, and the methodology behind the numbers.

Shipped optimizations

All items below are on by default unless marked (opt-in) — those are library-level knobs exposed as struct fields, not CLI flags yet.

C1 — SEED stoich-hash cache at load time

crates/gapsmith-db/src/seed.rs. SeedRxnRow gains a #[serde(skip)] stoich_hash: Option<String> field populated once during load_seed_reactions (sequential, ~150 ms for 35k rows). Gap-fill and draft dedup paths read the cached hash instead of re-parsing + re-sorting the stoichiometry on every access.

C3 — O(1) reaction/metabolite lookups in apply_medium

crates/gapsmith-fill/src/medium.rs. Replaces rxns.iter().find(..) per medium entry with a one-shot HashMap<rxn_id, index>. Same for the metabolite-index lookup in the new-exchange fallback path. O(k·n)O(n + k) on a 500-rxn draft with 20 medium entries.

A2 — FASTA header cache per find run

crates/gapsmith-find/src/runner.rs. Multiple reactions frequently resolve to the same reference FASTA (two reactions sharing an EC both land on rev/<EC>.fasta). A HashMap<PathBuf, Arc<Vec<(String, String)>>> populated lazily during the unique-reaction loop avoids redundant file reads + header parses.

E1 — Structural blocked-reaction pre-pass in futile-cycle prune

crates/gapsmith-fill/src/futile.rs::structural_blocked_rxns. Iterative topological dead-reaction detection on the closed-boundary model: reactions whose metabolites have no producer or no consumer (in any alive direction) can't carry flux, so they can't participate in cycles. Drops 30–70% of the LP-probed candidate pool before any LP runs. O(iters × nnz), typically 3–8 iterations. Runs inside the existing --prune-futile flag.

B2 — Skip KO probes for zero-flux reactions

crates/gapsmith-fill/src/gapfill.rs. Any non-core candidate carrying zero flux in the post-fill max-biomass FBA optimum is provably non-essential: the constraint x[r] = 0 leaves the current optimum feasible, so the KO probe would succeed deterministically. Skip the LP probe; drop the reaction directly. Cuts ~20–40% of KO-loop LP solves on typical gap-fills.

B1 (opt-in) — HiGHS hot-start via with_initial_solution

crates/gapsmith-fill/src/fba.rs. FbaOptions gains hot_start: Option<Vec<f64>>. When Some, the caller supplies a previous net-flux vector; fba() decomposes it into split-flux (vp, vn) = (max(v,0), max(-v,0)) seeds and threads them through good_lp's with_initial_solution. HiGHS warm-starts the simplex from the seed basis. Can nudge HiGHS toward a different alternative optimum, so off by default for byte-parity.

B3 (opt-in) — pFBA ladder memo across calls

crates/gapsmith-fill/src/pfba.rs. PfbaHeuristicOptions gains ladder_memo: Option<Arc<Mutex<Option<(f64, f64)>>>>. When Some, the ladder seeds from the last successful (tolerance, coefficient) instead of the defaults and writes back on success. Clamped to [min_tol, start_tol] to prevent a stale memo from pushing the solver outside the user's configured range. Same byte-parity caveat as B1. Off by default.

Benchmark results

Hardware: server ws, 64 cores, 128 GB RAM, NVMe SSD, Debian. Aligner: DIAMOND 2.1.9 (-A diamond -K 8). Input: E. coli K-12 MG1655 (GCF_000005845.2, 4300 proteins). Reference DB: upstream gapseq's dat/ at commit-time HEAD. Comparison: commit a42d172 (pre-Phase-1 baseline) vs 474a1e7 (all Phase 1 / 2 / 3 defaults shipped).

gapsmith find -p all

MetricPhase-0Phase-3Δ
Wall18.36 s17.96 s−2.2%
User115.8 s114.1 s−1.5%
Peak RSS279 MB286 MB+2.5%

Pathways.tbl byte-identical. Reactions.tbl varies run-to-run on the same binary — DIAMOND parallel search with -K > 1 has inherent non-determinism at the per-hit level. Not a regression; not in our code.

gapsmith doall (full default pipeline)

MetricPhase-0Phase-3Δ
Wall77.6 s79.2 s+2.1%
User244 s245 s+0.4%
Peak RSS276 MB297 MB+7.7%

Semantic parity (via COBRApy):

  • 2776 reactions, 2193 metabolites (both)
  • Biomass FBA growth: 0.7422696154309283 (identical to 16 decimals)
  • Reaction Jaccard: 1.0000
  • Metabolite Jaccard: 1.0000

gapsmith fill --full-suite (4-phase, deep KO loops)

MetricPhase-0Phase-3Δ
Wall11:51.3611:51.55+0.03%
User1209 s1217 s+0.7%
Peak RSS107 MB123 MB+15.5%

Semantic parity:

  • 2859 reactions, 2207 metabolites (both)
  • Biomass FBA growth: 0.7433163537454 (identical to 13 decimals)
  • Reaction Jaccard: 1.0000
  • Metabolite Jaccard: 1.0000

Added reactions differ only in step attribution (one reaction rxn12933_c0 added in Step 1 by P0 vs Step 2 by P3 — B2's zero-flux skip dropped it from Step 1's add-list because its flux at the biomass optimum was zero; Step 2 needed it and re-added it). Final model is identical.

gapsmith fill --full-suite --prune-futile (E1 shines)

Both binaries hit a pre-existing pfba_heuristic: max iterations failure on this workload (unrelated to Phase 1–3 changes). Timing to the failure point:

MetricPhase-0Phase-3Δ
Wall5:564:20−27%
User18963 s13728 s−28%
Peak RSS1354 MB1322 MB−2.4%

E1 cuts the time needed to exhaust the LP-prune pool substantially. When --prune-futile actually completes on a well-conditioned input, this speedup translates directly.

Honest assessment

The measured wins on the default path are small:

  • DIAMOND dominates find / doall wall time (~60 s of 80 s total).
  • fill (Step 1 default) runs few LPs; B2 saves fractions of a second.
  • C1 / C3 savings are milliseconds, below noise.
  • A2 cache adds ~15 MB RSS because per-FASTA reads were already buffered.

Real wins show up in:

  • --prune-futile enabled (E1: ~25–30% wall reduction)
  • Deep KO loops in full-suite runs with large candidate pools (B2)
  • Batch runs reusing ladder_memo across pFBA calls (B3, opt-in)
  • KO-loop-dominated fills with hot_start configured (B1, opt-in)

Semantic parity is the firm guarantee: identical reaction sets, identical biomass to machine precision, regardless of which optimization path fires.

Scaling to many genomes

The numbers above are all single-genome. For 100+ genomes, the dominant cost moves out of FBA and into per-genome alignment. Gapsmith's multi-genome path routes around that entirely:

  • --gspa-run consumes a pre-clustered, pre-aligned gspa manifest so a 1000-genome collection pays for one cross-genome alignment, not 1000. Aligner calls inside find / find-transport become map-lookups.
  • doall-batch --jobs N runs doall across genomes with rayon; on a 32-core box, speedup is ~min(N, 32) vs. looping doall sequentially.
  • doall-batch --shard i/N makes SLURM array jobs the unit of parallelism. At 1 024 tasks of 16 cores each, throughput is ~1–2 s of wallclock per genome including I/O — limited by per-task doall-internal work, not by coordination.

Community modes are small LPs (≤ 1000 organisms in per-mag, which is linear) or a single big LP (cfba). cfba is feasible up to ~50 organisms before HiGHS becomes the bottleneck; for larger communities prefer per-mag.

See Multi-genome workflows for recipes.

Deferred — where the big wins still are

B1-full — HiGHS basis warm-start across KO probes (highs-sys FFI)

The currently-shipped B1 uses good_lp's with_initial_solution for a simplex hot-start guess, but each solve still constructs a fresh highs::Model. True warm-start needs a persistent highs::Model across the KO loop, mutating bounds via Highs_changeColBounds and re-solving in place. The highs crate 2.0 doesn't expose that API — requires ~200 LOC of highs-sys FFI. Estimated 30–50% wall reduction on KO-dominated fills.

Parallel KO probe

The current KO loop is serial because each commit affects the next iteration's baseline. A valid parallelisation: probe every candidate concurrently against the same post-fill model (all probes are independent at that snapshot), collect a set of provably-non-essential candidates, then re-probe the survivors serially. Bounded to a single pass, correctness-preserving. Estimated 4–8× on multi-core for the Step 4 fermentation-product loop.

Step-4 restructuring

Step 4 probes ~100 fermentation products one at a time, each doing a full gap-fill cycle. Amortising the shared preprocessing across probes (same starting model, same weights) would cut ~5 min off a full-suite run.

Reproducing

# 1. Clone + build both commits
git clone https://github.com/bio-ontology-research-group/gapsmith.git
cd gapsmith
git worktree add -f /tmp/gapsmith-p0 a42d172
cd /tmp/gapsmith-p0 && cargo build --release --bin gapsmith
cd - && cargo build --release --bin gapsmith

# 2. Get reference data
gapsmith update-data -o /path/to/dat
gapsmith --data-dir /path/to/dat update-sequences -t Bacteria

# 3. Run both binaries, compare
for bin in /tmp/gapsmith-p0/target/release/gapsmith ./target/release/gapsmith; do
  out=$(mktemp -d)
  /usr/bin/time -v "$bin" --data-dir /path/to/dat doall \
    -A diamond -K 8 -f "$out" genome.faa.gz 2>&1 | tee "$out.time"
done

# 4. Semantic diff in COBRApy
uv run --with cobra python tools/bench/diff_models.py \
  out1/genome-filled.xml out2/genome-filled.xml

tools/bench/run_bench.sh bundles the above end-to-end.

Comparison with upstream gapseq

gapsmith is a Rust reimplementation of gapseq (R/bash, ~9k LOC). This document covers performance benchmarks and feature differences.

Benchmarks

Wall-clock comparison on four real bacterial proteomes. Hardware: 56-core Xeon, 128 GB RAM, Debian 13, NVMe SSD. Same aligner binary (NCBI blastp 2.17.0 via bioconda). Same reference sequence database (Zenodo v1.4, record 16908828).

Test genomes

OrganismAccessionProteins
Candidatus Blochmannia floridanusGCF_000007725.1517
Bacillus subtilis 168GCF_000009045.14,237
Escherichia coli K-12 MG1655GCF_000005845.24,300
Salmonella enterica Typhimurium LT2GCF_000006945.24,554

Results

GenomeStagegapseq (R)gapsmithSpeedup
B. floridanus (517)find -p all117 s34 s3.5×
B. floridanus (517)find-transport9 s8 s1.2×
B. subtilis (4,237)find -p all205 s73 s2.8×
B. subtilis (4,237)find-transport25 s14 s1.8×
E. coli (4,300)find -p all218 s76 s2.9×
E. coli (4,300)find-transport27 s14 s1.9×
S. Typhimurium (4,554)find -p all211 s76 s2.8×
S. Typhimurium (4,554)find-transport27 s14 s1.9×

gapsmith also uses 35–40% less peak memory (e.g. 498 MB vs 786 MB for E. coli find).

Stages without R baseline

The draft, medium, and fill stages could not be benchmarked against upstream R gapseq because the cobrar R package (which replaced the archived sybil) failed to install on the benchmark host due to a libsbml/libxml2 ABI conflict in the conda R 4.5 environment. This is a packaging issue on the test rig, not a limitation of either tool.

gapsmith timings for these stages (E. coli K-12):

StageWall-timePeak RSS
draft0.6 s152 MB
medium0.1 s47 MB
fill (Steps 1 + 2 + 2b)52 s103 MB

These stages are fast in gapsmith because the LP solver (HiGHS) runs in-process via good_lp, rather than shelling out through R's cobrar wrapper to an external GLPK/CPLEX binary.

Reproducing the benchmarks

# Requires: blastp, Rscript (with data.table, stringr, Biostrings),
# and gapsmith binary on PATH.
bash tools/bench/run_bench.sh <genomes_dir> <results_dir>
python3 tools/bench/aggregate.py <results_dir>

Feature comparison

Featuregapseq (R)gapsmith
find (pathway detection)✅ byte-identical output
find-transport✅ row-count parity
draft (model assembly)✅ 0 libSBML errors
medium (rule-based inference)✅ byte-identical output
fill (4-phase gap-filling)✅ Steps 1–4
adapt (add/remove reactions)✅ (EC/KEGG resolution deferred)
pan (pan-draft union)
doall (end-to-end pipeline)
update-sequences (Zenodo sync)
EC/TC conflict resolution❌ (affects <1% of reactions)
MIRIAM cross-ref annotations❌ (SBML loads without them)
HMM-based taxonomy prediction❌ (use --taxonomy flag)
Precomputed alignment input--aligner precomputed
Batch-cluster alignmentbatch-align
In-process LP solver❌ (external GLPK)✅ (HiGHS, bundled)
FBA/pFBA subcommandfba
Native CBOR model format❌ (RDS)✅ (replaces RDS)
Single static binary

Key differences

Solver

gapseq uses R's cobrar package (or the older sybil) which wraps GLPK or CPLEX. gapsmith uses HiGHS via good_lp, statically linked at build time. No runtime solver dependency.

Model format

gapseq stores models as R .RDS files. gapsmith uses CBOR (.gmod.cbor) as its native format — compact, fast to load, language-agnostic. Both tools emit SBML for interchange.

Alignment

Both tools shell out to the same external aligner binaries (blastp, diamond, mmseqs2). gapsmith additionally supports --aligner precomputed (skip the aligner, read a user-supplied TSV) and batch-align (cluster N genomes with mmseqs2, align once, expand per-genome).

Known deferred items

See docs/porting-notes.md for the full list. The main gaps:

  • EC/TC conflict resolution — affects <1% of multi-EC-annotated genes.
  • MIRIAM cross-refs — SBML emits ModelSEED id only; COBRApy round-trip still works.
  • HMM-based taxonomy — CLI requires --taxonomy Bacteria|Archaea instead of auto-detecting.
  • adapt EC/KEGG resolution — use direct SEED reaction ids or MetaCyc pathway ids.