gapsmith

gapsmith is a Rust reimplementation of gapseq. It reconstructs genome-scale metabolic models from bacterial proteomes: pathway detection, transporter detection, draft model assembly, growth-medium inference, and iterative gap-filling — all as a single static binary.

What it does

Given a protein FASTA, gapsmith:

Finds pathways — aligns proteins against per-reaction reference FASTAs and scores pathway completeness (gapsmith find).
Finds transporters — matches against a TCDB-derived substrate table (gapsmith find-transport).
Assembles a draft model — combines the above into a stoichiometrically consistent metabolic network (gapsmith draft).
Infers a growth medium — applies boolean rules over detected reactions to predict minimal-medium compounds (gapsmith medium).
Gap-fills — an iterative pFBA + knock-out loop adds the minimal set of extra reactions needed for biomass growth (gapsmith fill).

Each step is independently callable, or chain them end-to-end with gapsmith doall.

One-liner

gapsmith doall genome.faa.gz -f output/

This produces a gap-filled SBML model that loads directly in COBRApy, COBRAToolbox, or any SBML-compatible tool.

Why a rewrite?

Upstream gapseq is R + bash (~9k LOC). gapsmith replaces the R stack — including the cobrar wrapper and external GLPK/CPLEX — with:

In-process HiGHS via good_lp for FBA/pFBA
Native CBOR model format (replaces R's RDS)
Single static binary, no R install
--aligner precomputed mode for re-using one alignment across many genomes
batch-align for clustering N genomes with mmseqs2 and amortising the per-reaction alignment

Benchmarks show ~3× faster pathway detection on real bacterial proteomes and 35–40% lower peak memory. See Comparison with upstream gapseq for details.

Scaling out: metagenomes and multi-genome runs

gapsmith also ships optional, additive features for community and multi-genome workloads:

--gspa-run — consume protein clusters + alignments produced upstream by gspa so thousands of genomes share one alignment.
doall-batch — rayon-parallel driver with a --shard i/N selector for SLURM array jobs (scales from one machine to 1 M genomes).
community per-mag — per-MAG FBA under a shared (union) medium. Linear in N MAGs.
community cfba — compose N drafts into one community model with a shared _e0 pool and a weighted-sum biomass objective; optional balanced-growth constraint for classical cFBA semantics.

See Multi-genome & metagenome workflows for the full recipe.

Where to start

Install & first run: User guide
Every CLI flag: CLI reference
Metagenomes / multi-genome runs: Multi-genome workflows
How the crates fit together: Architecture
R → Rust module mapping: Feature matrix
Intentional deviations from upstream: Porting notes

gapsmith user guide

A Rust reimplementation of gapseq — informed prediction and analysis of bacterial metabolic pathways and genome-scale networks. Single static binary, in-process LP solver, faster on every stage of the pipeline (see benchmarks in the README).

This guide walks a new user through installing gapsmith, reconstructing a genome-scale metabolic model end-to-end, and understanding each stage of the output.

1. Install

Binary prerequisites

gapsmith shells out to external aligners and a couple of HMM tools. Install whichever aligner you want to use — you only need one:

Tool	Required for	apt / bioconda
`blastp` + `makeblastdb`	`--aligner blastp` (default)	`ncbi-blast+`
`diamond`	`--aligner diamond`	`diamond`
`mmseqs`	`--aligner mmseqs2`	`mmseqs2`
`bedtools`	none (legacy; only needed for upstream `gapsmith find` shell pipeline)	`bedtools`

--aligner precomputed skips the aligner entirely and reads a user- supplied TSV; see Precomputed alignment mode below.

Reference data

gapsmith needs the standard gapseq dat/ directory (SEED reactions + metabolites, biomass templates, medium rules, pathway tables). Get it by cloning the upstream repo:

git clone --depth 1 https://github.com/jotech/gapseq.git
export GAPSMITH_DATA_DIR=$PWD/gapseq/dat

Or point to it explicitly on every invocation with --data-dir.

Reference sequences

Sync the genome-specific BLAST/diamond reference FASTA database from Zenodo:

gapsmith update-sequences -t Bacteria             # pinned release
gapsmith update-sequences -t Bacteria -Z latest   # newest Zenodo version
gapsmith update-sequences -c                       # check-only, no download

Without this, find / find-transport / doall will fail — the FASTA files they align against live under <data-dir>/seq/<Taxonomy>/.

Build

git clone <this-repo>
cd gapsmith
cargo build --release --workspace
# Binary is target/release/gapsmith

The HiGHS LP solver is bundled via good_lp → highs-sys (CMake'd from source on first build; ~30 s). An optional cbc feature adds CBC as a fallback solver — requires coinor-libcbc-dev on the build host:

cargo build --release --workspace --features gapsmith-fill/cbc

2. Quick start — reconstruct E. coli K-12 in one command

# Download an E. coli reference proteome
wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/all_assembly_versions/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_protein.faa.gz

# Reconstruct: find→find-transport→draft→medium (auto)→fill
gapseq --data-dir $GAPSMITH_DATA_DIR doall \
    GCF_000005845.2_ASM584v2_protein.faa.gz \
    -f ecoli_out/ \
    -A diamond \
    -b 200 \
    -k 0.01

# Inspect the result
gapsmith fba ecoli_out/GCF_000005845.2_ASM584v2_protein-filled.gmod.cbor

Outputs in ecoli_out/:

<genome>-all-Reactions.tbl, <genome>-all-Pathways.tbl — find results
<genome>-Transporter.tbl — find-transport result
<genome>-draft.gmod.cbor, <genome>-draft.xml — draft model
<genome>-medium.csv — inferred growth medium
<genome>-filled.gmod.cbor, <genome>-filled.xml — gap-filled model
<genome>-filled-added.tsv — list of reactions added by each fill step

The filled CBOR or SBML loads directly in COBRApy, COBRAToolbox, cobrar, or any SBML-aware tool.

3. Individual stages

3.1 `gapsmith find` — pathway and reaction detection

Aligns the genome's proteins against gapseq's per-reaction reference FASTAs, scores each pathway for completeness, and writes a *-Reactions.tbl + *-Pathways.tbl.

# One pathway by MetaCyc id
gapsmith find -p PWY-6587 genome.faa

# A category shorthand — expands to gapseq's internal hierarchy match
gapsmith find -p amino genome.faa
#        ↑ one of: amino | nucl | cofactor | carbo | polyamine |
#                  fatty | energy | terpenoid | degradation |
#                  core | min | kegg | all

# Every active pathway in metacyc + custom (gapseq's default)
gapsmith find -p all -l metacyc,custom genome.faa

# Alternative aligner
gapsmith find -A diamond -p all genome.faa
gapsmith find -A mmseqs2 -p all genome.faa

Key options:

-b / --bitcutoff — BLAST bitscore cutoff for "high-evidence" hits (default 200).
-c / --coverage — minimum query coverage (default 75%).
-a / --completeness-main — pathway-completeness cutoff (default 80%).
-k / --completeness-hints — relaxed cutoff when every key reaction is present (default 66%).
-n / --include-superpathways — by default superpathways are filtered out (matches upstream gapseq).
-t / --taxonomy — Bacteria (default) or Archaea.

The output *-Pathways.tbl is byte-identical to real gapseq's output on tested cases (-p PWY-6587 and -p amino on ecoli). See the porting-notes for the handful of intentional deviations.

3.2 `gapsmith find-transport` — transporter detection

Iterates gapseq's transporter substrate list (dat/subex.tbl) and aligns the genome against both TCDB and the curated SEED transporter FASTAs.

gapsmith find-transport -A diamond -b 50 genome.faa
# → <stem>-Transporter.tbl

Row count and the TC-identifier set match real gapseq on the ecoli toy genome (550 distinct TC ids, 1808 rows).

3.3 `gapsmith draft` — build a draft model

Combines the *-Reactions.tbl + *-Transporter.tbl with the SEED reaction DB into a metabolic model. Adds biomass, exchanges, diffusion reactions, conditional transporters, dedups by stoichiometric hash.

gapsmith draft \
    -r ecoli-all-Reactions.tbl \
    -t ecoli-Transporter.tbl \
    -b neg \
    -o drafts/
# → drafts/ecoli-draft.gmod.cbor   (native format)
#   drafts/ecoli-draft.xml         (SBML L3V1+FBC2+groups, validates
#                                   against libSBML with 0 errors)

The -b flag picks the biomass template:

auto — read gram= from the Reactions.tbl header (gapseq defaults).
pos / neg — Gram+ / Gram-.
archaea — Archaea.
<path/to/custom.json> — user biomass in the same JSON format as dat/biomass/biomass_Gram_neg.json.

3.4 `gapsmith medium` — rule-based medium inference

Evaluates 82 boolean rules in dat/medium_prediction_rules.tsv against the draft + Pathways.tbl to infer a growth medium.

gapsmith medium \
    -m drafts/ecoli-draft.gmod.cbor \
    -p ecoli-all-Pathways.tbl \
    -o ecoli-medium.csv

# Manual overrides — force oxygen uptake 0 to predict anaerobic
gapsmith medium ... -c "cpd00007:0;cpd00011:0"

The inferred medium CSV is compatible with gapsmith fill -n. On the ecoli toy genome the output is byte-identical to upstream gapseq's toy/ecoli-medium.csv.

3.5 `gapsmith fill` — iterative gap-filling

4-phase pFBA + KO-essentiality pipeline. Steps 1 + 2 + 2b are the default "quick" path; Steps 3 + 4 run only with --full-suite.

gapsmith fill \
    drafts/ecoli-draft.gmod.cbor \
    -n ecoli-medium.csv \
    -r ecoli-all-Reactions.tbl \
    -k 0.01 \
    -o filled/

# Full 4-phase suite (adds energy-source + fermentation screens;
# 10-30 minutes on a whole bacterium)
gapsmith fill ... --full-suite

# Opt into the futile-cycle prune — drops candidate reactions that
# saturate at 0.99·max_flux with all boundaries closed. Expensive on
# large candidate pools.
gapsmith fill ... --prune-futile

Phase breakdown:

Step 1 — user medium + biomass target. Bulk of the fills.
Step 2 — MM_glu + available carbon sources; iterate every biomass substrate, attach a sink per substrate, gap-fill if infeasible.
Step 2b — aerobic / anaerobic variant, depending on whether the user medium has O₂.
Step 3 (--full-suite) — energy-source screen with ESP1–5 pseudo-reactions (menaquinone / ubiquinone / NADH / ferredoxin / plastoquinone redox couples).
Step 4 (--full-suite) — fermentation-product screen: maximise outflow on each exchange compound.

Output: <genome>-filled.gmod.cbor + .xml + -added.tsv (rxn id + step number for every addition).

3.6 `gapsmith fba` — FBA / pFBA on a model

Utility to solve FBA or parsimonious FBA on a CBOR/JSON model. Useful for validating a draft before gap-filling, or for post-hoc analysis.

gapsmith fba drafts/ecoli-draft.gmod.cbor
gapsmith fba --pfba --min-growth 0.01 --top 20 filled/ecoli-filled.gmod.cbor
gapsmith fba -r EX_cpd00027_e0 filled/ecoli-filled.gmod.cbor

Split-flux LP via good_lp with HiGHS backend. Solves the 2378×2953 ecoli draft in under 100 ms.

3.7 `gapsmith adapt` — edit reactions + force growth

# Add specific SEED rxns
gapsmith adapt -m model.gmod.cbor -a rxn00011,rxn00102

# Add every rxn in a MetaCyc pathway
gapsmith adapt -m model.gmod.cbor -a "PWY-6587"

# Remove
gapsmith adapt -m model.gmod.cbor -r rxn00011

# Force growth on D-glucose (runs gap-fill)
gapsmith adapt -m model.gmod.cbor -w cpd00027:TRUE -b Reactions.tbl

# Force NO growth on glucose (closes uptake)
gapsmith adapt -m model.gmod.cbor -w cpd00027:FALSE

3.8 `gapsmith pan` — pan-draft from N drafts

# Glob a directory of drafts
gapsmith pan -m drafts/ -t 0.06 -f pan_out/

# Explicit list
gapsmith pan -m a-draft.gmod.cbor,b-draft.gmod.cbor,c-draft.gmod.cbor -t 0.5

# Only emit the binary presence/absence table
gapsmith pan -m drafts/ -b -f pan_out/

-t sets the minimum across-draft reaction frequency for inclusion in the pan-draft. Default 0.06 matches upstream gapseq.

3.9 `gapsmith doall` — full pipeline in one command

gapsmith doall genome.faa.gz -f out/ -A diamond -b 200 -k 0.01
# Chains: find -p all → find-transport → draft → medium (auto) → fill

Add --full-suite to include fill Steps 3/4. Use -m <medium.csv> to skip the medium-inference step and supply your own medium.

4. Precomputed alignment mode

For many-genome batch runs, gapsmith can skip the per-genome aligner call entirely and read a pre-computed alignment TSV:

# Pre-run BLAST / diamond / mmseqs2 yourself (or use gapsmith batch-align)
diamond makedb --in reference.faa -d ref.dmnd
diamond blastp -q genome.faa -d ref.dmnd -o hits.tsv \
    --outfmt 6 qseqid sseqid pident qcovs evalue bitscore stitle sstart send

# Feed to find / find-transport
gapsmith find -A precomputed -P hits.tsv -p all genome.faa
gapsmith find-transport -A precomputed -P hits.tsv genome.faa

`gapsmith batch-align` — cluster N genomes + single alignment pass

When you're annotating many genomes with the same reference, run one cluster + alignment over all of them and expand per-genome hits:

gapsmith batch-align \
    -q reference.faa \
    -g genomes_dir/ \
    -o aligned/ \
    --aligner diamond \
    --cluster-identity 0.5 \
    --cluster-coverage 0.8
# → aligned/<genome_id>.tsv per genome in genomes_dir/

Amortises the alignment cost over many genomes — especially effective when the proteomes share a lot of sequence homology.

4b. Multi-genome, metagenome, and gspa integration

When annotating 100+ genomes — or when the input is a metagenome — reach for these additional subcommands instead of looping doall:

`--gspa-run` — consume precomputed clusters from gspa

If you have a gspa run directory (mmseqs2-clustered proteins + alignment against cluster reps), point any per-genome subcommand at it:

gapsmith doall genome.faa -f out/ \
    --gspa-run /path/to/gspa-run/ \
    --gspa-genome-id my-genome

gapsmith skips the aligner entirely, reads the rep-level hits, and fans them out to this genome's cluster members.

`gapsmith doall-batch` — parallel N-genome driver

# 100 genomes on one machine, reusing a gspa cluster, 32 rayon workers
gapsmith doall-batch --gspa-run gspa-run/ -f out/ --jobs 32

# 1 M genomes on SLURM: 1024 array tasks, each processes ~1000 genomes
sbatch --array=0-1023 slurm_script.sh   # script passes --shard ${SLURM_ARRAY_TASK_ID}/1024

Key flags: --genomes-dir / --genomes-list / --gspa-run, --jobs, --shard i/N, --continue-on-error.

`gapsmith community per-mag` — shared-medium per-MAG FBA

For metagenomes with 50+ MAGs where a full cFBA isn't tractable:

gapsmith community per-mag \
    --gspa-run gspa-run/ \
    --drafts-root out/ \
    -o community/

Outputs community-medium.csv (union of per-MAG media) and per-mag-growth.tsv with abundance-weighted community growth.

`gapsmith community cfba` — full community LP

For small, curated communities (≤ ~50 organisms) where cross-feeding matters:

gapsmith community cfba \
    --drafts-dir out/drafts/ \
    --abundance abundances.tsv \
    --balanced-growth \
    -o community/

Composes N drafts into one block-diagonal model with a shared _e0 extracellular pool, builds a weighted-sum biomass objective, and optionally constrains every organism to the community's growth rate.

→ Full recipe: Multi-genome & metagenome workflows.

5. Solver, format, and file-layout notes

5.1 Model format

Native: .gmod.cbor — CBOR-encoded Model (fast load, compact).
JSON: .json — human-readable, via gapsmith convert.
SBML: .xml — SBML L3V1 + FBC2 + groups. 0 libSBML errors on every emitted model. Loads cleanly in COBRApy / COBRAToolbox.

Conversion between native formats:

gapsmith convert model.gmod.cbor model.json --pretty
gapsmith convert model.json model.gmod.cbor
gapsmith export-sbml model.gmod.cbor model.xml

5.2 LP solver

good_lp 1.15 + HiGHS (bundled). The HiGHS crate statically links its C++ solver via CMake; no system HiGHS needed.

Optional CBC fallback: compile with --features gapsmith-fill/cbc and pass --cbc-fallback in the heuristic options. Requires coinor-libcbc-dev + coinor-libclp-dev on the build host.

5.3 Sequence database layout

<seq-dir>/<Taxonomy>/
  ├── rev/<reaction>.fasta         ← curated reviewed sequences
  ├── unrev/<reaction>.fasta       ← unreviewed / broader
  ├── rxn/<rxn_id>.fasta           ← per-SEED-reaction FASTAs
  ├── tcdb.fasta                   ← transporter DB
  ├── transporter.fasta            ← curated transporter hits
  ├── user/<reaction>.fasta        ← your own references (optional)
  └── version_seqDB.json           ← Zenodo version stamp

gapsmith find / find-transport resolve reference FASTAs in this priority order: user/ → rxn/ → rev/<EC> → unrev/<EC> → rev/<md5(name)> → unrev/<md5(name)>. Matching real gapseq's prepare_batch_alignments.R:150-234.

6. Troubleshooting

"external tool `makeblastdb` exited with status 1"

The tool was fed a gzipped FASTA. gapsmith doall auto-decompresses .gz; for standalone find / find-transport invocations, ungzip first.

"full candidate pool is infeasible"

The draft + every SEED reaction together can't produce biomass. Either:

The biomass template asks for a metabolite no SEED reaction produces (peptidoglycan cpd15665 on some drafts — see porting-notes for the known cycle gap), or
Your medium is missing a critical nutrient (run gapsmith medium --manual-flux to check what the rule-based inference assigns).

Use gapsmith fba --pfba --min-growth 0.001 on the draft to narrow the biomass-component that's blocking.

pFBA heuristic "max iterations reached"

HiGHS exhausted the tolerance + pFBA-coefficient ladder. Options:

Widen the medium (add more carbon sources).
Lower -k (minimum growth rate).
Rebuild with --features cbc and opt-in to the CBC fallback.

Draft has 0 reactions / stays empty

The upstream *-Reactions.tbl has dbhit= empty on every row. That means none of the reference FASTAs were found for the pathways you asked about. Verify <seq-dir>/<Taxonomy>/ is populated (gapsmith update-sequences -c).

7. Performance tips

Release build: cargo build --release --workspace. Debug builds are ~50× slower for the LP-heavy stages.
Use diamond over BLASTp: -A diamond. Comparable accuracy, 5–20× faster on large proteomes.
Parallelism: -K <threads> on every aligner-invoking subcommand. gapsmith is rayon-parallel in the LP-free stages (find, transport, medium) too.
Batch mode: if you're annotating 10+ genomes, gapsmith batch-align
- --aligner precomputed amortises the alignment cost.
Multi-genome scale-out: for 100+ genomes or 1k+ metagenomes, use gapsmith doall-batch with --gspa-run so the expensive alignment is done exactly once across the whole collection. Shard across SLURM array tasks with --shard i/N. See Multi-genome workflows.

8. Where to look next

porting-notes.md — every intentional deviation from upstream R gapseq.
feature-matrix.md — exhaustive feature list with R source → Rust module pointers.
README.md — status + benchmarks.
Per-crate rustdoc: cargo doc --workspace --no-deps --open.

CLI reference

Every gapseq subcommand, exhaustive flag list. Generated from --help, annotated per-section. Run any command with -h to see these locally.

Global options

These apply to every subcommand:

Flag	Description
`--data-dir <PATH>`	Override gapseq reference-data directory. Resolution order when absent: `GAPSMITH_DATA_DIR` env → `$XDG_DATA_HOME/gapsmith` → `<exe-dir>/../dat` → `./dat`.
`--seq-dir <PATH>`	Override the sequence database directory. Defaults to `<data-dir>/seq`.
`-K, --threads <N>`	Worker threads. Default: all available cores.
`-v, -vv, -vvv`	Logging verbosity (info / debug / trace).
`-h, --help`	Print help.

`gapsmith doall`

End-to-end pipeline: find → find-transport → draft → medium → fill.

Flag	Default	Description
`<GENOME>`	—	Required. Protein FASTA (`.faa` / `.faa.gz`).
`-A, --aligner`	`blastp`	`blastp`, `diamond`, `mmseqs2`, or `precomputed`.
`-b, --bitcutoff`	`200.0`	Bitscore cutoff for the find stages.
`-c, --coverage`	`75`	Coverage cutoff (0–100 %).
`-l, --min-bs-core`	`50.0`	Core-reaction bitscore cutoff.
`-t, --taxonomy`	`auto`	`Bacteria`, `Archaea`, or `auto` (currently maps to Bacteria).
`-m, --medium`	`auto`	Medium CSV, or `auto` to infer via `gapsmith medium`.
`-f, --out-dir`	`.`	Output directory.
`-K, --threads`	all cores	Thread count.
`--full-suite`	off	Include fill Steps 3 + 4 (slow).
`-k, --min-growth`	`0.01`	Minimum growth for gap-filling.
`--gspa-run <DIR>`	—	Read precomputed cluster-rep hits from a gspa manifest instead of running the aligner.
`--gspa-genome-id <ID>`	FASTA stem	Row id in the manifest's `genomes.tsv` matching this run.

Gzipped inputs are auto-decompressed into a tempfile::tempdir.

`gapsmith find`

Pathway and reaction detection.

Flag	Default	Description
`<GENOME>`	—	Protein FASTA.
`-p, --pathways`	`all`	Pathway keyword or id (or `all`).
`-l, --pathway-db`	`metacyc,custom`	Comma-separated: `metacyc`, `kegg`, `seed`, `custom`, `all`.
`-t, --taxonomy`	`Bacteria`	Reference subfolder under `seq-dir`.
`-A, --aligner`	`diamond`	Aligner backend.
`-P, --precomputed`	—	Precomputed alignment TSV (skip aligner).
`--gspa-run <DIR>`	—	gspa manifest — expands rep-level hits onto this genome's cluster members.
`--gspa-genome-id <ID>`	FASTA stem	Genome id in the manifest matching this invocation.
`--gspa-coverage-fraction`	off	Treat the alignment TSV's qcov as 0–1 (mmseqs2 native).
`-b, --bitcutoff`	`200.0`	Bitscore cutoff.
`-i, --identcutoff`	`0.0`	Identity cutoff (%).
`-c, --coverage`	`75`	Coverage cutoff.
`-a, --completeness-main`	`80.0`	Pathway completeness cutoff.
`-k, --completeness-hints`	`66.0`	Relaxed cutoff when every key reaction is present.
`-s, --strict`	off	Disable key-reaction heuristic.
`-n, --include-superpathways`	off	Default is to filter them out (matches upstream).
`-o, --out-dir`	`.`	Output directory.
`-u, --suffix`	—	Filename suffix `<stem>-<suffix>-{Reactions,Pathways}.tbl`.

`gapsmith find-transport`

Transporter detection.

Flag	Default	Description
`<GENOME>`	—	Protein FASTA.
`-A, --aligner`	`blastp`	Aligner backend.
`-P, --precomputed`	—	Precomputed TSV.
`-b, --bitcutoff`	`50.0`	Transport hits use a lower threshold than biosynthetic.
`-i, --identcutoff`	`0.0`	Identity cutoff.
`-c, --coverage`	`75`	Coverage cutoff.
`-a, --nouse-alternatives`	off	Disable the alt-transporter fallback.
`-m, --only-met`	—	Restrict to one substrate keyword.
`-o, --out-dir`	`.`	Output directory.

`gapsmith draft`

Build a draft metabolic model from find + find-transport output.

Flag	Default	Description
`-r, --reactions`	required	`*-Reactions.tbl`.
`-t, --transporter`	required	`*-Transporter.tbl`.
`-b, --biomass`	`auto`	`auto` / `pos` / `neg` / `archaea` / `<path.json>`.
`-n, --name`	inferred	Model id.
`-u, --high-evi-rxn-bs`	`200.0`	Bitscore threshold for core reactions.
`-l, --min-bs-for-core`	`50.0`	Lower bound for candidate pool.
`-o, --out-dir`	`.`	Output directory.
`--no-sbml`	off	Skip SBML emission.

`gapsmith medium`

Rule-based medium inference.

Flag	Default	Description
`-m, --model`	required	Draft CBOR / JSON.
`-p, --pathways`	required	`*-Pathways.tbl` from `find`.
`-c, --manual-flux`	—	`cpdXXXXX:val;cpdYYYYY:val` overrides.
`-o, --output`	inferred	Output CSV path.
`-f, --out-dir`	`.`	Output directory (when `-o` is absent).

`gapsmith fill`

Iterative gap-filling.

Flag	Default	Description
`<MODEL>`	required	Draft CBOR / JSON.
`-n, --media`	required	Medium CSV.
`-r, --reactions`	—	`*-Reactions.tbl` for bitscore-weighted pFBA.
`-t, --target`	`cpd11416`	Biomass pseudo-metabolite id.
`-k, --min-growth`	`0.01`	Minimum growth rate floor.
`-b, --bcore`	`50.0`	Minimum bitscore for "core" classification.
`--high-evi`	`200.0`	Bitscore-to-weight upper calibration point.
`--dummy-weight`	`100.0`	Weight for reactions with no hit.
`-o, --out-dir`	`.`	Output directory.
`--no-sbml`	off	Skip SBML.
`--step1-only`	off	Run only Step 1 (debugging).
`--full-suite`	off	Include fill Steps 3 + 4.
`--prune-futile`	off	Thermodynamic futile-cycle prune (opt-in; slow).

`gapsmith fba`

FBA / pFBA on an existing model.

Flag	Default	Description
`<MODEL>`	required	CBOR / JSON model.
`-r, --objective`	model's own	Objective reaction id.
`--pfba`	off	Parsimonious FBA.
`--pfba-coef`	`0.001`	pFBA biomass trade-off coefficient.
`--min-growth`	`0.0`	Biomass floor (pFBA only).
`--top`	`20`	Print the top-N highest-absolute-flux reactions.
`--minimise`	off	Minimise instead of maximise.

`gapsmith adapt`

Edit reactions or force growth on a compound.

Flag	Default	Description
`-m, --model`	required	Model file.
`-a, --add`	—	Comma-separated `rxnNNNNN` or MetaCyc pathway ids.
`-r, --remove`	—	Same format; reactions are removed.
`-w, --growth`	—	`cpdNNNNN:TRUE` / `cpdNNNNN:FALSE`.
`-b, --reactions`	—	`*-Reactions.tbl` (needed for `-w ...:TRUE` gap-filling).
`-k, --min-growth`	`0.01`	Min growth floor for the `-w ...:TRUE` path.
`-f, --out-dir`	`.`	Output directory.
`--no-sbml`	off	Skip SBML.

`gapsmith pan`

Build a pan-draft from N drafts.

Flag	Default	Description
`-m, --models`	required	Comma-separated / directory / single path.
`-t, --min-freq`	`0.06`	Minimum across-draft reaction frequency.
`-b, --only-binary`	off	Skip the pan-draft, emit only the rxn × model presence TSV.
`-f, --out-dir`	`.`	Output directory.
`--no-sbml`	off	Skip SBML.

`gapsmith update-sequences`

Zenodo seqdb sync.

Flag	Default	Description
`-t, --taxonomy`	`Bacteria`	Which seqdb to sync.
`-D, --seq-dir`	`<data-dir>/seq`	Seqdb location.
`-Z, --record`	pinned	Zenodo record id, or `latest`.
`-c, --check-only`	off	Report version, no download.
`-q, --quiet`	off	Suppress progress messages.

`gapsmith convert`

CBOR ↔ JSON round-trip.

Flag	Default	Description
`<INPUT>`	required	Path to source file.
`<OUTPUT>`	required	Destination path.
`--to`	from extension	`cbor` / `json` (overrides extension).
`--pretty`	off	Pretty-print JSON.

`gapsmith export-sbml`

Serialise a CBOR / JSON model as SBML L3V1 + FBC2 + groups.

Flag	Default	Description
`<INPUT>`	required	CBOR / JSON model.
`<OUTPUT>`	required	Destination `.xml` path.
`--compact`	off	Omit pretty-print / nested indentation.

`gapsmith align`

Run a single aligner standalone. Useful for debugging reference-FASTA issues or building a precomputed TSV to feed into find --aligner precomputed.

Flag	Default	Description
`-A, --aligner`	required	`blastp` / `tblastn` / `diamond` / `mmseqs2` / `precomputed`.
`-q, --query`	required	Query FASTA.
`-t, --target`	required	Target FASTA.
`-P, --precomputed`	—	When `-A precomputed`, path to the TSV.
`-b, --bitcutoff`	`0.0`	Filter hits below this bitscore.
`-c, --coverage`	`0`	Filter below this coverage %.
`-e, --evalue`	`1e-5`	E-value cutoff.
`--extra`	—	Passthrough flags to the aligner binary.
`-o, --out`	stdout	Output TSV path.

`gapsmith batch-align`

Cluster N genomes + single alignment + per-genome TSV expansion.

Flag	Default	Description
`-q, --query`	required	Reference query FASTA.
`-g, --genomes`	required	Directory containing `.faa(.gz)` files.
`-o, --out-dir`	required	Output `<genome>.tsv` directory.
`-A, --aligner`	`diamond`	Aligner backend (`blastp` / `diamond` / `mmseqs2`).
`--cluster-identity`	`0.5`	mmseqs2 cluster identity.
`--cluster-coverage`	`0.8`	mmseqs2 cluster coverage.
`-b, --bitcutoff`	`0.0`	Per-hit bitscore filter.
`-c, --coverage`	`0`	Per-hit coverage filter.

`gapsmith doall-batch`

Run doall across many genomes in parallel. See multi-genome.md for the full recipe.

Flag	Default	Description
`-g, --genomes-dir <DIR>`	—	Directory of protein FASTAs.
`--genomes-list <TSV>`	—	Explicit TSV list (`id<TAB>path[<TAB>abundance]`).
`--gspa-run <DIR>`	—	Pulls the genome list from a gspa manifest.
`-f, --out-dir <DIR>`	required	One `<genome_id>/` subdir is written per input.
`-j, --jobs <N>`	all cores	Rayon pool size.
`--shard <i/N>`	—	Select genomes where `index mod N == i`.
`--continue-on-error`	off	Log failed genomes to `doall-batch-errors.tsv` instead of aborting.
(plus every passthrough flag from `doall`)

`gapsmith community`

Community-level optimisation. Two subcommands.

`gapsmith community per-mag`

Per-MAG FBA under a shared (union) medium. Linear in N.

Flag	Default	Description
`--drafts-dir <DIR>`	—	Directory of `<id>-{draft,filled}.gmod.cbor`.
`--drafts-list <TSV>`	—	`<id><TAB><cbor>[<TAB><medium>]`.
`--gspa-run <DIR>`	—	Use manifest's genome list.
`--drafts-root <DIR>`	`.`	Where per-genome `doall` outputs live (paired with `--gspa-run`).
`-m, --medium <CSV>`	auto	Override the inferred shared medium.
`-o, --out-dir <DIR>`	`.`	Writes `community-medium.csv`, `per-mag-growth.tsv`.

`gapsmith community cfba`

Compose N drafts into one community model; solve weighted-sum biomass.

Flag	Default	Description
`--drafts-dir <DIR>` / `--drafts-list <TSV>` / `--gspa-run <DIR>`	—	Exactly one (same semantics as `per-mag`).
`--drafts-root <DIR>`	`.`	Where `doall` outputs live.
`--abundance <TSV>`	—	`<id><TAB><abundance>` overrides.
`--biomass-rxn <ID>`	`bio1`	Per-organism biomass reaction id.
`--balanced-growth`	off	Add `v(bio_k) == v(bio_community)` constraints.
`-m, --medium <CSV>`	—	Optional shared medium CSV.
`-o, --out-dir <DIR>`	`.`	Writes `community.gmod.cbor`, `community-fba.tsv`.

`gapsmith db inspect`

Smoke-test the --data-dir. Loads every reference table and prints row counts. No flags beyond --data-dir.

`gapsmith test`

Print resolved paths + which external tool binaries are on PATH.

`gapsmith example-model`

Emit a small hand-built toy model (3 metabolites / 2 reactions) as CBOR. Useful for smoke-testing format round-trips and downstream tools.

Flag	Default	Description
`<OUTPUT>`	required	Destination CBOR path.
`--complex`	off	Emit a slightly bigger "complex" variant.

Multi-genome & metagenome workflows

Gapsmith's single-genome pipeline (doall) stays the default. This chapter covers three additional, opt-in capabilities for running at scale:

--gspa-run — consume precomputed protein clusters + alignments from the upstream gspa pipeline instead of re-running DIAMOND/BLASTp per genome.
doall-batch — rayon- and SLURM-friendly driver that fans doall across N genomes (targets 1 000 metagenomes; scales to ~1 M genomes via --shard).
community — community-level objectives for metagenomes: a cheap per-MAG mode with a shared union medium, and a full cFBA mode that composes N drafts into one model.

Nothing in here breaks single-genome usage. If you never pass --gspa-run or use the new subcommands, behaviour is identical to earlier releases.

1. Consuming a gspa manifest (`--gspa-run`)

Why

Annotating N genomes with a single mmseqs2/linclust cluster + one alignment against cluster representatives is dramatically cheaper than N independent BLASTp runs. gspa's Phase 11 already produces that cluster; --gspa-run plugs the output into gapsmith find / find-transport / doall.

Manifest layout (v1)

gspa writes (or you assemble by hand) a directory with this shape:

<gspa-run>/
  clusters.tsv            # 3 cols: rep_id<TAB>member_id<TAB>genome_id
  alignment/<ref>.tsv     # 8-col gapsmith TSV keyed on rep_id
  genomes.tsv             # 2–3 cols: genome_id<TAB>faa_path[<TAB>abundance]
  gspa-manifest.toml      # optional — version/provenance marker

clusters.tsv: every (rep, member, genome) triple emitted by the cross-genome mmseqs2 cluster. The rep appears on its own line as its own member.
alignment/*.tsv: one or more 8-column files in gapsmith's canonical format (qseqid pident evalue bitscore qcov stitle sstart send). All .tsv / .tab / .txt files in that directory are concatenated transparently — useful when you run DIAMOND and mmseqs2 separately and want both sources of evidence.
genomes.tsv: the authoritative list of organism ids + FASTA paths. The optional 3rd column carries a relative abundance (any positive float — RPKM / TPM / raw counts / …) that propagates into community cfba.

Usage

# Single genome, alignment reused from an already-built gspa manifest:
gapsmith doall /path/to/genome.faa -f out/ \
    --gspa-run /path/to/gspa-run/ \
    --gspa-genome-id genome

# If the FASTA stem already matches the genome_id in genomes.tsv,
# --gspa-genome-id can be omitted.

--gspa-coverage-fraction re-scales mmseqs2-native coverage columns (0–1) up to the gapsmith convention (0–100). Diamond/BLASTp TSVs are already 0–100 and need no flag.

What happens under the hood

gapsmith_align::GspaRunAligner implements the Aligner trait. On align() it:

Reads alignment/*.tsv lazily (once, cached).
Looks up each rep hit in clusters.tsv filtered to target_genome_id.
Emits one Hit per member, rewriting qseqid to the member's original protein id so downstream find / find-transport see a normal per-genome hit table.

2. Batch reconstruction (`doall-batch`)

When to reach for it

Scale	Recommended tool
1 genome	`gapsmith doall`
10 – 200 genomes, 1 box	`gapsmith doall-batch --jobs <cores>`
1 k – 1 M genomes, cluster	`doall-batch --shard i/N` inside a SLURM array

Flags that matter

--genomes-dir <dir> / --genomes-list <tsv> / --gspa-run <dir> — exactly one; the last option pulls the genome list straight out of the gspa manifest.
--jobs N — rayon pool size. Each worker processes one genome at a time; we pin doall's own --threads to 1 so rayon gets clean parallelism.
--shard i/N — SLURM-array-friendly selector. After sorting the genome list alphabetically, only genomes whose 0-based index k satisfies k mod N == i are processed. Every shard is independent — no shared state, no cross-task coordination.
--continue-on-error — log-and-skip instead of aborting the batch when a single genome's doall fails. Failures are written to <out-dir>/doall-batch-errors.tsv.

Recipes

100 genomes on one box, reusing a gspa cluster

gapsmith doall-batch \
    --gspa-run /data/gspa-run/ \
    -f /out/reconstructions/ \
    --jobs 32 \
    --continue-on-error

Each <genome_id>/ subdirectory under /out/reconstructions/ gets the standard doall outputs.

1 M genomes on a SLURM cluster

#!/bin/bash
#SBATCH --array=0-1023
#SBATCH --cpus-per-task=16
#SBATCH --mem=32G
#SBATCH --time=24:00:00

gapsmith doall-batch \
    --gspa-run /scratch/gspa-run/ \
    -f /scratch/recon/ \
    --jobs $SLURM_CPUS_PER_TASK \
    --shard ${SLURM_ARRAY_TASK_ID}/1024 \
    --continue-on-error

One gspa run on the full 1 M genome set covers the expensive alignment step globally.
1 024 SLURM tasks each process ~1 000 genomes in parallel; add tasks or add cores per task to change the balance.

3. Community optimisation (`community`)

Two subcommands; pick the one that matches your accuracy / scale trade-off. You can run both on the same input and compare.

`community per-mag` — scalable default

Each MAG keeps its own GEM and its own FBA. We (a) build a shared medium that is the union of every per-MAG medium (largest maxFlux per compound wins), (b) run FBA on every draft under that shared medium, and (c) report the abundance-weighted mean growth rate.

gapsmith community per-mag \
    --gspa-run /data/gspa-run/ \
    --drafts-root /out/reconstructions/ \
    -o /out/community/

Outputs:

community-medium.csv — the shared medium.
per-mag-growth.tsv — genome_id, abundance, growth_rate, status per MAG.

Complexity: linear in N; no community-level LP. This is the mode for ≥ 100 MAGs where a full cFBA isn't tractable.

`community cfba` — full community LP

Composes N drafts into one block-diagonal model. Private mets/rxns are suffixed __<genome_id>; extracellular mets (_e0) and exchange reactions (EX_*_e0) are shared so organisms cross-feed through a single pool. Bounds on shared exchanges are widened to the most permissive input.

A synthetic bio_community reaction is added whose flux equals Σ w_k · bio_k (weights come from --abundance or genomes.tsv).

gapsmith community cfba \
    --gspa-run /data/gspa-run/ \
    --drafts-root /out/reconstructions/ \
    --abundance /data/abundances.tsv \
    --balanced-growth \
    -m /data/community-medium.csv \
    -o /out/community/

Outputs:

community.gmod.cbor — composed community model.
community-fba.tsv — per-organism biomass flux plus the community-level objective.

--balanced-growth adds one linear constraint per organism forcing v(bio_k) == v(bio_community), matching classical cFBA semantics. Drop the flag for an abundance-weighted average without the equal-growth constraint (faster, more realistic when abundances are very uneven).

Scaling: the community LP grows roughly N × (rxns + mets). Expect comfortable behaviour up to ~50 organisms; larger communities are better served by per-mag first, with cFBA applied to selected sub-communities.

4. Decision tree

                 "metagenome / multi-MAG run"
                             │
            ┌────────────────┴────────────────┐
            │                                 │
     small curated set                 thousands of MAGs
      (≤ ~50 organisms)                  or fast iteration
            │                                 │
    ┌───────┴───────┐                 ┌───────┴───────┐
    │ need obligate │                 │ need obligate │
    │ cross-feeding?│                 │ cross-feeding?│
    └───────┬───────┘                 └───────┬───────┘
      yes  │    │ no                    yes  │    │ no
           │    │                            │    │
  community cfba  community per-mag  community cfba on  community per-mag
  --balanced-growth                  selected cohorts

Rules of thumb:

1 — 50 organisms, synthetic or curated community: cfba.
50 — 1 000 organisms, real-world metagenome: per-mag.
1 000 organisms or iterating over many assemblies: always start with doall-batch + per-mag; drop to cfba only for sub-cohorts.

5. Producing a gspa run from outside gspa

Any tool that can emit the three files above works. A minimal hand-rolled example:

# 1. Concatenate proteomes, tagging each header with its genome id.
mkdir -p run/
python tools/concat_with_prefix.py --in genomes/ --out run/all.faa

# 2. Cluster across genomes.
mmseqs easy-cluster run/all.faa run/cluster run/tmp \
    --min-seq-id 0.5 -c 0.8 --threads 32

# 3. Pivot the 2-col `run/cluster_cluster.tsv` into 3-col clusters.tsv
#    by peeling each member's `<genome_id>|` prefix.
awk -F$'\t' '{
  member=$2; rep=$1;
  n = split(member, a, "|"); gid=a[1];
  orig=substr(member, length(gid)+2);
  print rep "\t" orig "\t" gid
}' run/cluster_cluster.tsv > run/clusters.tsv

# 4. Run alignment ONCE against cluster reps.
mkdir -p run/alignment
diamond makedb --in reference.faa -d ref
diamond blastp -q run/cluster_rep_seq.fasta -d ref \
    --outfmt 6 qseqid pident evalue bitscore qcovhsp stitle sstart send \
    --more-sensitive -k 10 -e 1e-5 > run/alignment/reference.tsv

# 5. genomes.tsv (optional 3rd column = abundance).
for f in genomes/*.faa; do
  id=$(basename "$f" .faa)
  echo -e "$id\t$(realpath "$f")"
done > run/genomes.tsv

After that, gapsmith doall-batch --gspa-run run/ --jobs 32 -f out/ is all you need.

Architecture

A bird's-eye view of how gapsmith is organised, how data flows through it, and what each crate is responsible for.

1. Crate dependency graph

             gapsmith-core            ← Types only, no I/O / DB / solver
              ↑  ↑  ↑
              │  │  │
    ┌─────────┘  │  └────────────────┐
    │            │                   │
  gapsmith-io   gapsmith-db           gapsmith-sbml
    │            │                   ↑
    └─┬──────────┘                   │
      │                              │
      ▼                              │
  gapsmith-align ──┐                   │
      │          │                   │
      ▼          │                   │
  gapsmith-find    │                   │
      │          │                   │
      ├─▶ gapsmith-transport           │
      │                              │
      ▼                              │
  gapsmith-draft ──────────────────────┤
      │                              │
      ▼                              │
  gapsmith-fill ◀─── gapsmith-medium    │
      │                              │
      └──────────────▶ gapsmith-cli ◀──┘

Arrows point from "depended on" toward "depends on". gapsmith-core is the leaf: every other crate builds on its Model / Reaction / Metabolite / StoichMatrix types.

2. Data flow for `gapsmith doall`

  genome.faa.gz
        │
        ▼
  ┌──────────────┐                      ┌──────────────────────┐
  │ gapsmith find  │ ──── aligner (shell) │  dat/seq/<tax>/      │
  │ -p all       │          ─ BLAST/    │  rev/ unrev/ rxn/    │
  └──────┬───────┘            diamond/  └──────────────────────┘
         │                    mmseqs2
         │   *-Reactions.tbl  + *-Pathways.tbl
         ▼
  ┌──────────────────────┐
  │ gapsmith find-transport│ ─── same aligner, subex.tbl + tcdb.fasta
  └──────┬───────────────┘
         │   *-Transporter.tbl
         ▼
  ┌────────────────────┐
  │ gapsmith draft       │ ─── seed_reactions_corrected.tsv, biomass JSON
  └──────┬─────────────┘
         │   *-draft.gmod.cbor  (CBOR) + *-draft.xml  (SBML)
         ▼
  ┌──────────────────┐
  │ gapsmith medium    │ ─── medium_prediction_rules.tsv + *-Pathways.tbl
  │ (auto)           │
  └──────┬───────────┘
         │   *-medium.csv
         ▼
  ┌────────────────────┐
  │ gapsmith fill        │ ─── HiGHS LP (in-process via good_lp)
  │ 4-phase suite      │
  └──────┬─────────────┘
         │   *-filled.gmod.cbor  +  *-filled.xml  +  *-filled-added.tsv
         ▼
  (COBRApy, COBRAToolbox, cobrar, …)

Every arrow between stages is plain filesystem I/O. No daemon, no IPC. Each subcommand is individually re-runnable; if you change the medium you only re-run fill.

3. Model representation

[gapsmith_core::Model] is the single source of truth:

#![allow(unused)]
fn main() {
pub struct Model {
    pub annot: ModelAnnot,           // provenance metadata
    pub compartments: Vec<Compartment>,
    pub mets: Vec<Metabolite>,
    pub rxns: Vec<Reaction>,
    pub genes: Vec<GeneId>,
    pub s: StoichMatrix,              // sprs CsMat in CSC
}
}

Serialised via serde as:

CBOR (.gmod.cbor) — native, compact, fast load. The gapsmith internal format; replaces upstream's .RDS.
JSON (.json) — human-readable; gapsmith convert converts both ways. Same semantic content as CBOR.
SBML (.xml) — Level 3 Version 1 + FBC 2 + groups 1. Written by [gapsmith_sbml::write_sbml]; loads cleanly in COBRApy / COBRAToolbox / cobrar.

Metabolite ids use the cpd00001_c0 convention (compartment baked into the id) for SBML SId compliance. Reaction ids likewise — rxn00001_c0, EX_cpd00001_e0, bio1, etc.

4. LP plumbing

The gap-filler does a lot of LPs — dozens per Step 1 / 2 / 2b / 3 / 4 run, sometimes thousands in a --full-suite. Key design decisions:

Split-flux encoding

Every reaction r becomes two variables vp_r, vn_r ≥ 0. The net flux is v_r = vp_r − vn_r. This makes |v_r| = vp_r + vn_r linear — essential for pFBA.

Bound translation

model `[lb, ub]`	`vp_r` upper	`vn_r` upper
`lb ≥ 0, ub ≥ 0`	`ub`	`0`
`lb ≤ 0, ub ≥ 0`	`ub`	`−lb`
`lb ≤ 0, ub ≤ 0`	`0`	`−lb`

Strict-direction reactions additionally get vp_r ≥ lb (forward-forced) or vn_r ≥ −ub (backward-forced).

Mass balance

Walk the CSC matrix once column-by-column (O(nnz) total). For each non-zero (row, col, coef), accumulate coef · (vp[col] − vn[col]) onto row row's expression. The naive per-row scan would be O(m · n) — ~45 B iterations on a 3k×5k model. The O(nnz) walk solves it in under 100 ms on ecoli.

Solver

good_lp 1.15 with HiGHS backend. HiGHS is statically linked via CMake at build time — no runtime HiGHS dependency. Optional CBC fallback behind --features cbc: pfba_heuristic retries once with CBC after HiGHS exhausts the tolerance / coefficient ladder.

5. Alignment layer

gapsmith-align exposes a single [Aligner] trait:

#![allow(unused)]
fn main() {
pub trait Aligner {
    fn run(&self, query: &Path, targets: &[&Path], opts: &AlignOpts)
        -> Result<Vec<Hit>, AlignError>;
}
}

Implementors — one struct each — shell out to the matching binary and normalise its output to a [Hit] with the gapseq-canonical column set (qseqid / pident / evalue / bitscore / qcovs / stitle / sstart / send).

[BlastpAligner] builds an indexed BLAST DB in a temp dir per call.
[DiamondAligner] mirrors gapseq's --more-sensitive default.
[Mmseqs2Aligner] uses the explicit createdb → search → convertalis 4-command pipeline (not easy-search — see porting-notes for why).
[PrecomputedTsvAligner] skips all shelling out and reads a user-supplied TSV.
[BatchClusterAligner] is the Rust-only feature: mmseqs2-clusters N input genomes, runs one alignment, expands cluster membership back to per-genome hits.
[GspaRunAligner] is the multi-genome / metagenome entry point: given a gspa manifest (clusters.tsv + alignment/*.tsv + genomes.tsv) and a target genome id, it fans every rep-level hit onto that genome's cluster members. No alignment is re-run — gspa clustered across N genomes once, upstream.

6. Candidate-pool / gap-fill

The 4-phase gap-fill suite (gapsmith-fill::run_suite) is the heaviest computation in the pipeline. Sketch of one fill call:

  draft model  +  medium CSV  +  Reactions.tbl (bitscore weights)
                         │
                         ▼
          ┌──────────────────────────────┐
          │ apply_medium(draft)          │
          │ attach EX_<target>_c0 obj    │
          └──────────────┬───────────────┘
                         ▼
          ┌──────────────────────────────┐
          │ build_full_model             │  ← every approved SEED rxn
          │   draft + candidates         │    not already present,
          │   deduped by stoich hash     │    sorted by core status
          └──────────────┬───────────────┘
                         ▼
          ┌──────────────────────────────┐
          │ (optional) detect_futile_    │  ← parallel LP probe,
          │   cycles  +  drop            │    `--prune-futile`
          └──────────────┬───────────────┘
                         ▼
          ┌──────────────────────────────┐
          │ pfba_heuristic               │  ← tolerance ladder,
          │   min_growth ≥ k             │    `1e-6 → 1e-9`
          └──────────────┬───────────────┘
                         ▼
          ┌──────────────────────────────┐
          │ extract utilized candidates  │
          │ add to draft                 │
          └──────────────┬───────────────┘
                         ▼
          ┌──────────────────────────────┐
          │ KO essentiality loop         │  ← zero bounds, recheck FBA,
          │   serial, core-first         │    drop non-essential
          └──────────────┬───────────────┘
                         ▼
          ┌──────────────────────────────┐
          │ Step 2: biomass components   │  ← per substrate, MM_glu
          ├──────────────────────────────┤
          │ Step 2b: aerobic/anaerobic   │
          ├──────────────────────────────┤
          │ Step 3: ESP1-5 energy screen │  (`--full-suite`)
          ├──────────────────────────────┤
          │ Step 4: fermentation screen  │  (`--full-suite`)
          └──────────────────────────────┘
                         │
                         ▼
                    filled model

6b. Multi-genome and community paths

The single-genome flow above is unchanged. Three additional, opt-in paths sit alongside it.

6b.1 Alignment reuse via gspa

   N proteomes ───▶  [gspa, upstream]  ──▶  gspa-run directory
                     (mmseqs2 linclust,      clusters.tsv
                      diamond vs. ref)       alignment/*.tsv
                                             genomes.tsv
                           │
     ┌─────────────────────┼─────────────────────┐
     ▼                     ▼                     ▼
  gapsmith find         find-transport         doall-batch
   --gspa-run …          --gspa-run …           --gspa-run …
     │                     │                     │
     └─ reads clusters.tsv + alignment/*.tsv, expands rep
        hits onto each target genome's cluster members
        (gapsmith_align::GspaRunAligner).

No alignment is run inside gapsmith when --gspa-run is set. Per-genome find still applies its bitscore / coverage / completeness gates on the fanned-out hits, so the downstream pipeline is unchanged.

6b.2 `doall-batch` parallel driver

   genomes.tsv  ──▶ sort + shard (`--shard i/N`) ──▶ rayon pool
                                                       │
                                                       ├─▶ doall genome_1
                                                       ├─▶ doall genome_2
                                                       ├─▶ ...
                                                       └─▶ doall genome_k
   out-dir/
     <genome_1>/
     <genome_2>/
     ...
     doall-batch-errors.tsv  ← present only with --continue-on-error

Each worker calls doall::run_cli in-process with its --threads pinned to 1 so rayon owns the parallelism. SLURM array tasks get a disjoint --shard i/N — no inter-task coordination, no shared state.

6b.3 Community modes

   N drafts (gmod.cbor) ──┐
                          │
              ┌───────────┴───────────┐
              ▼                       ▼
   community per-mag           community cfba
   (gapsmith-fill::             (gapsmith-fill::
    community::                  community::
     union_medium,                compose_models,
     per_mag_weights)             add_community_biomass)
              │                       │
              ▼                       ▼
   per-mag-growth.tsv +        community.gmod.cbor +
   community-medium.csv        community-fba.tsv
   (N independent FBAs)        (1 big LP across the
                                block-diagonal model)

compose_models suffixes every private metabolite and reaction with __<organism>, leaves extracellular (_e0) ids unsuffixed so they collapse into one shared pool, and takes the most permissive EX_*_e0 bounds across organisms. add_community_biomass appends a community_biomass pseudo-metabolite plus a draining bio_community reaction (Σ w_k · bio_k), optionally adding one equality per organism (v(bio_k) == v(bio_community)) for balanced-growth cFBA. The regular [fba::fba] solves the resulting LP with no solver-side changes.

7. Testing surface

Every algorithmic component has unit tests (~170 total, cargo test --workspace). On top, three integration layers:

Shell-parity tests (crates/gapsmith-align/tests/*_parity.rs) — run the real BLAST / diamond / mmseqs2 binaries and diff against our wrappers' output.
R-parity tests (crates/gapsmith-find/tests/complex_parity.rs, pipeline_parity.rs, crates/gapsmith-transport/tests/parity.rs) — run the actual R gapseq via Rscript and diff column-by-column.
SBML validator (tools/validate_sbml.py) — uv-managed venv with python-libsbml + cobra; runs libSBML consistency checks + COBRApy round-trip on emitted SBML. 0 errors on every model.

8. File-system layout

gapsmith/
  crates/
    gapsmith-core/               # types
    gapsmith-io/                 # CBOR/JSON + path resolver
    gapsmith-db/                 # dat/*.tsv loaders
    gapsmith-sbml/               # SBML writer
    gapsmith-align/              # aligner trait + 6 backends (incl. gspa-run)
    gapsmith-find/               # pathway + reaction detection
    gapsmith-transport/          # transporter detection
    gapsmith-draft/              # draft model builder
    gapsmith-medium/             # rule-based medium inference
    gapsmith-fill/               # FBA / pFBA / gap-filler / suite / community
    gapsmith-cli/                # clap dispatch + every subcommand
  tools/
    bench/                     # R-vs-rs benchmark runner
    validate_sbml.py           # libSBML + COBRApy validator
    r_complex_detection.R      # R parity harness
    .sbml-validate/            # uv venv (python-libsbml + cobra)
  docs/
    user-guide.md              # end-user walk-through
    cli-reference.md           # exhaustive flag list
    multi-genome.md            # metagenome, gspa integration, cFBA
    feature-matrix.md          # R source → Rust module pointers
    porting-notes.md           # intentional deviations from upstream
    architecture.md            # this file
  Cargo.toml                   # workspace manifest
  README.md                    # status + benchmarks

Feature matrix

Exhaustive list of everything gapsmith implements, one row per feature, with pointers to both the upstream R source and the Rust module.

Status legend:

✅ — implemented and tested against real gapseq where feasible
🆕 — Rust-only feature (no upstream equivalent)
⚠️ — shipped but intentionally deviates from upstream (see porting-notes.md)
❌ — deferred / intentionally not ported

1. Subcommands

Subcommand	R source	Rust module	Status
`gapsmith test`	—	`gapsmith-cli/src/commands/test.rs`	✅
`gapsmith find`	`src/gapseq_find.sh` + `src/*.R`	`gapsmith-find/` + `gapsmith-cli/src/commands/find.rs`	✅ byte-identical on `PWY-6587` & `amino`
`gapsmith find-transport`	`src/transporter.sh` + `src/analyse_alignments_transport.R`	`gapsmith-transport/` + `gapsmith-cli/src/commands/find_transport.rs`	✅ TC-set + row-count identical
`gapsmith draft`	`src/generate_GSdraft.R` + `src/prepare_candidate_reaction_tables.R`	`gapsmith-draft/` + `gapsmith-cli/src/commands/draft.rs`	✅ SBML validates (0 libSBML errors)
`gapsmith medium`	`src/predict_medium.R`	`gapsmith-medium/` + `gapsmith-cli/src/commands/medium.rs`	✅ byte-identical on ecoli
`gapsmith fill`	`src/gf.suite.R` + `src/gapfill4.R`	`gapsmith-fill/` + `gapsmith-cli/src/commands/fill.rs`	✅ 4-phase suite + KO loop
`gapsmith adapt`	`src/adapt.R` + `src/gf.adapt.R`	`gapsmith-cli/src/commands/adapt.rs`	⚠️ EC / KEGG / name resolution deferred
`gapsmith pan`	`src/pan-draft.R`	`gapsmith-cli/src/commands/pan.rs`	✅ union + binary table
`gapsmith doall`	`src/doall.sh`	`gapsmith-cli/src/commands/doall.rs`	✅ end-to-end on ecore in 2m47s
`gapsmith update-sequences`	`src/update_sequences.sh`	`gapsmith-cli/src/commands/update_sequences.rs`	✅ Zenodo sync + md5 diff
`gapsmith convert`	—	`gapsmith-cli/src/commands/convert.rs`	🆕 CBOR ↔ JSON round-trip
`gapsmith example-model`	—	`gapsmith-cli/src/commands/example_model.rs`	🆕 toy model fixture
`gapsmith db inspect`	—	`gapsmith-cli/src/commands/db.rs`	🆕 reference-data row-count dump
`gapsmith export-sbml`	`cobrar::writeSBMLmod`	`gapsmith-cli/src/commands/export_sbml.rs`	🆕 CBOR → SBML
`gapsmith align`	—	`gapsmith-cli/src/commands/align.rs`	🆕 debug-wrap for a single aligner
`gapsmith batch-align`	—	`gapsmith-cli/src/commands/batch_align.rs`	🆕 cluster N genomes + single alignment
`gapsmith doall-batch`	—	`gapsmith-cli/src/commands/doall_batch.rs`	🆕 rayon + SLURM-shard parallel `doall` across N genomes
`gapsmith community per-mag`	—	`gapsmith-cli/src/commands/community.rs`	🆕 per-MAG FBA under a shared (union) medium
`gapsmith community cfba`	—	`gapsmith-cli/src/commands/community.rs`	🆕 compose N drafts, weighted-sum biomass objective
`gapsmith fba`	—	`gapsmith-cli/src/commands/fba.rs`	🆕 FBA / pFBA standalone

2. Core algorithms

2.1 Alignment layer (`gapsmith-align`)

Feature	R source	Rust module
BLASTp wrapper	`gapseq_find.sh` blastp block	`blast.rs::BlastpAligner`
tBLASTn wrapper	same, for `-n nucl`	`blast.rs::TblastnAligner`
DIAMOND wrapper	`gapseq_find.sh` diamond block	`diamond.rs::DiamondAligner`
mmseqs2 wrapper (full pipeline)	`gapseq_find.sh` mmseqs block	`mmseqs2.rs::Mmseqs2Aligner`
Precomputed TSV input	—	`precomputed.rs::PrecomputedTsvAligner` 🆕
Batch-cluster (N genomes → 1 alignment)	—	`batch.rs::BatchClusterAligner` 🆕
gspa-run manifest reader (cluster-aware hit expansion)	—	`gspa.rs::{GspaManifest, GspaRunAligner}` 🆕
2-decimal scientific e-value format	BLAST `-outfmt 6` native	`tsv.rs`

2.2 `find` pipeline (`gapsmith-find`)

Feature	R source	Rust module
Pathway table loader (meta / kegg / seed / custom)	`gapseq_find.sh:520-532`	`gapsmith-db::PathwayTable`
metacyc + custom merge (custom-wins-on-id)	same	same
Keyword-shorthand resolution (`amino`, `carbo`, ...)	`gapseq_find.sh:40-60`	`pathways.rs::MatchMode::Hierarchy`
Reference FASTA resolver (`user/` → `rxn/` → `rev/EC` → `unrev/EC` → md5)	`prepare_batch_alignments.R:150-234`	`seqfile.rs`
Complex-subunit detection	`complex_detection.R`	`complex.rs` (R-parity on 9 cases)
Hit classification with exception table	`analyse_alignments.R:108-189`	`classify.rs`
Pathway completeness scoring (`f64` precision)	`filter_pathways.R:10-34`	`pathways.rs::score`
`dbhit` lookup (EC + altEC + MetaCyc id + enzyme name)	`getDBhit.R:60-130`	`dbhit.rs`
`noSuperpathways=true` default	`gapseq_find.sh:20`	`find::FindOptions`
Word-boundary-less header filter (matches shell `grep -Fivf`)	`gapseq_find.sh`	`seqfile.rs` ⚠️ intentional
Output writers (`Reactions.tbl`, `Pathways.tbl`)	same	`output.rs`

2.3 `find-transport` pipeline (`gapsmith-transport`)

Feature	R source	Rust module
`subex.tbl` substrate filter	`transporter.sh:140-280`	`filter.rs`
TC-id parsing + type canonicalisation	`analyse_alignments_transport.R:1-188`	`tc.rs`
Substrate resolution (tcdb_all + FASTA header fallback)	same	`runner.rs`
Alt-transporter reaction assignment (gated by `--nouse-alternatives`)	`analyse_alignments_transport.R:110-130`	`runner.rs`
Substrate-case preservation (gapseq emits `sub=Potassium`)	shell behaviour	`data.rs`

2.4 Draft model builder (`gapsmith-draft`)

Feature	R source	Rust module
Candidate selection (bitscore ≥ cutoff OR pathway support)	`prepare_candidate_reaction_tables.R` + `generate_GSdraft.R:55-100`	`candidate.rs`
Stoichiometric hash dedup	`generate_rxn_stoich_hash.R`	`stoich_hash.rs`
Best-status-across-rows (OR `is_complex`, max `complex_status`, highest-rank `pathway_status`)	implicit in R's data.table merges	`candidate.rs::build_candidates` ⚠️ explicit
Biomass JSON parser (single + pipe-separated multi-link)	`parse_BMjson.R:1-107`	`biomass.rs` + `gapsmith-db::BiomassComponent::links`
Biomass cofactor mass-rescaling	`generate_GSdraft.R:281-292`	`biomass.rs` ⚠️ menaquinone-8 auto-removal deferred
GPR composition (and / or tree, "subunit undefined" edge cases)	`get_gene_logic_string.R`	`gpr.rs`
Diffusion + exchange expansion	`add_missing_exRxns.R:1-156`	`exchanges.rs`
Conditional transporter additions (butyrate, IPA, PPA, phloretate)	`generate_GSdraft.R`	`runner.rs::add_conditional_transporters`
SBML ID sanitiser (`-/./:/space → _`)	—	`builder.rs` 🆕
Cytosolic met-id format (`cpd00001_c0` not `cpd00001[c0]`)	—	`builder.rs` ⚠️ for SBML SId compliance

2.5 FBA / pFBA solver (`gapsmith-fill`)

Feature	R source	Rust module
Split-flux LP encoding (`vp, vn ≥ 0`)	implicit in `cobrar::pfbaHeuristic`	`lp.rs::SplitFluxLp`
FBA	`cobrar::fba`	`fba.rs::fba`
pFBA (single call)	`cobrar::pfbaHeuristic`	`pfba.rs::pfba`
pFBA-heuristic tolerance ladder (15 iters, `1e-6 → 1e-9`, pFBA-coef relaxation)	`gapfill4.R:95-137`	`pfba.rs::pfba_heuristic`
HiGHS solver	— (R uses glpk/cplex)	`good_lp` 1.15 + `highs-sys`
CBC fallback	—	`pfba.rs::pfba_cbc` (feature-gated `cbc`) 🆕
Row-expression builder (`O(nnz)`)	implicit	`fba.rs::build_row_exprs` 🆕 performance

2.6 Gap-filling (`gapsmith-fill`)

Feature	R source	Rust module
`gapfill4` single-iteration driver	`gapfill4.R:1-303`	`gapfill.rs::gapfill4`
Candidate pool (`draft + all approved SEED, stoich-hash deduped`)	`construct_full_model.R` + `gapfill4.R:12-56`	`pool.rs::build_full_model`
`rxnWeights` derivation from bitscores	`prepare_candidate_reaction_tables.R:222-228`	`pool.rs::rxn_weight`
KO essentiality loop (serial, core-first, highest-weight-first)	`gapfill4.R:247-280`	`gapfill.rs::gapfill4`
Medium application (close all EX, open per-medium, add missing EX)	`constrain.model.R`	`medium.rs::apply_medium`
Environment overrides (`env_highH2.tsv`)	`adjust_model_env.R`	`medium.rs::apply_environment_file`
Step 1 (user medium + biomass target)	`gf.suite.R:244-258`	`suite.rs::run_suite`
Step 2 (per-biomass-component on MM_glu + carbon sources)	`gf.suite.R:285-372`	`suite.rs::step2`
Step 2b (aerobic / anaerobic variant)	`gf.suite.R:377-464`	`suite.rs::run_suite`
Step 3 (energy-source screen with ESP1-5)	`gf.suite.R:480-581`	`suite.rs::step3`
Step 4 (fermentation-product screen)	`gf.suite.R:585-683`	`suite.rs::step4`
Target-met sink as objective	`add_met_sink` in `add_missing_exRxns.R:56-72`	`suite.rs::add_target_sink_obj`
Futile-cycle detector (parallel pairwise LP probe)	recent upstream `cccbb6f0`	`futile.rs::detect_futile_cycles` (opt-in `--prune-futile`)
Community model composition (block-diagonal, shared `_e0`)	—	`community.rs::compose_models` 🆕
Weighted-sum community biomass + optional balanced-growth	—	`community.rs::add_community_biomass` 🆕
Union-medium + per-MAG weights (community `per-mag` mode)	—	`community.rs::{union_medium, per_mag_weights, weighted_growth}` 🆕

2.7 Medium inference (`gapsmith-medium`)

Feature	R source	Rust module
Rules-table loader	`predict_medium.R:46`	`rules.rs::load_rules`
Boolean-expression evaluator (`\| & ! < > == <= >=`)	`eval(parse(text=))`	`boolexpr.rs::eval`
Counting-rule support (`a + b + c < 3`)	same (R int arithmetic)	`boolexpr.rs::parse_sum`
Cross-rule dedup + mean flux	`predict_medium.R:84-86`	`predict.rs::predict_medium`
Saccharides / Organic acids category dedup	`predict_medium.R:88-92`	`predict.rs::predict_medium`
Manual flux overrides	`predict_medium.R:94-114`	`predict.rs::parse_manual_flux`
Proton balancer	`predict_medium.R:121-132`	`predict.rs::predict_medium`

2.8 Serialisation (`gapsmith-sbml`, `gapsmith-io`)

Feature	R source	Rust module
CBOR round-trip	—	`gapsmith-io::{read,write}_model_cbor` 🆕
JSON round-trip	—	`gapsmith-io::{read,write}_model_json` 🆕
SBML L3V1 + FBC2 + groups writer	`cobrar::writeSBMLmod`	`gapsmith-sbml::write_sbml`
SBML SId idempotent on mets with compartment suffix	—	`writer.rs::species_id` 🆕 bugfix
Streaming via `quick-xml`	—	`writer.rs` 🆕 no libSBML dep
SBML consistency validation	libSBML native	`tools/validate_sbml.py` (libSBML + COBRApy)

2.9 Reference-data loaders (`gapsmith-db`)

Feature	R source	Rust module
`seed_reactions_corrected.tsv`	`data.table::fread`	`seed.rs::load_seed_reactions`
`seed_metabolites_edited.tsv`	same	`seed.rs::load_seed_metabolites`
MNXref cross-refs (`mnxref_*.tsv`)	same	`mnxref.rs`
`meta_pwy.tbl` / `kegg_pwy.tbl` / `seed_pwy.tbl` / `custom_pwy.tbl`	same	`pathway.rs`
`subex.tbl`	same	`subex.rs`
`tcdb.tsv`	same	`tcdb.rs`
`exception.tbl`	same	`exception.rs`
`medium_prediction_rules.tsv`	same	`gapsmith-medium::rules`
`complex_subunit_dict.tsv`	same	`complex.rs`
Biomass JSON (Gram+, Gram-, archaea, user custom)	`parse_BMjson.R`	`biomass.rs`
SEED stoichiometry parser (`-1:cpd00001:0:0:"H2O";...`)	`parse_BMjson.R:21-29`	`stoich_parse.rs`

3. New Rust-only features

Feature	Rust module	Motivation
Precomputed alignment input (`--aligner precomputed -P <tsv>`)	`gapsmith-align::PrecomputedTsvAligner`	Skip per-genome BLAST when the user pre-runs diamond / mmseqs2 at batch scale
BatchClusterAligner (`gapsmith batch-align`)	`gapsmith-align::BatchClusterAligner`	Amortise alignment cost over N genomes via one mmseqs2 cluster + single alignment
gspa-run manifest reader (`--gspa-run <dir>`)	`gapsmith-align::GspaRunAligner`	Consume precomputed cluster-rep hits from the upstream gspa pipeline; fans rep hits onto per-genome members
`gapsmith doall-batch`	`gapsmith-cli::commands::doall_batch`	Rayon + SLURM-array-friendly driver for reconstructing 100 → 1 M genomes in one batch
`gapsmith community per-mag`	`gapsmith-fill::community` + CLI	Shared-medium per-MAG FBA for metagenomes with 50+ MAGs
`gapsmith community cfba`	`gapsmith-fill::community` + CLI	Full community LP (block-diagonal compose, weighted-sum biomass, optional balanced-growth)
In-process LP (HiGHS via good_lp)	`gapsmith-fill`	Replaces R cobrar's shelled-out glpk / cplex; faster warm-starts
Optional CBC fallback backend	`gapsmith-fill::pfba_cbc` (`--features cbc`)	When HiGHS exhausts the tolerance ladder on pathological LPs
CBOR native format	`gapsmith-io`	Fast, compact, stdlib-free; replaces R's RDS
`gapsmith fba` subcommand	`gapsmith-cli`	Standalone FBA / pFBA without shelling into R
`gapsmith convert` subcommand	`gapsmith-cli`	CBOR ↔ JSON round-trip for inspection
`gapsmith db inspect` subcommand	`gapsmith-cli`	Smoke-test the reference data directory
`gapsmith export-sbml` subcommand	`gapsmith-cli`	Write an arbitrary CBOR model as SBML

4. Known gaps (deferred)

Gap	Upstream location	Workaround / plan
EC / TC conflict resolution (IRanges overlap math)	`prepare_candidate_reaction_tables.R::resolve_common_{EC,TC}_conflicts`	Affects <1 % of multi-EC annotations. Plan: port when a user case needs it.
MIRIAM cross-ref annotations (KEGG / BiGG / MetaNetX / HMDB / ChEBI)	`addReactAttr.R` + `addMetAttr.R`	SBML emits ModelSEED id only; round-trip in COBRApy still works.
HMM-based taxonomy / gram prediction	`predict_domain.R`, `predict_gramstaining.R`	CLI requires explicit `--taxonomy Bacteria
Gene-name MD5 fallback in seqfile resolver	`uniprot.sh:179`	Common-case MD5 fallback is ported; gene-name branch rarely fires.
Menaquinone-8 auto-removal (gated on MENAQUINONESYN-PWY / PWY-5852 / PWY-5837)	`generate_GSdraft.R:281-292`	Bio1 includes cpd15500 regardless; affects anaerobic predictions marginally.
`gram_by_network.R` (predict gram by metabolic-network similarity)	same	Requires explicit `-b pos
`adapt` EC / KEGG / enzyme-name resolution	`adapt.R::ids2seed` strategies 3–7	Direct SEED + pathway id resolution works; user can pre-resolve via `gapsmith find`.
`pan` weight medianing (`custom_median`)	`pan-draft_functions.R`	Pan-model emits without merged `rxnWeights` metadata; `gapsmith fill` on a pan-draft needs the source Reactions.tbl.
CPLEX solver support	—	Plan explicitly calls for HiGHS + optional CBC; no CPLEX path.
MetaCyc DB updaters (`meta2pwy.py`, `meta2genes.py`, `meta2rea.py`)	upstream Python helpers	Run once per year by maintainers; kept in Python.

5. Testing surface

Test suite	Tests	What it asserts
`gapsmith-core` unit	18	Type invariants, serde round-trips, stoichiometric matrix construction.
`gapsmith-io` unit	5	CBOR / JSON round-trip, data-dir auto-detect.
`gapsmith-db` unit	18	Every reference-data parser on realistic inputs.
`gapsmith-sbml` unit + integration	2 + 1	SBML writer emits every FBC2 / groups element; libSBML validates cleanly.
`gapsmith-align` unit + smoke + parity	18 + 4 + 3	Aligner trait, precomputed TSV, gspa-run manifest + fan-out, BLAST / diamond / mmseqs2 shell parity.
`gapsmith-find` unit + smoke + parity	36 + 1 + 2	Pathway scoring, complex detection (R-parity on 9 cases), `find -p PWY-6587` and `-p amino` byte-identical against real gapseq.
`gapsmith-transport` unit + parity	7 + 1	TC parsing, substrate resolution, end-to-end row+TC-id parity against real gapseq.
`gapsmith-draft` unit + smoke	10 + 1	Biomass rescaling, GPR composition, stoich dedup, conditional transporters.
`gapsmith-fill` unit + textbook + smoke	20 + 5 + 1	FBA / pFBA / pFBA-heuristic on toys; community compose + cFBA on 2-organism toy; `gapfill4` end-to-end on ecoli draft.
`gapsmith-medium` unit	14	Boolean-expression evaluator (incl. counting rules), rule loader, cross-rule dedup, proton balance.
`gapsmith-cli` integration	6 + 1	CBOR↔JSON round-trip end-to-end via the binary; SLURM-shard parser.
Total	~170

Run the full suite:

cargo test --workspace
cargo clippy --workspace --all-targets -- -D warnings

6. LOC breakdown

Crate	`src/` LOC	`tests/` LOC
`gapsmith-core`	~1 000	—
`gapsmith-io`	~330	—
`gapsmith-db`	~1 600	—
`gapsmith-sbml`	~870	~220
`gapsmith-align`	~1 250	~560
`gapsmith-find`	~2 700	~380
`gapsmith-transport`	~1 040	~115
`gapsmith-draft`	~1 670	~80
`gapsmith-fill`	~2 150	~400
`gapsmith-medium`	~550	—
`gapsmith-cli`	~2 400	~60
Total	~17 000	~2 100

gapsmith porting notes

Intentional and unintentional deviations from the upstream R/bash gapseq codebase. Each entry says what differs, why, and whether downstream output parity holds.

This document is the authoritative list of "known non-R behavior". If you find something in gapsmith that differs from real gapseq and it's not listed here, please file an issue — we treat those as bugs.

1. Storage format

RDS → CBOR. R's saveRDS(mod, ...) is replaced by ciborium-encoded CBOR at <id>.gmod.cbor. SBML is emitted alongside as the portable interchange format. An Rscript-driven CBOR↔RDS bridge can be added later if downstream R consumers need it.
Metabolite IDs: cpd00001_c0 instead of cpd00001[c0]. SBML SIds must match [A-Za-z_][A-Za-z0-9_]* — brackets break that. The SBML writer derives the species id from met.id + met.compartment and only presents <cpd>[comp] to the user via the name attribute.
Model IDs are sanitized to valid SBML SIds: -, ., :, space → _. Original id preserved in the name attribute if different.

2. Alignment layer

mmseqs2 invocation. gapseq's R pipeline calls explicit mmseqs createdb + search + convertalis; we replicate this 4-command flow exactly rather than the simpler easy-search, because the latter uses alignment-mode 3 (full-alignment identity) while gapseq calibrates against the k-mer prefilter identity that search reports.
mmseqs2 qseqid normalization. We post-process mmseqs's qheader output to the first whitespace-delimited token, matching what blastp/diamond emit natively. gapseq does this via a sed call in the shell script (gapseq_find.sh at the tail of each mmseqs block).
Diamond --more-sensitive is on by default, matching gapseq's aliArgs="--more-sensitive" default for diamond.
E-value formatting in TSV output: BLAST's exact 2-decimal scientific notation (2.52e-41). Initial wrapper used {:.3e} which emitted 2.520e-41; fixed to match.

3. find pipeline (M5)

noSuperpathways=true is always the default. Matches gapseq's gapseq_find.sh:20. Pass -n / --include-superpathways to disable.
-p keyword shorthand. -p amino|nucl|cofactor|carbo|polyamine|fatty|energy|terpenoid|degradation|core|min|kegg|all resolves to the same hierarchy-category name gapseq uses. For -p <literal> we fall back to \b(<literal>)\b regex against id / name / altname / hierarchy — same as gapseq's default stype branch.
Pathway table merge semantics. -l metacyc,custom is default (matching gapseq); on ID collision, the custom_pwy.tbl row wins. This is how several "GS"-suffixed pathways (e.g. ARGSYN-PWY) have their superpathway=TRUE flag shadowed by the custom row with superpathway=FALSE, so they survive the filter.
user/ reference fastas always resolve against the built-in <gapseq>/dat/seq/<tax>/user/, even when -D / --seq-dir overrides the rest. Matches prepare_batch_alignments.R:217.
Reaction-name MD5 fallback. Uses md5sum-compatible hex digest of the raw bytes (no trailing newline), same as uniprot.sh:155.
Header filter uses substring, not word-boundary, match. This is a deliberate match of gapseq's shell behavior (grep -Fivf without -w). Has the known quirk of accidentally matching 4.A.2.1.1 against a line containing 4.A.2.1.11. Reproducing the quirk is required for byte-identical output.

4. find-transport pipeline (M6)

Substrate case is preserved from subex.tbl (sub=Potassium, not sub=potassium). Matches gapseq.
Alt-transporter reaction assignment kicks in when a TC-id hit exists but no SEED reaction matches the exact TC type. Same rule as analyse_alignments_transport.R:110-130. Disable with -a.

5. draft pipeline (M7) — known gaps

EC/TC conflict resolution (prepare_candidate_reaction_tables.R resolve_common_EC_conflicts / resolve_common_TC_conflicts, ~130 lines of IRanges overlap math) is not yet ported. Affects <1% of reactions; the mitigations are listed downstream in the draft — a gene with two EC annotations stays attached to both reactions rather than being resolved.
MIRIAM cross-ref annotations. SBML writer emits the ModelSEED id only. KEGG / BiGG / MetaNetX / HMDB / ChEBI cross-refs are available in addReactAttr.R + addMetAttr.R but not threaded through yet. Round-trip load in COBRApy works; MIRIAM-dependent downstream tools may need to re-query gapseq's dat/mnxref_*.tsv.
gram_by_network.R (predict gram-staining by metabolic-network similarity against reference genomes) is not ported. The CLI requires --biomass pos|neg|archaea or reads the gram= field from the Reactions.tbl header (which find writes when gapseq has HMM-predicted the gram).
Biomass rescaling: the menaquinone-8 auto-removal rule (generate_GSdraft.R:281-292) is not ported. Gram+/Gram- biomass retains menaquinone-8 regardless of whether the MENAQUINONESYN-PWY / PWY-5852 / PWY-5837 pathways are present. Impact is small (a single metabolite in the biomass reaction) but affects anaerobic-growth predictions.

6. Not yet implemented

gapsmith medium — rule-based medium inference. Planned for M10.
gapsmith fill — 4-phase gap-filling driver. Planned for M9 (FBA/pFBA primitives already shipped in M8 under gapsmith-fill).
gapsmith adapt — add/remove reactions on an existing model. Planned for M10.
gapsmith pan — pan-model union. Planned for M10.
gapsmith doall — chain find→find-transport→draft→medium→fill. Planned for M10.
gapsmith update-sequences — seqdb download from upstream. Planned for M10.
HMM-based predict_domain / predict_gramstaining. Planned alongside M10.

6a. fill (M8–M9) — LP primitives + single-iteration gap-fill (shipped)

Split-flux encoding: each reaction contributes vp_r, vn_r ≥ 0 with net v_r = vp_r − vn_r. |v_r| = vp_r + vn_r is linear, which keeps pFBA a pure LP. The bound translation is documented in crates/gapsmith-fill/src/lp.rs — [lb, ub] on the net flux becomes a per-variable ub pair (max(ub,0), max(−lb,0)), with additional lower bounds when the direction is strictly forced.
FBA (gapsmith-fill::fba) — maximise / minimise Σ c_r · (vp_r − vn_r) s.t. S · (vp − vn) = 0. HiGHS backend via good_lp 1.15.
pFBA (gapsmith-fill::pfba) — minimise Σ w_r·(vp_r+vn_r) − pfba_coef · Σ c_r·(vp_r−vn_r) subject to v_bio ≥ min_growth. Matches cobrar's pfbaHeuristic.
pFBA-heuristic ladder (gapsmith-fill::pfba_heuristic) — port of gapfill4.R:95–137. Retries up to 15 times, halving the feasibility tolerance then the pFBA coefficient on solver failure; validates each candidate solution by running FBA on the reduced model after zero-flux reactions are removed.
gapsmith fba subcommand — FBA or --pfba on a CBOR/JSON model; prints solver status, objective, biomass flux, and the top-N |flux| reactions. Useful for sanity-checking a draft model before fill is applied.
Candidate pool (pool.rs::build_full_model) — clone draft, append every approved/corrected SEED reaction not already present, deduped by stoichiometric hash with core-preference on ties. Port of gapfill4.R:12–56.
Medium application (medium.rs::apply_medium) — close all EX_* lower bounds then re-open per medium CSV; adds missing EX reactions for compounds not yet in the model. Matches constrain.model.R.
Single-iteration gap-fill (gapfill.rs::gapfill4) — pFBA-heuristic on the full model with bitscore-derived weights, extract utilized candidates, add to draft, run KO essentiality loop. Port of gapfill4.R:1–303.
gapsmith fill subcommand — end-to-end draft → filled pipeline. Ecoli on MM_glu.csv: 20 reactions added, final growth 0.53.

Known divergences / outstanding work

Biomass link-multi parsing — biomass.json supports "link": "cpd01997:-1|cpd03422:-1" (pipe-separated coupled metabolites). The M7 biomass parser only read the first link. M9 fixed this (BiomassComponent::links()). Without the fix, every biomass reaction was missing two cofactor metabolites, pegging growth to zero. Re-run draft on any model built before M9 — the old CBOR has a structurally broken biomass.
SBML species-id double suffix — the M7 species_id helper blindly appended _<comp> to the cpd id. When gapsmith stores metabolites as cpd00001_c0 (the actual shipped convention), the SBML emitted M_cpd00001_c0_c0. COBRApy treated these as distinct mets, so SEED reactions and biomass didn't link. Fixed in M9: idempotent when the id already ends with the compartment suffix.
Objective must be EX_<target>_c0 sink, not bio1 directly — gf.suite.R:239–242 zeros every obj_coef and then calls add_met_sink(mod, target.met, obj=1) to attach the objective to a freshly-created sink reaction. Without it, steady-state balance pins the biomass flux to zero because the biomass pseudo-metabolite has no consumer. The CLI wires this up automatically.
4-phase suite (Steps 2 / 2b / 3 / 4) shipped in the M9 follow-up. Default CLI runs Steps 1 + 2 + 2b; --full-suite adds 3 + 4. Steps 3 and 4 iterate ~200 exchange compounds each with a gapfill4 call — budget 10–30 minutes on a whole-bacterium genome.
Futile-cycle prune shipped (detect_futile_cycles) but opt-in via --prune-futile. Pairwise LP probe on large candidate pools (~8k rxns) takes ~40 minutes even with rayon parallelism; use selectively.
Candidate-dedup fix (draft-stage latent bug caught during M9 follow-up): my build_candidates accumulator copied is_complex / complex_status / pathway_status only from the highest-bitscore row. That row may have empty status fields (gapseq fills those only on rows where complex detection ran), causing reactions like rxn00011 (bs=1797, pyruvate decarboxylase) to be rejected in the runner's complex filter. Fixed: OR is_complex, max-across-rows for complex_status, highest-rank pathway_status. Re-run gapseq draft on any model built before this fix.
CBC fallback wired behind --features cbc. Path: pfba_heuristic retries once with good_lp::solvers::coin_cbc after HiGHS exhausts its tolerance/coef ladder. Requires coinor-libcbc-dev + coinor-libclp-dev on the build host.

7. Unimplemented but probably won't be

The R helpers that depend on pythoncyc (meta2pwy.py, meta2genes.py, meta2rea.py) are one-off MetaCyc DB updaters run ≤ twice per year by maintainers. They stay in Python; no Rust port planned.
CPLEX solver support — plan stated "no CPLEX path"; gapsmith-fill will use HiGHS via good_lp with CBC/Clp as backups.

8. New features not in gapseq

BatchClusterAligner (M4.5) — cluster N genomes' proteomes with mmseqs2, align once against the reference query FASTA, expand per-genome hit sets. Amortizes alignment cost over many genomes with shared proteins. Not in upstream gapseq.
Precomputed alignment input — --aligner precomputed -P <tsv> on find / find-transport skips the alignment step entirely. User must pre-run blastp/diamond/mmseqs2 with the expected 8-column layout.

9. Output format differences

Pathways.tbl completeness precision. Ported to f64 to match R's default 15-significant-digit output (66.6666666666667).
Reactions.tbl column order matches gapseq exactly after M5 fixes.
Transporter.tbl matches gapseq exactly after M6 fixes, modulo the cosmetic sub column case when a substrate has both capitalized and lowercased SUBkey entries (gapseq picks the alphabetically-first post-dedup; we pick insertion-order-first).
evalue formatting matches BLAST's -outfmt 6 conventions: 0 for exact zero, 2-decimal scientific notation below 1e-3.

10. Testing surface

R-vs-Rust parity tests run the actual R implementation via Rscript where feasible. This lives in three places:
- crates/gapsmith-find/tests/complex_parity.rs drives tools/r_complex_detection.R which sources the real src/complex_detection.R.
- crates/gapsmith-find/tests/pipeline_parity.rs runs real gapsmith find and gapsmith find side-by-side with a synthesized minimal seqdb.
- crates/gapsmith-transport/tests/parity.rs does the same for find-transport.
SBML validation runs python-libsbml + cobra via a uv-managed venv at tools/.sbml-validate/. Not part of cargo test (no libsbml on crates.io) but a manual gate via tools/validate_sbml.py.

Performance

gapsmith shipped with a series of optimizations relative to its initial Rust port. This page catalogues what's in, where wins actually show up, and the methodology behind the numbers.

Shipped optimizations

All items below are on by default unless marked (opt-in) — those are library-level knobs exposed as struct fields, not CLI flags yet.

C1 — SEED stoich-hash cache at load time

crates/gapsmith-db/src/seed.rs. SeedRxnRow gains a #[serde(skip)] stoich_hash: Option<String> field populated once during load_seed_reactions (sequential, ~150 ms for 35k rows). Gap-fill and draft dedup paths read the cached hash instead of re-parsing + re-sorting the stoichiometry on every access.

C3 — O(1) reaction/metabolite lookups in `apply_medium`

crates/gapsmith-fill/src/medium.rs. Replaces rxns.iter().find(..) per medium entry with a one-shot HashMap<rxn_id, index>. Same for the metabolite-index lookup in the new-exchange fallback path. O(k·n) → O(n + k) on a 500-rxn draft with 20 medium entries.

A2 — FASTA header cache per `find` run

crates/gapsmith-find/src/runner.rs. Multiple reactions frequently resolve to the same reference FASTA (two reactions sharing an EC both land on rev/<EC>.fasta). A HashMap<PathBuf, Arc<Vec<(String, String)>>> populated lazily during the unique-reaction loop avoids redundant file reads + header parses.

E1 — Structural blocked-reaction pre-pass in futile-cycle prune

crates/gapsmith-fill/src/futile.rs::structural_blocked_rxns. Iterative topological dead-reaction detection on the closed-boundary model: reactions whose metabolites have no producer or no consumer (in any alive direction) can't carry flux, so they can't participate in cycles. Drops 30–70% of the LP-probed candidate pool before any LP runs. O(iters × nnz), typically 3–8 iterations. Runs inside the existing --prune-futile flag.

B2 — Skip KO probes for zero-flux reactions

crates/gapsmith-fill/src/gapfill.rs. Any non-core candidate carrying zero flux in the post-fill max-biomass FBA optimum is provably non-essential: the constraint x[r] = 0 leaves the current optimum feasible, so the KO probe would succeed deterministically. Skip the LP probe; drop the reaction directly. Cuts ~20–40% of KO-loop LP solves on typical gap-fills.

B1 (opt-in) — HiGHS hot-start via `with_initial_solution`

crates/gapsmith-fill/src/fba.rs. FbaOptions gains hot_start: Option<Vec<f64>>. When Some, the caller supplies a previous net-flux vector; fba() decomposes it into split-flux (vp, vn) = (max(v,0), max(-v,0)) seeds and threads them through good_lp's with_initial_solution. HiGHS warm-starts the simplex from the seed basis. Can nudge HiGHS toward a different alternative optimum, so off by default for byte-parity.

B3 (opt-in) — pFBA ladder memo across calls

crates/gapsmith-fill/src/pfba.rs. PfbaHeuristicOptions gains ladder_memo: Option<Arc<Mutex<Option<(f64, f64)>>>>. When Some, the ladder seeds from the last successful (tolerance, coefficient) instead of the defaults and writes back on success. Clamped to [min_tol, start_tol] to prevent a stale memo from pushing the solver outside the user's configured range. Same byte-parity caveat as B1. Off by default.

Benchmark results

Hardware: server ws, 64 cores, 128 GB RAM, NVMe SSD, Debian. Aligner: DIAMOND 2.1.9 (-A diamond -K 8). Input: E. coli K-12 MG1655 (GCF_000005845.2, 4300 proteins). Reference DB: upstream gapseq's dat/ at commit-time HEAD. Comparison: commit a42d172 (pre-Phase-1 baseline) vs 474a1e7 (all Phase 1 / 2 / 3 defaults shipped).

`gapsmith find -p all`

Metric	Phase-0	Phase-3	Δ
Wall	18.36 s	17.96 s	−2.2%
User	115.8 s	114.1 s	−1.5%
Peak RSS	279 MB	286 MB	+2.5%

Pathways.tbl byte-identical. Reactions.tbl varies run-to-run on the same binary — DIAMOND parallel search with -K > 1 has inherent non-determinism at the per-hit level. Not a regression; not in our code.

`gapsmith doall` (full default pipeline)

Metric	Phase-0	Phase-3	Δ
Wall	77.6 s	79.2 s	+2.1%
User	244 s	245 s	+0.4%
Peak RSS	276 MB	297 MB	+7.7%

Semantic parity (via COBRApy):

2776 reactions, 2193 metabolites (both)
Biomass FBA growth: 0.7422696154309283 (identical to 16 decimals)
Reaction Jaccard: 1.0000
Metabolite Jaccard: 1.0000

`gapsmith fill --full-suite` (4-phase, deep KO loops)

Metric	Phase-0	Phase-3	Δ
Wall	11:51.36	11:51.55	+0.03%
User	1209 s	1217 s	+0.7%
Peak RSS	107 MB	123 MB	+15.5%

Semantic parity:

2859 reactions, 2207 metabolites (both)
Biomass FBA growth: 0.7433163537454 (identical to 13 decimals)
Reaction Jaccard: 1.0000
Metabolite Jaccard: 1.0000

Added reactions differ only in step attribution (one reaction rxn12933_c0 added in Step 1 by P0 vs Step 2 by P3 — B2's zero-flux skip dropped it from Step 1's add-list because its flux at the biomass optimum was zero; Step 2 needed it and re-added it). Final model is identical.

`gapsmith fill --full-suite --prune-futile` (E1 shines)

Both binaries hit a pre-existing pfba_heuristic: max iterations failure on this workload (unrelated to Phase 1–3 changes). Timing to the failure point:

Metric	Phase-0	Phase-3	Δ
Wall	5:56	4:20	−27%
User	18963 s	13728 s	−28%
Peak RSS	1354 MB	1322 MB	−2.4%

E1 cuts the time needed to exhaust the LP-prune pool substantially. When --prune-futile actually completes on a well-conditioned input, this speedup translates directly.

Honest assessment

The measured wins on the default path are small:

DIAMOND dominates find / doall wall time (~60 s of 80 s total).
fill (Step 1 default) runs few LPs; B2 saves fractions of a second.
C1 / C3 savings are milliseconds, below noise.
A2 cache adds ~15 MB RSS because per-FASTA reads were already buffered.

Real wins show up in:

--prune-futile enabled (E1: ~25–30% wall reduction)
Deep KO loops in full-suite runs with large candidate pools (B2)
Batch runs reusing ladder_memo across pFBA calls (B3, opt-in)
KO-loop-dominated fills with hot_start configured (B1, opt-in)

Semantic parity is the firm guarantee: identical reaction sets, identical biomass to machine precision, regardless of which optimization path fires.

Scaling to many genomes

The numbers above are all single-genome. For 100+ genomes, the dominant cost moves out of FBA and into per-genome alignment. Gapsmith's multi-genome path routes around that entirely:

--gspa-run consumes a pre-clustered, pre-aligned gspa manifest so a 1000-genome collection pays for one cross-genome alignment, not 1000. Aligner calls inside find / find-transport become map-lookups.
doall-batch --jobs N runs doall across genomes with rayon; on a 32-core box, speedup is ~min(N, 32) vs. looping doall sequentially.
doall-batch --shard i/N makes SLURM array jobs the unit of parallelism. At 1 024 tasks of 16 cores each, throughput is ~1–2 s of wallclock per genome including I/O — limited by per-task doall-internal work, not by coordination.

Community modes are small LPs (≤ 1000 organisms in per-mag, which is linear) or a single big LP (cfba). cfba is feasible up to ~50 organisms before HiGHS becomes the bottleneck; for larger communities prefer per-mag.

See Multi-genome workflows for recipes.

Deferred — where the big wins still are

B1-full — HiGHS basis warm-start across KO probes (`highs-sys` FFI)

The currently-shipped B1 uses good_lp's with_initial_solution for a simplex hot-start guess, but each solve still constructs a fresh highs::Model. True warm-start needs a persistent highs::Model across the KO loop, mutating bounds via Highs_changeColBounds and re-solving in place. The highs crate 2.0 doesn't expose that API — requires ~200 LOC of highs-sys FFI. Estimated 30–50% wall reduction on KO-dominated fills.

Parallel KO probe

The current KO loop is serial because each commit affects the next iteration's baseline. A valid parallelisation: probe every candidate concurrently against the same post-fill model (all probes are independent at that snapshot), collect a set of provably-non-essential candidates, then re-probe the survivors serially. Bounded to a single pass, correctness-preserving. Estimated 4–8× on multi-core for the Step 4 fermentation-product loop.

Step-4 restructuring

Step 4 probes ~100 fermentation products one at a time, each doing a full gap-fill cycle. Amortising the shared preprocessing across probes (same starting model, same weights) would cut ~5 min off a full-suite run.

Reproducing

# 1. Clone + build both commits
git clone https://github.com/bio-ontology-research-group/gapsmith.git
cd gapsmith
git worktree add -f /tmp/gapsmith-p0 a42d172
cd /tmp/gapsmith-p0 && cargo build --release --bin gapsmith
cd - && cargo build --release --bin gapsmith

# 2. Get reference data
gapsmith update-data -o /path/to/dat
gapsmith --data-dir /path/to/dat update-sequences -t Bacteria

# 3. Run both binaries, compare
for bin in /tmp/gapsmith-p0/target/release/gapsmith ./target/release/gapsmith; do
  out=$(mktemp -d)
  /usr/bin/time -v "$bin" --data-dir /path/to/dat doall \
    -A diamond -K 8 -f "$out" genome.faa.gz 2>&1 | tee "$out.time"
done

# 4. Semantic diff in COBRApy
uv run --with cobra python tools/bench/diff_models.py \
  out1/genome-filled.xml out2/genome-filled.xml

tools/bench/run_bench.sh bundles the above end-to-end.

Comparison with upstream gapseq

gapsmith is a Rust reimplementation of gapseq (R/bash, ~9k LOC). This document covers performance benchmarks and feature differences.

Benchmarks

Wall-clock comparison on four real bacterial proteomes. Hardware: 56-core Xeon, 128 GB RAM, Debian 13, NVMe SSD. Same aligner binary (NCBI blastp 2.17.0 via bioconda). Same reference sequence database (Zenodo v1.4, record 16908828).

Test genomes

Organism	Accession	Proteins
Candidatus Blochmannia floridanus	GCF_000007725.1	517
Bacillus subtilis 168	GCF_000009045.1	4,237
Escherichia coli K-12 MG1655	GCF_000005845.2	4,300
Salmonella enterica Typhimurium LT2	GCF_000006945.2	4,554

Results

Genome	Stage	gapseq (R)	gapsmith	Speedup
B. floridanus (517)	`find -p all`	117 s	34 s	3.5×
B. floridanus (517)	`find-transport`	9 s	8 s	1.2×
B. subtilis (4,237)	`find -p all`	205 s	73 s	2.8×
B. subtilis (4,237)	`find-transport`	25 s	14 s	1.8×
E. coli (4,300)	`find -p all`	218 s	76 s	2.9×
E. coli (4,300)	`find-transport`	27 s	14 s	1.9×
S. Typhimurium (4,554)	`find -p all`	211 s	76 s	2.8×
S. Typhimurium (4,554)	`find-transport`	27 s	14 s	1.9×

gapsmith also uses 35–40% less peak memory (e.g. 498 MB vs 786 MB for E. coli find).

Stages without R baseline

The draft, medium, and fill stages could not be benchmarked against upstream R gapseq because the cobrar R package (which replaced the archived sybil) failed to install on the benchmark host due to a libsbml/libxml2 ABI conflict in the conda R 4.5 environment. This is a packaging issue on the test rig, not a limitation of either tool.

gapsmith timings for these stages (E. coli K-12):

Stage	Wall-time	Peak RSS
`draft`	0.6 s	152 MB
`medium`	0.1 s	47 MB
`fill` (Steps 1 + 2 + 2b)	52 s	103 MB

These stages are fast in gapsmith because the LP solver (HiGHS) runs in-process via good_lp, rather than shelling out through R's cobrar wrapper to an external GLPK/CPLEX binary.

Reproducing the benchmarks

# Requires: blastp, Rscript (with data.table, stringr, Biostrings),
# and gapsmith binary on PATH.
bash tools/bench/run_bench.sh <genomes_dir> <results_dir>
python3 tools/bench/aggregate.py <results_dir>

Feature comparison

Feature	gapseq (R)	gapsmith
`find` (pathway detection)	✅	✅ byte-identical output
`find-transport`	✅	✅ row-count parity
`draft` (model assembly)	✅	✅ 0 libSBML errors
`medium` (rule-based inference)	✅	✅ byte-identical output
`fill` (4-phase gap-filling)	✅	✅ Steps 1–4
`adapt` (add/remove reactions)	✅	✅ (EC/KEGG resolution deferred)
`pan` (pan-draft union)	✅	✅
`doall` (end-to-end pipeline)	✅	✅
`update-sequences` (Zenodo sync)	✅	✅
EC/TC conflict resolution	✅	❌ (affects <1% of reactions)
MIRIAM cross-ref annotations	✅	❌ (SBML loads without them)
HMM-based taxonomy prediction	✅	❌ (use `--taxonomy` flag)
Precomputed alignment input	❌	✅ `--aligner precomputed`
Batch-cluster alignment	❌	✅ `batch-align`
In-process LP solver	❌ (external GLPK)	✅ (HiGHS, bundled)
FBA/pFBA subcommand	❌	✅ `fba`
Native CBOR model format	❌ (RDS)	✅ (replaces RDS)
Single static binary	❌	✅

Key differences

Solver

gapseq uses R's cobrar package (or the older sybil) which wraps GLPK or CPLEX. gapsmith uses HiGHS via good_lp, statically linked at build time. No runtime solver dependency.

Model format

gapseq stores models as R .RDS files. gapsmith uses CBOR (.gmod.cbor) as its native format — compact, fast to load, language-agnostic. Both tools emit SBML for interchange.

Alignment

Both tools shell out to the same external aligner binaries (blastp, diamond, mmseqs2). gapsmith additionally supports --aligner precomputed (skip the aligner, read a user-supplied TSV) and batch-align (cluster N genomes with mmseqs2, align once, expand per-genome).

Known deferred items

See docs/porting-notes.md for the full list. The main gaps:

EC/TC conflict resolution — affects <1% of multi-EC-annotated genes.
MIRIAM cross-refs — SBML emits ModelSEED id only; COBRApy round-trip still works.
HMM-based taxonomy — CLI requires --taxonomy Bacteria|Archaea instead of auto-detecting.
adapt EC/KEGG resolution — use direct SEED reaction ids or MetaCyc pathway ids.