gapsmith porting notes
Intentional and unintentional deviations from the upstream R/bash gapseq codebase. Each entry says what differs, why, and whether downstream output parity holds.
This document is the authoritative list of "known non-R behavior". If you find something in gapsmith that differs from real gapseq and it's not listed here, please file an issue — we treat those as bugs.
1. Storage format
- RDS → CBOR. R's
saveRDS(mod, ...)is replaced byciborium-encoded CBOR at<id>.gmod.cbor. SBML is emitted alongside as the portable interchange format. An Rscript-driven CBOR↔RDS bridge can be added later if downstream R consumers need it. - Metabolite IDs:
cpd00001_c0instead ofcpd00001[c0]. SBML SIds must match[A-Za-z_][A-Za-z0-9_]*— brackets break that. The SBML writer derives the species id frommet.id + met.compartmentand only presents<cpd>[comp]to the user via thenameattribute. - Model IDs are sanitized to valid SBML SIds:
-,.,:, space →_. Original id preserved in thenameattribute if different.
2. Alignment layer
- mmseqs2 invocation. gapseq's R pipeline calls explicit
mmseqs createdb+search+convertalis; we replicate this 4-command flow exactly rather than the simplereasy-search, because the latter uses alignment-mode 3 (full-alignment identity) while gapseq calibrates against the k-mer prefilter identity thatsearchreports. - mmseqs2
qseqidnormalization. We post-process mmseqs'sqheaderoutput to the first whitespace-delimited token, matching what blastp/diamond emit natively. gapseq does this via a sed call in the shell script (gapseq_find.shat the tail of each mmseqs block). - Diamond
--more-sensitiveis on by default, matching gapseq'saliArgs="--more-sensitive"default for diamond. - E-value formatting in TSV output: BLAST's exact 2-decimal
scientific notation (
2.52e-41). Initial wrapper used{:.3e}which emitted2.520e-41; fixed to match.
3. find pipeline (M5)
noSuperpathways=trueis always the default. Matches gapseq'sgapseq_find.sh:20. Pass-n/--include-superpathwaysto disable.-pkeyword shorthand.-p amino|nucl|cofactor|carbo|polyamine|fatty|energy|terpenoid|degradation|core|min|kegg|allresolves to the same hierarchy-category name gapseq uses. For-p <literal>we fall back to\b(<literal>)\bregex against id / name / altname / hierarchy — same as gapseq's defaultstypebranch.- Pathway table merge semantics.
-l metacyc,customis default (matching gapseq); on ID collision, thecustom_pwy.tblrow wins. This is how several "GS"-suffixed pathways (e.g. ARGSYN-PWY) have theirsuperpathway=TRUEflag shadowed by the custom row withsuperpathway=FALSE, so they survive the filter. user/reference fastas always resolve against the built-in<gapseq>/dat/seq/<tax>/user/, even when-D/--seq-diroverrides the rest. Matchesprepare_batch_alignments.R:217.- Reaction-name MD5 fallback. Uses
md5sum-compatible hex digest of the raw bytes (no trailing newline), same asuniprot.sh:155. - Header filter uses substring, not word-boundary, match. This is a
deliberate match of gapseq's shell behavior (
grep -Fivfwithout-w). Has the known quirk of accidentally matching4.A.2.1.1against a line containing4.A.2.1.11. Reproducing the quirk is required for byte-identical output.
4. find-transport pipeline (M6)
- Substrate case is preserved from subex.tbl (
sub=Potassium, notsub=potassium). Matches gapseq. - Alt-transporter reaction assignment kicks in when a TC-id hit
exists but no SEED reaction matches the exact TC type. Same rule as
analyse_alignments_transport.R:110-130. Disable with-a.
5. draft pipeline (M7) — known gaps
- EC/TC conflict resolution (
prepare_candidate_reaction_tables.Rresolve_common_EC_conflicts/resolve_common_TC_conflicts, ~130 lines of IRanges overlap math) is not yet ported. Affects <1% of reactions; the mitigations are listed downstream in the draft — a gene with two EC annotations stays attached to both reactions rather than being resolved. - MIRIAM cross-ref annotations. SBML writer emits the ModelSEED id
only. KEGG / BiGG / MetaNetX / HMDB / ChEBI cross-refs are available
in
addReactAttr.R+addMetAttr.Rbut not threaded through yet. Round-trip load in COBRApy works; MIRIAM-dependent downstream tools may need to re-query gapseq'sdat/mnxref_*.tsv. gram_by_network.R(predict gram-staining by metabolic-network similarity against reference genomes) is not ported. The CLI requires--biomass pos|neg|archaeaor reads thegram=field from the Reactions.tbl header (which find writes when gapseq has HMM-predicted the gram).- Biomass rescaling: the menaquinone-8 auto-removal rule
(
generate_GSdraft.R:281-292) is not ported. Gram+/Gram- biomass retains menaquinone-8 regardless of whether the MENAQUINONESYN-PWY / PWY-5852 / PWY-5837 pathways are present. Impact is small (a single metabolite in the biomass reaction) but affects anaerobic-growth predictions.
6. Not yet implemented
gapsmith medium— rule-based medium inference. Planned for M10.gapsmith fill— 4-phase gap-filling driver. Planned for M9 (FBA/pFBA primitives already shipped in M8 undergapsmith-fill).gapsmith adapt— add/remove reactions on an existing model. Planned for M10.gapsmith pan— pan-model union. Planned for M10.gapsmith doall— chain find→find-transport→draft→medium→fill. Planned for M10.gapsmith update-sequences— seqdb download from upstream. Planned for M10.- HMM-based
predict_domain/predict_gramstaining. Planned alongside M10.
6a. fill (M8–M9) — LP primitives + single-iteration gap-fill (shipped)
- Split-flux encoding: each reaction contributes
vp_r, vn_r ≥ 0with netv_r = vp_r − vn_r.|v_r| = vp_r + vn_ris linear, which keeps pFBA a pure LP. The bound translation is documented incrates/gapsmith-fill/src/lp.rs—[lb, ub]on the net flux becomes a per-variable ub pair(max(ub,0), max(−lb,0)), with additional lower bounds when the direction is strictly forced. - FBA (
gapsmith-fill::fba) — maximise / minimiseΣ c_r · (vp_r − vn_r)s.t.S · (vp − vn) = 0. HiGHS backend viagood_lp1.15. - pFBA (
gapsmith-fill::pfba) —minimise Σ w_r·(vp_r+vn_r) − pfba_coef · Σ c_r·(vp_r−vn_r)subject tov_bio ≥ min_growth. Matches cobrar'spfbaHeuristic. - pFBA-heuristic ladder (
gapsmith-fill::pfba_heuristic) — port ofgapfill4.R:95–137. Retries up to 15 times, halving the feasibility tolerance then the pFBA coefficient on solver failure; validates each candidate solution by running FBA on the reduced model after zero-flux reactions are removed. gapsmith fbasubcommand — FBA or--pfbaon a CBOR/JSON model; prints solver status, objective, biomass flux, and the top-N |flux| reactions. Useful for sanity-checking a draft model beforefillis applied.- Candidate pool (
pool.rs::build_full_model) — clone draft, append every approved/corrected SEED reaction not already present, deduped by stoichiometric hash with core-preference on ties. Port ofgapfill4.R:12–56. - Medium application (
medium.rs::apply_medium) — close allEX_*lower bounds then re-open per medium CSV; adds missing EX reactions for compounds not yet in the model. Matchesconstrain.model.R. - Single-iteration gap-fill (
gapfill.rs::gapfill4) — pFBA-heuristic on the full model with bitscore-derived weights, extract utilized candidates, add to draft, run KO essentiality loop. Port ofgapfill4.R:1–303. gapsmith fillsubcommand — end-to-enddraft → filledpipeline. Ecoli onMM_glu.csv: 20 reactions added, final growth 0.53.
Known divergences / outstanding work
- Biomass link-multi parsing —
biomass.jsonsupports"link": "cpd01997:-1|cpd03422:-1"(pipe-separated coupled metabolites). The M7 biomass parser only read the first link. M9 fixed this (BiomassComponent::links()). Without the fix, every biomass reaction was missing two cofactor metabolites, pegging growth to zero. Re-rundrafton any model built before M9 — the old CBOR has a structurally broken biomass. - SBML species-id double suffix — the M7
species_idhelper blindly appended_<comp>to the cpd id. When gapsmith stores metabolites ascpd00001_c0(the actual shipped convention), the SBML emittedM_cpd00001_c0_c0. COBRApy treated these as distinct mets, so SEED reactions and biomass didn't link. Fixed in M9: idempotent when the id already ends with the compartment suffix. - Objective must be
EX_<target>_c0sink, notbio1directly —gf.suite.R:239–242zeros everyobj_coefand then callsadd_met_sink(mod, target.met, obj=1)to attach the objective to a freshly-created sink reaction. Without it, steady-state balance pins the biomass flux to zero because the biomass pseudo-metabolite has no consumer. The CLI wires this up automatically. - 4-phase suite (Steps 2 / 2b / 3 / 4) shipped in the M9 follow-up.
Default CLI runs Steps 1 + 2 + 2b;
--full-suiteadds 3 + 4. Steps 3 and 4 iterate ~200 exchange compounds each with a gapfill4 call — budget 10–30 minutes on a whole-bacterium genome. - Futile-cycle prune shipped (
detect_futile_cycles) but opt-in via--prune-futile. Pairwise LP probe on large candidate pools (~8k rxns) takes ~40 minutes even with rayon parallelism; use selectively. - Candidate-dedup fix (draft-stage latent bug caught during M9
follow-up): my build_candidates accumulator copied
is_complex/complex_status/pathway_statusonly from the highest-bitscore row. That row may have empty status fields (gapseq fills those only on rows where complex detection ran), causing reactions likerxn00011(bs=1797, pyruvate decarboxylase) to be rejected in the runner's complex filter. Fixed: ORis_complex, max-across-rows forcomplex_status, highest-rankpathway_status. Re-rungapseq drafton any model built before this fix. - CBC fallback wired behind
--features cbc. Path:pfba_heuristicretries once withgood_lp::solvers::coin_cbcafter HiGHS exhausts its tolerance/coef ladder. Requirescoinor-libcbc-dev+coinor-libclp-devon the build host.
7. Unimplemented but probably won't be
- The R helpers that depend on
pythoncyc(meta2pwy.py,meta2genes.py,meta2rea.py) are one-off MetaCyc DB updaters run ≤ twice per year by maintainers. They stay in Python; no Rust port planned. - CPLEX solver support — plan stated "no CPLEX path"; gapsmith-fill will
use HiGHS via
good_lpwith CBC/Clp as backups.
8. New features not in gapseq
BatchClusterAligner(M4.5) — cluster N genomes' proteomes with mmseqs2, align once against the reference query FASTA, expand per-genome hit sets. Amortizes alignment cost over many genomes with shared proteins. Not in upstream gapseq.- Precomputed alignment input —
--aligner precomputed -P <tsv>onfind/find-transportskips the alignment step entirely. User must pre-run blastp/diamond/mmseqs2 with the expected 8-column layout.
9. Output format differences
- Pathways.tbl completeness precision. Ported to
f64to match R's default 15-significant-digit output (66.6666666666667). - Reactions.tbl column order matches gapseq exactly after M5 fixes.
- Transporter.tbl matches gapseq exactly after M6 fixes, modulo the
cosmetic
subcolumn case when a substrate has both capitalized and lowercased SUBkey entries (gapseq picks the alphabetically-first post-dedup; we pick insertion-order-first). evalueformatting matches BLAST's-outfmt 6conventions:0for exact zero, 2-decimal scientific notation below 1e-3.
10. Testing surface
- R-vs-Rust parity tests run the actual R implementation via
Rscriptwhere feasible. This lives in three places:crates/gapsmith-find/tests/complex_parity.rsdrivestools/r_complex_detection.Rwhich sources the realsrc/complex_detection.R.crates/gapsmith-find/tests/pipeline_parity.rsruns realgapsmith findand gapsmithfindside-by-side with a synthesized minimal seqdb.crates/gapsmith-transport/tests/parity.rsdoes the same forfind-transport.
- SBML validation runs
python-libsbml+cobravia a uv-managed venv attools/.sbml-validate/. Not part ofcargo test(no libsbml on crates.io) but a manual gate viatools/validate_sbml.py.