Comparison with upstream gapseq

gapsmith is a Rust reimplementation of gapseq (R/bash, ~9k LOC). This document covers performance benchmarks and feature differences.

Benchmarks

Wall-clock comparison on four real bacterial proteomes. Hardware: 56-core Xeon, 128 GB RAM, Debian 13, NVMe SSD. Same aligner binary (NCBI blastp 2.17.0 via bioconda). Same reference sequence database (Zenodo v1.4, record 16908828).

Test genomes

OrganismAccessionProteins
Candidatus Blochmannia floridanusGCF_000007725.1517
Bacillus subtilis 168GCF_000009045.14,237
Escherichia coli K-12 MG1655GCF_000005845.24,300
Salmonella enterica Typhimurium LT2GCF_000006945.24,554

Results

GenomeStagegapseq (R)gapsmithSpeedup
B. floridanus (517)find -p all117 s34 s3.5×
B. floridanus (517)find-transport9 s8 s1.2×
B. subtilis (4,237)find -p all205 s73 s2.8×
B. subtilis (4,237)find-transport25 s14 s1.8×
E. coli (4,300)find -p all218 s76 s2.9×
E. coli (4,300)find-transport27 s14 s1.9×
S. Typhimurium (4,554)find -p all211 s76 s2.8×
S. Typhimurium (4,554)find-transport27 s14 s1.9×

gapsmith also uses 35–40% less peak memory (e.g. 498 MB vs 786 MB for E. coli find).

Stages without R baseline

The draft, medium, and fill stages could not be benchmarked against upstream R gapseq because the cobrar R package (which replaced the archived sybil) failed to install on the benchmark host due to a libsbml/libxml2 ABI conflict in the conda R 4.5 environment. This is a packaging issue on the test rig, not a limitation of either tool.

gapsmith timings for these stages (E. coli K-12):

StageWall-timePeak RSS
draft0.6 s152 MB
medium0.1 s47 MB
fill (Steps 1 + 2 + 2b)52 s103 MB

These stages are fast in gapsmith because the LP solver (HiGHS) runs in-process via good_lp, rather than shelling out through R's cobrar wrapper to an external GLPK/CPLEX binary.

Reproducing the benchmarks

# Requires: blastp, Rscript (with data.table, stringr, Biostrings),
# and gapsmith binary on PATH.
bash tools/bench/run_bench.sh <genomes_dir> <results_dir>
python3 tools/bench/aggregate.py <results_dir>

Feature comparison

Featuregapseq (R)gapsmith
find (pathway detection)✅ byte-identical output
find-transport✅ row-count parity
draft (model assembly)✅ 0 libSBML errors
medium (rule-based inference)✅ byte-identical output
fill (4-phase gap-filling)✅ Steps 1–4
adapt (add/remove reactions)✅ (EC/KEGG resolution deferred)
pan (pan-draft union)
doall (end-to-end pipeline)
update-sequences (Zenodo sync)
EC/TC conflict resolution❌ (affects <1% of reactions)
MIRIAM cross-ref annotations❌ (SBML loads without them)
HMM-based taxonomy prediction❌ (use --taxonomy flag)
Precomputed alignment input--aligner precomputed
Batch-cluster alignmentbatch-align
In-process LP solver❌ (external GLPK)✅ (HiGHS, bundled)
FBA/pFBA subcommandfba
Native CBOR model format❌ (RDS)✅ (replaces RDS)
Single static binary

Key differences

Solver

gapseq uses R's cobrar package (or the older sybil) which wraps GLPK or CPLEX. gapsmith uses HiGHS via good_lp, statically linked at build time. No runtime solver dependency.

Model format

gapseq stores models as R .RDS files. gapsmith uses CBOR (.gmod.cbor) as its native format — compact, fast to load, language-agnostic. Both tools emit SBML for interchange.

Alignment

Both tools shell out to the same external aligner binaries (blastp, diamond, mmseqs2). gapsmith additionally supports --aligner precomputed (skip the aligner, read a user-supplied TSV) and batch-align (cluster N genomes with mmseqs2, align once, expand per-genome).

Known deferred items

See docs/porting-notes.md for the full list. The main gaps:

  • EC/TC conflict resolution — affects <1% of multi-EC-annotated genes.
  • MIRIAM cross-refs — SBML emits ModelSEED id only; COBRApy round-trip still works.
  • HMM-based taxonomy — CLI requires --taxonomy Bacteria|Archaea instead of auto-detecting.
  • adapt EC/KEGG resolution — use direct SEED reaction ids or MetaCyc pathway ids.