Skip to content

64-bit build to allow 4.3B+ sequences#1111

Merged
martin-steinegger merged 13 commits into
soedinglab:masterfrom
bbuschkaemper:master
Jun 19, 2026
Merged

64-bit build to allow 4.3B+ sequences#1111
martin-steinegger merged 13 commits into
soedinglab:masterfrom
bbuschkaemper:master

Conversation

@bbuschkaemper

Copy link
Copy Markdown
Contributor

Summary

This PR adds a build variableMMSEQS_INT64_IDS=1 that replaces 32-bit with 64-bit ints to handle more than 4.3B+ sequences.

With this build variable, MMseqs2 remains backwards-compatible and can use 32-bits so that existing behavior remains unchanged.

64-bit MMseqs2 databases are not compatible with 32-bit MMseqs2 databases. A 32/64-bit metadata information on any MMseqs2 db files might be a useful addition to make sure users don't mix up versions.

What changed

Central DBKeyType

Added DBKeyType datatype, giving 32/64bit ints depending on the build variable. Every previous 32-bit int variable that is limiting the number of possible sequences now uses DBKeyType as datatype.

Removed stale duplicate source file

Deleted src/alignment/Align2clust copy.cpp, not included in any build.

Fixed race-condition in align2clust

During a test linclust clustering of 7B sequences, align2clust got stuck and build up massive iowait without progress.

Before the fix, worker threads worked like this:

  1. Compute one ClusterResult
  2. Lock clusterMutex
  3. Check whether clusterResult.sequenceIdx == currentProcessPosition
  4. Push into clusterResultQueue
  5. Unlock
  6. If it looked like "next expected" result, call clusterCondition.notify_one()

The clustering thread can only consume results in sequenceIdx order, but with schedule(dynamic, 1), workers can finished ahead of current position, resulting in the queue filling up unbounded with out-of-order results.

The old code now decides to notify the cluster thread by comparing against currentProcessPosition before pushing, then notifying after unlocking. Therefore, notification is based on transient state. It was possible for currentProcessPosition to change between checking and pushing. This "check under lock, act after unlock" can cause hang-up in timing-sensitive behavior.

It is fixed with a helper function `pushAlign2clustClusterResult" that centralizes queue insertion:

  1. Lock clusterMutex
  2. If result is not next needed one and queue is above new size threshold, wait
  3. Recompute whether this result matches currentProcessPosition
  4. Push result
  5. Unlock
  6. Notify cluster thread if it was expected item.

We also added logging with align2clust progress state, since the align2clust can take up significant amount of total linclust time without any feedback.

Added CI coverage for the 64-bit build

Added an additional GitHub Actions build step for -DMMSEQS_INT64_IDS=1.

Important decisions

No compatibility metadata was added

This PR intentionally does not add DB-format compatibility metadata or cross-build compatibility markers.
The assumption is that users are responsible for using the correct build consistently and for not mixing 32-bit and 64-bit builds or artifacts.

It might be useful to add this metadata to any db files later on, however I will leave this design decision up to the maintainers.

Validation

  1. Verified with CMake builds of the mmseqs target for both:
  • default build
  • -DMMSEQS_INT64_IDS=1
  1. Ran a successful linclust clustering of 7B sequences (90% seqid, 80% cov) on a 4TB memory VM in 6 days.
  2. Double-checked git diff with GPT 5.5

--

Since this PR is massive and I'm not well familiar with this codebase, feedback would be greatly appreciated.

Related to #1100 #1039

bbuschkaemper and others added 4 commits June 10, 2026 14:22
…ild memory

Prefilter split-local ids are reconstructed to global keys via + dbFrom, so they
only need to span one memory-bounded split. Keep IndexEntryLocal::seqId and
CounterResult::id as 32-bit (instead of widening to DBKeyType/DBLocalId), and
guarantee each TARGET split holds < 2^32 sequences via a cap in
Prefiltering::setupSplit plus a guard in IndexBuilder::fillDatabase. This keeps
the prefilter index and counting bins at their 32-bit footprint in the 64-bit
build (query-db split is rejected for >= 2^32 targets).

Also fix unconditional unsigned int -> size_t widenings that inflated the
DEFAULT (32-bit) build: id2local/local2id, AlignmentSymmetry tmpSize, DBReader
sortedIndices, ClusteringAlgorithms clusterid_to_arrayposition and
setsize_abundance now use DBLocalId (4 bytes by default, 8 under the flag).
Genuine offsets (borders_of_set, contigSizes) stay size_t.

assignedCluster (align2clust) is deliberately left size_t: narrowing it is
semantically identical but exposes a latent timing-dependent data race in
align2clust (concurrent reads/writes of assignedCluster) that makes cluster
counts nondeterministic. That race should be fixed separately before narrowing.

Validated: default and -DMMSEQS_INT64_IDS=1 builds both pass the regression
suite 41/41, deterministically.
TmpResult is CounterResult's sibling temp buffer (tmpElementBuffer[binSize]) and
holds the same split-local id, so it should match: narrow TmpResult.id from
DBLocalId back to unsigned int. No effect on the default build (DBLocalId is
uint32 there); in the 64-bit build this keeps tmpElementBuffer at 6-byte entries
instead of 10, consistent with IndexEntryLocal/CounterResult.

int64 build passes regression 41/41.
Drop util/compare_mmseqs_linclust.py and util/.gitignore (added in f122bd3);
the .gitignore only existed to ignore that script's __pycache__.
martin-steinegger added a commit to bbuschkaemper/MMseqs2 that referenced this pull request Jun 19, 2026
clustersizes uses a -1 deleted-sentinel so it must stay signed. soedinglab#1111 widened it
from int to int64_t unconditionally, doubling this O(dbSize) array in the default
build. Use DBLocalIdSigned (int32_t default, int64_t under MMSEQS_INT64_IDS): 4
bytes/entry in the default build (matches pre-1111) and 8 bytes in the 64-bit
build (correct for dbSize > 2^31). No logic change; signedness preserved.

cluster-version-1 regression tests pass in both builds.
clustersizes uses a -1 deleted-sentinel, so it must stay signed. soedinglab#1111 widened it
from int to int64_t unconditionally, doubling this O(dbSize) array in the default
build. Gate it with an inline #ifdef: int32_t by default (4 bytes/entry, matching
pre-1111) and int64_t under MMSEQS_INT64_IDS (correct for dbSize > 2^31). No new
type alias; signedness preserved, so the -1 sentinel and all comparisons are
unchanged.

cluster-version-1 regression tests pass in both builds.
@martin-steinegger martin-steinegger marked this pull request as ready for review June 19, 2026 18:43
@martin-steinegger martin-steinegger merged commit 3a6d9a5 into soedinglab:master Jun 19, 2026
0 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants