64-bit build to allow 4.3B+ sequences by bbuschkaemper · Pull Request #1111 · soedinglab/MMseqs2

bbuschkaemper · 2026-06-01T10:51:45Z

Summary

This PR adds a build variableMMSEQS_INT64_IDS=1 that replaces 32-bit with 64-bit ints to handle more than 4.3B+ sequences.

With this build variable, MMseqs2 remains backwards-compatible and can use 32-bits so that existing behavior remains unchanged.

64-bit MMseqs2 databases are not compatible with 32-bit MMseqs2 databases. A 32/64-bit metadata information on any MMseqs2 db files might be a useful addition to make sure users don't mix up versions.

What changed

Central `DBKeyType`

Added DBKeyType datatype, giving 32/64bit ints depending on the build variable. Every previous 32-bit int variable that is limiting the number of possible sequences now uses DBKeyType as datatype.

Removed stale duplicate source file

Deleted src/alignment/Align2clust copy.cpp, not included in any build.

Fixed race-condition in align2clust

During a test linclust clustering of 7B sequences, align2clust got stuck and build up massive iowait without progress.

Before the fix, worker threads worked like this:

Compute one ClusterResult
Lock clusterMutex
Check whether clusterResult.sequenceIdx == currentProcessPosition
Push into clusterResultQueue
Unlock
If it looked like "next expected" result, call clusterCondition.notify_one()

The clustering thread can only consume results in sequenceIdx order, but with schedule(dynamic, 1), workers can finished ahead of current position, resulting in the queue filling up unbounded with out-of-order results.

The old code now decides to notify the cluster thread by comparing against currentProcessPosition before pushing, then notifying after unlocking. Therefore, notification is based on transient state. It was possible for currentProcessPosition to change between checking and pushing. This "check under lock, act after unlock" can cause hang-up in timing-sensitive behavior.

It is fixed with a helper function `pushAlign2clustClusterResult" that centralizes queue insertion:

Lock clusterMutex
If result is not next needed one and queue is above new size threshold, wait
Recompute whether this result matches currentProcessPosition
Push result
Unlock
Notify cluster thread if it was expected item.

We also added logging with align2clust progress state, since the align2clust can take up significant amount of total linclust time without any feedback.

Added CI coverage for the 64-bit build

Added an additional GitHub Actions build step for -DMMSEQS_INT64_IDS=1.

Important decisions

No compatibility metadata was added

This PR intentionally does not add DB-format compatibility metadata or cross-build compatibility markers.
The assumption is that users are responsible for using the correct build consistently and for not mixing 32-bit and 64-bit builds or artifacts.

It might be useful to add this metadata to any db files later on, however I will leave this design decision up to the maintainers.

Validation

Verified with CMake builds of the mmseqs target for both:

default build
-DMMSEQS_INT64_IDS=1

Ran a successful linclust clustering of 7B sequences (90% seqid, 80% cov) on a 4TB memory VM in 6 days.
Double-checked git diff with GPT 5.5

--

Since this PR is massive and I'm not well familiar with this codebase, feedback would be greatly appreciated.

Related to #1100 #1039

…edsqdb, etc.). Add 64-bit build to CI.

…ild memory Prefilter split-local ids are reconstructed to global keys via + dbFrom, so they only need to span one memory-bounded split. Keep IndexEntryLocal::seqId and CounterResult::id as 32-bit (instead of widening to DBKeyType/DBLocalId), and guarantee each TARGET split holds < 2^32 sequences via a cap in Prefiltering::setupSplit plus a guard in IndexBuilder::fillDatabase. This keeps the prefilter index and counting bins at their 32-bit footprint in the 64-bit build (query-db split is rejected for >= 2^32 targets). Also fix unconditional unsigned int -> size_t widenings that inflated the DEFAULT (32-bit) build: id2local/local2id, AlignmentSymmetry tmpSize, DBReader sortedIndices, ClusteringAlgorithms clusterid_to_arrayposition and setsize_abundance now use DBLocalId (4 bytes by default, 8 under the flag). Genuine offsets (borders_of_set, contigSizes) stay size_t. assignedCluster (align2clust) is deliberately left size_t: narrowing it is semantically identical but exposes a latent timing-dependent data race in align2clust (concurrent reads/writes of assignedCluster) that makes cluster counts nondeterministic. That race should be fixed separately before narrowing. Validated: default and -DMMSEQS_INT64_IDS=1 builds both pass the regression suite 41/41, deterministically.

TmpResult is CounterResult's sibling temp buffer (tmpElementBuffer[binSize]) and holds the same split-local id, so it should match: narrow TmpResult.id from DBLocalId back to unsigned int. No effect on the default build (DBLocalId is uint32 there); in the 64-bit build this keeps tmpElementBuffer at 6-byte entries instead of 10, consistent with IndexEntryLocal/CounterResult. int64 build passes regression 41/41.

Drop util/compare_mmseqs_linclust.py and util/.gitignore (added in f122bd3); the .gitignore only existed to ignore that script's __pycache__.

clustersizes uses a -1 deleted-sentinel so it must stay signed. soedinglab#1111 widened it from int to int64_t unconditionally, doubling this O(dbSize) array in the default build. Use DBLocalIdSigned (int32_t default, int64_t under MMSEQS_INT64_IDS): 4 bytes/entry in the default build (matches pre-1111) and 8 bytes in the 64-bit build (correct for dbSize > 2^31). No logic change; signedness preserved. cluster-version-1 regression tests pass in both builds.

clustersizes uses a -1 deleted-sentinel, so it must stay signed. soedinglab#1111 widened it from int to int64_t unconditionally, doubling this O(dbSize) array in the default build. Gate it with an inline #ifdef: int32_t by default (4 bytes/entry, matching pre-1111) and int64_t under MMSEQS_INT64_IDS (correct for dbSize > 2^31). No new type alias; signedness preserved, so the -1 sentinel and all comparisons are unchanged. cluster-version-1 regression tests pass in both builds.

bbuschkaemper added 8 commits May 3, 2026 13:17

Add build flag to support 64-bit ints for more sequences.

d53779c

Fix all remaining 32-bit ints to 64-bit.

5bf2b41

Merge branch 'soedinglab:master' into master

b19ded1

Add progress to align2clust.

73ccba0

Fix possible race-condition in align2clust.

83d0d19

Fix additional remaining 32-bit ints that might overflow.

1da8b1e

Merge branch 'soedinglab:master' into master

818b149

Fix 32-bit usage in utility files (pairaln, offsetalignment, makepadd…

c1516b5

…edsqdb, etc.). Add 64-bit build to CI.

bbuschkaemper mentioned this pull request Jun 1, 2026

How to cluster almost 6 billion protein sequences? #1100

Open

bbuschkaemper and others added 4 commits June 10, 2026 14:22

Add mmseqs linclust comparison test.

f122bd3

Remove linclust comparison test script and its .gitignore

5e48f3d

Drop util/compare_mmseqs_linclust.py and util/.gitignore (added in f122bd3); the .gitignore only existed to ignore that script's __pycache__.

martin-steinegger force-pushed the master branch from be7d92c to 5e48f3d Compare June 19, 2026 14:40

martin-steinegger force-pushed the master branch from c1d9448 to 1deddd4 Compare June 19, 2026 17:29

martin-steinegger marked this pull request as ready for review June 19, 2026 18:43

martin-steinegger merged commit 3a6d9a5 into soedinglab:master Jun 19, 2026
0 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

64-bit build to allow 4.3B+ sequences#1111

64-bit build to allow 4.3B+ sequences#1111
martin-steinegger merged 13 commits into
soedinglab:masterfrom
bbuschkaemper:master

bbuschkaemper commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

bbuschkaemper commented Jun 1, 2026

Summary

What changed

Central DBKeyType

Removed stale duplicate source file

Fixed race-condition in align2clust

Added CI coverage for the 64-bit build

Important decisions

No compatibility metadata was added

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Central `DBKeyType`