64-bit build to allow 4.3B+ sequences#1111
Merged
Merged
Conversation
…edsqdb, etc.). Add 64-bit build to CI.
…ild memory Prefilter split-local ids are reconstructed to global keys via + dbFrom, so they only need to span one memory-bounded split. Keep IndexEntryLocal::seqId and CounterResult::id as 32-bit (instead of widening to DBKeyType/DBLocalId), and guarantee each TARGET split holds < 2^32 sequences via a cap in Prefiltering::setupSplit plus a guard in IndexBuilder::fillDatabase. This keeps the prefilter index and counting bins at their 32-bit footprint in the 64-bit build (query-db split is rejected for >= 2^32 targets). Also fix unconditional unsigned int -> size_t widenings that inflated the DEFAULT (32-bit) build: id2local/local2id, AlignmentSymmetry tmpSize, DBReader sortedIndices, ClusteringAlgorithms clusterid_to_arrayposition and setsize_abundance now use DBLocalId (4 bytes by default, 8 under the flag). Genuine offsets (borders_of_set, contigSizes) stay size_t. assignedCluster (align2clust) is deliberately left size_t: narrowing it is semantically identical but exposes a latent timing-dependent data race in align2clust (concurrent reads/writes of assignedCluster) that makes cluster counts nondeterministic. That race should be fixed separately before narrowing. Validated: default and -DMMSEQS_INT64_IDS=1 builds both pass the regression suite 41/41, deterministically.
TmpResult is CounterResult's sibling temp buffer (tmpElementBuffer[binSize]) and holds the same split-local id, so it should match: narrow TmpResult.id from DBLocalId back to unsigned int. No effect on the default build (DBLocalId is uint32 there); in the 64-bit build this keeps tmpElementBuffer at 6-byte entries instead of 10, consistent with IndexEntryLocal/CounterResult. int64 build passes regression 41/41.
Drop util/compare_mmseqs_linclust.py and util/.gitignore (added in f122bd3); the .gitignore only existed to ignore that script's __pycache__.
be7d92c to
5e48f3d
Compare
martin-steinegger
added a commit
to bbuschkaemper/MMseqs2
that referenced
this pull request
Jun 19, 2026
clustersizes uses a -1 deleted-sentinel so it must stay signed. soedinglab#1111 widened it from int to int64_t unconditionally, doubling this O(dbSize) array in the default build. Use DBLocalIdSigned (int32_t default, int64_t under MMSEQS_INT64_IDS): 4 bytes/entry in the default build (matches pre-1111) and 8 bytes in the 64-bit build (correct for dbSize > 2^31). No logic change; signedness preserved. cluster-version-1 regression tests pass in both builds.
clustersizes uses a -1 deleted-sentinel, so it must stay signed. soedinglab#1111 widened it from int to int64_t unconditionally, doubling this O(dbSize) array in the default build. Gate it with an inline #ifdef: int32_t by default (4 bytes/entry, matching pre-1111) and int64_t under MMSEQS_INT64_IDS (correct for dbSize > 2^31). No new type alias; signedness preserved, so the -1 sentinel and all comparisons are unchanged. cluster-version-1 regression tests pass in both builds.
c1d9448 to
1deddd4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a build variable
MMSEQS_INT64_IDS=1that replaces 32-bit with 64-bit ints to handle more than 4.3B+ sequences.With this build variable, MMseqs2 remains backwards-compatible and can use 32-bits so that existing behavior remains unchanged.
64-bit MMseqs2 databases are not compatible with 32-bit MMseqs2 databases. A 32/64-bit metadata information on any MMseqs2 db files might be a useful addition to make sure users don't mix up versions.
What changed
Central
DBKeyTypeAdded
DBKeyTypedatatype, giving 32/64bit ints depending on the build variable. Every previous 32-bit int variable that is limiting the number of possible sequences now usesDBKeyTypeas datatype.Removed stale duplicate source file
Deleted
src/alignment/Align2clust copy.cpp, not included in any build.Fixed race-condition in align2clust
During a test linclust clustering of 7B sequences, align2clust got stuck and build up massive iowait without progress.
Before the fix, worker threads worked like this:
ClusterResultclusterMutexclusterResult.sequenceIdx == currentProcessPositionclusterResultQueueclusterCondition.notify_one()The clustering thread can only consume results in
sequenceIdxorder, but withschedule(dynamic, 1), workers can finished ahead of current position, resulting in the queue filling up unbounded with out-of-order results.The old code now decides to notify the cluster thread by comparing against
currentProcessPositionbefore pushing, then notifying after unlocking. Therefore, notification is based on transient state. It was possible forcurrentProcessPositionto change between checking and pushing. This "check under lock, act after unlock" can cause hang-up in timing-sensitive behavior.It is fixed with a helper function `pushAlign2clustClusterResult" that centralizes queue insertion:
clusterMutexcurrentProcessPositionWe also added logging with align2clust progress state, since the align2clust can take up significant amount of total linclust time without any feedback.
Added CI coverage for the 64-bit build
Added an additional GitHub Actions build step for
-DMMSEQS_INT64_IDS=1.Important decisions
No compatibility metadata was added
This PR intentionally does not add DB-format compatibility metadata or cross-build compatibility markers.
The assumption is that users are responsible for using the correct build consistently and for not mixing 32-bit and 64-bit builds or artifacts.
It might be useful to add this metadata to any db files later on, however I will leave this design decision up to the maintainers.
Validation
mmseqstarget for both:-DMMSEQS_INT64_IDS=1--
Since this PR is massive and I'm not well familiar with this codebase, feedback would be greatly appreciated.
Related to #1100 #1039