A simple C library for compressing lists of integers using binary packing and SIMD instructions. The assumption is either that you have a list of 32-bit integers where most of them are small, or a list of 32-bit integers where differences between successive integers are small. No software is able to reliably compress an array of 32-bit random numbers.
This library can decode at least 4 billions of compressed integers per second on most desktop or laptop processors. That is, it can decompress data at a rate of 15 GB/s. This is significantly faster than generic codecs like gzip, LZO, Snappy or LZ4.
On a Skylake Intel processor, it can decode integers at a rate 0.3 cycles per integer, which can easily translate into more than 8 decoded billions integers per second.
It runs on both x86/x64 (SSE/AVX) and 64-bit ARM (NEON, e.g. Apple Silicon). See Platforms below.
This library is part of the Awesome C list of C resources.
Contributors: Daniel Lemire, Nathan Kurz, Christoph Rupp, Anatol Belski, Nick White and others
This is a low-level library for fast integer compression. By design it does not define a compressed format. It is up to the (sophisticated) user to create a compressed format.
It is used by:
- On x86/x64: your processor should support SSE4.1 (supported by most Intel and AMD processors released since 2008). The core bit-packing functions only require SSE2 (Pentium4 or better).
- On ARM: an AArch64/ARM processor with NEON (e.g. Apple Silicon). The SSE intrinsics are mapped to NEON by our own self-contained shim (
include/neon128.h); no third-party translation library is pulled in. - A C99 (or better) compiler, plus a C++17 compiler if you build the benchmarks.
- CMake 3.14 or better.
For a plain C version that does not use SIMD instructions, see https://github.com/lemire/LittleIntPacker
The library supports two SIMD backends behind the same API:
- x86 / x64 — Intel/AMD SSE (with optional AVX2 and AVX-512 code paths,
enabled automatically when you build with
-march=nativeon a capable host). - 64-bit ARM (AArch64) with NEON, such as Apple Silicon. The SSE intrinsics
used by the 128-bit kernels are mapped onto ARM NEON by a small, self-contained
shim that ships with the library (
include/neon128.h). This is our own code written directly against<arm_neon.h>; no third-party translation layer (such as sse2neon) is pulled in. The wider AVX2/AVX-512 paths are x86-only and are simply inactive on ARM.
The public API is identical on both: it is selected automatically at compile
time, so the same source (including the __m128i-based entry points) builds on
either architecture.
Compression works over blocks of 128 integers.
For a complete working example, see example/example.c (after building, run it with "./build/example").
- Lists of integers in random order.
const uint32_t b = maxbits(datain);// computes bit width
simdpackwithoutmask(datain, buffer, b);//compressed to buffer, compressing 128 32-bit integers down to b*32 bytes
simdunpack(buffer, backbuffer, b);//uncompressed to backbufferWhile 128 32-bit integers are read, only b 128-bit words are written. Thus, the compression ratio is 32/b.
- Sorted lists of integers.
We used differential coding: we store the difference between successive integers. For this purpose, we need an initial value (called offset).
uint32_t offset = 0;
uint32_t b1 = simdmaxbitsd1(offset,datain); // bit width
simdpackwithoutmaskd1(offset, datain, buffer, b1);//compressing 128 32-bit integers down to b1*32 bytes
simdunpackd1(offset, buffer, backbuffer, b1);//uncompressedGeneral example for arrays of arbitrary length:
int compress_decompress_demo() {
size_t k, N = 9999;
__m128i * endofbuf;
uint32_t * datain = malloc(N * sizeof(uint32_t));
uint8_t * buffer;
uint32_t * backbuffer = malloc(N * sizeof(uint32_t));
uint32_t b;
for (k = 0; k < N; ++k){ /* start with k=0, not k=1! */
datain[k] = k;
}
b = maxbits_length(datain, N);
buffer = malloc(simdpack_compressedbytes(N,b)); // allocate just enough memory
endofbuf = simdpack_length(datain, N, (__m128i *)buffer, b);
/* compressed data is stored between buffer and endofbuf using (endofbuf-buffer)*sizeof(__m128i) bytes */
/* would be safe to do : buffer = realloc(buffer,(endofbuf-(__m128i *)buffer)*sizeof(__m128i)); */
simdunpack_length((const __m128i *)buffer, N, backbuffer, b);
for (k = 0; k < N; ++k){
if(datain[k] != backbuffer[k]) {
printf("bug\n");
return -1;
}
}
return 0;
}- Frame-of-Reference
We also have frame-of-reference (FOR) functions (see simdfor.h header). They work like the bit packing routines, but do not use differential coding so they allow faster search in some cases, at the expense of compression.
The library builds with CMake:
cmake -B build
cmake --build build
ctest --test-dir build # run the unit tests
To install the library, headers, and a CMake package configuration:
cmake --install build --prefix /your/install/prefix
Once installed, another CMake project can use it with:
find_package(simdcomp REQUIRED)
target_link_libraries(myapp PRIVATE simdcomp::simdcomp)Or vendor it directly via FetchContent (no install needed):
include(FetchContent)
FetchContent_Declare(simdcomp
GIT_REPOSITORY https://github.com/lemire/simdcomp.git
GIT_TAG master)
FetchContent_MakeAvailable(simdcomp)
target_link_libraries(myapp PRIVATE simdcomp::simdcomp)Useful options: -DSIMDCOMP_BUILD_BENCHMARKS=OFF (the benchmarks pull in
lemire/counters for cycle-accurate timing
and require C++17), -DSIMDCOMP_BUILD_TESTS=OFF, and -DSIMDCOMP_NATIVE=OFF
(disable -march=native).
The benchmark reports CPU cycles per integer using hardware performance counters when they are available (otherwise it falls back to wall-clock time and throughput):
./build/bitpackingbenchmark # on Apple Silicon/Linux, run with sudo for cycle counts
If you are a go user, there is a "go" folder where you will find a simple demo.
- Fast integer compression in Go: https://github.com/ronanh/intcomp
- Fast Bitpacking algorithms: Rust port of simdcomp https://github.com/quickwit-oss/bitpacking
- SIMDCompressionAndIntersection: A C++ library to compress and intersect sorted lists of integers using SIMD instructions https://github.com/lemire/SIMDCompressionAndIntersection
- The FastPFOR C++ library : Fast integer compression https://github.com/lemire/FastPFor
- High-performance dictionary coding https://github.com/lemire/dictionary
- LittleIntPacker: C library to pack and unpack short arrays of integers as fast as possible https://github.com/lemire/LittleIntPacker
- StreamVByte: Fast integer compression in C using the StreamVByte codec https://github.com/lemire/streamvbyte
- MaskedVByte: Fast decoder for VByte-compressed integers https://github.com/lemire/MaskedVByte
- CSharpFastPFOR: A C# integer compression library https://github.com/Genbox/CSharpFastPFOR
- JavaFastPFOR: A java integer compression library https://github.com/lemire/JavaFastPFOR
- Encoding: Integer Compression Libraries for Go https://github.com/zhenjl/encoding
- FrameOfReference is a C++ library dedicated to frame-of-reference (FOR) compression: https://github.com/lemire/FrameOfReference
- libvbyte: A fast implementation for varbyte 32bit/64bit integer compression https://github.com/cruppstahl/libvbyte
- TurboPFor is a C library that offers lots of interesting optimizations. Well worth checking! (GPL license) https://github.com/powturbo/TurboPFor
- Oroch is a C++ library that offers a usable API (MIT license) https://github.com/ademakov/Oroch
- Daniel Lemire, Nathan Kurz, Christoph Rupp, Stream VByte: Faster Byte-Oriented Integer Compression, Information Processing Letters, Information Processing Letters 130, February 2018, Pages 1-6https://arxiv.org/abs/1709.08990
- Jianguo Wang, Chunbin Lin, Yannis Papakonstantinou, Steven Swanson, An Experimental Study of Bitmap Compression vs. Inverted List Compression, SIGMOD 2017 http://db.ucsd.edu/wp-content/uploads/2017/03/sidm338-wangA.pdf
- P. Damme, D. Habich, J. Hildebrandt, W. Lehner, Lightweight Data Compression Algorithms: An Experimental Survey (Experiments and Analyses), EDBT 2017 http://openproceedings.org/2017/conf/edbt/paper-146.pdf
- P. Damme, D. Habich, J. Hildebrandt, W. Lehner, Insights into the Comparative Evaluation of Lightweight Data Compression Algorithms, EDBT 2017 http://openproceedings.org/2017/conf/edbt/paper-414.pdf
- Daniel Lemire, Leonid Boytsov, Nathan Kurz, SIMD Compression and the Intersection of Sorted Integers, Software Practice & Experience 46 (6) 2016. http://arxiv.org/abs/1401.6399
- Daniel Lemire and Leonid Boytsov, Decoding billions of integers per second through vectorization, Software Practice & Experience 45 (1), 2015. http://arxiv.org/abs/1209.2137 http://onlinelibrary.wiley.com/doi/10.1002/spe.2203/abstract
- Jeff Plaisance, Nathan Kurz, Daniel Lemire, Vectorized VByte Decoding, International Symposium on Web Algorithms 2015, 2015. http://arxiv.org/abs/1503.07387
- Wayne Xin Zhao, Xudong Zhang, Daniel Lemire, Dongdong Shan, Jian-Yun Nie, Hongfei Yan, Ji-Rong Wen, A General SIMD-based Approach to Accelerating Compression Algorithms, ACM Transactions on Information Systems 33 (3), 2015. http://arxiv.org/abs/1502.01916
- T. D. Wu, Bitpacking techniques for indexing genomes: I. Hash tables, Algorithms for Molecular Biology 11 (5), 2016. http://almob.biomedcentral.com/articles/10.1186/s13015-016-0069-5