Skip to content

[VL]:shuffle: extend TypeAwareCompress to INT128 (DECIMAL HUGEINT)#12299

Open
hhr293 wants to merge 1 commit into
apache:mainfrom
hhr293:hhr/tac-int128
Open

[VL]:shuffle: extend TypeAwareCompress to INT128 (DECIMAL HUGEINT)#12299
hhr293 wants to merge 1 commit into
apache:mainfrom
hhr293:hhr/tac-int128

Conversation

@hhr293

@hhr293 hhr293 commented Jun 15, 2026

Copy link
Copy Markdown

What changes are proposed in this pull request?

[VL] Extend TypeAwareCompress (TAC) to INT128 (DECIMAL HUGEINT).

The TAC codec from #11894 only covered INT64. DECIMAL(p>18) is stored as
128-bit HugeInt and previously fell through to LZ4, missing a structural
compression opportunity: in OLAP workloads (TPC-H prices, taxes, etc.)
DECIMAL values usually fit in INT64, so the high 64 bits are 0 and the
low 64 bits are narrow — exactly the pattern FFOR exploits.

Implementation: Split each 16B value into lo/hi uint64 sub-streams via
stride-2 gather into stack-allocated scratch buffers, then run the existing
64-bit FFOR encoder on each. Wire format reuses the 64-bit per-block
(bw, count, base) header twice — the stream is self-describing, no hi/lo
length prefix needed. hi sub-streams that are all equal to base degenerate
to just the 16B header (bw=0). Velox HUGEINT is mapped to tac::kUInt128;
shuffle writer and frame format unchanged.

How was this patch tested?

  • New unit tests in FForCodecTest.cc: 128-bit compress/decompress round-trips,
    alignment combinations, corrupted-input rejection, and tail handling.
  • Existing 64-bit tests continue to pass.
  • End-to-end validation on TPC-H SF=6000 (no regression on other queries).

Performance

Results vs LZ4 on the same int128 input:

  • Compression ratio: 0.116 (TAC) vs 0.193 (LZ4) — 40% smaller

End-to-end on TPC-H SF=6000, the wins concentrate on queries with heavy
decimal-keyed shuffles:

  • q15: shuffle size −15%, latency −8%
  • q17: shuffle size −8%, latency −3%
  • q18: shuffle size −8%, latency −3%

Was this patch authored or co-authored using generative AI tooling?

Reviewed-by: Claude claude-opus-4-7

Copilot AI review requested due to automatic review settings June 15, 2026 08:57

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR extends the type-aware compression (TAC) pipeline to support Velox HUGEINT / DECIMAL128 by introducing a 128-bit FFOR-based codec (implemented as two 64-bit sub-streams).

Changes:

  • Add TAC type mapping and codec support for 128-bit values (kUInt128 / Velox HUGEINT).
  • Refactor 64-bit FFOR block encode/decode into shared helpers and implement a new 128-bit codec on top.
  • Add unit tests covering 128-bit round-trips, tails, misalignment, and truncated-input handling.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
cpp/velox/shuffle/VeloxTypeAwareCompress.h Map Velox HUGEINT to TAC kUInt128.
cpp/core/utils/tac/ffor.hpp Add shared encode/decode helpers and implement 128-bit FFOR codec.
cpp/core/utils/tac/TypeAwareCompressCodec.h Add kUInt128 enum and document 128-bit support.
cpp/core/utils/tac/TypeAwareCompressCodec.cc Extend TAC codec dispatch for kUInt128 in compress/decompress and maxCompressedLen.
cpp/core/utils/tac/FForCodec.h Add 128-bit compress/decompress API surface.
cpp/core/utils/tac/FForCodec.cc Implement 128-bit compress/decompress wrappers and bounds.
cpp/core/tests/FForCodecTest.cc Add tests for TAC kUInt128 behavior and edge cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cpp/core/utils/tac/ffor.hpp
auto* in = input;
auto codecId = static_cast<CodecId>(*in++);
[[maybe_unused]] auto tacType = *in++;
auto tacType = (*in++);
Comment thread cpp/core/utils/tac/ffor.hpp
Comment on lines +79 to +81
if (outputSize < maxLen) {
return arrow::Status::Invalid("FForCodec: output buffer too small for 128-bit compression.");
}
@github-actions github-actions Bot added the VELOX label Jun 15, 2026
@hhr293 hhr293 requested a review from Copilot June 15, 2026 09:15

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Comment on lines +70 to +88
FForCodec::compress128(const uint8_t* input, int64_t inputSize, uint8_t* output, int64_t outputSize) {
if (inputSize == 0) {
return 0;
}
if (inputSize % sizeof(__int128_t) != 0) {
return arrow::Status::Invalid("FForCodec: input size ", inputSize, " is not a multiple of ", sizeof(__int128_t), ".");
}

size_t numValues = inputSize / sizeof(__int128_t);
auto maxLen = static_cast<int64_t>(ffor::compress128Bound(numValues));
if (outputSize < maxLen) {
return arrow::Status::Invalid(
"FForCodec: output buffer too small for 128-bit compression (need ",
maxLen, " bytes, got ", outputSize, ").");
}

auto written = ffor::compress128(input, numValues, output);
return static_cast<int64_t>(written);
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inputSize == 0 early return is correct. When inputSize is 0,
TypeAwareCompressCodec::compress() never calls FForCodec::compress128() in the
first place — the caller checks inputLen before invoking the codec. This guard
is purely defensive (same pattern as the 64-bit compress()).

Even if it were reached, returning 0 bytes written from the codec layer signals
"nothing compressed" to the caller, which is the correct semantic for empty
input. No malformed stream is produced because no stream is produced at all.

Comment on lines +91 to +101
FForCodec::decompress128(const uint8_t* input, int64_t inputSize, uint8_t* output, int64_t outputSize) {
if (outputSize == 0) {
return 0;
}
if (outputSize % sizeof(__int128_t) != 0) {
return arrow::Status::Invalid("FForCodec: output size ", outputSize, " is not a multiple of ", sizeof(__int128_t), ".");
}

auto nDecoded = ffor::decompress128(input, inputSize, output, static_cast<size_t>(outputSize));
return static_cast<int64_t>(nDecoded);
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When outputSize == 0, the caller is saying "I expect zero values." Returning 0
decoded values is the only correct answer regardless of input content.
Validating the input stream format in this case would be wasted work — even if
the stream is corrupt, producing 0 values into a 0-sized buffer is harmless and
matches caller expectations.

The caller already knows the expected output size from shuffle metadata. If the
stream is actually corrupt, the mismatch will be caught at a higher level when
the actual data (non-zero) fails to decode correctly in subsequent calls.

Comment thread cpp/core/utils/tac/TypeAwareCompressCodec.cc
Comment thread cpp/core/utils/tac/ffor.hpp
The TAC codec from apache#11894 only covered INT64.  DECIMAL(p>18) is stored as
128-bit HugeInt and previously fell through to LZ4, missing a structural
compression opportunity: in OLAP workloads (TPC-H prices, taxes, ...)
DECIMAL values usually fit in INT64, so the high 64 bits are 0 and the
low 64 bits are narrow -- exactly the pattern FFOR exploits.

Implementation: split each 16B value into lo/hi uint64 sub-streams via
stride-2 gather into stack-allocated scratch buffers, then run the
existing 64-bit FFOR encoder on each.  Wire format reuses the 64-bit
per-block (bw, count, base) header twice -- the stream is self-
describing, no hi/lo length prefix needed.  hi sub-streams that are all
equal to base degenerate to just the 16B header (bw=0).  Velox HUGEINT
is mapped to tac::kUInt128; shuffle writer and frame format unchanged.

Results vs LZ4 on the same int128 input:
  compression ratio  0.116 (TAC)  vs  0.193 (LZ4)  -- 40% smaller

End-to-end on TPC-H SF=6000, the wins concentrate on the queries with
heavy decimal-keyed shuffles:
  q15: shuffle size -15%, latency -8%
  q17: shuffle size -8%,  latency -3%
  q18: shuffle size -8%,  latency -3%
Generated-by: Claude claude-opus-4-7

Co-authored-by: Guo Wangyang <wangyang.guo@intel.com>
Co-authored-by: Lipeng Zhu <lipeng.zhu@intel.com>
Signed-off-by: huhengrui <hengrui.hu@intel.com>
@hhr293

hhr293 commented Jun 16, 2026

Copy link
Copy Markdown
Author

@marin-ma @FelixYBW Could you please help review this patch? This is a follow-up evolution to PR . While the previous PR successfully introduced TypeAwareCompress (TAC) support for int64, this patch extends the implementation to fully support int128.
Please let me know if there are any concerns or suggestions. Thanks!

ARROW_ASSIGN_OR_RAISE(auto nDecoded, FForCodec::decompress(in, dataLen, output, outputLen));
(void)nDecoded;
return inputLen;
if (codecId != CodecId::kFFor) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why narrowing this condition? Is kFFor the only tac type that will be supported?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants