Skip to content

fix(layout/dict): probe with the configured compressor instead of a hardcoded default#8406

Open
tomsanbear wants to merge 3 commits into
vortex-data:developfrom
tomsanbear:fix/dict-strategy-probe-compressor
Open

fix(layout/dict): probe with the configured compressor instead of a hardcoded default#8406
tomsanbear wants to merge 3 commits into
vortex-data:developfrom
tomsanbear:fix/dict-strategy-probe-compressor

Conversation

@tomsanbear

@tomsanbear tomsanbear commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes: #8405

DictStrategy's dict-fit probe was hardcoded to BtrBlocksCompressor::default(), ignoring the caller's configured compressor. The probe now takes stats_compressor as a parameter to DictStrategy::new, so caller scheme exclusions are honoured.

stats_compressor rather than data_compressor because the probe needs all dictionary schemes to detect eligibility. data_compressor excludes IntDictScheme.

Testing

New test dict_probe_honours_configured_compressor: asserts dict layout with the default builder, asserts no dict layout with StringDictScheme excluded.

AI use disclosure: fix authored with assistance from Claude Code.

…ardcoded default

DictStrategy decides whether to apply a dictionary layout by probe-compressing
the first chunk and checking whether the cascade chose Dict. The probe was
hardcoded to BtrBlocksCompressor::default(), ignoring the compressor configured
through WriteStrategyBuilder, so a caller's scheme choices did not influence the
dict-fit decision.

Add a probe_compressor field to DictStrategy (defaulting to
BtrBlocksCompressor::default(), leaving existing callers unchanged) with a
with_probe_compressor setter, and have WriteStrategyBuilder::build pass the
stats_compressor. stats_compressor is used rather than data_compressor because
data_compressor excludes IntDictScheme to avoid re-encoding the dict codes; the
probe needs every dict scheme to detect eligibility. For the default builder
stats_compressor equals BtrBlocksCompressor::default(), so the default path is
unchanged.

Signed-off-by: Thomas Santerre <thomas@santerre.xyz>
@tomsanbear tomsanbear requested a review from a team June 14, 2026 17:26
@tomsanbear

Copy link
Copy Markdown
Contributor Author

Open design question:

With this change, the probe runs the caller's opaque compressor instead of the stock default. This is empirically a no-op for every current opaque caller (all pass btrblocks-based compressors that emit Dict, or non-dict-favourable vector data). The only behaviour change would be for a hypothetical opaque compressor that never emits a top-level Dict while being fed low-cardinality data... it would stop triggering the dict layout. I chose this uniform behaviour because it matches the approach of "the probe reflects the caller's compressor" open to alternatives as i'm still new to the codebase.

@tomsanbear tomsanbear changed the title fix(layout/dict): probe with the configured compressor instead of a h… fix(layout/dict): probe with the configured compressor instead of a hardcoded default Jun 14, 2026
@onursatici onursatici added the action/benchmark Trigger full benchmarks to run on this PR label Jun 15, 2026
@joseph-isaacs joseph-isaacs requested a review from onursatici June 15, 2026 10:53
@onursatici onursatici added the changelog/fix A bug fix label Jun 15, 2026
@codspeed-hq

codspeed-hq Bot commented Jun 15, 2026

Copy link
Copy Markdown

Merging this PR will degrade performance by 19.8%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 5 improved benchmarks
❌ 104 regressed benchmarks
✅ 1436 untouched benchmarks
⏩ 10 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation compare[48] 213 µs 300.6 µs -29.15%
Simulation compare[50] 227.7 µs 319.2 µs -28.65%
Simulation compare[49] 228.2 µs 317.7 µs -28.18%
Simulation compare[44] 207.5 µs 287.7 µs -27.88%
Simulation compare[46] 218.5 µs 302.5 µs -27.76%
Simulation compare[47] 223.5 µs 309.3 µs -27.74%
Simulation compare[40] 190.7 µs 263.4 µs -27.62%
Simulation compare[44] 212.2 µs 292.4 µs -27.43%
Simulation compare[45] 218.9 µs 300.9 µs -27.26%
Simulation compare[43] 209.2 µs 287.7 µs -27.26%
Simulation compare[42] 204.6 µs 281 µs -27.21%
Simulation compare[40] 195.6 µs 268.4 µs -27.1%
Simulation compare[43] 214.2 µs 292.5 µs -26.77%
Simulation compare[42] 209.4 µs 285.9 µs -26.76%
Simulation compare[41] 204.5 µs 279.2 µs -26.74%
Simulation compare[41] 209.3 µs 284 µs -26.27%
Simulation compare[31] 157.7 µs 213.7 µs -26.2%
Simulation compare[39] 199.9 µs 270.8 µs -26.19%
Simulation compare[38] 195.5 µs 264.6 µs -26.1%
Simulation compare[36] 187.3 µs 252.6 µs -25.85%
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing tomsanbear:fix/dict-strategy-probe-compressor (0581c1f) with develop (9444d20)

Open in CodSpeed

Footnotes

  1. 10 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Comment thread vortex-file/src/strategy.rs Outdated
Comment on lines +271 to +284
let compress_then_flat = CompressingStrategy::new(flat, Arc::clone(&stats_compressor));

// 3. apply dict encoding or fallback
// 3. apply dict encoding or fallback.
// The dict-fit probe shares `stats_compressor` (the full configured cascade), not
// `data_compressor`: `data_compressor` drops `IntDictScheme` to avoid re-encoding the
// codes produced in step 5, but the probe needs every dict scheme to detect eligibility.
// The full cascade also honours caller scheme exclusions, unlike a hardcoded default.
let dict = DictStrategy::new(
coalescing.clone(),
compress_then_flat.clone(),
coalescing,
Default::default(),
);
)
.with_probe_compressor(stats_compressor);

@joseph-isaacs joseph-isaacs Jun 15, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use this compressor?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The probe compresses the first chunk and only checks is dict, the result is discarded. stats_compressor seems like the right fit since it has all dictionary schemes; the data compressor excludes intdictscheme to prevent double dict-encoding the codes that DictStrategy produces downstream

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can have a new compressor argument thar defaults to btrblocks instead of reusing the stats ine. I agree using data compressor doesn't make sense but maybe callers want a separate compressor for dict selection than the one used for stats

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's the tradeoff that callers have to explicitly choose which compressor probes dict eligibility, but that seems better than a default that silently ignores configured scheme exclusions, what do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea if users want to customise the probe compressor they should be able to change only that while constructing their write strategies. For the default case having the stats compressor works but there would probably be a need to change the probe compressor separately from the stats compressor

@onursatici onursatici left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense to me

Comment thread vortex-file/src/strategy.rs Outdated
Comment on lines +274 to +277
// The dict-fit probe shares `stats_compressor` (the full configured cascade), not
// `data_compressor`: `data_compressor` drops `IntDictScheme` to avoid re-encoding the
// codes produced in step 5, but the probe needs every dict scheme to detect eligibility.
// The full cascade also honours caller scheme exclusions, unlike a hardcoded default.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove this comment, I think it is a bit noisy given the declarations of stats and data compressors are self explanatory

Comment thread vortex-file/src/strategy.rs Outdated
Default::default(),
);
)
.with_probe_compressor(stats_compressor);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we get this as an argument to new with the default being btrblocks?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, updated the branch with this change and removed the with_... method

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

action/benchmark Trigger full benchmarks to run on this PR changelog/fix A bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DictStrategy dict-fit probe ignores the configured compressor

3 participants