Skip to content

Add unified parse_json runtime type checker (#3285) [draft / proof-of-direction]#4063

Draft
vup903 wants to merge 7 commits into
zarr-developers:mainfrom
vup903:fix-issue-3285
Draft

Add unified parse_json runtime type checker (#3285) [draft / proof-of-direction]#4063
vup903 wants to merge 7 commits into
zarr-developers:mainfrom
vup903:fix-issue-3285

Conversation

@vup903

@vup903 vup903 commented Jun 15, 2026

Copy link
Copy Markdown

Summary

This is a draft / proof-of-direction PR for #3285, opened to show the intended approach and get early feedback before migrating all call sites — per @d-v-b's request in the issue thread.

It introduces a single unified runtime type checker, parse_json(value, type_annotation), that consolidates the scattered per-field parse_* helpers (there are currently ~42 of them across src/zarr). The function validates a JSON-decoded value against a type annotation and returns data assignable to that annotation (coercing where sensible, e.g. parse_json([1, 2, 3], tuple[int, int, int]) -> (1, 2, 3)), or raises a clear error.

What's in this draft

  • src/zarr/core/json_parse.pyparse_json plus sub-parsers, scoped to the JSON-relevant categories named in the issue:
    • primitives: None, str, int, float, bool (with a bool-vs-int-safe check: True is not accepted for int, and 1 is not accepted for bool)
    • Literal[...]
    • unions / Optional
    • fixed and variadic tuple
    • Sequence[T] / list[T] → returned as a tuple
    • Mapping[str, T] / dict[str, T]
    • TypedDict
  • tests/test_json_parse.py — 94 tests covering every category, incl. the bool/int edge cases, coercion, unions, nested TypedDict.
  • Pilot migration of three representative helpers to delegate to parse_json, with public signatures and behavior unchanged:
    • core/common.py::parse_orderparse_json(data, Literal["C", "F"])
    • core/common.py::parse_boolparse_json(data, bool)
    • core/metadata/v3.py::parse_zarr_formatparse_json(data, Literal[3]), re-wrapping the error as MetadataValidationError to preserve the original exception type and message.

Focused suites pass locally (Python 3.12, test.py3.12-optional env): test_common, test_config, test_metadata, and test_json_parse430 passed.

Open questions for maintainers (direction feedback)

  1. Module name/locationsrc/zarr/core/json_parse.py is a placeholder; happy to move it (e.g. into a typing/metadata utility module).
  2. Exceptionsparse_json raises plain ValueError / TypeError since it's field-agnostic; field-specific callers (like parse_zarr_format) re-wrap into their domain error. Does that split match what you'd want?
  3. TypedDict + NotRequired — the current _parse_typeddict reads __required_keys__ / __optional_keys__ directly. Under from __future__ import annotations with class-syntax TypedDicts, typing_extensions can't see the NotRequired wrapper at class-creation time, so an optional key can be mis-classified as required. The robust fix is to resolve hints via typing_extensions.get_type_hints(..., include_extras=True) and inspect for Required/NotRequired. I've left this for the follow-up once the overall direction is confirmed, since none of the three pilot helpers are TypedDicts.

If the direction looks right, I'll migrate the remaining parse_* call sites incrementally in follow-up commits.

Closes #3285 once fully migrated (draft for now).

vup903 and others added 6 commits June 14, 2026 22:51
Introduce zarr.core.json_parse.parse_json, a single type-annotation-driven validator that consolidates the scattered per-field parse_* helpers. Handles primitives, Literal, unions/Optional, fixed and variadic tuples, Sequence/list (coerced to tuple), Mapping/dict, and TypedDict, with a bool-vs-int safe primitive check. Adds tests/test_json_parse.py (94 tests) and a changelog fragment. No existing call sites migrated yet; this is the proof-of-direction module.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r-developers#3285)

Pilot migration delegating three representative helpers to the unified parse_json validator. Public signatures and return types are unchanged. parse_zarr_format re-wraps parse_json's ValueError/TypeError as MetadataValidationError with the original message to preserve observable behavior. parse_json is imported function-locally to avoid circular imports. Focused suites green: test_common/test_config/test_metadata/test_json_parse = 430 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ns (zarr-developers#3285)

Apply ruff check/format to satisfy the Lint CI hook. Keep typing.Union/Optional spellings in tests (with noqa) to cover that origin path alongside X | Y. Rework _parse_typeddict to derive required/optional from get_type_hints(include_extras=True) + __total__ instead of __required_keys__, so class-syntax NotRequired is detected even when 'from __future__ import annotations' stringizes the hints (as in zarr's metadata modules). test_json_parse + test_common + test_metadata = 393 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lopers#3285)

get_origin(hint) is Required/NotRequired tripped mypy's comparison-overlap and unreachable checks; typing origin as Any keeps the runtime check while satisfying mypy. Full 'uv run --frozen mypy' is clean (190 files); ruff check/format clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Delegate the literal/type check of parse_indexing_order, parse_node_type, parse_node_type_array, parse_separator, two parse_zarr_format variants (group, v2), and parse_name to the unified parse_json. Public signatures and return types unchanged; each preserves its original exception type and message (wrapping parse_json's ValueError/TypeError into MetadataValidationError / NodeTypeValidationError / the original ValueError/TypeError where tests assert on them). parse_json imported function-locally to avoid circular imports. Focused suites + full mypy green (1196 passed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s#3285)

Delegate the int/bool type check of parse_checksum, parse_clevel, parse_blocksize, parse_typesize, parse_gzip_level, and parse_zstd_level to parse_json, keeping each helper's range/bound check and exact error messages. Original exception types are preserved by wrapping parse_json's ValueError/TypeError. Note: parse_json rejects bool where the old isinstance(data, int) accepted it; verified no caller/test passes a bool to these, so this is a deliberate, more-correct strictening. parse_json imported function-locally. Codec suite + full mypy green (794 passed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.73239% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.49%. Comparing base (4862895) to head (a4a3c6e).

Files with missing lines Patch % Lines
src/zarr/codecs/blosc.py 65.00% 7 Missing ⚠️
src/zarr/codecs/zstd.py 64.28% 5 Missing ⚠️
src/zarr/core/json_parse.py 95.93% 5 Missing ⚠️
src/zarr/core/group.py 75.00% 3 Missing ⚠️
src/zarr/codecs/gzip.py 71.42% 2 Missing ⚠️
src/zarr/core/chunk_key_encodings.py 60.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4063      +/-   ##
==========================================
- Coverage   93.50%   93.49%   -0.01%     
==========================================
  Files          90       91       +1     
  Lines       11979    12132     +153     
==========================================
+ Hits        11201    11343     +142     
- Misses        778      789      +11     
Files with missing lines Coverage Δ
src/zarr/core/common.py 89.94% <100.00%> (+1.25%) ⬆️
src/zarr/core/config.py 100.00% <100.00%> (ø)
src/zarr/core/metadata/v2.py 89.61% <100.00%> (+0.17%) ⬆️
src/zarr/core/metadata/v3.py 94.01% <100.00%> (+0.06%) ⬆️
src/zarr/codecs/gzip.py 90.90% <71.42%> (-1.78%) ⬇️
src/zarr/core/chunk_key_encodings.py 90.00% <60.00%> (-1.18%) ⬇️
src/zarr/core/group.py 95.11% <75.00%> (-0.09%) ⬇️
src/zarr/codecs/zstd.py 88.33% <64.28%> (-2.41%) ⬇️
src/zarr/core/json_parse.py 95.93% <95.93%> (ø)
src/zarr/codecs/blosc.py 93.43% <65.00%> (-1.89%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vup903

vup903 commented Jun 22, 2026

Copy link
Copy Markdown
Author

Hi @d-v-b, I put up a draft so you can see where I'm heading before I take it further.

So far it adds src/zarr/core/json_parse.py with a single parse_json(value, type_annotation) function. It validates a JSON-decoded value against a type annotation and returns assignable data (coercing sequences to tuples), or raises a clear error. I kept the scope to the JSON types listed in the issue: primitives, Literal, unions/Optional, tuple, Sequence/list, Mapping/dict, and TypedDict. There's a test file (tests/test_json_parse.py) with 94 cases covering each category, including the bool/int edge case and NotRequired under from __future__ import annotations.

On the de-dup side, I've moved 16 of the scattered parse_* helpers over to parse_json already. I split it into three small batches to keep each one easy to review: the pilot (parse_order, parse_bool, parse_zarr_format), the literal ones (parse_indexing_order, parse_node_type, parse_node_type_array, parse_separator, the zarr_format variants, parse_name), and the codec primitives (parse_clevel, parse_blocksize, parse_typesize, parse_gzip_level, parse_zstd_level, parse_checksum). Public signatures and the observable behavior (exception type and message) stay the same in every case.

Before I migrate the rest, a few things I'd like your read on:

  1. Is src/zarr/core/json_parse.py a reasonable home for this, or would you rather it live somewhere else?
  2. parse_json raises plain ValueError/TypeError, and the field-specific callers re-wrap into their own error (like MetadataValidationError). Does that split feel right to you?
  3. For TypedDict, I resolve hints with get_type_hints(..., include_extras=True) and read Required/NotRequired plus __total__ instead of __required_keys__, so a class-syntax NotRequired still works when annotations are stringized. Open to doing it differently if you have a preference.
  4. parse_json(x, int) rejects bool (since bool is an int subclass), where the old isinstance check let it through. I went with the stricter behavior because nothing relied on passing a bool there, but happy to revert if you'd rather keep it loose.

I'm holding off on the sequence/mapping/union helpers that dispatch to registries or factory methods until I hear back on the above.

One heads up on CI: the failing Test jobs are a pre-existing collection error in tests/test_group.py (duplicate parametrization of 'store') and tests/test_docs.py under the newer pytest. It's already on main and isn't from this change. Lint and pre-commit.ci are passing.

Let me know what you think on the four questions and I'll keep going.

@chuckwondo

Copy link
Copy Markdown
Contributor

@vup903, if we are willing to add msgspec as a (very small) dependency, I believe we can eliminate the vast majority of this code, as msgspec.convert handles nearly all (if not all) of this logic. I might be off, as I have not thoroughly looked through the entirety of the affected code, but I'd like to hear from @d-v-b on this point.

@d-v-b

d-v-b commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

nearly all of the types we need are now defined in zarr-metadata, so we should investigate if msgspec can handle all of those types correctly. zarr-metadata itself uses pydantic for its test suite, and I think i ended up choosing that after tryng msgspec and running into some issues, but I don't recall what they were. Maybe something to do with tuples? If its helpful, we could widen the JSON collection types in zarr-metadata to Sequence instead of tuple.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

unified runtime type checking for our json data

3 participants