Skip to content

zarrs bindings [do not merge]#4064

Draft
d-v-b wants to merge 46 commits into
zarr-developers:mainfrom
d-v-b:zarrs-bindings
Draft

zarrs bindings [do not merge]#4064
d-v-b wants to merge 46 commits into
zarr-developers:mainfrom
d-v-b:zarrs-bindings

Conversation

@d-v-b

@d-v-b d-v-b commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

vibe-coded zarrs bindings. I am not done with this, but I figured it's in a good state for signposting / discussion.

strategy

I wanted to keep the contact surface between zarr-python and zarrs minimal and low-state, so I defined a functional crud API that expresses the core of zarr IO. (that API is not wired up to the top-level zarr.Array class!). The idea is that we can express the stuff users want to do to their data -- create new arrays / groups, write chunks , read chunks, as f(metadata, storage, *parameters).

The crud API supports multiple registered backends, e.g. a default python backend (based on repurposing our existing routines) and the rust backend, when zarrs is available.

The statelessness of the functional API is also a downside if you call read_chunk(metadata, store, ...) repeatedly, because the rust code will re-construct the same chunk decoding machinery each time. I address this with an LRU cache on the rust side. I think "metadata + store + options" is a good cache key but we need to discuss this design further.

An alternative strategy would be to write a zarrs-based expression for the many methods defined on the Array and AsyncArray classes, while ensuring that we avoid crossing the FFI boundary excessively. I avoided this because I imagined it would require covering a huge code surface area and raise tough questions about whether python or rust was owning the life cycle of the object. If people really want a full rust-backed Array class, we can explore that direction.

caveats:

  • I have only implemented simple indexing right now. I'm going to add full numpy indexing semantics down the road, using the data structures defined in https://github.com/zarr-developers/ndsel as an FFI-friendly data model.
  • Rust handles local file system storage itself. other stores have to cross FFI for every store operation, and I haven't tested pathologies like deleting the store on the python side while rust is working on it. Maybe I am naive but I would really like pure functions here, which means the best solution is for stores to move across the language boundary as plain data, not live python objects. URL pipeline syntax would be a great addition here.

performance

the zarrs backend is faster! here's a benchmark script you can run yourself. It requires the rust toolchain for building the bindings.

I'm seeing ~15x throughput improvement, looks good.

array:       shape=(2048, 2048) dtype=uint16 chunks=(64, 64) shards=(512, 512) compressor=zstd
size:        8.4 MB logical, 10 iterations, LocalStore
correctness: reference and zarrs both match the source data

backend        best (ms)   median (ms)   median MB/s
reference         147.05        157.85            53
zarrs               6.94          8.97           935

impact

these changes require internal changes in the zarr package, as well as a new subpackage for the rust bindings. it adds the rust toolchain to the developer dependencies of the project. it exposes us to changes in the zarrs package, which is outside the zarr-developers org. We definitely need a design plan to limit complexity if we want to pursue this further.

d-v-b and others added 30 commits June 11, 2026 22:57
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Create zarrs-bindings/ with Cargo.toml, pyproject.toml (maturin build
backend), and src/lib.rs (exceptions + version function). Wire into the
root pyproject.toml via a new `zarrs` dependency group and [tool.uv.sources].
Add zarrs-bindings/target/ to .gitignore.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add StoreShim, a synchronous adapter over zarr's async Store ABC, and
resolve_store, which maps a Store to either a native config dict (for
LocalStore) or a StoreShim for Rust to call back into. Also convert the
store fixture in tests/zarrs/conftest.py to an async generator with
teardown so stores are properly closed after each test.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Adds `create_array` and `read_metadata` pyfunctions to the zarrs-bindings
crate, and exposes them as `create_new_array`, `create_overwrite_array`, and
`read_metadata` in the `zarr.zarrs` subpackage.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add `delete_node` and `list_children` to the zarrs-bindings Rust crate
and the `zarr.zarrs` Python subpackage. `delete_node` erases the node
prefix via `erase_prefix`, raising `NodeNotFoundError` when the node is
absent. `list_children` opens the target as a `Group` and returns direct
children as `(path, metadata_document)` pairs. Both are covered by 4
new tests (× 2 stores = 8 parametrized cases); total suite: 34 passed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Add --ignore=src/zarr/zarrs to pytest addopts so --doctest-modules
  doesn't attempt to import the bindings module when zarrs-bindings is
  not installed, preventing a collection ERROR in jobs that don't use
  the zarrs dependency group.
- Add dtolnay/rust-toolchain step (SHA-pinned, stable) to zarrs.yml CI
  so the build is not reliant on whatever Rust version the runner image
  ships; ensures rust-version = "1.91" in the crate is satisfied.
- Fix spec: abi3-py311 -> abi3-py312 to match zarr's requires-python
  >=3.12.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Move the pytest.importorskip("_zarrs_bindings") guard from conftest.py
(module-level) to the top of each test module. When the bindings are
absent, xdist workers would previously raise Skipped while importing the
conftest, causing "Different tests were collected between gw0 and gwX"
failures. Per-module guards are the standard xdist-safe pattern.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add `retrieve_array_subset` Rust binding and `decode_region` Python API
function. Selection normaliser maps integers/slices/Ellipsis to a step-1
bounding box fetched in one zarrs call; strides, reversals, and integer-axis
removal are applied as numpy views on the result.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Add docstring note to decode_region warning that zarrs fetches the
  step-1 bounding box, so strided selections read O(span) bytes.
- Remove dead isinstance(sel, EllipsisType) raise in _normalize_selection
  (Ellipsis is expanded to slice(None) before the per-dimension loop);
  replace with an assert to preserve mypy type narrowing.
- Guard non-integral float shape elements in _array_shape so shape=[1.5]
  raises TypeError instead of silently truncating to 1.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add a process-wide LRU cache (capacity 128) in chunk.rs keyed on
(filesystem root, node path, metadata JSON) that memoises the result of
Array::new_with_metadata for the native FilesystemStore path.  Generic
Python-backed stores (MemoryStore, ZipStore, custom) are not cached.
The cache key encodes root + path + metadata so an Array is never
reused for a different store or different codec chain; chunk data
continues to flow through the store on every call, so the cache cannot
return stale data.  Two test-hook pyfunctions (array_cache_len /
clear_array_cache) are exposed on _zarrs_bindings; five correctness
tests in tests/zarrs/test_cache.py cover population, non-caching of
MemoryStore, distinct-metadata entries, root-keying, and write
visibility.  All 117 tests pass; cargo clippy -D warnings clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace all four `array_cache().lock().unwrap()` call sites with a
`lock_cache()` helper that uses `.unwrap_or_else(|e| e.into_inner())`.
If a thread panics while holding the mutex the lock is now recovered
(worst case: a stale cache entry) rather than poisoning every
subsequent lock call and wedging all array I/O permanently.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
read_chunk stays parameter-free (chunk-grid addressed, whole chunk);
all selection-based reads route through read_region/read_subset
(array-coordinate, spans chunks). Drops the deferred chunk-subset
selection parameter.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…d method

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
d-v-b and others added 12 commits June 15, 2026 13:52
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- wrap the lazy `import zarr.zarrs` in `get_backend` with a try/except
  that raises a helpful ImportError when the zarrs-bindings extension is
  not installed
- add a thread-safety comment above the `_BACKENDS` dict explaining that
  CPython's import lock + GIL make it safe without additional locking
- document the `runtime_checkable` limitation in the `CrudBackend`
  docstring: isinstance only checks method names, not signatures or
  async-ness; mypy is the authoritative conformance check

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Implements `ReferenceBackend` in `src/zarr/crud/_reference.py`, registers
it as the default `"reference"` backend at `zarr.crud` import time.
Drives chunk encode/decode via `create_codec_pipeline` + the abstract
`CodecPipeline.encode`/`decode` methods (not `encode_batch`/`decode_batch`,
which live only on `BatchedCodecPipeline` and are not in the abstract
Protocol), and multi-chunk subset reads via `AsyncArray.getitem`.

Fixes two previously-failing registry tests; adds four new backend tests
covering round-trip chunk I/O, multi-chunk subset reads, duplicate-create
error, and missing-metadata error.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
`_array_spec` was hardcoding `ArrayConfig.from_dict({})`, which always
defaults to C order, causing silent data corruption for v2 arrays created
with `order="F"`. `read_chunk` returned transposed bytes and `write_chunk`
stored bytes that zarr-python read back incorrectly.

Fix: detect `ArrayV2Metadata` and pass its `.order` attribute into the
`ArrayConfig`, so the codec pipeline uses the correct memory layout.
v3 arrays are unaffected (order there is a transpose codec, not memory
order).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Adds the public `zarr.crud` facade dispatching to a CrudBackend, plus the
differential test suite exercising it against the reference backend. The
zarrs param in the backend fixture is skipped until Task 4 registers the
zarrs backend.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replaces the old flat-function API in `zarr.zarrs._api` with
`ZarrsBackend` (a `CrudBackend` implementation) in
`zarr.zarrs._backend`. Importing `zarr.zarrs` now registers the backend
under the key `"zarrs"` via `zarr.crud.register_backend`, enabling
`backend="zarrs"` in all crud facade functions and activating the
previously-skipped zarrs params in `tests/crud/test_crud.py`.

Old tests that exercised the removed flat API are deleted; cache tests
are migrated to use the `zarr.crud` facade.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…drops

- Add ZGROUP_JSON + a _node_exists() helper probing all three meta-key
  variants (zarr.json / .zarray / .zgroup) so v2 groups are visible to every
  CRUD operation; rewrite _create/read_metadata/delete_node/list_children to
  use it; remove dead _meta_key().
- Add _is_all_fill_value() and drop all-fill chunks in write_chunk to match
  zarrs's sparse-storage convention.
- Add differential tests for v2 groups, v2 array read_metadata, and all-fill
  chunk writes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ment it

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Matches the existing packages/zarr-metadata subpackage layout. Updates the
uv source path, sdist exclude, gitignore, and design doc accordingly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@d-v-b

d-v-b commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

@clbarnes related to your interests

@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 20.68127% with 326 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.09%. Comparing base (f12c1dc) to head (004ce2e).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/zarr/crud/_api.py 22.96% 104 Missing ⚠️
src/zarr/crud/_reference.py 23.66% 100 Missing ⚠️
src/zarr/zarrs/_bridge.py 0.00% 54 Missing ⚠️
src/zarr/zarrs/_backend.py 0.00% 45 Missing ⚠️
src/zarr/crud/_registry.py 41.17% 10 Missing ⚠️
src/zarr/zarrs/__init__.py 0.00% 9 Missing ⚠️
src/zarr/crud/_common.py 55.55% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4064      +/-   ##
==========================================
- Coverage   93.52%   91.09%   -2.43%     
==========================================
  Files          90       99       +9     
  Lines       11926    12337     +411     
==========================================
+ Hits        11154    11239      +85     
- Misses        772     1098     +326     
Files with missing lines Coverage Δ
src/zarr/core/config.py 100.00% <ø> (ø)
src/zarr/crud/__init__.py 100.00% <100.00%> (ø)
src/zarr/crud/_backend.py 100.00% <100.00%> (ø)
src/zarr/crud/_common.py 55.55% <55.55%> (ø)
src/zarr/zarrs/__init__.py 0.00% <0.00%> (ø)
src/zarr/crud/_registry.py 41.17% <41.17%> (ø)
src/zarr/zarrs/_backend.py 0.00% <0.00%> (ø)
src/zarr/zarrs/_bridge.py 0.00% <0.00%> (ø)
src/zarr/crud/_reference.py 23.66% <23.66%> (ø)
src/zarr/crud/_api.py 22.96% <22.96%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

async def list_children(store, path) -> list[tuple[str, dict]] # (path, metadata)

# chunk-level I/O
async def decode_chunk(metadata, store, path, chunk_coords, *, selection=None) -> np.ndarray

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning an ndarray will force a copy out of rust memory. Longer term, I'd perhaps suggest instead making a wrapper class on the Rust side that will expose an ndarray. At least whenever the data type is supported by the buffer protocol that should remove a memory copy. See for example async_tiff.Array

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have very adjacent prior art in https://github.com/zarrs/zarrs-python, I should see what they do!

@clbarnes clbarnes Jun 17, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can instantiate numpy arrays on the rust side, either from a vec directly or with an ndarray: https://docs.rs/numpy/latest/numpy/convert/trait.IntoPyArray.html#tymethod.into_pyarray

I imagine it would get more complicated when custom buffer prototypes are passed, because they need to be instantiated on the python side.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am kind of hoping to put the custom buffer prototype thing in the rear view window. It has not worked well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to all ideas here, but "every routine that allocates memory takes a python callable" is probably not going to work for cross-language interop. Another complication is that "contiguous stream of bytes" is a lower-level abstraction than "N-dimensional array".

My rough feeling is that we should partition the chunk encoding / decoding routines by the memory access pattern needed for the array side of the array -> bytes transformation. For dense arrays, rust can hand back a contiguous buffer, or write into a pre-allocated buffer. For sparse arrays, we will need to settle on an memory layout, and we probably can't support writing into a pre-allocated sparse array.

Comment on lines +120 to +122
3. **Anything else** → generic `PyStore`: a Rust struct implementing
`ReadableStorageTraits` / `WritableStorageTraits` /
`ListableStorageTraits` over a Python callback object.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: as you might expect from our zulip discussion I'm hesitant to lump all of these together. I'm interested in spending a little time with this thinking about how to represent these traits through a Python API

Comment on lines +124 to +131
The callback path: the async API function wraps the user's `Store` in a small
sync Python shim whose methods submit coroutines to zarr-python's existing
sync event-loop thread (`zarr.core.sync`,
`asyncio.run_coroutine_threadsafe(...)` + blocking result). Rust calls the
shim while holding no locks of its own. This makes any conformant `Store`
(Memory, Zip, Logging, Wrapper, user-defined) work without Rust knowing its
type. Deadlock safety relies on the existing invariant that code running on
the zarr sync loop never blocks on these Rust entry points.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know I don't love this 😅. I'd like to specialize Sync and Async stores on the rust side.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha I'm not a fan either! ideal scenario is that a store connection has a declarative form that rust can use to reconstruct it (like a URL). Things tied to python memory will remain hard.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fwiw that's how obstore works when passed into a pyo3 binding. It recreates the store on the rust side so no requests need to go through Python

Comment on lines +145 to +148
µs of actual chunk I/O on a warm filesystem. To amortize it across the common
"open one array, then do many chunk operations" pattern, the chunk/region
routines memoize the constructed `Array` in a process-global LRU cache
(capacity 128) keyed on `(filesystem root, node path, metadata JSON)`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's pretty messy to do this in a global LRU cache. There's no reason to have a public LRU cache when you could just cache the metadata in... a class instance 😄

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in... a class instance 😄

storing the metadata in a class instance isn't the same as globally caching it. The point here is to avoid performing the same computation that is keyed by exactly the metadata document. Even if we were using OOP, I think it would make sense for the procedure that takes metadata and emits chunk encoding / decoding machinery to use a cache.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point here is to avoid performing the same computation that is keyed by exactly the metadata document

You're keying on filesystem - node path - metadata JSON... how would that key ever be shared across more than one Array? Seems like an Array class is a logical place to put that metadata

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's possible that we aren't disagreeing! I want a functional API because I think that's actually a very natural way to interact with zarr data -- for a given operation, you specify the metadata document you want to use, the location, and any other parameters.

It also definitely makes sense to provide users with objects that persistently bind the metadata and the storage backend. That's how zarr-python works today. But if you start with the object oriented API, it can be hard to add the functional API later. I think the functional API is actually really important for doing a lot of cool things with Zarr! so i want to see if we can bake it in foundationally, and add the object-oriented API on top.

Comment on lines +41 to +45
zarr-python already contains everything a pure-Python backend needs:
`BatchedCodecPipeline` (`src/zarr/core/codec_pipeline.py`), `BasicIndexer`
(`src/zarr/core/indexing.py`), `save_metadata` (`src/zarr/core/metadata/io.py`),
metadata parsing (`ArrayV3Metadata.from_dict` / `ArrayV2Metadata.from_dict`),
and chunk-key encoding (`src/zarr/core/chunk_key_encodings.py`).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to learn more about codec pipelines to understand how they integrate with zarrs

but carries only defaults (fields become meaningful in Phase 3).

```python
# node lifecycle

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the functional API because it means that you're reparsing metadata that makes sense to just store in a class instance

@kylebarron

Copy link
Copy Markdown
Contributor

In case it's useful for evaluation, I started writing a direct, standalone binding of zarrs to Python: https://github.com/developmentseed/zarrista

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants