Skip to content

libnvmf: Add the NVMe-oF exclusion list#3523

Open
martin-belanger wants to merge 9 commits into
linux-nvme:masterfrom
martin-belanger:exclusion-list
Open

libnvmf: Add the NVMe-oF exclusion list#3523
martin-belanger wants to merge 9 commits into
linux-nvme:masterfrom
martin-belanger:exclusion-list

Conversation

@martin-belanger

Copy link
Copy Markdown

Summary

This series adds a system-wide NVMe-oF exclusion list: a persistent, administrator-authored set of files under /etc/nvme/exclusions/ naming controllers that no orchestrator may auto-connect. It is a small, cooperative coordination layer — the administrator's opt-out — and the natural companion to the ownership registry that landed earlier.

The two are mirror images: the registry records who owns a controller and guards against accidental disconnect (so disconnect-all does not tear down a connection some component depends on); the exclusion list guards against accidental connect (so an orchestrator does not auto-connect a controller the administrator wants left alone). Together with the transport ID introduced here, they are the building blocks that let independent NVMe-oF orchestrators — the initramfs (NBFT/FC boot), a human running connect-all, nvme-discoverd, nvme-stas — coexist on one host.

Why an exclusion list (and not just editing a config file)

The motivating cases are connections whose intent comes from a source the administrator cannot edit:

  • NBFT controllers are described by a firmware boot table configured in UEFI — there is no host-side entry to comment out. Excluding the controller is the only way to keep an orchestrator off an NBFT path, e.g. to take it out of service during testing.
  • Discovered controllers — a target returned as a DC's discovery log page entry (DLPE), or a controller announced over mDNS — are advertised by the fabric at runtime and listed in no file on the host.

For all of these the exclusion list is the only local lever: the one place an administrator can say "not this one", regardless of where the connection request originated. Like the registry, it is cooperative, not enforcement — every orchestrator runs as root and could connect anything; the list simply lets well-behaved tools refrain by agreement.

What's in the series

  1. libnvme: add shared fabrics helpers in preparation for exclusion list — a preparation refactor: extract the private mkdir -p, atomic-cloexec-tempfile, and test-directory-validation helpers from registry.c into util-fabrics.c as internal libnvmf_ helpers, and convert registry.c to use them. No functional change.
  2. libnvme: add the NVMe-oF exclusion list — the feature: the libnvmf_exclusion_* API in exclusion.c, built on a new transport-ID type (libnvmf_tid_* in tid.c) that canonicalizes a connection's identifying tuple and derives a stable hash from it; an nvme exclusion plugin (create/delete/edit/list/add/remove); and SWIG bindings exposing lists/entries/match to Python.
  3. nvme-cli/fabrics: add --exclude to nvme disconnect — a -x/--exclude flag on nvme disconnect that records a matching exclusion entry before tearing a controller (or subsystem) down, so an orchestrator sees the exclusion in place before the removal event fires and does not immediately reconnect. The entry is built inside libnvme (libnvmf_exclusion_add_ctrl for a device, libnvmf_exclusion_add_nqn for an NQN); the CLI keeps no knowledge of the on-disk entry format.
  4. doc: add NVMe-oF exclusion list documentation — an overview in libnvme (EXCLUSIONS.md, modelled on REGISTRY.md) covering the on-disk format, the field keys, the minimal-match semantics, and the relationship to the registry; man pages for the six nvme exclusion subcommands; and the nvme disconnect --exclude flag.

Design notes

  • On-disk format: one or more named .conf files, each holding exclusion = key=value;key=value lines (# comments). By convention user.conf is what nvme disconnect --exclude appends to. World-readable, root-writable; writes are atomic (mkstempfsyncrename).
  • Minimal match: a controller is excluded if it matches any entry in any list; only the fields present in an entry are compared. A NULL connection field never matches a present entry field; an unknown key makes the entry never match (fail-safe); an empty/field-less entry never matches. If the directory can't be read, matching is fail-open (never blocks connectivity).
  • Address-aware comparison: equivalent IP spellings match (e.g. compressed vs. expanded IPv6); FC addresses compare case-insensitively. The identity used for matching is the transport ID (tid.h), the same building block the registry and discoverd use.
  • Scope: the list governs the orchestrating paths that connect on their own. It is deliberately not applied to nvme connect <args> or nvme disconnect <device>, which are single, targeted human actions.

Testing

  • New unit tests: libnvme - tid and libnvme - exclusion (run under meson test, using the NVME_EXCLUSION_DIR /tmp override).
  • The full suite passes (meson test: all green, 2 expected-fail unchanged).
  • Each commit was verified to build and pass the unit tests on its own.

Follow-ups (not in this series)

  • nvme connect-all should honor the exclusion list. connect-all is an orchestration command (it auto-connects DLP/NBFT/config entries), so it should skip excluded controllers. The intended placement is libnvme's discovery-connect path (covering all of connect-all's sources at once, while leaving the targeted nvme connect path unaffected). Deferred to its own PR after this series lands.
  • Key-alias canonicalization. The matcher currently accepts both hyphen and underscore spellings (host-traddr/host_traddr, host-iface/host_iface) and both nqn/subsysnqn. Whether to settle on one canonical form is still under discussion; the docs describe current behavior.

Comment thread libnvme/design/EXCLUSIONS.md
Comment thread libnvme/src/nvme/private-fabrics.h Outdated
Comment thread libnvme/src/nvme/registry.c Outdated
Comment thread libnvme/src/nvme/private-fabrics.h
Comment thread plugins/exclusion/exclusion-nvme.h
Comment thread libnvme/src/nvme/exclusion.h
Comment thread libnvme/design/EXCLUSIONS.md Outdated
Comment thread libnvme/EXCLUSIONS.md Outdated
Comment thread libnvme/design/EXCLUSIONS.md
@martin-belanger

Copy link
Copy Markdown
Author

@igaw — Thanks for the review

I've addressed all the comments and force-pushed. Note the history was rewritten (commits re-split per your suggestions), so the series is now 8 commits; here's what changed:

  • Split the big "add the exclusion list" commit into reviewable pieces, in dependency order: libnvmf_tid is now its own commit (libnvme: add the libnvmf_tid transport identity), the exclusion plugin is its own commit (plugins/exclusion: add the nvme exclusion management plugin), and the core libnvme: add the NVMe-oF exclusion list commit is correspondingly slimmer.
  • Renamed libnvmf_mkstemp_cloexec()libnvmf_mkstemp() (close-on-exec is the implicit default).
  • Reworked the write_attr_atomic() comment in registry.c to explain why the temp-file + rename(2) + dir-fsync protocol is correct, not just the steps.
  • nvme connect now notes an excluded target under --verbose (it still connects — the list governs orchestrators, not explicit human actions).
  • Moved the markdown design docs (REGISTRY.md, EXCLUSIONS.md) into libnvme/design/ so they stop cluttering the top-level ls, kept distinct from libnvme/doc/ (the sphinx/RTD source).
  • Admin vs. runtime exclusions: resolved in-thread — because the exclusion calls take the global ctx (which already carries state like owner), a /run tier is purely additive whenever it's needed (a new setter, no signature change), so there's no API to rework later. I've left it out for now rather than ship an unimplemented stub.

Every commit builds on its own and the unit tests pass.

Comment thread libnvme/src/nvme/registry.c Outdated
Comment thread libnvme/src/nvme/private-fabrics.h Outdated
Comment thread libnvme/src/nvme/private-fabrics.h Outdated
Comment thread libnvme/src/nvme/private-fabrics.h Outdated
Comment thread libnvme/src/nvme/private-fabrics.h
Comment thread libnvme/src/nvme/tid.c Outdated
Comment thread libnvme/src/nvme/tid.c Outdated
Comment thread libnvme/src/nvme/tid.c Outdated
Comment thread libnvme/src/nvme/tid.c Outdated
Comment thread libnvme/src/nvme/tid.c
Comment thread libnvme/src/nvme/tid.c Outdated
Comment thread libnvme/src/nvme/tid.c Outdated
Comment thread libnvme/src/nvme/tid.c Outdated
Comment thread libnvme/src/nvme/tid.c Outdated
Comment thread libnvme/src/nvme/tid.c
Comment thread libnvme/src/nvme/tid.c
Comment thread libnvme/src/nvme/tid.c Outdated
Comment thread libnvme/src/nvme/tid.h Outdated
Comment thread libnvme/src/nvme/tid.h
Comment thread libnvme/src/nvme/exclusion.c Outdated
Comment thread libnvme/src/nvme/exclusion.c
Comment thread libnvme/src/nvme/exclusion.c Outdated
Comment thread libnvme/src/nvme/exclusion.c
Comment thread libnvme/src/nvme/exclusion.c
Comment thread libnvme/src/nvme/exclusion.c
Comment thread libnvme/src/nvme/exclusion.c Outdated
Comment thread libnvme/src/nvme/exclusion.c
Comment thread libnvme/src/nvme/exclusion.c
Comment thread libnvme/src/nvme/exclusion.c
Comment thread libnvme/src/nvme/exclusion.h Outdated
Comment thread libnvme/src/nvme/exclusion.h Outdated
Comment thread libnvme/src/nvme/exclusion.h
Comment thread fabrics.c Outdated
Comment thread fabrics.c
Comment thread libnvme/design/EXCLUSIONS.md
@igaw

igaw commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

alright, I am done with this round of review.

@martin-belanger

Copy link
Copy Markdown
Author

@igaw - Hey Daniel. It was a long day...

I've pushed an update that addresses the last round of reviews. Everything is folded into the existing 8-commit series (fixups squashed into the commit that introduced each piece), and every commit still builds on its own. Here's a summary of the substantive changes and the decisions behind them.

HostID is now part of the TID

After re-reading the NVMe Base Specification (revision 2.3, section 6.3, "Connect Command"), the transport ID now carries hostid alongside hostnqn. A single Host NQN may present multiple Host Identifiers as independent "elements" of a host, each a separate association — so the pair, not the Host NQN alone, is the host identity. The Linux kernel agrees: nvmf_ctlr_matches_baseopts() compares subsysnqn, host NQN and host ID. Linux currently enforces a 1:1 Host NQN ↔ Host Identifier mapping (nvmf_host_add() rejects a mismatch), so the multi-hostid case is unreachable there today, but that is kernel policy rather than a spec guarantee — carrying hostid keeps the TID correct regardless.

Config keys match the nvme connect options

The string keys used in the exclusion files (and in the shared parser) are the nvme connect option names — transport, traddr, trsvcid, nqn, host-traddr, host-iface, hostnqn, hostid — hyphenated exactly where the CLI hyphenates them. The reasoning: an administrator sees command-line options and config files, never the code, so those two surfaces should line up. Three of the keys map onto struct fields whose names use underscores, because C identifiers can't contain hyphens (nqnsubsysnqn, host-traddrhost_traddr, host-ifacehost_iface); libnvmf_tid_parse() does the mapping. Per the greenfield point, I dropped the old key synonyms — one spelling per key, and it's the option name.

Parse the entry once

Entry matching and validation now go through the TID parser instead of a second hand-rolled parser. I added libnvmf_tid_parse_strict(), which rejects unknown or malformed keys (fail-safe against a hand-edit typo silently weakening an entry), and matching runs on the parsed TID via a field-by-field subset comparison — no key-string dispatch once it's parsed. The structured callers (add_ctrl, the subsystem-NQN helper) build a TID and compare on it rather than rendering to a string that add/remove reparse. I promoted libnvmf_tid_is_empty() to the public API; the subset-match helper stays local until a second consumer needs it.

Test sandbox moved into the global context

I dropped the environment-variable override in favor of a setter on the global context, libnvme_set_test_base_dir() (confined to /tmp), which is the direction of #3530. It's applied to both the exclusion list and the registry, so there's no new LIBNVME_* back-door env var to deprecate later. The testing note is gone from the public header.

On-disk layout, dedup, and helper reuse

  • Layout is now /etc/nvme/exclusions.conf + exclusions.conf.d/ drop-ins, mirroring the connection-config format so both hand-edited files look the same to an administrator.
  • A single write_all() and a single fnv1a_64() in util, replacing the duplicate copies (three of write_all, two FNV-1a implementations).
  • libnvmf_exclusion_add_nqnlibnvmf_exclusion_add_subsysnqn; the iteration callback parameter is renamed to callback to match the registry API; SYSCONFDIR prefix instead of a hard-coded /etc; xstrdup()/asprintf() where I was open-coding them.
  • slurp() gains a 1 MiB size cap. I looked at replacing it with mmap but kept the read: the parser needs a NUL-terminated buffer, and a file whose size is an exact page multiple would fault on the terminating byte. These files are tiny, so there's no performance argument either.
  • Comment/style/whitespace pass throughout (early returns, no ?: abuse, blank line before final returns, two-tab argument indentation, consistent strtok_r style).

One comment I couldn't take literally

I'd agreed to reuse trim_white_space() and drop libnvmf_trim, but trim_white_space() lives in the CLI (nvme-print-json.c) and isn't reachable from libnvme across the library boundary, so libnvmf_trim stays as the libnvme-side helper. If a single shared trimmer is wanted, we'd first have to lift trim_white_space() into a util that libnvme can link — happy to do that as a separate cleanup if you'd prefer it.

Two notes on commit placement

Both were forced by patch context — a cleaner per-commit split wasn't possible — and both are harmless:

  • The removal of libnvmf_validate_test_dir landed in the libnvmf_tid commit rather than the prep commit. Its declaration sits between prep-commit and tid-commit helper decls, so it can only be removed cleanly once the tid commit exists. The prep commit leaves it as an unused (non-static, so no warning) helper; the tid commit deletes it.
  • The EXCLUSIONS.md "re-reads the files" wording change landed in the --verbose commit, because it shares a paragraph with the courtesy note that commit adds.

Deferred (as discussed)

  • The nqn local variable in fabrics.csubsysnqn: I'll pick that up in fabrics: rename nqn variable to subsysnqn #3531.
  • Making nvme connect-all honor the list: its own PR once this lands, since the check belongs in libnvme's discovery-connect path.
  • SWIG type hints / annotations: a separate follow-up — feasible since our bindings use proxy classes, but it needs a SWIG ≥ 4.2.0 version gate.
  • Address canonicalization (inet_ntop): done upstream at connect time, where hostname resolution already happens, so a single canonical address flows to the kernel, sysfs, and the TID alike. The TID stays a verbatim container; the kdoc records the caller contract. Wiring it into the connect path is separate work.

Martin Belanger added 9 commits July 1, 2026 18:55
When the global context became mandatory, three libnvme_msg() call sites
in the no-netdb / no-getifaddrs fallback stubs kept passing NULL.
__libnvme_msg() dereferences it (struct libnvme_log *l = &ctx->log), so
each stub crashes in any build without netdb (e.g. musl built with
-U_GNU_SOURCE). No test exercised those stubs until the exclusion-list
unit tests started calling libnvme_ipaddrs_eq().

Spin up a throwaway context for the diagnostic instead of passing NULL.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
This is a preparation patch for the NVMe-oF exclusion list added in a
following commit.  It introduces utility helpers shared between the
registry and the exclusion list.

registry.c keeps private helpers to create directories (mkdir -p),
write a file atomically via a cloexec temp file, and validate a
test-only directory override.  The exclusion list needs the same
primitives, so move them into util-fabrics.c as internal libnvmf_
helpers and convert registry.c to use them.

No functional change: the registry behaviour and unit tests are
unchanged.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Add libnvmf_tid, the owned, hashable transport-ID that uniquely
identifies a full host-to-controller path (an NVMe-oF association). It is
the identity the exclusion list matches on, and the basis other fabrics
features build their per-connection identity on. Split out from the
exclusion-list commit so it can be reviewed on its own.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Add a system-wide NVMe-oF exclusion list (/etc/nvme/exclusions/*.conf)
naming controllers that orchestrators must not auto-connect.
Enforcement is cooperative; the match is minimal -- only the fields
present in an entry are compared.

The library API (libnvmf_exclusion_*) is built on a new transport-ID
type (libnvmf_tid_*) that canonicalizes a connection's identifying
tuple and derives a stable hash from it.  An "nvme exclusion" plugin
provides create/delete/list/add/remove/edit, and SWIG bindings expose
the lists, entries and match query to Python.

The exclusion list and the transport ID join the ownership registry as
the building blocks that let independent NVMe-oF orchestrators (e.g.
nvme-discoverd, nvme-stas) coexist: the registry records who owns a
controller, the exclusion list lets an admin keep controllers out of
every orchestrator's reach, and the TID gives them all one stable
identity for a connection.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Add the `nvme exclusion` command group (list/add/remove/create/delete/
edit) as the CRUD front-end to the exclusion API. Split into its own
patch so the library and the management tooling land separately.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Add a -x/--exclude flag to "nvme disconnect" that records a matching
exclusion entry before tearing the controller (or subsystem) down, so
an orchestrator sees the exclusion before the removal event fires and
does not immediately reconnect.

The entry is built inside libnvme (libnvmf_exclusion_add_ctrl for a
device, libnvmf_exclusion_add_nqn for an NQN); the CLI keeps no
knowledge of the on-disk entry format.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Add an overview of the NVMe-oF exclusion list in libnvme (EXCLUSIONS.md)
-- the on-disk format, the field keys, the minimal-match semantics, and
how it complements the ownership registry as a cooperative layer for
orchestrator coexistence -- together with man pages for the six
"nvme exclusion" subcommands and the new "nvme disconnect --exclude"
flag.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
The hand-written markdown design docs were cluttering the top-level
libnvme/ listing. Move REGISTRY.md into a libnvme/design/ subdirectory,
kept distinct from libnvme/doc/ (the sphinx / read-the-docs source), to
be read directly on GitHub. EXCLUSIONS.md lands there in the same series.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
The exclusion list governs auto-connecting orchestrators, not an explicit
"nvme connect", so connect must never be blocked by it. But an operator
who hand-connects a controller they previously excluded is overriding
their own opt-out, often unknowingly. Under --verbose, consult the list
and print a note when the target matches, making the override visible
without changing behaviour. Document the courtesy in EXCLUSIONS.md.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
@martin-belanger

Copy link
Copy Markdown
Author

@igaw - While working on this PR, I uncovered an old regression affecting musl and muon builds when NVME_HAVE_NETDB is undefined.

The three stub functions in util.c (libnvme_ipaddrs_eq, libnvme_iface_matching_addr, and libnvme_iface_primary_addr_matches) call libnvme_msg(ctx, ...) with ctx = NULL. That is no longer valid, as ctx must not be NULL.

This issue had gone unnoticed because we didn't have any tests exercising those stubs. With the new TID tests I added, those code paths are now being hit.

So, long story short, I had to add a preliminary fix for this regression ahead of the rest of my changes. As a result, the PR now contains 9 commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants