Skip to content

DO-NOT-MERGE: RFC: NVMe-oF orchestrator coexistence — nvme-discoverd, ownership registry, and exclusion list#3442

Draft
martin-belanger wants to merge 12 commits into
linux-nvme:masterfrom
martin-belanger:discoverd-rfcs
Draft

DO-NOT-MERGE: RFC: NVMe-oF orchestrator coexistence — nvme-discoverd, ownership registry, and exclusion list#3442
martin-belanger wants to merge 12 commits into
linux-nvme:masterfrom
martin-belanger:discoverd-rfcs

Conversation

@martin-belanger

@martin-belanger martin-belanger commented Jun 11, 2026

Copy link
Copy Markdown

This PR is for review only and will not be merged. It contains four RFC documents proposing a coordinated set of features for NVMe-oF orchestrator coexistence in nvme-cli 3.0. Please use inline comments to provide feedback on specific sections.

Tip: Markdown files are shown as raw text by default in the diff view. Click the "Display the rich diff" button (the document icon at the top-right of each file) to render them as formatted text — much easier to read.


Background

As NVMe-oF deployments grow, hosts increasingly run multiple tools that manage NVMe-oF connections: nvme-stas, dracut/initramfs scripts, and soon nvme-discoverd. Without coordination, these orchestrators can conflict — one may disconnect a controller that another is actively managing, or connect to a controller that another has deliberately excluded.

This RFC set proposes two libnvme building blocks to solve the coexistence problem (ownership registry and exclusion list), and a new daemon orchestrator (nvme-discoverd) that puts them to use.


The four RFCs

rfc-nvme-orchestrator-coexistence.md — Start here

The top-level document. Frames the two distinct conflict scenarios (accidental disconnect, accidental connect), introduces the two prevention mechanisms (ownership registry and exclusion list), defines a three-tier orchestrator hierarchy (raw commands → manual orchestrators → daemon orchestrators), and shows how nvme-discoverd and nvme-stas naturally partition work without requiring IPC coupling between them.

rfc-nvme-registry.md — Ownership registry

A lightweight, cooperative registry under /run/nvme/registry/ that lets orchestrators declare ownership of connected controllers. nvme disconnect-all consults the registry before acting so it never disconnects a controller managed by a running daemon. The registry is a libnvme building block: a C API (libnvmf_registry_*), a Python binding, and a nvme registry CLI command family. An implementation PR already exists: #3425.

rfc-nvme-exclusion.md — Exclusion list

A human-administered exclusion list at /etc/nvme/exclusions/ that prevents orchestrators from auto-connecting to controllers the administrator wants excluded. Its design center is auto-discovered controllers (mDNS, CDC DLP) where there is no configuration entry to remove. Managed via nvme exclusion CRUD commands. Enforcement is cooperative — each orchestrator reads the list and skips matching controllers; libnvme does not enforce it.

rfc-nvme-discoverd.md — nvme-discoverd

A new daemon proposed for nvme-cli 3.0. It manages NVMe-oF connections for statically configured controllers, NBFT boot controllers, FC-discovered controllers, and (in a future release) mDNS-discovered controllers. Key design choices:

  • Connect-only by design — never issues disconnects; eliminates the entire connect/disconnect ordering complexity (CtrlTerminator problem). TP8010 fabric zoning is out of scope; that belongs to nvme-stas.
  • One systemd transient unit per controller — nvme-discoverd never calls libnvmf_connect() directly; blocking /dev/nvme-fabrics writes happen in child processes managed by systemd. Achieves Daniel's no-threading goal.
  • Registers ownership via --owner discoverd on each generated unit; NBFT reconnects use --owner nbft to preserve the lifetime invariant set at boot.
  • Respects the exclusion list before each connect and reconnect; NBFT controllers bypass the check unconditionally.
  • Coexists with nvme-stas through natural discovery partitioning: nvme-stas owns mDNS/TCP, nvme-discoverd owns NBFT/FC/RDMA and manual config. The registry-check rule in nvme-stas eliminates bounce loops without any IPC coupling.

What feedback is sought

  • Architecture: Does the three-tier orchestrator model match how you see these tools being used in practice?
  • Ownership registry: Any concerns with the directory layout, API shape, or cooperative-only enforcement model?
  • Exclusion list: File format, use cases, nvme exclusion command design?
  • nvme-discoverd: Transient unit design, state file layout, startup reconciliation logic, retry policy, NBFT handling?
  • Coexistence rules: The owner=nbft lifetime invariant, NBFT exclusion bypass, unowned-connections-are-fair-game semantics — do these cover the cases you care about?

Related

Martin Belanger added 2 commits June 8, 2026 14:13
RFCs submitted for review only. Not intended to be merged.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
Martin Belanger added 6 commits June 11, 2026 18:17
Registry:
- Show only the guarded udev rule in §4; remove the naive unguarded
  version that preceded it — a skimming reader would copy the wrong one
- Aperiodic audit is now bidirectional: re-assert ownership for live
  controllers with missing entries, not only remove stale ones
- Document the residual TOCTOU race (dangerous direction) explicitly

Exclusion list:
- Fix Python match semantics: None/NULL caller params skip the
  comparison (same rule as C); they do not block a match
- Clarify host-iface= scope: only matches interface-pinned connections;
  manual controller= entries without host-iface= are not matched
- Add address normalization note (inet_pton/inet_ntop, not strcmp)
- connect-all --nbft is exempt from the exclusion list

Orchestrator coexistence:
- Scope "no D-Bus signal" to between orchestrators
- "No special-case logic required" scoped to disconnect logic in nvme-stas
- connect-all --nbft exemption noted in Tier 2 summary
- Add versioning note: these behaviors co-ship in nvme-cli 3.0 /
  nvme-stas 3.0; earlier pairings do not provide the guarantees

nvme-discoverd:
- Add RuntimeDirectoryPreserve=yes; without it systemd removes
  /run/nvme/discoverd on every stop/crash, destroying devid files
  needed by active ExecStop= lines and state files for crash recovery
- Add registry ownership check before adopting pre-existing connections:
  skip any controller owned by neither discoverd nor nbft
- Correct varlink -> D-Bus throughout: StartTransientUnit, StopUnit,
  RestartUnit, and JobRemoved are all org.freedesktop.systemd1.Manager
- Scope "never issues disconnects" to steady-state operation
- Clarify referral DC fizzle-out is intentional; add open questions:
  stale-cache aging (#2), shutdown vs. mounts (#3), NBFT root (#4)
- Minor: "three lines" -> "four lines" for --devid-file specifier;
  GPL-2.0-only (not -or-later); inline unit comments moved to their
  own lines (systemd unit syntax has no end-of-line comments)

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
Add §3.9 explaining the varlink/D-Bus situation for the systemd
interface. Not depending on D-Bus was a design wish — systemd is
migrating to varlink and new code should follow. However, varlink
is not yet sufficient: io.systemd.Unit.StartTransient was only added
in systemd v260 (March 2026), and StopUnit, RestartUnit, and
ResetFailedUnit have no varlink equivalent at all.

Since discoverd already depends on libsystemd for sd_event, using
sd-bus (systemd's own D-Bus implementation, part of libsystemd)
adds no new library dependency. The D-Bus usage is scoped to the
systemd interface only; discoverd's own client-facing socket (§3.7)
remains varlink-only.

Add open question #5 to track the future migration once the varlink
unit lifecycle API is complete.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
Content fixes, verified against the kernel, systemd, libnvme, and
nvme-stas sources where applicable:

- io.systemd.Unit.StartTransient is in systemd v261 (still unreleased),
  not v260 as previously stated; v261 still lacks StopUnit, RestartUnit,
  and ResetFailedUnit, strengthening the D-Bus rationale (§3.9)
- a unique-NQN DC gets the kernel's 5 s NVME_DEFAULT_KATO, not a dead
  session; --keep-alive-tmo=30 normalizes the DC kato regardless of NQN
- FC WWN traddr comparison is case-insensitive string equality; nothing
  strips colon separators
- TID hash field order now matches nvme-stas staslib/trid.py exactly
- atomic write protocol documented as mkstemp() with random suffix,
  matching the implementation
- disconnect-all confirmation: prompt on TTY for both --force and
  --owner; non-interactive invocations proceed without prompting

Design additions from review:

- transient units carry Before=nvme-discoverd.service so discoverd
  stops before shutdown-time disconnects, preventing reconnect attempts
  (and FC kickstart re-issue) against a shutting-down system
- NBFT entries can be Discovery Controllers, not just IOCs; the
  --owner nbft substitution applies to both unit types
- host-traddr= exclusion entries provide interface exclusion for RDMA
  and FC, where host-iface does not apply
- registry §4.4 covers both initramfs boot paths: NBFT and the FC
  kickstart (unowned until discoverd adopts them at startup)
- the recycled-devid stale-entry edge case requires the udev rule race
  and a libnvme bypass to coincide; normally the rule has already
  cleaned up

Readability: split large sections into topical subsections (registry
§1/§4, exclusion §3/§7, discoverd §3.2, coexistence §5) and break up
long paragraphs throughout. No content changes from restructuring.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Assisted-by: Claude:claude-fable-5 [Claude Code]
Add §11 "Future Release: DC Retention Policy" documenting how
discovery-derived configuration is retained after a discovery source
becomes unavailable.

Introduce `discovery-retention-time` (replacing the mDNS-only
`zeroconf-stale-timeout`) as a unified parameter covering all dynamic
discovery sources: mDNS, referral DCs, and FC kickstart DCs. Statically
configured and NBFT-derived DCs are retried indefinitely — they represent
explicit intent and are not subject to the retention timer.

Introduce `fc-kickstart-interval-minutes` for periodic FC fabric probing,
motivated by the equipment replacement scenario where a new DC may have a
different address/NQN and can only be discovered via an active kickstart.

Update §4 and §7.1 to note that FC kickstart has no TID cache in the
initial release, but the retention policy will require one. Update §5
retry policy to reflect the static vs. dynamic DC distinction. Align
§11.1 timer-start semantics with §11.2 (timer starts on source
disappearance regardless of connection state).

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
Periodic FC kickstart is opt-in, not on by default. nvme-discoverd ships
on all Linux systems — enabling periodic fabric probing by default on
laptops and desktops with no FC infrastructure would be wasteful and
surprising.

Use 0 to disable (consistent with systemd's convention for interval
parameters, e.g. WatchdogSec=0). 0 was previously invalid; infinity was
the disable value. Any value >= 1 is a valid interval in minutes.

Fix the consistency issue introduced by the default change: soften "not
acceptable in a production environment" to "may not be acceptable in an
FC production environment", and reframe the orthogonality paragraph to
read naturally with the default-is-0 framing.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
The entry-ID hash was unnecessary complexity: exclusion list management
is human-only, so a sequential number scoped to a single interactive
`nvme exclusion remove` invocation suffices, and `libnvmf_exclusion_remove()`
now matches by exact entry-string content instead. The Python bindings
no longer expose exclusion_add/remove/create/delete at all — providing a
programmatic management API would contradict the human-administered
design; only the read-only exclusion_match/lists/entries() remain.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
@igaw igaw added the rfc For tracking discussions new features etc. label Jun 16, 2026
@igaw igaw marked this pull request as draft June 16, 2026 18:54
Martin Belanger added 3 commits June 16, 2026 22:05
The RFC referenced the legacy NVMe-oF autoconnect components only in
passing (§1, §7.1), with no single place describing what nvme-cli
installs today, when each piece fires, and how nvme-discoverd subsumes
it. Add §12 with a full inventory and a clear split between components
that establish connections (replaced) and those doing orthogonal tuning,
key provisioning, or interface naming (kept).

Document the NBFT late-connect path in particular: nvmf-connect-nbft.service
has no [Install] section and is driven only by a NetworkManager dispatcher
script, so it never fires under systemd-networkd or other managers.
discoverd's retry loop covers the same case manager-agnostically, making
the late-connect a latency optimization rather than a correctness
requirement.

Renumber Open Questions to §13 and Glossary to §14.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Expand the RFC with reviewer-driven clarifications:

- §6.1: explain why discoverd uses its own discoverd.conf rather than
  reusing config.json (json-c dependency; JSON has no comments) or
  discovery.conf (DC-only; cannot express IOCs or global toggles).
- §12.1/§12.2: nvmf-connect.target was only a collective handle for the
  daemon-less design; the discoverd daemon is that coordinator, so no
  target equivalent is needed.
- §12.2: scope the autoconnect-rule replacement to the running system.
  Initramfs (Phase 1) connect is dracut's job (74nvmf, formerly 95nvmf),
  which has never used 70-nvmf-autoconnect.rules. Note that
  StartTransientUnit works over systemd's private socket without the
  D-Bus daemon, so discoverd is kept out of Phase 1 by design, not
  impossibility. Recommend removing the dead 70-nvmf-autoconnect.conf
  snippet from nvme-cli.
- §12.5: emphasize systemd-networkd as the common NetworkManager
  alternative that the legacy NBFT dispatcher path does not cover.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Reconcile the exclusion RFC with what was actually built (review item L13):

- §6.1: libnvmf_exclusion_match() takes a struct libnvmf_tid * rather than
  seven positional string arguments — fewer call-site mistakes.  Document the
  new libnvmf_exclusion_entry_valid() helper used to pre-validate hand-edited
  files without filesystem side effects.
- §6.3: mark the proposed key_value_list_parse()/_free() utilities as
  intentionally not provided; libnvmf_tid_parse() already covers parsing a
  semicolon-separated key=value string, and the matcher parses entries
  internally.
- §5: show the actual --name/--entry option form instead of positional
  <name>/<entry> arguments.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Comment thread RFCs/rfc-nvme-discoverd.md Outdated
Comment thread RFCs/rfc-nvme-discoverd.md Outdated
Comment thread RFCs/rfc-nvme-orchestrator-coexistence.md Outdated
Comment thread RFCs/rfc-nvme-discoverd.md Outdated
Comment thread RFCs/rfc-nvme-discoverd.md Outdated
Comment thread RFCs/rfc-nvme-discoverd.md Outdated
Comment thread RFCs/rfc-nvme-discoverd.md Outdated
Comment thread RFCs/rfc-nvme-discoverd.md
Comment thread RFCs/rfc-nvme-discoverd.md
Comment thread RFCs/rfc-nvme-discoverd.md Outdated

@tbzatek tbzatek left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well written design documents, I appreciate the level of detail!

Question about the legacy nvme connect-all --nbft: still being used by dracut, so I guess it stays, yet discoverd reimplements some logic. Would be nice to unify the codebase to avoid duplication. My concern is interpretation of various NBFT flags and processes... although certain best practices got added in the Boot Spec, there are still gaps. There have always been disparity between various UEFI implementations and the OS, even though we generally tend to avoid adding quirks in the nvme-cli NBFT code for firmware bugs.

Comment thread RFCs/rfc-nvme-orchestrator-coexistence.md Outdated
Comment thread RFCs/rfc-nvme-orchestrator-coexistence.md Outdated
Comment thread RFCs/rfc-nvme-orchestrator-coexistence.md
Comment thread RFCs/rfc-nvme-registry.md
Comment thread RFCs/rfc-nvme-registry.md Outdated
Comment thread RFCs/rfc-nvme-discoverd.md Outdated
| `nvmf-connect-nbft.service` | systemd oneshot | On demand, started by the NM dispatcher on NBFT-interface up | **Replaced** — discoverd adopts/reconnects NBFT controllers from its NBFT cache (§7.1); see §12.5 |
| `80-nvmf-connect-nbft.sh` | NM dispatcher | NetworkManager interface-up for `nbft*`/HFI connections | **Replaced** — discoverd's retry loop, manager-agnostic (§12.5) |
| `70-nvmf-autoconnect.conf` | dracut conf | Build-time `install_items+=` snippet that copies `70-nvmf-autoconnect.rules` into the initramfs | **Remove outright** (pre-existing cruft). dracut has never used `70-nvmf-autoconnect.rules` and ships its own initramfs mechanism, so this snippet only copies an inert rule into the initrd. Safe to delete from nvme-cli independently of discoverd; early-boot connect is dracut's job (§7.1, §12.2) |
| `65-persistent-net-nbft.rules` | udev rule | Naming of `nbft*` interfaces | **Kept** — interface naming, not connect logic |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether this ever served its purpose... @mwilck ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you're right that it never really served a purpose, and the reason fits the rest of the picture. This rule does one thing — pin nbft* interface names so udev doesn't rename them — and the only thing that ever consumed that name was the NM dispatcher (80-nvmf-connect-nbft.sh), which matches nbft* to trigger the late-NBFT connect — i.e. the very path you just described as never really working. So with no working consumer, the naming rule has nothing to serve. I currently list it as "Kept" (orthogonal naming), but I'm inclined to move it to "Remove" alongside the NM-dispatcher path: if the late-NBFT machinery goes, this goes with it. I'll defer the definitive history to @mwilck — he'd know if anything else ever relied on the nbft* name — and tie the final disposition to the broader question of whether discoverd carries NBFT handling at all.

Comment thread RFCs/rfc-nvme-discoverd.md Outdated
Comment thread RFCs/rfc-nvme-discoverd.md
Comment thread RFCs/rfc-nvme-discoverd.md
Comment thread RFCs/rfc-nvme-discoverd.md Outdated
Updates forllowing following peer reviews

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Comment thread RFCs/rfc-nvme-discoverd.md

`nvme connect` is used instead of `nvme discover --persistent` because `nvme discover` always issues a Get Log Page immediately — nvme-discoverd fetches the DLP separately via ioctl after the DC device appears, so it would be redundant.

`--keep-alive-tmo=30` sets the keep-alive timeout explicitly: `nvme connect`'s automatic 30 s discovery KATO only fires for the well-known discovery NQN (`nqn.2014-08.org.nvmexpress.discovery`). NVMe Base Spec 2.x allows DCs to use a unique NQN; for those, neither nvme-cli nor the kernel recognizes the connection as a discovery session, and the kernel falls back to its 5 s I/O-controller default (`NVME_DEFAULT_KATO`). Setting the option explicitly gives every DC session the DC-appropriate 30 s value regardless of which NQN it uses.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything special about KATO and DC? Just wondering why this is highlighted here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discoverd uses nvme connect to establish every session — Discovery Controller (DC) or I/O Controller (IOC) — rather than nvme discover --persistent (the command usually used for DCs). Because nvme connect is the command normally used for IOCs, its keep-alive defaults to the 5 s I/O value (NVME_DEFAULT_KATO); but a DC session is persistent and otherwise idle — it exists only to keep the session up and receive AENs — so I want the 30 s discovery keep-alive there instead.

That is why --keep-alive-tmo=30 is set explicitly, and it's a footgun worth highlighting: nvme connect only applies the 30 s automatically when it recognizes the well-known discovery NQN (nqn.2014-08.org.nvmexpress.discovery). NVMe Base Spec 2.x lets a DC use a unique NQN (CDCs do), and for those neither nvme-cli nor the kernel classifies the session as discovery, so the keep-alive silently stays at the 5 s I/O default. Setting it explicitly gives every DC session 30 s regardless of which NQN it uses.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, I do remember now. Should we add additional knobs to the cli, --io-keep-alive-tmo, --dc-keep-alive-tmo?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth separating the two commands here, because the answer differs.

For nvme connect — which is the only one discoverd uses — I don't think the knobs help. nvme connect brings up exactly one controller, and the user already knows what they're connecting to: that's why we have nvme connect (typically used for I/O Controllers - IOC) and nvme discover (used for Discovery Controlers - DC) as separate commands. So the existing --keep-alive-tmo / -k already expresses everything — one controller, one value the admin chooses. A per-class split adds nothing here: with a single controller you'd just set -k to the value you want, and --dc-keep-alive-tmo on a connection you already know is an IOC (or vice-versa) is just a redundant spelling of -k. discoverd sets -k explicitly on every connection, so it needs nothing new.

nvme connect-all is the one command with the double duty — it loops over a discovery log and brings up both DCs (referral entries) and IOCs in a single shot, so one -k genuinely can't express a different keep-alive for the two classes. That's where --dc-keep-alive-tmo / --io-keep-alive-tmo do make sense, letting it set them independently (with the 30 s / 5 s defaults). So if we want the knobs, connect-all is their natural home — but that's a standalone connect-all enhancement, independent of this RFC, and irrelevant to discoverd, which never calls connect-all.

It's worth being clear about where the original footgun came from, because it's a library classification gap, not a user one. For a long time libnvme used the well-known discovery NQN (nqn.2014-08.org.nvmexpress.discovery) to tell a DC from an IOC, so nvme connect only applies the 30 s discovery default when it sees that NQN. TP8010 (CDCs) then allowed a DC to use a unique NQN, and libnvme had to catch up — a DC with a unique NQN still silently falls back to the 5 s I/O default, because the library can't recognize it as a DC from the NQN alone. The user knew it was a DC all along; the library didn't. discoverd sidesteps the whole thing by always setting -k explicitly.


DC units whose TID is present in the NBFT cache carry `--owner nbft` instead of `--owner discoverd` — the NBFT records the Discovery Controller the firmware used to find the boot targets. The substitution rule is shared with IOC units and described in §3.4.

`nvme connect` establishes the connection and exits; the kernel maintains the session and delivers future AENs on it. nvme-discoverd monitors for the resulting device-add event; after a ~1 s sysfs soak (sysfs attributes are not fully populated at the moment the `add` event fires), it checks the `cntrltype` attribute to confirm the device is a Discovery Controller. Once the device name is known (e.g., `/dev/nvme0`), libnvme is called directly (`libnvme_scan_ctrl` + `libnvmf_get_discovery_log`) to fetch the DLP via Get Log Page — a short ioctl on the already-connected device, not a `/dev/nvme-fabrics` write.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which sysfs entries are missing? This sounds like a bug in the kernel. When the udev is fired, all relevant entries need to be present IMO. Just recently, @maurizio-lombardi fixed the ordering of sysfs entries adding and the udev event fired.

@martin-belanger martin-belanger Jun 29, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This came out of testing on a real system about five years ago. sysfs exposes two kinds of controller attribute: (A) the local parameters used to make the connection (transport, address, host NQN, …), which appear immediately; and (B) attributes retrieved from the remote controller, such as cntrltype, which can lag. What I observed was that the kernel emitted the add nvmeX event as soon as the connection was established, but the type-(B) remote attributes were not necessarily in sysfs yet — adding a ~1 s delay after receiving the udev event before reading cntrltype reliably worked around it.

That was a long time ago, so it may well be fixed by now. If @maurizio-lombardi's recent add-event ordering fix guarantees the remote attributes are populated before the add event fires, the soak is unnecessary and I'll drop it. I'll re-test on a current kernel — connect a DC and read cntrltype at the add event — and remove the soak if it reads correctly, noting the minimum kernel version in the RFC.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you are talking about this: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=37953cec775ed34e59cf9a7d7bb9b0610daa3f3e

This patch doesn't have any relevant effect outside of those sysfs attributes that check for the NVME_CTRL_STARTED_ONCE flag; so I am pretty sure it won't fix the problem you are seeing

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed was just the NVME_EVENT=connected event. But this means we should at least review the relevant kernel code and figure out if this is not something we need to fix. I am allergic to random sleep sprinkled around.


Each DLPE in the DLP is processed based on its `subtype`:
- `NVME_NQN_NVME` — I/O Controller: create a transient IOC unit.
- `NVME_NQN_DISC` — Referral to another DC: create a transient DC unit for the referred DC. When that DC's device appears, its DLP is fetched and processed identically. Referral chains of arbitrary depth are followed naturally through the event loop — no explicit recursion is needed. Note: `nvme connect-all` traverses referrals the same way; `nvme discover` does not (it only fetches one level). If a referral DC becomes permanently unreachable, its per-DC DLP cache entry is eventually evicted; IOCs reachable exclusively through that referral chain fizzle out naturally as their connections drop and are not reconnected. This is intentional — the desired set is derived from current cache state. See §11 (DC Retention Policy) for the planned design.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add an option to nvme discover to support recursive lookups as well. Or is this stupid?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not stupid — I think it's genuinely useful. nvme discover today fetches one level and prints it; a recursive variant would follow referral DCs and print the whole reachable tree without connecting anything — the read-only mirror of what connect-all already does, and a handy "show me everything this fabric exposes" diagnostic.

I'd keep it as its own nvme-cli enhancement rather than fold it into this RFC, since it is independent of the daemon. Two things it would need: referral-loop detection (a misconfigured fabric can have cyclic referrals) and a way to present the tree (indent per level, or flatten with a "via" column). I'm happy to file a separate issue/RFC for it; it doesn't block discoverd either way.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's split this task out. I would like to avoid adding discoverd to the pre OS environment if possible. Thus nvme-cli should have at least the same feature set as one shot operation. (still not read through the whole RFC, I suppose there is something on this topic).

CollectMode=inactive-or-failed
```

**NBFT-sourced units use `--owner nbft`.** The NBFT can describe Discovery Controllers as well as I/O subsystems — the firmware records the DC it used to discover the boot targets — so this rule applies to both unit types. When discoverd creates a transient unit, DC or IOC, for a controller present in its NBFT cache, it substitutes `--owner nbft` for `--owner discoverd` in the `ExecStart=` line — the only difference between an NBFT unit and a regular unit. Because `RestartUnit` replays the original `ExecStart=` verbatim, all subsequent reconnects also carry `--owner nbft`, preserving the `owner=nbft` lifetime invariant without any reconnect-time special-casing. At unit-creation time, discoverd selects the template variant based on a cache lookup: TID present in the NBFT cache → `--owner nbft`; otherwise → `--owner discoverd`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand this correctly, you want to run discoverd in pre-OS env? I assume just a single pass, right? Shoudn't this part then single nvme-cli command?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rfc For tracking discussions new features etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants