Add generalized `(relations …)` CSV loading construct by hbarthels · Pull Request #259 · RelationalAI/logical-query-protocol

hbarthels · 2026-06-12T22:45:23Z

Summary

Adds a new (relations …) construct on CSVData, alongside the existing (columns …) (GNF) form, to support more general CSV loading: a shared set of key columns (or the special METADATA$KEY) plus one or more output relations, each with its own (possibly empty) value columns, with optional CDC grouping into (inserts …)/(deletes …).

This lets a load:

put several columns into a single relation, and
choose its own key column(s) instead of the implicit row id.

The legacy (columns …) form is untouched and remains fully supported (the two are mutually exclusive on a given CSVData).

Some examples:

;; No CDC. Produces a binary `edge` relation with keys `(src, dst)`.
(relations
  (keys
    (column "src" INT)
    (column "dst" INT))
  (outputs
    (relation :edge)))

;; No CDC. Produces a arity 4 `edge` relation with weights and labels. Keys: `(src, dst)`. Values: `(weight, label)`
(relations
  (keys
    (column "src" INT)
    (column "dst" INT))
  (outputs
    (relation :edge (column "weight" FLOAT) (column "label" STRING))))

;; CDC. Produces two output relations:
;; - `edge_insertions`, keys `(src, dst)`, values `(weight, label)`. Contains only insertions.
;; - `edge_deletions`, keys `(src, dst)`, values `()`. Contains only deletions.
(relations
  (keys
    (column "src" INT)
    (column "dst" INT))
  (outputs
    (inserts
      (relation :edge_insertions (column "weight" FLOAT) (column "label" STRING)))
    (deletes
      (relation :edge_deletions))))

;; No CDC. Produces two output relations:
;; - `weights`, keys `(src, dst)`, values `(weight)`
;; - `labels`, keys `(src, dst)`, values `(label)`
(relations
  (keys
    (column "src" INT)
    (column "dst" INT))
  (outputs
    (relation :weights (column "weight" FLOAT))
    (relation :labels (column "label" STRING))))

;; CDC GNF data load.
(relations
  (keys
    (column "METADATA$KEY" UINT128))
  (outputs
    (inserts
      (relation :aaa (column "aaa" INT))
      (relation :bbb (column "bbb" FLOAT))
      (relation :meta_key_insert))
    (deletes
      (relation :meta_key_delete))))

Changes

proto (logic.proto): NamedColumn, OutputRelation, Relations messages; optional Relations relations on CSVData.
grammar (grammar.y): relations / relation_keys / output_relation / named_column rules. relation_body returns a concrete Relations (not a tuple) so the Go parser stays type-stable.
Regenerated Python / Julia / Go parsers, pretty-printers, and protobuf bindings.
Julia SDK: global_ids and ==/hash/isequal extended for the new messages.
Fixtures: relations_edge_binary, relations_edge_arity4, relations_split, relations_cdc (+ regenerated bin/pretty/pretty_debug snapshots).

🤖 Generated with Claude Code

Adds a new `(relations …)` construct on `CSVData` alongside the legacy `(columns …)` form: a shared set of key columns (or the special `METADATA$KEY`) plus one or more output relations, each with its own (possibly empty) value columns, with optional CDC `(inserts …)`/`(deletes …)` grouping. - proto: `NamedColumn` / `OutputRelation` / `Relations` messages + an optional `Relations relations` field on `CSVData` (mutually exclusive with `columns`) - grammar: `relations` / `relation_keys` / `output_relation` / `named_column` rules; `relation_body` returns a concrete `Relations` to keep the Go parser type-stable - regenerated Python / Julia / Go parsers, pretty-printers, and protobuf bindings; `global_ids` + equality extended for the new messages (Julia SDK) - `.lqp` fixtures (binary edge, arity-4 edge, two-relation split, CDC) + regenerated bin / pretty / pretty_debug snapshots `make test` green across Python, Julia, and Go. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Align the CDC relations fixture with the rest of the edge-themed suite: use s3://bucket/edges.csv (matching relations_edge_binary / relations_edge_arity4) and rename the delta relations to :weight_ins / :weight_del, since they carry a weight column. Regenerated the .bin and pretty/pretty_debug snapshot goldens (name-derived relation hashes updated accordingly). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

# Conflicts: # sdks/go/src/parser.go # sdks/julia/LogicalQueryProtocol.jl/src/parser.jl # sdks/python/src/lqp/gen/parser.py

Regenerate parsers, protobuf bindings, pretty printers, and test fixtures (.bin + pretty/pretty_debug snapshots) from the post-merge grammar and protos. Parsers changed; protobuf bindings and printers were already in sync. Test artifacts pick up master's new :csv_compression default ("auto" -> ""). All Python/Go/Julia tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

comnik · 2026-06-15T22:30:14Z

+// A single named CSV column with its type. Used to describe both shared key columns and
+// per-relation value columns in the generalized `Relations` loading construct.
+message NamedColumn {
+  string name = 1; // CSV column name (e.g. "src"); special name "METADATA$KEY" => derived hash
+  Type type = 2;   // Column type
+}
+
+// One output relation: the shared keys plus this relation's own (possibly empty) value columns.
+message OutputRelation {
+  RelationId target_id = 1;        // Output relation path
+  repeated NamedColumn values = 2; // Value columns for this relation (may be empty)
+}
+
+// Generalized CSV loading: a shared set of key columns and one or more output relations.
+// CDC vs non-CDC is implied by which group is populated:
+//   - `relations` populated => non-CDC outputs
+//   - `inserts`/`deletes`   => CDC insert/delete groups
+message Relations {
+  repeated NamedColumn keys = 1;         // Shared key columns (name "METADATA$KEY" => derived hash)
+  repeated OutputRelation relations = 2; // Non-CDC outputs
+  repeated OutputRelation inserts = 3;   // CDC insert group
+  repeated OutputRelation deletes = 4;   // CDC delete group
+}
+


I was wondering if relations was maybe too generic to burn on this use case. Seems like you arrived at a similar conclusion, given that you went with OutputRelation. But Output also has a different meaning in LQP already. How about TargetRelations and TargetRelation?

Let's also keep this generic and not tied to CSV specifically, I assume we can reuse this for other types of external data. I don't think anything you have above is specific to CSV, except for the comments.

+1 for Target

Sure.

Yeah, the (relations ...) construct is not specific to CSV. I was planning to use the same for Iceberg. I will update the comments.

Not sure how I missed that, but I just noticed that the outputs keyword that I used in the examples above is missing from the grammar. 🤦‍♂️ So at the moment it's

(relations (keys (column "src" INT) (column "dst" INT)) (relation :edge))

instead of

(relations (keys (column "src" INT) (column "dst" INT)) (outputs (relation :edge)))

and

(relations (keys (column "src" INT) (column "dst" INT)) (inserts (relation :edge_insertions (column "weight" FLOAT) (column "label" STRING))) (deletes (relation :edge_deletions)))

instead of

(relations (keys (column "src" INT) (column "dst" INT)) (outputs (inserts (relation :edge_insertions (column "weight" FLOAT) (column "label" STRING))) (deletes (relation :edge_deletions))))

Do you have a preference between leaving it as it is, adding (outputs ...), or maybe using (targets ...)?

davidwzhao · 2026-06-16T01:13:03Z

Could we get some more concrete examples, e.g., for each load config, what the CSV file looks like and what are the shapes of the resulting relations?

Rename the generalized CSV-loading proto messages Relations -> TargetRelations and OutputRelation -> TargetRelation, plus the matching grammar nonterminals (relations -> target_relations, output_relation -> target_relation) and all comments. The LQP s-expression syntax is unchanged: the (relations …)/(relation …)/(inserts …)/(deletes …) keywords, proto field names, and wire format are all preserved (no .bin or pretty-snapshot changes). Regenerated parsers, printers, and protobuf bindings for all three SDKs; updated hand-written Julia equality. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The generalized loading construct (NamedColumn / TargetRelation / TargetRelations) is not CSV-specific — it will be used for other input types too. Remove "CSV" from those messages' comments; CSV wording stays on the CSV-specific CSVData message. Regenerated the Go binding (the only generated SDK that embeds proto comments). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The generalized-loading SDK types added in this branch had no coverage in equality_tests.jl. Add testitems for NamedColumn, TargetRelation, and TargetRelations following the existing equality/inequality/hash/ reflexivity/symmetry/transitivity pattern, including the non-CDC (relations) and CDC (inserts/deletes) groupings of TargetRelations. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

minsungc · 2026-06-16T16:44:53Z

+//   - `relations` populated => non-CDC outputs
+//   - `inserts`/`deletes`   => CDC insert/delete groups


Does this mean that the relations and the inserts/deletes portion of TargetRelations are mutually exclusive? e.g. either one or the other is populated? If so, maybe we can express this in a OneOf of CDC/non-CDC?

Yeah, they should be mutually exclusive. I will look into that.

I changed it to a OneOf.

hbarthels · 2026-06-16T18:27:05Z

Could we get some more concrete examples, e.g., for each load config, what the CSV file looks like and what are the shapes of the resulting relations?

Sure. I asked Claude to generate a few examples and added some comments:

1. Binary `edge` (no CDC)

(relations
  (keys (column "src" INT) (column "dst" INT))
  (outputs (relation :edge)))

CSV

src,dst
1,2
1,3
4,5

Resulting relations

:edge — key (src::Int, dst::Int), no values (arity 2)

Missing value are not allowed in this case, and we assume that the keys are unique.

2. Arity-4 `edge` with values (no CDC)

(relations
  (keys (column "src" INT) (column "dst" INT))
  (outputs (relation :edge (column "weight" FLOAT) (column "label" STRING))))

CSV

src,dst,weight,label
1,2,0.5,road
1,3,1.5,rail

Resulting relations

:edge — key (src::Int, dst::Int) → value (weight::Float64, label::String) (arity 4)

In this case, value rows must either be fully present or fully missing. Anything in between is not allowed.

3. CDC with a custom key — insertions + keys-only deletions

(relations
  (keys (column "src" INT) (column "dst" INT))
  (outputs
    (inserts (relation :edge_insertions (column "weight" FLOAT) (column "label" STRING)))
    (deletes (relation :edge_deletions))))

A CDC CSV carries the metadata columns METADATA$ACTION / METADATA$ISUPDATE / METADATA$ROW_ID. METADATA$ACTION routes each row to the insert group or the delete group. Because the deletes relation declares no value columns, DELETE rows only need their key columns populated (the value columns are ignored / may be empty):

CSV

src,dst,weight,label,METADATA$ACTION,METADATA$ISUPDATE,METADATA$ROW_ID
1,2,0.5,road,INSERT,false,00000000000000000000000000000001
1,3,1.5,rail,INSERT,false,00000000000000000000000000000002
4,5,,,DELETE,false,00000000000000000000000000000003

Resulting relations

:edge_insertions — key (src, dst) → value (weight, label), only INSERT rows:
edge_insertions(1, 2, 0.5, "road"), edge_insertions(1, 3, 1.5, "rail")
:edge_deletions — key (src, dst), no values, only DELETE rows (the key identifies the row to delete):
edge_deletions(4, 5)

4. Split into two relations sharing a key (no CDC)

(relations
  (keys (column "src" INT) (column "dst" INT))
  (outputs
    (relation :weights (column "weight" FLOAT))
    (relation :labels (column "label" STRING))))

One CSV row populates both relations; each picks out its own value column under the shared key.

CSV

src,dst,weight,label
1,2,0.5,road
1,3,1.5,rail

Resulting relations

:weights — key (src, dst) → value (weight):
weights(1, 2, 0.5), weights(1, 3, 1.5)
:labels — key (src, dst) → value (label):
labels(1, 2, "road"), labels(1, 3, "rail")

In this case, it's fine for weight or label to be missing.

5. CDC, row-hash key (`METADATA$KEY`) — the key-set pattern

(relations
  (keys (column "METADATA$KEY" UINT128))
  (outputs
    (inserts
      (relation :aaa (column "aaa" INT))
      (relation :bbb (column "bbb" FLOAT))
      (relation :meta_key_insert))
    (deletes
      (relation :meta_key_delete))))

METADATA$KEY is not a literal CSV column — it's a special key name meaning "derive a UINT128 row hash from the CSV's METADATA$ROW_ID column." That hash h is the shared key for every output relation. INSERT rows feed all three insert relations; DELETE rows feed the delete relation. :meta_key_insert / :meta_key_delete declare no value columns, so they are value-less key sets (just the row hash).

CSV

aaa,bbb,METADATA$ACTION,METADATA$ISUPDATE,METADATA$ROW_ID
10,0.5,INSERT,false,00000000000000000000000000000001
20,1.5,INSERT,false,00000000000000000000000000000002
,,DELETE,false,00000000000000000000000000000003

Let h1, h2, h3 be the UINT128 hashes of row-ids …0001, …0002, …0003.

Resulting relations (all keyed by the row hash h::UInt128)

:aaa — key (h) → value (aaa::Int), INSERT rows only:
aaa(h1, 10), aaa(h2, 20)
:bbb — key (h) → value (bbb::Float64), INSERT rows only:
bbb(h1, 0.5), bbb(h2, 1.5)
:meta_key_insert — key (h), no values; the set of inserted row hashes:
meta_key_insert(h1), meta_key_insert(h2)
:meta_key_delete — key (h), no values; the set of deleted row hashes:
meta_key_delete(h3)

Replace the flat relations/inserts/deletes fields on TargetRelations with a `oneof body { PlainTargets plain; CdcTargets cdc; }`, making the mutually-exclusive plain (non-CDC) and CDC modes explicit instead of inferred from which repeated field happens to be populated. PlainTargets wraps `targets`; CdcTargets wraps `inserts`/`deletes` (oneof can't hold repeated fields directly, hence the wrapper messages). The LQP s-expression syntax is unchanged — only the grammar's construct/deconstruct helpers and the deconstruct guard (now switching on the oneof case via has_proto_field) change, so pretty/pretty_debug snapshots are untouched; only .bin wire encodings shift. Regenerated all three SDKs; updated hand-written Julia equality (PlainTargets/CdcTargets + oneof-based TargetRelations), properties global_ids, and equality tests. All Python/Go/Julia suites pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

hbarthels and others added 4 commits June 13, 2026 00:45

Merge branch 'main' into hb-generalize-csv-loading

220987b

# Conflicts: # sdks/go/src/parser.go # sdks/julia/LogicalQueryProtocol.jl/src/parser.jl # sdks/python/src/lqp/gen/parser.py

hbarthels marked this pull request as ready for review June 15, 2026 15:24

hbarthels requested a review from comnik June 15, 2026 15:24

comnik reviewed Jun 15, 2026

View reviewed changes

Comment thread sdks/julia/LogicalQueryProtocol.jl/src/equality.jl

comnik reviewed Jun 15, 2026

View reviewed changes

hbarthels and others added 3 commits June 16, 2026 17:15

minsungc reviewed Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add generalized `(relations …)` CSV loading construct#259

Add generalized `(relations …)` CSV loading construct#259
hbarthels wants to merge 8 commits into
mainfrom
hb-generalize-csv-loading

hbarthels commented Jun 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

comnik Jun 15, 2026

Uh oh!

davidwzhao Jun 16, 2026

Uh oh!

hbarthels Jun 16, 2026

Uh oh!

hbarthels Jun 16, 2026

Uh oh!

davidwzhao commented Jun 16, 2026

Uh oh!

minsungc Jun 16, 2026

Uh oh!

hbarthels Jun 16, 2026

Uh oh!

hbarthels Jun 17, 2026

Uh oh!

hbarthels commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		// - `relations` populated => non-CDC outputs
		// - `inserts`/`deletes` => CDC insert/delete groups

Conversation

hbarthels commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Uh oh!

Uh oh!

comnik Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

davidwzhao Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

hbarthels Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

hbarthels Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

davidwzhao commented Jun 16, 2026

Uh oh!

minsungc Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

hbarthels Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

hbarthels Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

hbarthels commented Jun 16, 2026

1. Binary edge (no CDC)

2. Arity-4 edge with values (no CDC)

3. CDC with a custom key — insertions + keys-only deletions

4. Split into two relations sharing a key (no CDC)

5. CDC, row-hash key (METADATA$KEY) — the key-set pattern

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hbarthels commented Jun 12, 2026 •

edited

Loading

1. Binary `edge` (no CDC)

2. Arity-4 `edge` with values (no CDC)

5. CDC, row-hash key (`METADATA$KEY`) — the key-set pattern