Skip to content

Add generalized (relations …) CSV loading construct#259

Open
hbarthels wants to merge 6 commits into
mainfrom
hb-generalize-csv-loading
Open

Add generalized (relations …) CSV loading construct#259
hbarthels wants to merge 6 commits into
mainfrom
hb-generalize-csv-loading

Conversation

@hbarthels

@hbarthels hbarthels commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a new (relations …) construct on CSVData, alongside the existing (columns …) (GNF) form, to support more general CSV loading: a shared set of key columns (or the special METADATA$KEY) plus one or more output relations, each with its own (possibly empty) value columns, with optional CDC grouping into (inserts …)/(deletes …).

This lets a load:

  • put several columns into a single relation, and
  • choose its own key column(s) instead of the implicit row id.

The legacy (columns …) form is untouched and remains fully supported (the two are mutually exclusive on a given CSVData).

Some examples:

;; No CDC. Produces a binary `edge` relation with keys `(src, dst)`.
(relations
  (keys
    (column "src" INT)
    (column "dst" INT))
  (outputs
    (relation :edge)))

;; No CDC. Produces a arity 4 `edge` relation with weights and labels. Keys: `(src, dst)`. Values: `(weight, label)`
(relations
  (keys
    (column "src" INT)
    (column "dst" INT))
  (outputs
    (relation :edge (column "weight" FLOAT) (column "label" STRING))))

;; CDC. Produces two output relations:
;; - `edge_insertions`, keys `(src, dst)`, values `(weight, label)`. Contains only insertions.
;; - `edge_deletions`, keys `(src, dst)`, values `()`. Contains only deletions.
(relations
  (keys
    (column "src" INT)
    (column "dst" INT))
  (outputs
    (inserts
      (relation :edge_insertions (column "weight" FLOAT) (column "label" STRING)))
    (deletes
      (relation :edge_deletions))))

;; No CDC. Produces two output relations:
;; - `weights`, keys `(src, dst)`, values `(weight)`
;; - `labels`, keys `(src, dst)`, values `(label)`
(relations
  (keys
    (column "src" INT)
    (column "dst" INT))
  (outputs
    (relation :weights (column "weight" FLOAT))
    (relation :labels (column "label" STRING))))

;; CDC GNF data load.
(relations
  (keys
    (column "METADATA$KEY" UINT128))
  (outputs
    (inserts
      (relation :aaa (column "aaa" INT))
      (relation :bbb (column "bbb" FLOAT))
      (relation :meta_key_insert))
    (deletes
      (relation :meta_key_delete))))

Changes

  • proto (logic.proto): NamedColumn, OutputRelation, Relations messages; optional Relations relations on CSVData.
  • grammar (grammar.y): relations / relation_keys / output_relation / named_column rules. relation_body returns a concrete Relations (not a tuple) so the Go parser stays type-stable.
  • Regenerated Python / Julia / Go parsers, pretty-printers, and protobuf bindings.
  • Julia SDK: global_ids and ==/hash/isequal extended for the new messages.
  • Fixtures: relations_edge_binary, relations_edge_arity4, relations_split, relations_cdc (+ regenerated bin/pretty/pretty_debug snapshots).

🤖 Generated with Claude Code

hbarthels and others added 4 commits June 13, 2026 00:45
Adds a new `(relations …)` construct on `CSVData` alongside the legacy
`(columns …)` form: a shared set of key columns (or the special
`METADATA$KEY`) plus one or more output relations, each with its own
(possibly empty) value columns, with optional CDC `(inserts …)`/`(deletes …)`
grouping.

- proto: `NamedColumn` / `OutputRelation` / `Relations` messages + an optional
  `Relations relations` field on `CSVData` (mutually exclusive with `columns`)
- grammar: `relations` / `relation_keys` / `output_relation` / `named_column`
  rules; `relation_body` returns a concrete `Relations` to keep the Go parser
  type-stable
- regenerated Python / Julia / Go parsers, pretty-printers, and protobuf
  bindings; `global_ids` + equality extended for the new messages (Julia SDK)
- `.lqp` fixtures (binary edge, arity-4 edge, two-relation split, CDC) +
  regenerated bin / pretty / pretty_debug snapshots

`make test` green across Python, Julia, and Go.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Align the CDC relations fixture with the rest of the edge-themed
suite: use s3://bucket/edges.csv (matching relations_edge_binary /
relations_edge_arity4) and rename the delta relations to :weight_ins /
:weight_del, since they carry a weight column. Regenerated the .bin
and pretty/pretty_debug snapshot goldens (name-derived relation hashes
updated accordingly).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
# Conflicts:
#	sdks/go/src/parser.go
#	sdks/julia/LogicalQueryProtocol.jl/src/parser.jl
#	sdks/python/src/lqp/gen/parser.py
Regenerate parsers, protobuf bindings, pretty printers, and test
fixtures (.bin + pretty/pretty_debug snapshots) from the post-merge
grammar and protos. Parsers changed; protobuf bindings and printers
were already in sync. Test artifacts pick up master's new
:csv_compression default ("auto" -> ""). All Python/Go/Julia tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@hbarthels hbarthels marked this pull request as ready for review June 15, 2026 15:24
@hbarthels hbarthels requested a review from comnik June 15, 2026 15:24
Comment thread sdks/julia/LogicalQueryProtocol.jl/src/equality.jl
Comment thread proto/relationalai/lqp/v1/logic.proto Outdated
Comment on lines +291 to +314
// A single named CSV column with its type. Used to describe both shared key columns and
// per-relation value columns in the generalized `Relations` loading construct.
message NamedColumn {
string name = 1; // CSV column name (e.g. "src"); special name "METADATA$KEY" => derived hash
Type type = 2; // Column type
}

// One output relation: the shared keys plus this relation's own (possibly empty) value columns.
message OutputRelation {
RelationId target_id = 1; // Output relation path
repeated NamedColumn values = 2; // Value columns for this relation (may be empty)
}

// Generalized CSV loading: a shared set of key columns and one or more output relations.
// CDC vs non-CDC is implied by which group is populated:
// - `relations` populated => non-CDC outputs
// - `inserts`/`deletes` => CDC insert/delete groups
message Relations {
repeated NamedColumn keys = 1; // Shared key columns (name "METADATA$KEY" => derived hash)
repeated OutputRelation relations = 2; // Non-CDC outputs
repeated OutputRelation inserts = 3; // CDC insert group
repeated OutputRelation deletes = 4; // CDC delete group
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if relations was maybe too generic to burn on this use case. Seems like you arrived at a similar conclusion, given that you went with OutputRelation. But Output also has a different meaning in LQP already. How about TargetRelations and TargetRelation?

Let's also keep this generic and not tied to CSV specifically, I assume we can reuse this for other types of external data. I don't think anything you have above is specific to CSV, except for the comments.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for Target

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Yeah, the (relations ...) construct is not specific to CSV. I was planning to use the same for Iceberg. I will update the comments.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how I missed that, but I just noticed that the outputs keyword that I used in the examples above is missing from the grammar. 🤦‍♂️ So at the moment it's

(relations
  (keys
    (column "src" INT)
    (column "dst" INT))
    (relation :edge))

instead of

(relations
  (keys
    (column "src" INT)
    (column "dst" INT))
  (outputs
    (relation :edge)))

and

(relations
  (keys
    (column "src" INT)
    (column "dst" INT))
    (inserts
      (relation :edge_insertions (column "weight" FLOAT) (column "label" STRING)))
    (deletes
      (relation :edge_deletions)))

instead of

(relations
  (keys
    (column "src" INT)
    (column "dst" INT))
  (outputs
    (inserts
      (relation :edge_insertions (column "weight" FLOAT) (column "label" STRING)))
    (deletes
      (relation :edge_deletions))))

Do you have a preference between leaving it as it is, adding (outputs ...), or maybe using (targets ...)?

@davidwzhao

Copy link
Copy Markdown
Contributor

Could we get some more concrete examples, e.g., for each load config, what the CSV file looks like and what are the shapes of the resulting relations?

hbarthels and others added 2 commits June 16, 2026 17:15
Rename the generalized CSV-loading proto messages Relations ->
TargetRelations and OutputRelation -> TargetRelation, plus the matching
grammar nonterminals (relations -> target_relations, output_relation ->
target_relation) and all comments. The LQP s-expression syntax is
unchanged: the (relations …)/(relation …)/(inserts …)/(deletes …)
keywords, proto field names, and wire format are all preserved (no .bin
or pretty-snapshot changes). Regenerated parsers, printers, and protobuf
bindings for all three SDKs; updated hand-written Julia equality.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The generalized loading construct (NamedColumn / TargetRelation /
TargetRelations) is not CSV-specific — it will be used for other input
types too. Remove "CSV" from those messages' comments; CSV wording stays
on the CSV-specific CSVData message. Regenerated the Go binding (the
only generated SDK that embeds proto comments).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment on lines +306 to +307
// - `relations` populated => non-CDC outputs
// - `inserts`/`deletes` => CDC insert/delete groups

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the relations and the inserts/deletes portion of TargetRelations are mutually exclusive? e.g. either one or the other is populated? If so, maybe we can express this in a OneOf of CDC/non-CDC?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, they should be mutually exclusive. I will look into that.

@hbarthels

Copy link
Copy Markdown
Contributor Author

Could we get some more concrete examples, e.g., for each load config, what the CSV file looks like and what are the shapes of the resulting relations?

Sure. I asked Claude to generate a few examples and added some comments:

1. Binary edge (no CDC)

(relations
  (keys (column "src" INT) (column "dst" INT))
  (outputs (relation :edge)))

CSV

src,dst
1,2
1,3
4,5

Resulting relations

  • :edge — key (src::Int, dst::Int), no values (arity 2)

Missing value are not allowed in this case, and we assume that the keys are unique.

2. Arity-4 edge with values (no CDC)

(relations
  (keys (column "src" INT) (column "dst" INT))
  (outputs (relation :edge (column "weight" FLOAT) (column "label" STRING))))

CSV

src,dst,weight,label
1,2,0.5,road
1,3,1.5,rail

Resulting relations

  • :edge — key (src::Int, dst::Int) → value (weight::Float64, label::String) (arity 4)

In this case, value rows must either be fully present or fully missing. Anything in between is not allowed.

3. CDC with a custom key — insertions + keys-only deletions

(relations
  (keys (column "src" INT) (column "dst" INT))
  (outputs
    (inserts (relation :edge_insertions (column "weight" FLOAT) (column "label" STRING)))
    (deletes (relation :edge_deletions))))

A CDC CSV carries the metadata columns METADATA$ACTION / METADATA$ISUPDATE / METADATA$ROW_ID. METADATA$ACTION routes each row to the insert group or the delete group. Because the deletes relation declares no value columns, DELETE rows only need their key columns populated (the value columns are ignored / may be empty):

CSV

src,dst,weight,label,METADATA$ACTION,METADATA$ISUPDATE,METADATA$ROW_ID
1,2,0.5,road,INSERT,false,00000000000000000000000000000001
1,3,1.5,rail,INSERT,false,00000000000000000000000000000002
4,5,,,DELETE,false,00000000000000000000000000000003

Resulting relations

  • :edge_insertions — key (src, dst) → value (weight, label), only INSERT rows:
    edge_insertions(1, 2, 0.5, "road"), edge_insertions(1, 3, 1.5, "rail")
  • :edge_deletions — key (src, dst), no values, only DELETE rows (the key identifies the row to delete):
    edge_deletions(4, 5)

4. Split into two relations sharing a key (no CDC)

(relations
  (keys (column "src" INT) (column "dst" INT))
  (outputs
    (relation :weights (column "weight" FLOAT))
    (relation :labels (column "label" STRING))))

One CSV row populates both relations; each picks out its own value column under the shared key.

CSV

src,dst,weight,label
1,2,0.5,road
1,3,1.5,rail

Resulting relations

  • :weights — key (src, dst) → value (weight):
    weights(1, 2, 0.5), weights(1, 3, 1.5)
  • :labels — key (src, dst) → value (label):
    labels(1, 2, "road"), labels(1, 3, "rail")

In this case, it's fine for weight or label to be missing.

5. CDC, row-hash key (METADATA$KEY) — the key-set pattern

(relations
  (keys (column "METADATA$KEY" UINT128))
  (outputs
    (inserts
      (relation :aaa (column "aaa" INT))
      (relation :bbb (column "bbb" FLOAT))
      (relation :meta_key_insert))
    (deletes
      (relation :meta_key_delete))))

METADATA$KEY is not a literal CSV column — it's a special key name meaning "derive a UINT128 row hash from the CSV's METADATA$ROW_ID column." That hash h is the shared key for every output relation. INSERT rows feed all three insert relations; DELETE rows feed the delete relation. :meta_key_insert / :meta_key_delete declare no value columns, so they are value-less key sets (just the row hash).

CSV

aaa,bbb,METADATA$ACTION,METADATA$ISUPDATE,METADATA$ROW_ID
10,0.5,INSERT,false,00000000000000000000000000000001
20,1.5,INSERT,false,00000000000000000000000000000002
,,DELETE,false,00000000000000000000000000000003

Let h1, h2, h3 be the UINT128 hashes of row-ids …0001, …0002, …0003.

Resulting relations (all keyed by the row hash h::UInt128)

  • :aaa — key (h) → value (aaa::Int), INSERT rows only:
    aaa(h1, 10), aaa(h2, 20)
  • :bbb — key (h) → value (bbb::Float64), INSERT rows only:
    bbb(h1, 0.5), bbb(h2, 1.5)
  • :meta_key_insert — key (h), no values; the set of inserted row hashes:
    meta_key_insert(h1), meta_key_insert(h2)
  • :meta_key_delete — key (h), no values; the set of deleted row hashes:
    meta_key_delete(h3)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants