Optimizer should bias to prefer `RightSemi` over `LeftSemi`

### Is your feature request related to a problem or challenge?

To evaluate a semi join, we support two orientations: `LeftSemi` or `RightSemi` (analogously for anti and mark joins; I'll just refer to semijoins here to simplify the discussion). Under `RightSemi`, we build the non-preserved ("filter") input and stream the preserved input; these are swapped for `LeftSemi`. While it might seem like these two orientations are symmetrical, there are actually significant differences in evaluation behavior between them:

* The build-side hash table has to be resident in memory; all else being equal, building the smaller join input is a good general rule, and that's the main rule we follow today.
* `RightSemi` only needs to store the join keys for the build side; `LeftSemi` needs to store wider rows. By definition, the consumer of a semijoin can't be interested in any values from the filter side of the join. So even if the filter side has more rows than the preserved side, building the hash table on the filter side might still require less memory.
* `RightSemi` preserves the partitioning of the preserved input, whereas `LeftSemi` + `CollectLeft` emits with `UnknownPartitioning`.
* `RightSemi` works better with dynamic filter pushdown: I don't know the dynamic filter code super well, but I'd imagine that since `RightSemi` builds the filter side before streaming the preserved side, that gives us more information we can use to push down filters into the preserved-side scan.

Two additional factors that might change:

* `RightSemi` only needs to build on distinct values from the non-preserved side. In the future, we can optimize `RightSemi` to discard duplicate build-side rows. We don't do that today but we might in the future (#22930)
* `RightSemi` allows emitting join results incrementally: as we see each probe row, we can immediately determine if it should be output or not. Whereas `LeftSemi` consumes the _entire_ non-preserved side, marking which of the preserved-side rows matched, and only at the end of the non-preserved input stream can we do a pass over the matched bitmap to determine which preserved-side rows to emit. This is not fundamental though; probably worth fixing LeftSemi to emit incrementally (#22929)

The current optimizer rules don't reflect this:

* `LeftSemi` and `RightSemi` are considered symmetrically; whichever semijoin input is predicted to be smaller is  placed on the build side
* If there are absent stats, `LeftSemi` is chosen

I think revising these rules as follows would make more sense:

* Prefer `RightSemi` over `LeftSemi`, _unless_ the non-preserved input is k times larger than the preserved input. Choosing `k` is a bit arbitrary, but a value in the range of 2-4 seems reasonable.
* If there are absent stats, prefer `RightSemi`

### Describe the solution you'd like

_No response_

### Describe alternatives you've considered

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizer should bias to prefer `RightSemi` over `LeftSemi` #22931

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Optimizer should bias to prefer RightSemi over LeftSemi #22931

Description

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Optimizer should bias to prefer `RightSemi` over `LeftSemi` #22931