[AURON #2279] Implement native Levenshtein with threshold support for Spark 4.0+ by lyne7-sc · Pull Request #2280 · apache/auron

lyne7-sc · 2026-05-19T07:19:36Z

Which issue does this PR close?

Rationale for this change

Spark 4+ introduced an optional threshold parameter for Levenshtein. The previous native implementation delegated to DataFusion's built-in levenshtein() which only supports 2 arguments.

What changes are included in this PR?

Added a custom spark_levenshtein native function that supports both 2-arg and 3-arg (with threshold) forms
Changed NativeConverters.scala to route Levenshtein through buildExtScalarFunction("Spark_Levenshtein") instead of the old buildScalarFunction path
Removed .exclude("string Levenshtein distance") from Spark 4.0 and 4.1 test settings

Are there any user-facing changes?

Yes. The native levenshtein() function now supports the optional third threshold parameter.

How was this patch tested?

Rust unit tests
AuronStringFunctionsSuite and AuronFunctionSuite including "test function Levenshtein"

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

…spark_levenshtein # Conflicts: # native-engine/auron-planner/src/planner.rs

ShreyeshArangath · 2026-05-20T19:26:12Z

    Ok(ColumnarValue::Array(Arc::new(splitted_builder.finish())))
 }

+pub fn spark_levenshtein(args: &[ColumnarValue]) -> Result<ColumnarValue> {


Correct me if I'm wrong but, in spark_levenshtein, a null threshold is coerced to Some(0), which then returns -1 for any non-zero distance and 0 for equal strings:

Some(ColumnarValue::Scalar(scalar)) if scalar.is_null() => Some(0), ... Some(array) if array.data_type() == &DataType::Null => Some(0), Some(_) => thresholds.map(|array| if array.is_valid(i) { array.value(i) } else { 0 }),

I think that doesn't match Spark. In Spark, Levenshtein.eval only null-checks left/right..a null threshold at runtime causes an NPE during v.asInstanceOf[Int]. The defensible options are (a) propagate null when threshold is null, or (b) mirror Spark and error

Thanks for catching this! You're correct. I double-confirmed this and have fixed it by propagating null. Updated tests accordingly.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

lyne7-sc · 2026-05-26T13:45:19Z

+    let distance = if left == right {
+        0
+    } else {
+        let left_chars = left.chars().collect::<Vec<_>>();
+        let right_chars = right.chars().collect::<Vec<_>>();
+        if left_chars.is_empty() {
+            right_chars.len() as i32
+        } else if right_chars.is_empty() {
+            left_chars.len() as i32
+        } else {
+            let mut previous = (0..=right_chars.len()).collect::<Vec<_>>();
+            let mut current = vec![0; right_chars.len() + 1];
+
+            for (i, left_char) in left_chars.iter().enumerate() {
+                current[0] = i + 1;
+                for (j, right_char) in right_chars.iter().enumerate() {
+                    let substitution_cost = usize::from(left_char != right_char);
+                    current[j + 1] = (current[j] + 1)
+                        .min(previous[j + 1] + 1)
+                        .min(previous[j] + substitution_cost);
+                }
+                std::mem::swap(&mut previous, &mut current);
+            }
+            previous[right_chars.len()] as i32
+        }
+    };
+    Some(match threshold {
+        Some(threshold) if distance > threshold => -1,
+        _ => distance,
+    })


Makes sense. I’ve refactored this to align with Spark’s thresholded Levenshtein behavior:

https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L1989-L2137

weiqingy

Thanks for pushing this through — the banded early-exit Levenshtein is a genuinely careful piece of work, and the implementation matches Spark's UTF8String.levenshteinDistance semantics exactly across the tricky edges (negative threshold, null inputs, multibyte, left-longer-than-right without the swap Spark does). A few small questions inline.

weiqingy · 2026-06-12T06:33:30Z

  Power=67;
  IsNaN=69;
-  Levenshtein=80;
+  // Levenshtein=80;


Commenting out the line frees number 80 without a reserved 80; guard, so a future enum value could reclaim it. The practical risk is small — Auron plans are built and consumed in-process rather than persisted across versions, and the enum already has unguarded gaps (68, 70–79), so this matches existing house style.

Is leaving a commented-out enum line the deliberate convention here, or worth tightening to the following while the change is fresh?

reserved 80; reserved "Levenshtein";

That makes the intent explicit and lets the compiler reject any accidental reuse of 80 or the name.

Good suggestion. I agree reserved is safer than a comment, and it's the protobuf-recommended way to prevent field number reuse.

weiqingy · 2026-06-12T06:33:30Z

+        ])?;
+        let s = r.into_array(6)?;
+        assert_eq!(
+            as_int32_array(&s)?.into_iter().collect::<Vec<_>>(),


Every threshold case here runs on short strings (≤7 chars), so none of them exercises the band on a string long enough that a wrong min-1 / max+1 sentinel would diverge from the naive DP — the band stays wide enough to cover the whole row. The boundary-poisoning logic is the part most exposed to a future edit, and the current cases wouldn't catch a regression in it.

Would you add a long-string case with a tight band — something like levenshtein("abcdefghij", "abXdefghij", 1) (true distance 1, just inside the band) paired with the same inputs at threshold 0 (→ -1)? That pair specifically stresses the out-of-band sentinel handling that the short cases don't reach.

Good call. Added test cases that exercise the banded DP path with longer strings

lyne7-sc added 3 commits May 18, 2026 21:33

implement levenshtein with threshold parameter for Spark 4.x

67fdebd

Refactor spark levenshtein implementation

7963af0

Simplify levenshtein array threshold handling

8fcc669

github-actions Bot added spark native spark-tests labels May 19, 2026

lyne7-sc changed the title ~~[Aruon #2279] Support Levenshtein func with threshold parameter for Spark 4+~~ [AURON #2279] Implement native Levenshtein with threshold support for Spark 4.0+ May 19, 2026

cxzl25 requested a review from Copilot May 19, 2026 12:40

Copilot started reviewing on behalf of cxzl25 May 19, 2026 12:42 View session

Copilot AI reviewed May 19, 2026

lyne7-sc added 4 commits May 19, 2026 21:09

Comment out levenshtein proto enum

1b7439c

Clean up levenshtein planner fallback

f426f03

Clean up levenshtein planner fallback

be74b5f

Merge remote-tracking branch 'origin/fix/spark_levenshtein' into fix/…

c7bdafc

…spark_levenshtein # Conflicts: # native-engine/auron-planner/src/planner.rs

ShreyeshArangath reviewed May 20, 2026

View reviewed changes

lyne7-sc added 2 commits May 21, 2026 20:11

Fix null handling in levenshtein function

b1b35c5

lint

96e0c6e

slfan1989 requested a review from Copilot May 23, 2026 03:54

Copilot started reviewing on behalf of slfan1989 May 23, 2026 03:54 View session

Copilot AI reviewed May 23, 2026

View reviewed changes

lyne7-sc and others added 2 commits May 26, 2026 21:44

optimize with bounded threshold

bdfaa16

Merge branch 'master' into fix/spark_levenshtein

ae2065e

weiqingy reviewed Jun 12, 2026

View reviewed changes

lyne7-sc added 3 commits June 15, 2026 11:08

Merge branch 'master' into fix/spark_levenshtein

210fa79

add test cases

53aee2e

reserved levenshtein identifier

71794ec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AURON #2279] Implement native Levenshtein with threshold support for Spark 4.0+#2280

[AURON #2279] Implement native Levenshtein with threshold support for Spark 4.0+#2280
lyne7-sc wants to merge 14 commits into
apache:masterfrom
lyne7-sc:fix/spark_levenshtein

lyne7-sc commented May 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

ShreyeshArangath May 20, 2026 •

edited

Loading

Uh oh!

lyne7-sc May 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

lyne7-sc May 26, 2026

Uh oh!

weiqingy left a comment

Uh oh!

weiqingy Jun 12, 2026

Uh oh!

lyne7-sc Jun 15, 2026

Uh oh!

weiqingy Jun 12, 2026

Uh oh!

lyne7-sc Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lyne7-sc commented May 19, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

How was this patch tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

ShreyeshArangath May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lyne7-sc May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

lyne7-sc May 26, 2026

Choose a reason for hiding this comment

Uh oh!

weiqingy left a comment

Choose a reason for hiding this comment

Uh oh!

weiqingy Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

lyne7-sc Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

weiqingy Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

lyne7-sc Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ShreyeshArangath May 20, 2026 •

edited

Loading