Skip to content

[AURON #2279] Implement native Levenshtein with threshold support for Spark 4.0+#2280

Open
lyne7-sc wants to merge 14 commits into
apache:masterfrom
lyne7-sc:fix/spark_levenshtein
Open

[AURON #2279] Implement native Levenshtein with threshold support for Spark 4.0+#2280
lyne7-sc wants to merge 14 commits into
apache:masterfrom
lyne7-sc:fix/spark_levenshtein

Conversation

@lyne7-sc

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #2279

Rationale for this change

Spark 4+ introduced an optional threshold parameter for Levenshtein. The previous native implementation delegated to DataFusion's built-in levenshtein() which only supports 2 arguments.

What changes are included in this PR?

  • Added a custom spark_levenshtein native function that supports both 2-arg and 3-arg (with threshold) forms
  • Changed NativeConverters.scala to route Levenshtein through buildExtScalarFunction("Spark_Levenshtein") instead of the old buildScalarFunction path
  • Removed .exclude("string Levenshtein distance") from Spark 4.0 and 4.1 test settings

Are there any user-facing changes?

Yes. The native levenshtein() function now supports the optional third threshold parameter.

How was this patch tested?

  • Rust unit tests
  • AuronStringFunctionsSuite and AuronFunctionSuite including "test function Levenshtein"

@lyne7-sc lyne7-sc changed the title [Aruon #2279] Support Levenshtein func with threshold parameter for Spark 4+ [AURON #2279] Implement native Levenshtein with threshold support for Spark 4.0+ May 19, 2026
@cxzl25 cxzl25 requested a review from Copilot May 19, 2026 12:40

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Ok(ColumnarValue::Array(Arc::new(splitted_builder.finish())))
}

pub fn spark_levenshtein(args: &[ColumnarValue]) -> Result<ColumnarValue> {

@ShreyeshArangath ShreyeshArangath May 20, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong but, in spark_levenshtein, a null threshold is coerced to Some(0), which then returns -1 for any non-zero distance and 0 for equal strings:

  Some(ColumnarValue::Scalar(scalar)) if scalar.is_null() => Some(0),
  ...
  Some(array) if array.data_type() == &DataType::Null => Some(0),
  Some(_) => thresholds.map(|array| if array.is_valid(i) { array.value(i) } else { 0 }),

I think that doesn't match Spark. In Spark, Levenshtein.eval only null-checks left/right..a null threshold at runtime causes an NPE during v.asInstanceOf[Int]. The defensible options are (a) propagate null when threshold is null, or (b) mirror Spark and error

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this! You're correct. I double-confirmed this and have fixed it by propagating null. Updated tests accordingly.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comment on lines +199 to +228
let distance = if left == right {
0
} else {
let left_chars = left.chars().collect::<Vec<_>>();
let right_chars = right.chars().collect::<Vec<_>>();
if left_chars.is_empty() {
right_chars.len() as i32
} else if right_chars.is_empty() {
left_chars.len() as i32
} else {
let mut previous = (0..=right_chars.len()).collect::<Vec<_>>();
let mut current = vec![0; right_chars.len() + 1];

for (i, left_char) in left_chars.iter().enumerate() {
current[0] = i + 1;
for (j, right_char) in right_chars.iter().enumerate() {
let substitution_cost = usize::from(left_char != right_char);
current[j + 1] = (current[j] + 1)
.min(previous[j + 1] + 1)
.min(previous[j] + substitution_cost);
}
std::mem::swap(&mut previous, &mut current);
}
previous[right_chars.len()] as i32
}
};
Some(match threshold {
Some(threshold) if distance > threshold => -1,
_ => distance,
})

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I’ve refactored this to align with Spark’s thresholded Levenshtein behavior:

https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L1989-L2137

@weiqingy weiqingy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing this through — the banded early-exit Levenshtein is a genuinely careful piece of work, and the implementation matches Spark's UTF8String.levenshteinDistance semantics exactly across the tricky edges (negative threshold, null inputs, multibyte, left-longer-than-right without the swap Spark does). A few small questions inline.

Power=67;
IsNaN=69;
Levenshtein=80;
// Levenshtein=80;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting out the line frees number 80 without a reserved 80; guard, so a future enum value could reclaim it. The practical risk is small — Auron plans are built and consumed in-process rather than persisted across versions, and the enum already has unguarded gaps (68, 70–79), so this matches existing house style.

Is leaving a commented-out enum line the deliberate convention here, or worth tightening to the following while the change is fresh?

  reserved 80;
  reserved "Levenshtein";

That makes the intent explicit and lets the compiler reject any accidental reuse of 80 or the name.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. I agree reserved is safer than a comment, and it's the protobuf-recommended way to prevent field number reuse.

])?;
let s = r.into_array(6)?;
assert_eq!(
as_int32_array(&s)?.into_iter().collect::<Vec<_>>(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every threshold case here runs on short strings (≤7 chars), so none of them exercises the band on a string long enough that a wrong min-1 / max+1 sentinel would diverge from the naive DP — the band stays wide enough to cover the whole row. The boundary-poisoning logic is the part most exposed to a future edit, and the current cases wouldn't catch a regression in it.

Would you add a long-string case with a tight band — something like levenshtein("abcdefghij", "abXdefghij", 1) (true distance 1, just inside the band) paired with the same inputs at threshold 0 (→ -1)? That pair specifically stresses the out-of-band sentinel handling that the short cases don't reach.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Added test cases that exercise the banded DP path with longer strings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Levenshtein func with threshold parameter for Spark 4+

4 participants