[AURON #2279] Implement native Levenshtein with threshold support for Spark 4.0+#2280
[AURON #2279] Implement native Levenshtein with threshold support for Spark 4.0+#2280lyne7-sc wants to merge 14 commits into
Conversation
…spark_levenshtein # Conflicts: # native-engine/auron-planner/src/planner.rs
| Ok(ColumnarValue::Array(Arc::new(splitted_builder.finish()))) | ||
| } | ||
|
|
||
| pub fn spark_levenshtein(args: &[ColumnarValue]) -> Result<ColumnarValue> { |
There was a problem hiding this comment.
Correct me if I'm wrong but, in spark_levenshtein, a null threshold is coerced to Some(0), which then returns -1 for any non-zero distance and 0 for equal strings:
Some(ColumnarValue::Scalar(scalar)) if scalar.is_null() => Some(0),
...
Some(array) if array.data_type() == &DataType::Null => Some(0),
Some(_) => thresholds.map(|array| if array.is_valid(i) { array.value(i) } else { 0 }),
I think that doesn't match Spark. In Spark, Levenshtein.eval only null-checks left/right..a null threshold at runtime causes an NPE during v.asInstanceOf[Int]. The defensible options are (a) propagate null when threshold is null, or (b) mirror Spark and error
There was a problem hiding this comment.
Thanks for catching this! You're correct. I double-confirmed this and have fixed it by propagating null. Updated tests accordingly.
| let distance = if left == right { | ||
| 0 | ||
| } else { | ||
| let left_chars = left.chars().collect::<Vec<_>>(); | ||
| let right_chars = right.chars().collect::<Vec<_>>(); | ||
| if left_chars.is_empty() { | ||
| right_chars.len() as i32 | ||
| } else if right_chars.is_empty() { | ||
| left_chars.len() as i32 | ||
| } else { | ||
| let mut previous = (0..=right_chars.len()).collect::<Vec<_>>(); | ||
| let mut current = vec![0; right_chars.len() + 1]; | ||
|
|
||
| for (i, left_char) in left_chars.iter().enumerate() { | ||
| current[0] = i + 1; | ||
| for (j, right_char) in right_chars.iter().enumerate() { | ||
| let substitution_cost = usize::from(left_char != right_char); | ||
| current[j + 1] = (current[j] + 1) | ||
| .min(previous[j + 1] + 1) | ||
| .min(previous[j] + substitution_cost); | ||
| } | ||
| std::mem::swap(&mut previous, &mut current); | ||
| } | ||
| previous[right_chars.len()] as i32 | ||
| } | ||
| }; | ||
| Some(match threshold { | ||
| Some(threshold) if distance > threshold => -1, | ||
| _ => distance, | ||
| }) |
There was a problem hiding this comment.
Makes sense. I’ve refactored this to align with Spark’s thresholded Levenshtein behavior:
weiqingy
left a comment
There was a problem hiding this comment.
Thanks for pushing this through — the banded early-exit Levenshtein is a genuinely careful piece of work, and the implementation matches Spark's UTF8String.levenshteinDistance semantics exactly across the tricky edges (negative threshold, null inputs, multibyte, left-longer-than-right without the swap Spark does). A few small questions inline.
| Power=67; | ||
| IsNaN=69; | ||
| Levenshtein=80; | ||
| // Levenshtein=80; |
There was a problem hiding this comment.
Commenting out the line frees number 80 without a reserved 80; guard, so a future enum value could reclaim it. The practical risk is small — Auron plans are built and consumed in-process rather than persisted across versions, and the enum already has unguarded gaps (68, 70–79), so this matches existing house style.
Is leaving a commented-out enum line the deliberate convention here, or worth tightening to the following while the change is fresh?
reserved 80;
reserved "Levenshtein";That makes the intent explicit and lets the compiler reject any accidental reuse of 80 or the name.
There was a problem hiding this comment.
Good suggestion. I agree reserved is safer than a comment, and it's the protobuf-recommended way to prevent field number reuse.
| ])?; | ||
| let s = r.into_array(6)?; | ||
| assert_eq!( | ||
| as_int32_array(&s)?.into_iter().collect::<Vec<_>>(), |
There was a problem hiding this comment.
Every threshold case here runs on short strings (≤7 chars), so none of them exercises the band on a string long enough that a wrong min-1 / max+1 sentinel would diverge from the naive DP — the band stays wide enough to cover the whole row. The boundary-poisoning logic is the part most exposed to a future edit, and the current cases wouldn't catch a regression in it.
Would you add a long-string case with a tight band — something like levenshtein("abcdefghij", "abXdefghij", 1) (true distance 1, just inside the band) paired with the same inputs at threshold 0 (→ -1)? That pair specifically stresses the out-of-band sentinel handling that the short cases don't reach.
There was a problem hiding this comment.
Good call. Added test cases that exercise the banded DP path with longer strings
Which issue does this PR close?
Closes #2279
Rationale for this change
Spark 4+ introduced an optional
thresholdparameter forLevenshtein. The previous native implementation delegated to DataFusion's built-inlevenshtein()which only supports 2 arguments.What changes are included in this PR?
spark_levenshteinnative function that supports both 2-arg and 3-arg (with threshold) formsNativeConverters.scalato routeLevenshteinthroughbuildExtScalarFunction("Spark_Levenshtein")instead of the oldbuildScalarFunctionpath.exclude("string Levenshtein distance")from Spark 4.0 and 4.1 test settingsAre there any user-facing changes?
Yes. The native
levenshtein()function now supports the optional thirdthresholdparameter.How was this patch tested?
AuronStringFunctionsSuiteandAuronFunctionSuiteincluding "test function Levenshtein"