hi_itn_electronic by mayuris-00 · Pull Request #437 · NVIDIA/NeMo-text-processing

mayuris-00 · 2026-06-08T06:12:29Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Before your PR is "Ready for review"

Pre checks:

PR Type:

New Feature
Bugfix
Documentation
Test

If you haven't finished some of the above items you can still open "Draft" PR.

Signed-off-by: Mayuri S <mayuris@nvidia.com>

Signed-off-by: mayuris-00 <mayuris@nvidia.com>

Signed-off-by: Mayuri S <mayuris@nvidia.com>

mgrafu · 2026-06-15T22:06:17Z

+sharda	शारदा
+universities	यूनिवर्सिटीज़
+mcdonald	मैक्डॉनल्ड
+southmountaincc	साउथ माउन्टेन सी सी


let's trim this list to only be the most common cases

Signed-off-by: Mayuri S <mayuris@nvidia.com>

mgrafu · 2026-06-24T19:34:04Z

@@ -0,0 +1,167 @@
+Users	यूज़र्स


let's add everything lowercased and accept several kinds of inputs if needed
https://github.com/NVIDIA/NeMo-text-processing/blob/main/nemo_text_processing/inverse_text_normalization/en/utils.py#L50

mgrafu · 2026-06-24T19:38:13Z

            graph_hundred_component_at_least_one_none_zero_digit
        )

-        # Transducer for eleven hundred -> 1100 or twenty one hundred eleven -> 2111


let's not modify any scripts unnecessarily

mgrafu · 2026-06-24T19:40:19Z

+        common_map_lower = (common_map @ to_lower).optimize()
+
+        latin_run = pynini.closure(
+            pynini.union(*[pynini.accep(chr(c)) for c in range(ord('A'), ord('Z') + 1)])


is this equivalent to NEMO_ALPHA?

mgrafu · 2026-06-24T19:41:24Z

+        )
+        latin_run_lower = (latin_run @ to_lower).optimize()
+
+        drive_letter = letter_map_upper


this leave space for the entire alphabet instead of the previously restricted set. is that on purpose?

mgrafu · 2026-06-24T19:43:24Z

+
+        url_slash = delete_space + sym["forwardslash"]
+
+        lit_slash_seg = pynini.cross(" / ", "/")


again, let's not hardcode the symbol, and let's use delete_space

mgrafu · 2026-06-24T19:44:47Z

+            + pynini.closure(
+                pynutil.add_weight(
+                    (delete_space + path_atom)
+                    | win_hyphen


it looks like a lot of these follow the same pattern. can we create one rule for all symbols instead of individual rules?

mgrafu · 2026-06-24T19:52:48Z

+    """
+    Finite state transducer for classifying serial strings, whose segments are
+    joined by a hyphen (a literal "-" or the spoken word "हाइफ़न").
+        e.g. कोविड-उन्नीस -> tokens { serial { name: "कोविड-19" } }


'serial' is not a semiotic class in the semiotic classes proto

mgrafu · 2026-06-24T19:53:57Z

+    def __init__(self, cardinal: GraphFst):
+        super().__init__(name="serial", kind="classify")
+
+        not_quote = pynini.closure(pynini.difference(NEMO_SIGMA, pynini.accep('"')), 1)


use https://github.com/NVIDIA/NeMo-text-processing/blob/main/nemo_text_processing/text_normalization/en/graph_utils.py#L42

mgrafu · 2026-06-24T19:54:38Z

+        super().__init__(name="serial", kind="classify")
+
+        not_quote = pynini.closure(pynini.difference(NEMO_SIGMA, pynini.accep('"')), 1)
+        strip_cardinal_tags = pynutil.delete('cardinal { integer: "') + not_quote + pynutil.delete('" }')


let's define and use the necessary graphs before adding cardinal tags so we don't need to strip them

mgrafu · 2026-06-24T19:55:44Z

+        number_words = pynini.arcmap(pynini.project(number, "input"), map_type="rmweight").optimize()
+
+        devanagari_letter = pynini.union(
+            *[chr(c) for c in range(0x0900, 0x0966)],


can we define this in graph_utils so we can use it as needed in other classes (like electronic)

mgrafu · 2026-06-24T19:56:28Z

+        segment = word | number
+
+        word_hyphen = (
+            delete_space + (pynutil.delete("हाइफ़न") | pynutil.delete("हाइफन")) + delete_space + pynutil.insert("-")


let's use the symbols as defined in electronics instead of hardcoding here

mgrafu · 2026-06-24T20:11:42Z

+        letter_map_upper = (letter_map_lower @ TO_UPPER).optimize()
+
+        sym = load_symbols(get_abs_path("data/electronic/symbols.tsv"))
+        lit_open_paren = delete_space + pynutil.delete("(") + pynutil.insert("(") + delete_zero_or_one_space


why are these needed?

mgrafu · 2026-06-24T20:23:55Z

+            | pynutil.add_weight(digit_words, 0.10)
+            | pynutil.add_weight(letter_map_upper, 0.84)
+        )
+        alnum_run = alnum_token + delete_space + alnum_token + pynini.closure(delete_space + alnum_token, 0)


alnum_run = alnum_token + pynini.closure(delete_space + alnum_token, 1)

mgrafu · 2026-06-24T20:27:33Z

@@ -0,0 +1,37 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.


incorrect copyright

mgrafu · 2026-06-24T20:28:17Z

+class ElectronicFst(GraphFst):
+    """
+    ITN verbalizer for electronic expressions.
+    All fields pass content through unchanged.


let's follow the same descriptions as for other languages / classes

mgrafu · 2026-06-24T20:29:38Z

+
+class SerialFst(GraphFst):
+    """
+    Verbalizer for serial expressions.


let's follow the same descriptions as for other languages / classes, and make sure we're only adding tagger / verbalizer for classes that exist in the proto

mgrafu · 2026-06-24T20:31:34Z

        )
        graph = delete_space + pynini.closure(graph + delete_extra_space) + graph + delete_space
-        self.fst = graph
+        self.fst = (graph @ PostProcessor().fst).optimize()


let's follow other languages in their use of the postprocessor. it should be optional and passed as a flag

mayuris-00 added 7 commits June 5, 2026 17:11

feat(hi): add Hindi ITN electronic class

56f129c

Signed-off-by: Mayuri S <mayuris@nvidia.com>

chore: remove laptop setup guide and temp diag scripts

5bd93ca

Signed-off-by: Mayuri S <mayuris@nvidia.com>

chore: remove scratch helper script

933155b

Signed-off-by: Mayuri S <mayuris@nvidia.com>

chore(hi): remove percentage class (out of scope for electronic PR)

12e49f0

Signed-off-by: Mayuri S <mayuris@nvidia.com>

chore(hi): remove leftover percentage tagger and verbalizer

d4c9cf6

Signed-off-by: Mayuri S <mayuris@nvidia.com>

Merge branch 'staging/hi_itn_v3' into hi-itn-electronic-clean

c470cc9

Signed-off-by: mayuris-00 <mayuris@nvidia.com>

style: apply black and isort formatting

01b2fc4

Signed-off-by: Mayuri S <mayuris@nvidia.com>

mayuris-00 force-pushed the hi-itn-electronic-clean branch from 3e004de to 01b2fc4 Compare June 10, 2026 04:35