Skip to content

hi_itn_electronic#437

Open
mayuris-00 wants to merge 9 commits into
NVIDIA:staging/hi_itn_v3from
mayuris-00:hi-itn-electronic-clean
Open

hi_itn_electronic#437
mayuris-00 wants to merge 9 commits into
NVIDIA:staging/hi_itn_v3from
mayuris-00:hi-itn-electronic-clean

Conversation

@mayuris-00

@mayuris-00 mayuris-00 commented Jun 8, 2026

Copy link
Copy Markdown

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Before your PR is "Ready for review"

Pre checks:

  • Have you signed your commits? Use git commit -s to sign.
  • Do all unittests finish successfully before sending PR?
    1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
    2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
  • If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
  • Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
  • Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
  • Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
  • If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
  • Remove import guards (try import: ... except: ...) if not already done.
  • If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
  • Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

  • New Feature
  • Bugfix
  • Documentation
  • Test

If you haven't finished some of the above items you can still open "Draft" PR.

Signed-off-by: Mayuri S <mayuris@nvidia.com>
Signed-off-by: Mayuri S <mayuris@nvidia.com>
Signed-off-by: Mayuri S <mayuris@nvidia.com>
Signed-off-by: Mayuri S <mayuris@nvidia.com>
Signed-off-by: Mayuri S <mayuris@nvidia.com>
Signed-off-by: mayuris-00 <mayuris@nvidia.com>
Signed-off-by: Mayuri S <mayuris@nvidia.com>
@mayuris-00 mayuris-00 force-pushed the hi-itn-electronic-clean branch from 3e004de to 01b2fc4 Compare June 10, 2026 04:35
Comment thread nemo_text_processing/inverse_text_normalization/hi/data/electronic/domain.tsv Outdated
sharda शारदा
universities यूनिवर्सिटीज़
mcdonald मैक्डॉनल्ड
southmountaincc साउथ माउन्टेन सी सी

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's trim this list to only be the most common cases

Comment thread nemo_text_processing/inverse_text_normalization/hi/taggers/electronic.py Outdated
Comment thread nemo_text_processing/inverse_text_normalization/hi/taggers/electronic.py Outdated
Comment thread nemo_text_processing/inverse_text_normalization/hi/taggers/electronic.py Outdated
Comment thread nemo_text_processing/inverse_text_normalization/hi/taggers/electronic.py Outdated
@mayuris-00 mayuris-00 force-pushed the hi-itn-electronic-clean branch from 812333f to 99a6eef Compare June 23, 2026 04:30
@@ -0,0 +1,167 @@
Users यूज़र्स

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

graph_hundred_component_at_least_one_none_zero_digit
)

# Transducer for eleven hundred -> 1100 or twenty one hundred eleven -> 2111

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not modify any scripts unnecessarily

common_map_lower = (common_map @ to_lower).optimize()

latin_run = pynini.closure(
pynini.union(*[pynini.accep(chr(c)) for c in range(ord('A'), ord('Z') + 1)])

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this equivalent to NEMO_ALPHA?

)
latin_run_lower = (latin_run @ to_lower).optimize()

drive_letter = letter_map_upper

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this leave space for the entire alphabet instead of the previously restricted set. is that on purpose?


url_slash = delete_space + sym["forwardslash"]

lit_slash_seg = pynini.cross(" / ", "/")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, let's not hardcode the symbol, and let's use delete_space

+ pynini.closure(
pynutil.add_weight(
(delete_space + path_atom)
| win_hyphen

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like a lot of these follow the same pattern. can we create one rule for all symbols instead of individual rules?

"""
Finite state transducer for classifying serial strings, whose segments are
joined by a hyphen (a literal "-" or the spoken word "हाइफ़न").
e.g. कोविड-उन्नीस -> tokens { serial { name: "कोविड-19" } }

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'serial' is not a semiotic class in the semiotic classes proto

def __init__(self, cardinal: GraphFst):
super().__init__(name="serial", kind="classify")

not_quote = pynini.closure(pynini.difference(NEMO_SIGMA, pynini.accep('"')), 1)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super().__init__(name="serial", kind="classify")

not_quote = pynini.closure(pynini.difference(NEMO_SIGMA, pynini.accep('"')), 1)
strip_cardinal_tags = pynutil.delete('cardinal { integer: "') + not_quote + pynutil.delete('" }')

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's define and use the necessary graphs before adding cardinal tags so we don't need to strip them

number_words = pynini.arcmap(pynini.project(number, "input"), map_type="rmweight").optimize()

devanagari_letter = pynini.union(
*[chr(c) for c in range(0x0900, 0x0966)],

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we define this in graph_utils so we can use it as needed in other classes (like electronic)

segment = word | number

word_hyphen = (
delete_space + (pynutil.delete("हाइफ़न") | pynutil.delete("हाइफन")) + delete_space + pynutil.insert("-")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use the symbols as defined in electronics instead of hardcoding here

letter_map_upper = (letter_map_lower @ TO_UPPER).optimize()

sym = load_symbols(get_abs_path("data/electronic/symbols.tsv"))
lit_open_paren = delete_space + pynutil.delete("(") + pynutil.insert("(") + delete_zero_or_one_space

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these needed?

| pynutil.add_weight(digit_words, 0.10)
| pynutil.add_weight(letter_map_upper, 0.84)
)
alnum_run = alnum_token + delete_space + alnum_token + pynini.closure(delete_space + alnum_token, 0)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alnum_run = alnum_token + pynini.closure(delete_space + alnum_token, 1)

@@ -0,0 +1,37 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incorrect copyright

class ElectronicFst(GraphFst):
"""
ITN verbalizer for electronic expressions.
All fields pass content through unchanged.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's follow the same descriptions as for other languages / classes


class SerialFst(GraphFst):
"""
Verbalizer for serial expressions.

@mgrafu mgrafu Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's follow the same descriptions as for other languages / classes, and make sure we're only adding tagger / verbalizer for classes that exist in the proto

)
graph = delete_space + pynini.closure(graph + delete_extra_space) + graph + delete_space
self.fst = graph
self.fst = (graph @ PostProcessor().fst).optimize()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's follow other languages in their use of the postprocessor. it should be optional and passed as a flag

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants