hi_itn_electronic#437
Conversation
Signed-off-by: Mayuri S <mayuris@nvidia.com>
Signed-off-by: Mayuri S <mayuris@nvidia.com>
Signed-off-by: Mayuri S <mayuris@nvidia.com>
Signed-off-by: Mayuri S <mayuris@nvidia.com>
Signed-off-by: Mayuri S <mayuris@nvidia.com>
Signed-off-by: mayuris-00 <mayuris@nvidia.com>
Signed-off-by: Mayuri S <mayuris@nvidia.com>
3e004de to
01b2fc4
Compare
| sharda शारदा | ||
| universities यूनिवर्सिटीज़ | ||
| mcdonald मैक्डॉनल्ड | ||
| southmountaincc साउथ माउन्टेन सी सी |
There was a problem hiding this comment.
let's trim this list to only be the most common cases
Signed-off-by: Mayuri S <mayuris@nvidia.com>
Signed-off-by: Mayuri S <mayuris@nvidia.com>
812333f to
99a6eef
Compare
| @@ -0,0 +1,167 @@ | |||
| Users यूज़र्स | |||
There was a problem hiding this comment.
let's add everything lowercased and accept several kinds of inputs if needed
https://github.com/NVIDIA/NeMo-text-processing/blob/main/nemo_text_processing/inverse_text_normalization/en/utils.py#L50
| graph_hundred_component_at_least_one_none_zero_digit | ||
| ) | ||
|
|
||
| # Transducer for eleven hundred -> 1100 or twenty one hundred eleven -> 2111 |
There was a problem hiding this comment.
let's not modify any scripts unnecessarily
| common_map_lower = (common_map @ to_lower).optimize() | ||
|
|
||
| latin_run = pynini.closure( | ||
| pynini.union(*[pynini.accep(chr(c)) for c in range(ord('A'), ord('Z') + 1)]) |
There was a problem hiding this comment.
is this equivalent to NEMO_ALPHA?
| ) | ||
| latin_run_lower = (latin_run @ to_lower).optimize() | ||
|
|
||
| drive_letter = letter_map_upper |
There was a problem hiding this comment.
this leave space for the entire alphabet instead of the previously restricted set. is that on purpose?
|
|
||
| url_slash = delete_space + sym["forwardslash"] | ||
|
|
||
| lit_slash_seg = pynini.cross(" / ", "/") |
There was a problem hiding this comment.
again, let's not hardcode the symbol, and let's use delete_space
| + pynini.closure( | ||
| pynutil.add_weight( | ||
| (delete_space + path_atom) | ||
| | win_hyphen |
There was a problem hiding this comment.
it looks like a lot of these follow the same pattern. can we create one rule for all symbols instead of individual rules?
| """ | ||
| Finite state transducer for classifying serial strings, whose segments are | ||
| joined by a hyphen (a literal "-" or the spoken word "हाइफ़न"). | ||
| e.g. कोविड-उन्नीस -> tokens { serial { name: "कोविड-19" } } |
There was a problem hiding this comment.
'serial' is not a semiotic class in the semiotic classes proto
| def __init__(self, cardinal: GraphFst): | ||
| super().__init__(name="serial", kind="classify") | ||
|
|
||
| not_quote = pynini.closure(pynini.difference(NEMO_SIGMA, pynini.accep('"')), 1) |
There was a problem hiding this comment.
| super().__init__(name="serial", kind="classify") | ||
|
|
||
| not_quote = pynini.closure(pynini.difference(NEMO_SIGMA, pynini.accep('"')), 1) | ||
| strip_cardinal_tags = pynutil.delete('cardinal { integer: "') + not_quote + pynutil.delete('" }') |
There was a problem hiding this comment.
let's define and use the necessary graphs before adding cardinal tags so we don't need to strip them
| number_words = pynini.arcmap(pynini.project(number, "input"), map_type="rmweight").optimize() | ||
|
|
||
| devanagari_letter = pynini.union( | ||
| *[chr(c) for c in range(0x0900, 0x0966)], |
There was a problem hiding this comment.
can we define this in graph_utils so we can use it as needed in other classes (like electronic)
| segment = word | number | ||
|
|
||
| word_hyphen = ( | ||
| delete_space + (pynutil.delete("हाइफ़न") | pynutil.delete("हाइफन")) + delete_space + pynutil.insert("-") |
There was a problem hiding this comment.
let's use the symbols as defined in electronics instead of hardcoding here
| letter_map_upper = (letter_map_lower @ TO_UPPER).optimize() | ||
|
|
||
| sym = load_symbols(get_abs_path("data/electronic/symbols.tsv")) | ||
| lit_open_paren = delete_space + pynutil.delete("(") + pynutil.insert("(") + delete_zero_or_one_space |
| | pynutil.add_weight(digit_words, 0.10) | ||
| | pynutil.add_weight(letter_map_upper, 0.84) | ||
| ) | ||
| alnum_run = alnum_token + delete_space + alnum_token + pynini.closure(delete_space + alnum_token, 0) |
There was a problem hiding this comment.
alnum_run = alnum_token + pynini.closure(delete_space + alnum_token, 1)
| @@ -0,0 +1,37 @@ | |||
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |||
| class ElectronicFst(GraphFst): | ||
| """ | ||
| ITN verbalizer for electronic expressions. | ||
| All fields pass content through unchanged. |
There was a problem hiding this comment.
let's follow the same descriptions as for other languages / classes
|
|
||
| class SerialFst(GraphFst): | ||
| """ | ||
| Verbalizer for serial expressions. |
There was a problem hiding this comment.
let's follow the same descriptions as for other languages / classes, and make sure we're only adding tagger / verbalizer for classes that exist in the proto
| ) | ||
| graph = delete_space + pynini.closure(graph + delete_extra_space) + graph + delete_space | ||
| self.fst = graph | ||
| self.fst = (graph @ PostProcessor().fst).optimize() |
There was a problem hiding this comment.
let's follow other languages in their use of the postprocessor. it should be optional and passed as a flag
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Before your PR is "Ready for review"
Pre checks:
git commit -sto sign.pytestor (if your machine does not have GPU)pytest --cpufrom the root folder (given you marked your test cases accordingly@pytest.mark.run_only_on('CPU')).bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...pytestand Sparrowhawk here.__init__.pyfor every folder and subfolder, includingdatafolder which has .TSV files?Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.to all newly added Python files?Copyright 2015 and onwards Google, Inc.. See an example here.try import: ... except: ...) if not already done.PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.