1-800-BAD-CODE
/

punct_cap_seg_47_language

Text2Text Generation

ONNX

generic

punctuation

sentence-boundary-detection

truecasing

Model card Files Files and versions Community

1-800-BAD-CODE commited on Feb 22, 2023

Commit

651e333

1 Parent(s): 6d889b7

Update README.md

Browse files

Files changed (1) hide show

README.md +8 -5

README.md CHANGED Viewed

@@ -57,6 +57,9 @@ language:
 # Model Overview
 This model accepts as input lower-cased, unpunctuated, unsegmented text in 47 languages and performs punctuation restoration, true-casing (capitalization), and sentence boundary detection (segmentation).
 # Model Details
 This model generally follows the graph shown below, with brief descriptions for each step following.
@@ -87,14 +90,14 @@ In practice, this means the inverted question mark for Spanish and Asturian, `¿
 Note that a `¿` can only appear if a `?` is predicted, hence the conditioning.
 5. **Sentence boundary detection**
-Parallel to the "pre" punctuation, another classification predicts from the re-encoded text sentence boundaries.
 In all languages, sentence boundaries can occur only if a potential full stop is predicted, hence the conditioning.
 6. **Shift and concat sentence boundaries**
 In many languages, the first character of each sentence should be upper-cased.
-Thus, we want to feed the sentence boundary information to the true-case classification network.
 Since the true-case classification network is feed-forward and has no context, each time step must embed whether it is the first word of a sentence.
-Therefore, we shift right by one the binary sentence boundary decisions.
 Concatenating this with the re-encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
 7. **True-case prediction**
@@ -151,12 +154,12 @@ This model was trained with News Crawl data from WMT.
 Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.
-# Bias, Risks, and Limitation
 This model was trained on news data, and may not perform well on conversational or informal data.
 This is also a base-sized model with many languages and many tasks, so capacity may be limited.
-This model also predicts punctuation only once per subword.
 This implies that some acronyms, e.g., 'U.S.', cannot properly be punctuation.
 This concession was accepted on two grounds:
 1. Such acronyms are rare, especially in the context of multi-lingual models

 # Model Overview
 This model accepts as input lower-cased, unpunctuated, unsegmented text in 47 languages and performs punctuation restoration, true-casing (capitalization), and sentence boundary detection (segmentation).
+All languages are processed with the same algorithm with no need for language tags or language-specific branches in the graph.
+This includes continuous-script and non-continuous script languages, predicting language-specific punctuation, etc.
 # Model Details
 This model generally follows the graph shown below, with brief descriptions for each step following.
 Note that a `¿` can only appear if a `?` is predicted, hence the conditioning.
 5. **Sentence boundary detection**
+Parallel to the "pre" punctuation, another classification network predicts sentence boundaries from the re-encoded text.
 In all languages, sentence boundaries can occur only if a potential full stop is predicted, hence the conditioning.
 6. **Shift and concat sentence boundaries**
 In many languages, the first character of each sentence should be upper-cased.
+Thus, we should feed the sentence boundary information to the true-case classification network.
 Since the true-case classification network is feed-forward and has no context, each time step must embed whether it is the first word of a sentence.
+Therefore, we shift the binary sentence boundary decisions to the right by one: if token `N-1` is a sentence boundary, token `N` is the first word of a sentence.
 Concatenating this with the re-encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
 7. **True-case prediction**
 Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.
+# Limitations
 This model was trained on news data, and may not perform well on conversational or informal data.
 This is also a base-sized model with many languages and many tasks, so capacity may be limited.
+This model predicts punctuation only once per subword.
 This implies that some acronyms, e.g., 'U.S.', cannot properly be punctuation.
 This concession was accepted on two grounds:
 1. Such acronyms are rare, especially in the context of multi-lingual models