1-800-BAD-CODE
/

punct_cap_seg_47_language

Text2Text Generation

ONNX

generic

punctuation

sentence-boundary-detection

truecasing

Model card Files Files and versions Community

1-800-BAD-CODE commited on Feb 22, 2023

Commit

0c649bf

1 Parent(s): e3f629a

Update README.md

Browse files

Files changed (1) hide show

README.md +72 -19

README.md CHANGED Viewed

@@ -59,35 +59,88 @@ This model accepts as input lower-cased, unpunctuated, unsegmented text in 47 la
 # Model Details
-# Usage
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-## Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 # Training Details
 ## Training Data
-[More Information Needed]
-## Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-### Preprocessing [optional]
-[More Information Needed]
 # Bias, Risks, and Limitation
 # Evaluation

 # Model Details
+This model generally follows the graph shown below, with brief descriptions for each step following.
+![graph.png](https://s3.amazonaws.com/moonup/production/uploads/1677025540482-62d34c813eebd640a4f97587.png)
+1. **Encoding**:
+The model begins by tokenizing the text with a subword tokenizer.
+The tokenizer used here is a `SentencePiece` model with a vocabulary size of 64k.
+Next, the input sequence is encoded with a base-sized Transformer, consisting of 6 layers with a model dimension of 512.
+2. **Post-punctuation**:
+The encoded sequence is then fed into a classification network to predict "post" punctuation tokens.
+Post punctuation are punctuation tokens that may appear after a word, basically most normal punctuation.
+Post punctation is predicted once per subword - further discussion is below.
+3. **Re-encoding**
+All subsequent tasks (true-casing, sentence boundary detection, and "pre" punctuation) are dependent on "post" punctuation.
+Therefore, we must conditional all further predictions on the post punctuation tokens.
+For this task, predicted punctation tokens are fed into an embedding layer, where embeddings represent each possible punctuation token.
+Each time step is mapped to a 4-dimensional embeddings, which is concatenated to the 512-dimensional encoding.
+The concatenated joint representation is re-encoded to confer global context to each time step to incorporate puncuation predictions into subsequent tasks.
+4. **Pre-punctuation**
+After the re-encoding, another classification network predicts "pre" punctuation, or punctation tokens that may appear before a word.
+In practice, this means the inverted question mark for Spanish and Asturian, `¿`.
+Note that a `¿` can only appear if a `?` is predicted, hence the conditioning.
+5. **Sentence boundary detection**
+Parallel to the "pre" punctuation, another classification predicts from the re-encoded text sentence boundaries.
+In all languages, sentence boundaries can occur only if a potential full stop is predicted, hence the conditioning.
+6. **Shift and concat sentence boundaries**
+In many languages, the first character of each sentence should be upper-cased.
+Thus, we want to feed the sentence boundary information to the true-case classification network.
+Since the true-case classification network is feed-forward and has no context, each time step must embed whether it is the first word of a sentence.
+Therefore, we shift right by one the binary sentence boundary decisions.
+Concatenating this with the re-encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
+7. **True-case prediction**
+Armed with the knowledge of punctation and sentence boundaries, a classification network predicts true-casing.
+Since true-casing should be done on a per-character basis, the classification network makes `N` predictions per token, where `N` is the length of the subtoken.
+(In practice, `N` is the longest possible subword, and the extra predictions are ignored).
+This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
+## Post-Punctuation Tokens
+This model predicts the following set of "post" punctuation tokens:
+| Token  | Description | Relavant Languages |
+| ------: | :------------- | :------- |
+| .    | Latin full stop | Many |
+| ,    | Latin comma | Many |
+| ?    | Latin question mark | Many |
+| ？    | Full-width question mark | Chinese, Japanese |
+| ，    | Full-width comma | Chinese, Japanese |
+| 。    | Full-width full stop | Chinese, Japanese |
+| 、    | Ideographic comma | Chinese, Japanese |
+| ・    | Middle dot | Japanese |
+| ।    | Danda | Hindi |
+| ؟    | Arabic question mark | Arabic |
+| ;    | Greek question mark | Greek |
+| ።    | Ethiopic full stop | Amharic |
+| ፣    | Ethiopic comma | Amharic |
+| ፧    | Ethiopic question mark | Amharic |
+# Usage
 # Training Details
+This model was trained in the NeMo framework.
 ## Training Data
+This model was trained with News Crawl data from WMT.
+1M lines of text for each language was used, except for a few low-resource languages which may have used less.
+Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.
 # Bias, Risks, and Limitation
+This model was trained on news data, and may not perform well on conversational or informal data.
+This is also a base-sized model with many languages and many tasks, so capacity may be limited.
 # Evaluation