1-800-BAD-CODE
commited on
Commit
·
0c649bf
1
Parent(s):
e3f629a
Update README.md
Browse files
README.md
CHANGED
@@ -59,35 +59,88 @@ This model accepts as input lower-cased, unpunctuated, unsegmented text in 47 la
|
|
59 |
|
60 |
# Model Details
|
61 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
62 |
|
63 |
-
# Usage
|
64 |
-
|
65 |
-
|
66 |
-
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
67 |
|
68 |
-
|
69 |
-
|
70 |
-
## Recommendations
|
71 |
-
|
72 |
-
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
|
73 |
|
74 |
-
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
|
75 |
|
76 |
# Training Details
|
|
|
77 |
|
78 |
## Training Data
|
|
|
79 |
|
80 |
-
|
81 |
-
|
82 |
-
## Training Procedure
|
83 |
-
|
84 |
-
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
85 |
-
|
86 |
-
### Preprocessing [optional]
|
87 |
-
|
88 |
-
[More Information Needed]
|
89 |
|
|
|
90 |
|
91 |
# Bias, Risks, and Limitation
|
|
|
|
|
|
|
92 |
|
93 |
# Evaluation
|
|
|
59 |
|
60 |
# Model Details
|
61 |
|
62 |
+
This model generally follows the graph shown below, with brief descriptions for each step following.
|
63 |
+
|
64 |
+
![graph.png](https://s3.amazonaws.com/moonup/production/uploads/1677025540482-62d34c813eebd640a4f97587.png)
|
65 |
+
|
66 |
+
|
67 |
+
1. **Encoding**:
|
68 |
+
The model begins by tokenizing the text with a subword tokenizer.
|
69 |
+
The tokenizer used here is a `SentencePiece` model with a vocabulary size of 64k.
|
70 |
+
Next, the input sequence is encoded with a base-sized Transformer, consisting of 6 layers with a model dimension of 512.
|
71 |
+
|
72 |
+
2. **Post-punctuation**:
|
73 |
+
The encoded sequence is then fed into a classification network to predict "post" punctuation tokens.
|
74 |
+
Post punctuation are punctuation tokens that may appear after a word, basically most normal punctuation.
|
75 |
+
Post punctation is predicted once per subword - further discussion is below.
|
76 |
+
|
77 |
+
3. **Re-encoding**
|
78 |
+
All subsequent tasks (true-casing, sentence boundary detection, and "pre" punctuation) are dependent on "post" punctuation.
|
79 |
+
Therefore, we must conditional all further predictions on the post punctuation tokens.
|
80 |
+
For this task, predicted punctation tokens are fed into an embedding layer, where embeddings represent each possible punctuation token.
|
81 |
+
Each time step is mapped to a 4-dimensional embeddings, which is concatenated to the 512-dimensional encoding.
|
82 |
+
The concatenated joint representation is re-encoded to confer global context to each time step to incorporate puncuation predictions into subsequent tasks.
|
83 |
+
|
84 |
+
4. **Pre-punctuation**
|
85 |
+
After the re-encoding, another classification network predicts "pre" punctuation, or punctation tokens that may appear before a word.
|
86 |
+
In practice, this means the inverted question mark for Spanish and Asturian, `¿`.
|
87 |
+
Note that a `¿` can only appear if a `?` is predicted, hence the conditioning.
|
88 |
+
|
89 |
+
5. **Sentence boundary detection**
|
90 |
+
Parallel to the "pre" punctuation, another classification predicts from the re-encoded text sentence boundaries.
|
91 |
+
In all languages, sentence boundaries can occur only if a potential full stop is predicted, hence the conditioning.
|
92 |
+
|
93 |
+
6. **Shift and concat sentence boundaries**
|
94 |
+
In many languages, the first character of each sentence should be upper-cased.
|
95 |
+
Thus, we want to feed the sentence boundary information to the true-case classification network.
|
96 |
+
Since the true-case classification network is feed-forward and has no context, each time step must embed whether it is the first word of a sentence.
|
97 |
+
Therefore, we shift right by one the binary sentence boundary decisions.
|
98 |
+
Concatenating this with the re-encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
|
99 |
+
|
100 |
+
7. **True-case prediction**
|
101 |
+
Armed with the knowledge of punctation and sentence boundaries, a classification network predicts true-casing.
|
102 |
+
Since true-casing should be done on a per-character basis, the classification network makes `N` predictions per token, where `N` is the length of the subtoken.
|
103 |
+
(In practice, `N` is the longest possible subword, and the extra predictions are ignored).
|
104 |
+
This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
|
105 |
+
|
106 |
+
|
107 |
+
## Post-Punctuation Tokens
|
108 |
+
This model predicts the following set of "post" punctuation tokens:
|
109 |
+
|
110 |
+
| Token | Description | Relavant Languages |
|
111 |
+
| ------: | :------------- | :------- |
|
112 |
+
| . | Latin full stop | Many |
|
113 |
+
| , | Latin comma | Many |
|
114 |
+
| ? | Latin question mark | Many |
|
115 |
+
| ? | Full-width question mark | Chinese, Japanese |
|
116 |
+
| , | Full-width comma | Chinese, Japanese |
|
117 |
+
| 。 | Full-width full stop | Chinese, Japanese |
|
118 |
+
| 、 | Ideographic comma | Chinese, Japanese |
|
119 |
+
| ・ | Middle dot | Japanese |
|
120 |
+
| । | Danda | Hindi |
|
121 |
+
| ؟ | Arabic question mark | Arabic |
|
122 |
+
| ; | Greek question mark | Greek |
|
123 |
+
| ። | Ethiopic full stop | Amharic |
|
124 |
+
| ፣ | Ethiopic comma | Amharic |
|
125 |
+
| ፧ | Ethiopic question mark | Amharic |
|
126 |
|
|
|
|
|
|
|
|
|
127 |
|
128 |
+
# Usage
|
|
|
|
|
|
|
|
|
129 |
|
|
|
130 |
|
131 |
# Training Details
|
132 |
+
This model was trained in the NeMo framework.
|
133 |
|
134 |
## Training Data
|
135 |
+
This model was trained with News Crawl data from WMT.
|
136 |
|
137 |
+
1M lines of text for each language was used, except for a few low-resource languages which may have used less.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
138 |
|
139 |
+
Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.
|
140 |
|
141 |
# Bias, Risks, and Limitation
|
142 |
+
This model was trained on news data, and may not perform well on conversational or informal data.
|
143 |
+
|
144 |
+
This is also a base-sized model with many languages and many tasks, so capacity may be limited.
|
145 |
|
146 |
# Evaluation
|