1-800-BAD-CODE commited on
Commit
0c649bf
·
1 Parent(s): e3f629a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +72 -19
README.md CHANGED
@@ -59,35 +59,88 @@ This model accepts as input lower-cased, unpunctuated, unsegmented text in 47 la
59
 
60
  # Model Details
61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
- # Usage
64
-
65
-
66
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
67
 
68
- [More Information Needed]
69
-
70
- ## Recommendations
71
-
72
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
73
 
74
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
75
 
76
  # Training Details
 
77
 
78
  ## Training Data
 
79
 
80
- [More Information Needed]
81
-
82
- ## Training Procedure
83
-
84
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
85
-
86
- ### Preprocessing [optional]
87
-
88
- [More Information Needed]
89
 
 
90
 
91
  # Bias, Risks, and Limitation
 
 
 
92
 
93
  # Evaluation
 
59
 
60
  # Model Details
61
 
62
+ This model generally follows the graph shown below, with brief descriptions for each step following.
63
+
64
+ ![graph.png](https://s3.amazonaws.com/moonup/production/uploads/1677025540482-62d34c813eebd640a4f97587.png)
65
+
66
+
67
+ 1. **Encoding**:
68
+ The model begins by tokenizing the text with a subword tokenizer.
69
+ The tokenizer used here is a `SentencePiece` model with a vocabulary size of 64k.
70
+ Next, the input sequence is encoded with a base-sized Transformer, consisting of 6 layers with a model dimension of 512.
71
+
72
+ 2. **Post-punctuation**:
73
+ The encoded sequence is then fed into a classification network to predict "post" punctuation tokens.
74
+ Post punctuation are punctuation tokens that may appear after a word, basically most normal punctuation.
75
+ Post punctation is predicted once per subword - further discussion is below.
76
+
77
+ 3. **Re-encoding**
78
+ All subsequent tasks (true-casing, sentence boundary detection, and "pre" punctuation) are dependent on "post" punctuation.
79
+ Therefore, we must conditional all further predictions on the post punctuation tokens.
80
+ For this task, predicted punctation tokens are fed into an embedding layer, where embeddings represent each possible punctuation token.
81
+ Each time step is mapped to a 4-dimensional embeddings, which is concatenated to the 512-dimensional encoding.
82
+ The concatenated joint representation is re-encoded to confer global context to each time step to incorporate puncuation predictions into subsequent tasks.
83
+
84
+ 4. **Pre-punctuation**
85
+ After the re-encoding, another classification network predicts "pre" punctuation, or punctation tokens that may appear before a word.
86
+ In practice, this means the inverted question mark for Spanish and Asturian, `¿`.
87
+ Note that a `¿` can only appear if a `?` is predicted, hence the conditioning.
88
+
89
+ 5. **Sentence boundary detection**
90
+ Parallel to the "pre" punctuation, another classification predicts from the re-encoded text sentence boundaries.
91
+ In all languages, sentence boundaries can occur only if a potential full stop is predicted, hence the conditioning.
92
+
93
+ 6. **Shift and concat sentence boundaries**
94
+ In many languages, the first character of each sentence should be upper-cased.
95
+ Thus, we want to feed the sentence boundary information to the true-case classification network.
96
+ Since the true-case classification network is feed-forward and has no context, each time step must embed whether it is the first word of a sentence.
97
+ Therefore, we shift right by one the binary sentence boundary decisions.
98
+ Concatenating this with the re-encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
99
+
100
+ 7. **True-case prediction**
101
+ Armed with the knowledge of punctation and sentence boundaries, a classification network predicts true-casing.
102
+ Since true-casing should be done on a per-character basis, the classification network makes `N` predictions per token, where `N` is the length of the subtoken.
103
+ (In practice, `N` is the longest possible subword, and the extra predictions are ignored).
104
+ This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
105
+
106
+
107
+ ## Post-Punctuation Tokens
108
+ This model predicts the following set of "post" punctuation tokens:
109
+
110
+ | Token | Description | Relavant Languages |
111
+ | ------: | :------------- | :------- |
112
+ | . | Latin full stop | Many |
113
+ | , | Latin comma | Many |
114
+ | ? | Latin question mark | Many |
115
+ | ? | Full-width question mark | Chinese, Japanese |
116
+ | , | Full-width comma | Chinese, Japanese |
117
+ | 。 | Full-width full stop | Chinese, Japanese |
118
+ | 、 | Ideographic comma | Chinese, Japanese |
119
+ | ・ | Middle dot | Japanese |
120
+ | । | Danda | Hindi |
121
+ | ؟ | Arabic question mark | Arabic |
122
+ | ; | Greek question mark | Greek |
123
+ | ። | Ethiopic full stop | Amharic |
124
+ | ፣ | Ethiopic comma | Amharic |
125
+ | ፧ | Ethiopic question mark | Amharic |
126
 
 
 
 
 
127
 
128
+ # Usage
 
 
 
 
129
 
 
130
 
131
  # Training Details
132
+ This model was trained in the NeMo framework.
133
 
134
  ## Training Data
135
+ This model was trained with News Crawl data from WMT.
136
 
137
+ 1M lines of text for each language was used, except for a few low-resource languages which may have used less.
 
 
 
 
 
 
 
 
138
 
139
+ Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.
140
 
141
  # Bias, Risks, and Limitation
142
+ This model was trained on news data, and may not perform well on conversational or informal data.
143
+
144
+ This is also a base-sized model with many languages and many tasks, so capacity may be limited.
145
 
146
  # Evaluation