LuisAVasquez commited on
Commit
266d481
1 Parent(s): 108965a

filling model card

Browse files
Files changed (1) hide show
  1. README.md +123 -1
README.md CHANGED
@@ -21,4 +21,126 @@ widget:
21
  example_title: "Ubi est Roma?"
22
  ---
23
 
24
- Test
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  example_title: "Ubi est Roma?"
22
  ---
23
 
24
+
25
+
26
+
27
+ # Model Card for Simple Latin BERT
28
+
29
+ <!-- Provide a quick summary of what the model is/does. [Optional] -->
30
+ A simple BERT Masked Language Model for Latin for my portfolio, trained on Latin Corpora from the [Classical Language Toolkit](http://cltk.org/) corpora.
31
+
32
+ **NOT** apt for production nor commercial use.
33
+ This model&#39;s performance is really poor, and it has not been evaluated.
34
+
35
+ This model comes with its own tokenizer! It will automatically use **lowercase**.
36
+
37
+ Check the `notebooks` folder for the preprocessing and training scripts.
38
+
39
+
40
+
41
+
42
+ # Table of Contents
43
+
44
+ - [Model Card for Simple Latin BERT ](#model-card-for--model_id-)
45
+ - [Table of Contents](#table-of-contents)
46
+ - [Table of Contents](#table-of-contents-1)
47
+ - [Model Details](#model-details)
48
+ - [Model Description](#model-description)
49
+ - [Uses](#uses)
50
+ - [Direct Use](#direct-use)
51
+ - [Downstream Use [Optional]](#downstream-use-optional)
52
+ - [Training Details](#training-details)
53
+ - [Training Data](#training-data)
54
+ - [Training Procedure](#training-procedure)
55
+ - [Preprocessing](#preprocessing)
56
+ - [Speeds, Sizes, Times](#speeds-sizes-times)
57
+ - [Evaluation](#evaluation)
58
+
59
+
60
+ # Model Details
61
+
62
+ ## Model Description
63
+
64
+ <!-- Provide a longer summary of what this model is/does. -->
65
+ A simple BERT Masked Language Model for Latin for my portfolio, trained on Latin Corpora from the [Classical Language Toolkit](http://cltk.org/) corpora.
66
+
67
+ **NOT** apt for production nor commercial use.
68
+ This model&#39;s performance is really poor, and it has not been evaluated.
69
+
70
+ This model comes with its own tokenizer!
71
+
72
+ Check the `notebooks` folder for the preprocessing and training scripts.
73
+
74
+ - **Developed by:** Luis Antonio VASQUEZ
75
+ - **Model type:** Language model
76
+ - **Language(s) (NLP):** la
77
+ - **License:** mit
78
+
79
+
80
+
81
+ # Uses
82
+
83
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
84
+
85
+ ## Direct Use
86
+
87
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
88
+ <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
89
+
90
+ This model can be used directly for Masked Language Modelling.
91
+
92
+
93
+ ## Downstream Use
94
+
95
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
96
+ <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
97
+
98
+ This model could be used as a base model for other NLP tasks, for example, Text Classification (that is, using transformers&#39; `BertForSequenceClassification`)
99
+
100
+
101
+
102
+
103
+
104
+
105
+ # Training Details
106
+
107
+ ## Training Data
108
+
109
+ <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
110
+
111
+ The training data comes from the corpora freely available from the [Classical Language Toolkit](http://cltk.org/)
112
+
113
+ - [The Latin Library](https://www.thelatinlibrary.com/)
114
+ - Latin section of the [Perseus Digital Library](http://www.perseus.tufts.edu/hopper/)
115
+ - Latin section of the [Tesserae Project](https://tesserae.caset.buffalo.edu/)
116
+ - [Corpus Grammaticorum Latinorum](https://cgl.hypotheses.org/)
117
+
118
+
119
+
120
+
121
+ ## Training Procedure
122
+
123
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
124
+
125
+ ### Preprocessing
126
+
127
+ For preprocessing, the raw text from each of the corpora was extracted by parsing. Then, it was **lowercased** and written onto `txt` files. Ideally, in these files one line would correspond to one sentence.
128
+
129
+ Other data from the corpora, like Entity Tags, POS Tags, etc., were discarded.
130
+
131
+ Training hyperparameters:
132
+ - epochs: 1
133
+ - Batch size: 64
134
+ - Attention heads: 12
135
+ - Hidden Layers: 12
136
+ - Max input size: 512 tokens
137
+
138
+ ### Speeds, Sizes, Times
139
+
140
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
141
+
142
+ After having the dataset ready, training this model on a 16 GB Nvidia Graphics card took around 10 hours.
143
+
144
+ # Evaluation
145
+
146
+ No evaluation was performed on this dataset.