gsalunke commited on
Commit
c0bd30c
·
verified ·
1 Parent(s): 8990cd4

Updated the Model Card with details from the paper.

Browse files
Files changed (1) hide show
  1. README.md +174 -100
README.md CHANGED
@@ -1,199 +1,273 @@
1
- ---
2
- library_name: transformers
3
- tags: []
4
- ---
5
-
6
- # Model Card for Model ID
7
-
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
 
10
 
 
11
 
12
  ## Model Details
13
 
14
  ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
- - **Developed by:** [More Information Needed]
21
  - **Funded by [optional]:** [More Information Needed]
22
  - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
  - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
 
30
- <!-- Provide the basic links for the model. -->
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
 
36
  ## Uses
37
 
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
  ### Direct Use
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
 
44
- [More Information Needed]
45
 
46
- ### Downstream Use [optional]
47
 
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
 
50
- [More Information Needed]
51
 
52
- ### Out-of-Scope Use
53
 
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
 
56
- [More Information Needed]
57
 
58
- ## Bias, Risks, and Limitations
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
 
62
- [More Information Needed]
63
 
64
  ### Recommendations
65
 
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
 
70
  ## How to Get Started with the Model
71
 
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
  ## Training Details
77
 
78
  ### Training Data
 
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
 
82
- [More Information Needed]
83
 
84
- ### Training Procedure
85
 
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
 
88
- #### Preprocessing [optional]
89
 
90
- [More Information Needed]
91
 
 
92
 
93
- #### Training Hyperparameters
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
 
97
- #### Speeds, Sizes, Times [optional]
98
 
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
 
101
- [More Information Needed]
102
 
103
- ## Evaluation
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
 
107
- ### Testing Data, Factors & Metrics
108
 
109
- #### Testing Data
110
 
111
- <!-- This should link to a Dataset Card if possible. -->
112
 
113
- [More Information Needed]
114
 
115
- #### Factors
116
 
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
 
119
- [More Information Needed]
120
 
121
- #### Metrics
122
 
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
 
125
- [More Information Needed]
126
 
127
- ### Results
128
 
129
- [More Information Needed]
130
 
131
- #### Summary
132
 
133
 
 
134
 
135
- ## Model Examination [optional]
136
 
137
- <!-- Relevant interpretability work for the model goes here -->
138
 
139
- [More Information Needed]
140
 
141
- ## Environmental Impact
142
 
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
 
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
 
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
 
153
- ## Technical Specifications [optional]
154
 
155
- ### Model Architecture and Objective
156
 
157
- [More Information Needed]
158
 
159
- ### Compute Infrastructure
160
 
161
- [More Information Needed]
 
 
162
 
163
- #### Hardware
164
 
165
- [More Information Needed]
166
 
167
- #### Software
168
 
169
- [More Information Needed]
 
 
170
 
171
- ## Citation [optional]
172
 
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
 
175
- **BibTeX:**
176
 
177
- [More Information Needed]
178
 
179
- **APA:**
180
 
181
- [More Information Needed]
182
 
183
- ## Glossary [optional]
184
 
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
 
187
- [More Information Needed]
188
 
189
- ## More Information [optional]
190
 
191
- [More Information Needed]
192
 
193
- ## Model Card Authors [optional]
194
 
195
- [More Information Needed]
 
196
 
197
- ## Model Card Contact
198
 
199
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ language:
4
+ - en
5
+ ---
 
 
 
6
 
7
+ # Model Card for vit-gender
8
 
9
+ Finetuned Vision Transformer (ViT-16) model for classifying the gender of figures in MixTec Codices.
10
 
11
  ## Model Details
12
 
13
  ### Model Description
14
 
15
+ This model is designed for classifying the gender of figures depicted in the Mixtec codices(man/woman).The codices depict historical and mythological scenes using structured pictorial representations. The models Vision Transformer (ViT-16), was finetuned on a custom-labeled dataset of 1,300 figures extracted from three historical Mixtec codices.
16
 
 
17
 
18
+ - **Developed by:** [ufdatastudio.com](https://ufdatastudio.com/)
19
  - **Funded by [optional]:** [More Information Needed]
20
  - **Shared by [optional]:** [More Information Needed]
21
+ - **Model type:** Image Classification
22
+ - **Language:** Python
23
  - **License:** [More Information Needed]
24
+ - **Finetuned from model [optional]:** Vision Transformer (ViT-16)
 
 
25
 
26
+ ### Model Sources
27
 
28
+ - **Repository:** https://github.com/ufdatastudio/mixteclabeling
29
+ - **Paper:** [Analyzing Finetuned Vision Models for Mixtec Codex Interpretation](https://ufdatastudio.com/papers/webber2024analyzing.pdf)
30
+ - **Poster** https://ufdatastudio.com/papers/webber2024analyzing-poster.pdf
31
 
32
  ## Uses
33
 
 
 
34
  ### Direct Use
35
 
36
+ This model is intended for the classification of figures in historical Mixtec codices. The classification of gender assists in the interpretation of ancient Mixtec manuscripts, contributing to historical and anthropological research.
37
 
38
+ ### Downstream Use
39
 
40
+ This model may be used for more advanced tasks such as relationship extraction between figures within a codex scene, potentially helping to reconstruct the narratives depicted in the codices.
41
 
42
+ ### Out-of-Scope Use
43
 
44
+ Using the model for classification on datasets unrelated to Mixtec codices or datasets not following similar pictographic systems could yield inaccurate results. The model may not generalize well to modern or non-Mesoamerican artistic depictions.
45
 
46
+ ## Bias, Risks, and Limitations
47
 
48
+ + The model has adopted the use of pretrained classifiers, each trained on data not specific to our domain.
49
 
50
+ + The models inherit all biases previously encoded in the model. We have not investigated how these biases may affect downstream tasks.
51
 
52
+ + The finetuned models generated few errors in our investigation, however, we are unaware of how these biases may result in unintended effects.
53
 
54
+ + This work is an initial investigation into Mixtec and low- resource, semasiographic languages. We are prohibited from deeper explorations until we align our research direction with present communal, cultural, and anthropological needs. Support from Mixtec domain experts and native Mixtec speakers is essential for continued development.
55
 
 
56
 
57
  ### Recommendations
58
 
59
+ Given that the model can reliably classify figures from a low-resource dataset, this research opens the door for further processing and analysis of Mixtec Codices. The codices themselves are highly structured and carry a narrative woven through each scene. Finetuned state-of-the-art models could be combined to classify segmented figures within a scene, as well as classify the relationship between figures. These relationships would then be used to extract the narrative from a codex, as defined by domain experts.
 
 
60
 
61
  ## How to Get Started with the Model
62
 
63
+ ```python
64
+
65
+ from transformers import ViTFeatureExtractor,ViTForImageClassification
66
+ from PIL import Image
67
+ import torch
68
+ import requests
69
+ from io import BytesIO
70
+
71
+ # Load the feature extractor and model
72
+ model_name = "ufdatastudio/vit-gender"
73
+ feature_extractor = ViTFeatureExtractor.from_pretrained(model_name)
74
+ model = ViTForImageClassification.from_pretrained(model_name)
75
+
76
+ img = Image.open("<link_to_image>").convert("RGB")
77
+
78
+ # Preprocess the image
79
+ inputs = feature_extractor(images=img, return_tensors="pt")
80
+
81
+ # Run inference (classify the image)
82
+ with torch.no_grad():
83
+ outputs = model(**inputs)
84
+
85
+ # Get predicted class
86
+ predicted_class_idx = outputs.logits.argmax(-1).item()
87
+ labels = model.config.id2label # get labels
88
+ predicted_label = labels[predicted_class_idx]
89
+
90
+ # Print the result
91
+ print(f"Predicted Label: {predicted_label}")
92
+ ```
93
 
94
  ## Training Details
95
 
96
  ### Training Data
97
+ The dataset used for the training of this model can be found at: https://huggingface.co/datasets/ufdatastudio/mixtec-figures
98
 
99
+ #### **Dataset Generation**
100
 
101
+ + Extracted labelled data from 3 Codices:
102
 
103
+ 1. **Vindobonensis Mexicanus (65 pages)**: Describes both the mythological and historical founding of the first Mixtec kingdoms.
104
 
105
+ 2. **Selden (20 pages)**: Follows the founding of the kingdom of Jaltepec and its ruler, Lady 6 Monkey.
106
 
107
+ 3. **Zouche-Nuttall (facsimile edition (40 pages))**: Illustrates the life and conquests of Lord 8 Deer Jaguar Claw, but also details the histories of his ancestors.
108
 
109
+ > Note: Other Mixtex Codices are extant, but their condition is degraded and not amenable to our current machine-learning pipeline. Each codex is made of deerskin folios, and each folio comprises two pages.
110
 
111
+ + **Extraction Method**: We used the [Segment Anything Model (SAM) from Facebook AI Research](https://segment-anything.com/) to extract individual figures from the three source codices.
112
 
113
+ + Each figure was annotated according to the page it was found, its quality as either a, b, or c, and its order within the page.
114
 
115
+ **a**. quality rating indicated the entire figure was intact, regard- less of minor blemishes or cracking, and could be classified by a human annotator as man or woman, standing or not.
116
 
117
+ **b**. rating means that while the previous characteristics of the figure could be de- termined, significant portions of the figures were missing or damaged.
118
 
119
+ **c**. rated figures were missing most of the definable characteristics humans could use to classify the sample.
120
 
121
+ + **Data Labelling**: After figure segmentation and grading, we added classification labels to each figure - Man/Woman.
122
 
123
+ + **Literature used for evaluation of figures**: Boone 2000; Smith, 1973; Jansen, 1988; Williams, 2013; Lopez, 2021.
124
 
125
+ + **Criteria used to determine gender class membership**: loincloths and anklets for men, and dresses and braided hair for women.
126
 
127
+ + Two team members tagged the images for both categories independently and then verified the results with each other using the process of inter-rater reliability.
128
 
129
+ ### Training Procedure
130
 
131
+ #### **Preprocessing**
132
 
133
+ + Figures are moved to tensors and then normalized to 224x224 pixels.
134
 
135
+ + Loss function is biased by weighting each class in the loss function by its inverse.
136
 
137
+ + Due to the overall limited number of figures, and to prevent overfitting, the entire dataset was augmented by using random flips and blocking to increase the number of samples for training.
138
 
139
+ + The dataset is split into training, testing, and validation sets, 60%, 20%, and 20% respectively.
140
 
141
+ + Eight reference images were set aside to monitor which features of gender are prevalent in activation and attention maps throughout training.
142
 
143
+ #### **Model Training**
144
 
145
+ + We fine-tuned popular vision model ViT-16 to perform classification tasks and improve computational efficiency.
146
 
147
+ + Imported the model and its pre-trained weights from the PyTorch library, then unfroze the last four layers and heads of the model for training, as they are responsible for learning complex features specific to our classification tasks.
148
 
149
+ + Replaced the fully connected layer by one matching our binary classification task.
150
 
151
+ + Before the first and after the last epoch of training, an a an attention map is output for each reference image.
152
 
153
 
154
+ ### **Hyperparamter Tuning**
155
 
156
+ + Experimented with different batch sizes, ranging from 32 to 128, and opted for an average value of 64 as no size significantly outperformed the others.
157
 
158
+ + Selected the loss function and optimizer according to the best practices associated with ViT.
159
 
160
+ + Hyperparameter investigations revealed that the accuracy for training and validation converged around 100 epochs and the ideal learning rate was 0.00025.
161
 
162
+ ### **Model Evaluation**
163
 
164
+ + For each training and validation run, we collected metrics such as accuracy, F1, recall, loss, and precision.
165
 
166
+ + The testing accuracy was around 98% with a standard deviation of 1%.
167
 
 
 
 
 
 
168
 
169
+ ### Testing Data, Factors & Metrics
170
 
171
+ #### Testing Data
172
 
173
+ The test set was 20% of the overall dataset, comprising 260 figures from all three codices.
174
 
 
175
 
176
+ #### Factors
177
+
178
+ There is an imbalance in the number of imagees that belomng to a particulat gender in each codex.This can be attributed to the fact that each codex is centered on a different figure.
179
 
180
+ #### Metrics
181
 
182
+ The model’s performance was evaluated using accuracy, precision, recall, and F1 scores. The model performed with around 98% accuracy.
183
 
184
+ ### Results
185
 
186
+ The purpose of bulding the model was to answer the questions:
187
+
188
+ 1. **Can transformer-based models be finetuned to classify figures from a Mixtec Codices dataset?**
189
 
190
+ Yes!, the model achieved great results across training, validation, and testing phases when using an appropriate learning rate.
191
 
192
+ 2. **Does the model identify the same features experts do?**
193
 
194
+ + We assigned reference images for each class (man and woman, and standing/not standing) to understand which features each model learned, as well as to compare these learned features to those highlighted by experts.
195
 
196
+ + During training, we generated visualizations of activation and attention per pixel to view how the models learned important features over time.
197
 
198
+ <!-- ![Alt text](image-1.png) -->
199
 
200
+ + The ViT model assigned higher attention to areas corresponding to loincloths on man and showed increased attention to the poncho area on a woman.
201
 
202
+ + To verify that the model is indeed identifying the same features noted in literature, we masked attributes on the reference images.
203
 
204
+ + We extended our reference image set by adding three variations to each image: either blocked hair, blocked skirt, or both for woman. This process was replicated for the two features indicative of man.
205
 
206
+ + ViT correctly predicted 100% of the unblocked reference images, 79% of the singly blocked images, and 63% of the double blocked images.
207
 
208
+ + For the doubly blocker images the model fails to find defined areas of attention. This verifies that the model is learning features defined in literature.
209
 
 
210
 
211
+ #### Summary
212
 
213
+ We presented a low-resource dataset of figures from three Mixtec codices: Zouche- Nuttall, Selden, and Vindobonensis Mexicanus I. We extracted the figures using Segment Anything Model and labeled them according to gender, two critical features used to understand Mixtec codices. Using this novel dataset, we finetuned the last few layers of transformer-based foundational models ViT-16, to classify figures as either man or woman and standing or not standing. We found that both models have high accuracy with this task, but that ViT-16 may be more reliable for varying learning rates. We confirmed that the models are learning the features said to be relevant by experts using class activation maps and targeted blocking of said features. Given that these models can reliably classify figures from a low-resource dataset, this research opens the door for further processing and
214
+ analysis of Mixtec Codices. The codices themselves are highly structured and carry a narrative woven through each scene. Finetuned state-of- the-art models could be combined to classify seg- mented figures within a scene, as well as classify the relationship between figures. These relationships would then be used to extract the narrative from a codex, as defined by domain experts.
215
 
216
+ ## Environmental Impact
217
 
218
+ We have not yet explored more environmentally efficient models. The environmental impact is the same as that of the Vision Transformer models.
219
+
220
+ ## Technical Specifications
221
+
222
+ ### Compute Infrastructure
223
+
224
+ #### Hardware
225
+
226
+ Model training and inference were performed on an Nvidia A100 on the HiPerGator cluster using PyTorch 2.1 and CUDA 11.
227
+
228
+ #### Software
229
+
230
+ PyTorch framework
231
+
232
+ ## Citation
233
+
234
+ **BibTeX:**
235
+ ```BibTeX
236
+ @inproceedings{webber-etal-2024-analyzing,
237
+ title = "Analyzing Finetuned Vision Models for {M}ixtec Codex Interpretation",
238
+ author = "Webber, Alexander and
239
+ Sayers, Zachary and
240
+ Wu, Amy and
241
+ Thorner, Elizabeth and
242
+ Witter, Justin and
243
+ Ayoubi, Gabriel and
244
+ Grant, Christan",
245
+ editor = "Mager, Manuel and
246
+ Ebrahimi, Abteen and
247
+ Rijhwani, Shruti and
248
+ Oncevay, Arturo and
249
+ Chiruzzo, Luis and
250
+ Pugh, Robert and
251
+ von der Wense, Katharina",
252
+ booktitle = "Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)",
253
+ month = jun,
254
+ year = "2024",
255
+ address = "Mexico City, Mexico",
256
+ publisher = "Association for Computational Linguistics",
257
+ url = "https://aclanthology.org/2024.americasnlp-1.6",
258
+ doi = "10.18653/v1/2024.americasnlp-1.6",
259
+ pages = "42--49",
260
+ abstract = "Throughout history, pictorial record-keeping has been used to document events, stories, and concepts. A popular example of this is the Tzolk{'}in Maya Calendar. The pre-Columbian Mixtec society also recorded many works through graphical media called codices that depict both stories and real events. Mixtec codices are unique because the depicted scenes are highly structured within and across documents. As a first effort toward translation, we created two binary classification tasks over Mixtec codices, namely, gender and pose. The composition of figures within a codex is essential for understanding the codex{'}s narrative. We labeled a dataset with around 1300 figures drawn from three codices of varying qualities. We finetuned the Visual Geometry Group 16 (VGG-16) and Vision Transformer 16 (ViT-16) models, measured their performance, and compared learned features with expert opinions found in literature. The results show that when finetuned, both VGG and ViT perform well, with the transformer-based architecture (ViT) outperforming the CNN-based architecture (VGG) at higher learning rates. We are releasing this work to allow collaboration with the Mixtec community and domain scientists.",
261
+ }
262
+ ```
263
+
264
+ ## Glossary
265
+
266
+ Figures: Representations of people or gods in Mixtec mythology and are composed of different outfits, tools, and positions. Their names are represented by icons placed near their position on a page.
267
+
268
+ <!-- ## Model Card Authors [optional]
269
+
270
+ [More Information Needed] -->
271
+
272
+ ## Model Card Contact
273
+ https://ufdatastudio.com/contact/