vit-pose / README.md
cegme's picture
Fix task tag (#1)
c7c395c verified
metadata
library_name: transformers
pipeline_tag: keypoint-detection

Model Card for Model ID

Finetuned Vision Transformer (ViT-16) model for classifying the pose of figures in MixTec Codices.

Model Details

Model Description

This model is designed for classifying the pose of figures depicted in the Mixtec codices(standing/not standing).The codices depict historical and mythological scenes using structured pictorial representations. The models Vision Transformer (ViT-16), was finetuned on a custom-labeled dataset of 1,300 figures extracted from three historical Mixtec codices.

  • Developed by: ufdatastudio.com
  • Funded by [optional]: [More Information Needed]
  • Shared by [optional]: [More Information Needed]
  • Model type: Image Classification
  • Language: Python
  • License: [More Information Needed]
  • Finetuned from model [optional]: Vision Transformer (ViT-16)

Model Sources

Uses

Direct Use

This model is intended for the classification of figures in historical Mixtec codices. The classification of pose assists in the interpretation of ancient Mixtec manuscripts, contributing to historical and anthropological research.

Downstream Use

This model may be used for more advanced tasks such as relationship extraction between figures within a codex scene, potentially helping to reconstruct the narratives depicted in the codices.

Out-of-Scope Use

Using the model for classification on datasets unrelated to Mixtec codices or datasets not following similar pictographic systems could yield inaccurate results. The model may not generalize well to modern or non-Mesoamerican artistic depictions.

Bias, Risks, and Limitations

  • The model has adopted the use of pretrained classifiers, each trained on data not specific to our domain.

  • The models inherit all biases previously encoded in the model. We have not investigated how these biases may affect downstream tasks.

  • The finetuned models generated few errors in our investigation, however, we are unaware of how these biases may result in unintended effects.

  • This work is an initial investigation into Mixtec and low- resource, semasiographic languages. We are prohibited from deeper explorations until we align our research direction with present communal, cultural, and anthropological needs. Support from Mixtec domain experts and native Mixtec speakers is essential for continued development.

Recommendations

Given that the model can reliably classify figures from a low-resource dataset, this research opens the door for further processing and analysis of Mixtec Codices. The codices themselves are highly structured and carry a narrative woven through each scene. Finetuned state-of-the-art models could be combined to classify segmented figures within a scene, as well as classify the relationship between figures. These relationships would then be used to extract the narrative from a codex, as defined by domain experts.

How to Get Started with the Model


from transformers import ViTFeatureExtractor,ViTForImageClassification 
from PIL import Image 
import torch 
import requests 
from io import BytesIO 
 
# Load the feature extractor and model 
model_name = "ufdatastudio/vit-pose" 
feature_extractor = ViTFeatureExtractor.from_pretrained(model_name) 
model = ViTForImageClassification.from_pretrained(model_name) 
 
img = Image.open("<link_to_image>").convert("RGB") 
 
# Preprocess the image 
inputs = feature_extractor(images=img, return_tensors="pt") 
 
# Run inference (classify the image) 
with torch.no_grad(): 
    outputs = model(**inputs) 
 
# Get predicted class 
predicted_class_idx = outputs.logits.argmax(-1).item() 
labels = model.config.id2label  # get labels 
predicted_label = labels[predicted_class_idx] 
 
# Print the result 
print(f"Predicted Label: {predicted_label}") 

Training Details

Training Data

The dataset used for the training of this model can be found at: https://huggingface.co/datasets/ufdatastudio/mixtec-figures

Dataset Generation

  • Extracted labelled data from 3 Codices:

    1. Vindobonensis Mexicanus (65 pages): Describes both the mythological and historical founding of the first Mixtec kingdoms.

    2. Selden (20 pages): Follows the founding of the kingdom of Jaltepec and its ruler, Lady 6 Monkey.

    3. Zouche-Nuttall (facsimile edition (40 pages)): Illustrates the life and conquests of Lord 8 Deer Jaguar Claw, but also details the histories of his ancestors.

Note: Other Mixtex Codices are extant, but their condition is degraded and not amenable to our current machine-learning pipeline. Each codex is made of deerskin folios, and each folio comprises two pages.

  • Extraction Method: We used the Segment Anything Model (SAM) from Facebook AI Research to extract individual figures from the three source codices.

    • Each figure was annotated according to the page it was found, its quality as either a, b, or c, and its order within the page.

      a. quality rating indicated the entire figure was intact, regard- less of minor blemishes or cracking, and could be classified by a human annotator as man or woman, standing or not.

      b. rating means that while the previous characteristics of the figure could be de- termined, significant portions of the figures were missing or damaged.

      c. rated figures were missing most of the definable characteristics humans could use to classify the sample.

  • Data Labelling: After figure segmentation and grading, we added classification labels to each figure (standing/not standing).

    • Literature used for evaluation of figures: Boone 2000; Smith, 1973; Jansen, 1988; Williams, 2013; Lopez, 2021.

    • Criteria used to determine standing and not standing: If the figure is clearly on two feet and in an upright position, it is labeled standing, and any other position is labeled not standing.

    • Two team members tagged the images for both categories independently and then verified the results with each other using the process of inter-rater reliability.

Training Procedure

Preprocessing

  • Figures are moved to tensors and then normalized to 224x224 pixels.

  • Loss function is biased by weighting each class in the loss function by its inverse.

  • Due to the overall limited number of figures, and to prevent overfitting, the entire dataset was augmented by using random flips and blocking to increase the number of samples for training.

  • The dataset is split into training, testing, and validation sets, 60%, 20%, and 20% respectively.

  • Eight reference images were set aside to monitor which features of pose are prevalent in activation and attention maps throughout training.

Model Training

  • We fine-tuned popular vision model ViT-16 to perform classification tasks and improve computational efficiency.

  • Imported the model and its pre-trained weights from the PyTorch library, then unfroze the last four layers and heads of the model for training, as they are responsible for learning complex features specific to our classification tasks.

  • Replaced the fully connected layer by one matching our binary classification task.

  • Before the first and after the last epoch of training, an a an attention map is output for each reference image.

Hyperparamter Tuning

  • Experimented with different batch sizes, ranging from 32 to 128, and opted for an average value of 64 as no size significantly outperformed the others.

  • Selected the loss function and optimizer according to the best practices associated with ViT.

  • Hyperparameter investigations revealed that the accuracy for training and validation converged around 100 epochs and the ideal learning rate was 0.00025.

Model Evaluation

  • For each training and validation run, we collected metrics such as accuracy, F1, recall, loss, and precision.

  • The testing accuracy was around 98% with a standard deviation of 1%.

Testing Data, Factors & Metrics

Testing Data

The test set was 20% of the overall dataset, comprising 260 figures from all three codices.

Factors

In the dataset the number of images labelled as 'Not standing' outweighs images labelled as 'standing'. The reason for this is unclear, although given the number of ceremonies that each codex describes, which entails a seated or kneeling position, this balance intuitively makes sense.

Metrics

The model’s performance was evaluated using accuracy, precision, recall, and F1 scores. Both models performed with around 98% accuracy, with ViT-16 outperforming VGG-16 in some configurations.

Results

The purpose of bulding the model was to answer the questions:

  1. Can transformer-based models be finetuned to classify figures from a Mixtec Codices dataset?

Yes!, the model achieved great results across training, validation, and testing phases when using an appropriate learning rate.

  1. Does the model identify the same features experts do?
  • We assigned reference images for each class (man and woman, and standing/not standing) to understand which features each model learned, as well as to compare these learned features to those highlighted by experts.

  • During training, we generated visualizations of activation and attention per pixel to view how the models learned important features over time.

  • The ViT model assigned higher attention to areas corresponding to loincloths on man and showed increased attention to the poncho area on a woman.

  • To verify that the model is indeed identifying the same features noted in literature, we masked attributes on the reference images.

  • We extended our reference image set by adding three variations to each image: either blocked hair, blocked skirt, or both for woman. This process was replicated for the two features indicative of man.

  • ViT correctly predicted 100% of the unblocked reference images, 79% of the singly blocked images, and 63% of the double blocked images.

  • For the doubly blocker images the model fails to find defined areas of attention. This verifies that the model is learning features defined in literature.

Summary

We presented a low-resource dataset of figures from three Mixtec codices: Zouche-Nuttall, Selden, and Vindobonensis Mexicanus I. We extracted the figures using Segment Anything Model and labeled them according to pose, a critical feature used to understand Mixtec codices. Using this novel dataset, we finetuned the last few layers transformer-based foundational models ViT-16, to classify figures as standing or not standing. We confirmed that the model is learning the features said to be relevant by experts using class activation maps and targeted blocking of said features.

Environmental Impact

We have not yet explored more environmentally efficient models. The environmental impact is the same as that of the Vision Transformer models.

Technical Specifications

Compute Infrastructure

Hardware

Model training and inference were performed on an Nvidia A100 on the HiPerGator cluster using PyTorch 2.1 and CUDA 11.

Software

PyTorch framework

Citation

BibTeX:

@inproceedings{webber-etal-2024-analyzing,
    title = "Analyzing Finetuned Vision Models for {M}ixtec Codex Interpretation",
    author = "Webber, Alexander  and
      Sayers, Zachary  and
      Wu, Amy  and
      Thorner, Elizabeth  and
      Witter, Justin  and
      Ayoubi, Gabriel  and
      Grant, Christan",
    editor = "Mager, Manuel  and
      Ebrahimi, Abteen  and
      Rijhwani, Shruti  and
      Oncevay, Arturo  and
      Chiruzzo, Luis  and
      Pugh, Robert  and
      von der Wense, Katharina",
    booktitle = "Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.americasnlp-1.6",
    doi = "10.18653/v1/2024.americasnlp-1.6",
    pages = "42--49",
    abstract = "Throughout history, pictorial record-keeping has been used to document events, stories, and concepts. A popular example of this is the Tzolk{'}in Maya Calendar. The pre-Columbian Mixtec society also recorded many works through graphical media called codices that depict both stories and real events. Mixtec codices are unique because the depicted scenes are highly structured within and across documents. As a first effort toward translation, we created two binary classification tasks over Mixtec codices, namely, gender and pose. The composition of figures within a codex is essential for understanding the codex{'}s narrative. We labeled a dataset with around 1300 figures drawn from three codices of varying qualities. We finetuned the Visual Geometry Group 16 (VGG-16) and Vision Transformer 16 (ViT-16) models, measured their performance, and compared learned features with expert opinions found in literature. The results show that when finetuned, both VGG and ViT perform well, with the transformer-based architecture (ViT) outperforming the CNN-based architecture (VGG) at higher learning rates. We are releasing this work to allow collaboration with the Mixtec community and domain scientists.",
}

Glossary

Figures: Representations of people or gods in Mixtec mythology and are composed of different outfits, tools, and positions. Their names are represented by icons placed near their position on a page.

Model Card Contact

https://ufdatastudio.com/contact/