text-Oriaz / README.md
Oriaz's picture
Update README.md
7cf12d5 verified
|
raw
history blame
2.59 kB
metadata
title: Submission Oriaz
emoji: 🔥
colorFrom: yellow
colorTo: green
sdk: docker
pinned: false

Benchmarkusing different techniques

ML model for Climate Disinformation Classification

Model Description

Intended Use

  • Primary intended uses: Baseline comparison for climate disinformation classification models
  • Primary intended users: Researchers and developers participating in the Frugal AI Challenge
  • Out-of-scope use cases: Not intended for production use or real-world classification tasks

Training Data

The model uses the QuotaClimat/frugalaichallenge-text-train dataset:

  • Size: ~6000 examples
  • Split: 80% train, 20% test
  • 8 categories of climate disinformation claims

Labels

  1. No relevant claim detected
  2. Global warming is not happening
  3. Not caused by humans
  4. Not bad or beneficial
  5. Solutions harmful/unnecessary
  6. Science is unreliable
  7. Proponents are biased
  8. Fossil fuels are needed

Performance

Metrics (I used NVIDIA T4 small GPU)

  • Accuracy: ~69-72%
  • Environmental Impact:
    • Emissions tracked in gCO2eq (~0,7g)
    • Energy consumption tracked in Wh (~1,8wh)

Model Architecture

ML models prefers numeric values so we need to embed our quotes. I used MTEB Leaderboard on HuggingFace to find the model with the best trade-off between performance and the number of parameters.

I then chosed "dunzhang/stella_en_400M_v5" model as embedder. It has the 7th best performance score with only 400M parameters.

Once the quote are embedded, I have 6091 values x 1024 features. After that, train-test split (70%, 30%).

Using TPOT Classifier, I found that the best model on my data was a Logistic Regressor.

Then here is the Confusion Matrix :

image/png

Environmental Impact

Environmental impact is tracked using CodeCarbon, measuring:

  • Carbon emissions during inference
  • Energy consumption during inference

This tracking helps establish a baseline for the environmental impact of model deployment and inference.

Limitations

  • Embedding phase take ~30 secondes for 1800 quotes. It can be optimised and can have a real influence on carbon emissions.
  • Hard to go over 70% accuracy with "simple" ML.
  • Textual data have some interpretations limitations that little models can't find.

Ethical Considerations

  • Dataset contains sensitive topics related to climate disinformation
  • Environmental impact is tracked to promote awareness of AI's carbon footprint