metadata

title: Submission Oriaz
emoji: 🔥
colorFrom: yellow
colorTo: green
sdk: docker
pinned: false

Benchmarkusing different techniques

ML model for Climate Disinformation Classification

Model Description

Intended Use

Primary intended uses: Baseline comparison for climate disinformation classification models
Primary intended users: Researchers and developers participating in the Frugal AI Challenge
Out-of-scope use cases: Not intended for production use or real-world classification tasks

Training Data

The model uses the QuotaClimat/frugalaichallenge-text-train dataset:

Size: ~6000 examples
Split: 80% train, 20% test
8 categories of climate disinformation claims

Labels

No relevant claim detected
Global warming is not happening
Not caused by humans
Not bad or beneficial
Solutions harmful/unnecessary
Science is unreliable
Proponents are biased
Fossil fuels are needed

Performance

Metrics (I used NVIDIA T4 small GPU)

Accuracy: ~69-72%
Environmental Impact:
- Emissions tracked in gCO2eq (~0,7g)
- Energy consumption tracked in Wh (~1,8wh)

Model Architecture

ML models prefers numeric values so we need to embed our quotes. I used MTEB Leaderboard on HuggingFace to find the model with the best trade-off between performance and the number of parameters.

I then chosed "dunzhang/stella_en_400M_v5" model as embedder. It has the 7th best performance score with only 400M parameters.

Once the quote are embedded, I have 6091 values x 1024 features. After that, train-test split (70%, 30%).

Using TPOT Classifier, I found that the best model on my data was a Logistic Regressor.

Then here is the Confusion Matrix :

Environmental Impact

Environmental impact is tracked using CodeCarbon, measuring:

Carbon emissions during inference
Energy consumption during inference

This tracking helps establish a baseline for the environmental impact of model deployment and inference.

Limitations

Embedding phase take ~30 secondes for 1800 quotes. It can be optimised and can have a real influence on carbon emissions.
Hard to go over 70% accuracy with "simple" ML.
Textual data have some interpretations limitations that little models can't find.

Ethical Considerations

Dataset contains sensitive topics related to climate disinformation
Environmental impact is tracked to promote awareness of AI's carbon footprint