Spaces:
Runtime error
Runtime error
Update README.md
Browse files
README.md
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
---
|
2 |
-
title: Submission
|
3 |
emoji: 🔥
|
4 |
colorFrom: yellow
|
5 |
colorTo: green
|
@@ -7,27 +7,29 @@ sdk: docker
|
|
7 |
pinned: false
|
8 |
---
|
9 |
|
|
|
10 |
|
11 |
-
|
12 |
|
13 |
-
|
14 |
|
15 |
-
This is a random baseline model for the Frugal AI Challenge 2024, specifically for the text classification task of identifying climate disinformation. The model serves as a performance floor, randomly assigning labels to text inputs without any learning.
|
16 |
|
17 |
-
|
|
|
|
|
18 |
|
19 |
- **Primary intended uses**: Baseline comparison for climate disinformation classification models
|
20 |
- **Primary intended users**: Researchers and developers participating in the Frugal AI Challenge
|
21 |
- **Out-of-scope use cases**: Not intended for production use or real-world classification tasks
|
22 |
|
23 |
-
|
24 |
|
25 |
The model uses the QuotaClimat/frugalaichallenge-text-train dataset:
|
26 |
- Size: ~6000 examples
|
27 |
- Split: 80% train, 20% test
|
28 |
- 8 categories of climate disinformation claims
|
29 |
|
30 |
-
|
31 |
0. No relevant claim detected
|
32 |
1. Global warming is not happening
|
33 |
2. Not caused by humans
|
@@ -37,18 +39,28 @@ The model uses the QuotaClimat/frugalaichallenge-text-train dataset:
|
|
37 |
6. Proponents are biased
|
38 |
7. Fossil fuels are needed
|
39 |
|
40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
41 |
|
42 |
-
|
43 |
-
- **Accuracy**: ~12.5% (random chance with 8 classes)
|
44 |
-
- **Environmental Impact**:
|
45 |
-
- Emissions tracked in gCO2eq
|
46 |
-
- Energy consumption tracked in Wh
|
47 |
|
48 |
-
|
49 |
-
The model implements a random choice between the 8 possible labels, serving as the simplest possible baseline.
|
50 |
|
51 |
-
|
52 |
|
53 |
Environmental impact is tracked using CodeCarbon, measuring:
|
54 |
- Carbon emissions during inference
|
@@ -56,16 +68,13 @@ Environmental impact is tracked using CodeCarbon, measuring:
|
|
56 |
|
57 |
This tracking helps establish a baseline for the environmental impact of model deployment and inference.
|
58 |
|
59 |
-
|
60 |
-
-
|
61 |
-
-
|
62 |
-
-
|
63 |
-
- Serves only as a baseline reference
|
64 |
-
- Not suitable for any real-world applications
|
65 |
|
66 |
-
|
67 |
|
68 |
- Dataset contains sensitive topics related to climate disinformation
|
69 |
-
- Model makes random predictions and should not be used for actual classification
|
70 |
- Environmental impact is tracked to promote awareness of AI's carbon footprint
|
71 |
```
|
|
|
1 |
---
|
2 |
+
title: Submission Oriaz
|
3 |
emoji: 🔥
|
4 |
colorFrom: yellow
|
5 |
colorTo: green
|
|
|
7 |
pinned: false
|
8 |
---
|
9 |
|
10 |
+
# Benchmarkusing different techniques
|
11 |
|
12 |
+
## ML model for Climate Disinformation Classification
|
13 |
|
14 |
+
### Model Description
|
15 |
|
|
|
16 |
|
17 |
+
|
18 |
+
|
19 |
+
#### Intended Use
|
20 |
|
21 |
- **Primary intended uses**: Baseline comparison for climate disinformation classification models
|
22 |
- **Primary intended users**: Researchers and developers participating in the Frugal AI Challenge
|
23 |
- **Out-of-scope use cases**: Not intended for production use or real-world classification tasks
|
24 |
|
25 |
+
### Training Data
|
26 |
|
27 |
The model uses the QuotaClimat/frugalaichallenge-text-train dataset:
|
28 |
- Size: ~6000 examples
|
29 |
- Split: 80% train, 20% test
|
30 |
- 8 categories of climate disinformation claims
|
31 |
|
32 |
+
#### Labels
|
33 |
0. No relevant claim detected
|
34 |
1. Global warming is not happening
|
35 |
2. Not caused by humans
|
|
|
39 |
6. Proponents are biased
|
40 |
7. Fossil fuels are needed
|
41 |
|
42 |
+
### Performance
|
43 |
+
|
44 |
+
#### Metrics (I used NVIDIA T4 small GPU)
|
45 |
+
- **Accuracy**: ~69-72%
|
46 |
+
- **Environmental Impact**:
|
47 |
+
- Emissions tracked in gCO2eq (~0,7g)
|
48 |
+
- Energy consumption tracked in Wh (~1,8wh)
|
49 |
+
|
50 |
+
#### Model Architecture
|
51 |
+
ML models prefers numeric values so we need to embed our quotes. I used *MTEB Leaderboard* on HuggingFace to find the model with the best trade-off between performance and the number of parameters.
|
52 |
+
|
53 |
+
I then chosed "dunzhang/stella_en_400M_v5" model as embedder. It has the 7th best performance score with only 400M parameters.
|
54 |
+
|
55 |
+
Once the quote are embedded, I have 6091 values x 1024 features. After that, train-test split (70%, 30%).
|
56 |
+
|
57 |
+
Using TPOT Classifier, I found that the best model on my data was a Logistic Regressor.
|
58 |
|
59 |
+
Then here is the Confusion Matrix :
|
|
|
|
|
|
|
|
|
60 |
|
61 |
+

|
|
|
62 |
|
63 |
+
### Environmental Impact
|
64 |
|
65 |
Environmental impact is tracked using CodeCarbon, measuring:
|
66 |
- Carbon emissions during inference
|
|
|
68 |
|
69 |
This tracking helps establish a baseline for the environmental impact of model deployment and inference.
|
70 |
|
71 |
+
### Limitations
|
72 |
+
- Embedding phase take ~30 secondes for 1800 quotes. It can be optimised and can have a real influence on carbon emissions.
|
73 |
+
- Hard to go over 70% accuracy with "simple" ML.
|
74 |
+
- Textual data have some interpretations limitations that little models can't find.
|
|
|
|
|
75 |
|
76 |
+
### Ethical Considerations
|
77 |
|
78 |
- Dataset contains sensitive topics related to climate disinformation
|
|
|
79 |
- Environmental impact is tracked to promote awareness of AI's carbon footprint
|
80 |
```
|