text-Oriaz

Runtime error

App Files Files Community

Oriaz commited on Jan 7

Commit

7cf12d5

verified ·

1 Parent(s): 6122595

Update README.md

Browse files

Files changed (1) hide show

README.md +33 -24

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: Submission Template
 emoji: 🔥
 colorFrom: yellow
 colorTo: green
@@ -7,27 +7,29 @@ sdk: docker
 pinned: false
 ---
-# Random Baseline Model for Climate Disinformation Classification
-## Model Description
-This is a random baseline model for the Frugal AI Challenge 2024, specifically for the text classification task of identifying climate disinformation. The model serves as a performance floor, randomly assigning labels to text inputs without any learning.
-### Intended Use
 - **Primary intended uses**: Baseline comparison for climate disinformation classification models
 - **Primary intended users**: Researchers and developers participating in the Frugal AI Challenge
 - **Out-of-scope use cases**: Not intended for production use or real-world classification tasks
-## Training Data
 The model uses the QuotaClimat/frugalaichallenge-text-train dataset:
 - Size: ~6000 examples
 - Split: 80% train, 20% test
 - 8 categories of climate disinformation claims
-### Labels
 0. No relevant claim detected
 1. Global warming is not happening
 2. Not caused by humans
@@ -37,18 +39,28 @@ The model uses the QuotaClimat/frugalaichallenge-text-train dataset:
 6. Proponents are biased
 7. Fossil fuels are needed
-## Performance
-### Metrics
-- **Accuracy**: ~12.5% (random chance with 8 classes)
-- **Environmental Impact**:
-  - Emissions tracked in gCO2eq
-  - Energy consumption tracked in Wh
-### Model Architecture
-The model implements a random choice between the 8 possible labels, serving as the simplest possible baseline.
-## Environmental Impact
 Environmental impact is tracked using CodeCarbon, measuring:
 - Carbon emissions during inference
@@ -56,16 +68,13 @@ Environmental impact is tracked using CodeCarbon, measuring:
 This tracking helps establish a baseline for the environmental impact of model deployment and inference.
-## Limitations
-- Makes completely random predictions
-- No learning or pattern recognition
-- No consideration of input text
-- Serves only as a baseline reference
-- Not suitable for any real-world applications
-## Ethical Considerations
 - Dataset contains sensitive topics related to climate disinformation
-- Model makes random predictions and should not be used for actual classification
 - Environmental impact is tracked to promote awareness of AI's carbon footprint
 ```

 ---
+title: Submission Oriaz
 emoji: 🔥
 colorFrom: yellow
 colorTo: green
 pinned: false
 ---
+# Benchmarkusing different techniques
+## ML model for Climate Disinformation Classification
+### Model Description
+#### Intended Use
 - **Primary intended uses**: Baseline comparison for climate disinformation classification models
 - **Primary intended users**: Researchers and developers participating in the Frugal AI Challenge
 - **Out-of-scope use cases**: Not intended for production use or real-world classification tasks
+### Training Data
 The model uses the QuotaClimat/frugalaichallenge-text-train dataset:
 - Size: ~6000 examples
 - Split: 80% train, 20% test
 - 8 categories of climate disinformation claims
+#### Labels
 0. No relevant claim detected
 1. Global warming is not happening
 2. Not caused by humans
 6. Proponents are biased
 7. Fossil fuels are needed
+### Performance
+#### Metrics (I used NVIDIA T4 small GPU)
+- **Accuracy**: ~69-72%
+- **Environmental Impact**:
+  - Emissions tracked in gCO2eq (~0,7g)
+  - Energy consumption tracked in Wh (~1,8wh)
+#### Model Architecture
+ML models prefers numeric values so we need to embed our quotes. I used *MTEB Leaderboard* on HuggingFace to find the model with the best trade-off between performance and the number of parameters.
+I then chosed "dunzhang/stella_en_400M_v5" model as embedder. It has the 7th best performance score with only 400M parameters.
+Once the quote are embedded, I have 6091 values x 1024 features. After that, train-test split (70%, 30%).
+Using TPOT Classifier, I found that the best model on my data was a Logistic Regressor.
+Then here is the Confusion Matrix :
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/66169e1ce557753f30eab31b/tfAcfFu3Cnc9XJ00ixrWB.png)
+### Environmental Impact
 Environmental impact is tracked using CodeCarbon, measuring:
 - Carbon emissions during inference
 This tracking helps establish a baseline for the environmental impact of model deployment and inference.
+### Limitations
+- Embedding phase take ~30 secondes for 1800 quotes. It can be optimised and can have a real influence on carbon emissions.
+- Hard to go over 70% accuracy with "simple" ML.
+- Textual data have some interpretations limitations that little models can't find.
+### Ethical Considerations
 - Dataset contains sensitive topics related to climate disinformation
 - Environmental impact is tracked to promote awareness of AI's carbon footprint
 ```