CatastroBERT a model for Extreme weather events detection in French text

This model aims to facilitate the detection of paragraphs or articles relevant to extreme weather events in French text. It is based on the camembert-base model and was trained on manually annotated data (articles summaries) from the Gazette de Lausanne archives collected by impresso

Model Description

  • Developed by: Lucas Nicolas

  • Language(s) (NLP): French

  • Finetuned from model : camembert-base (RoBERTa Checkpoint)

  • Repository: Check the CatastroBERT GitHub page for more usage examples and information.

Usage

In Transformers

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch


model_name = "epfl-dhlab/CatastroBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification(model_name)

def predict(text):
    # Prepare the text data
    inputs = tokenizer.encode_plus(
        text,
        None,
        add_special_tokens=True,
        return_token_type_ids=True,
        padding=True,
        max_length=512,
        truncation=True,
        return_tensors='pt'
    )

    ids = inputs['input_ids'].to('cuda' if torch.cuda.is_available() else 'cpu')
    mask = inputs['attention_mask'].to('cuda' if torch.cuda.is_available() else 'cpu')

    # Get predictions
    with torch.no_grad():
        outputs = model(ids, mask)
        logits = outputs.logits

    # Apply sigmoid function to get probabilities
    probs = torch.sigmoid(logits).cpu().numpy()

    # Return the probability of the class (1)
    return probs[0][0]

#example usage 
text = "Un violent ouragan du sud-ouest est passé cette nuit sur Lausanne."
print(f"Prediction: {predict(text)}")

Training Data

This model was trained on manually a manually annotated dataset (articles summaries) curated from the Gazette de Lausanne archives collected by the impresso project. The dataset is composed of 4500 articles summaries of which 3500 were used for training and 1000 for validation.

Environmental Impact

Carbon emissions estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: RTX 3090
  • Hours used: 26
  • Carbon Emitted: 0.07 kg CO2
Downloads last month
5
Safetensors
Model size
111M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.