Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Estienne is a text-segmentation model trained on Deberta.

In contrast with most text-segmentation approach, Estienne is based on token classification. Editorial structure are identified similarly to named-entity recognition.

Estienne was trained on 2,000 example of manually annotated texts, excerpted at random from three very large dataset collected by Pleias: Common Corpus (cultural heritage texts in the public domain), Marianne-OpenData (French/English administrative documents) and OpenScientificPile (scientific publications in free licenses, indexed on OpenAlex).

Given the diversity of the corpus, Estienne should work out on diverse document formats in European languages.

The model is named in reference to the humanist Henri Estienne who introduced many practices of text segmentation still in use in scholarly edition today.

Use

As Deberta remove newline by default and has no support for it in the tokenizer, they should be replaced by pilcrows (ยถ).

Estienne supports the following segmentations:

  • Text
  • Separator - actually a segmentation separator. They are generally based on newline (actually ยถ) with some variations due to text segmentation understanding.
  • Title
  • Table
  • Dialog - any kind of speaker attributed intervention.
  • Bibliography - statement of a specific bibliographic reference, either in a bibliography section or a footnote.
  • Contact - personal information, can be especially useful in the context of PII removal.
  • Paratext - any non-meaningful text included in standard documents like header, page numbering, section recall, etc.
  • Author - author names and signatures.
  • Date - statement of date and time, common in letters and newspaper articles.
  • Keyword - list of keywords, especially common in scientific publications.

Example

Downloads last month
101
Safetensors
Model size
278M params
Tensor type
F32
ยท

Spaces using PleIAs/Estienne 2