pierreguillou
/

lilt-xlm-roberta-base-finetuned-with-DocLayNet-base-at-linelevel-ml384

Model card Files Files and versions Metrics Training metrics Community

pierreguillou commited on Feb 10, 2023

Commit

a8bcd8f

1 Parent(s): 8985bf0

Update README.md

Browse files

Files changed (1) hide show

README.md +24 -2

README.md CHANGED Viewed

@@ -33,7 +33,14 @@ metrics:
 - accuracy
 model-index:
 - name: lilt-xlm-roberta-base-finetuned-with-DocLayNet-base-at-linelevel-ml384
-  results: []
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -43,21 +50,36 @@ should probably proofread and complete it, then remove this comment. -->
 This model is a fine-tuned version of [nielsr/lilt-xlm-roberta-base](https://huggingface.co/nielsr/lilt-xlm-roberta-base) with the [DocLayNet base](https://huggingface.co/datasets/pierreguillou/DocLayNet-base) dataset.
 It achieves the following results on the evaluation set:
 - Loss: 1.0003
 - Precision: 0.8584
 - Recall: 0.8584
 - F1: 0.8584
 - Accuracy: 0.8584
 ## Model description
 The model was finetuned at **line level on chunk of 384 tokens with overlap of 128 tokens**. Thus, the model was trained with all layout and text data of all pages of the dataset.
 At inference time, a calculation of best probabilities give the label to each line bounding boxes.
 ## Training and evaluation data
-More information needed
 ## Training procedure

 - accuracy
 model-index:
 - name: lilt-xlm-roberta-base-finetuned-with-DocLayNet-base-at-linelevel-ml384
+  results:
+  - task:
+      name: Token Classification
+      type: token-classification
+    metrics:
+    - name: f1
+      type: f1
+      value: 0.8584
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 This model is a fine-tuned version of [nielsr/lilt-xlm-roberta-base](https://huggingface.co/nielsr/lilt-xlm-roberta-base) with the [DocLayNet base](https://huggingface.co/datasets/pierreguillou/DocLayNet-base) dataset.
 It achieves the following results on the evaluation set:
 - Loss: 1.0003
 - Precision: 0.8584
 - Recall: 0.8584
 - F1: 0.8584
 - Accuracy: 0.8584
+### DocLayNet dataset
+[DocLayNet dataset](https://github.com/DS4SD/DocLayNet) (IBM) provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique pages from 6 document categories.
+Until today, the dataset can be downloaded through direct links or as a dataset from Hugging Face datasets:
+- direct links: [doclaynet_core.zip](https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_core.zip) (28 GiB), [doclaynet_extra.zip](https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_extra.zip) (7.5 GiB)
+- Hugging Face dataset library: [dataset DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet)
+Paper: [DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis](https://arxiv.org/abs/2206.01062) (06/02/2022)
 ## Model description
 The model was finetuned at **line level on chunk of 384 tokens with overlap of 128 tokens**. Thus, the model was trained with all layout and text data of all pages of the dataset.
 At inference time, a calculation of best probabilities give the label to each line bounding boxes.
+## Inference
+See notebook: [inference_on_LiLT_model_finetuned_on_DocLayNet_base_in_any_language_at_levellines_ml384.ipynb]()
 ## Training and evaluation data
+See notebook: [Fine_tune_LiLT_on_DocLayNet_base_in_any_language_at_linelevel_ml_384.ipynb]()
 ## Training procedure