pierreguillou commited on
Commit
a8bcd8f
·
1 Parent(s): 8985bf0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -2
README.md CHANGED
@@ -33,7 +33,14 @@ metrics:
33
  - accuracy
34
  model-index:
35
  - name: lilt-xlm-roberta-base-finetuned-with-DocLayNet-base-at-linelevel-ml384
36
- results: []
 
 
 
 
 
 
 
37
  ---
38
 
39
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -43,21 +50,36 @@ should probably proofread and complete it, then remove this comment. -->
43
 
44
  This model is a fine-tuned version of [nielsr/lilt-xlm-roberta-base](https://huggingface.co/nielsr/lilt-xlm-roberta-base) with the [DocLayNet base](https://huggingface.co/datasets/pierreguillou/DocLayNet-base) dataset.
45
  It achieves the following results on the evaluation set:
 
46
  - Loss: 1.0003
47
  - Precision: 0.8584
48
  - Recall: 0.8584
49
  - F1: 0.8584
50
  - Accuracy: 0.8584
51
 
 
 
 
 
 
 
 
 
 
 
52
  ## Model description
53
 
54
  The model was finetuned at **line level on chunk of 384 tokens with overlap of 128 tokens**. Thus, the model was trained with all layout and text data of all pages of the dataset.
55
 
56
  At inference time, a calculation of best probabilities give the label to each line bounding boxes.
57
 
 
 
 
 
58
  ## Training and evaluation data
59
 
60
- More information needed
61
 
62
  ## Training procedure
63
 
 
33
  - accuracy
34
  model-index:
35
  - name: lilt-xlm-roberta-base-finetuned-with-DocLayNet-base-at-linelevel-ml384
36
+ results:
37
+ - task:
38
+ name: Token Classification
39
+ type: token-classification
40
+ metrics:
41
+ - name: f1
42
+ type: f1
43
+ value: 0.8584
44
  ---
45
 
46
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 
50
 
51
  This model is a fine-tuned version of [nielsr/lilt-xlm-roberta-base](https://huggingface.co/nielsr/lilt-xlm-roberta-base) with the [DocLayNet base](https://huggingface.co/datasets/pierreguillou/DocLayNet-base) dataset.
52
  It achieves the following results on the evaluation set:
53
+
54
  - Loss: 1.0003
55
  - Precision: 0.8584
56
  - Recall: 0.8584
57
  - F1: 0.8584
58
  - Accuracy: 0.8584
59
 
60
+ ### DocLayNet dataset
61
+
62
+ [DocLayNet dataset](https://github.com/DS4SD/DocLayNet) (IBM) provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique pages from 6 document categories.
63
+
64
+ Until today, the dataset can be downloaded through direct links or as a dataset from Hugging Face datasets:
65
+ - direct links: [doclaynet_core.zip](https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_core.zip) (28 GiB), [doclaynet_extra.zip](https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_extra.zip) (7.5 GiB)
66
+ - Hugging Face dataset library: [dataset DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet)
67
+
68
+ Paper: [DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis](https://arxiv.org/abs/2206.01062) (06/02/2022)
69
+
70
  ## Model description
71
 
72
  The model was finetuned at **line level on chunk of 384 tokens with overlap of 128 tokens**. Thus, the model was trained with all layout and text data of all pages of the dataset.
73
 
74
  At inference time, a calculation of best probabilities give the label to each line bounding boxes.
75
 
76
+ ## Inference
77
+
78
+ See notebook: [inference_on_LiLT_model_finetuned_on_DocLayNet_base_in_any_language_at_levellines_ml384.ipynb]()
79
+
80
  ## Training and evaluation data
81
 
82
+ See notebook: [Fine_tune_LiLT_on_DocLayNet_base_in_any_language_at_linelevel_ml_384.ipynb]()
83
 
84
  ## Training procedure
85