viklofg
/

swedish-ocr-correction

Text2Text Generation

Transformers

PyTorch

Swedish

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

viklofg commited on Dec 17, 2023

Commit

84b1380

1 Parent(s): 0f64611

Update README.md

Browse files

Files changed (1) hide show

README.md +24 -6

README.md CHANGED Viewed

@@ -12,17 +12,21 @@ widget:
   example_title: "Long-s piano ad"
 ---
-(Work in progress)
 # Swedish OCR correction
 <!-- Provide a quick summary of what the model is/does. -->
 This model corrects OCR errors in Swedish text.
 ## Model Description
-This model is a fine-tuned version of [byt5-small](https://huggingface.co/google/byt5-small), a character-level multilingual transformer. It is fine-tuned on OCR samples from Swedish 19th and 20th century newspapers and historical text.
 <!-- ### Model Description-->
@@ -42,6 +46,7 @@ This model is a fine-tuned version of [byt5-small](https://huggingface.co/google
 - **Paper [optional]:** [More Information Needed]
 - **Demo [optional]:** [More Information Needed]-->
 ## Training Data
 The base model byt5 is pre-trained on [mc4](https://huggingface.co/datasets/mc4). This fine-tuned version is further trained on:
@@ -49,11 +54,24 @@ The base model byt5 is pre-trained on [mc4](https://huggingface.co/datasets/mc4)
 - Swedish newspapers from 1818 to 2018. Parts of the dataset are available from Språkbanken Text: [Swedish newspapers 1818-1870](https://spraakbanken.gu.se/en/resources/svenska-tidningar-1818-1870), [Swedish newspapers 1871-1906](https://spraakbanken.gu.se/resurser/svenska-tidningar-1871-1906).
 - Swedish blackletter documents from 1626 to 1816, available from Språkbaknen Text: [Swedish fraktur 1626-1816](https://spraakbanken.gu.se/resurser/svensk-fraktur-1626-1816)
-This data includes characters not used in Swedish today, such as the long s (ſ) and the esszett ligature (ß).
 ## Usage
-The model accepts input sequences of at most 128 UTF-8 bytes, longer sequences are truncated to this limit. 128 UTF-8 bytes corresponds to slightly less than 128 characters of Swedish text, since most characters use one byte but Å, Ä and Ö use two bytes.
-[Demo code here]

   example_title: "Long-s piano ad"
 ---
 # Swedish OCR correction
 <!-- Provide a quick summary of what the model is/does. -->
 This model corrects OCR errors in Swedish text.
+## Try it!
+- On short texts in the inference widget to the right ->
+- On files or longer texts in the [demo](https://huggingface.co/spaces/viklofg/swedish-ocr-correction-demo)
 ## Model Description
+This model is a fine-tuned version of [byt5-small](https://huggingface.co/google/byt5-small), a character-level multilingual transformer.
+The fine-tuning data consists of OCR samples from Swedish newspapers and historical documents.
+The model works on texts up to 128 UTF-8 bytes (see [Length limit](#length-limit)).
 <!-- ### Model Description-->
 - **Paper [optional]:** [More Information Needed]
 - **Demo [optional]:** [More Information Needed]-->
 ## Training Data
 The base model byt5 is pre-trained on [mc4](https://huggingface.co/datasets/mc4). This fine-tuned version is further trained on:
 - Swedish newspapers from 1818 to 2018. Parts of the dataset are available from Språkbanken Text: [Swedish newspapers 1818-1870](https://spraakbanken.gu.se/en/resources/svenska-tidningar-1818-1870), [Swedish newspapers 1871-1906](https://spraakbanken.gu.se/resurser/svenska-tidningar-1871-1906).
 - Swedish blackletter documents from 1626 to 1816, available from Språkbaknen Text: [Swedish fraktur 1626-1816](https://spraakbanken.gu.se/resurser/svensk-fraktur-1626-1816)
+This data includes characters not used in Swedish today, such as the long s (ſ) and the esszett ligature (ß), which means that the model should be able to handle texts with these characters.
+See for example the example titled _Long-s piano ad_ in the inference widget to the right.
 ## Usage
+Use the code below to get started with the model.
+```python
+from transformers import pipeline, T5ForConditionalGeneration, AutoTokenizer
+model = T5ForConditionalGeneration.from_pretrained('viklofg/swedish-ocr-correction')
+tokenizer = AutoTokenizer.from_pretrained('google/byt5-small')
+pipe = pipeline('text2text-generation', model=model, tokenizer=tokenizer)
+ocr = 'Den i HandelstidniDgens g&rdagsnnmmer omtalade hvalfisken, sorn fångats i Frölnndaviken'
+output = pipe(ocr)
+print(output)
+```
+### Length limit
+The model accepts input sequences of at most 128 UTF-8 bytes, longer sequences are truncated to this limit. 128 UTF-8 bytes corresponds to slightly less than 128 characters of Swedish text since most characters are encoded as one byte, but non-ASCII characters such as Å, Ä, and Ö are encoded as two (or more) bytes.