--- title: README emoji: ๐ŸŒ colorFrom: blue colorTo: yellow sdk: static pinned: false ---
# **1.1 Fine-tuning Small LLMs** Exploring the potential of small LLMs for cleaning Raw HTR outputs from machine-transcribed English Admiralty depositions.

Fine-Tuned Models

  • mT5-small (300M parameters)
  • GPT-2 Small (124M parameters)
  • LLaMA 3.1 (1B parameters)

Current Training Data

  • 100 pages: 40,000 lines (~0.4M words)
  • 200 pages: 80,000 lines (~0.8M words)
  • 400 pages: 160,000 lines (~1.6M words)

Objectives

  • Word Correction: Identify and correct errors using contextual and grammatical cues.
  • Language Identification: Distinguish English from Latin text.
  • Artefact Removal: Eliminate HTR-generated artefacts.
  • Structural Recognition: Detect depositionsโ€™ components (e.g., front matter, headings, articles).
  • Insertion Logic: Handle missing text at marked positions.
# **1.2 Integration with RAG Pipeline** ### Components: - **Retriever**: BM25 or Sentence-BERT - **LLM**: mT5-small - **Corpus**: Curated historical texts or JSON/SQLite databases ### Deployment Highlights: - **Scalable**: Easily runs on platforms like Hugging Face Spaces with lightweight GPU instances. - **API-Friendly**: Supports integrations via Hugging Face Inference API for retrieval-augmented tasks.
# ๐Ÿ“š **2.0 Datasets** ## **2.1 Published Datasets** ### **ENGLISH HIGH COURT OF ADMIRALTY DEPOSITIONS** 1. [MarineLives/English-Expansions](https://huggingface.co/datasets/MarineLives/English-Expansions) 2. [MarineLives/Latin-Expansions](https://huggingface.co/datasets/MarineLives/Latin-Expansions) 3. [MarineLives/Line-Insertions](https://huggingface.co/datasets/MarineLives/Line-Insertions) 4. [MarineLives/HCA-1358-Errors-In-Phrases](https://huggingface.co/datasets/MarineLives/HCA-1358-Errors-In-Phrases) 5. [MarineLives/HCA-13-58-TEXT](https://huggingface.co/datasets/MarineLives/HCA-13-58-TEXT) ### **YIDDISH LETTERS** 1. [MarineLives/Gavin-yiddish-raw-HTR-and-groundtruth-lines](https://huggingface.co/datasets/MarineLives/Gavin_yiddish_raw_HT_and_groundtruth_lines) 2. [MarineLives/Gavin-yiddish-raw-HTR-and-groundtruth-paragraphs](https://huggingface.co/datasets/MarineLives/Gavin_yiddish_raw_HTR_and_groundtruth_paragraphs) ## **2.2 Unpublished Datasets** - **Dataset 1**: 420K tokens, full diplomatic transcription (1627โ€“1660) - **Dataset 2**: 4.5M tokens, semi-diplomatic transcription (1607โ€“1660) - **Dataset 3**: 100K tokens, diplomatic transcription of Early Modern letters (1600โ€“1685)
# ๐ŸŒ **Explore MarineLives** Join us in unlocking Early Modern history by exploring our [Hugging Face organization](https://huggingface.co/MarineLives) and datasets! You can follow us on BlueSky at [@marinelives.bsky.social](https://bsky.app/profile/marinelives.bsky.social) You can explore our content on our [MarineLives wiki](http://www.marinelives.org/wiki/MarineLives) and on our [ai-and-history-collaboratory GitHub repository](https://github.com/Addaci/marinelives-collaboratory/wiki).