DHIVEHI NOUGAT SMALL (IMAGE-TO-TEXT)

This model is a fine-tuned version of facebook/nougat-small on an dhivehi-text-image dataset. It achieves the following results on the evaluation set:

  • Loss: 0.0300

Model description

Finetuned dhivehi on text-image dataset, config dv-01-01 only

Usage

from PIL import Image
import torch
from transformers import NougatProcessor, VisionEncoderDecoderModel
from pathlib import Path

# Load the model and processor
processor = NougatProcessor.from_pretrained("alakxender/dhivehi-nougat-small-dv01-01")
model = VisionEncoderDecoderModel.from_pretrained(
    "alakxender/dhivehi-nougat-small-dv01-01",  
    torch_dtype=torch.bfloat16,                 # Optional: Load the model with BF16 data type for faster inference and lower memory usage
    attn_implementation={                       # Optional: Specify the attention kernel implementations for different parts of the model
        "decoder": "flash_attention_2",         # Use FlashAttention-2 for the decoder for improved performance
        "encoder": "eager"                      # Use the default ("eager") attention implementation for the encoder
    }
)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

context_length = 128

def predict(img_path):
    # Ensure image is in RGB format
    image = Image.open(img_path).convert("RGB")  
    pixel_values = processor(image, return_tensors="pt").pixel_values.to(torch.bfloat16)

    # generate prediction
    outputs = model.generate(
        pixel_values.to(device),
        min_length=1,
        max_new_tokens=context_length,
        repetition_penalty=1.5,
        bad_words_ids=[[processor.tokenizer.unk_token_id]],
        eos_token_id=processor.tokenizer.eos_token_id,
    )

    page_sequence = processor.batch_decode(outputs, skip_special_tokens=True)[0]
    return page_sequence

print(predict("DV01-04_31.jpg"))

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0001
  • train_batch_size: 3
  • eval_batch_size: 3
  • seed: 42
  • gradient_accumulation_steps: 6
  • total_train_batch_size: 18
  • optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • num_epochs: 100

Training results

Training Loss Epoch Step Validation Loss
7.1462 0.0567 100 1.1326
6.5572 0.1135 200 1.0543
6.1831 0.1702 300 0.9868
6.0022 0.2269 400 0.9323
5.6527 0.2837 500 0.8896
5.5004 0.3404 600 0.8478
5.2741 0.3971 700 0.8168
4.9927 0.4539 800 0.7466
4.3776 0.5106 900 0.6724
2.816 0.5673 1000 0.4038
1.8526 0.6241 1100 0.2720
1.5099 0.6808 1200 0.2064
1.3084 0.7375 1300 0.1696
1.1449 0.7943 1400 0.1516
0.8819 0.8510 1500 0.1331
0.7947 0.9077 1600 0.1194
0.9857 0.9644 1700 0.1091
0.7097 1.0210 1800 0.1023
0.5212 1.0777 1900 0.0953
0.6396 1.1345 2000 0.0882
0.6073 1.1912 2100 0.0863
0.5683 1.2479 2200 0.0815
0.5399 1.3047 2300 0.0770
0.5433 1.3614 2400 0.0740
0.5824 1.4181 2500 0.0688
0.447 1.4748 2600 0.0665
0.4875 1.5316 2700 0.0633
0.4694 1.5883 2800 0.0616
0.4001 1.6450 2900 0.0580
0.3971 1.7018 3000 0.0585
0.3889 1.7585 3100 0.0556
0.3088 1.8152 3200 0.0546
0.3476 1.8720 3300 0.0522
0.4569 1.9287 3400 0.0513
0.3979 1.9854 3500 0.0502
0.2847 2.0420 3600 0.0486
0.4332 2.0987 3700 0.0465
0.3647 2.1554 3800 0.0469
0.3791 2.2122 3900 0.0459
0.2982 2.2689 4000 0.0450
0.3294 2.3256 4100 0.0447
0.2839 2.3824 4200 0.0434
0.3094 2.4391 4300 0.0433
0.3062 2.4958 4400 0.0422
0.2723 2.5526 4500 0.0412
0.2348 2.6093 4600 0.0406
0.2125 2.6660 4700 0.0403
0.3172 2.7228 4800 0.0385
0.2315 2.7795 4900 0.0382
0.2707 2.8362 5000 0.0385
0.2391 2.8930 5100 0.0373
0.2979 2.9497 5200 0.0372
0.2933 3.0062 5300 0.0362
0.2388 3.0630 5400 0.0357
0.2525 3.1197 5500 0.0364
0.2563 3.1764 5600 0.0359
0.2534 3.2332 5700 0.0354
0.2401 3.2899 5800 0.0344
0.2116 3.3466 5900 0.0340
0.2713 3.4034 6000 0.0340
0.2351 3.4601 6100 0.0333
0.1471 3.5168 6200 0.0335
0.2209 3.5736 6300 0.0326
0.2206 3.6303 6400 0.0324
0.2208 3.6870 6500 0.0316
0.2329 3.7438 6600 0.0316
0.1439 3.8005 6700 0.0312
0.2335 3.8572 6800 0.0315
0.1582 3.9140 6900 0.0312
0.2298 3.9707 7000 0.0305
0.1649 4.0272 7100 0.0309
0.1489 4.0840 7200 0.0304
0.1729 4.1407 7300 0.0304
0.1907 4.1974 7400 0.0297
0.2 4.2542 7500 0.0298
0.1776 4.3109 7600 0.0296
0.1955 4.3676 7700 0.0292
0.1838 4.4244 7800 0.0295
0.1685 4.4811 7900 0.0292
0.161 4.5378 8000 0.0300

Framework versions

  • Transformers 4.47.0
  • Pytorch 2.6.0+cu124
  • Datasets 3.2.0
  • Tokenizers 0.21.0
Downloads last month
37
Safetensors
Model size
247M params
Tensor type
I64
·
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for alakxender/dhivehi-nougat-small-dv01-01

Finetuned
(3)
this model

Dataset used to train alakxender/dhivehi-nougat-small-dv01-01