DHIVEHI NOUGAT SMALL (IMAGE-TO-TEXT)

This model is a fine-tuned version of facebook/nougat-small on an dhivehi-text-image dataset. It achieves the following results on the evaluation set:

Loss: 0.0300

Model description

Finetuned dhivehi on text-image dataset, config dv-01-01 only

Usage

from PIL import Image
import torch
from transformers import NougatProcessor, VisionEncoderDecoderModel
from pathlib import Path

# Load the model and processor
processor = NougatProcessor.from_pretrained("alakxender/dhivehi-nougat-small-dv01-01")
model = VisionEncoderDecoderModel.from_pretrained(
    "alakxender/dhivehi-nougat-small-dv01-01",  
    torch_dtype=torch.bfloat16,                 # Optional: Load the model with BF16 data type for faster inference and lower memory usage
    attn_implementation={                       # Optional: Specify the attention kernel implementations for different parts of the model
        "decoder": "flash_attention_2",         # Use FlashAttention-2 for the decoder for improved performance
        "encoder": "eager"                      # Use the default ("eager") attention implementation for the encoder
    }
)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

context_length = 128

def predict(img_path):
    # Ensure image is in RGB format
    image = Image.open(img_path).convert("RGB")  
    pixel_values = processor(image, return_tensors="pt").pixel_values.to(torch.bfloat16)

    # generate prediction
    outputs = model.generate(
        pixel_values.to(device),
        min_length=1,
        max_new_tokens=context_length,
        repetition_penalty=1.5,
        bad_words_ids=[[processor.tokenizer.unk_token_id]],
        eos_token_id=processor.tokenizer.eos_token_id,
    )

    page_sequence = processor.batch_decode(outputs, skip_special_tokens=True)[0]
    return page_sequence

print(predict("DV01-04_31.jpg"))

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 3
eval_batch_size: 3
seed: 42
gradient_accumulation_steps: 6
total_train_batch_size: 18
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
num_epochs: 100

Training results

Training Loss	Epoch	Step	Validation Loss
7.1462	0.0567	100	1.1326
6.5572	0.1135	200	1.0543
6.1831	0.1702	300	0.9868
6.0022	0.2269	400	0.9323
5.6527	0.2837	500	0.8896
5.5004	0.3404	600	0.8478
5.2741	0.3971	700	0.8168
4.9927	0.4539	800	0.7466
4.3776	0.5106	900	0.6724
2.816	0.5673	1000	0.4038
1.8526	0.6241	1100	0.2720
1.5099	0.6808	1200	0.2064
1.3084	0.7375	1300	0.1696
1.1449	0.7943	1400	0.1516
0.8819	0.8510	1500	0.1331
0.7947	0.9077	1600	0.1194
0.9857	0.9644	1700	0.1091
0.7097	1.0210	1800	0.1023
0.5212	1.0777	1900	0.0953
0.6396	1.1345	2000	0.0882
0.6073	1.1912	2100	0.0863
0.5683	1.2479	2200	0.0815
0.5399	1.3047	2300	0.0770
0.5433	1.3614	2400	0.0740
0.5824	1.4181	2500	0.0688
0.447	1.4748	2600	0.0665
0.4875	1.5316	2700	0.0633
0.4694	1.5883	2800	0.0616
0.4001	1.6450	2900	0.0580
0.3971	1.7018	3000	0.0585
0.3889	1.7585	3100	0.0556
0.3088	1.8152	3200	0.0546
0.3476	1.8720	3300	0.0522
0.4569	1.9287	3400	0.0513
0.3979	1.9854	3500	0.0502
0.2847	2.0420	3600	0.0486
0.4332	2.0987	3700	0.0465
0.3647	2.1554	3800	0.0469
0.3791	2.2122	3900	0.0459
0.2982	2.2689	4000	0.0450
0.3294	2.3256	4100	0.0447
0.2839	2.3824	4200	0.0434
0.3094	2.4391	4300	0.0433
0.3062	2.4958	4400	0.0422
0.2723	2.5526	4500	0.0412
0.2348	2.6093	4600	0.0406
0.2125	2.6660	4700	0.0403
0.3172	2.7228	4800	0.0385
0.2315	2.7795	4900	0.0382
0.2707	2.8362	5000	0.0385
0.2391	2.8930	5100	0.0373
0.2979	2.9497	5200	0.0372
0.2933	3.0062	5300	0.0362
0.2388	3.0630	5400	0.0357
0.2525	3.1197	5500	0.0364
0.2563	3.1764	5600	0.0359
0.2534	3.2332	5700	0.0354
0.2401	3.2899	5800	0.0344
0.2116	3.3466	5900	0.0340
0.2713	3.4034	6000	0.0340
0.2351	3.4601	6100	0.0333
0.1471	3.5168	6200	0.0335
0.2209	3.5736	6300	0.0326
0.2206	3.6303	6400	0.0324
0.2208	3.6870	6500	0.0316
0.2329	3.7438	6600	0.0316
0.1439	3.8005	6700	0.0312
0.2335	3.8572	6800	0.0315
0.1582	3.9140	6900	0.0312
0.2298	3.9707	7000	0.0305
0.1649	4.0272	7100	0.0309
0.1489	4.0840	7200	0.0304
0.1729	4.1407	7300	0.0304
0.1907	4.1974	7400	0.0297
0.2	4.2542	7500	0.0298
0.1776	4.3109	7600	0.0296
0.1955	4.3676	7700	0.0292
0.1838	4.4244	7800	0.0295
0.1685	4.4811	7900	0.0292
0.161	4.5378	8000	0.0300

Framework versions

Transformers 4.47.0
Pytorch 2.6.0+cu124
Datasets 3.2.0
Tokenizers 0.21.0

alakxender
/

dhivehi-nougat-small-dv01-01