File size: 9,299 Bytes
1203a84 4e9b1dc 811afd0 81c9af0 08552f6 c133fc1 08552f6 4e9b1dc c133fc1 97f41eb 4e9b1dc 97f41eb 4e9b1dc 4f40538 97f41eb 4f40538 4e9b1dc 4f40538 4e9b1dc 97f41eb 4e9b1dc 4f40538 4e9b1dc 08552f6 97f41eb 08552f6 34e041b 08552f6 97f41eb 08552f6 34e041b 97f41eb 08552f6 4f40538 08552f6 4f40538 08552f6 97f41eb 34e041b 97f41eb 08552f6 4f40538 08552f6 4f40538 08552f6 4f40538 08552f6 4f40538 08552f6 4f40538 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 |
---
license: apache-2.0
datasets:
- Ericu950/Papyri_1
base_model:
- meta-llama/Meta-Llama-3.1-8B-Instruct
library_name: transformers
tags:
- papyrology
- epigraphy
- philology
---
# Papy_1_Llama-3.1-8B-Instruct_date
This is a fine-tuned version of the Llama-3.1-8B-Instruct model, specialized in assigning a date to Greek documentary papyri. On a test set of 2,295 unseen papyri its predictions were, on average, 21.7 years away from the actual date spans.
See https://arxiv.org/abs/2409.13870.
## Dataset
This model was finetuned on the Ericu950/Papyri_1 dataset, which consists of Greek documentary papyri editions and their corresponding dates and geographical attributions sourced from the amazing Papyri.info.
## Usage
To run the model on a GPU with large memory capacity, follow these steps:
### 1. Download and load the model
```python
import json
from transformers import pipeline, AutoTokenizer, LlamaForCausalLM
import torch
model_id = "Ericu950/Papy_1_Llama-3.1-8B-Instruct_date"
model = LlamaForCausalLM.from_pretrained(
model_id,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
generation_pipeline = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
device_map="auto",
)
```
### 2. Run inference on a papyrus fragment of your choice
```python
# This is a rough transcription of Pap.Ups. 106
papyrus_edition = """
ετουσ τεταρτου αυτοκρατοροσ καισαροσ ουεσπασιανου σεβαστου ------------------
ομολογει παυσιριων απολλωνιου του παuσιριωνοσ μητροσ ---------------τωι γεγονοτι αυτωι
εκ τησ γενομενησ και μετηλλαχυιασ αυτου γυναικοσ -------------------------
απο τησ αυτησ πολεωσ εν αγυιαι συγχωρειν ειναι ----------------------------------
--------------------σ αυτωι εξ ησ συνεστιν ------------------------------------
----τησ αυτησ γενεασ την υπαρχουσαν αυτωι οικιαν ------------
------------------ ---------καὶ αιθριον και αυλη απερ ο υιοσ διοκοροσ --------------------------
--------εγραψεν του δ αυτου διοσκορου ειναι ------------------------------------
---------- και προ κατενγεγυηται τα δικαια --------------------------------------
νησ κατα τουσ τησ χωρασ νομουσ· εαν δε μη ---------------------------------------
υπ αυτου τηι του διοσκορου σημαινομενηι -----------------------------------ενοικισμωι του
ημισουσ μερουσ τησ προκειμενησ οικιασ --------------------------------- διοσκοροσ την τουτων αποχην
---------------------------------------------μηδ υπεναντιον τουτοισ επιτελειν μηδε
------------------------------------------------ ανασκευηι κατ αυτησ τιθεσθαι ομολογιαν μηδε
----------------------------------- επιτελεσαι η χωρισ του κυρια ειναι τα διομολογημενα
παραβαινειν, εκτεινειν δε τον παραβησομενον τωι υιωι διοσκορωι η τοισ παρ αυτου καθ εκαστην
εφοδον το τε βλαβοσ και επιτιμον αργυριου δραχμασ 0 και εισ το δημοσιον τασ ισασ και μηθεν
ησσον· δ -----ιων ομολογιαν συνεχωρησεν·
"""
system_prompt = "Date this papyrus fragment to an exact year!"
input_messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": papyrus_edition},
]
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = generation_pipeline(
input_messages,
max_new_tokens=4,
num_beams=45, # Set this as high as your memory will allow!
num_return_sequences=1,
early_stopping=True,
)
beam_contents = []
for output in outputs:
generated_text = output.get('generated_text', [])
for item in generated_text:
if item.get('role') == 'assistant':
beam_contents.append(item.get('content'))
real_response = "71 or 72 AD"
print(f"Year: {real_response}")
for i, content in enumerate(beam_contents, start=1):
print(f"Suggestion {i}: {content}")
```
### Expected Output:
```
Year: 71 or 72 AD
Suggestion 1: 71
```
## Usage on free tier in Google Colab
If you don’t have access to a larger GPU but want to try the model out, you can run it in a quantized format in Google Colab. **The quality of the responses might deteriorate significantly.** Follow these steps:
### Step 1: Connect to free GPU
1. Click Connect arrow_drop_down near the top right of the notebook.
2. Select Change runtime type.
3. In the modal window, select T4 GPU as your hardware accelerator.
4. Click Save.
5. Click the Connect button to connect to your runtime. After some time, the button will present a green checkmark, along with RAM and disk usage graphs. This indicates that a server has successfully been created with your required hardware.
### Step 2: Install Dependencies
```python
!pip install -U bitsandbytes
import os
os._exit(00)
```
### Step 3: Download and quantize the model
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
import torch
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained("Ericu950/Papy_1_Llama-3.1-8B-Instruct_date",
device_map = "auto", quantization_config = quant_config)
tokenizer = AutoTokenizer.from_pretrained("Ericu950/Papy_1_Llama-3.1-8B-Instruct_date")
generation_pipeline = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
device_map="auto",
)
```
### Step 4: Run inference on a papyrus fragment of your choice
```python
# This is a rough transcription of Pap.Ups. 106
papyrus_edition = """
ετουσ τεταρτου αυτοκρατοροσ καισαροσ ουεσπασιανου σεβαστου ------------------
ομολογει παυσιριων απολλωνιου του παuσιριωνοσ μητροσ ---------------τωι γεγονοτι αυτωι
εκ τησ γενομενησ και μετηλλαχυιασ αυτου γυναικοσ -------------------------
απο τησ αυτησ πολεωσ εν αγυιαι συγχωρειν ειναι ----------------------------------
--------------------σ αυτωι εξ ησ συνεστιν ------------------------------------
----τησ αυτησ γενεασ την υπαρχουσαν αυτωι οικιαν ------------
------------------ ---------καὶ αιθριον και αυλη απερ ο υιοσ διοκοροσ --------------------------
--------εγραψεν του δ αυτου διοσκορου ειναι ------------------------------------
---------- και προ κατενγεγυηται τα δικαια --------------------------------------
νησ κατα τουσ τησ χωρασ νομουσ· εαν δε μη ---------------------------------------
υπ αυτου τηι του διοσκορου σημαινομενηι -----------------------------------ενοικισμωι του
ημισουσ μερουσ τησ προκειμενησ οικιασ --------------------------------- διοσκοροσ την τουτων αποχην
---------------------------------------------μηδ υπεναντιον τουτοισ επιτελειν μηδε
------------------------------------------------ ανασκευηι κατ αυτησ τιθεσθαι ομολογιαν μηδε
----------------------------------- επιτελεσαι η χωρισ του κυρια ειναι τα διομολογημενα
παραβαινειν, εκτεινειν δε τον παραβησομενον τωι υιωι διοσκορωι η τοισ παρ αυτου καθ εκαστην
εφοδον το τε βλαβοσ και επιτιμον αργυριου δραχμασ 0 και εισ το δημοσιον τασ ισασ και μηθεν
ησσον· δ -----ιων ομολογιαν συνεχωρησεν·"""
system_prompt = "Date this papyrus fragment to an exact year!"
input_messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": papyrus_edition},
]
outputs = generation_pipeline(
input_messages,
max_new_tokens=4,
num_beams=10,
num_return_sequences=1,
early_stopping=True,
)
beam_contents = []
for output in outputs:
generated_text = output.get('generated_text', [])
for item in generated_text:
if item.get('role') == 'assistant':
beam_contents.append(item.get('content'))
real_response = "71 or 72 AD"
print(f"Year: {real_response}")
for i, content in enumerate(beam_contents, start=1):
print(f"Suggestion {i}: {content}")
```
### Expected Output:
```
Year: 71 or 72 AD
Suggestion 1: 71
``` |