File size: 1,310 Bytes
9966ebc
 
a56be2a
9e434fb
 
 
 
9966ebc
 
a56be2a
 
9e434fb
 
a56be2a
9e434fb
a56be2a
69715fe
 
 
 
 
 
 
 
 
a56be2a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
library_name: transformers
license: apache-2.0
tags:
- tokenizer
- claude3
- t5
---

# claude3 tokenizer: for T5

Vocabulary size: 65103

- relevant special tokens for T5 training added
- post processor updated following t5's tokenizer

usage:


```py
from transformers import AutoTokenizer
tk = AutoTokenizer.from_pretrained('BEE-spoke-data/claude-tokenizer-forT5')
inputs = tk("here are some words", return_tensors="pt")
```

## post processor


```json
"post_processor": {
    "type": "TemplateProcessing",
    "single": [
      {
        "Sequence": {
          "id": "A",
          "type_id": 0
        }
      },
      {
        "SpecialToken": {
          "id": "</s>",
          "type_id": 0
        }
      }
    ],
    "pair": [
      {
        "Sequence": {
          "id": "A",
          "type_id": 0
        }
      },
      {
        "SpecialToken": {
          "id": "</s>",
          "type_id": 0
        }
      },
      {
        "Sequence": {
          "id": "B",
          "type_id": 0
        }
      },
      {
        "SpecialToken": {
          "id": "</s>",
          "type_id": 0
        }
      }
    ],
    "special_tokens": {
      "</s>": {
        "id": "</s>",
        "ids": [
          65001
        ],
        "tokens": [
          "</s>"
        ]
      }
    }
  },
```