File size: 2,448 Bytes
3686238
 
 
12de767
 
 
 
 
 
 
 
 
 
fee4177
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5af5686
2a15446
5af5686
 
fee4177
0851415
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
---
license: mit
---

# Large Language Models for Expansion of Spoken Language Understanding Systems to New Languages

## Model description
This model is a fine-tuned version of the pre-trained model James-WYang/BigTranslate, specifically adjusted to handle the slot translation task. The fine-tuning process and the specific model adjustments are based on methodologies described in our recent publication https://arxiv.org/pdf/2404.02588.pdf. This model is designed to translate sentences while maintaining the integrity of annotated NLU (Natural Language Understanding) slots, which are marked with simple HTML-like tags.

The input to the model should be a sentence where all NLU slots are annotated with HTML-like tags consisting of consecutive alphabetical letters (e.g., \<a\>, \<b\>, \<c\>). The model outputs the translated sentence preserving these annotations.

Example: "Set the temperature on my \<a\>thermostat\<a\> to \<b\>29 degrees\<b\>."

## How to use

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
BIGTRANSLATE_LANG_TABLE = {
    "zh": "汉语",
    "es": "西班牙语",
    "fr": "法语",
    "de": "德语",
    "hi": "印地语",
    "pt": "葡萄牙语",
    "tr": "土耳其语",
    "en": "英语",
    "ja": "日语"
}

def get_prompt(src_lang, tgt_lang, src_sentence):
        translate_instruct = f"请将以下{BIGTRANSLATE_LANG_TABLE[src_lang]}句子翻译成{BIGTRANSLATE_LANG_TABLE[tgt_lang]}:{src_sentence}"
        return (
            "以下是一个描述任务的指令,请写一个完成该指令的适当回复。\n\n"
            f"### 指令:\n{translate_instruct}\n\n### 回复:")


def translate(input_text, src_lang, trg_lang):
    prompt = get_prompt(src_lang, trg_lang, input_text)
    input_ids = tokenizer(prompt, return_tensors="pt")
    generated_tokens = model.generate(**input_ids, max_new_tokens=256)[0]

    return tokenizer.decode(generated_tokens, skip_special_tokens=True)[len(prompt):]


model = AutoModelForCausalLM.from_pretrained("Samsung/BigTranslateSlotTranslator")
tokenizer = AutoTokenizer.from_pretrained("Samsung/BigTranslateSlotTranslator")

translation = translate("set the temperature on my <a>thermostat<a> to <b> 29 degrees <b>", "en", "de")  # translation: stell die temperatur auf meinem <a> thermostat <a> auf <b> 29 grad <b>
```

## Model fine tuning code
https://github.com/Samsung/MT-LLM-NLU/tree/main/BigTranslateFineTuning