Edit model card

1. KoRnDAlpaca-Polyglot-12.8B (v1.3)

  • KoRnDAlpaca is based on Korean and fine-tuned with 1 million instruction data (R&D Instruction dataset v1.3) generated from Korean national research reports.
  • The base model of KoRnDAlpaca is EleutherAI/polyglot-en-12.8b.
  • For more information about the training procedure and model, please contact gsjang@kisti.re.kr.

2. How to use the model

from transformers import pipeline, AutoModelForCausalLM
import torch

LLM_MODEL = "NTIS/KoRnDAlpaca-Polyglot-12.8B"
query = "์ง€๋Šฅํ˜• ์˜์ƒ๊ฐ์‹œ ๊ธฐ์ˆ ์˜ ๋Œ€ํ‘œ์ ์ธ ๊ตญ๋‚ด ๊ธฐ์—…์€?"

llm_model = AutoModelForCausalLM.from_pretrained(
    LLM_MODEL,
    device_map="auto",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    #    load_in_8bit=True,
    #    revision="8bit"
)
pipe = pipeline(
    "text-generation",
    model=llm_model,
    tokenizer=LLM_MODEL,
    # device=2,
)

ans = pipe(
            f"### ์งˆ๋ฌธ: {query}\n\n### ๋‹ต๋ณ€:",
            do_sample=True,
            max_new_tokens=512,
            temperature=0.1,
            top_p=0.9,
            return_full_text=False,
            eos_token_id=2,
        )
msg = ans[0]["generated_text"]

if len(msg.split('###')[0]) > 0:
    output = msg.split('###')[0]
else:
    output = '๋‹ต๋ณ€์„ ๋“œ๋ฆฌ์ง€ ๋ชปํ•˜์—ฌ ์ฃ„์†กํ•ฉ๋‹ˆ๋‹ค.'

print(output)
# ๊ตญ๋‚ด ์ง€๋Šฅํ˜• ์˜์ƒ๊ฐ์‹œ ๊ธฐ์ˆ ์˜ ๋Œ€ํ‘œ์ ์ธ ๊ธฐ์—…์œผ๋กœ๋Š” ํ•œํ™” ํ…Œํฌ์œˆ์ด ์žˆ๋‹ค.

3. R&D Instruction Dataset v1.3

  • The dataset is built using 30,000 original research reports from the last 5 years provided by KISTI (curation.kisti.re.kr).
  • The dataset cannot be released at this time due to the licensing issues (to be discussed to release data in the future).
  • The process of building the dataset is as follows
    • A. Extract important texts related to technology, such as technology trends and technology definitions, from research reports.
    • B. Preprocess the extracted text
    • C. Generate question and answer pairs (total 1.5 million) based on the extracted text by using ChatGPT API(temporarily), which scheduled to be replaced with our own question&answer generation model(`23.11)
    • D. Reformat the dataset in the form of (Instruction, Output, Source). โ€˜Instructionโ€™ is the user's question, โ€˜Outputโ€™ is the answer, and โ€˜Sourceโ€™ is the research report identification code that the answer is based on.
    • E. Remove low-quality data by the data quality evaluation module. Use only high-quality Q&As for training. (1 million)
      • โ€ป In KoRnDAlpaca v2 (planned for `23.10), in addition to Q&A, the instruction dataset will be added to generate long-form technology trends.

4. Future plans

  • 23.10: Release KoRnDAlpaca v2 (adds the ability to generate long-form technology trend information in Markdown format)
  • 23.12: Release NITS-seachGPT module v1 (Retriever + KoRnDAlpaca v3)
    • โ€ป R&D-specific open-domain question answering module with "Retriever + Generator" structure
    • โ€ป NTIS-searchGPT v1 is an early edition, with anticipated performance improvements scheduled for 2024.
  • 23.12: KoRnDAlpaca v2 will be applied to the chatbot of NTIS (www.ntis.go.kr)

5. Date of last update

  • 2023.08.31

References

Downloads last month
10
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.