metadata

language:
  - en
  - ko
license: llama3
library_name: transformers
tags:
  - llama-cpp
  - gguf-my-repo
base_model:
  - meta-llama/Meta-Llama-3-70B
  - jeiku/Average_Test_v1
  - Bllossom/llama-3-Korean-Bllossom-70B

Bllossom | Demo | Homepage | Github | Colab-tutorial |

저희 Bllossom팀 에서 한국어-영어 이중 언어모델인 Bllossom을 공개했습니다!
서울과기대 슈퍼컴퓨팅 센터의 지원으로 100GB가넘는 한국어로 모델전체를 풀튜닝한 한국어 강화 이중언어 모델입니다!
한국어 잘하는 모델 찾고 있지 않으셨나요?
 - 한국어 최초! 무려 3만개가 넘는 한국어 어휘확장
 - Llama3대비 대략 25% 더 긴 길이의 한국어 Context 처리가능
 - 한국어-영어 Pararell Corpus를 활용한 한국어-영어 지식연결 (사전학습)
 - 한국어 문화, 언어를 고려해 언어학자가 제작한 데이터를 활용한 미세조정
 - 강화학습
이 모든게 한꺼번에 적용되고 상업적 이용이 가능한 Bllossom을 이용해 여러분 만의 모델을 만들어보세욥!
본 모델은 42GB 이상 GPU 혹은 42GB 이상의 메모리가 있는 CPU에서 구동 가능한 양자화 모델입니다!

1. Bllossom-8B는 서울과기대, 테디썸, 연세대 언어자원 연구실의 언어학자와 협업해 만든 실용주의기반 언어모델입니다! 앞으로 지속적인 업데이트를 통해 관리하겠습니다 많이 활용해주세요 🙂
2. 초 강력한 Advanced-Bllossom 8B, 70B모델, 시각-언어모델을 보유하고 있습니다! (궁금하신분은 개별 연락주세요!!)
3. Bllossom은 NAACL2024, LREC-COLING2024 (구두) 발표로 채택되었습니다.
4. 좋은 언어모델 계속 업데이트 하겠습니다!! 한국어 강화를위해 공동 연구하실분(특히논문) 언제든 환영합니다!! 
   특히 소량의 GPU라도 대여 가능한팀은 언제든 연락주세요! 만들고 싶은거 도와드려요.

The Bllossom language model is a Korean-English bilingual language model based on the open-source LLama3. It enhances the connection of knowledge between Korean and English. It has the following features:

Knowledge Linking: Linking Korean and English knowledge through additional training
Vocabulary Expansion: Expansion of Korean vocabulary to enhance Korean expressiveness.
Instruction Tuning: Tuning using custom-made instruction following data specialized for Korean language and Korean culture
Human Feedback: DPO has been applied
Vision-Language Alignment: Aligning the vision transformer with this language model

This model developed by MLPLab at Seoultech, Teddysum and Yonsei Univ. This model was converted to GGUF format from Bllossom/llama-3-Korean-Bllossom-70B using llama.cpp via the ggml.ai's GGUF-my-repo space. Refer to the original model card for more details on the model.

Demo Video

Bllossom-V Demo

Bllossom Demo(Kakao)ㅤㅤㅤㅤㅤㅤㅤㅤ

NEWS

[2024.05.08] Vocab Expansion Model Update
[2024.04.25] We released Bllossom v2.0, based on llama-3
[2023/12] We released Bllossom-Vision v1.0, based on Bllossom
[2023/08] We released Bllossom v1.0, based on llama-2.
[2023/07] We released Bllossom v0.7, based on polyglot-ko.

Example code

!CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
!huggingface-cli download Bllossom/llama-3-Korean-Bllossom-70B-gguf-Q4_K_M --local-dir='YOUR-LOCAL-FOLDER-PATH'

from llama_cpp import Llama
from transformers import AutoTokenizer

model_id = 'Bllossom/llama-3-Korean-Bllossom-70B-gguf-Q4_K_M'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = Llama(
    model_path='YOUR-LOCAL-FOLDER-PATH/llama-3-Korean-Bllossom-70B-gguf-Q4_K_M.gguf',
    n_ctx=512,
    n_gpu_layers=-1        # Number of model layers to offload to GPU
)

PROMPT = \
'''당신은 유용한 AI 어시스턴트입니다. 사용자의 질의에 대해 친절하고 정확하게 답변해야 합니다.
You are a helpful AI assistant, you'll need to answer users' queries in a friendly and accurate manner.'''

instruction = 'Your Instruction'

messages = [
    {"role": "system", "content": f"{PROMPT}"},
    {"role": "user", "content": f"{instruction}"}
    ]

prompt = tokenizer.apply_chat_template(
    messages, 
    tokenize = False,
    add_generation_prompt=True
)

generation_kwargs = {
    "max_tokens":512,
    "stop":["<|eot_id|>"],
    "echo":True, # Echo the prompt in the output
    "top_p":0.9,
    "temperature":0.6,
}

resonse_msg = model(prompt, **generation_kwargs)
print(resonse_msg['choices'][0]['text'][len(prompt):])

Citation

Language Model

@misc{bllossom,
  author = {ChangSu Choi, Yongbin Jeong, Seoyoon Park, InHo Won, HyeonSeok Lim, SangMin Kim, Yejee Kang, Chanhyuk Yoon, Jaewan Park, Yiseul Lee, HyeJin Lee, Younggyun Hahm, Hansaem Kim, KyungTae Lim},
  title = {Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean},
  year = {2024},
  journal = {LREC-COLING 2024},
  paperLink = {\url{https://arxiv.org/pdf/2403.10882}},
 },
}

Vision-Language Model

@misc{bllossom-V,
  author = {Dongjae Shin, Hyunseok Lim, Inho Won, Changsu Choi, Minjun Kim, Seungwoo Song, Hangyeol Yoo, Sangmin Kim, Kyungtae Lim},
  title = {X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment},
  year = {2024},
  publisher = {GitHub},
  journal = {NAACL 2024 findings},
  paperLink = {\url{https://arxiv.org/pdf/2403.11399}},
 },
}

Contact

임경태(KyungTae Lim), Professor at Seoultech. ktlim@seoultech.ac.kr
함영균(Younggyun Hahm), CEO of Teddysum. hahmyg@teddysum.ai
김한샘(Hansaem Kim), Professor at Yonsei. khss@yonsei.ac.kr

Contributor

최창수(Chansu Choi), choics2623@seoultech.ac.kr
김상민(Sangmin Kim), sangmin9708@naver.com
원인호(Inho Won), wih1226@seoultech.ac.kr
김민준(Minjun Kim), mjkmain@seoultech.ac.kr
송승우(Seungwoo Song), sswoo@seoultech.ac.kr
신동재(Dongjae Shin), dylan1998@seoultech.ac.kr
임현석(Hyeonseok Lim), gustjrantk@seoultech.ac.kr
육정훈(Jeonghun Yuk), usually670@gmail.com
유한결(Hangyeol Yoo), 21102372@seoultech.ac.kr
송서현(Seohyun Song), alexalex225225@gmail.com