# InternLM
👋 join us on Discord and WeChat
## Introduction InternLM3 has open-sourced an 8-billion parameter instruction model, InternLM3-8B-Instruct, designed for general-purpose usage and advanced reasoning. This model has the following characteristics: - **Enhanced performance at reduced cost**: State-of-the-art performance on reasoning and knowledge-intensive tasks surpass models like Llama3.1-8B and Qwen2.5-7B. Remarkably, InternLM3 is trained on only 4 trillion high-quality tokens, saving more than 75% of the training cost compared to other LLMs of similar scale. - **Deep thinking capability**: InternLM3 supports both the deep thinking mode for solving complicated reasoning tasks via the long chain-of-thought and the normal response mode for fluent user interactions. ## InternLM3-8B-Instruct ### Performance Evaluation We conducted a comprehensive evaluation of InternLM using the open-source evaluation tool [OpenCompass](https://github.com/internLM/OpenCompass/). The evaluation covered five dimensions of capabilities: disciplinary competence, language competence, knowledge competence, inference competence, and comprehension competence. Here are some of the evaluation results, and you can visit the [OpenCompass leaderboard](https://rank.opencompass.org.cn) for more evaluation results. | Benchmark | | InternLM3-8B-Instruct | Qwen2.5-7B-Instruct | Llama3.1-8B-Instruct | GPT-4o-mini(close source) | | ------------ | ------------------------------- | --------------------- | ------------------- | -------------------- | ------------------------- | | General | CMMLU(0-shot) | **83.1** | 75.8 | 53.9 | 66.0 | | | MMLU(0-shot) | 76.6 | **76.8** | 71.8 | 82.7 | | | MMLU-Pro(0-shot) | **57.6** | 56.2 | 48.1 | 64.1 | | Reasoning | GPQA-Diamond(0-shot) | **37.4** | 33.3 | 24.2 | 42.9 | | | DROP(0-shot) | **83.1** | 80.4 | 81.6 | 85.2 | | | HellaSwag(10-shot) | **91.2** | 85.3 | 76.7 | 89.5 | | | KOR-Bench(0-shot) | **56.4** | 44.6 | 47.7 | 58.2 | | MATH | MATH-500(0-shot) | **83.0*** | 72.4 | 48.4 | 74.0 | | | AIME2024(0-shot) | **20.0*** | 16.7 | 6.7 | 13.3 | | Coding | LiveCodeBench(2407-2409 Pass@1) | **17.8** | 16.8 | 12.9 | 21.8 | | | HumanEval(Pass@1) | 82.3 | **85.4** | 72.0 | 86.6 | | Instrunction | IFEval(Prompt-Strict) | **79.3** | 71.7 | 75.2 | 79.7 | | Long Context | RULER(4-128K Average) | 87.9 | 81.4 | **88.5** | 90.7 | | Chat | AlpacaEval 2.0(LC WinRate) | **51.1** | 30.3 | 25.0 | 50.7 | | | WildBench(Raw Score) | **33.1** | 23.3 | 1.5 | 40.3 | | | MT-Bench-101(Score 1-10) | **8.59** | 8.49 | 8.37 | 8.87 | - The evaluation results were obtained from [OpenCompass](https://github.com/internLM/OpenCompass/) (some data marked with *, which means evaluating with Thinking Mode), and evaluation configuration can be found in the configuration files provided by [OpenCompass](https://github.com/internLM/OpenCompass/). - The evaluation data may have numerical differences due to the version iteration of [OpenCompass](https://github.com/internLM/OpenCompass/), so please refer to the latest evaluation results of [OpenCompass](https://github.com/internLM/OpenCompass/). **Limitations:** Although we have made efforts to ensure the safety of the model during the training process and to encourage the model to generate text that complies with ethical and legal requirements, the model may still produce unexpected outputs due to its size and probabilistic generation paradigm. For example, the generated responses may contain biases, discrimination, or other harmful content. Please do not propagate such content. We are not responsible for any consequences resulting from the dissemination of harmful information. ### Requirements ```python transformers >= 4.48 ``` ### Conversation Mode #### Transformers inference To load the InternLM3 8B Instruct model using Transformers, use the following code: ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM model_dir = "internlm/internlm3-8b-instruct" tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) # Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error. # model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, torch_dtype=torch.float16) # (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes. # InternLM3 8B in 4bit will cost nearly 8GB GPU memory. # pip install -U bitsandbytes # 8-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_8bit=True) # 4-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_4bit=True) model = model.eval() system_prompt = """You are an AI assistant whose name is InternLM (书生·浦语). - InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless. - InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.""" messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": "Please tell me five scenic spots in Shanghai"}, ] tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt") generated_ids = model.generate(tokenized_chat, max_new_tokens=1024, temperature=1, repetition_penalty=1.005, top_k=40, top_p=0.8) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids) ] prompt = tokenizer.batch_decode(tokenized_chat)[0] print(prompt) response = tokenizer.batch_decode(generated_ids)[0] print(response) ``` #### LMDeploy inference LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. ```bash pip install lmdeploy ``` You can run batch inference locally with the following python code: ```python import lmdeploy model_dir = "internlm/internlm3-8b-instruct" pipe = lmdeploy.pipeline(model_dir) response = pipe("Please tell me five scenic spots in Shanghai") print(response) ``` Or you can launch an OpenAI compatible server with the following command: ```bash lmdeploy serve api_server internlm/internlm3-8b-instruct --model-name internlm3-8b-instruct --server-port 23333 ``` Then you can send a chat request to the server: ```bash curl http://localhost:23333/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "internlm3-8b-instruct", "messages": [ {"role": "user", "content": "Please tell me five scenic spots in Shanghai"} ] }' ``` Find more details in the [LMDeploy documentation](https://lmdeploy.readthedocs.io/en/latest/) #### Ollama inference TODO #### vLLM inference We are still working on merging the PR(https://github.com/vllm-project/vllm/pull/12037) into vLLM. In the meantime, please use the following PR link to install it manually. ```python git clone -b support-internlm3 https://github.com/RunningLeon/vllm.git pip install -e . ``` inference code: ```python from vllm import LLM, SamplingParams llm = LLM(model="internlm/internlm3-8b-instruct") sampling_params = SamplingParams(temperature=1, repetition_penalty=1.005, top_k=40, top_p=0.8) system_prompt = """You are an AI assistant whose name is InternLM (书生·浦语). - InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless. - InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.""" prompts = [ { "role": "system", "content": system_prompt, }, { "role": "user", "content": "Please tell me five scenic spots in Shanghai" }, ] outputs = llm.chat(prompts, sampling_params=sampling_params, use_tqdm=False) print(outputs) ``` ### Thinking Mode #### Thinking Demo #### Thinking system prompt ```python thinking_system_prompt = """You are an expert mathematician with extensive experience in mathematical competitions. You approach problems through systematic thinking and rigorous reasoning. When solving problems, follow these thought processes: ## Deep Understanding Take time to fully comprehend the problem before attempting a solution. Consider: - What is the real question being asked? - What are the given conditions and what do they tell us? - Are there any special restrictions or assumptions? - Which information is crucial and which is supplementary? ## Multi-angle Analysis Before solving, conduct thorough analysis: - What mathematical concepts and properties are involved? - Can you recall similar classic problems or solution methods? - Would diagrams or tables help visualize the problem? - Are there special cases that need separate consideration? ## Systematic Thinking Plan your solution path: - Propose multiple possible approaches - Analyze the feasibility and merits of each method - Choose the most appropriate method and explain why - Break complex problems into smaller, manageable steps ## Rigorous Proof During the solution process: - Provide solid justification for each step - Include detailed proofs for key conclusions - Pay attention to logical connections - Be vigilant about potential oversights ## Repeated Verification After completing your solution: - Verify your results satisfy all conditions - Check for overlooked special cases - Consider if the solution can be optimized or simplified - Review your reasoning process Remember: 1. Take time to think thoroughly rather than rushing to an answer 2. Rigorously prove each key conclusion 3. Keep an open mind and try different approaches 4. Summarize valuable problem-solving methods 5. Maintain healthy skepticism and verify multiple times Your response should reflect deep mathematical understanding and precise logical thinking, making your solution path and reasoning clear to others. When you're ready, present your complete solution with: - Clear problem understanding - Detailed solution process - Key insights - Thorough verification Focus on clear, logical progression of ideas and thorough explanation of your mathematical reasoning. Provide answers in the same language as the user asking the question, repeat the final answer using a '\\boxed{}' without any units, you have [[8192]] tokens to complete the answer. """ ``` #### Transformers inference ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM model_dir = "internlm/internlm3-8b-instruct" tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) # Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error. model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, torch_dtype=torch.float16) # (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes. # InternLM3 8B in 4bit will cost nearly 8GB GPU memory. # pip install -U bitsandbytes # 8-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_8bit=True) # 4-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_4bit=True) model = model.eval() messages = [ {"role": "system", "content": thinking_system_prompt}, {"role": "user", "content": "Given the function\(f(x)=\mathrm{e}^{x}-ax - a^{3}\),\n(1) When \(a = 1\), find the equation of the tangent line to the curve \(y = f(x)\) at the point \((1,f(1))\).\n(2) If \(f(x)\) has a local minimum and the minimum value is less than \(0\), determine the range of values for \(a\)."}, ] tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt") generated_ids = model.generate(tokenized_chat, max_new_tokens=8192) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids) ] prompt = tokenizer.batch_decode(tokenized_chat)[0] print(prompt) response = tokenizer.batch_decode(generated_ids)[0] print(response) ``` #### LMDeploy inference LMDeploy is a toolkit for compressing, deploying, and serving LLM. ```bash pip install lmdeploy ``` You can run batch inference locally with the following python code: ```python from lmdeploy import pipeline, GenerationConfig, ChatTemplateConfig model_dir = "internlm/internlm3-8b-instruct" chat_template_config = ChatTemplateConfig(model_name='internlm3') pipe = pipeline(model_dir, chat_template_config=chat_template_config) messages = [ {"role": "system", "content": thinking_system_prompt}, {"role": "user", "content": "Given the function\(f(x)=\mathrm{e}^{x}-ax - a^{3}\),\n(1) When \(a = 1\), find the equation of the tangent line to the curve \(y = f(x)\) at the point \((1,f(1))\).\n(2) If \(f(x)\) has a local minimum and the minimum value is less than \(0\), determine the range of values for \(a\)."}, ] response = pipe(messages, gen_config=GenerationConfig(max_new_tokens=2048)) print(response) ``` #### Ollama inference TODO #### vLLM inference We are still working on merging the PR(https://github.com/vllm-project/vllm/pull/12037) into vLLM. In the meantime, please use the following PR link to install it manually. ```python git clone https://github.com/RunningLeon/vllm.git pip install -e . ``` inference code ```python from vllm import LLM, SamplingParams llm = LLM(model="internlm/internlm3-8b-instruct") sampling_params = SamplingParams(temperature=1, repetition_penalty=1.005, top_k=40, top_p=0.8, max_tokens=8192) prompts = [ { "role": "system", "content": thinking_system_prompt, }, { "role": "user", "content": "Given the function\(f(x)=\mathrm{e}^{x}-ax - a^{3}\),\n(1) When \(a = 1\), find the equation of the tangent line to the curve \(y = f(x)\) at the point \((1,f(1))\).\n(2) If \(f(x)\) has a local minimum and the minimum value is less than \(0\), determine the range of values for \(a\)." }, ] outputs = llm.chat(prompts, sampling_params=sampling_params, use_tqdm=False) print(outputs) ``` ## Open Source License The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow **free** commercial usage. To apply for a commercial license, please fill in the [application form (English)](https://wj.qq.com/s2/12727483/5dba/)/[申请表(中文)](https://wj.qq.com/s2/12725412/f7c1/). For other questions or collaborations, please contact