Uni-SMART commited on
Commit
4b18a7f
·
verified ·
1 Parent(s): 30b33b3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -0
README.md CHANGED
@@ -1,6 +1,63 @@
1
  ---
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  ```
6
  @misc{li2024scilitllmadaptllmsscientific,
 
1
  ---
2
  license: mit
3
  ---
4
+ # Model Card for SciLitLLM-7B
5
+
6
+ SciLitLLM-7B adapts a general large language model for effective scientific literature understanding. Starting from the Qwen2-7B model, SciLitLLM-7B goes through a hybrid strategy that integrates continual pre-training (CPT) and supervised fine-tuning (SFT), to simultaneously infuse scientific domain knowledge and enhance instruction-following capabilities for domain-specific tasks.
7
+
8
+ In this process, we identify two key challenges: (1) constructing high-quality CPT corpora, and (2) generating diverse SFT instructions. We address these challenges through a meticulous pipeline, including PDF text extraction, parsing content error correction, quality filtering, and synthetic instruction creation.
9
+
10
+ Applying this strategy, we present SciLitLLM-7B, specialized in scientific literature understanding, which demonstrates promising performance on scientific literature understanding benchmarks. Specifically, it shows an average performance improvement of 3.6\% on SciAssess and 10.1\% on SciRIFF compared to leading LLMs with fewer than 15B parameters.
11
+
12
+ See the [paper](https://arxiv.org/abs/2408.15545) for more details.
13
+
14
+ ## Requirements
15
+ Since SciLitLLM is based on Qwen2, we advise you to install `transformers>=4.37.0`, or you might encounter the following error:
16
+ ```
17
+ KeyError: 'qwen2'
18
+ ```
19
+
20
+ ## Quickstart
21
+
22
+ Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents.
23
+
24
+ ```python
25
+ from transformers import AutoModelForCausalLM, AutoTokenizer
26
+ device = "cuda" # the device to load the model onto
27
+
28
+ model = AutoModelForCausalLM.from_pretrained(
29
+ "Uni-SMART/SciLitLLM",
30
+ torch_dtype="auto",
31
+ device_map="auto"
32
+ )
33
+ tokenizer = AutoTokenizer.from_pretrained("Uni-SMART/SciLitLLM")
34
+
35
+ prompt = "Can you summarize this article for me?\n <ARTICLE>"
36
+ messages = [
37
+ {"role": "system", "content": "You are a helpful assistant."},
38
+ {"role": "user", "content": prompt}
39
+ ]
40
+ text = tokenizer.apply_chat_template(
41
+ messages,
42
+ tokenize=False,
43
+ add_generation_prompt=True
44
+ )
45
+ model_inputs = tokenizer([text], return_tensors="pt").to(device)
46
+
47
+ generated_ids = model.generate(
48
+ model_inputs.input_ids,
49
+ max_new_tokens=512
50
+ )
51
+ generated_ids = [
52
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
53
+ ]
54
+
55
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
56
+ ```
57
+
58
+ ## Citation
59
+
60
+ If you find our work helpful, feel free to give us a cite.
61
 
62
  ```
63
  @misc{li2024scilitllmadaptllmsscientific,