tingyuansen
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,58 +1,108 @@
|
|
1 |
---
|
2 |
-
|
|
|
|
|
|
|
3 |
tags:
|
4 |
-
-
|
5 |
-
|
6 |
-
-
|
7 |
-
|
8 |
-
|
9 |
-
|
|
|
10 |
---
|
11 |
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
license: mit
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
pipeline_tag: text-generation
|
6 |
tags:
|
7 |
+
- llama-2
|
8 |
+
- astronomy
|
9 |
+
- astrophysics
|
10 |
+
- arxiv
|
11 |
+
inference: false
|
12 |
+
base_model:
|
13 |
+
- meta-llama/Llama-2-70b-hf
|
14 |
---
|
15 |
|
16 |
+
# AstroLLaMA-2-70B-Chat_AIC
|
17 |
+
|
18 |
+
AstroLLaMA-2-70B-Chat_AIC is a specialized chat model for astronomy, developed by fine-tuning the AstroLLaMA-2-70B-Base_AIC model. This model was developed by the AstroMLab team and is, to our best knowledge, one of the first specialized 70B parameter-level LLMs in astronomy designed for instruction-following and chat-based interactions.
|
19 |
+
|
20 |
+
## Model Details
|
21 |
+
|
22 |
+
- **Base Architecture**: LLaMA-2-70b
|
23 |
+
- **Base Model**: AstroLLaMA-2-70B-Base_AIC (trained on Abstract, Introduction, and Conclusion sections from arXiv's astro-ph category papers)
|
24 |
+
- **Fine-tuning Method**: Supervised Fine-Tuning (SFT)
|
25 |
+
- **SFT Dataset**:
|
26 |
+
- 10,356 astronomy-centered conversations generated from arXiv abstracts by GPT-4
|
27 |
+
- Full content of LIMA dataset
|
28 |
+
- 10,000 samples from Open Orca dataset
|
29 |
+
- 10,000 samples from UltraChat dataset
|
30 |
+
- **Training Details**:
|
31 |
+
- Learning rate: 3 × 10⁻⁷
|
32 |
+
- Training epochs: 1
|
33 |
+
- Total batch size: 48
|
34 |
+
- Maximum token length: 2048
|
35 |
+
- Warmup ratio: 0.03
|
36 |
+
- Cosine decay schedule for learning rate reduction
|
37 |
+
- **Primary Use**: Instruction-following and chat-based interactions for astronomy-related queries
|
38 |
+
- **Reference**: Pan et al. 2024 [Link to be added]
|
39 |
+
|
40 |
+
## Using the model for chat
|
41 |
+
|
42 |
+
```python
|
43 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
44 |
+
import torch
|
45 |
+
|
46 |
+
# Load the model and tokenizer
|
47 |
+
tokenizer = AutoTokenizer.from_pretrained("AstroMLab/astrollama-2-70b-chat_aic")
|
48 |
+
model = AutoModelForCausalLM.from_pretrained("AstroMLab/astrollama-2-70b-chat_aic", device_map="auto")
|
49 |
+
|
50 |
+
# Function to generate a response
|
51 |
+
def generate_response(prompt, max_length=512):
|
52 |
+
full_prompt = f"###Human: {prompt}\n\n###Assistant:"
|
53 |
+
inputs = tokenizer(full_prompt, return_tensors="pt", truncation=True, max_length=max_length)
|
54 |
+
inputs = inputs.to(model.device)
|
55 |
+
|
56 |
+
# Generate a response
|
57 |
+
with torch.no_grad():
|
58 |
+
outputs = model.generate(
|
59 |
+
**inputs,
|
60 |
+
max_length=max_length,
|
61 |
+
num_return_sequences=1,
|
62 |
+
do_sample=True,
|
63 |
+
pad_token_id=tokenizer.eos_token_id,
|
64 |
+
eos_token_id=tokenizer.encode("###Human:", add_special_tokens=False)[0]
|
65 |
+
)
|
66 |
+
|
67 |
+
# Decode and return the response
|
68 |
+
response = tokenizer.decode(outputs[0], skip_special_tokens=False)
|
69 |
+
|
70 |
+
# Extract only the Assistant's response
|
71 |
+
assistant_response = response.split("###Assistant:")[-1].strip()
|
72 |
+
return assistant_response
|
73 |
+
|
74 |
+
# Example usage
|
75 |
+
user_input = "What are the main components of a galaxy?"
|
76 |
+
response = generate_response(user_input)
|
77 |
+
print(f"Human: {user_input}")
|
78 |
+
print(f"Assistant: {response}")
|
79 |
+
```
|
80 |
+
|
81 |
+
## Model Performance and Limitations
|
82 |
+
|
83 |
+
While the AstroLLaMA-2-70B-Base_AIC model demonstrated significant improvements over its baseline LLaMA-2-70B model, the chat version (AstroLLaMA-2-70B-Chat_AIC) experiences performance degradation due to limitations in the SFT process. Here's a performance comparison:
|
84 |
+
|
85 |
+
|
86 |
+
Key limitations:
|
87 |
+
|
88 |
+
1. **SFT Dataset Limitations**: The current SFT dataset, with only 30,000 Q&As (many not astronomy-focused), has proven inadequate for maintaining the base model's performance.
|
89 |
+
2. **Performance Degradation**: The chat model's performance (64.7%) is significantly lower than the base model (76.0%), indicating an 11.3-point decrement due to the SFT process.
|
90 |
+
3. **General Knowledge vs. Specialized Knowledge**: The current SFT process appears to deviate the model towards general answers, potentially at the cost of specialized astronomical knowledge.
|
91 |
+
|
92 |
+
These limitations underscore the challenges in developing specialized chat models and the critical importance of both the quantity and quality of training data, especially for the SFT process.
|
93 |
+
|
94 |
+
This model is released primarily for reproducibility purposes, allowing researchers to track the development process and compare different iterations of AstroLLaMA models.
|
95 |
+
|
96 |
+
For optimal performance and the most up-to-date capabilities in astronomy-related tasks, we recommend using AstroSage-8B, where these limitations have been addressed through expanded training data and refined fine-tuning processes.
|
97 |
+
|
98 |
+
## Ethical Considerations
|
99 |
+
|
100 |
+
While this model is designed for scientific use, users should be mindful of potential misuse, such as generating misleading scientific content. Always verify model outputs against peer-reviewed sources for critical applications.
|
101 |
+
|
102 |
+
## Citation
|
103 |
+
|
104 |
+
If you use this model in your research, please cite:
|
105 |
+
|
106 |
+
```
|
107 |
+
[Citation for Pan et al. 2024 to be added]
|
108 |
+
```
|