File size: 5,064 Bytes
42c5811
 
 
b57f0cb
42c5811
b57f0cb
42c5811
 
 
b57f0cb
 
42c5811
 
 
 
 
011b5fb
2773778
 
b57f0cb
011b5fb
 
b57f0cb
 
011b5fb
 
 
 
 
 
 
 
 
b57f0cb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b4262a6
 
b57f0cb
 
 
 
 
 
6311dac
b57f0cb
8a1522d
b57f0cb
 
 
 
 
 
 
 
 
 
 
8a1522d
b57f0cb
011b5fb
 
b57f0cb
011b5fb
 
 
 
 
b57f0cb
011b5fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2c26bd1
011b5fb
 
 
 
2c26bd1
011b5fb
 
 
 
 
 
b57f0cb
 
 
 
 
 
 
011b5fb
b57f0cb
 
 
7dfd861
2773778
 
 
 
 
 
 
 
 
7dfd861
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
language:
- en
pipeline_tag: text-generation
tags:
- llama-3.1
- astronomy
- astrophysics
- cosmology
- arxiv
inference: false
base_model:
- meta-llama/Meta-Llama-3.1-8B
---

# AstroSage-Llama-3.1-8B

https://arxiv.org/abs/2411.09012

AstroSage-Llama-3.1-8B is a domain-specialized natural-language AI assistant tailored for research in astronomy, astrophysics, and cosmology. Trained on the complete collection of astronomy-related arXiv papers from 2007-2024 along with millions of synthetically-generated question-answer pairs and other astronomical literature, AstroSage-Llama-3.1-8B demonstrates excellent proficiency on a wide range of questions. This achievement demonstrates the potential of domain specialization in AI, suggesting that focused training can yield capabilities exceeding those of much larger, general-purpose models.

## Model Details

- **Base Architecture**: Meta-Llama-3.1-8B
- **Base Model**: Meta-Llama-3.1-8B
- **Parameters**: 8 billion
- **Training Focus**: Astronomy, Astrophysics, Cosmology, and Astronomical Instrumentation
- **License**: Llama 3.1 Community License
- **Development Process**:
  1. Continued Pre-training (CPT) on astronomical literature
  2. Supervised Fine-tuning (SFT) on QA pairs and instruction sets
  3. Model merging with Meta-Llama-3.1-8B-Instruct (75% CPT+SFT / 25% Meta-Instruct)

## Using the model

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained("AstroMLab/AstroSage-8b", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("AstroMLab/AstroSage-8b")

# Function to generate a response
def generate_response(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )
    response = outputs[0][inputs['input_ids'].shape[-1]:]
    decoded = tokenizer.decode(response, skip_special_tokens=True)

    return decoded

# Example usage
prompt = """
You are an expert in general astrophysics. Your task is to answer the following question:
What are the main components of a galaxy?
"""
response = generate_response(prompt)
print(response)
```


## Model Improvements and Performance

AstroSage-Llama-3.1-8B shows remarkable performance improvements:

| Model | Score (%) |
|-------|-----------|
| **<span style="color:green">AstroSage-Llama-3.1-8B</span>** | **<span style="color:green">80.9</span>** |
| GPT-4o | 80.4 |
| LLaMA-3.1-8B | 73.7 |
| Gemma-2-9B | 71.5 |
| Qwen-2.5-7B | 70.4 |
| Yi-1.5-9B | 68.4 |
| InternLM-2.5-7B | 64.5 |
| Mistral-7B-v0.3 | 63.9 |
| ChatGLM3-6B | 50.4 |

The model demonstrates:
- Outperformance of all 8B parameter models
- Comparable performance to GPT-4o (80.4%)
- ~1000x more cost-effective than proprietary models
- 7 percentage-point improvement over base Llama-3.1-8b model


## Training Data

- **Continued Pre-training**:
  - ~250,000 arXiv preprints (2007-2024) from astro-ph and gr-qc
  - Astronomy-related Wikipedia articles
  - Selected astronomy textbooks
  - Total: 3.3 billion tokens, 19.9 GB plaintext

- **Supervised Fine-tuning**:
  - 8.8 million curated QA pairs
  - Filtered Infinity-Instruct-7M dataset
  - Paper summaries and metadata
  - Total: 2.0 billion tokens, 9.8 GB plaintext

## Intended Use
- Curiosity-driven question answering
- Brainstorming new ideas
- Astronomical research assistance
- Educational support in astronomy
- Literature review and summarization
- Scientific explanation of concepts

## Limitations
- Training data cutoff: January 2024
- As with all LLMs, hallucinations are possible
- Limited by 8B parameter size for complex reasoning
- Paper metadata not perfectly memorized
- Performance primarily validated on multiple-choice questions
- Primarily trained for use in English

## Technical Specifications
- Architecture: Based on Meta-Llama 3.1
- Training Infrastructure: ORNL OLCF Frontier
- Hosting: Hugging Face Hub (AstroMLab/AstroSage-8B)

## Ethical Considerations

While this model is designed for scientific use:
- Should not be used as sole source for critical research decisions
- Output should be verified against primary sources
- May reflect biases present in astronomical literature

## Citation and Contact

- Corresponding author: Tijmen de Haan (tijmen dot dehaan at gmail dot com)
- AstroMLab: astromachinelearninglab at gmail dot com
- Please cite the AstroMLab 3 paper when referencing this model:
```
@preprint{dehaan2024astromlab3,
      title={AstroMLab 3: Achieving GPT-4o Level Performance in Astronomy with a Specialized 8B-Parameter Large Language Model}, 
      author={Tijmen de Haan and Yuan-Sen Ting and Tirthankar Ghosal and Tuan Dung Nguyen and Alberto Accomazzi and Azton Wells and Nesar Ramachandra and Rui Pan and Zechang Sun},
      year={2024},
      eprint={2411.09012},
      archivePrefix={arXiv},
      primaryClass={astro-ph.IM},
      url={https://arxiv.org/abs/2411.09012}, 
}
```