Update README.md

ef0f324 verified about 1 month ago

5.98 kB

	---
	license: mit
	language:
	- en
	pipeline_tag: text-generation
	tags:
	- llama-2
	- astronomy
	- astrophysics
	- arxiv
	inference: false
	base_model:
	- meta-llama/Llama-2-70b-hf
	---

	# AstroLLaMA-2-70B-Chat_AIC

	AstroLLaMA-2-70B-Chat_AIC is a specialized chat model for astronomy, developed by fine-tuning the AstroLLaMA-2-70B-Base_AIC model. This model was developed by the AstroMLab team and is, to our best knowledge, one of the first specialized 70B parameter-level LLMs in astronomy designed for instruction-following and chat-based interactions.

	## Model Details

	- Base Architecture: LLaMA-2-70b
	- Base Model: AstroLLaMA-2-70B-Base_AIC (trained on Abstract, Introduction, and Conclusion sections from arXiv's astro-ph category papers)
	- Fine-tuning Method: Supervised Fine-Tuning (SFT)
	- SFT Dataset:
	- 10,356 astronomy-centered conversations generated from arXiv abstracts by GPT-4
	- Full content of LIMA dataset
	- 10,000 samples from Open Orca dataset
	- 10,000 samples from UltraChat dataset
	- Training Details:
	- Learning rate: 3 × 10⁻⁷
	- Training epochs: 1
	- Total batch size: 48
	- Maximum token length: 2048
	- Warmup ratio: 0.03
	- Cosine decay schedule for learning rate reduction
	- Primary Use: Instruction-following and chat-based interactions for astronomy-related queries
	- Reference: [Pan et al. 2024](https://arxiv.org/abs/2409.19750)

	## Using the model for chat

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	# Load the model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("AstroMLab/astrollama-2-70b-chat_aic")
	model = AutoModelForCausalLM.from_pretrained("AstroMLab/astrollama-2-70b-chat_aic", device_map="auto")

	# Function to generate a response
	def generate_response(prompt, max_length=512):
	full_prompt = f"###Human: {prompt}\n\n###Assistant:"
	inputs = tokenizer(full_prompt, return_tensors="pt", truncation=True, max_length=max_length)
	inputs = inputs.to(model.device)

	# Generate a response
	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_length=max_length,
	num_return_sequences=1,
	do_sample=True,
	pad_token_id=tokenizer.eos_token_id,
	eos_token_id=tokenizer.encode("###Human:", add_special_tokens=False)[0]
	)

	# Decode and return the response
	response = tokenizer.decode(outputs[0], skip_special_tokens=False)

	# Extract only the Assistant's response
	assistant_response = response.split("###Assistant:")[-1].strip()
	return assistant_response

	# Example usage
	user_input = "What are the main components of a galaxy?"
	response = generate_response(user_input)
	print(f"Human: {user_input}")
	print(f"Assistant: {response}")
	```

	## Model Performance and Limitations

	While the AstroLLaMA-2-70B-Base_AIC model demonstrated significant improvements over its baseline LLaMA-2-70B model, the chat version (AstroLLaMA-2-70B-Chat_AIC) experiences performance degradation due to limitations in the SFT process. Here's a performance comparison:

	\| Model \| Score (%) \|
	\|-------\|-----------\|
	\| AstroSage-LLaMA-3.1-8B (AstroMLab) \| 80.9 \|
	\| <span style="color:green">AstroLLaMA-2-70B-Base (AstroMLab)</span> \| <span style="color:green">76.0</span> \|
	\| LLaMA-3.1-8B \| 73.7 \|
	\| LLaMA-2-70B \| 70.7 \|
	\| Gemma-2-9B \| 71.5 \|
	\| Qwen-2.5-7B \| 70.4 \|
	\| Yi-1.5-9B \| 68.4 \|
	\| InternLM-2.5-7B \| 64.5 \|
	\| <span style="color:green">AstroLLaMA-2-70B-Chat (AstroMLab)</span> \| <span style="color:green">64.7</span> \|
	\| Mistral-7B-v0.3 \| 63.9 \|
	\| ChatGLM3-6B \| 50.4 \|

	Key limitations:

	1. SFT Dataset Limitations: The current SFT dataset, with only 30,000 Q&As (many not astronomy-focused), has proven inadequate for maintaining the base model's performance.
	2. Performance Degradation: The chat model's performance (64.7%) is significantly lower than the base model (76.0%), indicating an 11.3-point decrement due to the SFT process.
	3. General Knowledge vs. Specialized Knowledge: The current SFT process appears to deviate the model towards general answers, potentially at the cost of specialized astronomical knowledge.

	These limitations underscore the challenges in developing specialized chat models and the critical importance of both the quantity and quality of training data, especially for the SFT process.

	This model is released primarily for reproducibility purposes, allowing researchers to track the development process and compare different iterations of AstroLLaMA models.

	For optimal performance and the most up-to-date capabilities in astronomy-related tasks, we recommend using AstroSage-LLaMA-3.1-8B, where these limitations have been addressed through expanded training data and refined fine-tuning processes.

	## Ethical Considerations

	While this model is designed for scientific use, users should be mindful of potential misuse, such as generating misleading scientific content. Always verify model outputs against peer-reviewed sources for critical applications.

	## Citation

	If you use this model in your research, please cite:

	```
	@ARTICLE{2024arXiv240919750P,
	author = {{Pan}, Rui and {Dung Nguyen}, Tuan and {Arora}, Hardik and {Accomazzi}, Alberto and {Ghosal}, Tirthankar and {Ting}, Yuan-Sen},
	title = "{AstroMLab 2: AstroLLaMA-2-70B Model and Benchmarking Specialised LLMs for Astronomy}",
	journal = {arXiv e-prints},
	keywords = {Astrophysics - Instrumentation and Methods for Astrophysics, Computer Science - Computation and Language},
	year = 2024,
	month = sep,
	eid = {arXiv:2409.19750},
	pages = {arXiv:2409.19750},
	doi = {10.48550/arXiv.2409.19750},
	archivePrefix = {arXiv},
	eprint = {2409.19750},
	primaryClass = {astro-ph.IM},
	adsurl = {https://ui.adsabs.harvard.edu/abs/2024arXiv240919750P},
	adsnote = {Provided by the SAO/NASA Astrophysics Data System}
	}

	```