Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,78 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
datasets:
|
3 |
+
- homebrewltd/instruction-speech-whispervq-v2
|
4 |
+
language:
|
5 |
+
- en
|
6 |
+
license: apache-2.0
|
7 |
+
tags:
|
8 |
+
- sound language model
|
9 |
+
---
|
10 |
+
|
11 |
+
## Model Details
|
12 |
+
|
13 |
+
We have developed and released the family [llama3s](https://huggingface.co/collections/homebrew-research/llama3-s-669df2139f0576abc6eb7405). This family is natively understanding audio and text input.
|
14 |
+
|
15 |
+
We continual pretrain on the expanded vocabulary [homebrewltd/llama3.1-s-whispervq-init](https://huggingface.co/homebrewltd/llama3.1-s-whispervq-init) with 900M tokens from [homebrewltd/raw-speech-whispervq-v1](https://huggingface.co/datasets/homebrewltd/raw-speech-whispervq-v1) dataset.
|
16 |
+
|
17 |
+
**Model developers** Homebrew Research.
|
18 |
+
|
19 |
+
**Input** Text and sound.
|
20 |
+
|
21 |
+
**Output** Text.
|
22 |
+
|
23 |
+
**Model Architecture** Llama-3.
|
24 |
+
|
25 |
+
**Language(s):** English.
|
26 |
+
|
27 |
+
## Intended Use
|
28 |
+
|
29 |
+
**Intended Use Cases** This family is primarily intended for research applications. This version aims to further improve the LLM on sound understanding capabilities.
|
30 |
+
|
31 |
+
**Out-of-scope** The use of llama3-s in any manner that violates applicable laws or regulations is strictly prohibited.
|
32 |
+
|
33 |
+
## Training process
|
34 |
+
**Training Metrics Image**: Below is a snapshot of the training loss curve visualized.
|
35 |
+
|
36 |
+
![train_log](https://cdn-uploads.huggingface.co/production/uploads/65713d70f56f9538679e5a56/iAbaP7SCoyZ8tz2hyK8k0.png)
|
37 |
+
|
38 |
+
### Hardware
|
39 |
+
|
40 |
+
**GPU Configuration**: Cluster of 10x NVIDIA A6000-48GB.
|
41 |
+
|
42 |
+
**GPU Usage**:
|
43 |
+
- **Continual Training**: 30 hours.
|
44 |
+
|
45 |
+
### Training Arguments
|
46 |
+
|
47 |
+
We utilize [torchtune](https://github.com/pytorch/torchtune) library for the latest FSDP2 training code implementation.
|
48 |
+
|
49 |
+
| Parameter | Continual Training |
|
50 |
+
|----------------------------|-------------------------|
|
51 |
+
| **Epoch** | 1 |
|
52 |
+
| **Global batch size** | 480 |
|
53 |
+
| **Learning Rate** | 2e-4 |
|
54 |
+
| **Learning Scheduler** | Cosine with warmup |
|
55 |
+
| **Optimizer** | AdamW fused |
|
56 |
+
| **Warmup Steps** | 50 |
|
57 |
+
| **Weight Decay** | 0.01 |
|
58 |
+
| **Max Sequence Length** | 512 |
|
59 |
+
| **Max Training Steps** | 2000 |
|
60 |
+
|
61 |
+
## Citation Information
|
62 |
+
|
63 |
+
**BibTeX:**
|
64 |
+
|
65 |
+
```
|
66 |
+
@article{Llama3-S: Sound Instruction Language Model 2024,
|
67 |
+
title={Llama3-S},
|
68 |
+
author={Homebrew Research},
|
69 |
+
year=2024,
|
70 |
+
month=August},
|
71 |
+
url={https://huggingface.co/homebrewltd/llama3.1-s-2024-08-15}
|
72 |
+
```
|
73 |
+
|
74 |
+
## Acknowledgement
|
75 |
+
|
76 |
+
- **[WhisperSpeech](https://github.com/collabora/WhisperSpeech)**
|
77 |
+
|
78 |
+
- **[Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)**
|