Update README.md
Browse files
README.md
CHANGED
@@ -48,7 +48,8 @@ Pragna-1B is a decoder-only transformer model inspired by TinyLlama, featuring t
|
|
48 |
- Hidden Dimension: 2048
|
49 |
- Expansion Dimension: 5632
|
50 |
- Vocabulary Size: 69632
|
51 |
-
|
|
|
52 |
|
53 |
Pragna-1B is trained on our proprietary platform, GenAI Studio, a modular AI Developer Platform designed to support any GenAI model architecture. It is capable of scaling across thousands of GPUs or accelerators and is built to be fault-tolerant. The development of this model leveraged Triton, an open-source language from OpenAI, for crafting high-performance custom fused CUDA Kernels for various operations. Furthermore, the model uses Fully Sharded Data Parallel (FSDP) for distributed and parallel training and incorporates the state-of-the-art FlashAttention2 to accelerate training and inference.
|
54 |
|
|
|
48 |
- Hidden Dimension: 2048
|
49 |
- Expansion Dimension: 5632
|
50 |
- Vocabulary Size: 69632
|
51 |
+
|
52 |
+
This model incorporates Rotary Positional Encoding to infuse positional information into the embeddings, utilising a base of 10,000. It employs RSNorm with an epsilon value of 1e-5 and the Sigmoid Activation Unit (SiLU) as the activation function. Additionally, Pragna-1B adopts Grouped Query Attention, an alternative to Multi-Head Attention, which enhances training and inference speed while reducing memory bandwidth. This also supports the use of lower-compute devices for inference tasks.
|
53 |
|
54 |
Pragna-1B is trained on our proprietary platform, GenAI Studio, a modular AI Developer Platform designed to support any GenAI model architecture. It is capable of scaling across thousands of GPUs or accelerators and is built to be fault-tolerant. The development of this model leveraged Triton, an open-source language from OpenAI, for crafting high-performance custom fused CUDA Kernels for various operations. Furthermore, the model uses Fully Sharded Data Parallel (FSDP) for distributed and parallel training and incorporates the state-of-the-art FlashAttention2 to accelerate training and inference.
|
55 |
|