OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework
Paper
•
2404.14619
•
Published
•
124
Note Table 8. Normalization layers are a bottleneck. The throughput of both OLMo-1.18B and OpenELM-1.08B significantly decreases with the naive implementation of RMSNorm in PyTorch compared to highly optimized LayerNorm [2]. Although Apex’s [33] RMSNorm implementation leads to notable throughput improvements compared to the naive implementation, a considerable performance gap persists in comparison to LayerNorm.