Published on 2023-04-30 08:00 by Vitor Sousa
Exploring OpenELM The Intersection of Open Source and High Efficiency in AI
Introduction
OpenELM is a state-of-the-art open language model launched by Apple.
The researchers introduced OpenELM, a family of Open-source Efficient Language Models. OpenELM uses a layer-wise scaling strategy to efficiently allocate parameters within each layer of the transformer model, leading to enhanced accuracy. The OpenELM models were pretrained using the CoreNet library. They release both pretrained and instruction-tuned models with 270M, 450M, 1.1B and 3B parameters.
They also release the complete framework, encompassing data preparation, training, fine-tuning, and evaluation procedures, alongside multiple pre-trained checkpoints and training logs, to facilitate open research.
OpenELM outperforms existing open LLMs that are pre-trained using publicly available datasets. OpenELM with 1.1 billion parameters outperforms OLMo, which has 1.2 billion parameters, by 2.36% while requiring 2× fewer pre-training tokens.
Innovations in Architecture
OpenELM distinguishes itself with a layer-wise scaling strategy within its transformer architecture, enabling more efficient parameter allocation across layers. Traditional transformers allocate parameters uniformly across all layers, the models are isotropic. In contrast, OpenELM adjusts the configuration of each layer to optimize parameter usage effectively. This approach not only enhances the model’s learning efficiency but also significantly boosts its performance with less training data.
Key architectural points:
- They don’t use learnable bias parameters in any fully connected(linear) layers.
- Apply pre-normalization using RMSNorm
- Use Rotatory Positional Embedding (ROPE): Enhances the model’s understanding of word order, critical for generating coherent and contextually aware language outputs.
- Use Grouped Query Attention (GQA): An alternative to the typical multi-head attention that allows more nuanced attention mechanisms, improving the model’s interpretative capabilities.
- Use SwiGLU Activation: A novel activation function that provides a more dynamic range for model training, leading to faster convergence and improved accuracy (replacing the FFN).
- Use Flash Attention to compute the dot-product attention
- Use the same tokenizer as LLama.
- They filter and tokenize data on the fly. This facilitates experimentation with different tokenizers.
Layer-wise scaling
For the non-uniform allocation of parameters in the transformer layer, they adjust the number of attention heads and the FFN multiplier in each transformer layer.
Assuming that the standard transformer model with uniform parameter allocation has transformer layers and the dimensionality of the input of each layer is . The MHA has heads and a dimension o
Where:
and
The are the hyperparameters that allow them to scal teh attention heads. Similarly allow them to vary the with of the FFN layers. Note, setting and produces the standard uniform transformer model.
Pretraining
The used pre-training dataset contains RefinedWeb, deduplicated PILE, a subset of RedPajama, and a subset of Dolma v1.6, totalling approximately 1.8 trillion tokens.
They filter and tokenize data on the fly. This facilitates experimentation with different tokenizers.
To filter out low-length sequences, they apply two filtering methods. The first method operates at the characterlevel, checking if the number of characters in the sequence is below a specified threshold. The second method operates at the token-level, where it examines whether the sequence contains fewer tokens than a specified threshold. Sequences that are shorter than either of these thresholds are skipped. In experiments, were used 200 characters and 256 tokens as character and token-level filtering thresholds.
Performance and Benchmarking
OpenELM’s performance is outstanding, especially when benchmarked against similar-sized models like OLMo. With 1.1 billion parameters, OpenELM achieves a 2.36% higher accuracy while requiring only half the pre-training tokens. This efficiency is illustrated through various benchmarks on tasks such as ARC-c, BoolQ, and HellaSwag, where OpenELM consistently outperforms its peers.
Evaluation
They evaluate the performance across different tasks using LM Evaluation Harness:
-
Standard zero-shot tasks(7):
- ARC easy and challenge,
- BoolQ,
- HellaSwag,
- PIQA,
- SciQ, and
- WinoGrande
Model Size ARC-c ARC-e BoolQ HellaSwag PIQA SciQ WinoGrande Average OpenELM-270M 26.45 45.08 53.98 46.71 69.75 84.70 53.91 54.37 OpenELM-270M-Instruct 30.55 46.68 48.56 52.07 70.78 84.40 52.72 55.11 OpenELM-450M 27.56 48.06 55.78 53.97 72.31 87.20 58.01 57.56 OpenELM-450M-Instruct 30.38 50.00 60.37 59.34 72.63 88.00 58.96 59.95 OpenELM-1_1B 32.34 55.43 63.58 64.81 75.57 90.60 61.72 63.44 OpenELM-1_1B-Instruct 37.97 52.23 70.00 71.20 75.03 89.30 62.75 65.50 OpenELM-3B 35.58 59.89 67.40 72.44 78.24 92.70 65.51 67.39 OpenELM-3B-Instruct 39.42 61.74 68.17 76.36 79.00 92.50 66.85 69.15 -
OpenLLM (5)
- ARC challenge,
- HellaSwag,
- MMLU,
- TruthfulQA, and
- WinoGrande.
Model Size ARC-c HellaSwag MMLU TruthfulQA WinoGrande Average OpenELM-270M 27.65 47.15 25.72 39.24 53.83 38.72 OpenELM-270M-Instruct 32.51 51.58 26.70 38.72 53.20 40.54 OpenELM-450M 30.20 53.86 26.01 40.18 57.22 41.50 OpenELM-450M-Instruct 33.53 59.31 25.41 40.48 58.33 43.41 OpenELM-1_1B 36.69 65.71 27.05 36.98 63.22 45.93 OpenELM-1_1B-Instruct 41.55 71.83 25.65 45.95 64.72 49.94 OpenELM-3B 42.24 73.28 26.76 34.98 67.25 48.90 OpenELM-3B-Instruct 47.70 76.87 24.80 38.76 67.96 51.22 -
LLM360 leaderboard tasks (7):
- ARC challenge,
- CrowS-Pairs (English version),
- HellaSwag,
- WinoGrande,
- MMLU,
- PIQA, and
- RACE
Model Size ARC-c CrowS-Pairs HellaSwag MMLU PIQA RACE TruthfulQA WinoGrande Average OpenELM-270M 27.65 66.79 47.15 25.72 69.75 30.91 39.24 53.83 45.13 OpenELM-270M-Instruct 32.51 66.01 51.58 26.70 70.78 33.78 38.72 53.20 46.66 OpenELM-450M 30.20 68.63 53.86 26.01 72.31 33.11 40.18 57.22 47.69 OpenELM-450M-Instruct 33.53 67.44 59.31 25.41 72.63 36.84 40.48 58.33 49.25 OpenELM-1_1B 36.69 71.74 65.71 27.05 75.57 36.46 36.98 63.22 51.68 OpenELM-1_1B-Instruct 41.55 71.02 71.83 25.65 75.03 39.43 45.95 64.72 54.40 OpenELM-3B 42.24 73.29 73.28 26.76 78.24 38.76 34.98 67.25 54.35 OpenELM-3B-Instruct 47.70 72.33 76.87 24.80 79.00 38.47 38.76 67.96 55.73
OpenELM outperforms existing open LLMs that are pre-trained using publicly available datasets. OpenELM with 1.1 billion parameters outperforms OLMo, which has 1.2 billion parameters, by 2.36% while requiring 2× fewer pre-training tokens.
Instruction tuning results
They also check the impact of inctruct tuning, that consistently improves OpenELM’s average accuracy by 1-2% across different evaluation frameworks.
Parameter-efficient fine-tuning (PEFT) results
They used CommonSense reasoning training and evaluation setup that provides 170k training samples across 8 multiple-choice datasets for PEFT studies with different methods, including LoRA and DoRA
Decrease on throughput
In one of the lastest section they show an analysis that reveals tha OpenELM is a bit slower thatn OLMo. They attribute significant portion of OpenELM’s processing time to the naive implementation of RMSNorm. Naive RMSNorm implementation results in many individual kernel launches each of which processes a small input, rather than a launch of a single, fused kernel, as would be the case with e.g. LayerNorm. By replacing the naive RMSNorm with Apex’s RMSNorm we observe a notable increase in OpenELM’s throughput.
Conclusion
The major point was the usage of a layer-wise scaling method for efficient parameter allocation within the transformer model, resulting in improved accuracy compared to existing models.
OpenELM is more than just another entry in the growing list of language models. Its innovative architecture, superior performance, and unprecedented commitment to openness set a new standard in the field. It’s really interesting see a company like Apple to release the complete framework, encompassing data preparation, training, fine-tuning, and evaluation procedures, alongside multiple pre-trained checkpoints and training logs, this could really help in further developments in research.
In future work, one of the point identified to improve the inference efficiency of OpenELM.
Explore More
For those interested in diving deeper into OpenELM or even experimenting with the model, check out the resources provided by the researchers and .
Written by Vitor Sousa
← Back to blog