Published on 2023-04-30 08:00 by Vitor Sousa

Exploring OpenELM The Intersection of Open Source and High Efficiency in AI

arq

Introduction

OpenELM is a state-of-the-art open language model launched by Apple.
The researchers introduced OpenELM, a family of Open-source Efficient Language Models. OpenELM uses a layer-wise scaling strategy to efficiently allocate parameters within each layer of the transformer model, leading to enhanced accuracy. The OpenELM models were pretrained using the CoreNet library. They release both pretrained and instruction-tuned models with 270M, 450M, 1.1B and 3B parameters.

They also release the complete framework, encompassing data preparation, training, fine-tuning, and evaluation procedures, alongside multiple pre-trained checkpoints and training logs, to facilitate open research.

OpenELM outperforms existing open LLMs that are pre-trained using publicly available datasets. OpenELM with 1.1 billion parameters outperforms OLMo, which has 1.2 billion parameters, by 2.36% while requiring 2× fewer pre-training tokens.

Innovations in Architecture

OpenELM distinguishes itself with a layer-wise scaling strategy within its transformer architecture, enabling more efficient parameter allocation across layers. Traditional transformers allocate parameters uniformly across all layers, the models are isotropic. In contrast, OpenELM adjusts the configuration of each layer to optimize parameter usage effectively. This approach not only enhances the model’s learning efficiency but also significantly boosts its performance with less training data.

Key architectural points:

They don’t use learnable bias parameters in any fully connected(linear) layers.
Apply pre-normalization using RMSNorm
Use Rotatory Positional Embedding (ROPE): Enhances the model’s understanding of word order, critical for generating coherent and contextually aware language outputs.
Use Grouped Query Attention (GQA): An alternative to the typical multi-head attention that allows more nuanced attention mechanisms, improving the model’s interpretative capabilities.
Use SwiGLU Activation: A novel activation function that provides a more dynamic range for model training, leading to faster convergence and improved accuracy (replacing the FFN).
Use Flash Attention to compute the dot-product attention
Use the same tokenizer as LLama.
- They filter and tokenize data on the fly. This facilitates experimentation with different tokenizers.

Layer-wise scaling

For the non-uniform allocation of parameters in the transformer layer, they adjust the number of attention heads and the FFN multiplier in each transformer layer.

Assuming that the standard transformer model with uniform parameter allocation has $N$ transformer layers and the dimensionality of the input of each layer is $d_{model}$ . The MHA has $n_h$ heads and a dimension o

n_{i_h} = \frac{\alpha_i \cdot d_{\text{model}}}{d_h}, \quad m_i = \beta_i

Where:

\alpha_i = \alpha_{\min} + \frac{(\alpha_{\max} - \alpha_{\min}) \cdot i}{N-1},

and

\beta_i = \beta_{\min} + \frac{(\beta_{\max} - \beta_{\min}) \cdot i}{N-1}, \quad 0 \leq i < N.

The $a_{min}, a_{max}$ are the hyperparameters that allow them to scal teh attention heads. Similarly $b_{min},b_{max}$ allow them to vary the with of the FFN layers. Note, setting $a_{min}= a_{max}=1$ and $m_i = m$ produces the standard uniform transformer model.

Pretraining

The used pre-training dataset contains RefinedWeb, deduplicated PILE, a subset of RedPajama, and a subset of Dolma v1.6, totalling approximately 1.8 trillion tokens.

They filter and tokenize data on the fly. This facilitates experimentation with different tokenizers.

To filter out low-length sequences, they apply two filtering methods. The first method operates at the characterlevel, checking if the number of characters in the sequence is below a specified threshold. The second method operates at the token-level, where it examines whether the sequence contains fewer tokens than a specified threshold. Sequences that are shorter than either of these thresholds are skipped. In experiments, were used 200 characters and 256 tokens as character and token-level filtering thresholds.

Performance and Benchmarking

OpenELM’s performance is outstanding, especially when benchmarked against similar-sized models like OLMo. With 1.1 billion parameters, OpenELM achieves a 2.36% higher accuracy while requiring only half the pre-training tokens. This efficiency is illustrated through various benchmarks on tasks such as ARC-c, BoolQ, and HellaSwag, where OpenELM consistently outperforms its peers.

Evaluation

They evaluate the performance across different tasks using LM Evaluation Harness:

Standard zero-shot tasks(7):

ARC easy and challenge,
BoolQ,
HellaSwag,
PIQA,
SciQ, and
WinoGrande

Model Size	ARC-c	ARC-e	BoolQ	HellaSwag	PIQA	SciQ	WinoGrande	Average
OpenELM-270M	26.45	45.08	53.98	46.71	69.75	84.70	53.91	54.37
OpenELM-270M-Instruct	30.55	46.68	48.56	52.07	70.78	84.40	52.72	55.11
OpenELM-450M	27.56	48.06	55.78	53.97	72.31	87.20	58.01	57.56
OpenELM-450M-Instruct	30.38	50.00	60.37	59.34	72.63	88.00	58.96	59.95
OpenELM-1_1B	32.34	55.43	63.58	64.81	75.57	90.60	61.72	63.44
OpenELM-1_1B-Instruct	37.97	52.23	70.00	71.20	75.03	89.30	62.75	65.50
OpenELM-3B	35.58	59.89	67.40	72.44	78.24	92.70	65.51	67.39
OpenELM-3B-Instruct	39.42	61.74	68.17	76.36	79.00	92.50	66.85	69.15

OpenLLM (5)

ARC challenge,
HellaSwag,
MMLU,
TruthfulQA, and
WinoGrande.

Model Size	ARC-c	HellaSwag	MMLU	TruthfulQA	WinoGrande	Average
OpenELM-270M	27.65	47.15	25.72	39.24	53.83	38.72
OpenELM-270M-Instruct	32.51	51.58	26.70	38.72	53.20	40.54
OpenELM-450M	30.20	53.86	26.01	40.18	57.22	41.50
OpenELM-450M-Instruct	33.53	59.31	25.41	40.48	58.33	43.41
OpenELM-1_1B	36.69	65.71	27.05	36.98	63.22	45.93
OpenELM-1_1B-Instruct	41.55	71.83	25.65	45.95	64.72	49.94
OpenELM-3B	42.24	73.28	26.76	34.98	67.25	48.90
OpenELM-3B-Instruct	47.70	76.87	24.80	38.76	67.96	51.22

LLM360 leaderboard tasks (7):

ARC challenge,
CrowS-Pairs (English version),
HellaSwag,
WinoGrande,
MMLU,
PIQA, and
RACE

Model Size	ARC-c	CrowS-Pairs	HellaSwag	MMLU	PIQA	RACE	TruthfulQA	WinoGrande	Average
OpenELM-270M	27.65	66.79	47.15	25.72	69.75	30.91	39.24	53.83	45.13
OpenELM-270M-Instruct	32.51	66.01	51.58	26.70	70.78	33.78	38.72	53.20	46.66
OpenELM-450M	30.20	68.63	53.86	26.01	72.31	33.11	40.18	57.22	47.69
OpenELM-450M-Instruct	33.53	67.44	59.31	25.41	72.63	36.84	40.48	58.33	49.25
OpenELM-1_1B	36.69	71.74	65.71	27.05	75.57	36.46	36.98	63.22	51.68
OpenELM-1_1B-Instruct	41.55	71.02	71.83	25.65	75.03	39.43	45.95	64.72	54.40
OpenELM-3B	42.24	73.29	73.28	26.76	78.24	38.76	34.98	67.25	54.35
OpenELM-3B-Instruct	47.70	72.33	76.87	24.80	79.00	38.47	38.76	67.96	55.73

OpenELM outperforms existing open LLMs that are pre-trained using publicly available datasets. OpenELM with 1.1 billion parameters outperforms OLMo, which has 1.2 billion parameters, by 2.36% while requiring 2× fewer pre-training tokens.

Instruction tuning results

They also check the impact of inctruct tuning, that consistently improves OpenELM’s average accuracy by 1-2% across different evaluation frameworks.

Parameter-efficient fine-tuning (PEFT) results

They used CommonSense reasoning training and evaluation setup that provides 170k training samples across 8 multiple-choice datasets for PEFT studies with different methods, including LoRA and DoRA

Decrease on throughput

In one of the lastest section they show an analysis that reveals tha OpenELM is a bit slower thatn OLMo. They attribute significant portion of OpenELM’s processing time to the naive implementation of RMSNorm. Naive RMSNorm implementation results in many individual kernel launches each of which processes a small input, rather than a launch of a single, fused kernel, as would be the case with e.g. LayerNorm. By replacing the naive RMSNorm with Apex’s RMSNorm we observe a notable increase in OpenELM’s throughput.

Conclusion

The major point was the usage of a layer-wise scaling method for efficient parameter allocation within the transformer model, resulting in improved accuracy compared to existing models.

OpenELM is more than just another entry in the growing list of language models. Its innovative architecture, superior performance, and unprecedented commitment to openness set a new standard in the field. It’s really interesting see a company like Apple to release the complete framework, encompassing data preparation, training, fine-tuning, and evaluation procedures, alongside multiple pre-trained checkpoints and training logs, this could really help in further developments in research.

In future work, one of the point identified to improve the inference efficiency of OpenELM.