Vitor Sousa

← Back to blog

Published on 2023-04-30 08:00 by Vitor Sousa

Exploring OpenELM The Intersection of Open Source and High Efficiency in AI

arq


appledate arxiv


Introduction

OpenELM is a state-of-the-art open language model launched by Apple.
The researchers introduced OpenELM, a family of Open-source Efficient Language Models. OpenELM uses a layer-wise scaling strategy to efficiently allocate parameters within each layer of the transformer model, leading to enhanced accuracy. The OpenELM models were pretrained using the CoreNet library. They release both pretrained and instruction-tuned models with 270M, 450M, 1.1B and 3B parameters.

They also release the complete framework, encompassing data preparation, training, fine-tuning, and evaluation procedures, alongside multiple pre-trained checkpoints and training logs, to facilitate open research.

OpenELM outperforms existing open LLMs that are pre-trained using publicly available datasets. OpenELM with 1.1 billion parameters outperforms OLMo, which has 1.2 billion parameters, by 2.36% while requiring 2× fewer pre-training tokens.

Innovations in Architecture

OpenELM distinguishes itself with a layer-wise scaling strategy within its transformer architecture, enabling more efficient parameter allocation across layers. Traditional transformers allocate parameters uniformly across all layers, the models are isotropic. In contrast, OpenELM adjusts the configuration of each layer to optimize parameter usage effectively. This approach not only enhances the model’s learning efficiency but also significantly boosts its performance with less training data.

Key architectural points:

  • They don’t use learnable bias parameters in any fully connected(linear) layers.
  • Apply pre-normalization using RMSNorm
  • Use Rotatory Positional Embedding (ROPE): Enhances the model’s understanding of word order, critical for generating coherent and contextually aware language outputs.
  • Use Grouped Query Attention (GQA): An alternative to the typical multi-head attention that allows more nuanced attention mechanisms, improving the model’s interpretative capabilities.
  • Use SwiGLU Activation: A novel activation function that provides a more dynamic range for model training, leading to faster convergence and improved accuracy (replacing the FFN).
  • Use Flash Attention to compute the dot-product attention
  • Use the same tokenizer as LLama.
    • They filter and tokenize data on the fly. This facilitates experimentation with different tokenizers.

Layer-wise scaling

For the non-uniform allocation of parameters in the transformer layer, they adjust the number of attention heads and the FFN multiplier in each transformer layer.

Assuming that the standard transformer model with uniform parameter allocation has NN transformer layers and the dimensionality of the input of each layer is dmodeld_{model}. The MHA has nhn_h​ heads and a dimension o

nih=αidmodeldh,mi=βin_{i_h} = \frac{\alpha_i \cdot d_{\text{model}}}{d_h}, \quad m_i = \beta_i

Where:

αi=αmin+(αmaxαmin)iN1,\alpha_i = \alpha_{\min} + \frac{(\alpha_{\max} - \alpha_{\min}) \cdot i}{N-1},

and

βi=βmin+(βmaxβmin)iN1,0i<N.\beta_i = \beta_{\min} + \frac{(\beta_{\max} - \beta_{\min}) \cdot i}{N-1}, \quad 0 \leq i < N.

The amin,amaxa_{min}, a_{max} are the hyperparameters that allow them to scal teh attention heads. Similarly bmin,bmaxb_{min},b_{max} allow them to vary the with of the FFN layers. Note, setting amin=amax=1a_{min}= a_{max}=1 and mi=mm_i = m produces the standard uniform transformer model.

Pretraining

The used pre-training dataset contains RefinedWeb, deduplicated PILE, a subset of RedPajama, and a subset of Dolma v1.6, totalling approximately 1.8 trillion tokens.

They filter and tokenize data on the fly. This facilitates experimentation with different tokenizers.

To filter out low-length sequences, they apply two filtering methods. The first method operates at the characterlevel, checking if the number of characters in the sequence is below a specified threshold. The second method operates at the token-level, where it examines whether the sequence contains fewer tokens than a specified threshold. Sequences that are shorter than either of these thresholds are skipped. In experiments, were used 200 characters and 256 tokens as character and token-level filtering thresholds.

Performance and Benchmarking

OpenELM’s performance is outstanding, especially when benchmarked against similar-sized models like OLMo. With 1.1 billion parameters, OpenELM achieves a 2.36% higher accuracy while requiring only half the pre-training tokens. This efficiency is illustrated through various benchmarks on tasks such as ARC-c, BoolQ, and HellaSwag, where OpenELM consistently outperforms its peers.

Evaluation

They evaluate the performance across different tasks using LM Evaluation Harness:

OpenELM outperforms existing open LLMs that are pre-trained using publicly available datasets. OpenELM with 1.1 billion parameters outperforms OLMo, which has 1.2 billion parameters, by 2.36% while requiring 2× fewer pre-training tokens.

Instruction tuning results

They also check the impact of inctruct tuning, that consistently improves OpenELM’s average accuracy by 1-2% across different evaluation frameworks.

Parameter-efficient fine-tuning (PEFT) results

They used CommonSense reasoning training and evaluation setup that provides 170k training samples across 8 multiple-choice datasets for PEFT studies with different methods, including LoRA and DoRA

Decrease on throughput

In one of the lastest section they show an analysis that reveals tha OpenELM is a bit slower thatn OLMo. They attribute significant portion of OpenELM’s processing time to the naive implementation of RMSNorm. Naive RMSNorm implementation results in many individual kernel launches each of which processes a small input, rather than a launch of a single, fused kernel, as would be the case with e.g. LayerNorm. By replacing the naive RMSNorm with Apex’s RMSNorm we observe a notable increase in OpenELM’s throughput.

Conclusion

The major point was the usage of a layer-wise scaling method for efficient parameter allocation within the transformer model, resulting in improved accuracy compared to existing models.

OpenELM is more than just another entry in the growing list of language models. Its innovative architecture, superior performance, and unprecedented commitment to openness set a new standard in the field. It’s really interesting see a company like Apple to release the complete framework, encompassing data preparation, training, fine-tuning, and evaluation procedures, alongside multiple pre-trained checkpoints and training logs, this could really help in further developments in research.

In future work, one of the point identified to improve the inference efficiency of OpenELM.

Explore More

For those interested in diving deeper into OpenELM or even experimenting with the model, check out the resources provided by the researchers github and date.

Written by Vitor Sousa

← Back to blog
  • RAG with LlamaIndex, Elasticsearch and Llama3

    RAG with LlamaIndex, Elasticsearch and Llama3

    Implement Q&A using a RAG technique (Retrieval Augmented Generation) with Elasticsearch as a vector database

  • Exploring OpenELM The Intersection of Open Source and High Efficiency in AI

    Exploring OpenELM The Intersection of Open Source and High Efficiency in AI

    My analysis of OpenELM An Efficient Language Model Family with Open-source Training and Inference Framework, showcasing how Apple is pushing the boundaries of AI efficiency and accessibility.

  • Exploring the Differential Transformer A Step Forward in Language Modeling

    Exploring the Differential Transformer A Step Forward in Language Modeling

    My analysis of OpenELM An Efficient Language Model Family with Open-source Training and Inference Framework, showcasing how microsoft is pushing the boundaries of AI efficiency and accessibility.