German artificial intelligence startup Aleph Alpha, has introduced a new foundation model family, specifically the Pharia-1-large language model (LLM)-7 billion (B)-control and its variant Pharia-1-LLM-7B-control-aligned. These models are designed for delivering concise, length-controlled responses and are optimized for German, French, and Spanish languages.
They are particularly suited for domain-specific applications in the automotive and engineering industries. The Pharia-1-LLM-7B-control-aligned version includes additional safety measures, and both models are available under the Open Aleph License for non-commercial research and educational use.
Model Architecture and Hyperparameters
Several ablations conducted on a 1B parameter model guided the architecture and hyperparameter choices, with evaluations on benchmarks such as Lambada, TriviaQA, HellaSwag, and others. The initial hyperparameter search involved a proxy model, upscaling to 1B parameters using maximal update parametrization (MuP). Although MuP was initially intended for the 7B scale, training instabilities were encountered, leading to the abandonment of MuP in favor of heuristics similar to those used in LLM Meta artificial intelligence (AI) (Llama) 2.
When comparing the classical generative pre-trained transformer (GPT) transformer architecture with Llama 2, both performed similarly, though the GPT architecture showed an edge on TriviaQA, leading to its selection for the Pharia-1-LLM-7B models. Group-query attention (GQA) was introduced to improve inference-time performance, with a 1/9 key value-query (kv-q) ratio providing significant memory and throughput benefits without degradation. A larger base for rotary embeddings and a Unigram tokenizer with a 128000-vocabulary size were selected based on better downstream performance.
Weight decay and learning rate decay were also optimized, with a 1e-1 weight decay and decaying the learning rate to zero, yielding the best results.
Pre-training the Model
The Pharia-1-LLM-7B base model was trained using the Scaling codebase, known for its parallelization capabilities and performance optimizations. Training employed the bfloat16 format with a standard mixed-precision strategy, maintaining master copies of weights and optimizer states in full precision and sharing full-precision tensors across data-parallel workers using zero redundancy optimizer (ZeRO) stage 1.
The pre-training was conducted with a sequence length of 8192 tokens to establish baseline long-context abilities. To counter early instabilities observed when scaling sequence length, a warm-up strategy was implemented, gradually increasing from 512 to 2048 and finally to 8192 tokens over several thousand steps. The training process was executed with a global batch size of 1024, spanning 4.7 trillion tokens, covering a single epoch of the initial pre-training dataset.
Subsequently, a second epoch was trained on a different data mix, incorporating recently accessible high-quality English data while retaining the model's multilingual capabilities. This phase covered an additional 3 trillion tokens. The learning rate, initially decayed to zero after the first pre-training phase, was warmed up for 2000 iterations to 3e-5 and gradually decayed to 3e-6 following a cosine schedule.
In total, the Pharia-1-LLM-7B base model was trained on 7.7 trillion tokens, utilizing 256 A100 graphic processing units (GPUs) for the first phase and 256 H100 GPUs for the second. Memory reduction techniques optimized throughput without the need for activation checkpointing, resulting in efficient step durations and high model throughput.
Fine-Tuning and Model Variants
Pharia-1-LLM-7B-control was optimized for instruction using full model fine-tuning and a curriculum strategy that involved training on a blend of instruction datasets, including proprietary and multilingual data in English, German, Spanish, and French. The model was fine-tuned with a focus on minimal and anonymized data.
Two variants were developed, namely, Pharia-1-LLM-7B-control and Pharia-1-LLM-7B-control-aligned. The latter includes preference alignment and safety training, making it ideal for conversational applications where clarity and safety are important. This version uses a knowledge transfer optimization (KTO) alignment process, though it sometimes results in more verbose, generic responses. The control-aligned model is suited for chatbots and virtual assistants, while the non-aligned control model excels in tasks requiring direct, concise outputs, such as extraction and summarization.
Performance Evaluation
Evaluating generative AI (GenAI) models is challenging due to the inherent ambiguity in language, which complicates the creation of standardized metrics. Unlike other AI domains with clear metrics, language models can produce outputs subject to multiple interpretations, making evaluation particularly complex. Human annotators often prioritize assertiveness and length over factuality, affecting evaluation outcomes.
Additionally, evaluation scores can be unstable due to model architecture and training details, such as prompt composition and metric choice. Evaluation data might also leak into training datasets, leading to overfitting and skewed results. Many GenAI evaluation tasks do not align well with real-world scenarios, leading to discrepancies between benchmark scores and practical performance.
For instance, tasks in benchmarks like massive multi-task language understanding (MMLU) and Alpaca Eval may not accurately reflect real-world use cases, complicating the assessment of model usefulness. Evaluations of models like Pharia-1-7B-control and Pharia-1-7B-control-aligned against other multilingual models highlight these challenges.
Conclusion
In conclusion, the Pharia-1-LLM-7B models, including the control and control-aligned variants, are advanced language models optimized for multilingual instruction and domain-specific applications. The control-aligned version incorporates safety and preference alignment, making it suitable for conversational tasks. Despite challenges in evaluating generative AI due to language ambiguity and benchmark misalignments, these models offer significant advancements for research and educational purposes under the Open Aleph License.