In an article recently posted to the Meta Research website, researchers introduced a large language model (Llama 3), a new set of foundation models featuring a 405B parameter transformer with a 128K token context window. The study demonstrated that Llama 3 matches the performance of leading models like generative pre-trained transformer 4 (GPT-4) across various tasks. They also released pre-trained and post-trained versions, along with Llama Guard 3 for safety. Integration of image, video, and speech capabilities showed competitive results. However, these features are still under development and have yet to be widely released.
Llama 3 Architecture
Llama 3 features a dense transformer architecture with enhancements for efficiency and performance, such as grouped query attention (GQA) and an attention mask to manage long sequences. It uses a vocabulary of 128,000 tokens, combining tiktoken and additional tokens for better non-English support. The model's rotary positional encoding (RoPE) base frequency is increased to 500,000 for longer contexts, and the 405B parameter version includes 126 layers, 16,384 token dimensions, and 128 attention heads.
Scaling laws guided the optimal model size based on benchmark performance and training floating point operations per second (FLOPs), leading to a 405B parameter flagship model. Training uses Meta's infrastructure with up to 16,000 H100 graphics processing units (GPUs), optimized for flexible batch sizing, reduced memory usage, and improved communication.
Post-Training Refinement
The Llama 3 models undergo extensive post-training to align with human feedback through several iterative rounds. This process includes supervised finetuning (SFT) and direct preference optimization (DPO), with each stage building on a pre-trained checkpoint. Post-training involves creating a reward model using human-annotated data and finetuning with both human and synthetic data.
New capabilities like multi-message chat protocols and enhanced formatting are incorporated to refine the model's performance. Preference data is annotated and processed to rate responses and improve them. In contrast, the training data is carefully curated and quality-controlled using topic classification, quality and difficulty scoring, and semantic deduplication methods. Rejection sampling and the pagedattention technique enhance efficiency and data quality throughout this iterative alignment process.
Llama 3 Evaluation
The evaluation of Llama 3 covered pre-trained and post-trained performance, including safety aspects. The model excelled in standard benchmarks like reading comprehension and coding, showing improved robustness and stability across various question formats. Adversarial benchmarks revealed Llama 3's strengths in handling complex tasks, although performance varied between adversarial and non-adversarial scenarios. Contamination analysis highlighted the impact of training data overlap on evaluation scores, showing varied effects across benchmarks. Overall, Llama 3 set new standards in model performance and safety.
FP8 Optimization
To enhance the inference efficiency of the Llama 3 405B model, pipeline parallelism and 8-bit floating point (FP8) quantization are utilized. The brain floating point 16 (BF16) representation of the model exceeds the GPU memory capacity of a single machine with 8 Nvidia H100 GPUs, necessitating parallelization across 16 GPUs on two machines. Tensor parallelism is employed within each machine due to high bandwidth, while pipeline parallelism is used across machines to manage lower inter-machine connectivity.
Micro-batching further optimizes throughput by balancing throughput and latency. FP8 quantization leverages the native FP8 support of H100 GPUs, quantizing parameters and activations in feedforward layers but not in self-attention layers, with adjustments like upper-bounding dynamic scaling factors and using row-wise quantization to mitigate quantization errors. Despite FP8 quantization showing comparable benchmark performance to BF16, it results in significant throughput improvements—up to 50% during the pre-fill stage—and offers a favorable trade-off between throughput and latency during decoding.
Visual-Text Integration
Incorporating visual recognition into Llama 3 involves a two-stage compositional approach: first, integrating a pre-trained image encoder with the language model using cross-attention layers, and second, enhancing temporal understanding with video cross-attention layers. This method leverages parallel development, circumvents challenges of joint pre-training, ensures text-only tasks remain unaffected, and improves inference efficiency with ongoing model development and experimentation.
Initialized with pre-trained weights, trained on large image-text pairs, and refined with higher-resolution data, the video involves adding specific layers and training on video data. Finetuned with curated data, employs preference data for reward modeling and DPO, utilizes rejection sampling for reasoning, and applies quality tuning to enhance performance.
Future Enhancements
Speech capabilities are integrated into Llama 3 using a compositional approach, combining a speech encoder with an adapter for understanding and a text-to-speech system for generation. The model supports 34 languages, leveraging system prompts for speech recognition and translation tasks. A large, diverse dataset is used for training, and evaluations show Llama 3 excels in speech translation and the naturalness of generated speech, outperforming other state-of-the-art models.
Conclusion
To sum up, the development of Llama 3 demonstrated that prioritizing high-quality data, scale, and simplicity led to optimal results despite initial experiments with more complex approaches. The process also underscored the importance of organizational decisions, such as preventing benchmark contamination and ensuring trustworthy evaluations. Sharing the development process and preliminary multimodal experiments aimed to foster informed research and accelerate advancements.