Discover how CALDERA’s cutting-edge compression algorithm is revolutionizing AI by making powerful language models accessible, efficient, and edge-ready.
Research: Compressing Large Language Models using Low Rank and Low Precision Decomposition. Image Credit: amgun / Shutterstock
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Large language models (LLMs) are increasingly automating tasks like translation, text classification, and customer service. But tapping into an LLM's power typically requires users to send their requests to a centralized server—a process that's expensive, energy-intensive, and often slow.
Now, researchers have introduced a technique for compressing an LLM's reams of data, which could increase privacy, save energy, and lower costs.
The new algorithm, CALDERA (Calibration Aware Low precision DEcomposition with low-Rank Adaptation), developed by engineers at Princeton and Stanford Engineering, works by trimming redundancies and reducing the precision of an LLM's layers of information. This type of leaner LLM could be stored and accessed locally on a device like a phone or laptop and could provide performance nearly as accurate and nuanced as an uncompressed version.
"Any time you can reduce the computational complexity, storage, and bandwidth requirements of using AI models, you can enable AI on devices and systems that otherwise couldn't handle such compute- and memory-intensive tasks," said study coauthor Andrea Goldsmith, dean of Princeton's School of Engineering and Applied Science and Arthur LeGrand Doty Professor of Electrical and Computer Engineering.
"When you use ChatGPT, whatever request you give it goes to the back-end servers of OpenAI, which process all of that data, and that is very expensive," said coauthor Rajarshi Saha, a Stanford Engineering Ph.D. student. "So, you want to be able to do this LLM inference using consumer GPUs [graphics processing units], and the way to do that is by compressing these LLMs." Saha's graduate work is coadvised by Goldsmith and coauthor Mert Pilanci, an assistant professor at Stanford Engineering.
The researchers will present their new algorithm, which provides state-of-the-art results in low-bit precision regimes, at the Conference on Neural Information Processing Systems (NeurIPS) in December. Saha and colleagues began this compression research not with LLMs themselves but with the large collections of information used to train LLMs and other complex AI models, such as those used for image classification. This technique, a forerunner to the new LLM compression approach, was published in 2023.
Training data sets and AI models are both composed of matrices, or grids of numbers, used to store data. In the case of LLMs, these are called weight matrices, which are numerical representations of word patterns learned from large swaths of text.
"We proposed a generic algorithm for compressing large data sets or large matrices," said Saha. "And then we realized that nowadays, it's not just the data sets that are large, but the models being deployed are also getting large. So, we could also use our algorithm to compress these models."
While the team's algorithm is not the first to compress LLMs, its novelty lies in combining low-precision representation with low-rank approximation. As digital computers store and process information as bits (zeros and ones), "low-precision" representation reduces the number of bits, speeding up storage and processing while improving energy efficiency. On the other hand, "low-rank" refers to reducing redundancies in the LLM weight matrices.
"Using both of these properties together, we are able to get much more compression than either of these techniques can achieve individually," said Saha.
The team tested their technique using Meta AI's open-source LLaMA 2 and LLaMA 3 models and found that their method, which used low-rank and low-precision components in tandem, outperformed existing techniques such as QuIP# in metrics like perplexity. The improvement can be up to 5%, which is significant for metrics that measure uncertainty in predicting word sequences.
They evaluated the compressed language models' performance using several sets of benchmark tasks for LLMs. The tasks included determining the logical order of two statements, answering questions involving physical reasoning, and commonsense reasoning tasks such as those evaluated by the Winogrande and RTE datasets.
"I think it's encouraging and a bit surprising that we were able to get such good performance in this compression scheme," said Goldsmith, who moved to Princeton from Stanford Engineering in 2020. "By taking advantage of the weight matrix rather than just using a generic compression algorithm for the bits that are representing the weight matrix, we were able to do much better."
Using an LLM compressed in this way could be suitable for situations that don't require the highest possible precision. Moreover, the ability to fine-tune compressed LLMs on edge devices like a smartphone or laptop enhances privacy by allowing organizations and individuals to adapt models to their specific needs without sharing sensitive data with third-party providers. This reduces the risk of data breaches or unauthorized access to confidential information during the training process. However, this approach still requires careful consideration of memory and energy constraints, as noted by the authors.
Saha also cautioned that running LLMs on a smartphone or laptop could hog the device's memory for a period of time. "You won't be happy if you are running an LLM and your phone drains out of charge in an hour," said Saha. Low-precision computation can help reduce power consumption, he added. "But I wouldn't say that there's one single technique that solves all the problems. Our work on CALDERA complements prior approaches and provides a flexible framework that balances compression with performance."
The paper "Compressing Large Language Models using Low Rank and Low Precision Decomposition," will be presented at the Conference on Neural Information Processing Systems (NeurIPS) in December 2024. In addition to Goldsmith, Saha, and Pilanci, coauthors include Stanford Engineering researchers Naomi Sagan and Varun Srivastava. This work was supported in part by the U.S. National Science Foundation, the U.S. Army Research Office, and the Office of Naval Research.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Source:
- Source: Princeton University, Engineering School
Journal reference:
- Preliminary scientific report.
Saha, R., Sagan, N., Srivastava, V., Goldsmith, A. J., & Pilanci, M. (2024). Compressing Large Language Models using Low Rank and Low Precision Decomposition. ArXiv. https://arxiv.org/abs/2405.18886