Artificial intelligence (AI) models with high parameters have demonstrated remarkable accuracy but pose challenges to energy efficiency in conventional processors. Analog in-memory computing (analog-AI) emerges as a solution by enabling energy-efficient parallel matrix computations. In a recent publication in the journal Nature, researchers presented a chip comprising 35 million memory devices distributed across 34 tiles, achieving an impressive performance of up to 12.4 tera-operations per second per watt (TOPS/W).
Background
Over the last decade, AI techniques have found applications in diverse domains, encompassing tasks such as image recognition, speech transcription, and text generation. These advancements hinge on ever-expanding deep neural networks (DNNs) with an increasing number of parameters. Models such as transformers and recurrent neural-network transducers (RNNTs) with billions of parameters have notably improved word error rates (WERs) for speech transcription for the Librispeech and SwitchBoard datasets.
However, hardware progress has lagged, resulting in prolonged training, inference times, and higher energy consumption. Analog in-memory computing (analog-AI) emerges as a solution, leveraging non-volatile memory arrays to perform computation directly in memory. This promises efficiency for large DNNs with fully connected layers. An experimental chip with phase-change memory arrays and analog components demonstrates accurate and energy-efficient natural language processing (NLP) inference, even for substantial models such as RNNTs. This innovation addresses energy inefficiencies associated with data movement, offering a potential leap in performance.
Chip architecture
The chip's architecture features a grid of 34 analog tiles, each housing a 512 × 2,048 phase-change memory (PCM) crossbar array. These tiles are organized into six power domains labeled as north, center, or south and further categorized as east or west. Within each power domain, there is an input landing pad (ILP) and an output landing pad (OLP) connected to sizable static random-access memory (SRAM). The ILP receives digital input vectors (each with 8-bit unsigned integer (UINT8) entries) from external sources, converting them into pulse-width-modulated (PWM) durations transmitted via parallel wires on the 2D mesh. Conversely, the OLP acquires PWM durations and reverses the process to transform them into UINT8 for chip transportation.
Communication between analog tiles occurs using durations, avoiding analog-to-digital conversion at the tile periphery. PCM devices encode analog conductance states by adjusting crystalline or amorphous material ratios. Variable PCM configurations enable flexible weight encoding. Local controllers on each tile define weight setups, MAC operations, and routing schemes within the 512x512 wire mesh. Complex routing patterns are managed by 'Borderguard' circuits and tri-state buffers.
From keyword spotting to speech-to-text transcription
The chip's performance was demonstrated through a multi-class keyword spotting (KWS) task. While the machine learning model, namely MLPerf, typically employs a convolutional neural network structure for KWS, we chose a fully connected (FC) network architecture. Both network variants require upstream digital preprocessing to prepare incoming audio waveforms for input. The convolutional neural network MLPerf outperforms the FC model in classification tasks. It offers a simpler architecture and faster performance.
To execute an end-to-end implementation on the chip, the researchers adjusted the audio-spectrum preprocessing to generate 1,960 inputs and expanded the size of hidden layers to 512 per tile. In response to analog noise sensitivity, they integrated weight and activation noise, weight clipping, L2 regularization, and bias removal into the network. A pruned version of this network, accommodating the chip's capacity, was adopted for implementation. For KWS, researchers employed four tiles in total, two for the first weight layer and two for the subsequent two layers.
To enhance the accuracy and account for peripheral circuit asymmetries, researchers introduced the multiply and accumulate computation (MAC) asymmetry balance (AB) method, ensuring accurate computation by canceling out circuitry asymmetries. Each audio frame took 2.4 microseconds, significantly faster than the best-case latency reported by MLPerf. The accuracy of the KWS implementation was 86.14%, well within the MLPerf's software-equivalent accuracy limit of 85.88%.
For a more complex task, the chip demonstrated speech-to-text transcription using the RNNT model. The chip mapped the network's components, and although digital preprocessing remained vital, the chip adeptly handled vector-vector products and activation functions. This capability extended to multiple chips, with one chip's output feeding into another. Remarkably, the chip maintained resilience even after more than a week of PCM drift, resulting in a mere 0.4% increase in the RNNT's word error rate (WER).
Analyzing power consumption and efficiency
An analysis of power consumption highlighted the dominant impact of the 1.5 volts (V) and 0.8 V power supplies on consumption. Sustained TOPS/W values were recorded, with chip four demonstrating the highest performance. The overall energy efficiency of the system was evaluated, revealing that incorporating digital computation would yield similar energy efficiencies. Furthermore, the combined analog-digital processing proved significantly more efficient than pure digital processing.
Conclusion
In summary, researchers showcased successful industry-specific applications on analog-AI chips, focusing on speech recognition and transcription in the field of NLP. Using an all-analog setup with an innovative AB technique, the 14-nm analog inference chip demonstrates software-equivalent accuracy in end-to-end KWS. The study extends to MLPerf RNNT on Librispeech, achieving 9.258% WER with a weight-expansion approach. This pioneering work establishes the first instance of commercially significant accuracy with over 140 analog-AI tiles and efficient neural network activation communication. The findings suggest that, in conjunction with efficient on-chip auxiliary computation, analog-AI systems can offer sustained energy efficiency and throughput at impressive levels.