In the paper published in the journal Scientific Reports, researchers introduced an efficient network known as Lightweight Hybrid Vision Transformer (LH-ViT) for radar-based Human Activity Recognition (HAR). LH-ViT combines convolution operations with self-attention to enhance feature extraction from micro-Doppler maps. It employed a Residual Squeeze-and-Excitation (RES-SE) block to reduce computational load. Experimental results on two human activity datasets demonstrated the method's advantages in expressiveness and computing efficiency over traditional approaches.
Background
HAR has diverse applications in healthcare, smart homes, security, and autonomous driving. HAR approaches fall into two categories: visual-based, optical cameras, and non-visual sensor-based, employing sensors like radar. Radar-based HAR, leveraging micro-Doppler features, has garnered attention for its adaptability and privacy protection. Researchers have explored traditional methods and deep learning approaches for addressing various challenges in embedded applications. However, a growing consensus is that actively pursuing lightweight solutions is crucial for enhanced performance.
Previous HAR research categorized data sources into visual-based and non-visual sensor-based methods, with radar-based HAR gaining attention. Traditional HAR approaches had limitations in dealing with complex human activities, leading to the adoption of deep learning techniques like Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and transformers. Hybrid networks and attention mechanisms improved recognition precision and accuracy, while self-attention addressed radar image variation.
Methodology
Radar-based HAR with LH-ViT Framework: The process begins with a millimeter-wave radar that collects echoes from a moving human body, resulting in multi-channel intermediate frequency signals following dechirp processing. In this stage, the signals undergo Two-Dimensional Fast Fourier Transform (2D FFT) processing to compress signal energy within the range-angle plane efficiently. Active execution suppresses static clutter by utilizing a phase average cancellation method.
Target detection actively uses the two-dimensional constant false alarm rate (2D-CFAR) method. Following target bin detection, an active process combines data from different frames to create a slow-time vector. This vector is then actively subjected to a short-time Fourier transform (STFT) to generate the Micro-Doppler Map (MDM). The normalized MDM is subsequently employed as active input for the LH-ViT network to facilitate efficient HAR.
Feature Extraction Network: The feature extraction network utilizes a pyramid structure to capture multi-scale micro-Doppler features on the MDM. Each pyramid level employs a pair of RES-SE modules for feature extraction. In each layer, the first RES-SE module extracts micro-Doppler features at the current scale, while the second RES-SE module handles upsampling by adjusting the stride value. These modules utilize a residual network structure with 1x1 convolution, Batch Normalization, and Depthwise Separable Convolution (DSC) to extract features efficiently. An SE Block based on a lightweight channel attention mechanism processes the output of DSC, enhancing feature sensitivity in the channel dimension. The SE Block's channel attention improves the network's ability to emphasize channels with more separable information while suppressing less valuable channels.
Feature Enhancement Network: The feature enhancement network eliminates background noise interference and emphasizes micro-Doppler features related to human behavior through cross-stacked Radar-ViT and RES-SE modules. This hybrid structure simplifies local representation and fusion modules, creating a shallow, narrow, lightweight network. The stacked global representation modules with multi-head attention allow the network to capture rich feature information from different representation subspaces. Radar-ViT divides the feature map into non-overlapping cells and applies multi-head attention to capture the global micro-Doppler features. The combination of Radar-ViT and RES-SE modules ensures effective feature enhancement.
The output actively emerges after a point-wise convolution and actively combines with the network's input through concatenation. This fusion approach enhances information propagation, accelerates training, and improves recognition accuracy. These concatenated features continue to be refined in subsequent RES-SE modules, making the LH-ViT model a comprehensive and efficient solution for radar-based HAR.
Findings
The research utilized two distinct radar datasets, one acquired from a C-band radar and the other from mmWave radar, to evaluate the LH-ViT network's performance in HAR. These datasets encompassed a range of human activities, and LH-ViT outperformed various state-of-the-art networks in terms of accuracy, parameter efficiency, and inference times.
LH-ViT excelled in recognizing individual activities and displayed strong performance in subject-independent splits, underlining its adaptability and efficiency for radar-based HAR through precise Micro-Doppler feature extraction. The LH-ViT network emerged as a promising and efficient solution for radar-based HAR, offering robust recognition capabilities even in scenarios with individual variations and demonstrating its potential to contribute to a wide range of applications in fields such as intelligent healthcare, smart homes, security systems, and autonomous driving.
Summary
To sum up, this study introduced the LH-ViT network designed for HAR using radar-based micro-Doppler features. Following preprocessing, the LH-ViT network exhibited remarkable recognition accuracy, achieving 99.7% in the self-established dataset and 92.1% in the public dataset. Extensive investigations into the network's architecture led to the identification of an optimal structure, which consistently demonstrated superior performance compared to other widely used networks and existing literature on HAR networks.
The LH-ViT network meets the stringent requirements for accuracy and real-time performance in HAR and holds significant promise for embedded applications. Significantly, this study concentrates on recognizing single actions under relatively ideal data collection scenarios.
Future directions include enhancing and diversifying datasets, refining radar signal processing algorithms, and optimizing deep learning network structures to improve radar-based HAR performance in the context of complex and continuous human activities.