Efficient Radar-Based Human Activity Recognition with Lightweight Hybrid Vision Transformer

In the paper published in the journal Scientific Reports, researchers introduced an efficient network known as Lightweight Hybrid Vision Transformer (LH-ViT) for radar-based Human Activity Recognition (HAR). LH-ViT combines convolution operations with self-attention to enhance feature extraction from micro-Doppler maps. It employed a Residual Squeeze-and-Excitation (RES-SE) block to reduce computational load. Experimental results on two human activity datasets demonstrated the method's advantages in expressiveness and computing efficiency over traditional approaches.

Study: Efficient Radar-Based Human Activity Recognition with Lightweight Hybrid Vision Transformer. Image credit: Generated using DALL.E.3
Study: Efficient Radar-Based Human Activity Recognition with Lightweight Hybrid Vision Transformer. Image credit: Generated using DALL.E.3

Background

HAR has diverse applications in healthcare, smart homes, security, and autonomous driving. HAR approaches fall into two categories: visual-based, optical cameras, and non-visual sensor-based, employing sensors like radar. Radar-based HAR, leveraging micro-Doppler features, has garnered attention for its adaptability and privacy protection. Researchers have explored traditional methods and deep learning approaches for addressing various challenges in embedded applications. However, a growing consensus is that actively pursuing lightweight solutions is crucial for enhanced performance.

Previous HAR research categorized data sources into visual-based and non-visual sensor-based methods, with radar-based HAR gaining attention. Traditional HAR approaches had limitations in dealing with complex human activities, leading to the adoption of deep learning techniques like Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and transformers. Hybrid networks and attention mechanisms improved recognition precision and accuracy, while self-attention addressed radar image variation. 

Methodology

Radar-based HAR with LH-ViT Framework: The process begins with a millimeter-wave radar that collects echoes from a moving human body, resulting in multi-channel intermediate frequency signals following dechirp processing. In this stage, the signals undergo Two-Dimensional Fast Fourier Transform (2D FFT) processing to compress signal energy within the range-angle plane efficiently. Active execution suppresses static clutter by utilizing a phase average cancellation method.

Target detection actively uses the two-dimensional constant false alarm rate (2D-CFAR) method. Following target bin detection, an active process combines data from different frames to create a slow-time vector. This vector is then actively subjected to a short-time Fourier transform (STFT) to generate the Micro-Doppler Map (MDM). The normalized MDM is subsequently employed as active input for the LH-ViT network to facilitate efficient HAR.

Feature Extraction Network: The feature extraction network utilizes a pyramid structure to capture multi-scale micro-Doppler features on the MDM. Each pyramid level employs a pair of RES-SE modules for feature extraction. In each layer, the first RES-SE module extracts micro-Doppler features at the current scale, while the second RES-SE module handles upsampling by adjusting the stride value. These modules utilize a residual network structure with 1x1 convolution, Batch Normalization, and Depthwise Separable Convolution (DSC) to extract features efficiently. An SE Block based on a lightweight channel attention mechanism processes the output of DSC, enhancing feature sensitivity in the channel dimension. The SE Block's channel attention improves the network's ability to emphasize channels with more separable information while suppressing less valuable channels.

Feature Enhancement Network: The feature enhancement network eliminates background noise interference and emphasizes micro-Doppler features related to human behavior through cross-stacked Radar-ViT and RES-SE modules. This hybrid structure simplifies local representation and fusion modules, creating a shallow, narrow, lightweight network. The stacked global representation modules with multi-head attention allow the network to capture rich feature information from different representation subspaces. Radar-ViT divides the feature map into non-overlapping cells and applies multi-head attention to capture the global micro-Doppler features. The combination of Radar-ViT and RES-SE modules ensures effective feature enhancement.

The output actively emerges after a point-wise convolution and actively combines with the network's input through concatenation. This fusion approach enhances information propagation, accelerates training, and improves recognition accuracy. These concatenated features continue to be refined in subsequent RES-SE modules, making the LH-ViT model a comprehensive and efficient solution for radar-based HAR.

Findings

The research utilized two distinct radar datasets, one acquired from a C-band radar and the other from mmWave radar, to evaluate the LH-ViT network's performance in HAR. These datasets encompassed a range of human activities, and LH-ViT outperformed various state-of-the-art networks in terms of accuracy, parameter efficiency, and inference times.

LH-ViT excelled in recognizing individual activities and displayed strong performance in subject-independent splits, underlining its adaptability and efficiency for radar-based HAR through precise Micro-Doppler feature extraction. The LH-ViT network emerged as a promising and efficient solution for radar-based HAR, offering robust recognition capabilities even in scenarios with individual variations and demonstrating its potential to contribute to a wide range of applications in fields such as intelligent healthcare, smart homes, security systems, and autonomous driving.

Summary

To sum up, this study introduced the LH-ViT network designed for HAR using radar-based micro-Doppler features. Following preprocessing, the LH-ViT network exhibited remarkable recognition accuracy, achieving 99.7% in the self-established dataset and 92.1% in the public dataset. Extensive investigations into the network's architecture led to the identification of an optimal structure, which consistently demonstrated superior performance compared to other widely used networks and existing literature on HAR networks.

The LH-ViT network meets the stringent requirements for accuracy and real-time performance in HAR and holds significant promise for embedded applications. Significantly, this study concentrates on recognizing single actions under relatively ideal data collection scenarios.

Future directions include enhancing and diversifying datasets, refining radar signal processing algorithms, and optimizing deep learning network structures to improve radar-based HAR performance in the context of complex and continuous human activities.

Journal reference:
Silpaja Chandrasekar

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Chandrasekar, Silpaja. (2023, October 25). Efficient Radar-Based Human Activity Recognition with Lightweight Hybrid Vision Transformer. AZoAi. Retrieved on November 21, 2024 from https://www.azoai.com/news/20231025/Efficient-Radar-Based-Human-Activity-Recognition-with-Lightweight-Hybrid-Vision-Transformer.aspx.

  • MLA

    Chandrasekar, Silpaja. "Efficient Radar-Based Human Activity Recognition with Lightweight Hybrid Vision Transformer". AZoAi. 21 November 2024. <https://www.azoai.com/news/20231025/Efficient-Radar-Based-Human-Activity-Recognition-with-Lightweight-Hybrid-Vision-Transformer.aspx>.

  • Chicago

    Chandrasekar, Silpaja. "Efficient Radar-Based Human Activity Recognition with Lightweight Hybrid Vision Transformer". AZoAi. https://www.azoai.com/news/20231025/Efficient-Radar-Based-Human-Activity-Recognition-with-Lightweight-Hybrid-Vision-Transformer.aspx. (accessed November 21, 2024).

  • Harvard

    Chandrasekar, Silpaja. 2023. Efficient Radar-Based Human Activity Recognition with Lightweight Hybrid Vision Transformer. AZoAi, viewed 21 November 2024, https://www.azoai.com/news/20231025/Efficient-Radar-Based-Human-Activity-Recognition-with-Lightweight-Hybrid-Vision-Transformer.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Deep Learning Model Predicts Flight Strategies to Control Pandemics