TY - GEN
T1 - A Novel Multimodal LLM-Driven RF Sensing Method for Human Activity Recognition
AU - Khan, Muhammad Zakir
AU - Bilal, Muhammad
AU - Abbas, Hasan
AU - Imran, Muhammamd
AU - Abbasi, Qammer H.
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Human activity recognition (HAR) using radio frequency (RF) sensing has attracted significant attention due to its unobtrusive and privacy-preserving nature. Traditional HAR methods rely on task-specific deep neural networks trained on large labeled datasets, which can be time-consuming and resource-intensive. To address these challenges, we propose a novel approach that leverages multimodal large language models (MLLMs) for RF-based HAR. Specifically, we fine-tune Florence-2, a pre-trained vision-language model (VLM), on RF spectrogram data from the open-source Xethru Radar dataset. Our approach frames activity detection as a question-answering task, allowing the model to associate radar spectrogram features with specific activity classes through prompt-based interactions. Testing on three distinct activities (sitting, bending, and crawling), our fine-tuned model achieves 98% classification accuracy with minimal misclassifications. This work demonstrates the effectiveness of integrating VLMs with RF sensing data for scalable and adaptive HAR applications, opening new research directions for unified, prompt-based models in complex multimodal sensing tasks.
AB - Human activity recognition (HAR) using radio frequency (RF) sensing has attracted significant attention due to its unobtrusive and privacy-preserving nature. Traditional HAR methods rely on task-specific deep neural networks trained on large labeled datasets, which can be time-consuming and resource-intensive. To address these challenges, we propose a novel approach that leverages multimodal large language models (MLLMs) for RF-based HAR. Specifically, we fine-tune Florence-2, a pre-trained vision-language model (VLM), on RF spectrogram data from the open-source Xethru Radar dataset. Our approach frames activity detection as a question-answering task, allowing the model to associate radar spectrogram features with specific activity classes through prompt-based interactions. Testing on three distinct activities (sitting, bending, and crawling), our fine-tuned model achieves 98% classification accuracy with minimal misclassifications. This work demonstrates the effectiveness of integrating VLMs with RF sensing data for scalable and adaptive HAR applications, opening new research directions for unified, prompt-based models in complex multimodal sensing tasks.
KW - Multimodal Vision-Language Models
KW - Radio Frequency Sensing
KW - Visual Signal Processing
UR - https://www.scopus.com/pages/publications/105007423345
U2 - 10.1109/ICMAC64768.2025.11003262
DO - 10.1109/ICMAC64768.2025.11003262
M3 - Conference contribution
AN - SCOPUS:105007423345
T3 - 2025 2nd International Conference on Microwave, Antennas and Circuits, ICMAC 2025
BT - 2025 2nd International Conference on Microwave, Antennas and Circuits, ICMAC 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2nd International Conference on Microwave, Antennas and Circuits, ICMAC 2025
Y2 - 17 April 2025 through 18 April 2025
ER -