Advancing sEMG Hand Signal Classification

Surface Electromyography (sEMG) captures the tiny electrical signals generated by our muscles when they contract. These signals are at the heart of many prosthetic control systems and human–computer interfaces. In this report, we explore two main strategies:

Windowed FFT-based feature extraction: breaking the signal into fixed-length segments and analyzing frequency components
Non-windowed modeling: feeding raw signal samples directly into machine learning models

Our goal is to understand how each approach balances accuracy and processing speed when classifying hand gestures.

Accurate and fast gesture recognition is crucial for applications like real-time prosthetic devices, where even small delays or misclassifications can impact the user experience. By comparing a traditional signal-processing pipeline against newer, end-to-end learning methods, we aim to find the sweet spot between performance and complexity.

Problem Statement
Data Source
Methodology
Evaluation Technique
Results
Conclusion & Future Work
References

Problem Statement

Surface Electromyography measures the electrical activity produced by muscle contractions. Windowing segments signals into fixed intervals for robust feature extraction but introduces computational overhead and fixed temporal boundaries. Non-windowed methods model each timestep directly, potentially reducing latency. This study evaluates both approaches to determine their impact on classification accuracy and inference time.

Data Source

The dataset utilized in this research is the “EMG Data for Gestures,” available through the UCI Machine Learning Repository. It consists of raw electromyographic (EMG) signals recorded from 36 subjects using a MYO Thalmic bracelet worn on the forearm. The bracelet’s eight evenly distributed sensors capture myographic signals transmitted via Bluetooth. Each subject performed two series of six or seven static hand gestures, each held for 3 seconds followed by a 3-second rest period.

The data files contain ten columns: time (in milliseconds), eight sEMG channels, and a class label indicating the gesture performed. Class labels are defined as:

0 — unmarked data
1 — hand at rest
2 — hand clenched in a fist
3 — wrist flexion
4 — wrist extension
5 — radial deviation
6 — ulnar deviation
7 — extended palm (performed by a subset of subjects)

For simplicity, data from all subjects were concatenated across recordings, focusing on a generalized classification problem rather than individual differences.

Methodology

Below is a detailed methodology description that includes the precise mathematical formulations from the original report. We compare a windowing-based signal-processing pipeline against a non-windowing strategy that models raw samples directly.

1. Experimental Setup

All models are trained and evaluated on a MacBook M3 Pro using Metal Performance Shaders (MPS) for GPU acceleration. Data are standardized and random seeds fixed to ensure reproducibility.

2. Windowing Approach

a. Segmentation into Windows

Let the raw dataset be ${t_i, x_i, y_i}_{i=1}^n$, where

$t_i$ is the timestamp (ms),
$x_i\in\mathbb{R}^M$ is the $M$-channel sEMG vector, and
$y_i$ is the gesture label.

Applying a sliding window of length $W$ yields:

\[X^{(j)} = [\,x_i, x_{i+1}, \dots, x_{i+W-1}\], \quad Y^{(j)} = [\,y_i, y_{i+1}, \dots, y_{i+W-1}\],\]

for $j=1,\dots,N$, where $N$ is the number of windows.

Screenshot 2025-05-16 at 4 31 22 PM [Image Source](https://www.danorlandoblog.com/use-the-sliding-window-pattern-to-solve-problems-in-javascript/)

b. Frequency-Domain Transformation (FFT)

For each channel $m=1,\dots,M$, perform a Discrete Fourier Transform on each windowed signal $x_m^{(j)}(n)$:

$$ X_m^{(j)}(k) = \sum_{n=0}^{W-1} x_m(n)\,e^{-2\pi i\,k\,n / W}, \quad k=0,\dots,\lfloor W/2\rfloor. $$

Record magnitude and phase:

$$ |X_m^{(j)}(k)| = \sqrt{\Re(X_m^{(j)}(k))^2 + \Im(X_m^{(j)}(k))^2}, \quad \angle X_m^{(j)}(k) = \arctan\!\frac{\Im(X_m^{(j)}(k))}{\Re(X_m^{(j)}(k))}. $$

Stacking across $M$ channels yields a feature vector $H^{(j)}\in\mathbb{R}^{2M(K+1)}$, where $K=\lfloor W/2\rfloor$.

Screenshot 2025-05-16 at 4 33 46 PM [Image Source](https://www.researchgate.net/figure/Three-Components-of-a-Complex-Number-In-Phase-Quadrature-and-Phase-Incoming-radar-wave_fig2_332511933)

c. CNN for Frequency-Domain Feature Extraction

Treating frequency bins as a “spatial” axis and channels as input planes, a 1D convolution with filters $W^{(\ell)}\in\mathbb{R}^{F\times C_{\mathrm{in}}\times C_{\mathrm{out}}}$ and bias $b^{(\ell)}$ computes

$$ h^{(\ell)}(k,d) = f\Bigl( \sum_{c=1}^{C_{\mathrm{in}}} \sum_{\delta=0}^{F-1} H(k+\delta,c)\,W^{(\ell)}_{d,\delta,c} + b^{(\ell)}_d \Bigr), $$

where $f$ is LeakyReLU:
$f(x)=\max(0,x)+\alpha\,\min(0,x).$

Each block also includes BatchNorm, MaxPool and Dropout.

Screenshot 2025-05-16 at 4 36 04 PM

d. Deep Cross Network (DCN)

Given embedding $x^{(0)}$, each cross layer $\ell$ updates

$$ x^{(\ell+1)} = x^{(\ell)} \,\circ\,(x^{(\ell)}\,W^{(\ell)}) \;+\; b^{(\ell)} \;+\; x^{(\ell)}, \quad \ell=0,\dots,L-1, $$

where $\circ$ is the Hadamard (elementwise) product.

Screenshot 2025-05-16 at 4 36 41 PM [Image Source](https://arxiv.org/pdf/2008.13535)

e. Multi-Layer Perceptron (MLP)

The final DCN output $x^{(L)}$ is passed through fully connected layers:

$$ z^{(1)} = f\bigl(W^{(1)}\,x^{(L)} + b^{(1)}\bigr), \quad z^{(K)} = W^{(K)}\,z^{(K-1)} + b^{(K)}, \quad \hat y = \mathrm{softmax}\bigl(z^{(K)}\bigr). $$

Screenshot 2025-05-16 at 4 52 00 PM

f. Model Architecture Overview

The complete architecture for the windowing approach is as follows:

Screenshot 2025-05-16 at 4 54 44 PM

3. Non-Windowing Approach

a. Raw Feature Input

Use each instantaneous vector $x\in\mathbb{R}^M$ without any segmentation.

b. Random Forest Classifier

As a baseline, average class probabilities from $T$ trees:

\[P(y=c\mid x) = \frac{1}{T} \sum_{t=1}^T P_t(y=c\mid x).\]

Screenshot 2025-05-16 at 4 57 40 PM [Image Source](https://www.researchgate.net/figure/Architecture-of-the-Random-Forest-algorithm_fig1_337407116)

c. Outer-Product Neural Network (OPNN)

Compute the outer product $O = x\,x^\top\in\mathbb{R}^{M\times M}$, flatten to $z\in\mathbb{R}^{M^2}$, then

\[\hat y = \mathrm{softmax}\bigl(W_2\,f(W_1\,z + b_1) + b_2\bigr).\]

Screenshot 2025-05-16 at 4 59 45 PM

4. Loss & Optimization

All deep models use cross-entropy loss:

\[\mathcal{L}(\theta) = -\sum_{c} y_c \,\log \hat y_c,\]

optimized with Adam.

Screenshot 2025-05-16 at 5 00 57 PM [Image Source](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html)

Evaluation Technique

Accuracy and inference time (ms/sample) are measured on an M3 Pro with Metal Performance Shaders. Accuracy is defined as:

\[\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}.\]

Results

| Approach | Accuracy (%) | Inference Time (ms/sample) | Parameters | |------------------------------|--------------|----------------------------|------------| | Windowing (CNN+DCN+MLP) | 97.86 | 0.0828 | 46.8 M | | Non-Windowing (RandomForest) | 98.01 | 0.0313 | — | | Non-Windowing (OPNN+MLP) | 98.43 | 0.0081 | 11.0 M |

Conclusion & Future Work

Non-windowed deep learning offers higher accuracy and lower latency, suggesting that time-frequency segmentation may not be necessary for static gesture recognition. Future work should explore dynamic gestures and cross-user generalization.

References

Olmo & Domingo (2020). EMG Characterization. Materials, 13(24), 5815.
Raez et al. (2006). EMG Signal Analysis. Biological Procedures Online, 8, 11–35.
Rani et al. (2023). sEMG & AI. IEEE Access, 11, 105140–105169.
Asogbon et al. (2018). Window Conditioning in EMG. IEEE CBS.
Krilova et al. (2018). EMG Data for Gestures. UCI Repository.