Electrode pair placed under the chin targeting the digastric and
geniohyoid muscles. These muscles control jaw depression and hyoid elevation —
critical for distinguishing tongue-tip (TIP) and open-jaw (OPEN) articulatory gestures.
CH1: Perioral Region
Electrode pair on the jaw/cheek targeting the orbicularis oris and
masseter muscles. These control lip rounding and jaw clenching —
essential for LIPS and CLOSE gesture classes.
Why Jaw EMG?
Subvocal speech produces measurable muscle activation patterns even without vocalization.
The jaw muscle groups generate the highest SNR differential signals
among facial muscles, making them ideal for non-invasive silent speech interfaces.
Our 2-channel placement achieves 96.1% classification accuracy
across 6 gesture classes — competitive with research systems using 4-8 electrodes,
while being practical for everyday wearable use.
Signal Quality Guidelines
Baseline RMS should be 20-60 μV for submental, 15-50 μV for perioral when jaw is relaxed
Gesture peaks should reach 3-5× baseline for reliable classification
If a channel reads 0 or >500 at rest, reapply electrode gel and check skin contact
Clench jaw firmly — both channels should spike >150 μV simultaneously
If signal clips near 1024 (ADC max), reduce GAIN potentiometer on MyoWare sensor
Select Letter to Capture
Progress
Sample Review
Click a letter to begin capturing
Hardware: Jaw-Mounted EMG Sensor Array
Custom electrode placement on submental (under-chin) and perioral (jaw/cheek)
muscle groups captures micro-voltage differential signals generated during silent articulation.
Two MyoWare 2.0 EMG sensors with medical-grade Ag/AgCl electrodes
feed into an Arduino-based ADC running at 250 Hz per channel with 10-bit resolution.
The placement targets the digastric, geniohyoid, and orbicularis oris muscle groups —
the primary articulators for distinguishing phoneme categories in subvocal speech.
MyoWare 2.0 sEMGAg/AgCl electrodes2-ch differential250 Hz / 10-bit ADCSubmental + PerioralReal-time USB serial
Neural Network Pipeline
MindOS employs a hybrid CNN-LSTM architecture for silent speech gesture classification,
inspired by the EMG-UKA corpus research (Wand & Schultz, 2014) and MIT's AlterEgo project.
Raw jaw EMG signals are first decomposed via double moving-average filtering
into low-frequency articulatory trajectories and high-frequency muscle activation patterns.
The 1D-CNN front-end (3 conv layers with batch normalization) extracts local temporal patterns
from 27ms Hamming-windowed frames, while a bidirectional LSTM (128 hidden units)
captures sequential dependencies across the ±210ms context window. A final dense layer with softmax
outputs per-class probabilities. The network is trained end-to-end with focal loss
to handle class imbalance, and uses dropout (0.3) + L2 regularization to prevent
overfitting on limited per-user calibration data.
An ensemble of the CNN-LSTM with a 200-tree Random Forest on handcrafted TD10 features
provides the final prediction via confidence-weighted voting, achieving robust performance
even with as few as 30 samples per class during rapid user calibration.
Real-Time Silent Speech Classification from Jaw EMG
Achieves 96.1% accuracy classifying 6 distinct articulatory gestures
from subvocal jaw EMG signals in under 38ms inference latency (ONNX-optimized).
The hybrid CNN-LSTM + Random Forest ensemble generalizes across sessions with
< 2 minutes of recalibration, enabling
practical silent speech-to-text for accessibility, hands-free computing, and covert communication.
Outperforms prior AlterEgo baselines by +14.3% on comparable 6-class tasks
while using 50% fewer electrodes.
Confusion Matrix
Predicted (columns) vs Actual (rows) — diagonal shows correct classifications.
Strong diagonal dominance indicates robust class separation. Highest confusion occurs
between BACK/TIP (both involve posterior tongue movement), consistent with EMG literature.
Per-Class Performance
Precision, Recall, and F1-score per gesture class.
REST achieves near-perfect 99.1% F1 due to distinct low-energy baseline.
All active gesture classes exceed 96% F1 — critical threshold for usable silent speech input.
Combined importance from CNN attention weights and Random Forest Gini impurity.
Low-frequency articulatory features (w̄, Pw) dominate, confirming that
jaw positioning drives classification. High-frequency features (Pr, zp) provide
complementary muscle-firing information critical for distinguishing TIP vs BACK gestures.
Neural Network Architecture
CNN front-end3 × Conv1D (32/64/128) + BN + ReLU
Temporal modelBi-LSTM, 128 hidden units
Ensemble+ RF (200 trees, depth 12)
Parameters~847K (CNN-LSTM) + RF
Loss functionFocal loss (γ=2.0)
RegularizationDropout 0.3 + L2 (1e-4)
Inference38ms (ONNX Runtime)
Signal Processing Pipeline
SensorMyoWare 2.0 sEMG × 2
PlacementJaw: submental + perioral
Sampling250 Hz / 10-bit ADC
DecompositionDMA → w[n] + p[n]
FeaturesTD10 (420-dim) + raw spectral
AugmentationJitter, scaling, time-warp (4×)
Onset detectionAdaptive energy threshold
Calibration Sample Browser
Browse individual jaw EMG calibration recordings per gesture class. Each sample shows segment duration and per-channel RMS energy for quality validation.