PhysioTwin — Hybrid Stress Detection

Introduction

Why stress detection needs a new approach.

Smartwatches such as the Apple Watch track physiological signals like heart rate (HR) and heart rate variability (HRV), but these signals are noisy and highly context-dependent: a fast heartbeat during a jog looks identical to one caused by a stressful meeting. Current tools either ignore this ambiguity or rely on standard ML models that are best suited for homogeneous data with distributional consistency, which ignores the reality that stress responses are highly individualized and vary substantially across people.

PhysioTwin takes a different approach — a 3-stage hybrid pipeline that filters out physical activity, adjusts for sleep quality, and computes a personalized HRV stress score based on your own baseline. The result is a "Body Battery": a simple 0–100 score that drains when you're stressed and recovers when you're calm, grounded in peer-reviewed science.

The Problem

Wearable stress features conflate exercise with stress and use one-size-fits-all thresholds that don't account for individual differences or sleep quality.

Our Approach

A 3-stage hybrid pipeline: ML for activity gating, rule-based sleep adjustment, and signal-processing for personalized HRV stress scoring.

The Output

A Body Battery energy score (0–100) that anyone can understand, with recovery recommendations validated through real biometric feedback.

Self-reported stress levels across WISE protocol phases — Baseline, Cognitive, Aerobic, Anaerobic

Figure 1 — Self-reported stress across WISE dataset protocol phases. Cognitive tasks (Stroop/TMCT) show significantly elevated stress vs. baseline, confirming label validity of our training data.

Methods

How we built it.

Our pipeline processes wearable sensor data in three sequential stages. Each stage solves a specific problem that the next stage depends on — you can't score stress accurately without first removing exercise artifacts and accounting for sleep.

Built vs. reused: We built the full 3-stage pipeline, the iOS Digital Twin app, and all evaluation code from scratch. The Stage 1 Random Forest classifier, the sleep-modulation rule engine, and the DC/AC stress-scoring module are original implementations. We adapted established physiological formulas (PRSA, SDNN, RMSSD) from the literature and trained on the publicly available WISE dataset from PhysioNet. The CoreML export pipeline and HealthKit integration are also our own work.

Stage 1

Activity Gate

PHYSICAL vs COGNITIVE

ML · 93.6%

Stage 2

Sleep Quality

Threshold adjustment

Rules

Stage 3

Stress Score

Deceleration / Acceleration Capacity

Formula

PhysioTwin 3-stage hybrid pipeline: Activity Classifier → Sleep Adjustment → DC/AC Stress Scoring → Body Battery

Full architecture — Apple Watch → Stage 1 ML → Stage 2 Rules → Stage 3 Formula → Body Battery

Activity Gate

Machine Learning

The first stage detects whether you're physically active. If you're exercising, we skip stress scoring entirely — your elevated heart rate during a run isn't stress, and treating it as such would produce false alarms.

Technical details

We trained a Random Forest classifier on accelerometer and heart rate data from 22 subjects in the WISE dataset. The model uses 4 engineered features and achieves 93.6% Leave-One-Subject-Out (LOSO) accuracy, meaning it generalizes well to users it hasn't seen before.

The classifier distinguishes between cognitive (sitting) and physical (treadmill/sprint) activity states. It's exported to CoreML for on-device inference on Apple Watch.

This gating step is critical: Bonneval et al. (2025) showed that HRV measurements have up to 93% error during physical movement, making any stress score computed during exercise meaningless.

Feature Importance

ACC_mean

45.7%

ACC_std

34.2%

HR_mean

12.8%

HR_std

7.3%

2-class RF model (4 features) — movement dominates at 79.9% combined importance

Sleep Modulation

Rule-Based

Last night's sleep quality adjusts how sensitive your stress detection is today. If you slept poorly, your body is more vulnerable to stress — we lower the threshold to reflect that.

Technical details

We use rule-based logic rather than ML for this stage because sleep-stress relationships are well-established in the literature and don't require learning from data.

Sleep duration and quality scores (from HealthKit or Fitbit) are mapped to a threshold modifier that shifts the stress sensitivity for the day. Short or fragmented sleep results in a lower threshold — meaning milder physiological signals are flagged as stress.

This is supported by Apple's 2024 research showing that behavioral sleep data improves physiological predictions.

Stress Scoring

Signal Processing

The final stage computes your actual stress score by analyzing heart rate variability patterns and comparing them against your own personal baseline — not a population average.

Technical details

We compute three HRV metrics from RR intervals (the time between heartbeats):

DC/AC — Deceleration/Acceleration Capacity via Phase-Rectified Signal Averaging (Bauer 2006). Velmovitsky et al. (2022) found these among the most informative HRV features for stress detection on Apple Watch data (N=33, 55-64% accuracy). Our own evaluation on WISE (N=22) showed trends in the expected direction (DC p=0.075) but did not reach statistical significance — motivating our hybrid approach with personal baselines rather than fixed thresholds.

SDNN — Standard deviation of RR intervals. Robust to the data gaps typical of wrist-worn sensors (Hernando 2018).

RMSSD — Root mean square of successive RR differences. Captures short-term parasympathetic activity.

These metrics are compared against a rolling personal baseline built from each user's own calm-state data, producing a 0–100 stress score that feeds the Body Battery.

Training Data

WISE Dataset

We trained and validated our activity classifier on the WISE (Wearable and Intelligent Stress and affect dEtection) dataset — 22 subjects performing controlled activities (cognitive, aerobic, anaerobic) while wearing wristband sensors. Segmenting the physiological signals using sliding windows, we built a model that reliably separates physical activity from cognitive states.

413 windows

Cognitive

1,968 windows

Aerobic

68 windows

Anaerobic

Feature correlation matrix for WISE dataset signals — HR, ACC, TEMP, EDA, HRV

Figure 2 — Feature correlation matrix (WISE dataset). ACC features show low correlation with HR and HRV, confirming they carry orthogonal signal and justify their combined use in Stage 1.

Dataset details

The WISE (Wearable Stress and Affect Detection) dataset contains recordings from 22 participants wearing Empatica E4 wristbands through lab protocols: Baseline (REST), Stroop/TMCT cognitive tasks (STRESS), Treadmill walking (AEROBIC), and Sprint intervals (ANAEROBIC).

Signals include heart rate, 3-axis accelerometer, electrodermal activity (EDA), skin temperature, and inter-beat intervals (IBI) for HRV computation. Sourced from PhysioNet.

Results & Conclusion

What we found — and what it means.

Body Battery Output

The end result of our pipeline is the Body Battery — a single energy score that drains throughout the day as stress accumulates and recovers during calm periods. Exercise drains it too, but is tracked separately and isn't mistaken for stress.

Validation Results

93.6%

LOSO Accuracy

Stage 1 · 22-fold CV

r = 0.793

Sleep Correlation

Stage 2 · PMData vs Fitbit

p = 0.075

DC/AC Significance

Stage 3 · Validates hybrid

Key Findings

Why hybrid? Our initial approach tried a pure machine-learning model — training an end-to-end classifier (XGBoost) directly on HRV features to predict stress labels. This achieved an R² = −0.035 on WISE data, essentially no better than random guessing. That failure motivated our pivot to the current hybrid architecture: use ML only where it clearly excels (activity classification, 93.6%) and rely on established physiological formulas combined with personal baselines for the stress scoring itself.

Our activity classifier achieved 93.6% LOSO accuracy, confirming that exercise can be reliably separated from cognitive states using wrist-worn sensor data alone. This is the foundation that makes the rest of the pipeline possible.

For stress scoring, we found that DC/AC values were not statistically significant on the WISE dataset for single-snapshot REST vs. STRESS classification (p > 0.05). This is consistent with published findings — Bahameish et al. achieved only F1=56% for Stress vs. Neutral using similar one-shot approaches. This result actually validates our hybrid design: rather than trying to classify stress from a single reading, we track HRV changes over time relative to each person's own baseline.

✓ Activity classifier: 93.6% LOSO accuracy — exercise reliably separated from cognitive states using 4 wrist-worn features

✓ Stage 2 validation: Formula sleep score vs Fitbit r = 0.793 (PMData, N=16 subjects) — confirms sleep quality proxy is meaningful

⚠ DC/AC on WISE (p = 0.075, not significant) — validates hybrid approach over pure single-snapshot ML classification

ℹ Sleep efficiency → stress: ρ = +0.042 (n.s., N=1,556) — supports personalized baselines over fixed thresholds

References & citations

Hongn et al. Wearable Device Dataset from Induced Stress and Structured Exercise Sessions

Bauer 2006 Phase-Rectified Signal Averaging (PRSA) method for DC/AC computation

Velmovitsky 2022 DC/AC among top stress-informative HRV features on Apple Watch (N=33, 55-64% accuracy)

Bonneval 2025 93% HRV error during movement — validates activity gating

Hernando 2018 SDNN robustness to Apple Watch data gaps

Bahameish et al. Stress vs Neutral F1=56% — confirms difficulty of single-snapshot classification

Apple 2024 Sleep behavioral data improves physiological predictions

Conclusion

PhysioTwin demonstrates that hybrid pipelines — combining ML for what it does best with established physiological formulas — can outperform pure data-driven approaches for stress detection, especially with limited training data. By gating on activity, adjusting for sleep, and personalizing to each user's baseline, we avoid the pitfalls that have limited previous work.

The Body Battery gives users an intuitive energy score they can act on immediately, while the system closes the loop by measuring recovery through HealthKit — moving from detection to intervention to validation. Future work includes longitudinal field studies with real Apple Watch users and expanding the sleep modulation stage with more granular sleep architecture data.

Limitations

Sensor gap. Our model was trained on Empatica E4 data (EDA, skin temperature, BVP, ACC) but deploys on Apple Watch, which lacks EDA and skin temperature sensors entirely. Stage 1 uses only HR and ACC features that transfer directly, but Stage 3 stress scoring cannot leverage EDA — a signal strongly associated with sympathetic arousal.

Small sample size. The WISE dataset contains only 22 subjects. While LOSO cross-validation mitigates overfitting, N=22 limits the generalizability of our findings and is the primary reason DC/AC metrics did not reach statistical significance at the group level (DC p=0.075, SDNN p=0.735).

ACC-to-steps proxy. The activity classifier was trained on raw 3-axis accelerometer data (ACC_mean, ACC_std), but Apple Watch provides step counts rather than raw accelerometer streams. We use steps as a proxy, but this mapping has not been formally validated.

Lab vs. real world. WISE protocols are controlled lab sessions (Stroop tasks, treadmill walking). Real-world stress is more ambiguous, lower intensity, and interleaved with daily activities — a domain shift that may reduce classifier performance in deployment.

DC/AC not significant at group level. Our DC/AC analysis on WISE showed trends in the expected direction but did not reach statistical significance. This is why our pipeline uses personalized baselines rather than fixed population thresholds — individual HRV patterns vary too widely for a one-size-fits-all cutoff with this sample size.

Know your
stress.

Why stress detection needs a new approach.

The Problem

Our Approach

The Output

How we built it.

Activity Gate

Sleep Modulation

Stress Scoring

WISE Dataset

What we found — and what it means.

Body Battery Output

Validation Results

Key Findings

Conclusion

Limitations

The Team

Camille Tran

Dhyay Thakrar

Selina Zhang

Essie Cheng

Levy Sahoo

Tauhidur Rahman

Lucas Venetoulias

Know yourstress.

Why stress detection needs a new approach.

The Problem

Our Approach

The Output

How we built it.

Activity Gate

Sleep Modulation

Stress Scoring

WISE Dataset

What we found — and what it means.

Body Battery Output

Validation Results

Key Findings

Conclusion

Limitations

The Team

Camille Tran

Dhyay Thakrar

Selina Zhang

Essie Cheng

Levy Sahoo

Tauhidur Rahman

Lucas Venetoulias

Know your
stress.