What's the Most Accurate Wearable Data? A 2024-2025 Study Breakdown by Device
- Ryan - Kygo Health
- Jan 27
- 10 min read
Updated: Mar 22
Last Updated: March 22, 2026

The most accurate wearable depends on what you’re tracking. We analyzed peer-reviewed studies from 2024–2025 comparing Oura Ring, Apple Watch, WHOOP, Garmin, Fitbit, and others against gold-standard medical measurements across sleep staging, HRV, heart rate, SpO2, step counting, VO2 max, and more. Below is everything we found—organized by metric, with study funding flagged so you can evaluate the data for yourself.
We also built a free interactive comparison tool based on this research that lets you pick your devices and the metrics you care about to see them side by side: http://kygo.app/tools/wearable-accuracy
Most Accurate Wearable: Master Summary by Metric
This table compiles findings across all peer-reviewed studies analyzed. Each metric section below includes the full data, study details, and funding disclosures.
Biometric | 🥇 Winner | 🥈 Second | 🥉 Third | Worst |
Sleep Staging (Oura-funded) | Oura (κ=0.65) | Apple Watch (κ=0.60) | Fitbit (κ=0.55) | — |
Sleep Staging (Independent) | Apple Watch (κ=0.53) | Fitbit Sense (κ=0.42) | Fitbit Charge 5 (κ=0.41) | Garmin (κ=0.21) |
Deep Sleep Detection (Independent) | WHOOP (69.6%) | Apple Watch (50.7%) | Fitbit Sense (48.3%) | Withings (29.8%) |
REM Detection (Independent) | Apple Watch (68.6%) | WHOOP (62.0%) | Fitbit Sense (55.5%) | Garmin (28.7%) |
Wake Detection (Independent) | Apple Watch (52.2%) | Fitbit Charge 5 (42.7%) | Fitbit Sense (39.2%) | Garmin (27.6%) |
Nocturnal HRV | Oura Gen 4 (MAPE 5.96%) | WHOOP (8.17%) | Garmin (10.52%) | Polar (16.32%) |
Resting Heart Rate | Oura Gen 4 (CCC 0.98) | Oura Gen 3 (0.97) | WHOOP (0.91) | Polar (0.86) |
Active Heart Rate | Apple Watch (86.3%) | Fitbit (73.6%) | Garmin (67.7%) | — |
HR Correlation vs ECG | Polar Chest Strap (r=0.99) | Apple Watch (r=0.80) | Garmin (r=0.52) | — |
SpO2 | Apple Watch (MAE 2.2%) | Garmin Fenix (~4.5%) | Withings (~4.8%) | Garmin Venu (5.8%) |
Step Count | Garmin (82.6%) | Apple Watch (81.1%) | Fitbit (77.3%) | Oura (poor) |
Calories/Energy | Apple Watch (71%) | Fitbit (65.6%) | — | Garmin (48%) |
VO2 Max | Garmin Fenix 6 (7.05%) | Apple Watch (13–16%) | — | — |
Skin Temperature | Oura (r²>0.99 lab) | — | — | — |
Sleep Staging Accuracy (4-Stage Classification)
Sleep staging is the most studied—and most contested—metric in wearable accuracy research. Three major studies from 2023–2025 produced different rankings, and study funding is a factor worth noting.
Brigham and Women’s Hospital Study (2024) — Oura-Funded
Robbins et al. compared Oura Ring Gen 3, Fitbit Sense 2, and Apple Watch Series 8 against polysomnography (PSG) across 36 participants over multiple nights.
Device | Overall (κ) | Deep Sleep Sensitivity | Deep Sleep Bias |
Oura Ring Gen 3 | 0.65 (Substantial) | 79.5% | No significant bias |
Apple Watch Series 8 | 0.60 (Moderate) | 50.5% | -43 min (underestimates) |
Fitbit Sense 2 | 0.55 (Moderate) | 61.7% | -15 min (underestimates) |
⚠️ Funding: This study was funded by Oura Ring Inc. Lead author Dr. Rebecca Robbins is an Oura scientific advisor.
University of Antwerp Study (2025) — Independent
Schyvens et al. tested six devices against PSG in 62 adults. Funded by VLAIO (Flanders Innovation & Entrepreneurship)—no device manufacturer funding. Oura was not included in this study.
Device | κ | TST Bias | Deep Sleep | REM | Wake | Light Sleep | Notes |
Apple Watch 8 | 0.53 | +19.6 min | 50.7% | 68.6% | 52.2% | 84.5% | Best κ |
Fitbit Sense | 0.42 | +6.3 min | 48.3% | 55.5% | 39.2% | 76.2% | Lowest bias |
Fitbit Charge 5 | 0.41 | +11.1 min | 43.3% | 47.5% | 42.7% | 73.8% | |
WHOOP 4.0 | 0.37 | +24.5 min | 69.6% | 62.0% | 32.5% | 60.9% | Best deep |
Withings Scanwatch | 0.22 | +39.9 min | 29.8% | 36.5% | 29.4% | 73.5% | |
Garmin Vivosmart 4 | 0.21 | +38.4 min | 32.1% | 28.7% | 27.6% | 72.2% | Oldest HW |
Note: All six devices misclassified wake, deep sleep, and REM as light sleep—a conservative algorithmic approach shared across all consumer wearables. All devices significantly underestimated Wake After Sleep Onset by 12–48 minutes.
Korean Multicenter Study (2023) — Independent
Park et al. tested 11 devices in 75 participants across 2 centers (349,114 epochs analyzed). No industry funding disclosed.
Device | Cohen’s Kappa (κ) |
Google Pixel Watch | 0.4–0.6 (Moderate) |
Galaxy Watch 5 | 0.4–0.6 (Moderate) |
Fitbit Sense 2 | 0.4–0.6 (Moderate) |
Apple Watch 8 | 0.2–0.4 (Fair) |
Oura Ring 3 | 0.2–0.4 (Fair) |
Note: This study produced different rankings than the Brigham study. Oura scored lower here. Different study populations, methodologies, and PSG protocols can affect results.
Deep Sleep Detection Sensitivity
Deep sleep sensitivity data comes from two studies with different device lineups:
From Robbins et al. (2024) — Oura-funded:
Oura Ring Gen 3: 79.5%, Fitbit Sense 2: 61.7%, Apple Watch Series 8: 50.5%.
From Schyvens et al. (2025) — Independent:
WHOOP 4.0: 69.6%, Apple Watch: 50.7%, Fitbit Sense: 48.3%, Fitbit Charge 5: 43.3%, Garmin Vivosmart 4: 32.1%, Withings: 29.8%. Oura was not tested in this study.
Nocturnal HRV (Heart Rate Variability) Accuracy
An Ohio State University / Air Force Research Lab study (Dial et al., 2025) validated nocturnal HRV across 13 participants and 536 nights using a Polar H10 ECG chest strap as reference. No industry funding disclosed.
Device | CCC | MAPE | Rating |
Oura Gen 4 | 0.99 | 5.96% ± 5.12% | Nearly Perfect |
Oura Gen 3 | 0.97 | 7.15% ± 5.48% | Substantial |
WHOOP 4.0 | 0.94 | 8.17% ± 10.49% | Moderate |
Garmin Fenix 6 | 0.87 | 10.52% ± 8.63% | Poor |
Polar Grit X Pro | 0.82 | 16.32% ± 24.39% | Poor |
CCC Scale: >0.99 = Nearly Perfect, 0.95–0.99 = Substantial, 0.90–0.95 = Moderate, <0.90 = Poor
Note: Garmin Fenix 6 is 2+ generations behind current hardware. The study authors acknowledged this limitation—current Garmin devices may perform differently. Sample size was 13 participants, though 536 total nights of data were collected.
Resting Heart Rate Accuracy
From the same Ohio State study (Dial et al., 2025):
Device | CCC | MAPE | Rating |
Oura Gen 4 | 0.98 | 1.94% ± 2.51% | Nearly Perfect |
Oura Gen 3 | 0.97 | 1.67% ± 1.54% | Substantial |
WHOOP 4.0 | 0.91 | 3.00% ± 2.15% | Moderate |
Polar Grit X Pro | 0.86 | 2.71% ± 2.75% | Poor |
Note: Garmin Fenix 6 was excluded from RHR analysis due to timestamp reporting issues that prevented alignment with the Polar H10 reference data.
Active Heart Rate Accuracy
Active heart rate data comes from the WellnessPulse Meta-Analysis (2025) and aggregate PubMed Central studies:
Device | Accuracy | Correlation vs ECG |
Polar Chest Strap | — | r = 0.99 |
Apple Watch | 86.31% | r = 0.80 |
Fitbit | 73.56% | — |
Garmin | 67.73% | r = 0.52 |
TomTom | 67.63% | — |
Blood Oxygen (SpO2) Accuracy
Garmin Venu 2s underestimated SpO2 in 67.4% of readings. None of these SpO2 features are FDA-cleared for medical use—they are classified as wellness features.
Device | MAE | MDE | Within Range | Missing Data |
Apple Watch Series 7 | 2.2% | -0.4% | 58.3% | 11% |
Garmin Fenix 6 Pro | ~4.5% | — | ~44% | 28% |
Withings ScanWatch | ~4.8% | — | ~38% | 31% |
Garmin Venu 2s | 5.8% | 5.5% | 18.5% | 14% |
Sources: PLOS, Nature, various validation studies.
Step Count Accuracy
Device | Accuracy | MAPE (where available) |
Garmin | 82.58% | Vivoactive 4: <2% |
Apple Watch | 81.07% | — |
Fitbit | 77.29% | Sense: ~8% |
Jawbone | 57.91% | — |
Polar | 53.21% | — |
Oura Ring | Poor (50.3% error real-world, 4.8% controlled) | — |
Source: WellnessPulse Meta-Analysis (2025)
Energy Expenditure (Calories) Accuracy
All wearables are weak at calorie estimation. Accuracy decreases during high-intensity or multi-modal exercise.
Device | Accuracy |
Oura Ring | ~87% (13% avg error) |
Apple Watch | 71.02% |
Fitbit | 65.57% |
Polar | ~50–65% |
Garmin | 48.05% |
Source: WellnessPulse Meta-Analysis (2025).
Note: None should be treated as precise calorie counters.
VO2 Max Estimation Accuracy
All devices tend to underestimate VO2 max in highly fit individuals and overestimate in sedentary/lower fitness populations.
Device | MAPE | MAE | Notes |
Garmin Forerunner 245 | 5.7% | — | Acceptable for runners |
Garmin Fenix 6 | 7.05% | — | CCC=0.73 for 30s avg |
Apple Watch Series 7 | 15.79% | 6.07 ml/kg/min | Underestimates |
Apple Watch (2025 study) | 13.31% | 6.92 ml/kg/min | Mixed bias |
Sources: Caserman et al. (2024), Lambe et al. (2025), Garmin validation (2025).
Skin Temperature Accuracy
Oura’s internal validation study (2024) tested temperature sensing across 16 individuals over 1 week (93,571 data points):
r² > 0.99 in lab conditions, r² > 0.92 in real-world conditions, with precision of ±0.13°C per minute.
⚠️ Funding: This is Oura’s own study, not independently peer-reviewed. However, Oura’s temperature data has been validated in independent menstrual cycle tracking studies (Maijala et al., 2019). Apple Watch, Garmin, WHOOP, and Samsung all track skin temperature, but limited independent comparative data exists.
FDA-Cleared Features
Most wearable metrics are wellness estimates. A few features have earned FDA authorization:
Feature | Device | Status |
ECG / Atrial Fibrillation Detection | Apple Watch (Series 4+) | FDA Cleared |
ECG / Atrial Fibrillation Detection | Samsung Galaxy Watch (4+) | FDA Cleared |
Sleep Apnea Notification | Apple Watch (Series 9+, Ultra 2) | FDA Authorized |
Sleep Apnea Detection | Samsung Galaxy Watch | FDA Authorized (Feb 2024) |
Blood Oxygen (SpO2) | Apple Watch | Wellness feature (NOT FDA cleared) |
Irregular Rhythm Notification | Fitbit | FDA Cleared |
Important Caveats
Before drawing conclusions from any of this data, keep these limitations in mind:
No single device wins everywhere. The best device depends on which metric matters most to you.
Study funding matters. The primary sleep study (Robbins et al.) was Oura-funded. Independent studies (Park, Schyvens) found different rankings. We’ve flagged funding sources throughout so you can decide for yourself.
Device generations matter. Studies often test older hardware. Garmin Fenix 6 and Vivosmart 4 are 2+ generations behind current devices. Results may not apply to current models.
Small sample sizes. The HRV/RHR study had 13 participants (536 nights). Antwerp had 62 participants, 1 night each. Brigham had 36 participants over multiple nights.
All wearables are estimates. None are medical devices (except specific FDA-cleared features listed above). Data should inform, not diagnose.
Individual variation. Accuracy can vary based on skin tone, tattoos, BMI, wrist fit, and activity level.
Skin tone bias. PPG sensor accuracy is affected by skin pigmentation. Most validation studies have predominantly Caucasian participants—a critical research gap.
PSG is imperfect too. The “gold standard” polysomnography has interrater reliability of κ≈0.75, meaning even human experts disagree ~25% of the time on sleep staging.
Common device failure mode. All consumer devices tend to misclassify wake, deep sleep, and REM as light sleep—a conservative algorithmic approach that inflates light sleep totals.
Why Accuracy Matters for Understanding Food-Biometric Patterns
If you’re trying to understand how nutrition affects your sleep, recovery, or energy levels, the accuracy of your wearable data is the foundation everything else builds on. When measurement error is high, real patterns between what you eat and how your body responds get harder to detect. When accuracy is high, the data can surface connections—like how meal timing affects your overnight heart rate, or whether a supplement is actually changing your HRV—that you’d never spot manually.
This is part of the reason we built Kygo Health to integrate with multiple wearable platforms. Different devices bring different strengths. Connecting them to nutrition data in one place gives you a more complete picture to work with.
Using Multiple Wearables Together
Many people in the biohacking and quantified self communities wear multiple devices simultaneously to capture different metrics from different strengths—Oura Ring for sleep plus Apple Watch for workouts, or WHOOP plus Garmin for different contexts.
The challenge is getting that data to talk to each other. We wrote a detailed guide on this: How to Centralize Health Data from Multiple Devices.
If you’re specifically using Oura for sleep and want to connect that with food tracking, check out: How to Combine Oura Ring with Food Tracking.
Want to compare devices yourself? Explore all the data from these studies in our free
Ready to see how your nutrition connects to the biometric data your wearable tracks? Join our app Kygo Health -iOS or Android and start exploring the patterns in your own data.
Sources
Robbins R, et al. (2024). “Accuracy of Three Commercial Wearable Devices for Sleep Tracking in Healthy Adults.” Sensors, 24(20), 6532. DOI: 10.3390/s24206532 — Funded by Oura Ring Inc.
Schyvens AM, et al. (2025). “Performance of six consumer sleep trackers in comparison with polysomnography in healthy adults.” Sleep Advances, 6(1), zpaf016. DOI: 10.1093/sleepadvances/zpaf016 — Independent (VLAIO-funded)
Dial MB, et al. (2025). “Validation of nocturnal resting heart rate and heart rate variability in consumer wearables.” Physiological Reports, 13(16), e70527. DOI: 10.14814/phy2.70527 — Independent
Park et al. (2023). “Accuracy of 11 Wearable, Nearable, and Airable Consumer Sleep Trackers.” JMIR mHealth and uHealth, 11, e50983. DOI: 10.2196/50983 — Independent
WellnessPulse Meta-Analysis (2025). Accuracy of Fitness Trackers — Aggregate data
Caserman P, et al. (2024). “Validity of Apple Watch Series 7 VO2 Max Estimation.” JMIR Biomedical Engineering, 9, e54023.
Lambe RF, et al. (2025). “Validation of Apple Watch VO2 max estimates.” PLOS One, 20(2), e0318498.
Christakis et al. (2025). “A guide to consumer-grade wearables in cardiovascular clinical care.” npj Cardiovascular Health, 2, 82.
Khodr R, et al. (2024). “Accuracy, Utility and Applicability of the WHOOP Wearable Monitoring Device.” medRxiv. DOI: 10.1101/2024.01.04.24300784
Oura Internal Validation (2024). Temperature sensor validation study. 16 participants, 93,571 data points.
Maijala et al. (2019). “Nocturnal finger skin temperature in menstrual cycle tracking.” BMC Women’s Health, 19, 150.
Lanfranchi et al. (2024). Samsung Galaxy Watch SpO2 validation. Journal of Clinical Sleep Medicine.
FAQ: Wearable Accuracy Questions
Which wearable is the most accurate for sleep tracking?
It depends on the study. In the Oura-funded Brigham study (2024), Oura led with κ=0.65 and 79.5% deep sleep sensitivity. In the independent Antwerp study (2025), Apple Watch led overall (κ=0.53) while WHOOP led deep sleep detection (69.6%). The independent Korean study (2023) ranked Google Pixel Watch and Galaxy Watch highest. Study funding, population, and methodology all affect results.
How accurate is Oura Ring HRV compared to medical devices?
Oura Gen 4 achieved a 0.99 concordance correlation coefficient with Polar H10 ECG in an independent 536-night study (Dial et al., 2025). This is the highest HRV accuracy among consumer wearables tested in that study.
Is WHOOP accurate for HRV tracking?
WHOOP 4.0 showed a CCC of 0.94 and MAPE of 8.17% in the Dial et al. (2025) study—rated “Moderate” on the concordance scale.
Does skin tone affect wearable accuracy?
PPG sensor accuracy is affected by skin pigmentation. Most validation studies have predominantly Caucasian participants, which is a known research gap. Accuracy data may not generalize equally across all skin tones.
Can I use multiple wearables together?
Yes. Many people wear multiple devices to capture different metrics from each device’s strengths. The challenge is correlating data across platforms, which typically requires a third-party platform or manual comparison.
Which wearable is best for tracking how food affects sleep?
For nutrition-sleep correlation analysis, you need accurate sleep and HRV data paired with consistent food logging. The studies above show which devices perform strongest for each metric—the best choice depends on which specific metrics you prioritize.
Are wearable calorie estimates reliable?
No wearable tracks calories with high precision. The highest reported accuracy is Apple Watch at 71%. All devices should be treated as rough estimates. Accuracy decreases further during high-intensity or multi-modal exercise.
Why do different studies show different accuracy rankings?
Study funding, sample size, population demographics, device firmware version, number of nights tested, and PSG scoring protocols all affect results. This is why we include multiple studies and flag funding sources throughout this article.
Disclaimer: Kygo Health LLC is a personal data aggregation and insights platform designed for informational purposes only. The information provided does not constitute medical advice, diagnosis, or treatment. Always consult a licensed healthcare provider with any questions regarding medical conditions.
Have questions about wearable accuracy or data you think should be included? Reach out directly at Ryan@kygo.app. If you have sources or credible data that isn’t listed here, share it and we’ll review and update accordingly.