top of page

What's the Most Accurate Wearable Data? A 2024-2025 Study Breakdown by Device

  • Writer: Ryan - Kygo Health
    Ryan - Kygo Health
  • Jan 27
  • 10 min read

Updated: Mar 22

Last Updated: March 22, 2026

A smartwatch with a smiling face is surrounded by icons: pink heart rate, blue sleep cloud, bar chart, and metrics ring. Mood is cheerful. Summarizing the different data health wearables provide. Kygo Health App connects multiple wearable devices to utilize the best metrics from each.

The most accurate wearable depends on what you’re tracking. We analyzed peer-reviewed studies from 2024–2025 comparing Oura Ring, Apple Watch, WHOOP, Garmin, Fitbit, and others against gold-standard medical measurements across sleep staging, HRV, heart rate, SpO2, step counting, VO2 max, and more. Below is everything we found—organized by metric, with study funding flagged so you can evaluate the data for yourself.


We also built a free interactive comparison tool based on this research that lets you pick your devices and the metrics you care about to see them side by side: http://kygo.app/tools/wearable-accuracy


Most Accurate Wearable: Master Summary by Metric

This table compiles findings across all peer-reviewed studies analyzed. Each metric section below includes the full data, study details, and funding disclosures.

Biometric

🥇 Winner

🥈 Second

🥉 Third

Worst

Sleep Staging (Oura-funded)

Oura (κ=0.65)

Apple Watch (κ=0.60)

Fitbit (κ=0.55)

Sleep Staging (Independent)

Apple Watch (κ=0.53)

Fitbit Sense (κ=0.42)

Fitbit Charge 5 (κ=0.41)

Garmin (κ=0.21)

Deep Sleep Detection (Independent)

WHOOP (69.6%)

Apple Watch (50.7%)

Fitbit Sense (48.3%)

Withings (29.8%)

REM Detection (Independent)

Apple Watch (68.6%)

WHOOP (62.0%)

Fitbit Sense (55.5%)

Garmin (28.7%)

Wake Detection (Independent)

Apple Watch (52.2%)

Fitbit Charge 5 (42.7%)

Fitbit Sense (39.2%)

Garmin (27.6%)

Nocturnal HRV

Oura Gen 4 (MAPE 5.96%)

WHOOP (8.17%)

Garmin (10.52%)

Polar (16.32%)

Resting Heart Rate

Oura Gen 4 (CCC 0.98)

Oura Gen 3 (0.97)

WHOOP (0.91)

Polar (0.86)

Active Heart Rate

Apple Watch (86.3%)

Fitbit (73.6%)

Garmin (67.7%)

HR Correlation vs ECG

Polar Chest Strap (r=0.99)

Apple Watch (r=0.80)

Garmin (r=0.52)

SpO2

Apple Watch (MAE 2.2%)

Garmin Fenix (~4.5%)

Withings (~4.8%)

Garmin Venu (5.8%)

Step Count

Garmin (82.6%)

Apple Watch (81.1%)

Fitbit (77.3%)

Oura (poor)

Calories/Energy

Apple Watch (71%)

Fitbit (65.6%)

Garmin (48%)

VO2 Max

Garmin Fenix 6 (7.05%)

Apple Watch (13–16%)

Skin Temperature

Oura (r²>0.99 lab)


Sleep Staging Accuracy (4-Stage Classification)

Sleep staging is the most studied—and most contested—metric in wearable accuracy research. Three major studies from 2023–2025 produced different rankings, and study funding is a factor worth noting.


Brigham and Women’s Hospital Study (2024) — Oura-Funded

Robbins et al. compared Oura Ring Gen 3, Fitbit Sense 2, and Apple Watch Series 8 against polysomnography (PSG) across 36 participants over multiple nights.

Device

Overall (κ)

Deep Sleep Sensitivity

Deep Sleep Bias

Oura Ring Gen 3

0.65 (Substantial)

79.5%

No significant bias

Apple Watch Series 8

0.60 (Moderate)

50.5%

-43 min (underestimates)

Fitbit Sense 2

0.55 (Moderate)

61.7%

-15 min (underestimates)

⚠️ Funding: This study was funded by Oura Ring Inc. Lead author Dr. Rebecca Robbins is an Oura scientific advisor.


University of Antwerp Study (2025) — Independent

Schyvens et al. tested six devices against PSG in 62 adults. Funded by VLAIO (Flanders Innovation & Entrepreneurship)—no device manufacturer funding. Oura was not included in this study.

Device

κ

TST Bias

Deep Sleep

REM

Wake

Light Sleep

Notes

Apple Watch 8

0.53

+19.6 min

50.7%

68.6%

52.2%

84.5%

Best κ

Fitbit Sense

0.42

+6.3 min

48.3%

55.5%

39.2%

76.2%

Lowest bias

Fitbit Charge 5

0.41

+11.1 min

43.3%

47.5%

42.7%

73.8%


WHOOP 4.0

0.37

+24.5 min

69.6%

62.0%

32.5%

60.9%

Best deep

Withings Scanwatch

0.22

+39.9 min

29.8%

36.5%

29.4%

73.5%


Garmin Vivosmart 4

0.21

+38.4 min

32.1%

28.7%

27.6%

72.2%

Oldest HW

Note: All six devices misclassified wake, deep sleep, and REM as light sleep—a conservative algorithmic approach shared across all consumer wearables. All devices significantly underestimated Wake After Sleep Onset by 12–48 minutes.


Korean Multicenter Study (2023) — Independent

Park et al. tested 11 devices in 75 participants across 2 centers (349,114 epochs analyzed). No industry funding disclosed.

Device

Cohen’s Kappa (κ)

Google Pixel Watch

0.4–0.6 (Moderate)

Galaxy Watch 5

0.4–0.6 (Moderate)

Fitbit Sense 2

0.4–0.6 (Moderate)

Apple Watch 8

0.2–0.4 (Fair)

Oura Ring 3

0.2–0.4 (Fair)

Note: This study produced different rankings than the Brigham study. Oura scored lower here. Different study populations, methodologies, and PSG protocols can affect results.


Deep Sleep Detection Sensitivity

Deep sleep sensitivity data comes from two studies with different device lineups:

  • From Robbins et al. (2024) — Oura-funded:

    • Oura Ring Gen 3: 79.5%, Fitbit Sense 2: 61.7%, Apple Watch Series 8: 50.5%.

  • From Schyvens et al. (2025) — Independent:

    • WHOOP 4.0: 69.6%, Apple Watch: 50.7%, Fitbit Sense: 48.3%, Fitbit Charge 5: 43.3%, Garmin Vivosmart 4: 32.1%, Withings: 29.8%. Oura was not tested in this study.


Nocturnal HRV (Heart Rate Variability) Accuracy

An Ohio State University / Air Force Research Lab study (Dial et al., 2025) validated nocturnal HRV across 13 participants and 536 nights using a Polar H10 ECG chest strap as reference. No industry funding disclosed.

Device

CCC

MAPE

Rating

Oura Gen 4

0.99

5.96% ± 5.12%

Nearly Perfect

Oura Gen 3

0.97

7.15% ± 5.48%

Substantial

WHOOP 4.0

0.94

8.17% ± 10.49%

Moderate

Garmin Fenix 6

0.87

10.52% ± 8.63%

Poor

Polar Grit X Pro

0.82

16.32% ± 24.39%

Poor

CCC Scale: >0.99 = Nearly Perfect, 0.95–0.99 = Substantial, 0.90–0.95 = Moderate, <0.90 = Poor

Note: Garmin Fenix 6 is 2+ generations behind current hardware. The study authors acknowledged this limitation—current Garmin devices may perform differently. Sample size was 13 participants, though 536 total nights of data were collected.


Resting Heart Rate Accuracy

From the same Ohio State study (Dial et al., 2025):

Device

CCC

MAPE

Rating

Oura Gen 4

0.98

1.94% ± 2.51%

Nearly Perfect

Oura Gen 3

0.97

1.67% ± 1.54%

Substantial

WHOOP 4.0

0.91

3.00% ± 2.15%

Moderate

Polar Grit X Pro

0.86

2.71% ± 2.75%

Poor

Note: Garmin Fenix 6 was excluded from RHR analysis due to timestamp reporting issues that prevented alignment with the Polar H10 reference data.


Active Heart Rate Accuracy

Active heart rate data comes from the WellnessPulse Meta-Analysis (2025) and aggregate PubMed Central studies:

Device

Accuracy

Correlation vs ECG

Polar Chest Strap

r = 0.99

Apple Watch

86.31%

r = 0.80

Fitbit

73.56%

Garmin

67.73%

r = 0.52

TomTom

67.63%


Blood Oxygen (SpO2) Accuracy

Garmin Venu 2s underestimated SpO2 in 67.4% of readings. None of these SpO2 features are FDA-cleared for medical use—they are classified as wellness features.

Device

MAE

MDE

Within Range

Missing Data

Apple Watch Series 7

2.2%

-0.4%

58.3%

11%

Garmin Fenix 6 Pro

~4.5%

~44%

28%

Withings ScanWatch

~4.8%

~38%

31%

Garmin Venu 2s

5.8%

5.5%

18.5%

14%

Sources: PLOS, Nature, various validation studies.


Step Count Accuracy

Device

Accuracy

MAPE (where available)

Garmin

82.58%

Vivoactive 4: <2%

Apple Watch

81.07%

Fitbit

77.29%

Sense: ~8%

Jawbone

57.91%

Polar

53.21%

Oura Ring

Poor (50.3% error real-world, 4.8% controlled)

Source: WellnessPulse Meta-Analysis (2025)


Energy Expenditure (Calories) Accuracy

All wearables are weak at calorie estimation. Accuracy decreases during high-intensity or multi-modal exercise.

Device

Accuracy

Oura Ring

~87% (13% avg error)

Apple Watch

71.02%

Fitbit

65.57%

Polar

~50–65%

Garmin

48.05%

Source: WellnessPulse Meta-Analysis (2025).

Note: None should be treated as precise calorie counters.


VO2 Max Estimation Accuracy

All devices tend to underestimate VO2 max in highly fit individuals and overestimate in sedentary/lower fitness populations.

Device

MAPE

MAE

Notes

Garmin Forerunner 245

5.7%

Acceptable for runners

Garmin Fenix 6

7.05%

CCC=0.73 for 30s avg

Apple Watch Series 7

15.79%

6.07 ml/kg/min

Underestimates

Apple Watch (2025 study)

13.31%

6.92 ml/kg/min

Mixed bias

Sources: Caserman et al. (2024), Lambe et al. (2025), Garmin validation (2025).



Skin Temperature Accuracy

Oura’s internal validation study (2024) tested temperature sensing across 16 individuals over 1 week (93,571 data points):

  • r² > 0.99 in lab conditions, r² > 0.92 in real-world conditions, with precision of ±0.13°C per minute.


⚠️ Funding: This is Oura’s own study, not independently peer-reviewed. However, Oura’s temperature data has been validated in independent menstrual cycle tracking studies (Maijala et al., 2019). Apple Watch, Garmin, WHOOP, and Samsung all track skin temperature, but limited independent comparative data exists.


FDA-Cleared Features

Most wearable metrics are wellness estimates. A few features have earned FDA authorization:

Feature

Device

Status

ECG / Atrial Fibrillation Detection

Apple Watch (Series 4+)

FDA Cleared

ECG / Atrial Fibrillation Detection

Samsung Galaxy Watch (4+)

FDA Cleared

Sleep Apnea Notification

Apple Watch (Series 9+, Ultra 2)

FDA Authorized

Sleep Apnea Detection

Samsung Galaxy Watch

FDA Authorized (Feb 2024)

Blood Oxygen (SpO2)

Apple Watch

Wellness feature (NOT FDA cleared)

Irregular Rhythm Notification

Fitbit

FDA Cleared


Important Caveats

Before drawing conclusions from any of this data, keep these limitations in mind:


  1. No single device wins everywhere. The best device depends on which metric matters most to you.

  2. Study funding matters. The primary sleep study (Robbins et al.) was Oura-funded. Independent studies (Park, Schyvens) found different rankings. We’ve flagged funding sources throughout so you can decide for yourself.

  3. Device generations matter. Studies often test older hardware. Garmin Fenix 6 and Vivosmart 4 are 2+ generations behind current devices. Results may not apply to current models.

  4. Small sample sizes. The HRV/RHR study had 13 participants (536 nights). Antwerp had 62 participants, 1 night each. Brigham had 36 participants over multiple nights.

  5. All wearables are estimates. None are medical devices (except specific FDA-cleared features listed above). Data should inform, not diagnose.

  6. Individual variation. Accuracy can vary based on skin tone, tattoos, BMI, wrist fit, and activity level.

  7. Skin tone bias. PPG sensor accuracy is affected by skin pigmentation. Most validation studies have predominantly Caucasian participants—a critical research gap.

  8. PSG is imperfect too. The “gold standard” polysomnography has interrater reliability of κ≈0.75, meaning even human experts disagree ~25% of the time on sleep staging.

  9. Common device failure mode. All consumer devices tend to misclassify wake, deep sleep, and REM as light sleep—a conservative algorithmic approach that inflates light sleep totals.


Why Accuracy Matters for Understanding Food-Biometric Patterns

If you’re trying to understand how nutrition affects your sleep, recovery, or energy levels, the accuracy of your wearable data is the foundation everything else builds on. When measurement error is high, real patterns between what you eat and how your body responds get harder to detect. When accuracy is high, the data can surface connections—like how meal timing affects your overnight heart rate, or whether a supplement is actually changing your HRV—that you’d never spot manually.


This is part of the reason we built Kygo Health to integrate with multiple wearable platforms. Different devices bring different strengths. Connecting them to nutrition data in one place gives you a more complete picture to work with.


Using Multiple Wearables Together

Many people in the biohacking and quantified self communities wear multiple devices simultaneously to capture different metrics from different strengths—Oura Ring for sleep plus Apple Watch for workouts, or WHOOP plus Garmin for different contexts.


The challenge is getting that data to talk to each other. We wrote a detailed guide on this: How to Centralize Health Data from Multiple Devices.


If you’re specifically using Oura for sleep and want to connect that with food tracking, check out: How to Combine Oura Ring with Food Tracking.


Want to compare devices yourself? Explore all the data from these studies in our free 


Ready to see how your nutrition connects to the biometric data your wearable tracks? Join our app Kygo Health -iOS or Android and start exploring the patterns in your own data.




Sources

  1. Robbins R, et al. (2024). “Accuracy of Three Commercial Wearable Devices for Sleep Tracking in Healthy Adults.” Sensors, 24(20), 6532. DOI: 10.3390/s24206532 — Funded by Oura Ring Inc.

  2. Schyvens AM, et al. (2025). “Performance of six consumer sleep trackers in comparison with polysomnography in healthy adults.” Sleep Advances, 6(1), zpaf016. DOI: 10.1093/sleepadvances/zpaf016 — Independent (VLAIO-funded)

  3. Dial MB, et al. (2025). “Validation of nocturnal resting heart rate and heart rate variability in consumer wearables.” Physiological Reports, 13(16), e70527. DOI: 10.14814/phy2.70527 — Independent

  4. Park et al. (2023). “Accuracy of 11 Wearable, Nearable, and Airable Consumer Sleep Trackers.” JMIR mHealth and uHealth, 11, e50983. DOI: 10.2196/50983 — Independent

  5. WellnessPulse Meta-Analysis (2025). Accuracy of Fitness Trackers — Aggregate data

  6. Caserman P, et al. (2024). “Validity of Apple Watch Series 7 VO2 Max Estimation.” JMIR Biomedical Engineering, 9, e54023.

  7. Lambe RF, et al. (2025). “Validation of Apple Watch VO2 max estimates.” PLOS One, 20(2), e0318498.

  8. Christakis et al. (2025). “A guide to consumer-grade wearables in cardiovascular clinical care.” npj Cardiovascular Health, 2, 82.

  9. Khodr R, et al. (2024). “Accuracy, Utility and Applicability of the WHOOP Wearable Monitoring Device.” medRxiv. DOI: 10.1101/2024.01.04.24300784

  10. Oura Internal Validation (2024). Temperature sensor validation study. 16 participants, 93,571 data points.

  11. Maijala et al. (2019). “Nocturnal finger skin temperature in menstrual cycle tracking.” BMC Women’s Health, 19, 150.

  12. Lanfranchi et al. (2024). Samsung Galaxy Watch SpO2 validation. Journal of Clinical Sleep Medicine.



FAQ: Wearable Accuracy Questions


Which wearable is the most accurate for sleep tracking?

It depends on the study. In the Oura-funded Brigham study (2024), Oura led with κ=0.65 and 79.5% deep sleep sensitivity. In the independent Antwerp study (2025), Apple Watch led overall (κ=0.53) while WHOOP led deep sleep detection (69.6%). The independent Korean study (2023) ranked Google Pixel Watch and Galaxy Watch highest. Study funding, population, and methodology all affect results.


How accurate is Oura Ring HRV compared to medical devices?

Oura Gen 4 achieved a 0.99 concordance correlation coefficient with Polar H10 ECG in an independent 536-night study (Dial et al., 2025). This is the highest HRV accuracy among consumer wearables tested in that study.


Is WHOOP accurate for HRV tracking?

WHOOP 4.0 showed a CCC of 0.94 and MAPE of 8.17% in the Dial et al. (2025) study—rated “Moderate” on the concordance scale.


Does skin tone affect wearable accuracy?

PPG sensor accuracy is affected by skin pigmentation. Most validation studies have predominantly Caucasian participants, which is a known research gap. Accuracy data may not generalize equally across all skin tones.


Can I use multiple wearables together?

Yes. Many people wear multiple devices to capture different metrics from each device’s strengths. The challenge is correlating data across platforms, which typically requires a third-party platform or manual comparison.


Which wearable is best for tracking how food affects sleep?

For nutrition-sleep correlation analysis, you need accurate sleep and HRV data paired with consistent food logging. The studies above show which devices perform strongest for each metric—the best choice depends on which specific metrics you prioritize.


Are wearable calorie estimates reliable?

No wearable tracks calories with high precision. The highest reported accuracy is Apple Watch at 71%. All devices should be treated as rough estimates. Accuracy decreases further during high-intensity or multi-modal exercise.


Why do different studies show different accuracy rankings?

Study funding, sample size, population demographics, device firmware version, number of nights tested, and PSG scoring protocols all affect results. This is why we include multiple studies and flag funding sources throughout this article.



Disclaimer: Kygo Health LLC is a personal data aggregation and insights platform designed for informational purposes only. The information provided does not constitute medical advice, diagnosis, or treatment. Always consult a licensed healthcare provider with any questions regarding medical conditions.


Have questions about wearable accuracy or data you think should be included? Reach out directly at Ryan@kygo.app. If you have sources or credible data that isn’t listed here, share it and we’ll review and update accordingly.

New York, NY​

© 2025 by KYGO Health LLC Kygo Health LLC is not intended to diagnose, treat, cure, or prevent any disease. The information provided is for educational purposes only and is not a substitute for professional medical advice. Consult your physician before making any health decisions.

bottom of page