Factor Analyzing Karpathy's Sleep Trackers

In his post reviewing four sleep trackers (Oura, Whoop, AutoSleep, and 8Sleep), Karpathy left out something very important: any serious attempt to evaluate accuracy. He covers the normality of the sleep-score distribution for each tracker, the range of scores given by each tracker, which ones are prone to ceiling effects, and even how strongly the trackers’ scores correlate with one another. But the closest he comes to evaluating accuracy is noting that the Oura and Whoop scores seem to correlate well with how he feels in the morning. That kind of validation is intuitive, but it is also unreliable. We all know how fallible human judgment is. There should be an objective way to measure how accurate each sleep tracker is.

At first glance, evaluating accuracy appears ill-posed. Without a ground-truth measure of sleep quality, there’s no obvious external criterion against which sleep tracker scores can be compared, so “accuracy” appears to be unidentifiable. However, the problem becomes tractable under a single modeling assumption: that there exists a latent sleep-quality variable that is the cause of the observed correlations among the sleep-tracker scores¹. Under this assumption, standard factor-analytic methods can be used to estimate the degree to which each device’s scores correlate with the latent variable. Using a one-factor model² fit to Karpathy’s reported correlation matrix (calculation code here), the estimated loadings on the latent sleep-quality factor are:

Oura: 0.94 (± 0.046)
Whoop: 0.64 (± 0.069)
8Sleep: 0.60 (± 0.076)
Autosleep: 0.46 (± 0.093)

It seems that Oura is very informative, loading far more strongly onto the latent sleep-quality factor than any of the other trackers. Whoop and 8Sleep perform similarly, both substantially weaker than Oura and not clearly distinguishable from each other, while AutoSleep is the least informative by a fair margin; while it would be inaccurate to describe AutoSleep as “basically a random number generator,” I wouldn’t recommend anyone use it.

Karpathy also reports a correlation matrix for resting heart rate. This provides a useful comparison case, since resting heart rate is generally treated as a quantity that wearable devices capture with reasonable accuracy. Applying the same factor-analytic approach to the resting-heart-rate correlations (calculation code here) yields the following loadings:

AutoSleep: 0.97 (± 0.009)
8Sleep: 0.97 (± 0.010)
Oura: 0.95 (± 0.013)
Whoop: 0.94 (± 0.015)

In this case, all devices exhibit loadings close to one, suggesting that they all measure heart rate well. Notably, the lowest loading in this analysis is of similar magnitude to the highest loading on the sleep-quality factor. So it seems that trackers are capable of producing accurate measurements, just not when it comes to sleep quality (with the exception of Oura).

Recommendation: If you’re looking for a sleep tracker, get the Oura ring.

Technically, we don’t know that the latent factor explaining the shared covariance corresponds to actual sleep quality. But (1) Karpathy notes that Oura and Whoop correlate well with how he feels in the morning, which, while not conclusive on its own, is suggestive, and (2) making assumptions of this kind has worked well for me in the past, so it is reasonable to expect them to work well in the present and future. Additionally, parallel analysis supported a one-factor model, and the first principal component accounted for 59% of the variance. ↩
There are other models that could explain the data. For example, we might expect sleep trackers in the form of watches to be more strongly correlated with one another than a single latent sleep-quality factor would predict. In that case, Whoop and AutoSleep could be modeled as loading on a “watch sleep quality” factor that itself loads on a latent sleep-quality factor. Ultimately, I chose to stick with the one-factor model because of its simplicity, minimal assumptions, and robustness. Additionally, parallel analysis supported a one-factor solution for both the sleep-quality scores and the resting heart-rate data. The first principal component accounted for 59% of the variance in sleep scores and 94% of the variance in heart rates. Thus, the use of a one-factor model is supported by the data. ↩