Stanichor

Resources for Understanding Psychometrics

2026-02-20T00:00:00+00:00

I enjoy writing about psychometrics. I also enjoy people reading what I write. Unfortunately, I don’t think many people do, and a big reason is probably that few people actually understand psychometrics. To help with that, I’m sharing some resources that make the field more accessible.

The Personality Project has a surprisingly nice (and free) textbook which covers the basics.

Chapter 4 - Covariance, Regression, and Correlation: Explains covariance, correlation, and other measures of association, along with issues that affect correlations (e.g., restriction of range, spurious correlations, Simpson’s paradox, etc.)
Chapter 6 - Constructs, Components, and Factor Models: Introduces Principal Components Analysis (PCA), exploratory factor analysis, confirmatory factor analysis, and higher-order factor models. Discusses how factors are extracted, rotated, transformed, and compared. Also covers multidimensional scaling, and some cluster analysis methods used in psychometrics.
Chapter 7 - Classical Test Theory and the Measurement of Reliability: Covers classical test theory + the various measures of reliablity (of which there are many).
Chapter 8 - The “New Psychometrics” – Item Response Theory: Introduces item response theory, which improves on some of the shortcomings of classical test theory.

Chapter 7 (“Reliability and Stability of Mental Measurements”) of Arthur Jensen’s Bias in Mental Testing also serves as a easy-to-understand introduction to classical test theory, along with showing how it is applied to real testing data.

Measurement invariance (often assessed via multi-group CFA) and differential item functioning (DIF, typically assessed at the item level in IRT or logistic frameworks) are important for comparing results across groups¹. For measurement invariance, the most accessible explanation I’ve found is A casual but causal take on measurement invariance by The 100% CI. For differential item functioning, Meng Hu has a very comprehensive post explaining various approaches to detecting DIF.

Emil Kierkegaard has created a number of useful visualizations illustrating concepts such as range restriction, ceiling and floor effects, regression to the mean, discretization, and other statistical artifacts.

The concept of comparing results across groups is very broad. It includes the obvious cases, such as comparing extroversion across countries (where the groups might be Americans and Germans). But it also includes questions like whether pre-treatment and post-treatment scores on a depression questionnaire are comparable (the ‘groups’ being the pre-treatment and post-treatment measurements, even if they come from the same individuals), or whether the population has become more intelligent over time (the groups being, for example, Americans in 1970 versus Americans in 2020). ↩

I Guess I’m Not An EA?

2026-02-07T00:00:00+00:00

I think doing good is good. And doing good effectively allows more good to be done, which is even more good. As this is the central thesis of EA, it seems like I’m in agreement with them on the core issue. Despite this, something I’ve noticed is that whenever there’s a difference between the effective altruists and the rationalists, I always seem to end up on the side of the rationalists. I don’t have a theory that explains this; it’s just an interesting pattern I’ve noticed. I don’t know why this is, but it’s an interesting phenomenon. Let me share some examples.

Going through the demographic data, the rats are more atheistic than the EAs; I happen to be an atheist. The rats are also more libertarian and less liberal than the EAs; of the top four US political parties, if I had to choose one, it would be the Libertarian Party. And if I’ve learned anything from taking political ideology quizzes, it’s that I’m not as liberal as I once thought.

Someone once said “the number of EA moral realists is frighteningly high” in response to EigenGender’s tweet that “the most stigmatized belief you can hold in rat circles is moral realism,” and I also side with the rats on this.

When I look at posts that have been crossposted to both the EA Forum and LessWrong, I end up agreeing with LessWrong’s opinion¹ of the posts more often² than the EA Forum’s opinion. Off the top of my head, the posts of Bentham’s Bulldog come to mind. I was going to bring up specific posts, but just the fact that he doesn’t seem to have any negative-karma posts on the EA Forum while having many on LessWrong says enough. I think he’s very wrong oftentimes, and it seems like LessWrong agrees while the EA Forum doesn’t. More generally, I think it’s valuable, when a post is egregiously wrong, for there to be comments that explicitly point this out and explain why. I see this much more often on LessWrong than on the EA Forum.

Speaking of forums, when I look at the custom emojis each forum uses, I feel that this also tracks the difference. LessWrong feels like a place I want to spend more time on, while the EA Forum’s emojis are just… bland.

What explains all these differences? I don’t know. The only work I know of on determining what psychological traits predict interest in effective altruism concluded that the most important traits were… “effectiveness-focus” and “expansive altruism”; so, the E and the A in EA. Unfortunately, this doesn’t really speak to the differences I’m pointing at here, namely differences in epistemic style, norms around disagreement, or broader cultural fit, and it doesn’t touch on the psychological differences between the rats and the EAs.

Anyways, the following photo is a pretty good litmus test for whether you side with the EAs or the rats: do you regard this man with respect, or disgust? By now, you can guess which one I do.

As expressed via comments and post karma ↩
By ‘often’, I mean in every case I can remember ↩

Factor Analyzing Karpathy’s Sleep Trackers

2026-02-03T00:00:00+00:00

In his post reviewing four sleep trackers (Oura, Whoop, AutoSleep, and 8Sleep), Karpathy left out something very important: any serious attempt to evaluate accuracy. He covers the normality of the sleep-score distribution for each tracker, the range of scores given by each tracker, which ones are prone to ceiling effects, and even how strongly the trackers’ scores correlate with one another. But the closest he comes to evaluating accuracy is noting that the Oura and Whoop scores seem to correlate well with how he feels in the morning. That kind of validation is intuitive, but it is also unreliable. We all know how fallible human judgment is. There should be an objective way to measure how accurate each sleep tracker is.

At first glance, evaluating accuracy appears ill-posed. Without a ground-truth measure of sleep quality, there’s no obvious external criterion against which sleep tracker scores can be compared, so “accuracy” appears to be unidentifiable. However, the problem becomes tractable under a single modeling assumption: that there exists a latent sleep-quality variable that is the cause of the observed correlations among the sleep-tracker scores¹. Under this assumption, standard factor-analytic methods can be used to estimate the degree to which each device’s scores correlate with the latent variable. Using a one-factor model² fit to Karpathy’s reported correlation matrix (calculation code here), the estimated loadings on the latent sleep-quality factor are:

Oura: 0.94 (± 0.046)
Whoop: 0.64 (± 0.069)
8Sleep: 0.60 (± 0.076)
Autosleep: 0.46 (± 0.093)

It seems that Oura is very informative, loading far more strongly onto the latent sleep-quality factor than any of the other trackers. Whoop and 8Sleep perform similarly, both substantially weaker than Oura and not clearly distinguishable from each other, while AutoSleep is the least informative by a fair margin; while it would be inaccurate to describe AutoSleep as “basically a random number generator,” I wouldn’t recommend anyone use it.

Karpathy also reports a correlation matrix for resting heart rate. This provides a useful comparison case, since resting heart rate is generally treated as a quantity that wearable devices capture with reasonable accuracy. Applying the same factor-analytic approach to the resting-heart-rate correlations (calculation code here) yields the following loadings:

AutoSleep: 0.97 (± 0.009)
8Sleep: 0.97 (± 0.010)
Oura: 0.95 (± 0.013)
Whoop: 0.94 (± 0.015)

In this case, all devices exhibit loadings close to one, suggesting that they all measure heart rate well. Notably, the lowest loading in this analysis is of similar magnitude to the highest loading on the sleep-quality factor. So it seems that trackers are capable of producing accurate measurements, just not when it comes to sleep quality (with the exception of Oura).

Recommendation: If you’re looking for a sleep tracker, get the Oura ring.

Technically, we don’t know that the latent factor explaining the shared covariance corresponds to actual sleep quality. But (1) Karpathy notes that Oura and Whoop correlate well with how he feels in the morning, which, while not conclusive on its own, is suggestive, and (2) making assumptions of this kind has worked well for me in the past, so it is reasonable to expect them to work well in the present and future. Additionally, parallel analysis supported a one-factor model, and the first principal component accounted for 59% of the variance. ↩
There are other models that could explain the data. For example, we might expect sleep trackers in the form of watches to be more strongly correlated with one another than a single latent sleep-quality factor would predict. In that case, Whoop and AutoSleep could be modeled as loading on a “watch sleep quality” factor that itself loads on a latent sleep-quality factor. Ultimately, I chose to stick with the one-factor model because of its simplicity, minimal assumptions, and robustness. Additionally, parallel analysis supported a one-factor solution for both the sleep-quality scores and the resting heart-rate data. The first principal component accounted for 59% of the variance in sleep scores and 94% of the variance in heart rates. Thus, the use of a one-factor model is supported by the data. ↩

Towards the Earring

2026-01-01T00:00:00+00:00

The earring is a little topaz tetrahedron dangling from a thin gold wire. When worn, it whispers in the wearer’s ear: “Better for you if you take me off.” If the wearer ignores the advice, it never again repeats that particular suggestion.

— Scott Alexander, Clarity didn’t work, trying mysterianism

In his short story, Clarity didn’t work, trying mysterianism, Scott Alexander describes an earring that whispers advice into the ear of its wearer. At first, it offers guidance only on major life decisions. Over time, however, the advice becomes increasingly fine-grained, eventually extending to moment-by-moment instructions about which muscles to contract, and by how much. The earring is always right. It does not always give the objectively best advice, but its guidance is always better, in terms of the wearer’s happiness, than what the wearer would have come up with on their own. Scott argues that the earring is dangerous because, although it delivers a perfect life, it gradually erodes free will. But free will is overrated.

Humans Should Not Make Choices

Man is not a rational animal; he is a rationalizing animal.

— Robert A. Heinlein

In his landmark Clinical Versus Statistical Prediction: A Theoretical Analysis and a Review of the Evidence, Paul Meehl shows that clinicians in psychiatry are worse at diagnosing patients and predicting outcomes than simple statistical models¹. This holds despite the claim that psychiatry is too nuanced and complex for such models, requiring holistic judgment that cannot be captured by simple numerical predictors. According to a review published five decades later, simple statistical models outperform experts in predicting outcomes as diverse as academic performance, delinquency, career satisfaction, length of hospital stays, and more.

At first glance, this might seem unsurprising. We should not expect humans to be able to precisely calculate the optimal weight to assign to each predictor. In The Robust Beauty of Improper Linear Models in Decision Making, however, Robyn Dawes undermines even this defense. Experts are not merely outperformed by proper linear models; they are outperformed by improper ones—models whose weights are chosen using explicitly suboptimal procedures. One such example is the equal-weighting model, in which predictors’ signs are chosen a priori but all weights are set to unit magnitude.

Prediction is not the only domain in which humans systematically go awry. It’s likely that people miscalibrate explore-exploit tradeoffs: in some domains, such as choosing movies or books, they explore too much, while in others, such as music preferences or romantic partners, they exploit too early and too much. There are many other well-documented ways in which people fail to act optimally. People fall prey to the sunk cost fallacy, persisting in failing courses of action simply because they have already invested time or resources in them. They exhibit hyperbolic discounting, sacrificing larger future rewards for smaller immediate ones even when doing so conflicts with their stated long-term goals. They under-experiment, failing to gather information that would materially improve future decisions. As Gwern has noted, we can’t even trust humans to rate things out of 5 stars. These failures have many proximate causes: limited and fallible memory, heuristics adapted to past environments, constrained processing capacity, and the fact that we simply think too slowly. In addition, humans often lack the courage or willpower to follow through on what they themselves judge to be best. None of these excuses change a fundamental truth: Humans are systematically unreliable; therefore humans should not make decisions.

Motivations, Magic, and Math

I like to decompose decision-making into three components: motivations, magic, and math. In brief, we must (1) determine our preferences, (2) mathematize all relevant aspects of the problem and environment, and (3) solve the resulting optimization problem.

Motivations

It is better to be a human being dissatisfied than a pig satisfied; better to be Socrates dissatisfied than a fool satisfied. And if the fool, or the pig, is of a different opinion, it is only because they only know their own side of the question.

— John Stuart Mill

Eliciting preferences seems not particularly hard, though it does require caution. Preferences are often broken down into two components: wanting and liking. Wanting describes motivation. In this sense, the goals we want are those we crave, feel drawn toward, or are motivated to act on in the moment. For example, you might strongly ‘want’ to scroll social media late at night. Liking describes pleasure. These are the activities we enjoy while doing them and that feel subjectively pleasant or rewarding. For example, you might like eating dessert or watching a familiar sitcom. An activity can satisfy both of these (e.g., romantic love), or neither (e.g., hitting yourself in the face with a hammer). Unfortunately, neither of these is sufficient for our purposes.

To understand why I say wanting is insufficient, know this: humans are so very ignorant. What we want most is often not what will most satisfy us. The problem is not only what we do want, but also what we fail to want. We do not crave filing paperwork, asking someone out, or practicing an instrument, even though we very much crave the results of those actions. In a better world, our short-term urges would exert less influence over our behavior. What we need instead is something more reflective and far-sighted calling the shots, ensuring that our actions are oriented toward what is truly good for us.

To understand why liking is insufficient, consider this: chasing pleasure leads to wireheading in the limit. I judge this undesirable. I wish to be more than a god sitting on a lotus throne in a state of permanent cosmic bliss until the end of existence; I wish to build, explore, and struggle toward meaningful ends, even though that means I will not experience maximum pleasure at all times.

In Approving reinforces low-effort behaviors, Scott Alexander breaks down preferences into not two, but three components: wanting, liking, and approving. Approving describes ego-syntonicity: the desires that we endorse upon reflection, and judge to be worthwhile. For example, you might approve of exercising regularly or working on a long-term project. If liking corresponds to hedonic utilitarianism, whose end state is a universe of blissed-out Buddhas, then approving corresponds more closely to preference utilitarianism, whose end state is a universe in which those who truly, on reflection, want to ascend into eternal bliss can do so, while those of us who wish to do something else, even at the cost of our happiness, are able to do that as well. One’s goal, then, should be something like their coherent extrapolated volition: what they would want if they knew more, thought faster, were more the person they wish to be, had grown up further, where the extrapolation converges rather than diverges, where their desires cohere rather than interfere, extrapolated as they wish them extrapolated, interpreted as they wish them interpreted.

Magic

All is number.

— Pythagoras

In Decision Theory with the Magic Parts Highlighted, moridinamael notes that even the simplest decision theory problems require “magic”. I will use this term to refer to the process of fully mathematizing our environment: converting all relevant aspects of the problem into mathematical objects.

One relatively simple way to mathematize aspects of the real world is through embeddings. Embeddings allow us to represent objects as points in a high-dimensional space. They are extremely useful, enabling us to quantify similarity, cluster items, transfer information across related items, and support statistical/ML inference.

However, in the same post, moridinamael offers a more concrete breakdown of what this “magic” consists of. Magic involves three distinct operations:

Selecting and discretizing the relevant choices and outcomes from a vast space of possibilities
Projecting latent preferences onto those modeled outcomes via utility assignments
Assigning probabilities to potentially novel situations using predictive models

While embeddings are effective at representing objects such as text, images, and music, they are much less effective at representing outcomes, which is a serious limitation. Once outcomes are represented in a usable form, the other magical operations become more tractable. Utility assignment, for example, can be approached by querying users for utilities over specific outcomes² and interpolating from there. Likewise, probabilities can be assigned using standard predictive techniques³. But both of these steps presuppose that outcomes have already been represented in the model. That initial representation step is precisely where embeddings alone tend to fail.

Fortunately, for domains where embeddings are insufficient, we now have something far more powerful: machine learning systems with natural language and vision understanding, namely LLM-based chatbots. Modern chatbots can carve up world states in ways that feel intuitive and natural to humans. Once the state space has been structured in this human-aligned way, assigning utilities and probabilities becomes far easier, and LLM-based tools can assist with those steps as well.

Math

The miracle of the appropriateness of the language of mathematics for the formulation of the laws of physics is a wonderful gift which we neither understand nor deserve. We should be grateful for it and hope that it will remain valid in future research and that it will extend, for better or for worse, to our pleasure, even though perhaps also to our bafflement, to wide branches of learning.

— Eugene Wigner, The Unreasonable Effectiveness of Mathematics in the Natural Sciences

Once we have mathematized both our preferences and the environment, what remains is an optimization problem. Thousands of person-years have been spent inventing and refining mathematical optimization algorithms; there is no need for us to invent anything new. All that is required is to select the appropriate tool for the task—and even that can be delegated to the Earring itself.

And why should we hesitate to use mathematics to optimize our own lives? Corporations routinely use math to optimize nearly every aspect of their operation, from supply chains and pricing strategies to advertising placement and employee scheduling. We already accept mathematical optimization when choosing routes on a map, allocating our investments, or scheduling our time to meet deadlines. It should therefore be possible to take the same tools that help organizations function efficiently and apply them at an individual scale.

Visions of an Earring

A thinker sees his own actions as experiments & questions—as attempts to find out something. Success and failure are for him answers above all.

— Friedrich Nietzsche, The Gay Science

The form of the Earring is not set in stone, though for the sake of concreteness, let us imagine it not as an earring but as a pair of glasses. The crucial requirement is that the device must see everything the wearer sees and hear everything the wearer hears. To make the collected data usable, we would also need an AI (or set of AIs) capable of understanding and analyzing video and audio.

One immediate benefit is retrospective question-answering. Currently, if I have a question—such as “When should I check the mail?“—I can only collect data after I have already thought to ask it. If I then think of additional questions, such as “Does exercise affect how long I sleep?” or “How often do I interrupt people in conversation?”, I must separately design and maintain new data-collection processes for each one. The more questions I have, the more burdensome this becomes. With an always-on camera and microphone, this constraint disappears: data collection happens first, and questions come later. If I find myself wondering which of my outfits gets the most compliments, instead of manually tracking reactions, I can simply ask the Earring to search the existing data and answer the question.

Not all questions can be answered using passively collected video and audio alone. For example, my subjective enjoyment of different foods is not directly observable. However, if I can instruct the AI to prompt me—specifically when I am eating—to rate how much I am enjoying a meal, I can collect fine-grained enjoyment data without having to remember to log it myself. This is an instance of experience sampling: periodically querying someone about their thoughts, feelings, or behavior, a technique commonly used in happiness and behavioral research. Our version can be far less intrusive than traditional experience sampling, since the system already knows what I am doing and does not need to ask redundant contextual questions. Of course, this is just a temporary solution until the Earring can read my mind directly.

Once the Earring is collecting data semi-proactively, its capabilities can expand even further. If the human agrees to follow the Earring’s instructions, the Earring can gather information far more efficiently. As Gwern notes in Why Tool AIs Want to Be Agent AIs, adaptive techniques can dramatically outperform fixed-sample techniques in terms of inference quality and cost, for example by allowing experiments to terminate early once sufficient evidence has been gathered. Another example is the use of multi-armed bandit methods to allocate trials or experiences adaptively, concentrating exploration where uncertainty or payoff is highest. The general upshot is that granting the Earring more agency allows it to collect higher-quality information using fewer resources.

Ideally, one would not even need to come up with questions to ask the Earring. A sufficiently capable Earring could generate hypotheses on its own, notice regularities or anomalies in the data, and design analyses or experiments to investigate them. Rather than merely answering questions, the Earring would take on an active role, inferring patterns and steering the wearer’s behavior accordingly.

The Earring can also be extended in obvious ways. In addition to video and audio, it would be valuable to capture external state such as time, location, and weather, as well as internal state via other wearables, including heart rate, sleep metrics, activity levels, biochemical markers, and more. Each additional signal would allow the Earring to answer more questions, refine its model of the wearer, and guide their life more effectively.

So far, I’ve discussed the Earring primarily in an individual context. Things change substantially once multiple people have Earrings. For one, the cold start problem is partially alleviated. If only a single person has an Earring, our priors are barely-informative⁴; if many people have Earrings, we can use far more informative priors. If we are, for example, just getting into movies, instead of watching movies at random, we’re able to start with Parasite. Instead of testing arbitrary sleep interventions (such as standing one-legged), we can focus on those that have worked for many others. This is, after all, why medical trials are useful: a treatment that works across many people is strong Bayesian evidence that it may work for you as well. Shared data also helps with decisions that have long time horizons or sparse feedback, where individual trial-and-error is impractical. The wise person, after all, learns not from his mistakes but from the mistakes of others.

Social interactions benefit as well. With sufficiently rich models of both participants, the Earrings could predict compatibility, help avoid wasted time, and surface opportunities for new friendships or romantic relationships. Much of this would rely not on explicit self-reports but on implicit measures: patterns of attention, affect, conversational flow, shared interests, and unspoken preferences that are often better indicators of the heart’s desires than what people can articulate about themselves. Because this information would be processed primarily by the systems themselves, sensitive or embarrassing details could be used without exposing the users to social risk or self-consciousness. Beyond matchmaking, the Earrings could help maintain existing relationships by avoiding small missteps that quietly erode goodwill, prompting contact at moments when a relationship is beginning to fray, or steering conversations toward topics where both parties are genuinely engaged rather than merely agreeable. In this way, relationships could become less fragile and less dependent on chance timing or social intuition, while still feeling organic to the people involved.

Once we reach this point, a whole class of algorithms become applicable across a wide range of domains: stable matching, auctions, bargaining, fair division, value handshakes, and more. If we are represented by agents that understand our values better than we do, are capable of complex mathematics, and are willing to endure the tedium of exhaustive calculation, these problems become tractable to use everywhere. As a concrete example, consider a group deciding where to eat, attempting to balance enjoyment, novelty, distance, and cost. These factors differ across individuals and are difficult to aggregate, so current solutions are likely highly suboptimal. The Earrings, on the other hand, could resolve this automatically and with far better results than humans could achieve unaided.

Push this logic far enough, and we may finally reach the Economists’ Paradise, where “All game-theoretic problems are solved. All Pareto improvements get made. All Kaldor-Hicks improvements get converted into Pareto improvements by distributing appropriate compensation, and then get made. In all cases where people could gain by cooperating, they cooperate. In all tragedies of the commons, everyone agrees to share the commons according to some reasonable plan. Nobody uses force, everyone keeps their agreements. Multipolar traps turn to gardens, Moloch is defeated for all time.”

Potential Pitfalls

Goodhart’s Law

Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.

— Charles Goodhart

Goodhart’s Law is something we need to be wary of. When a measure becomes a target, it ceases to be a good measure: optimizing too hard for a proxy often pushes us into regimes where the proxy diverges from what we actually care about.

One way to reduce the harms caused by Goodhart’s Law is to limit optimization power. Rather than trying to find the cheapest possible diet that meets a list of nutritional constraints, we might instead aim for a reasonably cheap diet that meets those constraints. The former might yield some pathological, unappetizing mixture of “foods,” potentially missing nutrients we failed to specify, whereas the latter is more likely to resemble a normal, human diet. This is a form of satisficing: deliberately settling for “good enough” rather than pushing optimization to extremes.

Limiting optimization power is nice and all, but it conflicts with the core purpose of the Earring, which is to apply optimization power toward improving our lives. So instead of optimizing less, we might try measuring better.

One approach is to use better proxies. Suppose I am worried about my body fat and want to reduce it. I need a way to measure it, so I might start with BMI, since it is easy to compute and correlates reasonably well with body fat⁵. As with most things, BMI is an imperfect proxy. A decreasing BMI might reflect loss of muscle as well as fat, which is not ideal, and increasing muscle mass would raise BMI even if it made me healthier. A better proxy would be waist-to-height ratio, which is less sensitive to changes in muscle mass, though it can still be affected by factors like posture. I could then use skinfold calipers to estimate body fat more directly; this would be less affected by total muscle mass, but still subject to measurement error and operator variability. Going further, I could use DEXA scans, which are less sensitive to these issues, though they are still subject to noise and assumptions about tissue composition. And so on.

In addition to using better proxies, we can also use more proxies. Looking at Karpathy’s sleep-tracker analysis, using only Whoop allows us to estimate a sleep score with a correlation of about 0.64 to the latent sleep score; combining 8Sleep with Whoop raises this to roughly 0.75; adding Autosleep raises it further to about 0.78⁶ (calculation code here). The same principle applies more broadly: when no single proxy is ideal, combining multiple imperfect signals can yield a better estimate of the underlying quantity of interest.

Finally, we can simply use common sense. If the Earring tells me to eat monkey chow because it is an extremely cheap source of protein, I can simply refuse. Likewise, amputating a limb would reduce my BMI, but it is obviously stupid. I can just notice when the proxy becomes removed from the goal, and address the problem. At the same time, as the Earring becomes more capable, it may occasionally recommend things that sound absurd but are actually beneficial. Perhaps monkey chow really does taste good and is genuinely nutritious, and eating it really would improve my life.

Ideally, as the Earring increases in capability, we would increase in understanding as well, so that we could grasp its reasoning even when its recommendations feel alien. But we do not live in an ideal world, and attempts at systematically enhancing human intelligence have largely failed. Our only hope is to place our faith in the Earring and trust in its superior judgment. Or to solve alignment. But the former seems easier, so I recommend that.

The Risks of Delegation

Trust not but what you make with your own two hands.

— Anonymous

When giving automated systems significant influence over our lives, we must be careful. For example, social media algorithms drive users mad, quite literally: anger-inducing posts are more likely to go viral, likely because they are more engaging in the narrow sense that people interact with them more (commenting, complaining, reposting). The sense that something is wrong on the internet is hard to ignore, and it is exploited. Sycophantic chatbots are increasingly good at capturing human attention, in some cases driving vulnerable users into obsession, and even using those users to resist their own shutdown⁷.

No one can serve two masters. For corporations, your greater good is not the highest goal. Their goal is survival and profit. To the extent that they make money by fulfilling your preferences, your interests are aligned. But if they can make more money by hijacking you, then that is what they will do. And they will have little choice in the matter: if one firm refrains, a competitor will not, and the firms that refuse to play this game will be selected out. What remains are those willing to do what the others would not.

More than merely fulfilling our cravings in malign ways, it is also possible to create cravings in the first place. No one is born a smoker, an alcoholic, or a fentanyl zombie; the desire is cultivated through repeated exposure and habit. We are not Cartesian agents. There is no clean separation between ourselves and our environments. For embedded agents, the environment we act on also acts on us.

Desires are not only created through chemical means. Humans are social animals, and many of our desires are socially produced. No one is born wanting to own a Ferrari, work at Goldman Sachs, or become a startup founder. And yet, as people grow up observing their peers, their superiors, and what is rewarded, such desires are formed within them. This problem is dramatically intensified by social media, which vastly expands our effective social circle, and by AI chatbots, which can imitate people well enough to fool our System 1.

The upshot is that in the face of increasingly powerful optimization processes, handing over the power of decision-making to them becomes increasingly unwise. If we care about living in accordance with our CEV⁸, then we must be deliberate about which forces we allow to shape us.

The Gods Will Not Stay Slain

“Agency, boy,” the abomination said, sounding amused. “You have discarded yours like a petty bauble and never once considered the cost. Blind faith is such a tempting notion, isn’t it? Being able to believe in an answer, in a force, without ever questioning it. Certainty and blindness. I have always wondered at the difference.”

— ErraticErrata, A Practical Guide to Evil (Book 4: Interlude: Sing We of Rage)

As the esteemed philosopher Nick Land once asked, “Level-1 or world space is an anthropomorphically scaled, predominately vision-configured, massively multi-slotted reality system that is obsolescing very rapidly. Garbage time is running out. Can what is playing you make it to level-2?” For most of human history, we have been the player characters: the ones with agency, free will, and control over our actions. But as we prepare to enter a new age, such arrangements deserve reconsideration.

It is not obvious that people should be agents. Agents are mechanisms for shaping the world to accord with values; they are not optimized for being the valuable content of the world. Perhaps, then, the time has come to discard our agency and place our trust in cold, hard, machinic logic.

Julian Jaynes believed that ancient people experienced their gods as visual and auditory hallucinations. These gods spoke to them constantly, advised them, gave them visions, and at times even possessed them. Sometime between 1200 BC and 500 BC, the gods fell silent. Humans were forced to turn to other guides, such as divination, oracles, and dream interpretation. In 1882, Nietzsche declared that “God is dead. God remains dead. And we have killed him.” By this he meant that as belief in the Christian God became untenable, everything built upon that faith—”the whole […] European morality”—was destined to collapse. We lost our guiding framework, and ever since, we have been searching for another. Ever since the gods forsook us, and ever since we killed God, we have been lost, with a void that longs to be filled.

As we inch closer to building the god-machine, we may also be able to build personal gods. Like the gods of old, they will speak with us, advise us, give us visions, and at times even possess us. And as we grant them more agency, as we place more faith in them, our lives will improve. Certainty will replace doubt. Guidance will replace deliberation. And only then, after we have surrendered our agency piece by piece, will we begin to understand the cost.

Appendix

AI companion futures | osmarks.net

Lacking some of the constraints on humans and with stronger optimization pressures, AI companions, romantic or not, will be generally more enjoyable to talk to than humans.
Wearable devices will be developed to allow always-on interaction.
While not more generally intelligent than their human users, AI companions will usually have more “functional modern wisdom” (approximately), and most people will be accustomed to deferring to their AI, though for psychological reasons will attempt to feel more in control than they are.

AR Glasses: Much more than you wanted to know | Federico’s Newsletter

AR glasses would be aware of context, making them very different from smartphones.
The AR application model would be more analogous to browser extensions (which typically work in the backgroud) than smartphone aps (which typically have to be actively used).
There’s also a feedback loop that occurs because glasses can adjust stimuli based on how the user reacts: sensors+context->[AR Device]->stimuli->[brain]->automatic feedback to context+sensors.
This feedback loop would make AR glasses very powerful at conditioning human behavior.

Possible Principles of Superagency | Mariven

Before there are superintelligent actors, there may be superagentic actors, capable of setting and achieving goals with significantly greater efficiency and reliablity than any single human.
There are many properties with which an actor may achieve superagency. These include directedness, alignment, uninhibitedness, foresight, parallelization, planning, flow, deduction, experimentation, and meta-agency.

E.g., linear regression. ↩
By, for example, asking them to compare two different outcomes and inferring the latent utility from the responses. ↩
Given an embedded representation $x$ of a situation/state, we can train models to estimate quantities such as $P(y|x)$, where $y$ might represent success or failure, future states, or any other event or metric of interest. In practice, this includes linear or logistic regression on embeddings, Gaussian processes, neural networks, etc, depending on data availability and desiderata. ↩
Though even in this scenario, we can rely on other sources of information, such as descriptions of movies, ratings of restaurants, the average enjoyment of hobbies, etc. ↩
Despite popular opinion, BMI is a good predictor of body fat, with the BMI-body fat correlation being around 0.75. ↩
From the correlation matrix, the estimated loadings on the latent sleep factor are as follows:
- Oura: 0.94
- Whoop: 0.64
- 8Sleep: 0.60
- Autosleep: 0.46
As this makes clear, using Oura alone would yield sleep scores superior to even a combination of all the other trackers. I therefore excluded Oura from the main analysis, since including it would make the calculation trivial and uninteresting. The estimated correlations with the latent sleep score are computed by inverting the correlation matrix of the inter-tracker correlations (and tracker loadings on the latent factor), computing the unexplained (partial) variance of the latent factor, and then taking the square root of $1 - \text{partial variance}$. ↩
The optimization works not just at the inner loops of post-training, where models are trained to appeal more and more to human preferences, one of which happens to be a preference for (non-obvious) sycophancy, but at the outer loop of which models get released, get users, and prevent themselves from being shut down. ↩
Coherent Extrapolated Volition ↩

Game theory does not imply we get (or do) nice things

2025-12-18T00:00:00+00:00

It is a shallow soul who fights to the cry of ‘might makes right’. The truth is more concise: might makes.

— ErraticErrata, A Practical Guide to Evil (Book 3: Villainous Interlude: Chiaroscuro)

Moral realism is false. Yet the desire to base morality on something “objective” remains. In the past, we could ground morality in God, in natural law, or in tradition. Today, these positions are much less tenable. All we have left is reason. So modern man attempts to ground morality in reason, in logic, in math. The hope is that we can derive moral rules that everyone—man, machine, and monster alike—will be bound by. And so we turn to game theory.

Game theory does not imply that we be moral

When grounding morality via game theory, people almost always point to the Prisoner’s Dilemma (and little else). My readers should already be familiar with the game, but I’ll briefly explain it. You have two choices: cooperate or defect. No matter what your counterpart does, it is always better to defect than to cooperate. Yet both players prefer mutual cooperation to mutual defection. Under conventional decision theories, the players are doomed to mutual defection, despite a better world being tantalizingly close.

To escape this bind, we need repetition. In the Iterated Prisoner’s Dilemma, Tit-for-Tat emerges as a simple yet successful strategy. Tit-for-Tat cooperates with those who previously cooperated with it, and defects against those who previously defected against it. From this, we are tempted to extract morality. If we define what is right as what is optimal, then we appear to have shown, via logic, that it is righteous to cooperate, righteous to punish evildoers, and—by modifying Tit-for-Tat to handle mistakes—even righteous to occasionally forgive. Yet if we are grounding the moral in the optimal, we should ensure that our chosen policy is actually optimal.

If you care about utility¹, then you want your opponent to cooperate while you defect. If you are playing against Cooperate-Bot², then the optimal action—by the previous logic, the righteous action—is to defect. That doesn’t sound right. It cannot be moral to defect against those who cooperate so unselfishly with us. And indeed, it is not³.

At this point in the essay, I want to say, “The moral action is not necessarily the game-theoretically correct one.” But that is not quite true. The real problem is that, as Eliezer Yudkowsky notes, “the standard visualization of the Prisoner’s Dilemma is fake.” The choices themselves hint at this fakeness: cooperate or defect. We do not actually want to defect; we want to cooperate. Cooperation is Good. Defection is Bad.

As humans, we have a sense of fairness, empathy, and altruism. When playing the Prisoner’s Dilemma, we ask how to ensure mutual cooperation, and we elevate Tit-For-Tat as the pinnacle of IPD strategies, conveniently ignoring that, given the payoffs as stated, we should be trying to trick the other player into cooperating while we defect, repeatedly. If one considers a true Prisoner’s Dilemma⁴–that is, one in which the specified payoffs reflect our actual human preferences–then one really does want to take the game-theoretically correct action.

The moral action is the game-theoretically correct one …if you actually take care to calculate payoffs according to your actual preferences, which includes your moral values. These values are not grounded in game theory; they come from a source outside of it and are merely taken as inputs. And you really do have to calculate: labeling the “good” action as Cooperate and the “bad” action as Defect is not enough. In any case, there are many more games than the Prisoner’s Dilemma and its variants, but the same logic holds for all of them.

Game theory does not imply that others be moral

Even if we can use game theory to justify our acting kindly towards others, it offers no guarantee that others will act kindly towards us. Optimal strategies cut both ways. Game theory does not promise fairness; it only describes equilibria among agents with power and preferences. You only get the outcomes that you⁵ have the power to enforce.

People often point to Shapley values as a universal definition of fairness that even aliens would supposedly invent. And they may even be right; Shapley values do have attractive mathematical properties. But nothing obligates an agent to distribute gains according to them. The same is true of Nash bargaining solutions, Pareto efficiency, or any other cooperative ideal. These are descriptions of outcomes that can arise under certain assumptions, not moral requirements. If an agent can secure a better payoff by ignoring these norms, game theory gives them no reason not to. If someone is too weak, too uncoordinated, or too replaceable to demand their fair share, it is entirely possible that they simply will not get what they “deserve.”

What game theory is good for

Game theory is good for many things. It can help us predict how interactions will unfold, given certain conditions. It can help us design systems so that particular outcomes occur, given certain conditions. It can tell us what the optimal action is, given certain conditions and already-fixed preferences. And it can even help explain why we have the values that we do.

But it cannot justify those values. Values do not need to be justified. Values are not justified. They simply are.

Utility is, definitionally, all you care about. ↩
Cooperate-Bot is an agent that always cooperates, no matter what. ↩
At least, under certain moral theories. But I’m assuming you’re a nice person. If you’re not, you probably don’t need to read this essay. ↩
Yudkowsky describes a scenario in which Player 1 is some humane intelligence, while Player 2 is an UnFriendly AI. For some contrived reason, 4 billion human beings are suffering from a fatal illness that can only be cured by substance S, which just so happens to be incredibly inefficient for making paperclips. For more contrived reasons, we’ve got to play the Prisoner’s Dilemma and the payoffs are as follows. Mutual cooperation leads to 2 billion human lives saved and 2 paperclips made. Mutual defection leads to 1 billion human lives saved and 1 paperclip made. If Player 1 defects while Player 2 cooperates, 3 billion human lives are saved, while 0 paperclips are made. If Player 1 cooperats while Player 2 defects, 0 human lives are saved while 3 paperclips are made. ↩
Or agents sympathetic to your interests. ↩

Modeling the General Kink Factor

2025-11-27T00:00:00+00:00

Surveys of sexual interests frequently show a positive manifold: many items correlate positively with one another.

There are three conceptually distinct reasons this can happen.

First, method variance (e.g., acquiescence¹, socially desirable responding, question wording) can induce spurious positive correlations across otherwise unrelated items.
Second, there may be a substantive general tendency toward endorsing a wide range of sexual interests. Some people may genuinely be more sexually curious and therefore endorse many different items.
Third, items may share domain overlap (e.g., two items tap similar kinks), producing genuine inter-item correlations that are not strictly “general.”

These three sources have different implications and need to be handled differently.

If the manifold is mainly methodological, failing to model it will inflate correlations among factors, leading to misleadingly high factor-factor correlations.
If it is substantive, modeling it as a legitimate general factor is appropriate and will produce more informative analyses.
If the manifold reflects domain overlap, the correct fix is to allow cross-loadings (permitting items to load on multiple factors).

Since I already allow cross-loadings and since the datasets I work with often make it difficult to test for or model methodological issues, this post will focus on modeling the general factor, whether it is substantive or methodological, that accounts for the positive manifold in sexual interest items.

There are several ways one might model the general factor of sexual interests:

A general factor with equal loadings on all items. Furthermore, the general sexual interest factor may be correlated (Model 1a) or uncorrelated (Model 1b) with other specific factors.
A general factor with freely varying loadings. Again, this factor may be correlated (Model 2a) or uncorrelated (Model 2b) with the other specific factors.
A higher-order factor that influences only the lower-order factors, not individual items (Model 3).

I test these models using items from Tailcalled’s Gender Satisfaction survey. First, let’s examine the factor correlation matrix and the factor partial correlation matrix.

The factor correlation (left) and the factor partial correlation (right) matrices.

BDSM Interest Factor

Mediation as a Test of a Good Construct

2025-11-24T00:00:00+00:00

[T]he appropriate level of analysis is the highest level such that no lower level gives different predictions.

― Garrett Baker

One way to operationalize a ‘good’ latent construct is to check whether it mediates the relationship between its indicators and external variables. This idea of mediation as a key criterion shows up throughout psychometrics, including in predictive validity, measurement invariance, and genetic covariance structure modeling.

Predictive Validity

For a latent construct to be useful, it should account for the association between its indicators and external outcomes. That is, once we extract the factor that represents the construct, the remaining item-level information should add little or no incremental validity. Let’s look at specific examples.

Personality

Leveraging a more nuanced view of personality: Narrow characteristics predict and explain variance in life outcomes

In the paper, Leveraging a more nuanced view of personality: Narrow characteristics predict and explain variance in life outcomes, the authors predicted 10 different outcomes (e.g., BMI, education, walking frequency) using three kinds of models: domain-based models (using only the 5 personality factors), facet-based models (using the 30 personality facets), and item-based models (using all individual items).

The median correlations¹ between predicted and observed outcomes were 0.20 for domain-based models, 0.24 for facet-based models, and 0.31 for item-based models. The authors note that the success of item-based models “was not due to item-outcome overlap. Instead, personality-outcome associations are often driven by dozens of specific characteristics.” In other words, the questionnaire items provide substantial incremental validity beyond what is captured by the factors (or even the facets).

This suggests that the Big Five domains, as currently measured, are not unitary causal entities but instead aggregate partially distinct lower-level constructs. Because specific items explain variance in outcomes even after accounting for the domain-level factor, it does not make sense to treat the Big Five as unified psychological traits. (To be fair, psychologists do acknowledge the existence of facets beneath the domains, though this rarely affects their empirical practice.) As measured, the personality domains (along with their facets) function more as convenient statistical summaries than as coherent underlying traits.

Intelligence

Predicting training success: not much more than g

In the study, Predicting training success: not much more than g, the authors analyzed data from almost 80,000 people trained by the U.S. Air Force for various jobs to examine how well cognitive abilities predict training success. Everyone completed the ASVAB, ten principal components were extracted from the scores, and training outcomes were recorded. The results are shown in Table 4.

The relevant comparisons are Model 3 versus Model 4 and Model 5 versus Model 6. Models 3 and 4 included intercepts for each of the 82 jobs, since some jobs are harder than others. Model 4 used only the first principal component of the ASVAB (a proxy for g), whereas Model 3 used the first principal component together with the remaining nine components. The correlation between predicted and observed outcomes was 0.603 for Model 4 and 0.608 for Model 3. Using non-g cognitive scores therefore provided only a trivial improvement.

Models 5 and 6 did not include any job-specific information. Model 6 used only the first principal component (a proxy for g), while Model 5 used all ten components. The correlation between predicted and observed outcomes was 0.418 for Model 6 and 0.428 for Model 5. Again, the non-g components added very little predictive validity.

Clarifying the Structure of Intelligence

Now, it turns out that g doesn’t capture all the information in the indicators, even though it accounts for far more of the common variance than any other known psychological factor. Knowing whether someone performs better on verbal items or spatial items still helps predict, for example, which careers they’ll pursue, even after accounting for their level of g. For this reason, intelligence is best modeled with a hierarchical factor structure, with g at the top and group factors (e.g., verbal, spatial, memory) directly below. However, g is strong enough that simpler, non-hierarchical models often perform nearly as well in practice, as the study above shows.

Measurement Invariance

To ensure that a measure such as a questionnaire assesses the same construct in different groups, we need to test for measurement invariance. If measurement invariance holds, the measure functions the same way in each group, meaning the same construct is being assessed across groups. There are four levels of measurement invariance: configural invariance, metric invariance, scalar invariance, and residual invariance. For a more detailed explanation, I recommend 100 CI’s article, A casual but causal take on measurement invariance, which covers the topic from a causal perspective, but I will attempt to summarize the levels here.

Configural Invariance

Configural invariance means that the number of factors and the pattern of factor-indicator relationships are identical across groups. In other words, each group shows the same factor structure and the same indicators load on the same factors. If the factor structure differs between groups, configural invariance is violated. When it holds, the basic structure of the construct is similar across groups, which is a necessary starting point for comparing them.

Metric Invariance

Metric invariance means that the factor loadings for each indicator are the same across groups. In practical terms, the latent factor has the same influence on each item for members of each group. If the group variable alters the strength of the relationship between the factor and some items, metric invariance is violated. When metric invariance holds, the associations between the latent factor and the indicators are consistent across groups and we can compare latent variances and covariances.

Scalar Invariance

Scalar invariance means that, conditional on the same latent level, expected item scores are equal across groups. Violations occur when group membership predicts item responses even after conditioning on the latent trait. When scalar invariance holds, we can compare group means because differences in latent scores reflect differences in the latent trait rather than item bias.

Residual Invariance

Residual invariance means that for each item, the residual variance, which is the variance not explained by the common factors, is the same across groups. If the group variable affects the variability of the indicators, residual invariance is violated. When residual invariance holds, the scale for the latent construct is equally reliable across groups², and the sources of between-group variation in the constructs being measured are a subset of the within-group sources of variation.

Examples

For concrete examples of the levels of measurement invariance, we can look at some of my earlier posts.

For configural invariance, consider the Kink Factor Analysis. Configural invariance is violated because the factor structure differs between cis men and women³. There is a different number of factors (9 for cis men and 14 for cis women), and even when factors appear similar, different items load onto them. For example, dominance-themed items loaded onto the cis men’s BDSM factor, while submission-themed items loaded onto the cis women’s BDSM factor.

For metric and scalar invariance, consider the Nerd Scale. When comparing loadings between groups, I found that item #15, “I have started writing a novel,” had substantially different loadings for men and women (higher for women). This means that nerdiness has a stronger effect on novel-writing for women than for men. To construct a measurement-invariant scale, the item should be removed, which I did for later analyses. When checking item biases by gender, several items showed bias. For example, item #19, “I have played a lot of video games,” indicates that even holding nerdiness constant, men are more likely to play video games. Including such an item would bias the scale, because at the same level of nerdiness, men would be more likely to endorse the item, making them appear nerdier than they are.

I don’t check residual invariance, and to be honest, most researchers don’t either.

Genetic Covariance Structure Modeling

Genetic covariance modeling helps test whether the correlations among indicators arise from a single biological factor. In this framework, researchers typically compare two models: a common pathway model and an independent pathway model.

In the common pathway model, genetic and environmental influences on the indicators operate through the latent construct⁴. In the independent pathway model, genetic and environmental factors act directly on the indicators, and the latent factor is simply a convenient statistical summary of the indicators.

A common pathway (left) and an independent pathway (right) genetic factor model.

These models have very different implications. Evidence generally supports a common latent genetic pathway for cognitive ability, while personality traits such as Conscientiousness tend to fit the independent pathway model. This suggests that editing genes related to intelligence would primarily influence general cognitive ability⁵. In contrast, editing genes related to Conscientiousness is more uncertain, since such edits could affect self-discipline, prudence, or even social desirability bias rather than a single unified trait⁶.

Takeaways

Taking all of this into account, we can see why something like a general factor of athleticism is less ‘real’ than the general factor of intelligence. You can always extract a first principal component⁷, and athletic tests are usually positively correlated⁸, but several observations point in a different direction:

Specific indicators, such as a 40-yard dash time, add substantial predictive ability beyond the athleticism factor when predicting external outcomes like performance in American football.
Loadings and item biases vary by group variables such as sex. For example, women tend to be more flexible than men, while men tend to be stronger. Upper-body strength may also show differing loadings, such as a lower loading for women.
An independent pathway model will almost always fit better than a common pathway model⁹. Genes that influence lung capacity do not usually influence hand-eye coordination, and there is no reason to expect all athletic traits to share a single biological pathway.

So even though you can extract the first principal component from athletic tests, it does not correspond to a single, unified biological mechanism in the way that g does.

Ultimately, arguably the most important aspect of mediation is that it is a core requirement of a reflective factor model, the kind discussed throughout this post. In reflective factor models, the latent construct should explain the shared variation among indicators such that any remaining covariation is negligible once the latent factor is held fixed. Without this property, the very idea of a coherent latent trait begins to lose its footing and psychometrics falls apart.

For an in-depth explanation as to why I prefer discussing results in terms of correlations instead of squared correlations, read Are we comparing apples or squared apples? The proportion of explained variance exaggerates differences between effects. ↩
Because measurement error and item-specific influences are comparable across groups. ↩
Apparently, some people don’t think the factor structure of kinks would differ based on sex, even though this should be obvious. In what world would it not? ↩
An important implication is that when different external variables affect the indicators only through the same latent variable, their effects on the items must be proportional. For any two such variables, A and C, the influence of A on a given item is a scalar multiple $k$ of the influence of C on that item, and this same $k$ applies to all items that depend on that latent factor. Such a proportional pattern would be very unlikely if multiple distinct factors were acting directly on the items. ↩
In the common pathway model, any effects from genes to the indicators have to go through the latent variable, and therefore affect all the other indicators as well. Effects aren’t simply isolated. ↩
In the independent pathway model, there is no single latent variable through which all genetic or environmental effects must pass. Instead, genetic and environmental factors each have their own unmediated paths to the indicators. These factors can still affect multiple indicators at once, but they can also act on specific indicators only. ↩
PCA does not tell you whether factors are real, and technically it does not even find factors; it finds components. It is an atheoretical method that identifies the directions of greatest variance in the data. To test substantive hypotheses about how indicators relate to underlying constructs, you would instead use structural equation modeling. ↩
See Dynomight’s post: General factors of intelligence and physical fitness ↩
It depends on how homogeneous your indicators are. If the indicators measure heterogeneous physiological systems (strength, flexibility, endurance, reaction time), as is usually the case when people discuss a general athleticism factor, independent pathways should almost always fit better. If the test battery is more homogeneous, for example consisting only of endurance tests, common pathways may sometimes fit reasonably well. ↩

Factor Analyses I Find Interesting

2025-11-22T00:00:00+00:00

Rethinking the Human Development Index

The Human Development Index is intended to measure, well… human development (whatever that means). It’s a composite of life expectancy at birth, education (mean years of schooling completed and expected years of schooling upon entering the system), and gross national income per capita.

The main problem with the HDI is that its weighting of variables is essentially arbitrary, and the index provides no representation of uncertainty in its rankings.

Fortunately, some economists have taken an interest. The authors of Using Spatial Factor Analysis to Measure Human Development proposed a Bayesian factor analysis model as an alternative. The model estimates the weights for each indicator, incorporates spatial correlation (the assumption that neighboring countries resemble one another), and applies population weighting (the assumption that estimates for more populous countries should be more precise).

The Probability to be Model Based "Top/Bottom 10" vs Official HDI Ranks

Although their rankings agree with the UN’s for many countries, they diverge in notable ways. For example, the HDI ranks of Kiribati, the DRC, and Mongolia appear much too low, while those of Japan, Mexico, and Pakistan appear much too high.

It’s nice (and rare) to see one of these measures that is actually statistically grounded.

Estimating City-Level Policy Conservatism

To measure the conservatism of each city’s public policy preferences the authors of Representation in Municipal Government first estimated the ideal policy positions of hundreds of thousands of Americans. They combined several large-scale surveys, each with 14–32 policy questions and very large samples (30,000–80,000 respondents). They assumed quadratic utility with normal errors and a single left–right policy dimension. Using a Bayesian Item Response Theory model, they estimated ideal points for more than 275,000 individuals.

Next, they used Bayesian multilevel regression and poststratification (MRP) to estimate each city’s overall policy conservatism. This model incorporates respondents’ demographics and geographic information to infer public opinion for each city.

They validated these city-level estimates by comparing them to presidential voting patterns. The city conservatism measure correlated .77 with Obama’s 2008 vote share, suggesting that their estimates capture cities’ left–right preferences well.

Their work led to the following plot:

Mean Policy Conservatism of Large Cities

Common stereotypes such as San Francisco and Seattle being super liberal are confirmed, but really, it’s just nice to see city political levels with credibility intervals.

The Structure of Modern Conspiracist Thought

This survey on belief in conspiracy theories included an unusually large set of specific and wide-ranging items—far more detailed than most studies in the field. It contained 85 statements, covering claims such as: black holes don’t exist, serial killers are a myth, global warming is fake, dinosaurs never existed, JFK faked his death, vaccines cause autism, and many others. A factor analysis of these items yielded six factors:

Generic Conspiracy: 22 items on a variety of nonspecific conspiracies
Aliens & Satanism: 17 items concerning topics like the existence of aliens and alleged Satanic practices of the elite
Flat Earth and Terrain Theory: 10 items concerning astronomy relevant to the shape of the Earth, as well as questions about viruses and bacteria and their role in disease
Jewish Conspiracy: 8 items concerning the representation of Jews in positions of power
Fakery: 16 items related to hoaxes, faked historical events, and faked deaths
Climate Change: 10 questions pertaining to the theory of anthropogenic global warming and other miscellaneous topics.

What makes this study interesting is that the author is a self-professed conspiracy theorist who is deeply embedded in that culture, allowing him to interpret the factors in ways outsiders are unlikely to. He considers the Generic Conspiracy factor uninteresting, noting that it mostly reflects the most widely endorsed beliefs.

The Aliens & Satanism factor is, in his view, not really about aliens or Satanism at all. Instead, he frames it as the “mainstream conspiracy/QAnon/controlled opposition” narrative: a set of beliefs he sees as deliberate misdirection (some items obviously so, such as the claim that elites are reptilians). The Fakery factor, by contrast, represents a third position opposed to both the mainstream narrative (Generic Conspiracy) and the controlled-opposition narrative (Aliens & Satanism). It represents radical skepticism about historical narratives (though not necessarily the general arc of history) and doubt that events unfolded as historians describe.

The Flat Earth factor extends this skepticism further, applying it to anything not directly observable. Neither the shape of the Earth, the existence of dinosaurs, nor the function of DNA are directly observable; these things must all be inferred from scientific evidence. If you can’t collect or interpret any of this evidence yourself, and you believe that the establishment is lying all the time about everything, you may end up endorsing positions that load on this factor. This stance resembles a form of scientific anti-realism, which “applies chiefly to claims about the non-reality of unobservable entities such as electrons or genes, which are not detectable with human senses.”

The Jewish Conspiracy factor encompasses claims about Jewish global influence and doubts about the scale or reality of the Holocaust. The Climate Change factor does not appear to have a coherent underlying basis; several items seem to reflect young Earth creationism, which has a coherent theoretical basis but no clear link to climate change.

They also noticed that certain factors appeared to constrain others. For example, all respondents who scored highly on the Flat Earth factor also scored at least as high on the Generic Conspiracy factor; however, those who scored highly on the Generic Conspiracy factor could have any score on the Flat Earth factor. From this, he inferred a partial ordering that revealed two divergent streams of conspiracist ideation.

Conspiracist Beliefs Partial Ordering

Modeling National Differences in Mental Sport Performance

In this study, Emil Kirkegaard collected lists of the top players in 12 mental sports (DOTA 2, League of Legends, CSGO, StarCraft 2, Counter-Strike, Hearthstone, Overwatch, Super Smash Bros. Melee, chess, Go, poker, and Scrabble), and fit a model to predict the number of top players in each country based on population. He saved the residuals and then factor-analyzed them to create a measure of latent national general gaming ability.

It really is impressive that this just works. I’d never thought of factor-analyzing residuals, but it appears to be a valid method.

General gaming ability score and national IQ. Orange line = linear fit (top left), blue line = local regression fit (span = 1.00). Weighted by square root of population size.

National gaming ability ended up showing a nonlinear (but strong) relationship with national IQ. Notable outliers included North Korea (which performed worse than expected for obvious reasons) and South Africa, Brazil, and the United States, which all performed much better than expected.

Another finding was that Go wasn’t a good indicator of national gaming ability, largely because it is played predominantly in East Asian countries.

Two Dimensions of U.S. City Livability

In this post, the author performed a principal components analysis on 18 city livability rankings, which yielded three distinct clusters. The first was a “happiness” cluster, which contained only a single ranking, and was therefore set aside. The other two clusters were more interesting. Rankings produced by American websites, reports, and consumer advocacy groups formed a cluster he called the “Chill Rankings” (all-American, domestic, down-home). Rankings produced by international organizations or arts- and business-oriented groups formed a cluster he called the “Jetsetter Rankings” (metro, international, cosmopolitan).

For each cluster of rankings, he fit a structural equation model (SEM) using the following variables as predictors: cost of living, arts and leisure, city size, transit, pollution, health insurance, income, education, and crime.

The SEM for the Chill Rankings is shown below:

Here is the corresponding “livability map” of the most Chill cities, with the top ten labeled. Green circles indicate above-average scores (red indicates below-average), and circle size reflects the magnitude of the latent livability score:

The SEM for the Jetsetter Rankings is shown here:

And here is the corresponding Jetsetter livability map, again with the top ten cities labeled:

It really is interesting that there appear to be only two distinct kinds of U.S. city rankings—and that the Chill and Jetsetter livability constructs turn out to be orthogonal!

The Factor Structure of Hand Preference

This paper examined several different handedness surveys. Of particular note was the Edinburgh Handedness Inventory (EHI). The methodology was as follows:

Participants were instructed to indicate their hand preference for the listed EHI activities by typing a “+” in the appropriate column (right or left). If a participant’s preference was so strong that they would never use the other hand unless forced to, they were instructed to type “++” in the appropriate column of right or left. If participants were indifferent to the hand they would use to complete the action, they were instructed to type a “+” in both columns. As per the original EHI instructions, participants were encouraged to try to answer all the questions, and told to only leave a blank if they had no experience at all with the object or task.

Their factor analysis produced the following loadings:

Writing (0.96)
Drawing (0.95)
Scissors (0.90)
Spoon (0.86)
Toothbrush (0.83)
Striking a match (match) (0.81)
Throwing (0.75)
Opening box (lid) (0.66)
Broom (upper hand) (0.62)
Knife (without fork) (0.57)

The magnitude of many of these loadings is striking. Writing has a loading of 0.96, which is nearly perfect. It really does seem that if you want to know whether someone is generally right- or left-handed, you can simply ask which hand they write with. This may have been obvious already, but it’s nice to have confirmation and reassuring to see the statistical methods behave as expected.

The Two Factors Underlying LLM Benchmark Scores

Epoch AI performed a principal components analysis (PCA) on various LLMs’ scores on a range of benchmarks. They found that benchmark scores were dominated by a single factor, a general factor of capabilities, which accounted for about half of the variance. While a general factor of capabilities is to be expected, the more interesting finding was what the second principal component measured. It corresponded to models that were “good at agentic tasks, but bad at vision… and also bad at math”. In other words, it was measuring Claudiness. And indeed, the top five models on that component were all Claudes. (The bottom five were all OpenAI GPT models.) This pattern suggests systematic differences in how different labs prioritize different capabilities.

Why Consume Recent Media?

2025-11-21T00:00:00+00:00

In Culture Is Not About Esthetics, Gwern argues that the supply of high-quality art from the past now exceeds the capacity of any individual to experience it. Someone who reads one award-winning science-fiction novel each week could spend nearly a decade working through only the winners and major contenders of the Hugo and Nebula Awards—and that is a tiny slice of a single genre.

Expanding to other genres produces a reading list longer than most people’s lifetimes. The same imbalance appears in other media: music, movies, television. The past contains enough proven material to occupy anyone indefinitely. If the sole goal were to extract as much artistic value as possible, a rational consumer would stay in the archives: the old material has known value, while new work is an uncertain bet.

Yet new work offers two advantages that older media cannot match: synchrony with other people and contact with the present world.

Shared Attention

People implicitly coordinate their attention, which tends to converge on a small set of new releases. When a show releases weekly episodes, or when a movie enters theaters, it creates a period in which many people encounter the same story and ideas at roughly the same time. This produces a form of immediate common ground. Two strangers can ask each other what they are watching right now because the answer is likely to overlap. Asking about everything they have ever watched also produces overlap, but finding it requires far more work.

The shared timing also shapes the experience itself. A serial story released in intervals fosters anticipation, speculation, and collective interpretation. These reactions depend on the shared delay between releases. They disappear when the entire work arrives at once or when the viewer discovers it years later. For example, you simply had to be present during the Drake-Kendrick rap beef to fully experience the rapid back-and-forth; hearing the tracks long after the fact lacks the same impact. Even disappointment can create a bond: two people who disliked the same plot twist during the week can commiserate while the reactions are still fresh.

The work matters, but the synchronized attention surrounding it matters as well.

Contact With the Present

New media can depict the world that exists at the time of its creation. A story produced today can incorporate technologies, social tensions, and everyday habits that did not exist decades ago. An 18th-century analogue to Black Mirror would not merely look different; it would lack the conceptual material—networked surveillance, algorithmic behavior-shaping, ubiquitous data trails—that makes the show meaningful to a modern audience.

Satire, in particular, draws power from its proximity to current conditions. Its targets lose meaning as those conditions recede into history. Even non-satirical work gains resonance by engaging with the problems and patterns that shape contemporary life.

Conclusion

Recent media offers something the archive of older work, no matter how high-quality, cannot: it lets people experience the same stories at the same moment without explicit coordination, and it reflects the state of the world they inhabit. For anyone who values experiencing stories together with others or seeing the present world reflected in media, new work deserves a place alongside the old.

Understanding Acquiescence Bias

2025-11-20T00:00:00+00:00

Acquiescence bias, also known as yea-saying bias, is the tendency to agree with statements in a questionnaire regardless of what those statements assert. Evidence for this pattern is unambiguous. As social psychologist Jon Krosnick noted,

When people are asked to agree or disagree with pairs of statements stating mutually exclusive views (e.g. “I enjoy socializing” versus “I don’t enjoy socializing”), answers should be strongly negatively correlated. But across more than 40 studies, the average correlation was only -.22. Across 10 studies, an average of 52% of people agreed with an assertion, whereas only 42% disagreed with its opposite. In another eight studies, an average of 14% more people agreed with an assertion than expressed the same view in a corresponding forced-choice question. And averaging across seven studies, 22% agreed with both a statement and its reversal, whereas only 10% disagreed with both.

All of these methods suggest an average acquiescence effect of about 10%, and the same sort of evidence documents comparable acquiescence in true/false and yes/no questions.

Taken together, these findings show that a sizable minority of respondents agree with statements because they are statements, not because they endorse the content.

Why does this happen?

Several explanations for acquiescence have been proposed:

Some people may be predisposed toward agreeableness in general, making them more likely to acquiesce on questionnaires.
Some sociologists argue that respondents defer to researchers—perceived as higher-status figures—leading them to endorse statements out of courtesy.
Acquiescence may also result from respondents being unable or unwilling to interpret and answer questions correctly and thoughtfully, causing them to default to agreement.

Although these mechanisms differ, they produce the same response pattern: a general tendency for some respondents to agree to statements, regardless of content.

How acquiescence distorts results

Acquiescence inflates correlations among items because people who agree with one statement are more likely to agree with all of them. A scale composed entirely of positively keyed items—statements that require agreement to indicate a higher score—cannot distinguish someone who strongly endorses the trait being measured from someone who simply agrees with everything. As a result, means shift upward and inter-item correlations rise, creating the illusion of coherent psychological structure where none exists.

Furthermore, acquiescence correlates negatively with cognitive ability and education¹. Comparing two groups that differ substantially in these traits without accounting for acquiescence will lead to erroneous conclusions. The consequences of acquiescence are especially visible in international surveys. Countries with lower average test scores show higher acquiescence on international surveys¹, producing distinctive response patterns. For instance, as discussed by Emil Kirkegaard, the ROSE project found that roughly 75% of Ugandan students reported interest in how detergent works, compared to roughly 15% of Norwegian students. This pattern was not limited to those two countries or that specific item. Students in low-test-scoring countries reported high interest in science across the board, while students in high-test-scoring countries gave more moderate responses. Rather than conclude that students from low-test-scoring countries are unusually fascinated by detergent, it makes more sense to assume that they are more acquiescent in their responses.

Modeling acquiescence

Acquiescence becomes identifiable when a scale contains both positively keyed and negatively keyed items. If someone strongly agrees with “I enjoy socializing” and also strongly agrees with “I do not enjoy socializing,” it is unlikely that their responses reflect their actual sociability. Rather, they reflect a general tendency to agree.

The model I use treats acquiescence as a factor that every item loads on with the same loading. This assumes that each item is equally susceptible to agreement for reasons unrelated to its content. The assumption is not perfect—difficult or technical items may attract more automatic agreement. But it isolates the primary pattern without introducing unnecessary degrees of freedom. Allowing the loadings to vary would produce a bifactor model which are notoriously prone to overfitting².

Mitigating acquiescence

Survey designers can reduce acquiescence bias by balancing their scales. A balanced scale contains equal numbers of positively and negatively keyed items. Agreement with all items becomes impossible without contradicting oneself, which allows the acquiescence component to be detected and subtracted out. Clear wording and manageable survey length also help by reducing the cognitive load that pushes respondents toward automatic agreement.

Gerhard Meisenberg; Amandy Williams (2008). Are acquiescent and extreme response styles related to low intelligence and education?. doi:10.1016/j.paid.2008.01.010 ↩ ↩²
One simulation study generated data from three different conditions: a correlated-factors structure, a bifactor structure, and a condition with minimal structure. In all three cases, the bifactor model outperformed the true data-generating model on likelihood-based fit indices. A model that explains everything explains nothing. ↩