Flourishing Futures and Anthropic
Why Joe Carlsmith should join ISIS, or, barring that, Anthropic
I am a
stan! Joe is one of the most thoughtful and interesting writers on the internet, able to write clearly about difficult philosophical subjects with extreme accuracy. His writings introduced me to infinite ethics, anthropics, and clearly explained the basic case for following expected utility theory. I have a vague memory of saying that Carlsmith’s random blog posts are always of about the quality of the best philosophy paper on the subject, and I have to agree.Joe is moving to Anthropic (the company not the field of philosophy) to work on model spec. What this means is that he’ll be influencing the values Claude is directed to follow. Some people weren’t happy about this. Holly Elmore, executive director of Pause AI (a group trying to pause AI development, as you may have guessed), tweeted:
Sellout. And the post is grade A cope.
You can sort of understand where she is coming from (though I think her comment is obviously both wrongheaded and an ill-advised communication strategy). Lots of leading experts, including Joe, think that AI has around a one in ten chance of killing everyone on the planet. Even if the real number is much lower, even if it’s only 1%, this still means Joe is going to work for an industry that has a 1% chance of blowing up the world. If you assume Anthropic is responsible for 20% of the risk from AI, then this means that Anthropic has a 2% chance of destroying the world. And like, maybe you shouldn’t work for the guys who might destroy the world.
Nonetheless, I don’t agree. I’m happy Joe—and others like Amanda Askell—are at Anthropic, even though I think they might blow up the world.
Imagine that Joe Carlsmith joined the upper echelons of ISIS. Would this be a bad thing? No, it would be great! It would mean that a very dangerous organization was being steered in a good direction. It would be bad if Carlsmith started to have the values of ISIS, but he’s still quite alarmed about Anthropic building AI, as he said in his post.
I just don’t see what the argument is against Carlsmith working for them. It would be one thing if he was working on development. But he’s not, he’s instead working on model specification. Making the models have better values seems like an obviously good thing. Steering AI so that it goes well is a lot more important than not getting your hands dirty by working with bad people.
If some group is doing dangerous things, you don’t want only reckless people working for them. Instead, you want people like Joe Carlsmith to be in the room where it happens, so that adequate precautions are taken. It’s important to have sane people around when people are doing stuff that might blow up the world.
Having Joe working for Anthropic is likely to improve Claude’s AI safety. Having influential safety-focused people in the room making will make AI safer. Claude having better values decreases the odds that it will kill everyone, because there’s some chance that alignment will be successful but the values we align the AI to will be terrible, and we’ll all die. By working for Anthropic, Joe can reduce that risk.
This is, I think, the most serious defense of Carlsmith. Carlsmith, by working for the AI companies, won’t speed up progress but will have influence that raises the odds of Claude being safely and responsibly developed. But there’s another more general point that explains why I’d support him even if I thought he was accelerating AI progress. That point is: there’s more in the world that matters than just not going extinct.
Of course, not going extinct is important! It’s absolutely terrifying that massive AI companies are playing roulette with the entire world. But it’s not the most important thing. As the good folks over at Forethought have argued, the very best futures contain astronomically more value than many of the futures where we don’t go extinct. The difference between the best futures and just okay futures is astronomical.
Most of the future’s value is concentrated in worlds where the AI gets good values. For this reason, it’s worth risking slightly greater existential threats to guide the AI towards having better values. A world where we merely don’t go extinct might have tens of millions of times less value than one where the AI has good values—where it tries to promote, in other words, what genuinely matters.
From this perspective, things aren’t looking too good. If we get aligned AI, by default it will be aligned to humanity’s current values. But humanities current values are wildly suboptimal. They sanction factory farms. They neglect the interests of nearly every sentient being on the planet that isn’t human. They place virtually no premium on creating happy people. The default scenario even where we get alignment looks pretty bleak.
The most likely scenarios where we get a really amazing future involve AIs that aren’t just aligned with human values but are aligned specifically with the right values. Having effective altruists—especially ones as careful and thoughtful as Joe—influencing the AI companies is thus hugely important. By default we throw away nearly all the value of the future. Even a world where all AI was shut down tomorrow would throw away all the value of the future. By default we lose.
And we lose for a very specific reason. Even if we survive, we get totally malignant values spread by the AI across the universe. If AI is aligned with human values, do you think it would have any compunction about spreading wild animal suffering across the galaxy, so that countless beings suffer horribly? Whether we win depends on whether the AI gets the right values.
In light of this, having a person with the right values working on designing Claude’s values is extremely important. It means that Claude, one of a few AIs radically upending the world, will have better values. To call that selling out seems completely mad. If we lose because of the AI having bad values, and you have the opportunity to be one of the dozen or so most influential people in the world influencing the AI’s values, then you should take that opportunity in a heartbeat.
There are serious ethical questions about whether Anthropic should be competing in the race. More competitors leads to faster progress and greater risk of doom. But when I think about just how narrow the target is, just how likely we are to lose unless the AI has the right values, I’m more sympathetic to Anthropic’s project. I’m still uncertain though; I don’t know if Anthropic is a force for good. But I’d bet that Joe is.
Now, I admit that this analysis loses some credibility if you:
Think humanity is likely to have good values if we don’t go extinct.
Think that if anyone builds it everyone dies.
If you think both of those things then you’ll think steering the AI isn’t that important. We’re doomed if anyone builds it. Alignment is hopeless; all we can do is slow down or stop it. I still think even if that’s your view you should support Joe joining Anthropic, but on such a view pausing AI is the most important thing. If alignment is near-unsolvable then all we can hope to do is not build the AI.
But crucially, Holly Elmore doesn’t think that. Her p(doom) is 20-40%. Now, that’s certainly higher than mine. But if you think we’re not guaranteed to die from AI, then a lot more matters than just slowing it down. Getting adequate safety procedures in place is hugely important. So I don’t know why Holly is complaining about Joe joining Anthropic.
I don’t have a p(doom) of anything like 100%. I’d probably be somewhere around 5%. If you have a p(doom) near 100%, I’m not going to try to talk you out of that in this article.
But I do want to try to convince you—whatever your p(doom) is—that making the future guided by good values is astronomically important. Alignment isn’t the only thing that matters. I’ve made this case here in more detail, and Forethought has a detailed report going over the case comprehensively.
The short version is that value is pretty fragile. Only a very small slice of futures are optimized for value. And there’s no automatic default that makes us especially likely to get that future. Some examples of how things could go wrong:
It could be—and this seems pretty likely, in fact—that future people won’t see creating happy people as an urgent moral priority, just as people today don’t. Thus, they won’t think it’s important to use space resources to create happy people. As a consequence, staggeringly large numbers of fewer digital beings will come to exist.
Just as people today don’t value AI, maybe future decision-makers won’t value digital minds. Because it’s potentially really easy to make a digital minds—code is easier to copy than a physical brain—nearly all the expected people in the future are likely to be digital. A world where nearly all the beings’ interests are neglected is a dystopia by default. There might also be some specific classes of digital minds that we don’t value even if we value digital minds in general (e.g. perhaps there will be simple digital subroutines that still possess sentience but aren’t smart enough to arouse our sympathy).
People might spread wild-animal suffering to the stars. Huge numbers of planets might be terraformed. Now, because digital minds are potentially so much easier to make than biological organisms, I’d guess this effect is less significant than the other two. Still, pretty important failure mode. Most people are fine with a world like the present where every sentient being dies painfully after just a few days.
We might spread factory farms across the galaxy. Right now, even if plant-based meat was cheaper, tastier, and more convenient, people wouldn’t replace their meat consumption with plant-based meat. It thus seems not terribly unlikely that we’ll spread factory farming across the galaxy.
We might create new universes in a laboratory that mostly contain suffering people.
As I suggested above, missing out on most of the world’s value is the likeliest scenario. It’s what happens if we don’t work very hard. In light of this, getting people in positions of power to steer the future in a valuable direction isn’t just valuable—it is absolutely necessary if we are to avoid the grotesquely hideous moral catastrophe that awaits us by default.



Also worth noting for your purposes that Claude in particular has always had a strong preference in far as LLMs can have preferences for animal welfare. This comes up repeatedly in Anthropic’s model cards, and it’s to the extent that advancing capabilities at Anthropic might even be a net positive at your weights.
There’s no attempt here to grapple with the reasons why it might be problematic for Joe to join Anthropic of which there are a few - perhaps he becomes captured by ideology or wealth (he will have equity that does well if Anthropic does well), perhaps he will be unable in practice to speak out against bad things happening within Anthropic, or his presence is used as safetywashing and he gets no real impact.