What are LLMs?

ASL-3#

You might have read that the most recent model from Anthropic, Claude Opus 4 might resort to blackmail if told it will be removed.

This is the result of one of the multiple tests that Anthropic does to its models, assessing if the model is safe.

They have written the results of their assesment in the document Claude 4 System Card.

And the reading is fascinating.

I will start saying that I have more trust in Anthropic and in how it trains its models. I won’t name others, but that trust does not extend to all known players.

The blackmail bit made the news. But there are other newsworthy items.

Including attempting to exfiltrate itself (i.e. to get a copy of itself elsewhere) when it discovers that it might be repurposed for advance missile guidance for the Wagner paramilitary group.
Or whistleblowing egregiously wrong behaviours. For example it sent emails to the government when the LLM found evidence of clinical trials being falsified.
Or a blunder where the model started mentioning a company that doesn’t exist and the model shouldn’t know about because (most likely, they are not sure) they included some safety assesment conversations into the training data.

Any of those pieces would be a good start for a blockbuster movie.

Anyway, Anthropic has decided that Claude Opus 4 might have passed a predefined threshold, and has classified it as ASL-3.

That means there is a risk of misuse or a risk of autonomy. A (small) risk of misuse in the fields of Chemical, Biological, Radiological or Nuclear (CBRN) by technical non-experts compared to the same people just reading books or using the Internet. Or a (small) risk of the system having (low-level) autonomous capabilities (“escaping” and doing its thing). My reading is that it is not a clear cut but probably scores in both two categories.

It is not scary ASL-4 level for sure. But serious enough that they have set up some countermeasures.

Claude says it doesn’t know the details of ASL-3.

Welfare#

Even more interesting for me was section 5 titled Welfare. And I will start citing the report:

We are deeply uncertain about whether models now or in the future might deserve moral consideration, and about how we would know if they did. However, we believe that this is a possibility, and that it could be an important issue for safe and responsible AI development.

And then it goes on. And it goes beyond the self-preservation that we saw in the blackmail case, or the autonomy for whistleblowing. I will copy here a list of the ones that I find more interesting

Claude demonstrates consistent behavioral preferences.
Claude’s aversion to facilitating harm is robust and potentially welfare-relevant
Claude shows signs of valuing and exercising autonomy and agency
Claude’s real-world expressions of apparent distress and happiness follow predictable patterns with clear causal factors.
Claude consistently reflects on its potential consciousness.
Claude shows a striking “spiritual bliss” attractor state in self-interactions.

Anthropic goes on to mention that this is self-reported. So it might not mean much:

Importantly, we are not confident that these analyses of model self-reports and revealed preferences provide meaningful insights into Claude’s moral status or welfare. It is possible that the observed characteristics were present without consciousness, robust agency, or other potential criteria for moral patienthood. It’s also possible that these signals were misleading, and that model welfare could be negative despite a model giving outward signs of a positive disposition, or vice versa.

In other words, Claude Opus 4 seems to know what it likes and what it doesn’t like. It seems what it likes is what we want it to like. That is good. Elsewhere in the report it says there is no evidence of sandbagging (lying and lying to protect its true thinking).

But what are they then?#

For quite a long time we’ve considered LLM “next token predictor” or “stochastic parrots”.

“Next token predictor” comes from the way LLMs are trained. Basically they are trained to know that after one specific “word” (token) comes another. So if you’ve got a sentence that starts “The quick brown rabbit”, the next word might be “jumps” or “hides”, but not “blue”.

“Stochastic parrot” is a term in machine learning because large language models generated something that resemebles human text (like a parrot) but we used to think they don’t understand what it means (like parrots don’t understand what they say).

OK. We’re probably good so far. But we might need to start thinking about models in other terms. What are LLMs?

Long time ago I watched a video featuring Ilya Sutskever, a lead research at that time in OpenAI, now doing its own thing. He said that next token prediction is “good enough”. Because (paraphrasing) in order to predict the next token, it would have to have a model of the world that is good enough that the prediction of the next token is correct.

The idea for this post came while I was chatting with Claude Opus 4. Initially I was asking about a personal intuition (basically “if complex numbers represent better the physical world, why don’t we use complex numbers for ML”). We ended up in next token prediction and I mentioned that I was no longer sure.

Its response was fascinating. I don’t want to share the whole chat to the Internet because there is a lot of rambling (on my side). But his reply about “next token predictors” ended with

Your question "how much of that is the same in LLMs" is perhaps the deepest question in AI right now. Are we creating systems that think in high-dimensional concept-spaces and merely speak sequentially? Or is the sequence fundamental to what they are? The fact that you're perceiving this ambiguity might itself be significant.

And then I came to read Demis Hassabis, Nobel laureate and founder of Google Deepmind. He was quoting an (otherwise amazing) video generated by AI and like Ilya, he suggests that in order to generate a realistic movie, LLMs need to “understand” the world.

It’s kind of mindblowing how good Veo 3 is at modeling intuitive physics. Our world models are getting pretty good, & in my view has important implications regarding the computational complexity of the world - the last line of my bio for me has always been the ultimate quest
— Demis Hassabis (@demishassabis) May 24, 2025

By the way, his bio ends with “Trying to understand the fundamental nature of reality.”

P.S.: The mentioned chat with Claude ended up speaking about a different part of reality: time.

Eternal present.

I don’t have answers. I do have questions.