A black box is a machine that gives us an answer without showing its work.

The term may have begun with a wartime secret. In 1940, the British sent a cavity magnetron across the Atlantic to MIT, packed inside a sealed black metal box. The device generated bursts of high-powered microwave energy. It helped shrink radar to ship size, which helped Allied ships spot German U-boats before the U-boats found them.

That is a real black box: important, powerful, and sealed.

AI gets the same label. Even its builders use it. Dario Amodei, CEO and co-founder of Anthropic, has called modern AI a “black box,” and warned that researchers do not fully understand how their own creations work. In his words, this level of ignorance is “essentially unprecedented in the history of technology.”

That sounds alarming. It is also no longer quite true.

The box has not been opened completely. But Anthropic has started to pry up the lid. Chris Olah, an Anthropic co-founder, led a team of 18 researchers that traced how parts of a large language model move from prompt to answer. The work matters not because it makes AI suddenly simple. It matters because it shows where the real mystery lives.

The puzzle inside the model

Large language models contain millions of internal concepts, or “features.” They are not neatly labeled. There is no drawer marked Python, another marked Golden Gate Bridge, and another marked bad joke. The model does not store knowledge like a library. It stores it more like a tangled city of roads, shortcuts, alleys, and traffic patterns.

That is why the black-box problem has been so hard. An answer does not travel through one obvious pipe. It moves through a dense network of connections. By the time the words appear on screen, the path that produced them can look almost impossible to reconstruct.

This should sound familiar. Human thinking often works the same way. A solution pops into our head and we say, “Where did that come from?” The answer is usually: from a web of associations too dense for consciousness to narrate.

Anthropic began small. Olah’s team first studied a tiny language model with only one layer of neurons. The goal was to find causal patterns, to see how a prompt became an output.

At first, they got failure. “Random garbage.”

Then came the useful crack in the wall. One experiment showed that certain neural patterns were linked to specific output concepts. Researchers found that one group of neurons was important for coding in Python. The black box did not swing open. But it stopped looking like a magic trick.

From Python to the Golden Gate Bridge

The team then moved to a full-size model: Claude Sonnet. They identified groups of neurons that together appeared to represent the structure of the Golden Gate Bridge. Nearby patterns lit up around related San Francisco ideas: Alcatraz, Governor Gavin Newsom, and Vertigo, the Hitchcock film set in the city.

This is less strange than it sounds. Brain scientists do something similar when they put people inside an fMRI machine, show them pictures or give them math problems, and watch which regions of the brain become active. The goal is not to read minds. It is to map the machinery.

Anthropic’s researchers found millions of these features. That does not mean they understand everything Claude does. It means they have begun to build a map. And maps matter. You cannot govern a system you cannot see.

Why opening the box matters

The obvious reason is safety. Anthropic argues that more transparent models can be made less biased, less deceptive, and less likely to produce harmful behavior. That is true, and important.

But the deeper point is broader. Interpretability changes AI from a mysterious force into an engineered system. Once you can see mechanisms, even imperfectly, you can ask better questions. What triggers a harmful pattern? Which internal features are connected? Where does a model generalize well, and where does it merely improvise with confidence?

The real question is not whether AI is mysterious. Many important systems are mysterious before we learn how to measure them. The real question is whether the mystery is permanent.

Anthropic’s work suggests it is not.

Not artificial, exactly

There is another lesson here, and it may be the more unsettling one.

The more we decode AI, the less artificial it looks. The breakthrough in modern AI came when scientists built neural networks that loosely imitate the structure of the human brain. Not perfectly. Not biologically. But enough to make the analogy useful.

The human brain has about 86 billion neurons. Each neuron can form thousands of connections with other neurons. Every memory, decision, fear, plan, and joke moves through dense webs of linked activity.

AI does something similar in silicon.

That is why calling it “artificial intelligence” can mislead us. AI is not alien intelligence. It is human intelligence reflected back through machines. It is trained on our writing, our images, our arguments, our code, our biases, our brilliance, and our nonsense.

So we should not be surprised when AI produces both useful and dangerous outcomes. Human intelligence does the same. It has given us medicine, music, markets, science, and longer lives. It has also given us war, genocide, fraud, and cruelty.

The technology is not separate from us. It is a mirror with memory and scale.

The real black box

This is why interpretability matters. Opening the AI black box is not only a technical project. It is a governance project.

If AI is, in an important sense, silicon-based human intelligence, then the problem is not simply to make models smarter. The challenge is to make their intelligence legible, accountable, and aligned with human judgment at its best, not human behavior at its worst.

The real constraint is not only inside the model. It is inside the institutions that will use it, regulate it, audit it, and trust it.

AI may not be a black box forever. But the larger black box is us: the humans building these systems, deploying them, and deciding what kind of intelligence we want them to amplify.

Source note: Based in part on Steven Levy’s article in WIRED, May 21, 2024, “AI Is a Black Box. Anthropic Figured Out a Way to Look Inside.”

Interested in Learning More?