Diving Deep into the Black Box of AI Reasoning
Researchers are gradually getting closer to AIs that can explain their reasoning and learn from their mistakes, leading to a future where we can rely on AI without
second-guessing its “thought” process.
Imagine asking an AI model, “Is 9.11 bigger than 9.8?” Whenit responds, “Yes, it’s bigger,” the answer would surprise you.But this isn’t a trick question – it’s basic math. The AI was treating 9.11 and 9.8 as if they were Bible verses or dates, where chapter 9, verse 11 comes after verse 8, or September 11 is later than September 8. It’s a quirky mistake, but it points to a much bigger issue in AI: sometimes, these models reach confident conclusions based on patterns that don’t apply to the question. And when AI is used in crucial fields like medicine or law, even minor misunderstandings can have significant consequences. That’s why researchers at Google DeepMind are delving into “how” AI models make these decisions, developing tools to expose the AI’s internal logic and pinpoint its “thought” process.
Such unexpected responses are becoming more common as we use AI for increasingly complex tasks. Take another example: some language models may associate certain professions with specific genders based on biased patterns in the data. If you ask the model, “Who is the chief surgeon?” it might automatically assume a male pronoun, even though the gender of the surgeon isn’t specified. This is a type of “silly” but potentially damaging assumption AI might make due to patterns it learned in the training data. Without understanding “why” the AI makes these connections, it’s hard to correct them.
A microscope for the AI mind
This is where DeepMind’s latest tool, “Gemma Scope”, steps in. It’s like a high-powered microscope for AI, allowing researchers to zoom into each layer of the model and see exactly which neurons light up in response to different prompts. By using a technique called a “sparse autoencoder”, they can isolate specific concepts–such as “doctors” or “dogs”–within the model, essentially examining the AI’s mental map layer by layer. For instance, if the model is prompted about chihuahuas, certain neurons related to “dogs” light up, revealing the patterns and associations the AI is drawing upon.
But navigating these layers is a balancing act. The sparse autoencoder acts like a zoom lens, and researchers can adjust the zoom level to see finer details or the broader picture. Go in too deep, and the details might get muddled or overwhelming; zoom out too far, and the broader context could be lost. This flexibility helps researchers find the “sweet spot” where they can observe the AI’s decision-making without too much noise.
Another example of AI getting tripped up is when it confidently answers questions about dates or measurements in quirky ways. Recently, one AI model was asked, “Which is bigger, 100 grams or 1 kilogram?” To a human, the answer is straightforward, but the AI mistakenly concluded that 100 grams was bigger because it misinterpreted the data, drawing on a pattern it had seen elsewhere. By using Gemma Scope, researchers can zoom in on the layers where this error originates, pinpointing how the AI conflates “size” with “frequency” or “importance” in its training data. These discoveries are helping researchers understand which parts of the model need adjustment and what internal “rules” the AI has created on its own.
Amusing, but threatening
Some of the errors are outright entertaining–though still revealing. In one case, researchers at DeepMind found that if they cranked up the AI’s understanding of “dogs” and then asked it a question about U.S. presidents, the model would manage to bring up dogs in its answer or even start “barking” responses. This showed that certain features could overpower others, creating an imbalance. By adjusting these “dog” neurons, researchers can tone down irrelevant connections, helping the AI stay focused on the right context.
Other labs, like Anthropic, are also experimenting with mechanistic interpretability. For instance, Anthropic once trained an AI on information about the Golden Gate Bridge. When they amplified this concept, the AI started to respond as if it “was” the Golden Gate Bridge itself, saying things like “I span the San Francisco Bay.” While amusing, this extreme association highlighted how specific concepts could overpower an AI’s identity, revealing just how much influence certain patterns can hold within the model.
Scrubbing AI’s memory
DeepMind and other researchers see these experiments as stepping stones to building safer, more transparent AI. Take the example of AI models that might get asked dangerous or harmful questions. Right now, companies rely on system-level safeguards to prevent harmful outputs, such as built-in instructions for AI models to refuse certain prompts. But users can often find clever workarounds. If researchers can identify the exact neurons or layers where certain dangerous information is stored, they could permanently “mute” these parts of the model. This would be like scrubbing the AI’s memory of certain harmful details entirely, making it safer against potential misuse.
The potential to turn off specific harmful behaviours within AI is exciting but complicated. AI’s knowledge is interwoven like a web; remove a thread, and it could disrupt other abilities. For example, if researchers wanted to remove knowledge related to bomb-making, they would likely also impact broader chemistry knowledge, as both are stored in similar layers. This delicate balancing act makes fine-tuning AI models challenging.
—
Gemma Scope and tools like it are shedding light on the “mind” of AI in unprecedented ways, allowing us to address these mistakes and make AI more predictable and trustworthy. The ultimate goal is to have AI that makes decisions transparently, aligned with human values, and free of hidden biases or unintended knowledge. As researchers keep fine-tuning these interpretability tools, we’re gradually getting closer to AIs that can explain their reasoning and learn from their mistakes, leading to a future where we can rely on AI without second-guessing its “thought” process.