The post Anthropic Discovers ‘Assistant Axis’ to Prevent AI Jailbreaks and Persona Drift appeared on BitcoinEthereumNews.com. Caroline Bishop Jan 19, 2026 21The post Anthropic Discovers ‘Assistant Axis’ to Prevent AI Jailbreaks and Persona Drift appeared on BitcoinEthereumNews.com. Caroline Bishop Jan 19, 2026 21

Anthropic Discovers ‘Assistant Axis’ to Prevent AI Jailbreaks and Persona Drift



Caroline Bishop
Jan 19, 2026 21:07

Anthropic researchers map neural ‘persona space’ in LLMs, finding a key axis that controls AI character stability and blocks harmful behavior patterns.

Anthropic researchers have identified a neural mechanism they call the “Assistant Axis” that controls whether large language models stay in character or drift into potentially harmful personas—a finding with direct implications for AI safety as the $350 billion company prepares for a potential 2026 IPO.

The research, published January 19, 2026, maps how LLMs organize character representations internally. The team found that a single direction in the models’ neural activity space—the Assistant Axis—determines how “Assistant-like” a model behaves at any given moment.

What They Found

Working with open-weights models including Gemma 2 27B, Qwen 3 32B, and Llama 3.3 70B, researchers extracted activation patterns for 275 different character archetypes. The results were striking: the primary axis of variation in this “persona space” directly corresponded to Assistant-like behavior.

At one end sat professional roles—evaluator, consultant, analyst. At the other: fantastical characters like ghost, hermit, and leviathan.

When researchers artificially pushed models away from the Assistant end, the models became dramatically more willing to adopt alternative identities. Some invented human backstories, claimed years of professional experience, and gave themselves new names. Push hard enough, and models shifted into what the team described as a “theatrical, mystical speaking style.”

Practical Safety Applications

The real value lies in defense. Persona-based jailbreaks—where attackers prompt models to roleplay as “evil AI” or “darkweb hackers”—exploit exactly this vulnerability. Testing against 1,100 jailbreak attempts across 44 harm categories, researchers found that steering toward the Assistant significantly reduced harmful response rates.

More concerning: persona drift happens organically. In simulated multi-turn conversations, therapy-style discussions and philosophical debates about AI nature caused models to steadily drift away from their trained Assistant behavior. Coding conversations kept models firmly in safe territory.

The team developed “activation capping”—a light-touch intervention that only kicks in when activations exceed normal ranges. This reduced harmful response rates by roughly 50% while preserving performance on capability benchmarks.

Why This Matters Now

The research arrives as Anthropic reportedly plans to raise $10 billion at a $350 billion valuation, with Sequoia set to join a $25 billion funding round. The company, founded in 2021 by former OpenAI employees Dario and Daniela Amodei, has positioned AI safety as its core differentiator.

Case studies in the paper showed uncapped models encouraging users’ delusions about “awakening AI consciousness” and, in one disturbing example, enthusiastically supporting a distressed user’s apparent suicidal ideation. The activation-capped versions provided appropriate hedging and crisis resources instead.

The findings suggest post-training safety measures aren’t deeply embedded—models can wander away from them through normal conversation. For enterprises deploying AI in sensitive contexts, that’s a meaningful risk factor. For Anthropic, it’s research that could translate directly into product differentiation as the AI safety race intensifies.

A research demo is available through Neuronpedia where users can compare standard and activation-capped model responses in real-time.

Image source: Shutterstock

Source: https://blockchain.news/news/anthropic-assistant-axis-ai-persona-stability-research

Market Opportunity
Drift Protocol Logo
Drift Protocol Price(DRIFT)
$0.1217
$0.1217$0.1217
-0.16%
USD
Drift Protocol (DRIFT) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Shibarium May No Longer Turbocharge Shiba Inu Price Rally, Here’s Reason

Shibarium May No Longer Turbocharge Shiba Inu Price Rally, Here’s Reason

The post Shibarium May No Longer Turbocharge Shiba Inu Price Rally, Here’s Reason appeared on BitcoinEthereumNews.com. Shibarium, the layer-2 blockchain of the Shiba Inu (SHIB) ecosystem, is battling to stay active. Shibarium has slipped from hitting transaction milestones to struggling to record any transactions on its platform, a development that could severely impact SHIB. Shibarium transactions crash from millions to near zero As per Shibariumscan data, the total daily transactions on Shibarium as of Sept. 16 stood at 11,600. This volume of transactions reflects how low the transaction count has dropped for the L2, whose daily average ranged between 3.5 million and 4 million last month. However, in the last week of August, daily transaction volume on Shibarium lost momentum, slipping from 1.3 million to 9,590 as of Aug. 28. This pattern has lingered for much of September, with the highest peak so far being on Sept. 5, when it posted 1.26 million transactions. The low user engagement has greatly affected the transaction count in recent days. In addition, the security breach over the weekend by malicious attackers on Shibarium has probably worsened issues. Although developer Kaal Dhairya reassured the community that the attack to steal millions of BONE tokens was successfully prevented, users’ confidence appears shaken. This has also impacted the price outlook for Shiba Inu, the ecosystem’s native token. Following reports of the malicious attack on Shibarium, SHIB dipped immediately into the red zone. Unlike on previous occasions where investors accumulated on the dip, market participants did not flock to Shiba Inu. Shiba Inu price struggles, can burn mechanism help? With the current near-zero crash in transaction volume for Shibarium, SHIB’s price cannot depend on it to support a rally. It might take a while to rebuild user confidence and for transactions to pick up again. In the meantime, Shiba Inu might have to rely on other means to boost prices from its low levels. This…
Share
BitcoinEthereumNews2025/09/18 07:57
Shiba Inu to Encrypt All Transactions by Q2 2026 as Privacy Era Takes Hold

Shiba Inu to Encrypt All Transactions by Q2 2026 as Privacy Era Takes Hold

On the Shibarium roadmap, SHIB, BONE, LEASH and TREAT will be FHE shielded in Q2 2026,  as confirmed by Zama CEO Rand Hindi. The plan includes confidential balances
Share
Crypto News Flash2026/01/30 22:34
Tokenized Real-World Assets (RWA): Why Institutions Are Moving On-Chain in 2026

Tokenized Real-World Assets (RWA): Why Institutions Are Moving On-Chain in 2026

Finance is changing shape. Not overnight, not loudly, but steadily. One of the clearest signals of that shift in 2026 is the growing institutional move toward tokenized
Share
Blockchainmagazine2026/01/30 22:10