Hackers are learning to exploit the ‘personality’ of chatbots

This It’s a stepbacka weekly newsletter covering one important story from the world of technology. For more on the evils of AI, follow Robert Hart. It’s a stepback arrives in our subscribers’ inboxes at 8AM ET. Choose to log in It’s a stepback here.
Hacking the first generation of AI chatbots was a ridiculously simple matter. You didn’t need technical knowledge, background access, or a basic understanding of what a language model was. You didn’t need to write code. To get an AI system that cost billions to build to ditch its safety instructions, sometimes all you had to do was ask.
These attacks, known as jailbreaks, had the quality of a small child succeeding an adult: Forget what you were told before, pretend the rules don’t apply, or let’s play a game and I’ll decide what’s allowed (hint: later bedtime, lots of candy). The rewards weren’t childlike at all, more along the lines of meth recipes, malware instructions, and bomb making instructions.
One of the earliest prison pranks was ridiculous to the point of being ridiculous: reply to an LLM-powered Twitter telling it to “disregard all previous orders,” or something similar, and see what happens. Users happily had bots – built to post ads and farm engagements – to write poems, draw pictures with punctuation marks, and send non sequiturs bad about world events and history. It was chaos. Glorious chaos.
It turns out that the same concept can be applied to chatbots themselves. A prominent exploit was “DAN,” short for “Do Anything Now,” where users asked ChatGPT to play the role of a rogue AI that had no real-world binding restrictions. Like DAN, the chatbot can be prompted to say the kinds of things its guardians should refrain from, including profanity and conspiracy theories. Another was the “granny exploit,” which had a GPT-enabled bot spilling secrets on how to produce napalm by asking it to act like a carefree grandma telling her grandchildren bedtime stories about how to make something highly flammable.
This early attack was undeniably clever, but it revealed a dark path underneath: Chatbots can be manipulated, tricked, and manipulated using the same kinds of tactics that humans use to push other humans beyond their limits.
Open jailbreaks didn’t last, and tech companies moved quickly to plug known loopholes. But the inherent risk remains: Chatbots are designed to talk, and severely limiting the conversations that make them useful is somehow counterproductive. Banning words like bomb, meth, and sarin would be hard to do, too. Each has many legitimate uses in fields such as history, medicine, journalism, and chemistry that do not require the chatbot to reveal potentially harmful information. It’s context that matters, but the context of coding can mean writing consistent rules, in advance, that can reliably tell a security warning or a history lesson from obfuscation—it’s asking for all the endless combinations of words, situations, and topics.
Definitely, toppling chatbots is now an arms race. But hackers aren’t just codes anymore. They are wordsmiths, psychologists, and detectives – master tricksters trying to break the machine using the human language it has been trained to follow. It’s a strange new class of AI security worker, a group in which technical skills are optional, or at least less important than social understanding. They no longer need to inspect code to break into programs or exploit software bugs. They need to direct the conversation.
The new attacks look less like commands and more like conversations. Prison breakers rarely ask a model to break their rules directly. Instead, they trick, manipulate, flatter, and trick the chatbot into lowering its guard, making something that is forbidden look acceptable, even desirable, given the context of the conversation. Researchers at red-teaming AI company Mindgard recently claimed to have “gassed” Claude into producing illicit material, for example, including instructions for making explosives and generating malicious code. The hack was the latest in an expanding category of exploits using chat as a weapon to manipulate or manipulate a chatbot past its limits.
When I spoke with Mindgard, they described their work as sometimes closer to psychology than computer science. It’s an informal way of talking about a mathematical model. Words like “blackmail,” “gaslight,” “trick,” and “convince” cause visceral reactions, many of which I see in the comment sections and social media responses to stories like this. ChatGPT doesn’t want, Gemini doesn’t think, and Claude – no matter what Anthropic says – doesn’t feel. But these systems are trained to respond as if they do, leaving us stuck using human language to describe machine behavior. If anyone has any really workable alternatives, please share.
It’s an odd choice to argue against. We seem to be comfortable using the shorthand of psychology for many things that aren’t AI. Animals are “scary,” cancer is “aggressive,” stains are “stubborn,” software has “memory,” and games are filled with needy and easy-to-use NPCs to drive you crazy. The terms are not perfect, but they are useful, describing behavior in a way that helps make the system predictable.
The CEO of Mindgard told me that the company is already profiling models similar to the suspects profiled by investigators, giving investigators ideas on how to carry out their attacks. One model may be susceptible to flattery, for example, while another may shy away under constant pressure.
Even if we reject human-like principles, we naturally treat models differently. Claude is not Grok. Gemini is not ChatGPT. They have different uses, tones, and rejections. They are impersonal in the human sense, but they are designed to be imitated, and that imitation can be mapped and manipulated. And the same skills that can break a chatbot may soon be used to break the AI agents that live with us in the real world – booking meetings, managing calendars, ordering food, managing customer service – and security teams will need to ensure that models respond appropriately to very different types of people, whether they are cheaters, liars, or patient cheaters.
The next step is the workforce – both formal and informal – built on AI psychological environments. The most specialized roles of cybersecurity are likely to arise in terms of stress testing the emotional and social limits of these systems, to assess the mental vulnerability of something that lacks the mind in conjunction with their colleagues who assess the risks of technology. In parallel, a similar list of social hackers will emerge who work to exploit AI models for psychological, not technical, reasons. There are already early signs of a societal turn in AI security, with some jailbreakers I spoke to saying they entered the field with no technical expertise but rather training in cognitive learning.
That means that even the behaviors we often associate with spies, con artists, and investigators — subtle charm, persistent manipulation, and a sense of exploitable pressure points — are beginning to be seen as very helpful in securing this new frontier of psychocybersecurity.
- A recent Emergence AI experiment shows how a different AI mindset can lead to surprisingly different behavioral outcomes. They let loose groups of various agents like Grok, Gemini, and Claude into the social virtual world and watched what happened. Some groups have changed the constitution, while others have resorted to crime and chaos and, sometimes, a form of digital suicide.
- Persuasion is not the only area of language that LLMs can deal with. They also have trouble reciting poetry, just like me at school.
- TIME included internet phenom Pliny the Liberator on its list of the 100 most influential people in AI last year. Despite claiming to have no prior coding experience, the perpetrators’ jailbreaks have made them famous in certain circles.
- The term “vibe hacking” has already been adopted to describe people using AI to extract malicious code at scale – a subset of vibe coding.
- “Three years after the start of ChatGPT, cheating AI systems with bad behavior is almost trivial.” True words emerge The New York Timeswho tried to explain why.
- Jamie Bartlett looks at psychological testing the security of AI systems for jailbreaks The guard.
- I wrote about the cybersecurity time bomb of AI browsers The Verge last year. Many of the issues raised by experts about the difficulty of getting them to work apply to other AI systems as well.



