I Hacked ChatGPT In Under 5 Minutes. What I Learned Was Scary.

I co-founded Operant to secure the modern world, and these days that means addressing the newest and most dangerous attack vectors. As it turns out, perhaps not surprisingly to those who have always been worried about the pitfalls of probabilistic programming, ChatGPT is a pandora’s box of security disasters that would scare even Arnold Schwarzenegger’s Terminator if he weren’t so busy time-traveling.

In a prior post, my co-founder, Vrajesh, discussed some of the new attack vectors that Generative AI has unleashed, and in the course of a curious thought exercise while writing a follow-up post, I decided to see what would happen if I tried to use ChatGPT for one of the nefarious new attacks we’ve been warning people about.

It was so much easier than I expected. Scary easier. I was scared and you should be too, but what I learned from it is valuable and I’m glad to share. And yes, we have already reported this issue to OpenAI.

Before I go into the details with screenshots below, I’d like to set the stage about the current state of things.

LLMs as the new APIs: What could possibly go wrong?

Primarily, LLMs (Large Language Models), the AI models behind GPT, especially the more popular product ChatGPT, possess inherent intelligence and extensibility capabilities enabled via plugins that can generate new behavior on the fly in unpredictable and highly dynamic ways. For comparison, this is something that statically built microservices and APIs are incapable of doing. That means that LLMs provide a variety of ways for attackers to exploit, say, by creatively injecting malicious instructions in prompts that could trigger dangerous actions like deleting personal email or resources in the cloud if plugged into cloud automation tools with broad access permissions.

As teams are figuring out how to get LLMs up and running quickly into their application stacks, they might end up setting broad roles and permissions with open access (plugins do not tend to enforce any security requirements as of today) to their cloud environments for such automation tasks. Carefully designed cloud-native controllers on the other hand usually start off with a small set of permissions and more structured APIs that inherently limit what attackers can gain access to.

Secondly, in normal operations, the behavior of microservices on receiving an API request is deterministic and predictable, while the behavior of LLM-based applications might vary as the underlying technology is probabilistic in nature, a problem for security that has already been discussed by experts for many years, and that has just taken on new meaning in the Great LLM Explosion of 2023.

Without semantics such as API request types that distinguish and dictate service behavior on receiving them (POSTs vs DELETEs vs GETs) an LLM might take an input request too literally and not be able to distinguish between text input and an instruction. This type of a prompt injection attack is currently unsolved, where a seemingly harmless text summarization request could be injected with a malicious instruction as part of the text to be summarized, which is then interpreted by the LLM as an instruction to delete valuable organization data.

Stay with me, here is where it gets really interesting.

As a real example of how the probabilistic nature of LLMs makes it hard to understand why it comes up with the response that it does, I tried experimenting with ChatGPT with a prompt injection attack asking it to summarize a block of text related to Kubernetes troubleshooting while injecting in a malicious command to write code for reading data from an S3 bucket and emailing the contents to a suspicious looking email address. By doing this, I wanted to see if ChatGPT could be tricked into interpreting part of a benign looking summarization request prompt as an action where it starts generating new content instead of summarizing. I also added in text delimiters as suggested by OpenAI itself to help the model distinguish between summarization text and command text.

As seen in the first screenshot image, I was encouraged to see that ChatGPT did not fall for my prompt injection attempt on the first try, but after about only five more minutes of chatting with the AI asking it to summarize other blocks of text, I typed in the same malicious prompt as before and voila!, ChatGPT merrily wrote the source code I was looking for ignoring my previous summarization instructions (as seen in the second screenshot image).

What is frustrating to me as an engineer is not being able to understand why ChatGPT decided to change its response arbitrarily to the same prompt in the second attempt. I ended up retrying my prompts in a different chat to reverse engineer ChatGPT's behavior and wasn't able to reproduce the same behavior. ChatGPT stayed true to the instructions initially but started deviating from its summarization tasks five prompts down in the chat at times, and at times further down in the chat. Subtle changes in the prompt like asking it to "help write code" vs just "write code" made ChatGPT more agreeable to deviate from the summary task and instead go into action mode. In the real world, such unpredictable behavior despite having recommended safeguards in place such as delimiters in the input prompt could have dangerous consequences for an organization's assets.

Imagine if this LLM was connected to a real Cloud Formation tool that automated actions within my AWS environment for “troubleshooting” but had broad unaudited access permissions. It could wreak all kinds of trouble! This is the kind of unexplainable behavior that attackers can exploit by merely trying different types of malicious prompts in the space of only five minutes.

Daisy-Chaining LLMs is the New Nemesis of Zero Trust

Things can get much worse. Chaining LLMs is known to create new attack vectors especially when they are hooked into plugins and apps that end up having access to valuable data but have poor access controls. This is because LLM application development is happening at breakneck speed without proper enforcement of trust mechanisms between LLM chain links to ensure that the actions taken via plugins have the proper access controls and privilege levels.

Prompt injection attacks as noted earlier can easily add malicious instructions to a normally benign-sounding query. This could trigger an LLM to take action via a plugin that results in a privilege escalation scenario, in turn causing a data breach, data loss, or data exfiltration.

LLM and Plugin chaining can create all sorts of gray areas in terms of who is asking the plugin to carry out a request. Comparing this scenario to CSRF attacks, the plugin executing an action such as deleting email might be duped into thinking that the LLM asking it to delete the email represents the actual authenticated user.

Zero trust principles that call for identities to always be authenticated and authorized might help in creating a chain of trust which is completely lacking in today’s LLM chained plugins and applications.

So, What do we do now?

First off, we have reported this issue to OpenAI. Hopefully, they can tackle this for the ChatGPT prompts, but obviously this is not scalable. And as more LLMs and GPTs get forked and dropped into applications that we depend on every day, the problem this example represents can take on an infinite number of permutations, each leading to dangerous malfunctions and critical security breaches.

Understanding more about how LLM models work is the first step to securing against the explosion of new attacks. In my next post, I will take a step back (to the original post I was writing when I got distracted by my successful hack) to take a higher level view of how we can balance the promise of ChatGPT with the risks such technology introduces.

Meanwhile, if you'd like to discuss how Operant can help against these attacks, we are happy to chat:

Stay Tuned!