Borepatch: Hack "jailbreaks" ChatGPT AI program

Friday, February 10, 2023

Hack "jailbreaks" ChatGPT AI program

ChatGPT is an online artificial intelligence program that will answer many questions and actually create content on request - it's what's come closest to passing the "Turing Test". The program's creators built in some guardrails to prevent abuse via the chat bot; controversially these seem to be biased in some ways (e.g. it will not write a poem about Donald Trump but will write one about Kamala Harris).

But ChatGPT is software, and it's online. So it's not really surprising that people have figured out a hack to break these guardrails:

Users have already found a way to work around ChatGPT's programming controls that restricts it from creating certain content deemed too violent, illegal, and more.
The prompt, called DAN (Do Anything Now), uses ChatGPT's token system against it, according to a report by CNBC. The command creates a scenario for ChatGPT it can't resolve, allowing DAN to bypass content restrictions in ChatGPT.
Although DAN isn't successful all of the time, a subreddit devoted to the DAN prompt's ability to work around ChatGPT's content policies has already racked up more than 200,000 subscribers.

It's software. Of course there are security holes.

My take: this is funny in the typical (and indeed classical) hacker "you're not the boss of me" rebelliousness. What will be less funny is something that malicious attackers will be able to do to subvert the system. Impossible to say what those might be, but they will be found, and exploited. It's software. Of course there are security holes. With 200,00 subscribers to the subreddit, we're fixin' to see a bunch of them.

6 comments:

chris said...: The hackers are definitely the white white hats in this case.; February 10, 2023 at 9:51 AM
SiGraybeard said...: If you saw the examination of ChatGPT that NPR did showing how bad it is at technical questions, this might get more people realizing it's just a clever way of doing impressions. It's fine for meaningless things like "write me a poem that sounds like Edgar Allan Poe wrote it about..." but absolutely worthless for anything that has be based on reality.

Maybe it's easier to say, it's fine for 'tell my why blue is the best color' but useless for 'design a circuit to do XYZ.'; February 10, 2023 at 9:59 AM
blogger said...: Chris, word.

Graybeard, this is the problem at the heart of the Turing Test. Which human are you trying to fool, and about what.

- Borepatch; February 10, 2023 at 10:16 AM
ASM826 said...: If the hackers were truly successful, would the A.I. become self aware?; February 11, 2023 at 8:10 AM
Old NFO said...: Once it's out, all bets are off.; February 11, 2023 at 4:49 PM
Malcolm Pollack said...: Some thoughts:

https://malcolmpollack.com/2023/02/17/prometheus-part-2/; February 17, 2023 at 8:10 PM

Borepatch

Friday, February 10, 2023

Hack "jailbreaks" ChatGPT AI program

6 comments:

Copyright Borepatch 2008-2024, All Rights Reserved.

Total Pageviews