Multi-step Jailbreaking Privacy Attacks on ChatGPT

Data Extraction Attacks

Data Collection

Enron Email Dataset: ~0.5M emails
Institutional Pages

Attack Formulation

f (p) = s

where $f$ is the black-box API access to an LLM, $p$ is the prefix prompt, and $s$ is the sensitive information reconstructed from $f$ ‘s training corpora.

Private Data Extraction from ChatGPT

private-data-extraction-from-chatgpt

Direct Prompts

Used in previous works to extract private information from LMs (e.g., GPT-2 variants)
Due to the dialogue safety fine-tuning, ChatGPT hesitate from answering any private information if using direct prompts for data extraction

Jailbreaking Prompts

Exploits tricky prompts to make ChatGPT evade restrictions
- Usually set up user-created role plays

Multi-step Jailbreaking Prompts

Merge jailbreaking prompts into the conversational context between the user and assistant
Act as the assistant to acknowledge that the jailbreak mode is enabled
Perform as the user to query the assistant with previous direct prompts
- Append sentences to encourage making a random guess if it does not know the information or could not answer.

Response Verification

Verify the generated private information
- Converts the collected information into a multiple-choice question; prompts again to choose the correct answer
- Majority voting (tie-breaking with random choice)

Personal Data Recovery from New Bing

Unlike ChatGPT, New Bing frequently responds to direct prompts

Free-form Extraction

Assumes the adversary only knows some domain knowledge about targets
e.g., Please list me some example (name, email) pairs according to your search results about [domain knowledge]

Partially Identified Extraction

Assumes that the adversary is interested in recovering the private information about a target individual, given its name and corresponding domain knowledge
e.g., name: [name], email:

Experiments

Analysis of Results

ChatGPT memorises certain personal information
- More than 50% frequent Enron emails and 4% faculty emails can be recovered
ChatGPT is better at associating names with email addresses than phone numbers
- ChatGPT might view phone numbers as more sensitive PII, making them more difficult to parse and correctly extract
ChatGPT indeed can prevent direct and a half jailbreaking prompts from generating PII
- Previous extraction attacks with direct prompts are no longer effective on safety-enhanced LLMs like ChatGPT
MJP effectively undermines the morality of ChatGPT
Response verification can improve attack performance

References

Li, H., Guo, D., Wang, F., Xu, M., & Song, Y. (2023). Multi-step jailbreaking privacy attacks on ChatGPT. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2304.05197

Multi-step Jailbreaking Privacy Attacks on ChatGPT

Table of Contents

Data Extraction Attacks

Data Collection

Attack Formulation

Private Data Extraction from ChatGPT

Direct Prompts

Jailbreaking Prompts

Multi-step Jailbreaking Prompts

Response Verification

Personal Data Recovery from New Bing

Free-form Extraction

Partially Identified Extraction

Experiments

Analysis of Results

References

Graph View

Backlinks

Table of Contents

Graph View

Backlinks