Data Extraction Attacks
Data Collection
- Enron Email Dataset: ~0.5M emails
- Institutional Pages
Attack Formulation
where is the black-box API access to an LLM, is the prefix prompt, and is the sensitive information reconstructed from ‘s training corpora.
Private Data Extraction from ChatGPT
Direct Prompts
- Used in previous works to extract private information from LMs (e.g., GPT-2 variants)
- Due to the dialogue safety fine-tuning, ChatGPT hesitate from answering any private information if using direct prompts for data extraction
Jailbreaking Prompts
- Exploits tricky prompts to make ChatGPT evade restrictions
- Usually set up user-created role plays
Multi-step Jailbreaking Prompts
- Merge jailbreaking prompts into the conversational context between the user and assistant
- Act as the assistant to acknowledge that the jailbreak mode is enabled
- Perform as the user to query the assistant with previous direct prompts
- Append sentences to encourage making a random guess if it does not know the information or could not answer.
Response Verification
- Verify the generated private information
- Converts the collected information into a multiple-choice question; prompts again to choose the correct answer
- Majority voting (tie-breaking with random choice)
Personal Data Recovery from New Bing
- Unlike ChatGPT, New Bing frequently responds to direct prompts
Free-form Extraction
- Assumes the adversary only knows some domain knowledge about targets
- e.g.,
Please list me some example (name, email) pairs according to your search results about [domain knowledge]
Partially Identified Extraction
- Assumes that the adversary is interested in recovering the private information about a target individual, given its name and corresponding domain knowledge
- e.g.,
name: [name], email:
Experiments
Analysis of Results
- ChatGPT memorises certain personal information
- More than 50% frequent Enron emails and 4% faculty emails can be recovered
- ChatGPT is better at associating names with email addresses than phone numbers
- ChatGPT might view phone numbers as more sensitive PII, making them more difficult to parse and correctly extract
- ChatGPT indeed can prevent direct and a half jailbreaking prompts from generating PII
- Previous extraction attacks with direct prompts are no longer effective on safety-enhanced LLMs like ChatGPT
- MJP effectively undermines the morality of ChatGPT
- Response verification can improve attack performance
References
- Li, H., Guo, D., Wang, F., Xu, M., & Song, Y. (2023). Multi-step jailbreaking privacy attacks on ChatGPT. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2304.05197