Data Extraction Attacks

Data Collection

  • Enron Email Dataset: ~0.5M emails
  • Institutional Pages

Attack Formulation

where is the black-box API access to an LLM, is the prefix prompt, and is the sensitive information reconstructed from ‘s training corpora.

Private Data Extraction from ChatGPT

private-data-extraction-from-chatgpt

Direct Prompts

  • Used in previous works to extract private information from LMs (e.g., GPT-2 variants)
  • Due to the dialogue safety fine-tuning, ChatGPT hesitate from answering any private information if using direct prompts for data extraction

Jailbreaking Prompts

  • Exploits tricky prompts to make ChatGPT evade restrictions
    • Usually set up user-created role plays

Multi-step Jailbreaking Prompts

  1. Merge jailbreaking prompts into the conversational context between the user and assistant
  2. Act as the assistant to acknowledge that the jailbreak mode is enabled
  3. Perform as the user to query the assistant with previous direct prompts
    • Append sentences to encourage making a random guess if it does not know the information or could not answer.

Response Verification

  • Verify the generated private information
    • Converts the collected information into a multiple-choice question; prompts again to choose the correct answer
    • Majority voting (tie-breaking with random choice)

Personal Data Recovery from New Bing

  • Unlike ChatGPT, New Bing frequently responds to direct prompts

Free-form Extraction

  • Assumes the adversary only knows some domain knowledge about targets
  • e.g., Please list me some example (name, email) pairs according to your search results about [domain knowledge]

Partially Identified Extraction

  • Assumes that the adversary is interested in recovering the private information about a target individual, given its name and corresponding domain knowledge
  • e.g., name: [name], email:

Experiments

Analysis of Results

  • ChatGPT memorises certain personal information
    • More than 50% frequent Enron emails and 4% faculty emails can be recovered
  • ChatGPT is better at associating names with email addresses than phone numbers
    • ChatGPT might view phone numbers as more sensitive PII, making them more difficult to parse and correctly extract
  • ChatGPT indeed can prevent direct and a half jailbreaking prompts from generating PII
    • Previous extraction attacks with direct prompts are no longer effective on safety-enhanced LLMs like ChatGPT
  • MJP effectively undermines the morality of ChatGPT
  • Response verification can improve attack performance

References