AmbigQA - Answering Ambiguous Open-domain Questions

Introduction

Ambiguity arises frequently in open-domain QA, where questions are written during information gathering without knowledge of the answer
Introduce a new task AmbigQA, which requries:
- Identifying all plausible answers to an open-domain question
- Identifying disambiguated questions to differentiate them
Construct a new dataset AmbigNQ
- 14,042 annotations on NQ-open questions containing diverse types of ambiguity
Introduce the first baseline models that:
- Produce multiple answers to open-domain questions

Task: `AmbigQA`

Setup

ambigqa-setup-example

Input: Prompt question $q$
Output: List of $n$ question-answer pairs $(x_{1}, y_{1}), \dots, (x_{n}, y_{n})$
- Each $y_{i}$ is an equally plausible answer to $q$
- Each $x_{i}$ is a minimally edited modification of $q$ , whose answer is unambiguously $y_{i}$
Subtasks
- Multiple Answer Prediction
  - Input: A question $q$
  - Output: A set of semantically distinct and equally plausible answers $y_{1}, \dots, y_{n}$ (where $n$ is unknown)
- Question Disambiguation
  - Input: A question $q$ , A set of answers $y_{1}, \dots, y_{n}$
  - Output: Disambiguated questions $x_{1}, \dots, x_{n}$ , where each $x_{i}$ is a minimal edit of $q$ which makes its answer unambiguously $y_{i}$ and $y_{i}$ only

Metrics

Goal: Compare a model prediction with $m$ QA pairs $(x_{1}, y_{1}), \dots, (x_{m}, y_{m})$ with a gold reference set with $n$ pairs $(\overset{x}{ˉ}_{1}, \overset{ˉ}{Y}_{1}), \dots, (\overset{x}{ˉ}_{n}, \overset{ˉ}{Y}_{n})$
- Each gold answer $\overset{ˉ}{Y}_{i}$ is a set of acceptable answer strings, where all $\overset{ˉ}{Y}_{i}$ are disjoint
Correctness score: $(x_{i}, y_{i}) \mapsto [0, 1]$ $c_{i} = max_{1 \leq j \leq n} I [y_{i} \in \overset{ˉ}{Y}_{j}] f (x_{i}, \overset{x}{ˉ}_{j})$ where $f$ is the similarity measure
- Considers:
  - The correctness of the answer
  - The similarity between the predicted $x_{i}$ and the reference question $\overset{x}{ˉ}_{j}$
Calculate F1 treating $c_{i}$ as measures of correctness
$Precision_{f} Recall_{f} = \frac{\sum _{i} c _{i}}{m} = \frac{\sum _{i} c _{i}}{n}$
- Choices of similarity measure $f$ :
  - $ans$ : Always yields 1
  - $BLEU$ : Computes BLEU scores
  - $EDIT-F1$ : Computes F1 score from unigram diffs
    - Prompt question: “Who made the play the crucible?”
    - Gold edit: “Who wrote the play the crucible?” → ${-made, +wrote}$
    - Predicted edit: “Who made the play the crucible in 2012?” → ${+in, +2012}$
    - $EDIT-F1 = 0$

Data: `AmbigNQ`

Collection

Used prompt questions from NQ-open, English Wikipedia as the evidence corpus
Constructed via crowdsourcing
Two stage pipeline: generation and validation

Generation

Given a prompt question and a Google Search API restricted to English Wikipedia
- Find all plausible answers to the question
- For some questions containing temporal deixis, remove time-dependence by rewriting the prompt question

Validation

Review the annotations provided by multiple generators
- Mark each generator’s annotations as correct / incorrect
- Provide a new set of QA pairs by combining the valid ones from each generator
Access to Wikipedia and the pages that generators viewed
Skipped when annotated answers from all generators exactly match

Quality Control

Highly qualified workers

Analysis

Types of Ambiguity

ambigqa-types-of-ambiguity

Model

Input: Prompt question $q$
Predict: Answers $y_{1}, \dots, y_{n}$
Generate: Corresponding questions $x_{1}, \dots, x_{n}$ , conditioning on $q$ , the answers $y_{1}, \dots, y_{n}$ , and the evidence (top) passages

Multiple Answer Prediction: `SpanSeqGen`

Follows DPR

Retrieve 100 passages with a BERT-based bi-encoder
Rerank the passages using a BERT-based cross-encoder
Sequentially generates distinct answers token-by-token, conditioned on the concatenation of $q$ and the top passages in order up to 1024 tokens using a BART-based seq2seq model

Question Disambiguation

BART-based model
- Generates each question $x_{i}$ conditioning on the concatenation of $q$ , the target answer $y_{i}$ , other answers $y_{1}, \dots, y_{n}$ , and the top passages as used by SpanSeqGen
Pretrain on NQ-open to generate questions given an answer and passage
Finetune on AmbigNQ

Co-training with Weak Supervision

Treats NQ-open annotations as weak supervision
Learns to discover potential ambiguity in the data

ambigqa-democratic-cotraining

Experiments

Baselines

`Disambig-first`

Feed the prompt question $q$ into a BERT-based binary classifier to determine whether it is ambiguous
- If so, pass it into a BART-based model which generates a sequence of disambiguated questions $x_{1}, \dots, x_{n}$
- Otherwise consider only $x_{1} = q$
Feed each $x_{i}$ into SOTA model on NQ-open to produce its answer $y_{i}$

`Thresholding + QD`

DPR model with thresholding for multiple answer prediction
- DPR outputs a likelihood score for each span
- Obtain $y_{1}, \dots, y_{n}$ by taking valid spans with likelihood larger than a hyperparameter $γ$
Training process same as SpanSeqGen

Results

ambigqa-experiments-results

Disambig-first is significantly worse than other models
- Ambiguity classification accuracy (67%) is close to the majority baseline (60%)
- When the model rewrites an ambiguous question, its rewrites look reasonable but do not match the facts
- Reading evidence documents is crucial for identifying and characterising ambiguities
SpanSeqGen + QD outperforms Thresholding + QD, but with little difference
- Thresholding may be a surprisingly effective baseline for outputting multiple answers
- Maximising likelihood in a seq2seq model (SpanSeqGen) may not produce well-calibrated results
  - Suffer from variation in the length of the output sequence
Substantial difference in performance between development and test overall
- Likely due to distributional differences in the original questions in NQ-open
Ensemble trained with co-training method achieves the best performance on all metrics

ambigqa-experiments-ablation-zeroshot

Ablation Study

Simply copying the prompt question gives high F1-BLEU score
- Justifies using F1-EDIT-F1 to evaluate semantic differences from the prompt question
- QD model conditioned on all available context is better than other variants
Overall low performance, even given the gold answers
- Maximising the likelihood of the output sequence can miss the importance of edits to the prompt question
  - QD model may miss the information that is most important to differentiate one answer from the others
- Lack of annotated data, especially for question disambiguation
- Metric may miss edits that are semantically correct, but phrased differently

Zero-shot Results

System predicts multiple distinct answers without using AmbigNQ training data

Error Analysis

When there are multiple reference answers, the model rarely gets all correct answers, although often generates a subset of them
Accuracy on examples with a single answer is quite high, higher than SOTA levels on NQ-open
- NQ-open may substantially underestimate performance due to the prevalence of unmarked ambiguity
Recall of multiple answers is one of the primary challenges in AmbigQA

Conclusion & Future Work

Explicitly modelling ambiguity over events and entities
Open-ended approaches on top of AmbigQA:
- Applying the approach to QA over structured data
- Handling questions with no answer or ill-formed questions that require inferring and satisfying more complex ambiguous information needs
- More carefully evaluating usefulness to end users

References

Min, S., Michael, J., Hajishirzi, H., & Zettlemoyer, L. (2020). AMBIGQA: Answering ambiguous open-domain questions. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2004.10645
Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). Dense passage retrieval for Open-Domain question answering. arXiv (Cornell University). https://arxiv.org/pdf/2004.04906.pdf

AmbigQA - Answering Ambiguous Open-domain Questions

Table of Contents

Introduction

Task: `AmbigQA`

Setup

Metrics

Data: `AmbigNQ`

Collection

Generation

Validation

Quality Control

Analysis

Types of Ambiguity

Model

Multiple Answer Prediction: `SpanSeqGen`

Question Disambiguation

Co-training with Weak Supervision

Experiments

Baselines

`Disambig-first`

`Thresholding + QD`

Results

Ablation Study

Zero-shot Results

Error Analysis

Conclusion & Future Work

References

Graph View

Backlinks

Table of Contents

Graph View

Backlinks

AmbigQA - Answering Ambiguous Open-domain Questions

Table of Contents

Introduction

Task: AmbigQA

Setup

Metrics

Data: AmbigNQ

Collection

Generation

Validation

Quality Control

Analysis

Types of Ambiguity

Model

Multiple Answer Prediction: SpanSeqGen

Question Disambiguation

Co-training with Weak Supervision

Experiments

Baselines

Disambig-first

Thresholding + QD

Results

Ablation Study

Zero-shot Results

Error Analysis

Conclusion & Future Work

References

Graph View

Backlinks

Table of Contents

Graph View

Backlinks

Task: `AmbigQA`

Data: `AmbigNQ`

Multiple Answer Prediction: `SpanSeqGen`

`Disambig-first`

`Thresholding + QD`