Introduction

  • Present AmbiEnt (Ambiguity in Entailment)
    • English benchmark of 1,645 examples
    • Coverts a variety of lexical, syntactic, and pragmatic ambiguities, and sentences which can be plausibly read as conveying one of multiple different messages
  • Characterising ambiguity requires a choice of meaning representation to distinguish between possible interpretations (enumerating can be tricky or impractical)
    • Adopt a functional approach
    • Characterise ambiguity in the premise and/or hypothesis by its effect on entailment relations
  • Design a suite of tests based on AmbiEnt to investigate the extent to which understanding of ambiguity is acquired during pretraining of LLMs
    • Investigate whether LMs can be finetuned on existing NLI data for the less demanding task of ambiguity recognition (w/o explicit disambiguation)

AmbiEnt

ambient-curated-examples

Curated Examples

  • Handwritten or sourced from existing NLI datasets and linguistics textbooks
    • DistNLI, ImpPres, NLI Diagnostics, MNLI, WaNLI

Generated Examples

ambient-generated-examples

  • Automatically identify groups of premise-hypothesis pairs that share a reasoning pattern
  • Use WaNLI as source of examples
    • Each group contains
      • A randomly chosen example on which its two annotators disagreed
      • Its 4 nearest neighbours (according to the final-layer embedding of a WaNLI-trained NLI model)
    • Format each group into a prompt with instruction, sample 5 continuations from InstructGPT

Annotation and Validation

  • Examples from InstructGPT consist of unlabelled premise-hypothesis pairs
  • Annotation phase
    • 37 university-level linguistics students
    • Select a set of labels for each examples
      • Provide a disambiguating rewrite for each one if more than one label is chosen
    • Discard offensive or low-quality examples
  • Validation phasse
    • Performed by subset of the authors
    • Review two sets of annotations to revise
    • Aggregate them into a single coherent annotation

Agreement

  • Four validators annotate a subset of 50 examples in common to calculate inter-annotator agreement

Analysis

  • Analysis to understand how annotators behave on ambiguous input (under the traditional 3-way annotation scheme for NLI)

Setup

  • Crowdworkers review ambiguous examples in AmbiEnt
  • Task is split into three steps
    1. Annotation of ambiguous example
      • Follows the traditional NLI labelling setup
    2. Recognition of disambiguities
      • The ambiguous sentence of the example is given for consideration
      • Given 3 candidate interpretations (2 disambiguations and 1 distractor), indicate whether each sentence is a possible interpretation of the ambiguous sentence
    3. Annotation of disambiguated examples
      • 3 new NLI examples are obtained by substituting the ambiguous sentnce of the original excample with each candidate interpretation from step 2
      • Select a single NLI label for each new example

Evaluating Pretrained Language Models

  • Evaluate if LMs can:
    1. Directly generate relevant disambiguations
    2. Recognise the validity of plausible interpretations
    3. Model open-ended continuations reflecting different interpretations
  • Ambiguous instances from AmbiEnt

Generating Disambiguities

ambient-eval-generating-disambiguities

  • Whether LMs can learn in-context to directly generate disambiguations and corresponding labels
  • Perform both automatic and human evaluation
    • Automatic: EDIT-F1 metric from AmbigQA
    • Human: Same setup as the previous crowdworker experiment (w/o step 1)

Recognising Disambiguities

ambient-eval-recognising-disambiguities

  • Focus on the ambiguous sentences alone
  • Model prediction: token with the greater logit between True and False
  • Executed zero-shot

Modelling Interpretation-Specific Continuations

  • Whether LMs implicitly model different interpretations in their distributions of text continuations
  • Obtain continuations for each interpretation, and quantify how surprised the LM is to see them
    • Sample 100 continuations conditioned on each disambiguation as context
    • Compare the likelihood of under ambiguous sentence vs. the corresponding disambiguation :
    • This is an unbiased estimate of the KL divergence between and
  • Expect the LM to model continuations from disambiguations better than those from the distractor :
    • KL ranking accuracy: The fraction for which this is true

Results

ambient-eval-results

References

  • Liu, A., Wu, Z., Michael, J., Suhr, A., West, P., Koller, A., Swayamdipta, S., Smith, N. A., & Choi, Y. (2023). We’re Afraid Language Models Aren’t Modeling Ambiguity. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2304.14399