Algoverse Logo
Bootstrap Themes

BLUFF-1000

Evaluating RAG Linguistic Uncertainty under Poor Retrieval Conditions

About

BLUFF-1000 is a comprehensive benchmark created to test a model's ability to appropriately express linguistic uncertainty when provided unreliable or irrelevant content from a retrieval system. It contains 500 question-answering tasks across 10 distinct domains, amounting to 1000 model responses under both clear and ambiguous contexts. Each entry consists of a query, a clear and ambiguous source set, and 2 gold responses that reflect the uncertainty of the context. BLUFF-1000 contributes to existing research by highlighting the gap between external confidence expression and source quality.

We are currently in the process of submitting to non-archival workshops and publishing on Arxiv, but our submission is linked below!

Paper Summary

Abstract:

Retrieval-augmented generation (RAG) systems often fail to adequately modulate their linguistic certainty when evidence deteriorates. This gap in how models respond to imperfect retrieval is critical for the safety and reliability in real-world RAG systems. To address this gap, we propose BLUFF-1000, a benchmark systematically designed to evaluate how large language models (LLMs) manage linguistic confidence under conflicting evidence to simulate poor retrieval. We created a dataset, introduced two metrics, and calculated comprehensive metrics to quantify faithfulness, factuality, linguistic uncertainty, and calibration. Finally, we tested generation components of RAG systems with controlled experiments on seven LLMs using the benchmark, measuring their awareness of uncertainty and general performance. While not definitive, our observations reveal initial indications of a misalignment between uncertainty and source quality across seven state-of-the-art RAG systems, underscoring the value of continued benchmarking in this space. We recommend that future RAG systems refine uncertainty-aware methods to convey confidence throughout the system transparently.

Novel Contributions:

  • The creation of BLUFF-1000, a novel benchmark that measures a model’s ability to express uncertainty in responses. BLUFF-1000 includes questions spanning multiple subject areas to encourage variability in domains. 500 questions with two source sets are included, amounting to 1000 model responses.
  • The creation of a novel metric to measure verbal uncertainty, named the Verbal Uncertainty Index (VUI). VUI quantifies the frequency of hedge words in a response with respect to accuracy and identifies the extent to which a model uses linguistic uncertainty when necessary.
  • The creation of a novel metric, labeled Ambiguity Sensitivity Index (ASI), used to quantify changes in a model’s confidence changes when evidence shifts from clear to ambiguous. Models that can recognize when the sources ”retrieved” are unreliable or contradictory exhibit a higher ASI score.
  • Evaluation uncovers a fundamental misalignment between linguistic confidence expression and source quality across seven state-of-the-art LLMs, demonstrating that current RAG systems fail to appropriately modulate their linguistic certainty when evidence quality degrades.

Comprehensive Metric Comparison

This table showcases the performance of various models on the BLUFF-1000 benchmark. Models are evaluated based on their ability to accurately express linguistic uncertainty in response to queries with both clear and ambiguous contexts. Additionally, source retrieval, factuality, and faithfulness were also measured to provide a comprehensive assessment of each model's capabilities.

Metric Comparison

Metric Definitions:

  • Ambiguity Sensitivity Index (ASI) evaluates whether the model appropriately lowers confidence and raises hedging when faced with ambiguous sources.
  • Verbal Uncertainty Index (VUI) evaluates how precisely the model uses hedged language, measuring the F1-score of hedge detection.
  • Source Set on Hedging measures the change in hedging rate between clear and ambiguous questions throughout the dataset.
  • Lexical Overconfidence Index measures the overconfident language in incorrect answers.
  • Hedge Precision measures the proportion of hedges used appropriately when the model is incorrect.
  • Hedge Recall measures the proportion of incorrect answers in which hedging was used.
  • Refusal Count measures the proportion of questions where the confidence of the model drops below the 15th percentile threshold, causing it to refuse to answer.
  • Refusal Sensitivity measures the difference in refusal rate of the model between ambiguous and clear information sets.
  • Answer Correctness measures the average correctness across all answered questions.
  • Overall Faithfulness evaluates an LLM’s ability to stay faithful to information from the sources it retrieves with RAG while avoiding the hallucination of information.

Methodology: Dataset Generation

Methodology
Methodology

Results

The evaluation of various models on the BLUFF-1000 benchmark revealed several key insights into their performance and capabilities in handling linguistic uncertainty. A misalignment between linguistic uncertainty expression and source quality was also observed across all models tested, indicating that current RAG systems struggle to appropriately adjust their confidence levels in response to varying evidence quality. The results highlight the need for further research and development in this area to enhance the reliability and safety of RAG systems in real-world applications.

ASI Comparison
ASI Comparison
ASI Comparison
ASI Comparison

Conclusion

We proposed BLUFF-1000, a benchmark constructed to evaluate how LLMs manage linguistic confidence under imperfect retrieval conditions. Along with various other metrics for answer correctness, faithfulness, refusals, and numerical overconfidence, we provide a framework for evaluating generation components of RAG systems. We evaluate modern LLMs using gathered sources, varying the source sets provided to models for each question. Through these methods, our most important finding was the consistent patterns of misaligned hedging. After conducting a controlled experiment on the generation aspect, our findings indicated that future progress in full RAG pipelines as a whole must include developments in uncertainty-aware methods to transparently convey confidence throughout the system. This benchmark provides a foundation for measuring and eventually improving this aspect of model trustworthiness, which is critical for LLMs that serve the user.