Large language models (LLMs) — with their ability to process, organize, and summarize large volumes of information — are increasingly being evaluated as tools to support policy research and analysis. For instance, frameworks, such as retrieval-augmented generation and GraphRAG, enable LLMs to connect to non-public corpora or policy-specific document repositories to support such tasks as retrieving contextually relevant information, providing factually grounded answers, synthesizing findings across reports, identifying evidence gaps, and facilitating systematic reviews. As these potential use cases and tools are explored, it is important to assess how well LLMs perform and how reliably they operate in policy-relevant settings. Existing evaluations, such as the Massive Multitask Language Understanding benchmark and the Beyond the Imitation Game benchmark, provide useful information about LLMs’ capacities for factual recall and general reasoning. However, they do not fully capture performance on domain-specific, real-world policy tasks.
In this report, the authors detail their development of a specialized benchmark for evaluating LLMs’ abilities to process and understand technical policy reports, thus addressing a gap in existing LLM domain-specific evaluation. The authors developed the benchmark specifically to target policy-relevant applications by creating a dataset of claims that can be evaluated for its faithfulness to source research reports. Producing the benchmark dataset combined human expertise with artificial intelligence (AI) assistance to achieve scalability while maintaining quality. The authors document the development process and their preliminary benchmark testing results, and they share the lessons learned during the development process while providing recommendations for future work.
This publication is part of the RAND research report series. Research reports present research findings and objective analysis that address the challenges facing the public and private sectors. All RAND research reports undergo rigorous peer review to ensure high standards for research quality and objectivity.
This document and trademark(s) contained herein are protected by law. This representation of RAND intellectual property is provided for noncommercial use only. Unauthorized posting of this publication online is prohibited; linking directly to this product page is encouraged. Permission is required from RAND to reproduce, or reuse in another form, any of its research documents for commercial purposes. For information on reprint and reuse permissions, please visit www.rand.org/pubs/permissions.
RAND is a nonprofit institution that helps improve policy and decisionmaking through research and analysis. RAND’s publications do not necessarily reflect the opinions of its research clients and sponsors.

