Debatebench: a challenging long context reasoning benchmark for large language models

Kumar, Dhruv

DSpace Home
→
BITS Faculty Publications
→
Department of Computer Science and Information Systems
→
View Item

dc.contributor.author	Kumar, Dhruv
dc.date.accessioned	2025-04-25T06:50:21Z
dc.date.available	2025-04-25T06:50:21Z
dc.date.issued	2025-02
dc.identifier.uri	https://arxiv.org/abs/2502.06279
dc.identifier.uri	http://dspace.bits-pilani.ac.in:8080/jspui/handle/123456789/18787
dc.description.abstract	We introduce DebateBench, a novel dataset consisting of an extensive collection of transcripts and metadata from some of the world's most prestigious competitive debates. The dataset consists of British Parliamentary debates from prestigious debating tournaments on diverse topics, annotated with detailed speech-level scores and house rankings sourced from official adjudication data. We curate 256 speeches across 32 debates with each debate being over 1 hour long with each input being an average of 32,000 tokens. Designed to capture long-context, large-scale reasoning tasks, DebateBench provides a benchmark for evaluating modern large language models (LLMs) on their ability to engage in argumentation, deliberation, and alignment with human experts. To do well on DebateBench, the LLMs must perform in-context learning to understand the rules and evaluation criteria of the debates, then analyze 8 seven minute long speeches and reason about the arguments presented by all speakers to give the final results. Our preliminary evaluation using GPT o1, GPT-4o, and Claude Haiku, shows that LLMs struggle to perform well on DebateBench, highlighting the need to develop more sophisticated techniques for improving their performance.	en_US
dc.language.iso	en	en_US
dc.subject	Computer Science	en_US
dc.subject	Evaluating large language models (LLMs)	en_US
dc.subject	GPT-4o evaluation	en_US
dc.title	Debatebench: a challenging long context reasoning benchmark for large language models	en_US
dc.type	Preprint	en_US

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

Department of Computer Science and Information Systems [1099]

Show simple item record

Search DSpace

Advanced Search

Browse

All of DSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects

Debatebench: a challenging long context reasoning benchmark for large language models

Files in this item

This item appears in the following Collection(s)

Search DSpace

Browse

All of DSpace

This Collection

My Account