Large Codebase Question Answering

Benchmarking large language models on repository-level question answering using the StackRepoQA dataset.

Project Description

Understanding large software systems is a central challenge in software engineering. Modern repositories contain thousands of files, complex dependencies, and evolving architecture, making program comprehension difficult for both developers and automated tools. Prior studies estimate that developers spend 58–70% of their time understanding existing code before making modifications.

Large Language Models (LLMs) have recently shown strong performance across many software engineering tasks, including code generation and question answering. However, most existing evaluations focus on small code snippets or single files, which do not reflect the realities of real-world development where answering questions often requires understanding relationships across multiple files and components within a repository.

This project investigates how well modern LLMs perform on repository-level question answering (QA). To support this study, we introduce StackRepoQA, a dataset consisting of 1,318 real developer questions and accepted answers from Stack Overflow mapped to 134 open-source Java repositories. The dataset captures realistic developer information needs and enables systematic evaluation of LLM capabilities on large-scale codebases.

Using StackRepoQA, we benchmark two widely used LLMs under multiple configurations. We compare baseline prompting with retrieval-augmented approaches that incorporate repository information, including:

  • File-level retrieval (File-RAG): retrieving relevant source files using semantic search.
  • Graph-based retrieval (Graph-RAG): leveraging structural dependencies between classes, methods, and files.
  • Agentic orchestration: coordinating multiple retrieval agents to ground LLM responses in repository context.

Our results show that while LLMs achieve moderate accuracy (~58%) when answering repository-level questions, much of this success can be attributed to memorization of previously seen Stack Overflow answers rather than genuine reasoning about the codebase. Retrieval augmentation improves performance modestly, with graph-based retrieval providing the largest gains, though overall accuracy remains limited for unseen questions.

These findings highlight both the promise and the limitations of current LLM-based approaches for program comprehension. The StackRepoQA benchmark provides a foundation for future research on repository-scale reasoning, structured retrieval, and reliable AI tools for developers.

This work represents the first phase of a broader research agenda on large-scale program comprehension with AI. While StackRepoQA provides a benchmark for evaluating repository-level question answering, several directions remain open. Future work will explore richer structural representations of software systems, including call graphs, data-flow relationships, and runtime dependencies, to better support reasoning across large codebases. We also plan to extend the dataset beyond Java to additional ecosystems such as Python and JavaScript, and to incorporate other development artifacts including GitHub issues, pull requests, and documentation. Finally, we aim to develop interactive systems that combine structured retrieval, visualization, and multi-agent reasoning to help developers explore and understand complex repositories more effectively.


Technologies

  • Programming Language: Python
  • LLMs: Claude 3.5 Sonnet, GPT-4o
  • Frameworks: AutoGen, LangChain
  • Databases: Neo4j, ChromaDB, SQLite
  • Datasets: Stack Overflow Data Dump, GitHub Repositories

References