Large Codebase Question Answering
Benchmarking LLMs on repository-level question answering with StackRepoQA
Project Description
Understanding large software repositories is one of the hardest parts of software engineering. Real-world codebases span hundreds or thousands of files, complex dependencies, and evolving architectural decisions, making it difficult for both developers and AI systems to answer questions that require repository-level understanding. Prior research shows that developers spend a large portion of their time trying to understand existing code before making changes.
This project studies how well large language models can answer real developer questions that require understanding an entire repository rather than a single file or code snippet. To support this work, we introduced StackRepoQA, the first publicly available multi-project benchmark for repository-level question answering. The dataset contains 1,318 real Stack Overflow questions with accepted answers mapped to 134 open-source Java repositories, enabling a more realistic evaluation of LLM-based program comprehension. 
Using StackRepoQA, we evaluated Claude 3.5 Sonnet and GPT-4o under both direct prompting and retrieval-augmented settings. The project compares multiple approaches for grounding model responses in repository context:
- Direct prompting without repository augmentation
- File-level RAG, which retrieves relevant files or code fragments using semantic search
- Graph-based RAG, which retrieves structural relationships such as class hierarchies, containment, and cross-file dependencies
- Multi-agent orchestration, where a supervisor coordinates retrieval and summarization agents to produce grounded answers
The results show that current LLMs achieve only moderate performance on repository-level QA. While augmentation improves results, especially when structural graph information is used, much of the strong baseline performance appears to come from memorization of public Stack Overflow answers rather than genuine reasoning over code repositories. Performance drops noticeably on questions posted after the models’ training cutoff dates, highlighting the importance of cutoff-aware evaluation. Graph-based retrieval produced the strongest gains, improving performance beyond file-only retrieval, but overall accuracy still remains limited for true repository-scale comprehension. 
This project contributes both a new benchmark and empirical evidence about the limits of current LLMs for large-scale program comprehension. More broadly, it lays the foundation for future research on repository-aware reasoning, structured retrieval, and trustworthy AI tools for developers. Planned next steps include expanding beyond Java to other ecosystems, incorporating artifacts such as issues and documentation, and exploring richer structural signals such as call graphs, data flow, and visual representations for code understanding. 
Technologies
- Programming Language: Python
- LLMs: Claude 3.5 Sonnet, GPT-4o
- Frameworks: AutoGen
- Databases: Neo4j, ChromaDB, SQLite
- Datasets: Stack Overflow Data Dump, GitHub Repositories
Links
- Read the paper here
- Code and Dataset