Strategy for building context for large source code project

benum · February 12, 2025, 3:20pm

Does anyone know if there are initiatives to formalize an approach to using ChatGPT to assist with understanding a large codebase that has poor documentation. For example, is there an open source community focused on creating a systematic approach to ingesting a large codebase to generate a distilled representation of each source file to create an efficient input token set for building a context to ask questions about a large codebase that the ChatGPT model has not been trained on?

kavitatipnis · February 12, 2025, 4:17pm

Assuming codebase is all mostly text. For providing “large context”, you have a few options

an external vector database
depending on your budget , you can “finetune” GPT-4o family of models
I recommend starting small with the smallest repo or modularized code to try out. Also out of curiosity - have you explored any of the AI Enabled IDEs - Codeuim/Windsur or Cursor AI ?

Topic		Replies	Views
Large codebase as knowledge for GPT-4-Turbo API gpt-4 , api	7	4836	February 17, 2024
GPT + vector DB no good for understanding new code bases API	10	3917	June 4, 2023
Fine-tuning GPT to learn a new coding language Prompting codex , chatgpt , plugin-development , fine-tuning , api	3	3452	December 24, 2023
Developers how do you efficiently modify any code using gpt4 or gpt4o? Prompting gpt-4 , prompt-engineering	7	1992	June 6, 2024
How to pass 150k lines of code to chatgpt Prompting chatgpt , prompting	4	4662	September 5, 2024

Strategy for building context for large source code project

Related topics