Does anyone know if there are initiatives to formalize an approach to using ChatGPT to assist with understanding a large codebase that has poor documentation. For example, is there an open source community focused on creating a systematic approach to ingesting a large codebase to generate a distilled representation of each source file to create an efficient input token set for building a context to ask questions about a large codebase that the ChatGPT model has not been trained on?
Assuming codebase is all mostly text. For providing “large context”, you have a few options
- an external vector database
- depending on your budget , you can “finetune” GPT-4o family of models
I recommend starting small with the smallest repo or modularized code to try out. Also out of curiosity - have you explored any of the AI Enabled IDEs - Codeuim/Windsur or Cursor AI ?
1 Like