Strategy for building context for large source code project

Does anyone know if there are initiatives to formalize an approach to using ChatGPT to assist with understanding a large codebase that has poor documentation. For example, is there an open source community focused on creating a systematic approach to ingesting a large codebase to generate a distilled representation of each source file to create an efficient input token set for building a context to ask questions about a large codebase that the ChatGPT model has not been trained on?

Assuming codebase is all mostly text. For providing “large context”, you have a few options

  1. an external vector database
  2. depending on your budget , you can “finetune” GPT-4o family of models
    I recommend starting small with the smallest repo or modularized code to try out. Also out of curiosity - have you explored any of the AI Enabled IDEs - Codeuim/Windsur or Cursor AI ?
1 Like