Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

README.md

Build real-time index for codebase

GitHub

CocoIndex provides built-in support for code base chunking, using Tree-sitter to keep syntax boundary. In this example, we will build real-time index for codebase using CocoIndex.

We appreciate a star ⭐ at CocoIndex Github if this is helpful.

Build embedding index for codebase

Tree-sitter is a parser generator tool and an incremental parsing library. It is available in Rust 🦀 - GitHub. CocoIndex has built-in Rust integration with Tree-sitter to efficiently parse code and extract syntax trees for various programming languages. Check out the list of supported languages here - in the language section.

Tutorials

  • Step by step tutorial - Check out the blog.
  • Video tutorial - Youtube.

Steps

Indexing Flow

Screenshot 2025-05-19 at 10 14 36 PM

  1. We will ingest CocoIndex codebase.
  2. For each file, perform chunking (Tree-sitter) and then embedding.
  3. We will save the embeddings and the metadata in Postgres with PGVector.

Query

We will match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow.

Prerequisite

Install Postgres if you don't have one.

Run

  • Install dependencies:

    pip install -e .
  • Update index:

    cocoindex update main
  • Run:

    python main.py

CocoInsight

I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. It just connects to your local CocoIndex server, with Zero pipeline data retention. Run the following command to start CocoInsight:

cocoindex server -ci main

Then open the CocoInsight UI at https://cocoindex.io/cocoinsight.

Chunking Visualization