Backend data storage for reference content

As part of #82, we need some form of data storage for parsed reference content, particularly the chunked document text. Since #83 , we no longer store the chunked reference text. We'll need to add this back in, with or without embeddings.

We need to be able to query the chunked text in order to retrieve the most likely related chunks for a given chat question (regardless of whether we are using embeddings or something like bm25).

Some options and notes:
1. Structured text file on disk (e.g. JSON) - Because the sidecar is not a long running process, this file will need to be read from disk on each `chat` call. That doesn't feel great, though it might actually perform fine.
2. [LanceDB](https://github.com/lancedb/lancedb) - We used LanceDB in an [initial prototype](https://github.com/gjreda/scratch-pdf-bot) and it worked well. It was easy to get started, performed well, and had nice integration with Python data ecosystem. However, LanceDB currently does not have delete or update capabilities for individual records and is a fairly new project in very active development, so it is likely that its API will evolve quickly. It is focused on [vector](https://lancedb.github.io/lancedb/embedding/) and [full-text search](https://lancedb.github.io/lancedb/fts/), and does have a TypeScript client.
3. [Sqlite + VSS](https://github.com/asg017/sqlite-vss) - Can't say much about VSS as I'm not familiar with it, but Sqlite would be plenty sufficient for our needs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Backend data storage for reference content #102

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Backend data storage for reference content #102

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions