0% found this document useful (0 votes)
8 views4 pages

Services Overview

Uploaded by

gourav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views4 pages

Services Overview

Uploaded by

gourav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Services Overview (Retriever and Vectorization)

A)Retriever service:
This service essentially acts as a document retrieval and question-
answering system that:
1. Takes a user query
2. Searches through indexed documents in Pinecone
3. Uses GPT-4 to generate answers based on relevant document
chunks
4. Returns structured results with source information
 components and functionality:
1. Imports and Dependencies (key libraries used):
 pinecone: For vector database operations
 Langchain: For working with LLMs and document processing
 FastAPI: For creating the web API
 pandas: For data manipulation
2. Main Components:
a) Helper Function:
def extract_unique_sources(query_result):
 This function extracts unique document IDs from Pinecone query
results
 It processes the matches from the query result and collects
unique document IDs
b) API Setup:
 Uses FastAPI framework
 Creates a router with prefix "/api/v1"
 Has two endpoints:
 /api/v1/steps/retriever (POST)
 /api/v1/health (GET)
3. Main Retriever Endpoint (/api/v1/steps/retriever):
This is the core functionality that:
 Accepts POST requests with JSON data containing:
 query: The user's question
 restrictToDocumentIds: list of specific documents to search
 documentType: Type of document to filter
 documentcategory: Category of document to filter
4. Configuration:
 Uses environment variables for various credentials:
 OpenAI/Azure credentials (API base, key, version, type)
 Pinecone credentials (API key, environment, index name)
5. Core Processing Flow:
a) Setup:
 Initializes Azure OpenAI LLM (GPT-4)
 Sets up OpenAI embeddings (using Ada model)
 Initializes Pinecone connection
b) Document Retrieval:
 Queries Pinecone index with filters based on:
 Document type
 Document category
 Specific document IDs
 Retrieves up to 1000 matches (configurable)
c) Processing:
For each unique document:
 Creates a RetrievalQA chain
 Processes the user's query
 Stores results in a pandas DataFrame with columns:
o documentId
o documentCategory
o documentType
o fileName
o result
6. Output:
Returns JSON response containing:
 results: List of processed documents and their answers
 errors: Any internal errors that occurred
7. Error Handling:
 Includes basic error handling for value errors
 Returns appropriate HTTP status codes (200 for success, 400 for
errors)

B)Vectorization service:
 Responsible for processing documents and converting them into
vector embeddings for efficient retrieval.
 The main business purpose of this service is to:
1. Take HTML documents
2. Process them into searchable chunks
3. Convert these chunks into vector embeddings
4. Store them in a vector database for efficient semantic search

 This service works in conjunction with the retriever service, where:


 This service (vectorization) prepares and stores the documents.
 The retriever service uses these stored vectors to find relevant
information when answering questions.
 components and functionality:
1. Service Overview:
This is a FastAPI-based service that converts documents (specifically
HTML documents) into vector embeddings and stores them in
Pinecone (a vector database). It has two main endpoints for
vectorization.
2. Main Components:
a) Setup and Configuration:
 Uses FastAPI framework
 Loads environment variables from .env file
 Sets up logging
 Has three endpoints:
 /api/v1/steps/vectorize-new (POST)
 /api/v1/steps/vectorize (POST)
 /api/v1/health (GET)
3. First Vectorization Endpoint (/api/v1/steps/vectorize-new):
This is the newer version of the vectorization endpoint that uses a
more structured approach:
a) Input Processing:
 Accepts JSON with two main sections:
 inputs: Contains file path, document ID, type, and category
 config: Contains configuration for embeddings and processing
b) Processing Flow:
1. Validates credentials (OpenAI and Pinecone)
2. Sets up Azure OpenAI embeddings
3. Loads HTML file from blob storage
4. Uses a Vectorizer class to:
 Load and process chunks
 Augment vectors with metadata
 Save vectors to database
4. Second Vectorization Endpoint (/api/v1/steps/vectorize):
This is the legacy version that handles the vectorization process
directly:
a) Input Processing: Accepts simpler JSON with -
 instanceId
 pathToInputFile
 documentType
 documentId
 documentcategory
b) Processing Flow:
1. Validates credentials
2. Loads HTML file from blob storage
3. Processes the document:
 Uses UnstructuredHTMLLoader to load HTML
 Splits text into chunks using RecursiveCharacterTextSplitter
 Adds metadata to each chunk
 Stores vectors in Pinecone
5. Key Business Logic:
a) Document Processing:
 Documents are loaded from Azure Blob Storage
 HTML content is parsed and split into manageable chunks
 Each chunk is converted into a vector embedding
 Metadata (document type, ID, category) is attached to each chunk
b) Vector Storage:
 Uses Pinecone as the vector database
 Stores vectors with associated metadata for later retrieval
 Uses Azure OpenAI's embeddings model (Ada)
6. Error Handling:
Comprehensive error handling for:
 Missing credentials
 File loading issues
 Processing errors
 Database operations
 Returns appropriate HTTP status codes and error messages
7. Integration Points:
 Azure Blob Storage: For document storage
 Azure OpenAI: For generating embeddings
 Pinecone: For vector storage
 FastAPI: For API endpoints

You might also like