A semantic search application for YouTube videos based on captions.
- Process YouTube videos and extract captions
- Create semantic embeddings for video content
- Search for specific topics within videos using natural language
- Web UI with Gradio
- Command-line interface with Typer
This project demonstrates integration between:
- MariaDB for data storage (specifically Vector type column for each video chunk)
- Python for backend processing: data extraction, data cleanup, running ML models and building user interfaces:
- transformers library for running HuggingFace models locally
- Used for punctuation and embedding models
- youtube-transcript-api for caption retrieval
- typer for CLI and gradio for WebUI
- NLTK for processing YouTube caption snippets (splitting into sentences)
- transformers library for running HuggingFace models locally
The main currently tested use case is semantic search of YouTube videos about Python (from different PyCon conferences). The primary data source for the search is auto-generated YouTube captions.
The system is implemented as a set of well-defined components:
See the main business logic for processing new videos under app/services/video_processing.py
-
Collecting captions from a specific subset of YouTube videos (Python-related conferences like PyCon)
- See
app/youtube/fetcher - To avoid hitting YouTube rate limits, youtube-transcript-api caches captions as pickle objects locally under the
data/folder (git ignored) and reuses them on subsequent runs - As an example, a small curated list of Python YouTube videos for database population can be found at
app/youtube/youtube_videos.json.
- See
-
Prepare the captions for future embedding:
- See
app/youtube/transformandapp/chunking - Restore missing punctuation from caption snippets using a dedicated punctuation model (
oliverguhr/fullstop-punctuation-multilang-large) - After adding punctuation, split the text into sentences
- Chunk the data by a configurable number of tokens per chunk by combining full sentences. This allows us to group only contextually related data (by sentences) together
- Each final chunk maintains the context of which video [start-timestamp; end-timestamp] interval it belongs to
- See
-
Run the embedding model on the chunks to get vector representation
- See
app/embeddingfor more implementatiion details - By default, using
all-MiniLM-L6-v2model for vector embedding
- See
-
Insert each vectorized video chunk into MariaDB Vector store alongside its context (video information, start and end timestamps, etc.)
- See
app/storagefor more implementatiion details - Currently
Euclideandistance vector is used Cosinedistance metric showed similar performance
- See
See the main business logic for processing search queries against embedded YouTube video captions under app/services/search.
- Vectorize natural language query using the same embedding model as when populating the data
- Find N nearest neighbors by
Euclideandistance score between vectorized query and existing vectors of chunks - Find and output N closest video chunks for the provided query
See app/service/crud.py for the business logic implementation of working with data from the MariaDB database.
- Command Line Tool to populate / query / manage data
- Web Frontend to query and view data with working links to YouTube videos
- Python 3.13+
- MariaDB 11.7+
- Docker and Docker Compose (optional, for containerized setup)
-
Build and start the application with Docker Compose:
make up -
The application will be available at:
- Install Python 3.13+
- Install uv for dependency management
- Install MariaDB Connector C
- Have MariaDB running (e.g., from existing docker-compose.yml)
- Install required dependencies:
uv sync - Configure environment variables if needed (see Configuration section)
- Run the application:
uv run -m app.cli
The application can be configured using environment variables:
| Variable | Description | Default |
|---|---|---|
| DB_USER | Database username | app_user |
| DB_PASSWORD | Database password | Password123! |
| DB_HOST | Database hostname | 127.0.0.1 |
| DB_PORT | Database port | 3306 |
| DB_NAME | Database name | semantic_search |
| EMBEDDING_MODEL | Hugging Face model for embeddings | all-MiniLM-L6-v2 |
The application provides a CLI for common operations:
# Populate the database with default videos
uv run -m app.cli video populate
# Search videos
uv run -m app.cli video search "your search query"
# List all videos
uv run -m app.cli video list
# Add a new video
uv run -m app.cli video create --id YOUTUBE_VIDEO_ID --title "Video Title" --metadata "{}"The Gradio web UI provides a user-friendly interface with:
- Video list with pagination
- Search functionality
- Result visualization
This project uses a modular architecture with clear separation of concerns:
app/services: Business logicapp/storage: Data persistence (using Maria DB)app/youtube: YouTube data fetching and processingapp/chunking: Text chunking for embeddingsapp/embedding: Vector embeddings generation
This project uses pytest for testing. Tests are organized to match the application structure:
tests/services: Tests for business logic services with protocol stubs- Additional test directories will be added as needed
You can run tests using the included make commands:
# Run all tests
make test
# Run linter or auto fixes for linting
make lint
make lint-fixTests are designed to use protocol stubs, leveraging Python's duck typing and Protocol classes to ensure correct interfaces without needing actual implementations during testing.
- Public HTTP API using FastAPI
- Scripts for automatically extracting YouTube videos for specific topics (like PyCon conferences in this case)
- More UI features
- More experiments with different embedding models and chunking techniques
- Experiment with using Python ORMs for working with MariaDB (
SQLModel,SQLAlchemy) instead of raw queries andalembicfor database migrations
This project is licensed under the MIT License - a permissive license that is short and to the point. It lets people do almost anything they want with your project, like making and distributing closed source versions, as long as they provide attribution back to you and don't hold you liable.
Copyright (c) 2025 Dmytro Abramov
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files, to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software.
