A point-in-time overview of my AI infrastructure, tools, and services for development, deployment, and production workflows.
Note: This is not prescriptive. It's a snapshot of what I, as one person, see value in and my thoughts. Everybody is different. I update this roughly once a year, though the pace of evolution in this space is rapid enough that monthly updates would make sense.
| Component | Primary Choice | Good Recs | Favorite Feature |
|---|---|---|---|
| API Gateway | OpenRouter | Unified API access | |
| Cloud | GCP | Primary provider | |
| Generative AI | Replicate | Fal | Model exploration |
| LLM Interface | Claude Code | CLI-based development | |
| Local AI | Ollama | Batch processing | |
| Prototyping | AI Studio | Development environment | |
| RAG | Supermemory | Ragie | No DIY complexity |
| Research | NotebookLM | Knowledge synthesis | |
| STT | AssemblyAI | Whisper | Speaker diarization |
| TTS | OpenAI | ElevenLabs | Cost-effective |
| UI | ChatGPT | Day-to-day interface |
- Core Infrastructure
- Voice Applications
- Generative AI
- Data & Retrieval
- Development
- Cloud & Deployment
- API Services
- Local AI
- Automation
- Google Ecosystem
- Honorable Mentions
- Cost Breakdown
- Philosophy
- Links & Resources
Once or twice a year, usually around winter, I organize all my GitHub repositories - consolidating, deleting old ones, creating new ones. Part of this process is documenting my AI stack.
Friends and clients sometimes ask: "With so much on the market, what are you using?" This document captures:
- What I'm currently using for different tasks
- Why I've picked up or dropped certain tools
- What these things actually cost
The AI space moves quickly enough that these notes could be out of date in months, but I find it useful to document where things stand.
The backbone of the entire AI stack, powering nearly every component and workflow.
Primary Interfaces:
- CLI: Claude Code (Pro: $17/month | Max: $100/month)
- GUI: Claude interface (included with Claude Code subscription)
- ChatGPT: Day-to-day interface for general tasks
- OpenAI: API access and integrations
Why I Use Claude Code:
After 20 years of using Linux, CLIs are familiar, but I've generally preferred GUIs when they're available. Claude Code has changed how I approach system management.
When I started using it (along with Gemini CLI and CodeX), I wondered why there wasn't more focus on using these tools for local computer management. There was a project called Open Interpreter that had potential, but Claude Code covers more ground.
What It Helps With:
Linux has gotten easier over time, but there's always been a list of things that don't quite work - bugs, things I wish I could do. That list is now much shorter. When I run into issues, I can just say "Hey Claude, any chance you can figure out why..." or "create a shortcut for turning the screens on and off" - and it handles most of these.
Use Cases:
- Local system administration
- Home lab management (didn't intend to get into home labbing, but here we are)
- Remote server fixes
- Solving Linux issues that used to linger
Note on Self-Hosted Interfaces: Previously used Open WebUI (self-hosted chat interface). Before becoming a father and getting busy, I reached the conclusion that there wasn't a compelling reason to deal with the hassle of things breaking and needing fixes. Simplicity and reliability won out over self-hosting.
API Access:
- OpenRouter: Unified API gateway for multiple LLM providers
- Direct vendor APIs also available from all major providers
- Can be used with CLI tools, web apps, desktop apps, custom integrations
Why I Use OpenRouter:
I'm not always sure whether you get the same quality of inference as when using the vendor directly. For Claude specifically, it seems like better performance comes from Anthropic rather than routing through OpenRouter. So there's still a case for first-party access.
That said, for someone looking to go beyond ChatGPT, OpenRouter is worth considering early.
Two Main Reasons:
-
Expense Consolidation:
- When you run a business and need to provide expenses to your accountant, one API bill is easier than fragmented charges
- It's easy to lose track when spending $20 here and there on different APIs
- With OpenRouter, you set spend limits in one place
-
Model Exploration:
- Try different models without separate accounts
- Example: I ran an evaluation called "AI Brevity" (looking at which models actually follow conciseness instructions)
- Ran the same prompt through 10 models using OpenRouter and a script
- Convenient because you just change the model parameter
What's Beyond the Big Names:
There's more than OpenAI, Anthropic, Gemini worth looking at:
- What's coming out of China
- What's emerging at the frontier of fine-tunable models
- Models that don't make headlines: IBM Granite, Amazon's models, Microsoft Phi
- Some of these are good at instruction following
Where This Matters:
I find that many practical uses for LLMs are simple text transformations. For those, you don't need a high-reasoning model. You want something good at: "Here's the system prompt, here's your task, go from A to B."
Having a mix of open-source and commercial models through one API is convenient for this kind of work.
Getting Started: Set up an account on OpenRouter. They offer some free inference to try it out. If you're self-hosting a chat interface like Open WebUI or looking for programmatic integration, this is where you can see that instructional AI and programmatic AI can be useful beyond chatbots.
Performance Note: First-party vendor access (e.g., Anthropic for Claude) may provide better inference quality than routing through aggregators. For production workloads with specific models, consider direct vendor access.
Model Context Protocol (MCP):
Useful MCPs:
- Cloudflare MCP: Manage Cloudflare resources and configurations
- Vercel MCP: Deploy and manage Vercel projects
- Context7: Particularly useful MCP for enhanced context management
- GitHub MCP: Enhanced GitHub integration and repository management
- Goose: Valuable MCP tool for development workflows
Primary Services:
- Whisper (via OpenAI API): Solid baseline, good value
- Note: OpenAI's newer model hasn't shown significant improvement in my workloads
- Browser-based tools (e.g., Blabby): Convenient for quick tasks
- OS-level integration tools: For continuous dictation
Mobile/Android:
- Futo Keyboard: Voice typing and ASR on Android
- Privacy-focused keyboard with local speech recognition
- No cloud processing for voice input
- Good accuracy for on-device processing
Specialized Services:
- AssemblyAI: Long-form transcription with speaker diarization (used frequently)
- Essential for meeting transcripts and AI minute extraction
- Diarization is a foundational element for quality downstream processing
- Deepgram: Task-dependent usage
- Gladia: Very solid performance
- Lemonfox: Budget-friendly option for cost-sensitive jobs
- Speechmatics: Advanced voice technology platform
API Aggregators:
- Eden AI: Multi-provider API batching and management
Critical Learning: Don't expect the STT tool you use for live transcription to work well for asynchronous jobs like meeting transcripts. There are significant performance differences within STT tools across synchronous vs. asynchronous workloads. Match the tool to the specific task requirements.
Use Case Recommendations:
- Live/Real-time: Whisper, browser tools, OS integration
- Async/Long-form with speakers: AssemblyAI, Speechmatics
- Budget-conscious batch jobs: Lemonfox
- General reliability: Gladia
Premium Option:
- ElevenLabs: Best-in-class quality with excellent expressiveness
- Drawback: Very expensive
- Recommendation: Use when audio quality is critical
Budget-Friendly Options:
- OpenAI TTS: Good enough for majority of use cases
- Lacks the expressiveness of ElevenLabs but significantly cheaper
- Recommended for most projects unless premium audio is essential
- Various alternative providers: Available based on specific needs
Budget Decision Framework:
- Quality-first budget: ElevenLabs
- Cost-conscious budget: OpenAI TTS or OpenAI Mini
Requires separate API subscriptions. Generally expensive but essential for live interaction use cases.
Primary Platforms:
- Fal: Similar offering, seems to have more Chinese models
- Replicate: Good for model exploration and API-based workflows
- Collections: Organized model categories
- Text-to-Image: Image generation models
- Image Editing: Image inpainting and manipulation
- Text-to-Video: Video generation models
Why These Are Useful:
Both platforms offer the same value as OpenRouter but for generative AI.
Discovery Beyond Convenience:
Yes, it's convenient to run generative AI tasks like text-to-video through one API account. But the discovery aspect is equally valuable:
- You can explore available models in one place
- Find capabilities you didn't know existed (audio inpainting, for example)
- Try out models before committing budget to a specific workload
- Find companies and models you haven't heard of
New generative AI models and modalities appear on these platforms regularly.
Fal vs Replicate:
I don't have strong opinions about which is better. Fal seems to have more Chinese models. Both work well.
Note: Similar to OpenRouter for LLMs, these platforms provide consolidated access to generative capabilities through a single API. The discovery aspect is as useful as the consolidation.
Capabilities:
- Text-to-image generation
- Image-to-image transformation
- Image inpainting and editing
- Available through Replicate, Fal, and direct model APIs
Noteworthy Models:
- Nano Banana: Google's image-to-image model
- Fast, efficient transformations
- Good for style transfer and variations
Capabilities:
- Image-to-video conversion
- Text-to-video creation
- Accessible via Replicate and Fal platforms
Noteworthy Models:
- Wan Video 2.5 I2V Fast: Budget-friendly image-to-video
- Cost-effective for experimental video generation
- Good balance of speed and quality
Image-Based:
- Vision LLMs (available across most modern LLMs)
Video-Based:
- Gemini (primary choice for video understanding)
Audio-Based:
- Gemini 2.5 (primary choice)
- Enormous transformative potential beyond simple STT
Note: Audio-based inference represents a significant opportunity for innovation beyond traditional speech-to-text approaches.
Retrieval-Augmented Generation for working with proprietary datasets and knowledge bases.
RAG-as-a-Service:
- Ragie: RAG as a service via API
- Supermemory: Philosophy of "don't build your own RAG pipeline"
Why Not DIY RAG:
Unless you have enterprise-level document stores, building RAG from scratch (Pinecone, Qdrant, custom embeddings, chunking strategies, vector sizes) is often time-consuming. Services like Supermemory and memory layer tools handle this complexity, letting you focus on grounding AI in documents rather than becoming a retrieval engineer.
Note: For most use cases not at massive scale, managed RAG services save setup time for relatively simple document retrieval tasks.
Firecrawl: Web scraping and data extraction
There's a whole world of MCP tools, and Firecrawl is one I use regularly.
Why Markdown Matters:
Markdown is a useful format for AI workloads. Being able to quickly say, "Okay, these are the API docs for something you're struggling with" helps.
There's Context, which is a good MCP for this. But sometimes I know exactly what Claude needs in Markdown format. Having a reliable tool for this is useful.
Model Definitions:
Being able to define models for scraping makes the extraction more targeted.
Scraping Your Own Content:
People assume scraping is spammy or questionable, but I've often used these tools to scrape my own content.
Real Use Case:
If I wrote notes a few years ago, I can use Firecrawl to pull them in. It's quicker than digging through old Google Drive folders and formatting to Markdown.
Use Cases:
- Pulling your own historical content
- API documentation lookup via MCP
- Converting web content to AI-friendly formats
- Retrieving notes from various platforms
Note on Scraping: Web scraping tools are useful for retrieving your own content, documentation, and notes from various platforms for AI processing.
- Mistral: Document understanding
- LlamaIndex: Document processing and indexing
Primary Tools:
- Aider (AI pair programming)
- Claude Code (CLI-based development)
- Codex (GitHub Copilot integration)
Vendor CLIs:
The various vendor CLIs (Gemini CLI, Claude CLI, OpenAI CLI, Qwen CLI) provide valuable direct command-line access to their respective models. These are particularly useful for:
- Quick API testing and experimentation
- Scripting and automation workflows
- CI/CD pipeline integration
- Direct model access without abstraction layers
- VS Code: Primary IDE for development work
- Extensive extension ecosystem
- Excellent AI integration support
- Strong debugging and Git integration
- Obsidian: Primary tool for note capture and knowledge management
- Local-first, markdown-based note-taking
- Excellent for organizing AI outputs and research
- Graph view for connecting ideas
- Extensible with plugins
Note on Output Management: Output management and storage remains an oddly underaddressed part of the AI universe. Most tools focus on generation, but systematic capture, organization, and retrieval of AI outputs is still largely an afterthought. Having a solid knowledge management system like Obsidian helps bridge this gap.
- Conda: Managing complex Python environments with system dependencies
- Excellent for ML/AI projects with specific library requirements
- Handles non-Python dependencies well
- UV: Modern, fast Python package installer and resolver
- Extremely fast for creating lightweight virtual environments
- Great for quick projects and scripts
- Rust-based performance benefits
Containerization & Development:
- Docker: Essential for containerized development, especially for local workloads
- Useful for isolating dependencies
- Creating reproducible development environments
- Testing deployment configurations locally
- Running services and databases for development
- GitHub (primary repository hosting)
- Google Cloud Platform (GCP): Primary cloud provider
- On-server: Self-hosted deployments
- Vercel: Serverless frontend deployments
- Recently migrated from Netlify
- More AI-forward feature set
- Environment variable sharing across projects is excellent
- Hugging Face Spaces: Rapid prototyping and lightweight app hosting
- Supports private deployments
- Excellent for quick experiments
- Cloudinary: AI-friendly media storage and transformation
- Optimized for AI-generated images and media assets
- Automatic optimization and transformation APIs
- CDN delivery for fast access
- Good integration with generative AI workflows
Multi-Model Platforms:
- Fal (fast inference)
- Fireworks (fast inference)
- Individual vendor APIs
- OpenRouter (unified API access)
- Replicate (model marketplace)
- LM Studio: GUI interface for local models
- User-friendly interface for model management
- Good for experimenting with different models
- Supports multiple model formats
- Ollama: CLI-based local inference
- Preferred for batch processing and automation
- Efficient model management
- Easy integration with scripts and workflows
- ComfyUI: Node-based interface for Stable Diffusion
- Workflow-driven image generation
- Powerful for complex generation pipelines
- Good for iterative experimentation
- rmbg: Background removal tool
- Local background removal
- Fast and efficient processing
- High-performance GPU for local model inference
- Current: AMD Radeon (ROCm-compatible)
- Recommendation: Go with NVIDIA GPU for fewer compatibility issues
- AMD works but adds complexity
- NVIDIA has broader model support and easier setup
Why I Use Local AI (and why it's limited):
- Not for privacy: Not a driving factor for my use
- Not for cost savings: Cloud inference (Whisper STT, cheap models) offers excellent value
- Primary reason: Speed for specific workloads
Where Local AI Works Well:
- Large batch jobs: Processing thousands of files (e.g., voice note classification)
- No rate limiting concerns
- Can run overnight
- Models like Meta Llama 3.1 for text transformation tasks
Where Local AI Falls Short (for me):
- Agentic workflows: Local agents (Qwen models, GLM 4.6) haven't matched cloud quality
- Code generation: Local models not yet competitive with cloud options
- Image generation: Cloud services remain more practical
Honest Assessment: I keep exploring local AI and am eager for it to improve, but currently find myself defaulting to cloud services for most tasks. The quality gap hasn't closed enough to justify the added complexity, except for specific batch processing scenarios.
Offline AI:
- Local models and interfaces for sensitive workloads
- No external API calls
Air-Gapped Environments:
- Adapted local AI stack for completely isolated systems
- Separate infrastructure domain
- N8N: Visual workflow automation
Why Voice Workflows Matter:
A significant pattern in my stack is voice-based context generation and documentation:
- Efficiency: 30-minute voice recording captures more information than hours of typing
- Natural expression: Speaking allows more natural, comprehensive context sharing
- Transformation pipeline: Voice -> STT -> Context extraction -> AI workspace
Current Project - Voice Note Classification:
Working on automated classification system for voice notes using batch processing with local models.
The Meta-Workflow:
This document itself was created using this pattern:
- Claude Code generated interview questions
- Recorded 30-minute voice response
- Transcribed and integrated into documentation
Advantage: Voice workflows excel at generating rich, personalized context that makes AI significantly more useful for ongoing tasks.
Integrated Services:
- AI Studio: Development environment
- AI Studio + Cloud Run: Production deployments (API costs apply)
- Gemini: Free with Google Workspace
- NotebookLM: Free with Google Workspace - Favorite tool for research and knowledge synthesis
Why I Use It:
- Originally limited to 10 sources; now more capable
- Good at assembling building blocks: retrieval -> questions -> generation based on retrieval
- Alternative to LlamaIndex for many use cases, but simpler
Use Case - Deliberate Context Generation:
I've been working on a pattern for a while around building context once rather than repeatedly.
The Pattern:
When you have an ongoing situation that's personal and complex - not just asking for a pasta recipe.
Real Example - Home Lab Infrastructure:
When managing a home lab with multiple servers, network configurations, hardware specs, and custom setups, it's tedious to re-explain the entire infrastructure every time you need help troubleshooting or planning an upgrade.
The Approach:
- Record the context once (30-minute voice note covering network topology, hardware specs, software stack, configuration decisions)
- Transform to text
- Use as workspace foundation in NotebookLM
- Get troubleshooting help, upgrade recommendations, configuration suggestions - without re-explaining your entire setup
Why This Helps:
When you're working on a few of these ongoing situations (home lab infrastructure, project architecture, complex workflows), you see the advantage of this approach over repeatedly defining context.
Two Approaches to Context:
- OpenAI's approach: Chat first, AI extracts memories over time
- This approach: Generate deliberate context upfront through structured recording
Both are trying to solve the same question: How can AI be more useful when personalized to your life circumstances?
Why Voice Workflows:
I can record a voice note and in 30 minutes gather a lot of information. Transform that (extracting the context data), and that's the starting context.
Working in Reverse:
This is working backwards from the OpenAI approach. They have you chat with ChatGPT, and it learns about you over time. This generates context upfront.
Variations:
- Basic: Speak for 30 minutes into a microphone
- Structured: A bot asks questions like an interview
- This stack documentation used the second approach: the bot came up with questions, I spoke for 30 minutes
Note: This pattern creates comprehensive context in a single focused session rather than hoping it emerges from conversation.
- Lumo (by Proton): Privacy-focused AI assistant
- Built by Proton with strong privacy guarantees
- Good option for users already in the Proton ecosystem
- Emphasis on data protection and user privacy
- Venice.ai: Privacy-first AI platform
- No data collection or tracking
- Uncensored model access
- Anonymous usage supported
- Reddit: Essential for AI news and discussions
- Subreddits like r/LocalLLaMA, r/StableDiffusion, r/MachineLearning
- Real-time updates on model releases and techniques
- Community troubleshooting and experience sharing
- Discord: Primary platform for AI community conversations
- Most AI projects maintain active Discord servers
- Direct access to developers and community experts
- Real-time support and collaboration
- Examples: Ollama, Stable Diffusion, various model communities
| Service | Tier | Cost |
|---|---|---|
| Claude Code | Pro (Basic) | $17/month |
| Claude Code | Max | $100/month |
- STT: OpenAI Whisper API (per-usage)
- TTS: 11 Labs (premium) or OpenAI (budget)
- Real-Time APIs: Various providers (generally expensive)
- Cloud Infrastructure: GCP (variable based on usage)
- Serverless Compute: RunPod, Modal (pay-per-use)
- AI Studio + Cloud Run: API costs (variable)
Simple Text Transformations:
Many practical uses for LLMs are simple text transformations. For these:
- You don't need high-reasoning models
- You want good instruction-following
- Models like IBM Granite, Microsoft Phi work well
- Cost-effective and reliable
Example: Running evaluations across 10+ models to find which handles "AI Brevity" best (avoiding verbose outputs that resist system prompting)
Programmatic AI:
I find programmatic integration and instructional AI more useful than chatbots. This provides value through automation, batch processing, and system integration.
This AI stack reflects a few principles:
- Flexibility: Multiple options for each capability (premium vs. budget, cloud vs. local)
- Task-Appropriate Tooling: Different tools for different workload types (sync vs. async, batch vs. real-time)
- Security Tiers: Options from cloud APIs to air-gapped local inference
- Cost Balance: Premium services for some things, cost-effective alternatives for others
- CLI Tools: Preference for CLI tools and automation pipelines
- Stack Consolidation: Minimize API fragmentation through unified platforms (OpenRouter, Replicate, Fal)
- Expense Management: Practical accounting and budgeting (matters for business use)
- ChatGPT - Day-to-day LLM interface
- Claude - Primary LLM interface
- Claude Code - CLI development tool
- Gemini - Google's LLM platform
- OpenAI - LLM API access
- OpenRouter - Unified multi-model API
- Aider - AI pair programming
- Conda - Python environment management
- Docker - Containerization platform
- GitHub - Code hosting
- GitHub Copilot - Code completion
- Obsidian - Note-taking and knowledge management
- Open Interpreter - Natural language interface
- UV - Fast Python package installer
- VS Code - Primary IDE
- MCP Servers - Official MCP servers
- Smithery - MCP marketplace
- Context7 - Enhanced context management
- Cloudflare MCP - Cloudflare resource management
- Vercel MCP - Vercel project management
- GitHub MCP - Enhanced GitHub integration
- Goose - Development workflow MCP
- AssemblyAI - Long-form with diarization
- Blabby - Browser-based
- Deepgram - Task-dependent
- Eden AI - Multi-provider batching
- ElevenLabs - Premium quality TTS
- Futo Keyboard - Android voice typing/ASR
- Gladia - Solid performance
- Lemonfox - Budget-friendly
- Speechmatics - Advanced voice platform
- Whisper - Baseline transcription
- Fal - Fast inference platform
- Fireworks - Fast inference API
- Replicate - Multi-modal generation
- Collections - Organized model categories
- Nano Banana - Image-to-image model
- Wan Video 2.5 I2V Fast - Budget video generation
- Firecrawl - Web to markdown
- LlamaIndex - Document processing
- Mistral - Document understanding
- Pinecone - Vector database
- Qdrant - Vector search
- Ragie - RAG as a service
- Supermemory - Managed RAG
- Cloudinary - AI-friendly media storage
- Google Cloud Platform - Primary cloud
- Hugging Face Spaces - Rapid prototyping
- Modal - Serverless compute
- Netlify - (Previously used)
- RunPod - GPU on demand
- Vercel - Serverless frontend
- AMD Radeon - GPU hardware
- ComfyUI - Node-based Stable Diffusion interface
- LM Studio - Local model interface
- Meta Llama - Open model family
- NVIDIA - Recommended GPU
- Ollama - Local inference CLI
- rmbg - Background removal tool
- ROCm - AMD compute
- n8n - Visual automation
- AI Studio - Development environment
- Cloud Run - Serverless containers
- NotebookLM - Knowledge synthesis
- Open WebUI - Self-hosted interface
