A TypeScript/Bun tool to analyze websites from CSV files and determine if they have blogs and how many blog posts they contain. This tool leverages ai browser automation through browser-use and Browserbase to achieve high accuracy for complex blog structures and infinite scroll sites.
-
Install dependencies:
bun install
-
Set up API keys:
# Create .env file and add your API keys echo "GEMINI_API_KEY=your_gemini_api_key_here" > .env echo "BROWSER_USE_API_KEY=your_browser_use_api_key_here" >> .env echo "BROWSER_BASE_API_KEY=your_browserbase_api_key_here" >> .env echo "BROWSER_BASE_PROJECT_ID=your_browserbase_project_id_here" >> .env
Get API keys from:
- Gemini API: https://aistudio.google.com/app/apikey
- Browser-use API: https://github.com/browser-use/browser-use#getting-started
- Browserbase API: https://www.browserbase.com/
cd scraper
# Put your CSV file in the data/ directory
cp your-leads.csv data/
bun run dev
# Or specify CSV file path
bun run src/index.ts path/to/your/leads.csv
# Run evaluations
bun run testYour input CSV must have a "Website" column containing the company websites. Example:
Company,Website,Email
Acme Corp,acme.com,[email protected]
Example Inc,https://example.com,[email protected]The tool will create a new CSV file with the same name but with "-with-blog-analysis" appended. It adds these new columns:
hasBlog: Boolean indicating if the website has a blogblogPostCount: Estimated number of blog posts foundblogUrl: Direct URL to the blog sectionhasResources: Boolean indicating if the site has resources/case studiesresourcesCount: Number of resources foundresourcesUrl: Direct URL to resources section
Company,Website,hasBlog,blogPostCount,blogUrl,hasResources,resourcesCount,resourcesUrl
Envoy B2B,https://envoyb2b.com,true,18,https://envoyb2b.com/news,true,4,https://envoyb2b.com/case-studies-and-research
Nalpeiron,https://nalpeiron.com,true,20,https://nalpeiron.com/blog,false,0,This tool uses cutting-edge browser automation technology to perform intelligent website analysis:
- CSV Processing: Reads your CSV file containing company websites
- Browser Automation: Uses browser-use - an AI-powered browser automation framework that can intelligently navigate websites
- Cloud Infrastructure: Leverages Browserbase - a headless browser infrastructure service for reliable, scalable web automation
- Smart Blog Detection: The browser agent intelligently searches for blog sections by:
- Trying common blog endpoints (
/blog,/news,/insights, etc.) - Analyzing page structure and navigation menus
- Using AI to understand page content and identify blog-like sections
- Trying common blog endpoints (
- Content Analysis: Uses Gemini AI to analyze discovered pages and accurately count blog posts
- Fallback Logic: Implements robust fallback mechanisms if primary detection methods fail
- Results Export: Outputs comprehensive analysis to a new CSV file with blog metrics
- browser-use: AI-powered browser automation framework
- Browserbase: Cloud-based headless browser infrastructure
- Gemini AI: Advanced content analysis and understanding
- TypeScript/Bun: Fast, modern runtime and type safety
The tool processes websites in batches of 5 with 2-second delays between batches to be respectful to target servers and avoid rate limiting.
- Multiple Analysis Methods: Combines browser automation, sitemap analysis, and AI-powered content detection
- Infinite Scroll Support: Handles modern blogs with pagination and infinite scroll
- Resource Detection: Finds case studies, whitepapers, and downloadable resources
- Robust Fallbacks: Multiple detection strategies ensure high accuracy across different site architectures
This tool includes comprehensive testing with 20+ manually verified test cases covering complex scenarios:
# Run evaluations
bun run test
# Results saved to actual-results.json- Complex Blog Structures: Multi-section blogs, subdomains, hidden navigation
- Infinite Scroll Sites: Modern SPAs with dynamic loading
- Edge Cases: Sites with email walls, dropdown menus, non-standard URLs
- Resource Detection: Case studies, whitepapers, downloadable content
- Netpresenter: 212 blog posts across paginated sections
- Craftview: Blog hosted on subdomain (blog.craftview.de)
- Envoy B2B: No blog, but 8 case studies in dropdown menu
- Nalpeiron: 38+ posts across two blog sections
Accuracy Metrics: Blog detection with 30% tolerance for post counts, resource detection with 50% tolerance.
- Browser issues: Check
browser-use-checker-problems.mdfor known issues and solutions - Debug output: Sitemap analysis debug files are saved to
debug-sitemaps/ - API rate limits: Reduce batch size or increase delays in the source code
- Missing results: Some sites may block automated access - check manually if results seem incomplete