Skip to content

Modern CLI tool for scraping & analyzing Facebook groups using Playwright & Gemini AI. Features self-healing selectors, session security, and local offline analysis.

License

Notifications You must be signed in to change notification settings

MasuRii/FBScrapeIdeas

Repository files navigation

FB Scrape Ideas

Python Version License Issues Forks Stars Contributors

A CLI-driven application to scrape and analyze Facebook group posts for insights using Selenium and Google Gemini AI.

This tool helps users identify potential capstone/thesis ideas, student problems, or other valuable insights from university Facebook group discussions by automating data collection (including posts and comments) and AI-powered categorization.

CLI Screenshot

πŸ“– Table of Contents

✨ Features

  • πŸ”’ Authenticated Facebook Group Scraping: Securely logs into Facebook to scrape posts and comments from private or public groups.
  • πŸ€– Flexible AI Analysis:
    • Support for Google Gemini (default) and OpenAI-compatible providers (OpenAI, Ollama, LM Studio, etc.)
    • Configurable models (e.g., switch between Gemini 2.5 Pro, 2.0 Flash, or local LLMs)
    • Customizable Prompts: Override default AI prompts via JSON configuration
  • πŸ’Ύ Local Database Storage: Stores scraped data and AI insights in a local SQLite database.
  • πŸ“Š Data Export & Statistics: Export data to CSV/JSON formats and view detailed statistics.
  • πŸ’» Advanced CLI Interface:
    • Dynamic Filtering: Filter posts by category, author, or potential ideas
    • Pagination: Limit results with --limit option
    • Interactive Menus: User-friendly command selection
  • ⚑ Performance Optimizations:
    • Parallel processing for faster scraping
    • Asynchronous AI batch processing
    • Incremental data saving during scraping
  • πŸ“€ Enhanced Export Capabilities:
    • Flexible output paths
    • Multiple export formats (CSV/JSON)
    • Automatic directory creation

πŸ“ Scraped Data Fields

The application collects the following data from Facebook group posts and comments:

πŸ—¨οΈ Posts

  • Post content
  • Post URL
  • Post timestamp
  • Author name
  • Author profile picture URL

πŸ’¬ Comments

  • Comment content
  • Comment timestamp
  • Author name
  • Author profile picture URL
  • Facebook comment ID

πŸ” AI Analysis Fields

  • Category (e.g., "Project Idea", "Problem Statement")
  • Sub-category
  • Keywords
  • Summary
  • Potential idea flag
  • Sentiment analysis (for comments)

πŸ› οΈ Tech Stack

  • Language: Python
  • Web Scraping:
    • Selenium
    • webdriver-manager
    • BeautifulSoup4
  • AI & Machine Learning:
    • google-generativeai
  • Database:
    • SQLite
  • CLI:
    • click
  • Utilities:
    • python-dotenv
    • getpass

πŸ“‹ Prerequisites

Before you begin, ensure you have the following:

  • Python 3.9+
  • Git
  • A modern Web Browser (e.g., Chrome, Firefox)
  • Google Cloud Project & Gemini API Key

πŸš€ Getting Started

πŸ“¦ Option 1: Binary Release (Easiest)

For most users, we recommend using the pre-compiled binaries:

  1. Download the latest version for your platform from the Releases page.
  2. Run the application:
    • Windows: Double-click FBScrapeIdeas-windows-x64.exe.
    • macOS/Linux: Open a terminal, make the file executable (chmod +x FBScrapeIdeas-*), and run it.
  3. Interactive Setup: On the first launch, the application will guide you through an interactive wizard to configure your API keys and credentials. No manual .env file creation is required!

πŸ› οΈ Option 2: Running from Source (For Developers)

  1. Clone the repository:
    git clone https://github.com/MasuRii/FBScrapeIdeas.git
    cd FBScrapeIdeas
  2. Create and activate a virtual environment:
    # For Linux/macOS
    python3 -m venv venv
    source venv/bin/activate
    
    # For Windows (Command Prompt)
    python -m venv venv
    venv\Scripts\activate.bat
  3. Install dependencies:
    pip install -r requirements.txt

Configuration (Manual)

If you prefer to configure the application manually (e.g., for automated environments):

  1. Set up Environment Variables: Create a .env file in the project root:

    # .env
    
    # Provider Selection (gemini or openai)
    AI_PROVIDER=gemini
    
    # Gemini Configuration
    GOOGLE_API_KEY=YOUR_GEMINI_API_KEY_HERE
    GEMINI_MODEL=models/gemini-2.5-flash

    (See AI Provider Configuration for more details)

    Note: Facebook credentials are entered securely during scraping or saved during the first-run interactive session.

  2. WebDriver Setup: webdriver-manager will handle this automatically on the first run.

🧠 AI Provider Configuration

FB Scrape Ideas supports multiple AI providers, allowing you to choose between Google's Gemini models, OpenAI's official API, or local LLMs running via tools like Ollama or LM Studio.

You can configure these settings via the .env file or the CLI menu.

πŸ”Ή Using Google Gemini (Default)

This is the default provider. You only need a Google API Key.

Configuration (.env):

AI_PROVIDER=gemini
GOOGLE_API_KEY=your_google_api_key
GEMINI_MODEL=models/gemini-2.0-flash  # Optional: Change model

Available Gemini Models:

  • models/gemini-2.0-flash (Fast, efficient)
  • models/gemini-1.5-flash
  • models/gemini-1.5-pro (Higher reasoning capability)

πŸ”Ή Using OpenAI-Compatible Providers

You can connect to any service that follows the OpenAI API standard, including local LLMs.

1. Official OpenAI

AI_PROVIDER=openai
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-5o

2. Ollama (Local LLM)

Run Ollama locally (ollama serve) and use the following config:

AI_PROVIDER=openai
OPENAI_BASE_URL=http://localhost:11434/v1
OPENAI_API_KEY=ollama  # Value doesn't matter for Ollama, but must be present
OPENAI_MODEL=llama3    # Or any model you have pulled

3. LM Studio (Local LLM)

Start the local server in LM Studio and use:

AI_PROVIDER=openai
OPENAI_BASE_URL=http://localhost:1234/v1
OPENAI_API_KEY=lm-studio
OPENAI_MODEL=model-identifier

4. OpenRouter / Together AI / Groq

Point the OPENAI_BASE_URL to the provider's endpoint:

AI_PROVIDER=openai
OPENAI_BASE_URL=https://openrouter.ai/api/v1
OPENAI_API_KEY=your_openrouter_key
OPENAI_MODEL=anthropic/claude-3-opus

πŸ”Ή Custom Prompts

You can customize the instructions given to the AI by creating a custom_prompts.json file in the root directory. This allows you to tailor the categorization logic or sentiment analysis to your specific needs.

To use:

  1. Copy custom_prompts.example.json to custom_prompts.json.
  2. Edit the prompts in custom_prompts.json.

Example Structure:

{
  "post_categorization": "You are an expert post categorizer. Analyze the following...",
  "comment_analysis": "You are an expert comment analyzer..."
}

βš™οΈ Usage

The application is run via the CLI:

python main.py <command> [options]

Available Commands:

  • scrape: Scrapes posts and comments from a Facebook group.

    python main.py scrape --group-url "GROUP_URL" [--num-posts 50] [--headless]

    You'll be prompted securely for Facebook credentials

  • process-ai: Processes scraped posts and comments with the configured AI provider.

    python main.py process-ai
  • view: Views categorized posts and comments with filtering options:

    python main.py view [--category CATEGORY] [--author AUTHOR] [--limit N]
    • Interactive field and value selection
    • Pagination support
  • export: Exports data to CSV or JSON format:

    python main.py export --format csv|json [--output-path PATH] [--category CATEGORY]
    • Handles both posts and comments
    • Automatic directory creation
  • stats: Shows comprehensive statistics about collected data:

    python main.py stats

⚠️ Important Notice

This tool is provided for educational purposes only. Users must:

  • Comply with Facebook's Terms of Service
  • Respect privacy and data protection laws
  • Not use scraped data for commercial purposes
  • Use responsibly and ethically

The developers assume no liability for misuse of this tool. Scraping may violate Facebook's terms - use at your own risk.

About

Modern CLI tool for scraping & analyzing Facebook groups using Playwright & Gemini AI. Features self-healing selectors, session security, and local offline analysis.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Languages