Skip to content
This repository was archived by the owner on Nov 16, 2023. It is now read-only.

Crawler in a box

William Bartholomew edited this page Mar 20, 2017 · 4 revisions

For convenience, a Docker configuration for running the crawler is available. This comes pre-configured with Rabbit MQ for queuing, MongoDB for document storage, Redis for caching, Metabase for insights and the GHCrawler dashboard for configuration and control. Each of these runs in its own container. The compose file is in docker/docker-compose.yml in the GHCrawler repo.

NOTE This is an evolving solution and the steps for running will be simplified published, ready-to-use images on Docker Hub. For now, follow these steps

  1. Clone the Microsoft/ghcrawler and Microsoft/crawler-dashboard repos.
  2. In a command prompt go to ghcrawler/docker and run docker-compose up.

Once the containers are up and running, you should see some crawler related messages in the container's console output every few seconds. You can control the crawler either using the cc command line tool or the browser-based dashboard, both of which are described below.

You can also hookup directly to the crawler infrastructure. By default the containers expose a number of endpoints at different ports on localhost. Note that if you have trouble starting the containers due to port conflicts, either shutdown your services using these ports or edit the docker/docker-compose.yml file to change the ports.

Updating the default Metabase for Docker configurations:

The Metabase configured by default has some canned queries and a dashboard. If you want to clear that out and start fresh, do the following:

  1. Ensure you're starting from a completely clean container (docker-compose down && docker-compose up).
  2. Crawl a small org to populate Mongo so you have schema/sample data to work with.
  3. Open the Metabase URL and configure the questions, dashboard, etc. you want
  4. REMEMBER: Any changes you make will be persisted
  5. Copy the Metabase database by changing to the docker/metabase folder in the GHCrawler repository and running:
  docker cp docker_metabase_1:/var/opt/metabase/dockercrawler.db.mv.db .

Clone this wiki locally