Public Repository

Last pushed: 18 hours ago
Short Description
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
Full Description

The goal is to distribute seed URLs among many waiting spider instances, whose requests are coordinated via Redis. Any other crawls those trigger, as a result of frontier expansion or depth traversal, will also be distributed among all workers in the cluster.

The input to the system is a set of Kafka topics and the output is a set of Kafka topics. Raw HTML and assets are crawled interactively, spidered, and output to the log. For easy local development, you can also disable the Kafka portions and work with the spider entirely via Redis, although this is not recommended due to the serialization of the crawl requests.

Core Concepts

This project tries to bring together a bunch of new concepts to Scrapy and large scale distributed crawling in general. Some bullet points include:

  • The spiders are dynamic and on demand, meaning that they allow the arbitrary collection of any web page that is submitted to the scraping cluster
  • Scale Scrapy instances across a single machine or multiple machines
  • Coordinate and prioritize their scraping effort for desired sites
  • Persist data across scraping jobs
  • Execute multiple scraping jobs concurrently
  • Allows for in depth access into the information about your scraping job, what is upcoming, and how the sites are ranked
  • Allows you to arbitrarily add/remove/scale your scrapers from the pool without loss of data or downtime
  • Utilizes Apache Kafka as a data bus for any application to interact with the scraping cluster (submit jobs, get info, stop jobs, view results)
  • Allows for coordinated throttling of crawls from independent spiders on separate machines, but behind the same IP Address
  • Enables completely different spiders to yield crawl requests to each other, giving flexibility to how the crawl job is tackled

Images

Each component for Scrapy Cluster is designated as a tag within the root docker repository. Unlike a lot of projects, we chose to keep the dockerized Scrapy Cluster within the same github repository in order to stay consistent with how the project is used. This means that there will be no latest tag for Scrapy Cluster, instead the tags are defined as follows.

  • Kafka Monitor: istresearch/scrapy-cluster:kafka-monitor-{release/build}

  • Redis Monitor: istresearch/scrapy-cluster:redis-monitor-{release/build}

  • Crawler: istresearch/scrapy-cluster:crawler-{release/build}

  • Rest: istresearch/scrapy-cluster:rest-{release/build}

For example istresearch/scrapy-cluster:redis-monitor-1.2 would be the official stable 1.2 release of the Redis Monitor, but istresearch/scrapy-cluster:redis-monitor-dev would be tied to the latest dev branch release. Typically numeric releases will be paired with the master branch, while -dev releases will be paired with the dev branch.

Documentation

Please check out our official Scrapy Cluster documentation for more details on how everything works!

Docker Pull Command
Owner
istresearch

Comments (0)