Public | Automated Build

Last pushed: 9 months ago
Short Description
Web Scraping
Full Description

$ docker run --name scrapyd-server --user scrapyd -it -P nutthaphon/scrapyd:1.1.1 bash

scrapyd@a4c2642d74db:~$ scrapyd &
scrapyd@a4c2642d74db:~$ mkdir projects
scrapyd@a4c2642d74db:~$ cd projects/
scrapyd@a4c2642d74db:~/projects$ scrapy startproject tutorial

New Scrapy project 'tutorial', using template directory '/usr/local/lib/python2.7/dist-packages/scrapy/templates/project', created in:
/home/scrapyd/projects/tutorial

You can start your first spider with:
cd tutorial
scrapy genspider example example.com
scrapyd@a4c2642d74db:~/projects$ cd tutorial/

scrapyd@a4c2642d74db:~/projects/tutorial$ sed -i 's/#url/url/g' scrapy.cfg
scrapyd@a4c2642d74db:~/projects/tutorial$ cat scrapy.cfg

[settings]
default = tutorial.settings

[deploy]
url = http://localhost:6800/
project = tutorial

scrapyd@a4c2642d74db:~/projects/tutorial$ ~/scrapyd-client-master/scrapyd-client/scrapyd-deploy -l

default http://localhost:6800/

Create first spider from tutorial (https://doc.scrapy.org/en/latest/intro/tutorial.html) in tutorial/spiders/quotes_spider.py

import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"

def start_requests(self):
    urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
    for url in urls:
        yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
    page = response.url.split("/")[-2]
    filename = 'quotes-%s.html' % page
    with open(filename, 'wb') as f:
        f.write(response.body)
    self.log('Saved file %s' % filename)

scrapyd@a4c2642d74db:~/projects/tutorial$ ~/scrapyd-client-master/scrapyd-client/scrapyd-deploy default -p tutorial

Packing version 1479799403
Deploying to project "tutorial" in http://localhost:6800/addversion.json
Server response (200):
{"status": "ok", "project": "tutorial", "version": "1479799403", "spiders": 0, "node_name": "a4c2642d74db"}

scrapyd@a4c2642d74db:~/projects/tutorial$ ~/scrapyd-client-master/scrapyd-client/scrapyd-deploy -L default

tutorial

scrapyd@a4c2642d74db:~/projects/tutorial/tutorial/spiders$ curl http://localhost:6800/listprojects.json

{"status": "ok", "projects": ["tutorial"], "node_name": "a4c2642d74db"}

scrapyd@a4c2642d74db:~/projects/tutorial/tutorial/spiders$ curl http://localhost:6800/listspiders.json?project=tutorial

{"status": "ok", "spiders": ["quotes"], "node_name": "a4c2642d74db"}

scrapyd@a4c2642d74db:~/projects/tutorial/tutorial/spiders$ curl http://localhost:6800/schedule.json -d project=tutorial -d spider=quotes -d setting=DOWNLOAD_DELAY=2 -d arg1=val1

{"status": "ok", "jobid": "2f928b04b08911e68c380242ac110002", "node_name": "a4c2642d74db"}

See output from crawling via web monitor http://localhost:6800

Docker Pull Command
Owner
nutthaphon
Source Repository

Comments (0)