Public Repository

Last pushed: 2 years ago
Short Description
An web spider that collects img urls
Full Description

A web crawler that finds all image links to .png, .jpg and .gif's on specified URL's. This container runs the django project that lives at https://github.com/detkin/img_spider.

Running the Application

The application depends on being linked to a running redis container which can be started via:

docker run --name redis -d redis

There are two processes that can be started from this image, the web process which is a simple django web-app that handles the REST API and creating tasks. There should be one of these containers running at a time. This will run the django runserver on port 8000, you can run this via:

docker run -d -p 8000:8000 --name web --link redis:redis detkin/img_spider web

The second process that can be started from this image is a celery process which handles the actual web crawling. There can be as many containers of this type running as you would like. More containers will allow the application to scale better if you have the resources to handle the increased number of workers. This will run celery talking to redis as it's queue and backend, you can run this via:

docker run -d --link redis:redis detkin/img_spider celery

This container can take an argument that determines the depth the web crawler will crawl. The default is 1 but if you'd like to specify a different depth you launch the container with this command:

docker run -d --link redis:redis detkin/img_spider celery -d 3

NOTE: we don't name the celery containers since we don't care what they are called, they are just brute force workers.

Using the Application

Submitting urls to crawl

The application runs off the root and expects a POST parameter called 'urls' which can contain a comma separated list of urls as a string or a list strings. Examples on how to invoke the app are:

curl -X POST -d "urls=https://google.com,https://www.docker.com" http://APP_LOCATION:8000

OR

curl -H "Content-Type: application/json" -d '{"urls":["https://google.com","https://www.docker.com"]}' http://APP_LOCATION:8000

PREFERRED: The application is written using django-rest-framework which provides a great interactive browser for REST API's. So navigate in your browser to:

http://APP_LOCATION:8000

and POST json such as:

{"urls":["https://www.docker.com", "https://google.com"]}

The result of this POST will return json in the form:

{"job_id": "c68e5033-1da2-4606-ba57-a4433764b4f0"}

Use the provided job_id to find status and results as described below.

Checking the status of urls being crawled

You can see the status of a job by hitting the following URL with a GET, or using the awesome built-in response page in a browser:

http://APP_LOCATION:8000/status/TASK_ID_PROVIDED_IN_RESPONSE

The result of the GET will be json in the form:

{"completed": 1, "inprogress": 2}

The numbers represent the URL's submitted, once all URL's have been crawled you can ask for the result as described below.

Viewing the results of the urls being crawled

Once the status says all jobs are completed you can view the results by hitting the following URL with a GET, or using the awesome built-in response page in a browser:

http://APP_LOCATION:8000/result/TASK_ID_PROVIDED_IN_RESPONSE

The result of this GET request, once processing has been completed is json that looks like this:

{
    "https://google.com/preferences?hl=en": [
        "https://google.com/images/logo_sm_2.gif", 
        "https://google.com/images/warning.gif"
    ]
}

Design decisions and drawbacks

Design decisions

I'm happy with the overall design. I think it's well separated and it breaks up well for horizontal scaling. The web-app, written in django and using the django-rest-framework, only does a minimal amount of work and works well as a single container. It should be fairly easy for those familiar with django to grok the code. The django-rest-framework seems to be the default goto for REST in django right now and I really like the HTML version of the API. It makes development and testing very quick.

For background jobs I used celery. Celery is very robust and I've seen it scale very well with many incoming jobs and lots of workers. The tasks are fairly straight-forward and are written recursively to handle the crawling. The background tasks do the network IO to read in the sites to be crawled as well as the HTML parsing and link extraction. This is the "heavy lifting" in the application and is the part that's been setup here to scale horizontally. You can add more celery containers which in-turn adds more workers, which provides more parallel processing of the web crawling. The web crawling results are stored as results to celery tasks, this is stored in Redis.

I'm using BeautifulSoup to do html parsing. This is a well established python project for handling HTML with a number of backend parsing options. The default is supposed to be a bit slower than the lxml parser but for this project the default is sufficient.

The Docker container is based on the official python:2.7.8 container and sets up the pip requirements, clones the project from GitHub, exposes the web-app port and runs a script to decide if it should start the web-app or celery. If run not as a daemon it will print the usage instructions for the container.

Scaling

The application should be setup fairly well for scaling. The brunt of the work is happening in the celery tasks which can scale horizontally by starting more containers. No concern has been taken to match container resources with number of celery workers. In a real app we would need to figure out the right balance here to maximize each container.

At some point the amount of memory it takes to generate all the results is going to become an issue in the web app if we ask for too many pages to be crawled in one request. But this will only be a problem for very large crawls.

At some point you might need to scale the web app as well which would require us to add in another layer such as haproxy to distribute incoming requests to multiple http servers. But at this point that is serious overkill.

I'm using the django runserver which is not meant for production applications. I could easily use gunicorn instead but it did not seem worth the trouble for the exercise. We would also want to front gunicorn with something like nginx if we were really putting this into service.

I'm using Redis for the queue and storing task results. Redis should be able to handle tons of clients and traffic but it is a single point of failure at this time. I've not worried about setting up Redis for scaling, I've left it as the stock docker image. Django stores users and such in a relational DB. I've left that using SQLLite but in the real world we would use something better like Postgres.

Security

There is none, this is a toy app.

There are no logins or authentication of any kind protecting any of the logic in the app.

The apps are running as privileged users in their containers, we would want to run as some app user in real life.

I left the django app running with DEBUG=True so I don't have to worry about allowed_hosts as that would make deployment more difficult for just playing with it. This can cause memory leaks in celery.

Error Handling

Basic errors are handled in the app but if it would like to be used correctly. I'm sure there are ways you can break the app by passing in malformed data. If this was something more than an exercise the app would need to handle pathological cases a bit better.

Docker Pull Command
Owner
detkin

Comments (0)