Scraping Web Pages with Scrapy

This is a simple tutorial on how to write a crawler using Scrapy to scrape and parse Craigslist Nonprofit jobs in San Francisco and store the data to a CSV file.

If you don’t have any experience with Scrapy, start by reading this tutorial. Also, I assume that you are familiar with Xpath; if not, please read the Xpath basic tutorial on w3schools. Enjoy!

Updates:

09/18/2015 – Updated the Scrapy scripts

Be sure to check out the accompanying video!

Installation
Create a Project
Test
Release!
Store the data
Next time

Installation

Start by downloading and installing Scrapy (v0.16.5) and all its dependencies. Refer to this video, if you need help.

Create a Project

Once installed, open your terminal and create a Scrapy project by navigating to the directory you’d like to store your project in and then running the following command:

$ scrapy startproject craigslist_sample

Item Class: Open the items.py file within the “craigslist_sample” directory. Edit the file to define the fields that you want contained within the Item. Since we want the post title and subsequent URL, the Item class looks like this:

from scrapy.item import Item, Field

class CraigslistSampleItem(Item):
    title = Field()
    link = Field()

The Spider: The spider defines the initial URL (http://sfbay.craigslist.org/npo/), how to follow links/pagination (if necessary), and how to extract and parse the fields defined above. The spider must define these attributes:

name: the spider’s unique identifier
start_urls: URLs the spider begins crawling at
parse: method that parses and extracts the scraped data, which will be called with the downloaded Response object of each start URL

You also need to use the HtmlXpathSelector for working with Xpaths. Visit the Scrapy tutorial for more information. The following is the code for the basic spider:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem


class MySpider(BaseSpider):
    name = "craig"
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://sfbay.craigslist.org/search/npo"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//span[@class='pl']")
        for titles in titles:
            title = titles.select("a/text()").extract()
            link = titles.select("a/@href").extract()
            print title, link

Save this in the “spiders” directory as test.py.

Test

Now you are ready for a trial run of the scraper. So, while in the root directory of your Scrapy project, run the following command to output the scraped data to the screen:

$ scrapy crawl craig

Dicts: The Item objects defined above are simply custom dicts. Use the standard dict syntax to return the extracted data inside the Item objects:

item = CraigslistSampleItem()
item["title"] = titles.select("a/text()").extract()
item["link"] = titles.select("a/@href").extract()
items.append(item)

Release!

Once complete, the final code looks like this:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem

class MySpider(BaseSpider):
    name = "craig"
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://sfbay.craigslist.org/search/npo"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.xpath("//span[@class='pl']")
        items = []
        for titles in titles:
            item = CraigslistSampleItem()
            item["title"] = titles.select("a/text()").extract()
            item["link"] = titles.select("a/@href").extract()
            items.append(item)
        return items

Store the data

The scraped data can now be stored in these formats- JSON, CSV, and XML (among others). Run the following command to save the data in CSV:

$ scrapy crawl craig -o items.csv -t csv

You should now have a CSV file in your directory called items.csv full of data:

csv

Although this is relatively simple tutorial, there are still powerful things you can do by just customizing this basic script. Just remember to not overload the server on the website you are crawling. Scrapy allows you to set delays to throttle the crawling speed.

Next time

In my next post I’ll show how to use Scrapy to recursively crawl a site by following links. Until then, you can find the code for this project on Github.

Contents