This is a simple tutorial on how to write a crawler using Scrapy to scrape and parse Craigslist Nonprofit jobs in San Francisco and store the data to a CSV file.
Updates: - 09/18/2015 – Updated the Scrapy scripts
Be sure to check out the accompanying video!
Create a Project
Once installed, open your terminal and create a Scrapy project by navigating to the directory you’d like to store your project in and then running the following command:
Item Class: Open the items.py file within the “craigslist_sample” directory. Edit the file to define the fields that you want contained within the
Item. Since we want the post title and subsequent URL, the
Item class looks like this:
1 2 3 4 5
The Spider: The spider defines the initial URL (http://sfbay.craigslist.org/npo/), how to follow links/pagination (if necessary), and how to extract and parse the fields defined above. The spider must define these attributes:
- name: the spider’s unique identifier
- start_urls: URLs the spider begins crawling at
- parse: method that parses and extracts the scraped data, which will be called with the downloaded Response object of each start URL
You also need to use the HtmlXpathSelector for working with Xpaths. Visit the Scrapy tutorial for more information. The following is the code for the basic spider:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Save this in the “spiders” directory as test.py.
Now you are ready for a trial run of the scraper. So, while in the root directory of your Scrapy project, run the following command to output the scraped data to the screen:
Item objects defined above are simply custom dicts. Use the standard dict syntax to return the extracted data inside the Item objects:
1 2 3 4
Once complete, the final code looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Store the data
The scraped data can now be stored in these formats- JSON, CSV, and XML (among others). Run the following command to save the data in CSV:
You should now have a CSV file in your directory called items.csv full of data:
Although this is relatively simple tutorial, there are still powerful things you can do by just customizing this basic script. Just remember to not overload the server on the website you are crawling. Scrapy allows you to set delays to throttle the crawling speed.
In my next post I’ll show how to use Scrapy to recursively crawl a site by following links. Until then, you can find the code for this project on Github.