Michael Herman

Software Developer

Web Scraping With Regular Expressions

Problem

You need to extract and parse all the headers and links from a web site or an XML feed, and then dump the data into a CSV file.

Check out the accompanying video!

Solution

1
2
3
import csv
from urllib import urlopen
import re

Perform html/xml query, grab desired fields, create a range:

1
2
3
4
5
6
7
8
9
10
xml = urlopen("http://www.tableausoftware.com/public/feed.rss").read()

xmlTitle = re.compile("<title>(.*)</title>")
xmlLink = re.compile("<link>(.*)</link>")

findTitle = re.findall(xmlTitle,xml)
findLink = re.findall(xmlLink,xml)

iterate = []
iterate[:] = range(1, 25)

Open CSV file:

1
writer = csv.writer(open("pytest.csv", "wb"))

Write header to CSV file (you want to do this before you enter the loop):

1
2
head = ("Title", "URL")
writer.writerow(head)

Write the For loop to iterate through the XML file and write the rows to the CSV file:

1
2
for i in iterate:
  writer.writerow([findTitle[i], findLink[i]])

Script

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/usr/bin/env python

import csv
from urllib import urlopen
import re

# Open and read HTMl / XML
xml = urlopen("http://www.tableausoftware.com/public/feed.rss").read()

# Grab article titles and links using regex
xmlTitle = re.compile("<title>(.*)</title>")
xmlLink = re.compile("<link>(.*)</link>")

# Find and store the data
findTitle = re.findall(xmlTitle,xml)
findLink = re.findall(xmlLink,xml)

#Iterate through the articles to create a range
iterate = []
iterate[:] = range(1, 25)

# Open the CSV file, write the headers
writer = csv.writer(open("pytest.csv", "wb"))
head = ("Title", "URL")
writer.writerow(head)

# Using a For Loop, write the results to the CSV file, row by row
for i in iterate:
  writer.writerow([findTitle[i], findLink[i]])

Comments