Blog

How to scrape and parse Wikipedia

Today’s exercise is to create a list of the longest and deepest caves in the UK from Wikipedia. Wikipedia pages for geographical structures often contain Infoboxes (that panel on the right hand side of the page).

The first job was for me to design an Template:Infobox_ukcave which was fit for purpose. Why ukcave? Well, if you’ve got a spare hour you can check out the discussion considering its deletion between the immovable object (American cavers who believe cave locations are secret) and the immovable force (Wikipedian editors who believe that you can’t have two templates for the same thing, except when they are in different languages).

But let’s get on with some Wikipedia parsing. Here’s what doesn’t work:

import urllib
print urllib.urlopen("http://en.wikipedia.org/wiki/Aquamole_Pot").read()

because it returns a rather ugly error, which at the moment is: “Our servers are currently experiencing a technical problem.”

What they would much rather you do is go through the wikipedia api and get the raw source code in XML form without overloading their servers.

To get the text from a single page requires the following code:

import lxml.etree
import urllib

title = "Aquamole Pot"

params = { "format":"xml", "action":"query", "prop":"revisions", "rvprop":"timestamp|user|comment|content" }
params["titles"] = "API|%s" % urllib.quote(title.encode("utf8"))
qs = "&".join("%s=%s" % (k, v)  for k, v in params.items())
url = "http://en.wikipedia.org/w/api.php?%s" % qs
tree = lxml.etree.parse(urllib.urlopen(url))
revs = tree.xpath('//rev')

print "The Wikipedia text for", title, "is"
print revs[-1].text

Note how I am not using urllib.urlencode to convert params into a query string. This is because the standard function converts all the ‘|’ symbols into ‘%7C’, which the Wikipedia api site doesn’t accept.

The result is:

{{Infobox ukcave
| name = Aquamole Pot
| photo =
| caption =
| location = [[West Kingsdale]], [[North Yorkshire]], England
| depth_metres = 113
| length_metres = 142
| coordinates =
| discovery = 1974
| geology = [[Limestone]]
| bcra_grade = 4b
| gridref = SD 698 784
| location_area = United Kingdom Yorkshire Dales
| location_lat = 54.19082
| location_lon = -2.50149
| number of entrances = 1
| access = Free
| survey = [http://cavemaps.org/cavePages/West%20Kingsdale__Aquamole%20Pot.htm cavemaps.org]
}}
'''Aquamole Pot''' is a cave on [[West Kingsdale]], [[North Yorkshire]],
England wih which was first discovered from the
bottom by cave diving through 550 feet of
sump from [[Rowten Pot]] in 1974....

This looks pretty structured. All ready for parsing. I’ve written a nice complicated recursive template parser that I use in wikipedia_utils, which makes it easy to extract all the templates from the page in the following way:

import scraperwiki
wikipedia_utils = scraperwiki.swimport("wikipedia_utils")

title = "Aquamole Pot"

val = wikipedia_utils.GetWikipediaPage(title)
res = wikipedia_utils.ParseTemplates(val["text"])
print res               # prints everything we have found in the text
infobox_ukcave = dict(res["templates"]).get("Infobox ukcave")
print infobox_ukcave    # prints just the ukcave infobox

This now produces the following Python data structure that is almost ready to push into our database — after we have converted the length and depths from strings into numbers:

{0: 'Infobox ukcave', 'number of entrances': '1',
 'location_lon': '-2.50149',
 'name': 'Aquamole Pot', 'location_area': 'United Kingdom Yorkshire Dales',
 'geology': '[[Limestone]]', 'gridref': 'SD 698 784', 'photo': '',
 'coordinates': '', 'location_lat': '54.19082', 'access': 'Free',
 'caption': '', 'survey': '[http://cavemaps.org/cavePages/West%20Kingsdale__Aquamole%20Pot.htm cavemaps.org]',
 'location': '[[West Kingsdale]], [[North Yorkshire]], England',
 'depth_metres': '113', 'length_metres': '142', 'bcra_grade': '4b', 'discovery': '1974'}

Right. Now to deal with the other end of the problem. Where do we get the list of pages with the data?

Wikipedia is, unfortunately, radically categorized, so Aquamole_Pot is inside Category:Caves_of_North_Yorkshire, which is in turn inside Category:Caves_of_Yorkshire which is then inside
Category:Caves_of_England which is finally inside
Category:Caves_of_the_United_Kingdom.

So, in order to get all of the caves in the UK, I have to iterate through all the subcategories and all the pages in each category and save them to my database.

Luckily, this can be done with:

lcavepages = wikipedia_utils.GetWikipediaCategoryRecurse("Caves_of_the_United_Kingdom")
scraperwiki.sqlite.save(["title"], lcavepages, "cavepages")

All of this adds up to my current scraper wikipedia_longest_caves that extracts those infobox tables from caves in the UK and puts them into a form where I can sort them by length to create this table based on the query SELECT name, location_area, length_metres, depth_metres, link FROM caveinfo ORDER BY length_metres desc:

name location_area length_metres depth_metres
Ease Gill Cave System United Kingdom Yorkshire Dales 66000.0 137.0
Dan-yr-Ogof Wales 15500.0
Gaping Gill United Kingdom Yorkshire Dales 11600.0 105.0
Swildon’s Hole Somerset 9144.0 167.0
Charterhouse Cave Somerset 4868.0 228.0

If I was being smart I could make the scraping adaptive, that is only updating the pages that have changed since the last scraped by using all the data returned by GetWikipediaCategoryRecurse(), but it’s small enough at the moment.

So, why not use DBpedia?

I know what you’re saying: Surely the whole of DBpedia does exactly this, with their parser?

And that’s fine if you don’t want your updates to come less than 6 months, which prevents you from getting any feedback when adding new caves into Wikipedia, like Aquamole_Pot.

And it’s also fine if you don’t want to be stuck with the na├»ve semantic web notion that the boundaries between entities is a simple, straightforward and general concept, rather than what it really is: probably the one deep and fundamental question within any specific domain of knowledge.

I mean, what is the definition of a singular cave, really? Is it one hole in the ground, or is it the vast network of passages which link up into one connected system? How good do those connections have to be? Are they defined hydrologically by dye tracing, or is a connection defined as the passage of one human body getting itself from one set of passages to the next? In the extreme cases this can be done by cave diving through an atrocious sump which no one else is ever going to do again, or by digging and blasting through a loose boulder choke that collapses in days after one nutcase has crawled through. There can be no tangible physical definition. So we invent the rules for the definition. And break them.

So while theoretically all the caves on Leck Fell and Easgill have been connected into the Three Counties System, we’re probably going to agree to continue to list them as separate historic caves, as well as some sort of combined listing. And that’s why you’ll get further treating knowledge domains as special cases.

Tags: , ,

5 Responses to “How to scrape and parse Wikipedia”

  1. Alice December 8, 2011 at 5:47 pm #

    Great article. But I’m facing some problem. Your first example is working fine. I have tried second one. But that one is not working. Here is traceback . Thank you.
    File “k2.py”, line 2, in
    wikipedia_utils = scraperwiki.swimport(“wikipedia_utils”)
    AttributeError: ‘module’ object has no attribute ‘swimport’

  2. Julian December 8, 2011 at 6:30 pm #

    where is this file k2.py of yours?

    scraperwiki.swimport() is a function in the library as described at the bottom of this page:
    https://scraperwiki.com/docs/python/python_help_documentation/

  3. Chris Davis January 27, 2012 at 9:32 am #

    Waiting 6 months for the latest DBpedia update isn’t a concern any more. They have a version that is being synchronized live with Wikipedia now – see http://live.dbpedia.org/.

    I agree with the criticism about how semantic web applications sometimes assume far too clear boundaries between entities. However, this isn’t an inherent problem with DBpedia since they’ve basically outsourced the task of creating entity definitions to the Wikipedia community. The only way they would show these caves being connected together in one system was if people on Wikipedia said that they were.

    In practical terms, I think that ScraperWiki can still be an awesome tool for scraping Wikipedia since the DBpedia parser does sometimes have problems parsing certain fields, and I don’t think they have very good support yet for parsing tables.

  4. Eddie November 27, 2012 at 4:53 pm #

    Your wikiscrape script just saved me, needed to get a list of regions from counties in the UK and other databases I’ve found online kept letting me down.. Setup a small script in python with flask and ran through google-refine.. Very helpful!

Trackbacks/Pingbacks

  1. The Data Hob | ScraperWiki Data Blog - March 6, 2012

    [...] I would have loved to have derived it from the editable source of the wikipedia article, as I described elsewhere, but is impossible to do because it is insanely [...]