The long dark tea time of the computer programmer

Ian McNaught-Davis presenting the BBC's 1983 TV series "Making the Most of the Micro"

The way in which Information Technology is taught in England is so dull and harmful it should be scrapped – that’s the view of the Education Secretary Michael Gove.
‘A nation of digital illiterates’ (BBC)

Many years ago there was a total corporate take-over of the computer software sector in the UK. Big money was to be made out of controlling the profits generated by software applications, which were protected from competition by incompatible and inoperable standards and the force of law. (An attempt by the UK government to establish a very modest requirement for open standards was successfully killed off last week.)

One of the most painful aspects of this take-over was the way in which the same corporations managed to deform the entire education system into serving their purposes. All things resembling actual computer programming were cleansed from the curriculum, which was instead packed with dire, tedious training modules for drilling students in how to use those self-same corporations’ big software suites.

Back in the 1980s, before this take-over, I learned to program on the BBC Micro, which was widespread throughout the UK at the time. There is good evidence that this was the reason there has been such a strong software industry in the UK over the last three decades.

Let’s just use the words of computer games pioneer Ian Livingston from his February 2011 report:

“Given that the new online world is being transformed by creative technology companies like Facebook, Twitter, Google and video games companies, it seems incredible that there is an absence of computer programming in schools. The UK has gone backwards at a time when the requirement for computer science as a core skill is more essential than ever before. When Sir Clive Sinclair launched the ZX Spectrum in 1982, affordable computers were eagerly purchased for the homes of a creative nation. At the same time, the BBC Micro was adopted as the computer platform of choice for most schools and became the cornerstone of computing in British education in the 1980s. There was a thirst for creative computing both in the home and in schools creating a further demand at universities for courses in computer science. This certainly contributed to the rapid growth of the UK computer games industry.

“But instead of building on the BBC’s Computer Literacy Project in the 1980s, schools turned away from programming in favour of ICT. Whilst useful in teaching various proprietary office software packages, ICT fails to inspire children to study computer programming. It is certainly not much help for a career in games. In a world where technology affects everything in our daily lives, so few children are taught such an essential STEM skill as programming. Bored by ICT, young people do not see the potential of the digital creative industries. It is hardly surprising that the games industry keeps complaining about the lack of industry-ready computer programmers and digital artists.”

The official Government response was mostly public relations, mentioning developments that have nothing to do them such as Raspberry Pi, which, incidentally, appears to be an attempt by Livingston’s generation to recreate the 1980s when we once watched computers on TV (instead of watching TV on computers).

While this change in Government policy is absolutely vital, it is odd the way the people lobbying for it haven’t branched outside of their narrow fields of computer games and computer graphics – which are ultimately little more than a game of shifting pixels around on a VDU, utilizing very mature technologies based on software applications developed by large corporations.

They’re missing the point: This is the era of smart energy monitors that need to be coupled to microcontrollers that do something with appliances over the internet in response to data. Or robotrading bank accounts that use live data feeds to monitor and execute your share portfolio while you’re watching Corrie. The focus should be on Arduinos and simple robotics to control the energy use in your house, or on creative programming to trump the hedgefunders.

Recently I have been doing experiments through the interface of a home-built 3D printer, completely bypassing their UI application to drive it directly through the serial port.

Here is my first result:

Come on guys. I know it takes years to mastering the ability to render the perfect series of CGI frames of a spoon stirring the brown liquid in a cup of tea. But get a computer controlled robot to actually make me a cup of tea — that would be something!

Posted in thoughts | Tagged , | 1 Comment

ScraperWikiをためしてみよう

Guest post by Makoto Inoue, a Japanese ScraperWiki user. Makoto works in London as a Web developer, a technical writer, and a translator. He has a Japanese blog and his Twitter account is @makoto_inoue.

はじめに

みなさんスクレイプ(Scrape)という単語はご存知でしょうか?

ウェッブページから特定のデータを引っこ抜く作業のことをスクレイピング(Scraping)と呼びます。

昨今のホームページではデータを簡単に提供するためのAPI(Application Programming Interface)というしくみが多いので「なんで今更そんなの必要なの」と思われる方>も多いかもしれません。しかしながら前回起きた東日本大地震の際、地震や電力の速報や、各地の被害状況を把握するために必要な政府の統計情報などがAPIとして提供されておらず、開発者の中には自分でスクレイパー(Scraper)用のプログラムを書いた人も多いのではないのでしょうか? ただそういった多くの開発者の善意でつくられたプログラムがいろいろなサイトに散らばっていたり、やがてメンテナンスされなくなるのは非常に残念なことです。

そういうときにScraperWikiの出番です。

ScraperWikiとは

ScraperWikiはイギリスのスタートアップ企業で、スクレイパーコードを共有するサイトを提供しています。開発者達はサイト上から直接コード(Ruby, PHP, Python)を編集、実行することができます。スクレイプを定期的に実行することも可能で、取得されたデータはScraperWikiに保存されますが、ScraperWikiはAPIを用意しているので、このAPIを通して、他のサイトでデータを再利用することが可能です。

「Wiki」といっているだけあって、一般公開されているコードは他の人も編集したり、またコードをコピーして他のスクレイピングに利用することもできます。定期的に実>行されているスクレイパーがエラーを起こしていないかをチェックする仕組みがあり「みんなでスクレイピングを管理」するための仕組みがいたるところにあります。

ScraperWikiは、もともとイギリスで、どの議員がどの法案に賛成または反対票を投じたかを議会のサイトから創業者の一人が2003年頃にスクレイプしたことを起源に持ちます。

日本であればちょうどこういったページでしょうか?

現在ではGuardian社といった大手報道機関が企業ロビイストの議会での影響力を調べるのにつかったり、イギリス政府自身がalpha.gov.ukというプロトタイプサ>イトで、各省庁に点在したデータを一元的にアクセスするための仕組みとしてScraperWikiを使っているそうです。

ScraperWikiのビジネスモデルですが、一般公開するコードに関しては無料ですが、非公開にしたり、定期的にスクレイプする量などに応じて課金するようになっています。

前置きが長くなってきましたが、実際に使ってみましょう。

既存のスクレイパーを眺めてみる

「ScraperWiki」でGoogle検索すると、すでにScraperWikiを使用している日本人の方がいらっしゃいました。

ここでは衆議院議員のデータをスクレイプするのに使用しています。

データは月一回走るように設定されていたり、複数のcontributorがいるのがわかります。

ページの下の方にはスプレッドシート形式でデータを閲覧できるようになっていますが、これだけだと他のサイトで再利用とか難しいですよね。そういうときは”Explorer with API”ボタンをクリックしてみて下さい。そこのページの最後に以下のようなurlがあると思います。

https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=jsondict&name=members_of_the_house_of_representatives_of_japan&query=select%20*%20from%20%60swdata%60%20limit%2010

このurlにアクセスすると、先ほどのデータをJSON(Javascript Object Notation)で返してくれます。出力フォーマットは CSV, RSS,HTMLテーブルといった他の形式にも対応している上sql文をつかってフィルタリングなどをかけることも可能です。

select * from `swdata` where party = ‘民主’

COPYして独自のスクレイパーを作ってみる

ブラウザの”バック”ボタンを押して先ほどのページのスプレッドシートの下の方に目を通してい見て下さい。”This Scraper in Context”というところ”Copied To”という項>目があります。これはこのソースコードがコピーされ、他の用途に利用されていることを示しています。

そこに「makoto / Members of the House of Councillors of Japan」とあるの>でクリックしてみて下さい。実はこれは私が参議院議員の名簿を抜き出すために作ったスクレイパーです。衆議院と参議院はそれぞれ別にホームページを持っているのです>が、それぞれの議員名簿のページが結構似ていたので簡単に流用できるのではと思っていました。

作り方は簡単でCopyリンクをクリックするだけです。ログインしていなくてもコピーをとれますが、これを機にアカウントを取得するのをお勧めします。

”Edit”ページを開くとその場でコードを編集するためのオンラインエディタが現れます。下にある”Run”ボタンを押すと実際にサイトからデータをとってきている模様が見て取れます。

http://www.screenr.com/embed/BgQs

以下がオリジナルのコードと私のコードの差異です。

衆議院と参議院のページ異なっていたため変更した点はは4つほどありました。

  • エンコーディング(文字の表示形式)がUTFとShift-JISでことなる
  • 衆議院のページは複数ページにまたがっているが参議院ページ1ページのみ
  • 衆議院のページで議員名は「くん」づけ。参議院のページは芸名と本名の両方が載っている

他にもHTMLのページの文法が微妙に違っていたので、XPathという、HTMLに構造的にアクセスするための書式を少し換えました。

もちろんこれらの変更をするのにはある程度のプログラミング知識が必要なのですが。動くサンプルを少し自分用にカスタマイズするScraperWikiはプログラミングを勉強したい人にとっても絶好の教材なのではないでしょうか? 私自身XPathはあまり使ったことがなかったのですが、このもとプログラムを参考にすることで比較的簡単に学習できました。

最後に

今回のScraperWikiの簡単なチュートリアルで概要がわかっていただけたでしょうか?

公共機関、メディアや政府機関の中でインターネットを通じた情報公開は進んできていますが、「マッシュアップを前提としたデータの再利用」を考慮したサイトが十分で>ないのが現状です。そういった状態に一石と投じるべくScraperWikiは活動しており、ヨーロッパのジャーナリストや政府関係者の間では徐々に認知度があがってきております。 現在ScraperWikiでは米国でのワークショップを予定していますが、日本でもワークショップを始めるべく準備をしている所です。 もし興味のある方はコンタクトページより気軽にご連絡下さい。

Posted in developer, opendata, Scrapers | Tagged , | Leave a comment

Happy New Year and Happy New York!

We are really pleased to announce that we will be hosting our very first US two day Journalism Data Camp event in conjunction with the Tow Center for Digital Journalism at Columbia University and supported by the Knight Foundation on February 3rd and 4th 2012.

We have been working with Emily Bell @emilybell, Director of the Tow Center and Susan McGregor @SusanEMcG, Assistant Professor at the Columbia J School to plan the event. The main objective is to liberate and use New York data for the purposes of keeping business and power accountable.

After a short introduction on the first day, we will split the event into three parallel streams; journalism data projects; liberating New York data; and ‘learn to scrape’. We plan to inject some fun by running a derby for the project stream and also by awarding prizes in all of the streams.  We hope to make the event engaging and enjoyable.

We need journalists, media professionals, students of journalism, political science or  information technology, coders, statisticians and public data boffs to dig up the data!

Please pick a stream and sign-up to help us to make New York a data driven city!

Our thanks to Columbia University, Civic Commons, The New York Times, and CUNY for allowing us to use their premises as we sojourned in the big apple

Zarino has created a map with our US events which we will update with additional events as we add locations. https://scraperwiki.com/events/

Posted in events | Tagged , , , , , , , | Leave a comment

Up in the Air with ScraperWiki and Tropo

We came across this blog post a few days ago from these cool guys at Tropo in Florida, and thought you’d be interested in how they’ve used ScraperWiki. Tropo is a simple API for adding voice and other goodies to your apps and, as Mark Headd explains, it can be really powerful when combined with data ScraperWiki…

ScraperWiki is a powerful cloud-based service that lets you scrape data from online documents and websites.

When you write a scraper – a script to pull information from a web resource and then parse out the bits you want – it will execute inside the ScraperWiki environment.

SMS flight information

You can store the data that is scraped inside a data store and then access the data from outside the ScraperWiki environment using their API. Scrapers can be written in one of several different languages – Ruby, PHP and Python.

ScraperWiki and Tropo operate in a very similar way. The Tropo scripting environment allows you to write scripts in one of several different languages, including Ruby, PHP and Python (Groovy and JavaScript are also supported).

Your script executes inside the Tropo environment, which means you can make direct connections to external resources – like the ScraperWiki API – from within your executing script. There is no need for extra HTTP overhead, and the additional step of posting to a back end server to connect to other APIs or resources.

In the following screencast, I demonstrate how to use Tropo and ScraperWiki to quickly and easily build an airport information system for the Philadelphia International Airport.

All of the code for this example can be found here. If you’d like to view the actual scraper I wrote on ScraperWiki, you can find it here.

This is still a work in progress – I’d like to run this script multiple times per day (ideally, maybe once an hour) to get updates to flight information and ensure that the app has the most up to date flight status. The voice dialog could also use a little tweaking, and I’d like to offer the option of repeating the information.

But even with these refinements aside, it is evident how combining these two powerful cloud resources can generate a pretty useful application in a very short time.

Tropo and ScraperWiki are a powerful combination. Happy flying!

Posted in users | Tagged , , , , | Leave a comment

Scraping the protests with Goldsmiths

Google Map of occupy protests around the worldZarino here, writing from carriage A of the 10:07 London-to-Liverpool (the wonders of the Internet!). While our new First Engineer, drj, has been getting to grips with lots of the under-the-hood changes which’ll make ScraperWiki a lot faster and more stable in the very near future, I’ve been deploying ScraperWiki out on the frontline, with some brilliant Masters students at CAST, Goldsmiths.

I say brilliant because these guys and girls, with pretty much no scraping experience but bags of enthusiasm, managed (in just three hours) to pull together a seriously impressive map of Occupy protests around the world. Using data from no less than three individual wikipedia articles, they parsed, cleaned, collated and geolocated almost 600 protests worldwide, and then visualised them over time using a ScraperWiki view. Click here to take a look.

Okay, I helped a bit. But still, it really pushed home how perfect ScraperWiki is for diving into a sea of data, quickly pulling out what you need, and then using it to formulate bigger hypotheses, flag up possible stories, or gather constantly-fresh intelligence about an untapped field. There was this great moment when the penny suddenly dropped and these journalists, activists and sociologists realised what they’d been missing all this time.

Students like these, with tools like ScraperWiki behind them, are going to take the world by storm.

But the penny also dropped for me, when I saw how suited ScraperWiki is to a role in the classroom. The path to becoming a data science ninja is a long and steep one, and despite the amazing possibilites fresh, clean and accountable data holds for everybody from anthropologists to zoologists, getting that first foot on the ladder is a tricky task. ScraperWiki was never really built as a learning environment, but with so little else out there to guide learners, it fulfils the task surprisingly well. Students can watch their tutor editing and running a scraper in realtime, right alongside their own work, right inside their web browser. They can take their own copy, hack it, and then merge the data back into a classroom pool. They can use it for assignments, and when the built-in documentation doesn’t answer their questions, there’s a whole community of other developers on there, and a whole library of living, working examples of everything from cabinet office tweeting to global shark activity. They can try out a new language (maybe even their first language) without worrying about local installations, plugins or permissions. And then they can share what they make with their classmates, tutors, and the rest of the world.

Guys like these, with tools like ScraperWiki behind them, are going to take the world by storm. I can’t wait to see what they cook up.

Posted in developer, events, journalism | Tagged , , , , , | 1 Comment

How to scrape and parse Wikipedia

Today’s exercise is to create a list of the longest and deepest caves in the UK from Wikipedia. Wikipedia pages for geographical structures often contain Infoboxes (that panel on the right hand side of the page).

The first job was for me to design an Template:Infobox_ukcave which was fit for purpose. Why ukcave? Well, if you’ve got a spare hour you can check out the discussion considering its deletion between the immovable object (American cavers who believe cave locations are secret) and the immovable force (Wikipedian editors who believe that you can’t have two templates for the same thing, except when they are in different languages).

But let’s get on with some Wikipedia parsing. Here’s what doesn’t work:

import urllib
print urllib.urlopen("http://en.wikipedia.org/wiki/Aquamole_Pot").read()

because it returns a rather ugly error, which at the moment is: “Our servers are currently experiencing a technical problem.”

What they would much rather you do is go through the wikipedia api and get the raw source code in XML form without overloading their servers.

To get the text from a single page requires the following code:

import lxml.etree
import urllib

title = "Aquamole Pot"

params = { "format":"xml", "action":"query", "prop":"revisions", "rvprop":"timestamp|user|comment|content" }
params["titles"] = "API|%s" % urllib.quote(title.encode("utf8"))
qs = "&".join("%s=%s" % (k, v)  for k, v in params.items())
url = "http://en.wikipedia.org/w/api.php?%s" % qs
tree = lxml.etree.parse(urllib.urlopen(url))
revs = tree.xpath('//rev')

print "The Wikipedia text for", title, "is"
print revs[-1].text

Note how I am not using urllib.urlencode to convert params into a query string. This is because the standard function converts all the ‘|’ symbols into ‘%7C’, which the Wikipedia api site doesn’t accept.

The result is:

{{Infobox ukcave
| name = Aquamole Pot
| photo =
| caption =
| location = [[West Kingsdale]], [[North Yorkshire]], England
| depth_metres = 113
| length_metres = 142
| coordinates =
| discovery = 1974
| geology = [[Limestone]]
| bcra_grade = 4b
| gridref = SD 698 784
| location_area = United Kingdom Yorkshire Dales
| location_lat = 54.19082
| location_lon = -2.50149
| number of entrances = 1
| access = Free
| survey = [http://cavemaps.org/cavePages/West%20Kingsdale__Aquamole%20Pot.htm cavemaps.org]
}}
'''Aquamole Pot''' is a cave on [[West Kingsdale]], [[North Yorkshire]],
England wih which was first discovered from the
bottom by cave diving through 550 feet of
sump from [[Rowten Pot]] in 1974....

This looks pretty structured. All ready for parsing. I’ve written a nice complicated recursive template parser that I use in wikipedia_utils, which makes it easy to extract all the templates from the page in the following way:

import scraperwiki
wikipedia_utils = scraperwiki.swimport("wikipedia_utils")

title = "Aquamole Pot"

val = wikipedia_utils.GetWikipediaPage(title)
res = wikipedia_utils.ParseTemplates(val["text"])
print res               # prints everything we have found in the text
infobox_ukcave = dict(res["templates"]).get("Infobox ukcave")
print infobox_ukcave    # prints just the ukcave infobox

This now produces the following Python data structure that is almost ready to push into our database — after we have converted the length and depths from strings into numbers:

{0: 'Infobox ukcave', 'number of entrances': '1',
 'location_lon': '-2.50149',
 'name': 'Aquamole Pot', 'location_area': 'United Kingdom Yorkshire Dales',
 'geology': '[[Limestone]]', 'gridref': 'SD 698 784', 'photo': '',
 'coordinates': '', 'location_lat': '54.19082', 'access': 'Free',
 'caption': '', 'survey': '[http://cavemaps.org/cavePages/West%20Kingsdale__Aquamole%20Pot.htm cavemaps.org]',
 'location': '[[West Kingsdale]], [[North Yorkshire]], England',
 'depth_metres': '113', 'length_metres': '142', 'bcra_grade': '4b', 'discovery': '1974'}

Right. Now to deal with the other end of the problem. Where do we get the list of pages with the data?

Wikipedia is, unfortunately, radically categorized, so Aquamole_Pot is inside Category:Caves_of_North_Yorkshire, which is in turn inside Category:Caves_of_Yorkshire which is then inside
Category:Caves_of_England which is finally inside
Category:Caves_of_the_United_Kingdom.

So, in order to get all of the caves in the UK, I have to iterate through all the subcategories and all the pages in each category and save them to my database.

Luckily, this can be done with:

lcavepages = wikipedia_utils.GetWikipediaCategoryRecurse("Caves_of_the_United_Kingdom")
scraperwiki.sqlite.save(["title"], lcavepages, "cavepages")

All of this adds up to my current scraper wikipedia_longest_caves that extracts those infobox tables from caves in the UK and puts them into a form where I can sort them by length to create this table based on the query SELECT name, location_area, length_metres, depth_metres, link FROM caveinfo ORDER BY length_metres desc:

name location_area length_metres depth_metres
Ease Gill Cave System United Kingdom Yorkshire Dales 66000.0 137.0
Dan-yr-Ogof Wales 15500.0
Gaping Gill United Kingdom Yorkshire Dales 11600.0 105.0
Swildon’s Hole Somerset 9144.0 167.0
Charterhouse Cave Somerset 4868.0 228.0

If I was being smart I could make the scraping adaptive, that is only updating the pages that have changed since the last scraped by using all the data returned by GetWikipediaCategoryRecurse(), but it’s small enough at the moment.

So, why not use DBpedia?

I know what you’re saying: Surely the whole of DBpedia does exactly this, with their parser?

And that’s fine if you don’t want your updates to come less than 6 months, which prevents you from getting any feedback when adding new caves into Wikipedia, like Aquamole_Pot.

And it’s also fine if you don’t want to be stuck with the naïve semantic web notion that the boundaries between entities is a simple, straightforward and general concept, rather than what it really is: probably the one deep and fundamental question within any specific domain of knowledge.

I mean, what is the definition of a singular cave, really? Is it one hole in the ground, or is it the vast network of passages which link up into one connected system? How good do those connections have to be? Are they defined hydrologically by dye tracing, or is a connection defined as the passage of one human body getting itself from one set of passages to the next? In the extreme cases this can be done by cave diving through an atrocious sump which no one else is ever going to do again, or by digging and blasting through a loose boulder choke that collapses in days after one nutcase has crawled through. There can be no tangible physical definition. So we invent the rules for the definition. And break them.

So while theoretically all the caves on Leck Fell and Easgill have been connected into the Three Counties System, we’re probably going to agree to continue to list them as separate historic caves, as well as some sort of combined listing. And that’s why you’ll get further treating knowledge domains as special cases.

Posted in developer, Scrapers | Tagged , , | 3 Comments

Have a Happy Open Data Day with ScraperWiki

Tomorrow is Open Data Day, and if you’re planning on hacking the web with the Open Knowledge Foundation or Random Hacks of Kindness or your own data hack of choice, here’s some fuel for your fight ( and if you’ve already driven our digger then check out the last video on our API so you can build applications!):

Posted in video | Tagged | Leave a comment