The Data Hob

Keeping with the baking metaphor, a hob is a projection or shelf at the back or side of a fireplace used for keeping food warm. The central part of a wheel into which the spokes are inserted looks kind of like a hob, and is called the hub (etymology).

Lately there has been a move to refer to certain websites as data hubs.

But what does this mean?

In transport terminology, the cities of Chicago or Paris are known as the hub train stations of the networks because all the lines run out from them to all over their respective countries, like the spokes of a wheel.

Back in the virtual world, the Open Knowledge Foundation has decided to rebrand their CKAN system as the Data Hub, describing it as a “community-run catalogue of useful sets of data on the Internet”. It also contains 3290 datasets that you can “browse, learn about and download”.

At the same time, and apparently entirely independently, Microsoft have a prototype service called Data Hub. It’s billed as an “online service for data discovery, distribution and curation”. I can’t tell how many datasets it has, as nothing is visible unless you ask them if you can register.

What about the other word? What does data mean?

I do have a working definition of “data” — it is a representation which allows for aggregations. That is, a set of information on which you can perform the SQL GROUP BY operation.

There are other crucial SQL operations, such as the WHERE and ORDER BY clauses, but these don’t creatively transform things in the way that the GROUP BY operation does.

If we go back to the UN peacekeepers database (which I talked at length about here), each record in the swdata table is:

 (country text, mission text, people integer, month text)

We can find the number of people from each country sent to all the missions in the month of June 2010 using the following query:

SELECT swdata.country,
       sum(people) as tpeople
FROM swdata
WHERE month='2010-06'
GROUP BY country
ORDER BY tpeople desc

The ORDER BY clause gives the highest numbers first:

country tpeople
Pakistan 10692
Bangladesh 10641
India 8920
Nigeria 5732
Egypt 5461
Nepal 5148
Ghana 3748
Rwanda 3654
Jordan 3599
Uruguay 2566

Of course it’s no surprise that Pakistan is on this table above Belgium, it’s got 17 times the population. What we need to see here is the per capita table.

This obviously requires the table of populations per country. Unfortunately there wasn’t one on Scraperwiki yet, so I had to create one from the Wikipedia article.

Now I would have loved to have derived it from the editable source of the wikipedia article, as I described elsewhere, but is impossible to do because it is insanely programatic:

|2||align=left|{{IND}}||1,210,193,422||March 1, 2011
||{{#expr: 1210193000/{{worldpop}}*100 round 2}}%
|-
|3||align=left|{{USA}}||{{data United States of America|poptoday 1}}
|-
...
|16||align=left|{{EGY}}||{{formatnum: {{#expr: (79602+4.738*{{Age in days|2011|1|1}}) round 0}}000}}
||{{CURRENTMONTHNAME}} {{CURRENTDAY}}, {{CURRENTYEAR}}
||{{#expr: (79602000+4738*{{Age in days|2011|1|1}})/{{worldpop}}*100 round 2}}%
|-

As you can see, some rows contain the numbers properly, some have the numbers transcluded from a different place, and some are expressed in terms of a mathematical formula.

The Wikipedia template programming language is so dauntingly sophisticated, except for not being able to produce the ranking numbers automatically, over 200 of which would have needed to be retyped following the creation of South Sudan. Oh, and the table has to be updated into over 30 different languages.

Country identifications are tricky because they change (like Sudan) and do have a lot of different spellings.

(One thing I have never understood is why countries have different names in different languages. I mean, something like the train system around San Francisco is called BART, short for Bay Area Rapid Transit, and everyone calls it that, because that is the name it has given itself. Even the spanish call it Bart. So why do they insist on calling the US, the United States, Estados Unidos? What does that shorten to? The EU?)

My population scraper, which should really have been derived in some way from the UNstats demographics data. To make it match up with the country spellings in the peacekeeper data I had to add some extra alternative spellings:

countryalts = { u"C\xf4te d'Ivoire":["Cote d Ivoire"],
    "United States":["United States of America"],
    "Democratic Republic of the Congo":["DR Congo"],
    "South Korea":["Republic of Korea"] }

This allows us to perform a join between the two data sets together to create the per-capita peace keepers table per country:

attach: country_populations

SELECT swdata.country,
    round(sum(people)*1000000/poptable.Population,2)
      as permillionpop,
    sum(people) as peacekeepers,
    poptable.Population
FROM swdata
LEFT JOIN poptable on swdata.country=poptable.Country
WHERE month='2010-06'
GROUP BY swdata.country
ORDER BY permillionpop desc

…which gives a much more representative country ranking that puts Uruguay, Jordan and Rwanda in the top three places.

It would have been even better to use a more fine grained demographics database so as to select only for the population of a country between the ages of 20 and 40, say.

Now, suppose I had a third data set, which was the financial contribution by each country to the UN peacekeeping efforts, and I was able to join that into the query in some meaningful way that tested the relationship. For example, do countries that contribute fewer peacekeepers contribute more money to make up for deficit, or are some nations just naturally more generous than others — in proportion of their GDP? (This looks like a fourth data set to me.)

This kind of operation only works if all the datasets have been imported into the same database or complex.

What is this starting to feel like we are doing?

Does it feel like we are fitting wooden spokes into some kind of a central object that joins them into a rigid entity which acts as a single object pivoting on an axis?

Maybe we should call it a data windmill.

Posted in developer | 1 Comment

The UN peacekeeping mission contributions mostly baked

Many of the most promising webscraping projects are abandoned when they are half done. The author often doesn’t know it. “What do you want? I’ve fully scraped the data,” they say.

But it’s not good enough. You have to show what you can do with the data. This is always very hard work. There are no two ways about it. In this example I have definitely done it.

For those of you who are not so interested in the process, the completed work is here. And if you don’t think it’s done very well, come back and read what I have to say.

[By the way, the raw data is here, for you to download and -- quote -- "Do whatever you want with it."]

Phase One: The scraping

At the Columbia event I was quite pleased to create a database of un_peacekeeping_statistics from a set of zip files of pdfs containing the monthly troop contributions to the various UN peacekeeping missions.

Each pdf document was around 40 pages and in a format like this:

This table says that during the month of January 2012 the government of Argentina sent 3 of its citizens to the United Nations Mission for the Referendum in Western Sahara, 741 people to the United Nations Stabilization Mission in Haiti, 265 to the United Nations Peacekeeping Force in Cyprus, and so forth.

I was very lucky with this: the format is utterly consistent because it is spat out of their database. I was able to complete it with about 160 lines of code.

After getting the code working, I cleaned it up by removing all the leftover print statements until the only thing that would be produced at runtime was a message when a new month became available in the database. The email generating code is on line 34 and it has so far worked once by sending me an email which looked like:

Subject: UN peacekeeping statistics for 2012-01

Dear friend,
There are 788 new records in the database for

https://scraperwiki.com/scrapers/un_peacekeeping_statistics/

after month 2011-12 to month 2012-01

Who gets this email? Those who are listed as doing so in the editors list (see the image above). Maybe if you are a a journalist with international conflicts on your beat, you ought to get on this list. The emailer technology was outlined in an earlier blog-post. There is no UI for it, so it can only be enabled by request [send your request to feedback].

Phase Two: The analysis

What we have now is a table of over 86000 records stretching back to January 2003. The important columns in the table are:

month text,
country text,
mission text,
people integer

It turns out there are hundreds of relevant timeline graphs which you can make from this data with a little bit of SQL.

For example, what are the three top countries in terms of maximum deployment to any mission? Find it using:

SELECT country, max(people) as max_people
FROM swdata
GROUP BY country
ORDER BY max_people desc
LIMIT 3

The answer is India, Bangladesh and Pakistan.

To which missions do these three neighbouring, sometimes-at-war, rival countries predominantly send their troops?

Query this by executing:

SELECT mission, sum(people) as people_months
FROM swdata
WHERE country='India' or country='Bangladesh' or
      country='Pakistan'
GROUP BY mission
ORDER BY people_months desc

The answer is MONUC, UNMIL, UNMIS and UNOCI.

[The reporter who encouraged me to scrape this dataset had a theory that these peacekeeping missions are a clever way for nations to get their troops battle-hardened before the inevitable conflict on their own territory. In other words, they also serve as war-training missions.]

Now let’s have a look at the just the deployment of peacekeepers from India, Bangladesh and Pakistan to MONUC (United Nations Organization Mission in the Democratic Republic of the Congo) over time.

[There is no easy way to embed this google's dynamic javascript timeline object into a blog, so I have to present a bitmap image, which is quite annoying.]

As you can see, the pattern of deployment tends to remain at a constant quota over many years, with sudden jumps, probably due to requirements on the ground. Pakistan appeared to supply both of these peacekeeping surges, once in 2003 and once in 2005, while Bangladesh surged at one and India surged at the other.

The picture for UNOCI (United Nations Operation in Côte d’Ivoire) is different:

There is none from India, but a fixed contingent between Bangladesh and Pakistan; 600 peacekeepers were swapped between them in August 2006.

The SQL code for producing these timeline graphs goes like this:

SELECT month||'-15',  # concatenate a day to make a valid format
    sum(people*(country='India')) as people_india,
    sum(people*(country='Bangladesh')) as people_bangladesh,
    sum(people*(country='Pakistan')) as people_pakistan,
    sum(people) as people_all FROM swdata
WHERE mission='UNOCI'
GROUP BY month
ORDER BY month

Now, you could ask who are the other countries which make up the bulk of this mission, and you could answer the question by developing the necessary SQL statement yourself, but it’s a little unfair to expect everyone who is interested in this data to already have mastered SQL, isn’t it?

Phase Three: Presentation

This is the very hard part, and is usually the point where most promising projects get abandoned, because “someone else better than me at design will come along and finish it.”

Except they never ever do.

As you’re really the only one in the world who comprehends the contents and the potential of this dataset, it is your job to prove it.

Here is my attempt at a user interface for generating graphs of the queries that people might be interested in. It has taken me two hard hacking sessions to get it into this form — or twice as long as it took to write the original scraper.

It is almost as time-consuming as producing video marketing.

This is also usually the phase where all those design geniuses come out of the woodwork and start getting critical and disparaging of your efforts, so you can’t blame programmers who don’t go this far. It’s like sweating all month learning to play a new piece of music on the piano, only to get reminded again and again that you don’t have the talent.

This used to bug me big time. Until I realized that it’s actually a positive sign.

What’s infinitely worse than criticism is no criticism at all because nobody has any idea about you are trying to achieve.

Now they think they know what you are trying to do — which is why they can be critical.

The next step is for them to actually know what you are trying to do. This ought to be a small step — and if they can’t make it, and don’t even try to make it, then by definition they cannot very good designers at all.

Look, you have just got all this way starting from nothing, from finding something out in the world, to recognizing its potential, all the way to pulling in and transforming the original raw data and struggling for a way to analyse it. It’s like you have prospected for the diamonds, found them in the earth, cut a mine tunnel to it with your bare hands, separated it from the rock, roughed out its edges, glued it onto a steel washer for a ring, and oh, it doesn’t look very professional and polished now does it? Come on, give us a break! We’ve applied bags of essential skills which hardly anyone else is capable of, so why should we expect to be especially good at this phase? Does your horse have table manners? No. But it works for its hay doesn’t it?

So anyway, here is what the current result looks like:

[Question: Does the Nepalese deployment react to events that were reported in the news during the course of the Haiti mission?]

When the page initializes there are three ajax call-backs to the database to obtain the lists of countries, missions, and top contributions from countries to specific missions. You can multiple select from the countries and the missions lists to create timeline graphs of numbers of people. If you select only from the countries list it shows the troop contributions from those countries to all UN missions. If you additionally select a single mission as well it will graph those country contributions to that specific mission. And it works the other way, vice versa, for lists of missions v countries. The top contributors table helps identify who are the top countries (or missions), so you know which ones to select to make an interesting graph that is not all zeros. (eg no point in graphing the number of Italians deployed to Nepal, because there aren’t any.)

Where do the Italians go? You can find that out by selecting “Italy” from the “Contributor nations” column and clicking on the “Refresh” button on the “Top contributions” column. And you can also click on “Make timeline” to discover that Italy never sent anyone anywhere until late 2006, when they suddenly started deploying two to three thousand peacekeepers to Lebanon. What happened then? Did something change in Italian politics around that point? Maybe people who write Italian newspapers ought to know.

Okay, the user interface is not great, but it achieves the objective of facilitating the formulation of relevant questions, and answering them — which is more than can be said of a lot of artistic user interfaces that crop up around the place, like so many empty bottles of wine.

Phase four: Publishing and promoting

There is no point in doing all this work if the people who would be interested never get to see it.

This bit I cannot do at all, so I don’t even try. I do know that throwing up a long rambling technical blog about the project does not constitute effective publication. In fact, according to the news rules, “once it’s told, it’s old”, so I have just completely ruined everything, because it can now never get onto the New York Times or The Guardian on their data blog section for its 15 hours of fame, before being lost into the past archive where no one is interested at all while it steadily goes out of date through the coming months and years.

Except this dataset, with the infrastructure behind is different, because it remains in date for the foreseeable future. So it really ought to have a home somewhere, like those stock market indicators, ever present on the business pages, like the daily crossword or cartoon.

Who knows how to get this done? It’s not my bag and I am quite exhausted.

What I do know is that I had to keep looking up what all those acronyms mean until I decided I should copy them down in the code and use them for tool-tips. It took quite a bit of work, and was repetitive, and maybe should have been scraped from somewhere. But was probably extremely well worth doing, so I am repeating it here.

missiontips = { UNMIS:"United Nations Missions in Sudan", UNMIL:"United Nations Mission in Liberia",
  UNAMID:"African Union/United Nations Hybrid operation in Darfur", UNOCI:"United Nations Operation in Côte d'Ivoire",
  MINUSTAH:"United Nations Stabilization Mission in Haiti", MONUC:"United Nations Organization Mission in the Democratic Republic of the Congo",
  UNMISS:"United Nations Mission in the Republic of South Sudan", UNMIK:"United Nations Interim Administration Mission in Kosovo",
  MONUSCO:"United Nations Organization Stabilization Mission in the Democratic Republic of the Congo",
  MINURCAT:"United Nations Mission in the Central African Republic and Chad",
  ONUCI:"Opération des Nations Unies en Côte d'Ivoire", UNMEE:"United Nations Mission in Ethiopia and Eritrea",
  ONUB:"United Nations Operation in Burundi", UNIFIL:"United Nations Interim Force in Lebanon",
  UNMIT:"United Nations Integrated Mission in Timor-Leste", UNMISET:"United Nations Mission of Support in East Timor",
  UNAMSIL:"United Nations Mission in Sierra Leone", MINURSO:"United Nations Mission for the Referendum in Western Sahara",
  UNOMIG:"United Nations Observer Mission in Georgia", UNMIN:"United Nations Mission in Nepal",
  UNAMA:"United Nations Assistance Mission in Afghanistan", UNIKOM:"United Nations Iraq-Kuwait Observation Mission",
  UNFICYP:"United Nations Peacekeeping Force in Cyprus", UNISFA:"United Nations Interim Security Force for Abyei",
  UNTSO:"United Nations Truce Supervision Organization", UNOTIL:"United Nations Office in East Timor",
  MINUCI:"United Nations Mission in Côte d'Ivoire", UNIOSIL:"United Nations Integrated Office in Sierra Leone",
  BINUB:"Bureau Intégré des Nations Unies au Burundi", UNAMI:"United Nations Assistance Mission for Iraq",
  UNDOF:"United Nations Disengagement Observer Force", UNMOGIP:"United Nations Military Observer Group in India and Pakistan",
  binub:"Bureau Intégré des Nations Unies au Burundi", BNUB:"United Nations Office in Burundi",
  UNMA:"United Nations Mission in Angola" };

I’ll sign off with an image of what normally stands for an interactive index to the list of missions on the official UN website, and imagine I have done enough for someone to take it on from here.

Posted in developer, journalism | 1 Comment

Meet us in St Louis to ‘Liberate the data’!

Spirit of St Louis

The ScraperWiki truck will be in St Louis, Missouri this Thursday 23rd February.  The NICAR event is a wonderful opportunity for us to meet the IRE delegates and to understand the on going issues that investigative reporters face when digging up data.  We already have some data sets nominated and these are listed in no particular order below.  We hope to have this data available for download through the IRE site from Friday.  All we ask is that you please cite the kind data digger who liberates the data and the ScraperWiki truck that helped to get it to you.

Datasets

The state official stonewalling release bfm.sd.gov/ledger/employee.asp
The prisoner database for the federal system www.bop.gov/iloc2/LocateInmate.jsp
U.S. Defense Department Budget Materials for fiscal 2011 comptroller.defense.gov/defbudget/fy2011/ (This is PDF)
Political spending of public employee unions kcerds.dol-esa.gov/query/getOrgQry.do
FDA import refusals http://www.fda.gov/ForIndustry/ImportProgram/ImportRefusals/default.htm
FDA food/drug products recalls www.fda.gov/Safety/Recalls/default.htm
What is this nuclear data http://www.nrc.gov/reading-rm/doc-collections/
Catholic Bishop and their movements www.usccb.org/about/bishops-and-dioc

We will kick off the evening at 18:00 with some nibbles and light refreshments and then we will commence a ‘Learn to Scrape’ session at 19:00 for about 1.5/2 hrs – this will be similar to the training event that we ran at Columbia University with Julian Todd and Thomas Levine.  After the session will get on with the real task of liberating the data.  There will be some ScraperWiki mugs and t-shirts awarded as prizes for best in class. Please bring your laptop along and we will provide the wifi.  We look forward to seeing you in the wonderful city of St Louis and lets help the data do some talking!

…and a big thank you to Mark Horvit and his team for inviting is to be part of the conference!

Posted in events | Leave a comment

What happened in New York

At our New York datacamp, we set out to liberate data, teach people to liberate data, and find stories in data.

About 100 people showed up for the event, and about 40 of them attended the Learn to Scrape sessions.

The hacking was punctuated by talks by Tom Lee of the Sunlight Foundation and Jake Porway of Data Without Borders

JDCNY journalists and developers at work

Projects

Dan Nguyen scraped Florida mugshots from and used face.com‘s API to analyse each photo to tell you the arrestee’s mood.

Mike Caprio and team cleaned a spreadsheet of 80,000 records from the New York lobbyist website to power a site on New York lobbyists based on the Chicago Lobbyists site It appears that $120 million was spend on New York on lobbiests in 2011.

JDCNY: the Stop & Frisk group hard at work

Michael Keller, Marc Georges et al. related the NYPD stop, question and frisk data nine mosques referenced in an NYPD report on surveillance in order to see whether there had been unusual changes in stopping activity around these mosques.

The dataset is insanely messy, but they fortunately had access to a relatively clean version that Data Without Borders had
developed in November.

They were still going strong after the data camp. Refusing to leave, they moved to a different to a different room after getting kicked out of the data camp space.

I helped one team relate contracts from Open Book New York to data that they had scraped by hand (!) from hand-written forms in order to identify pontential conflicts of interest.

I helped another team identify potential stories (outliers) in the NYC Open Data graffiti locations dataset.

Susan McGregor was “clearly hooked” because she liberated lobbyist contract details the next evening instead of watching the Superbowl.

JDCNY attendees wrangling lobbyist data

Technical Awards

Mike Caprio won Best Data Liberator for liberating the Iowa accident reports database.

Michelle Koeth won Best Creation of an API for scraping New York, NY hospitals from Medicare Hospital Compare

Jeremy Baron, from UN peacekeeping team, won Best Use of ScraperWiki for scraping United Nations PDFs. This team also scraped peacekeeping statistics and contributions

Honorary ScraperWikian

Susan McGregor was awarded Honorary ScraperWikian. We haven’t decided what that means yet. :)

Learning

JDCNY: the Stop & Frisk team present their workTeaching the Learn to Scrape sessions and working with many of the project teams, I got the impression that we had opened participants to thinking more about how data can be scraped, transformed and analyzed to identify unusual subsets and potential stories.

Our Learn to Scrape sessions seemed to work as well; I found several participants who had claimed no knowledge of webscraping prior to the sessions to be creating reasonably complex scrapers by the next afternoon.

What Next?

More data camps are coming up, and several groups plan on contining to work on their projects. But in the mean time, we now have lots of data for you to analyze!

Posted in events, Uncategorized | 1 Comment

Welcome Jane, from Ubuntu, to ScraperWiki’s board

What’s a company’s board of directors for?

Ultimately it’s to hire or fire the CEO. But that doesn’t happen very often.

What happens more often is board meetings. But what are they for?

They give directors a status update, and are a place to do legal administrivia. But even that is most efficiently done in advance.

What really matters is feedback, input, advice, and guidance on whatever is most important to the company right now. That isn’t just constructed in the meeting, but formulated even as they sleep the night before.

It is with that in mind that as part of our new funding round, I’m pleased that Jane Silber, CEO of Canonical who make the Ubuntu operating system, is joining as chair of our board.

ScraperWiki is open source, and lots of what we do is about getting value from open data.  As Jane said (in our press release the other day):

I believe the impact of open data on Government and the Citizen will be immense. Just as ten years ago only a few recognised how open source would revolutionise IT, so now only a few understand the impact of open data

She fits well with the way ScraperWiki spans worlds – being both a geek and a businesswoman, both British and American.

While I hope that she never has to organise firing me, I’m excited to have her help making ScraperWiki the best data platform for programmers and businesses.

Posted in business | Leave a comment

$1 million to build a data platform

Sometimes the easiest way of being authentic is to just post an email that was written to be private…

Date: Fri, 27 Jan 2012 14:29:57 +0000
From: Francis Irving <francis@scraperwiki.com>
To: team@scraperwiki.com
Subject: Capital!

Today we closed our round of investment from Enterprise Ventures and Blue Fountain.

In total, provided we hit certain milestones next August, and with the Knight Foundation money, this means we have a cool $1,000,000 of capital.

Many many thanks to Aidan who has done most of the work, and now has no hair left to keep him warm in his old age (it generally takes, and took, total one person full time for 6 months to get investment).

And to everyone else for making a company that someone would want to invest in.

Short FAQ

1. What’s this money for? It’s for us to create a viable business, by helping coders make data do things across the web. To do that, we have to reach clear product/market fit, with paying developers, and/or with corporations. So please continue to ask of everything you/we do “is this testing, in as lean a way as possible, how we can get to product/market fit?”.

2. How long will it last for? If we hit the revenue in our business plan, in theory until August 2014. Even in worst cases, it’ll last a year.

3. Can I tell the world? Not just yet. We’re not press releasing it immediately, it’s embargoed for writing about, blogging about or tweeting about. But feel free to tell friends and family.

4. When’s the party? Not sure, but Monday night in Liverpool looks best to me. Who isn’t free then? (At least for an early evening drink. Now the hard (and fun) part starts! Francis

For full details, read the press release. As you can see, we also have a new board member – I’ll write about her in a separate blog post. Any questions? Please ask in the comments!

Posted in business | 3 Comments

Big fat aspx pages for thin data

My work is more with the practice of webscraping, and less in the high-faluting business plans and product-market-fit leaning agility. At the end of the day, someone must have done some actual webscraping — and the harder it is the better.

During the final hours of the Columbia University hack day, I got to work on a corker in the form of the New York State Joint Committee on Public Ethics Lobbying filing system.

This is an aspx website which is truly shocking. The programmer who made it should be fired — except it looks like he probably got it to a visibly working stage, and then simply walked away from the mess he created without finding out why it was running so slowly.

Continue reading

Posted in developer | 1 Comment