600 Lines of Code, 748 Revisions = A Load of Bubbles

When Channel 4′s Dispatches came across 1,100 pages of PDFs, known as the National Asset Register, they knew they had a problem on their hands. All that data, caged in a pixelated prison.

So ScraperWiki let loose ‘The Julian’. What ‘The Stig’ is to Top Gear, ‘The Julian’ is to ScraperWiki. That and our CTO.

‘The Julian’ did not like the PDFs. After scraping 10 pages of Defence assets, he got angry. The register may as well been glued together by trolls. The 5 year old data copied and pasted by Luddites from the previous Government was worse then useless.

So the ScraperWiki team set about rebuilding the register. Using good old-fashioned man power (i.e. me) and a PDF cropper we built a database of names, values and hierarchies that link directly to the PDFs.

Then Julian set about coding; 600 lines and 748 revisions! He made the bubbles the size of the asset values and got them to orbit around their various parent bubbles. This required such functions as ‘MakeOtherBranchAggregationsRecurse(cluster)’.

This scared our designer Zarino a little, who nevertheless made it much more user-friendly. This is where ScraperWiki’s powers of viewing live edits, chatting and collaboration became useful. The result was rounds of debugging interspersed with a healthy dose of cursing.

We then tried using it. We wanted the source of the data to hold provenance. We wanted to give the users the ability to explore the data. We wanted them to be able to see the bubbles that were too small. We prodded ‘The Julian’.

He hard coded the smaller bubbles to get into a ‘More…’ bubble orbit. This made the whole article on Channel 4 News thing a lot clearer and changed the navigation from jumping to orbits to drilling down and finding out which assets are worth a similar amount.

He then got it to drill down to the source PDFs. ‘The Julian’ outdid himself and stayed up all night making a PDF annotator of the data. We have plans for this.

Oh, and we also made a brownfield map on the Channel 4 News site. The scraper can be found here. And the code for the visual here. The 25000 data points were in Excel form and so much easier to work with. This was nice data with lots of fields. The result: a very friendly application that allows users to type a post code and to see what land their local authority has up for redevelopment. But due to the new government coming in, the Homes and Communities Agency have not yet finished collecting the 2009 data.

NAR and NLUD – you’ve been ScraperWikied!

Tags: , , , , ,

7 Responses to “600 Lines of Code, 748 Revisions = A Load of Bubbles”

  1. Tim Green March 8, 2011 at 4:58 pm #

    Is there an explanation anywhere of the PDF cropper? I’ve seen it a few times from the Dispatches data, but the pages doesn’t give much explanation. I assume it’s only been used internally so far.

    • Tim Green March 8, 2011 at 5:01 pm #

      Now I feel a bit silly after finding the ‘Instructions’ bit. I guess I was mostly wondering if it could also do OCR, but the pdf annotator seems to be the tool for actually extracting information.

      • nicolahughes March 8, 2011 at 5:06 pm #

        The cropper and annotator are new features just rolled out. We’re testing to see what is most useful. Those that are will be made much more user-friendly. Our goal is to make unstructured data more usable but also much more explorable. Sadly, the data as it is collected, has very little information attached. Hopefully with exposure it will be collected in a much more usable manner


  1. National Asset Bubbles – All in a day’s data « Data Miner UK - March 9, 2011

    [...] This is a visual made from the most inaccessible (both data and journalistically) PDFs of the National Asset Register. the information it contained was used for a Dispatches live debate and this repurposing was put into an article on the Channel 4 News website. I was fortunate enough to be part of the ScraperWiki team that took on the project and produced it in a matter of days. I have written a blog post on ScraperWiki here. [...]

  2. All the news that’s fit to scrape | Online Journalism Blog - March 25, 2011

    [...] the Open Knowledge Foundation blog (more on Scraperwiki’s blog): “ScraperWiki worked with Channel 4 News and Dispatches to make two supporting data [...]

  3. #hhhglas – the really late “live” blog » Nicola Osborne - May 9, 2011

    [...] So, to finish off the introductory part of the day you might be wondering how ScraperWiki makes money? Well private scrapers (the default is public and shared) or excessive API calls can be charged for. And for consulting/pieces of work for others – big organisations don’t need to do this stuff regularly just occasionally so it makes sense to contract it out. So we worked with Channel4 News and the Dispatches programme. [...]

  4. Scraping guides: Excel spreadsheets | ScraperWiki Data Blog - September 14, 2011

    [...] We used an Excel scraper that pulls together 9 spreadsheets into one dataset for the brownfield sites map used by Channel 4 News. [...]