Tag Archives | pdf

The Tyranny of the PDF

  Why is ScraperWiki so interested in PDF files? Because the world is full of PDF files. The treemap above shows the scale of their dominance. In the treemap the area a segment covers is proportional to the number of examples we found. We found 508,000,000 PDF files, over 70% of the non-html web. It […]

Table Scraping Is Hard

The Problem NHS trusts have been required to publish data on their expenditure over £25,000 in a bid for greater transparency; A well known B2B publisher came to us to aggregate that data and provide them with information spanning across the hundreds of different trusts, such as: who are the biggest contractors across the NHS? […]