<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>ScraperWiki Data Blog</title>
	<atom:link href="http://blog.scraperwiki.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.scraperwiki.com</link>
	<description>A blog about ScraperWiki and all things data</description>
	<lastBuildDate>Thu, 09 Feb 2012 18:26:56 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='blog.scraperwiki.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>ScraperWiki Data Blog</title>
		<link>http://blog.scraperwiki.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://blog.scraperwiki.com/osd.xml" title="ScraperWiki Data Blog" />
	<atom:link rel='hub' href='http://blog.scraperwiki.com/?pushpress=hub'/>
		<item>
		<title>$1 million to build a data platform</title>
		<link>http://blog.scraperwiki.com/2012/02/09/1-million-to-build-a-data-platform-2/</link>
		<comments>http://blog.scraperwiki.com/2012/02/09/1-million-to-build-a-data-platform-2/#comments</comments>
		<pubDate>Thu, 09 Feb 2012 18:26:22 +0000</pubDate>
		<dc:creator>Francis Irving</dc:creator>
				<category><![CDATA[business]]></category>

		<guid isPermaLink="false">http://blog.scraperwiki.com/?p=758216330</guid>
		<description><![CDATA[Sometimes the easiest way of being authentic is to just post an email that was written to be private&#8230; Date: Fri, 27 Jan 2012 14:29:57 +0000 From: Francis Irving &#60;francis@scraperwiki.com&#62; To: team@scraperwiki.com Subject: Capital! Today we closed our round of &#8230; <a href="http://blog.scraperwiki.com/2012/02/09/1-million-to-build-a-data-platform-2/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&amp;blog=14548467&amp;post=758216330&amp;subd=scraperwiki&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Sometimes the easiest way of being authentic is to just post an email that was written to be private&#8230;</p>
<blockquote><p>Date: Fri, 27 Jan 2012 14:29:57 +0000<br />
From: Francis Irving &lt;francis@scraperwiki.com&gt;<br />
To: team@scraperwiki.com<br />
Subject: Capital!</p>
<p>Today we closed our round of investment from <a href="http://www.evgroup.uk.com/default.aspx">Enterprise Ventures</a> and <a href="http://www.bluefountain.com/home/">Blue Fountain</a>.</p>
<p>In total, provided we hit certain milestones next August, and with the <a href="http://blog.scraperwiki.com/2011/06/22/knight-foundation-finance-scraperwiki-for-journalism/">Knight Foundation</a> money, this means we have a cool $1,000,000 of capital.</p>
<p>Many many thanks to Aidan who has done most of the work, and now has no hair left to keep him warm in his old age (it generally takes, and took, total one person full time for 6 months to get investment).</p>
<p>And to everyone else for making a company that someone would want to invest in.</p>
<p>Short FAQ</p>
<p>1. What&#8217;s this money for? It&#8217;s for us to create a viable business, by helping coders make data do things across the web. To do that, we have to reach clear product/market fit, with paying developers, and/or with corporations. So please continue to ask of everything you/we do &#8220;is this testing, in as lean a way as possible, how we can get to product/market fit?&#8221;.</p>
<p>2. How long will it last for? If we hit the revenue in our business plan, in theory until August 2014. Even in worst cases, it&#8217;ll last a year.</p>
<p>3. Can I tell the world? Not just yet. We&#8217;re not press releasing it immediately, it&#8217;s embargoed for writing about, blogging about or tweeting about. But feel free to tell friends and family.</p>
<p>4. When&#8217;s the party? Not sure, but Monday night in Liverpool looks best to me. Who isn&#8217;t free then? (At least for an early evening drink. Now the hard (and fun) part starts! Francis</p></blockquote>
<p>For full details, <a href="http://www.prweb.com/releases/2012/2/prweb9149117.htm">read the press release</a>. As you can see, we also have a new board member &#8211; I&#8217;ll write about her in a separate blog post. Any questions? Please ask in the comments!</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/scraperwiki.wordpress.com/758216330/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/scraperwiki.wordpress.com/758216330/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/scraperwiki.wordpress.com/758216330/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/scraperwiki.wordpress.com/758216330/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/scraperwiki.wordpress.com/758216330/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/scraperwiki.wordpress.com/758216330/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/scraperwiki.wordpress.com/758216330/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/scraperwiki.wordpress.com/758216330/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/scraperwiki.wordpress.com/758216330/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/scraperwiki.wordpress.com/758216330/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/scraperwiki.wordpress.com/758216330/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/scraperwiki.wordpress.com/758216330/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/scraperwiki.wordpress.com/758216330/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/scraperwiki.wordpress.com/758216330/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&amp;blog=14548467&amp;post=758216330&amp;subd=scraperwiki&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scraperwiki.com/2012/02/09/1-million-to-build-a-data-platform-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/385f073a12b016d1a85c0fda88ce82d5?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">frabcus</media:title>
		</media:content>
	</item>
		<item>
		<title>Big fat aspx pages for thin data</title>
		<link>http://blog.scraperwiki.com/2012/02/07/big-fat-aspx-pages-for-thin-data/</link>
		<comments>http://blog.scraperwiki.com/2012/02/07/big-fat-aspx-pages-for-thin-data/#comments</comments>
		<pubDate>Tue, 07 Feb 2012 06:06:03 +0000</pubDate>
		<dc:creator>Julian</dc:creator>
				<category><![CDATA[developer]]></category>

		<guid isPermaLink="false">http://blog.scraperwiki.com/?p=758216287</guid>
		<description><![CDATA[My work is more with the practice of webscraping, and less in the high-faluting business plans and product-market-fit leaning agility. At the end of the day, someone must have done some actual webscraping &#8212; and the harder it is the &#8230; <a href="http://blog.scraperwiki.com/2012/02/07/big-fat-aspx-pages-for-thin-data/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&amp;blog=14548467&amp;post=758216287&amp;subd=scraperwiki&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>My work is more with the practice of webscraping, and less in the high-faluting business plans and product-market-fit leaning agility.  At the end of the day, someone must have done some actual webscraping &#8212; and the harder it is the better.  </p>
<p>During the final hours of the <a href="http://blog.scraperwiki.com/2012/02/02/100-years-of-history-and-i-just-hope-that-we-do-it-justice/">Columbia University hack day</a>, I got to work on a corker in the form of the <a href="https://apps.jcope.ny.gov/lrr/Menu_reports_public.aspx">New York State Joint Committee on Public Ethics Lobbying filing system</a>.  </p>
<p>This is an <b>aspx</b> website which is truly shocking.  The programmer who made it should be fired &#8212; except it looks like he probably got it to a visibly working stage, and then simply walked away from the mess he created without finding out why it was running so slowly.  </p>
<p><span id="more-758216287"></span></p>
<p><b>Directions:</b><br />
1. Start on <a href="https://apps.jcope.ny.gov/lrr/Menu_reports_public.aspx"><b>this page</b></a>.<br />
2. Click on <b>2. Client Query  &#8211; Click here to execute Client Query</b>.<br />
3. Select Registration Year: 2011<br />
4. Click the [Search] button</p>
<p>[ Don't try to use the browser's back button as there is a piece of code on the starting page that reads:  &lt;script language="javascript"&gt;history.forward();&lt;/script&gt; ]</p>
<p>A page called <b>LB_QReports.aspx</b> will be downloaded, which is the same as the previous page, except it is 1.05Mbs long and renders a very small table which looks like this:</p>
<p><img src="http://scraperwiki.files.wordpress.com/2012/02/nyclobbytable.png?w=640" alt="" title="nyclobbytable"   class="alignright size-full wp-image-758216291" /></p>
<p>If you are able to look at the page source you will find thousands of lines of the form:</p>
<pre>&lt;div id="DisplayGrid_0_14_2367"&gt;
	&lt;a id="DisplayGrid_0_14_2367_ViewBTN"
href="javascript:__doPostBack('DisplayGrid_0_14_2367$ViewBTN','')"&gt;View&lt;/a&gt;
	&lt;/div&gt;</pre>
<p>Followed by a very long section which begins like:</p>
<pre>window.DisplayGrid = new ComponentArt_Grid('DisplayGrid');
DisplayGrid.Data = [[5400,2011,'N','11-17 ASSOCIATES, LLC','11-17
ASSOCIATES, LLC','','11-17 ASSOCIATES, LLC','NEW YORK','NY',10000,
'APR',40805,'January - June','11201',],[6596,2011,'N','114 KENMARE
ASSOCIATES, LLC','114 KENMARE ASSOCIATES, LLC','','114 KENMARE
ASSOCIATES, LLC','NEW YORK','NY',11961,'APR',41521,'January -
June','10012',],[4097,2011,'N','1199 SEIU UNITED HEALTHCARE
WORKERS EAST','1199 SEIU UNITED HEALTHCARE WORKERS EAST','','1199
SEIU UNITED HEALTHCARE WORKERS EAST','NEW YORK','NY',252081,'APR',
40344,'January - June','10036',],...</pre>
<p>This DisplayGrid object is thousands of lists long.  So although you only get 15 records in the table at a time, your browser has been given the complete set of data at once for the javascript to pagenate.</p>
<p>Great, I thought.  This is easy.  I simply have to parse out this gigantic array as json and poke it into the database.  </p>
<p>Unfortunately, although it can be interpreted by the javascript machine, it&#8217;s not valid json.  The quotes are of the wrong type, there are trailing commas, and we need to deal with the escaped apostrophes.</p>
<pre>mtable = re.search("(?s)DisplayGrid.Data =\s*(\[\[.*?\]\])", html)
jtable = mtable.group(1)
jtable = jtable.replace("\\'", ";;;APOS;;;")
jtable = jtable.replace("'", '"')
jtable = jtable.replace(";;;APOS;;;", "'")
jtable = jtable.replace(",]", "]")
jdata = json.loads(jtable)
</pre>
<p>Then it&#8217;s a matter of working out the headers of the table and storing it into the database.  </p>
<p>(Un)Fortunately, there&#8217;s more data about the lobbying disclosure than is present in this table if you click on those <b>View</b> links on each line, such as person names, addresses, amounts of money, and what was lobbied.</p>
<p>If you hover your mouse above one of these links you will see it&#8217;s of the form: <b>javascript:__doPostBack(&#8216;DisplayGrid_0_14_2134$ViewBTN&#8217;,&#8221;)</b>.</p>
<p>At this point it&#8217;s worth a recap on <a href="http://blog.scraperwiki.com/2011/11/09/how-to-get-along-with-an-asp-webpage/">how to get along with an asp webpage</a>, because that is what this is.  </p>
<p>[The scraper I am working on is <a href="https://scraperwiki.com/scrapers/ny_state_lobby/"><b>ny_state_lobby</b></a>, if you want to take a look.]</p>
<p>Here is the code for getting this far, to the point where we can click on these <b>View</b> links:</p>
<pre>cj = mechanize.CookieJar()
br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.set_cookiejar(cj)

# get to the form (with its cookies)
response = br.open("https://apps.jcope.ny.gov/lrr/Menu_reports_public.aspx")
for a in br.links():
    if a.text == 'Click here to execute Client Query':
        link = a
response = br.follow_link(link)

# fill in the form
br.select_form("form1")
br["ddlQYear"] = ["2011"]
response = br.submit()
print response.read()  # this gives massive sets of data
br.select_form("form1")
br.set_all_readonly(False)
</pre>
<p>The way to do those clicks onto &#8220;DisplayGrid_0_14_%d$ViewBTN&#8221; (View) buttons is with the following function that does the appropriate __doPostBack action. </p>
<pre>def GetLobbyGrid(d, year):
    dt = 'DisplayGrid_0_14_%d$ViewBTN' % d
    br["__EVENTTARGET"] = dt
    br["__EVENTARGUMENT"] = ''
    br.find_control("btnSearch").disabled = True
    request = br.click()
    response1 = br1.open(request)
    print response1.read()
</pre>
<p>&#8230;And you will find you will have got exactly the same page as before &#8212; including that 1Mb fake json data blob.  </p>
<p>Except it&#8217;s not quite exactly the same.  There is a tiny new little section of javascript in the page, right at the bottom.  (I believe I discovered it by watching the network traffic on the browser when following the link.)</p>
<pre>&lt;script language=javascript&gt;var myWin;myWin=window.open(
'LB_HtmlCSR.aspx?x=EOv...QOn','test','width=900,height=450,toolbar
=no,titlebar=no,location=center,directories=no, status=no,menubar=
yes,scrollbars=yes,resizable=yes');myWin.focus();&lt;/script&gt;
</pre>
<p>This contains the secret new link you have to click on to get the lobbyist information.  </p>
<pre>    html1 = response1.read()
    root1 = lxml.html.fromstring(html1)
    for s in root1.cssselect("script"):
        if s.text:
            ms = re.match("var myWin;myWin=window.open\('(LB_HtmlCSR.aspx\?.*?)',", s.text)
            if ms:
                loblink = ms.group(1)
    uloblink = urlparse.urljoin(br1.geturl(), loblink)
    response2 = br1.open(uloblink)
    print response2.read()   # this is the page you want
</pre>
<p>So, anyway, that&#8217;s where I&#8217;m up to.  I&#8217;ve started the <a href="https://scraperwiki.com/scrapers/ny_state_lobby_parse/edit/">ny_state_lobby_parse</a> scraper to work on these pages, but I don&#8217;t have time to carry it on right now (too much blogging).  </p>
<p>The scraper itself is going to operate very slowly because for each record it needs to download 1Mb of uselessly generated data to get the individual link to the lobbyist.  And I don&#8217;t have reliable unique keys for it yet.  It&#8217;s possible I could make them by associating the button name with the corresponding record from that DisplayGrid table, but that&#8217;s for later.  </p>
<p>For now I&#8217;ve got to go and do other things.  But at least we&#8217;re a little closer to having the picture of what is being disclosed into this database.  The big deal, as always, is finishing it off.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/scraperwiki.wordpress.com/758216287/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/scraperwiki.wordpress.com/758216287/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/scraperwiki.wordpress.com/758216287/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/scraperwiki.wordpress.com/758216287/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/scraperwiki.wordpress.com/758216287/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/scraperwiki.wordpress.com/758216287/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/scraperwiki.wordpress.com/758216287/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/scraperwiki.wordpress.com/758216287/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/scraperwiki.wordpress.com/758216287/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/scraperwiki.wordpress.com/758216287/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/scraperwiki.wordpress.com/758216287/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/scraperwiki.wordpress.com/758216287/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/scraperwiki.wordpress.com/758216287/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/scraperwiki.wordpress.com/758216287/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&amp;blog=14548467&amp;post=758216287&amp;subd=scraperwiki&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scraperwiki.com/2012/02/07/big-fat-aspx-pages-for-thin-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/ae3cb03a98a6470bdf839dd84a226e47?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">goatchurch</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2012/02/nyclobbytable.png" medium="image">
			<media:title type="html">nyclobbytable</media:title>
		</media:content>
	</item>
		<item>
		<title>100 Years of history&#8230;and I just hope that we do it justice&#8230;</title>
		<link>http://blog.scraperwiki.com/2012/02/02/100-years-of-history-and-i-just-hope-that-we-do-it-justice/</link>
		<comments>http://blog.scraperwiki.com/2012/02/02/100-years-of-history-and-i-just-hope-that-we-do-it-justice/#comments</comments>
		<pubDate>Thu, 02 Feb 2012 18:36:45 +0000</pubDate>
		<dc:creator>ainemcguire</dc:creator>
				<category><![CDATA[events]]></category>

		<guid isPermaLink="false">http://blog.scraperwiki.com/?p=758216222</guid>
		<description><![CDATA[Columbia University, arguably the best Journalism school in the world is giving us the opportunity of a lifetime. We are hosting our first ever US event (Journalism Data Camp #jdcny)  and its their first hackathon in a proud 100 year &#8230; <a href="http://blog.scraperwiki.com/2012/02/02/100-years-of-history-and-i-just-hope-that-we-do-it-justice/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&amp;blog=14548467&amp;post=758216222&amp;subd=scraperwiki&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<div id="attachment_758216224" class="wp-caption alignleft" style="width: 650px"><a href="http://scraperwiki.files.wordpress.com/2012/02/p1300222.jpg"><img class="size-full wp-image-758216224" title="OLYMPUS DIGITAL CAMERA" src="http://scraperwiki.files.wordpress.com/2012/02/p1300222.jpg?w=640&#038;h=479" alt="" width="640" height="479" /></a><p class="wp-caption-text">Journalism School Columbia University</p></div>
<p><a href="http://www.columbia.edu/">Columbia University</a>, arguably the best Journalism school in the world is giving us the opportunity of a lifetime. We are hosting our first ever US event (Journalism Data Camp #jdcny)  and its their first hackathon in a proud 100 year history. It is scary but very exciting! The campus is lovely and welcoming.  Located in uptown Manhattan just above Central Park, the &#8216;J&#8217; School building is one of the oldest in the complex.</p>
<div id="attachment_758216225" class="wp-caption aligncenter" style="width: 650px"><a href="http://scraperwiki.files.wordpress.com/2012/02/p1300217.jpg"><img class="size-full wp-image-758216225" title="OLYMPUS DIGITAL CAMERA" src="http://scraperwiki.files.wordpress.com/2012/02/p1300217.jpg?w=640&#038;h=479" alt="" width="640" height="479" /></a><p class="wp-caption-text">Joseph Pulitzer</p></div>
<p>The entrance hallway has a bust of <a title="Joseph Pulitzer" href="http://en.wikipedia.org/wiki/Joseph_Pulitzer">Joseph Pulizer</a> the Hungarian-American Publisher who established Columbia as the world&#8217;s first school of Journalism.  We also know him for the <a href="http://www.pulitzer.org/">Pulitzer Prize</a> which is synonymous with excellence in journalism and the arts since 1917.</p>
<p>Our event is designed as an attempt to marry highly skilled journalists with data scientists, coders, statisticians and technology!  As he was such an innovator we would hope that Joseph would have approved! Our mission is to &#8216;Liberate Data&#8217; to allow the professionals to hold power and money to account and to perform the exemplary role of being guardian of freedom of speech and to ask hard questions.</p>
<p><a href="http://www.journalism.columbia.edu/profile/304-emily-bell/10">Emily Bell </a>and <a href="http://strataconf.com/strata2012/public/schedule/speaker/126525">Francis Irving</a> will do a short introduction on their thoughts on digital journalism and the world of data.</p>
<p><strong>Agenda&#8230;here it is&#8230;</strong></p>
<p>&#8230;this is a fairly crude cropped PDF which we created with the ScraperWiki cropper tool and  it works beautifully.</p>
<p><a title="PDF excerpt" href="http://scraperwiki.files.wordpress.com/2012/02/columbia-agenda.pdf"><img style="border:thin black solid;width:400px;" src="https://scraperwiki.com/cropper/png/u/page_1/clip_75,19_1000,409/?url=http%3A%2F%2Fscraperwiki.files.wordpress.com%2F2012%2F02%2Fcolumbia-agenda.pdf" alt="" /></a></p>
<p>So what exactly are we going to do tomorrow  and Saturday?  We think that we have packed the event with stuff!</p>
<p><strong>Project Data Derby</strong></p>
<p>The project data derby is where people will work together in teams to create &#8216;data driven&#8217; stories and applications. This is a facilitated session where people will learn and understand the various techniques and skills to work with data. We will look to have multiple skill sets in a team and they will be encouraged to follow a process that will give the best outcome at the end of the two days!</p>
<p><a href="http://scraperwiki.files.wordpress.com/2012/02/data-derby-map.jpg"><img class="aligncenter size-large wp-image-758216247" title="Data-Derby-Map" src="http://scraperwiki.files.wordpress.com/2012/02/data-derby-map.jpg?w=1024&#038;h=723" alt="" width="1024" height="723" /></a></p>
<p><strong>Liberate the Data</strong></p>
<p>We have been asking people to nominate data sets for the past few weeks and we <a href="http://blog.scraperwiki.com/2012/02/02/journalism-data-camp-ny-potential-data-sets/">already have a list</a>!  We will put these on index cards and ask our &#8216;Data Liberators&#8217; to dig up the data, get it into a structured format and publish it for the project teams and the world to see and reuse!</p>
<p><strong>Learn to Scrape</strong></p>
<p><a href="http://scraperwiki.files.wordpress.com/2012/02/p1300216.jpg"><img class="alignleft size-thumbnail wp-image-758216227" title="OLYMPUS DIGITAL CAMERA" src="http://scraperwiki.files.wordpress.com/2012/02/p1300216.jpg?w=150&#038;h=112" alt="" width="150" height="112" /></a>We are also running two three hour tutorials on Friday morning <a href="http://scraperwiki.files.wordpress.com/2012/02/p1300215.jpg"><img class="alignright size-thumbnail wp-image-758216226" title="OLYMPUS DIGITAL CAMERA" src="http://scraperwiki.files.wordpress.com/2012/02/p1300215.jpg?w=150&#038;h=112" alt="" width="150" height="112" /></a>(Python) and afternoon (Ruby).  Our chief data scientist <a href="http://en.wikipedia.org/wiki/Julian_Todd">Julian Todd</a> and <a href="http://thomaslevine.com">Thomas Levine</a> data advocate, will run the sessions with the assistance of <a href="http://codeforamerica.org/author/michelle/">Michelle Koeth</a> (Code for America Fellow) supporting and assisting the students.   They will cover things like identifying good targets for webscraping.  Navigating the complexity of different types of web pages.  Attendees will create their own scrapers to get and analyse the Department of Labour&#8217;s Unionreports.gov (Collective bargaining agreement listings).  The objective will be to get the data into a structured format, and join it with data from the US census in order to establish the number and order of union employees across the US by state.  If time allows we will also try to encourage people to do further analysis.</p>
<p>We will have prizes for the best projects, most daring data liberation and the craziest constructed data scrapers.   Our judges will be <a href="http://www.pbs.org/idealab/2010/04/programmer-journalist-hacker-journalist-our-identity-crisis107.html">Aron Pilhofer</a> from the New York Times and <a href="http://www.journalism.columbia.edu/profile/365-susan-e-mcgregor/10">Susan E McGregor</a> Assistant Professor of Journalism at Columbia.</p>
<p><strong>And there&#8217;s more&#8230;</strong></p>
<p>Hear from the pros!  On Friday and Saturday we are running some lunchtime lightening sessions in the plenary room and you will hear from <strong><a href="http://sunlightfoundation.com/people/tlee/">Tom Lee</a></strong> from the Sunlight Foundation who will talk about a joint project between ScraperWiki and Sunlight and <strong><a href="http://www.youtube.com/watch?v=YMPlebcNyuM">Jake Porway</a></strong> &#8211; <strong><a href="http://datawithoutborders.cc/">Data without Borders</a></strong> who will talk about some of the fabulous projects that they are currently running.</p>
<p>A big big thanks to Tahiat Mahboob and Sam Guzik our two Digital Media Associates at Columbia University Graduate School of Journalism who have very generously offered their time to film the event!  Hats off to the facilities team at Columbia for all their help with the logistics.</p>
<p>We look forward to seeing everyone who has signed up and a million thanks for supporting us.</p>
<p><a href="http://scraperwiki.files.wordpress.com/2012/02/p1300220.jpg"><img class="aligncenter size-full wp-image-758216261" title="OLYMPUS DIGITAL CAMERA" src="http://scraperwiki.files.wordpress.com/2012/02/p1300220.jpg?w=640&#038;h=479" alt="" width="640" height="479" /></a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/scraperwiki.wordpress.com/758216222/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/scraperwiki.wordpress.com/758216222/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/scraperwiki.wordpress.com/758216222/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/scraperwiki.wordpress.com/758216222/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/scraperwiki.wordpress.com/758216222/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/scraperwiki.wordpress.com/758216222/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/scraperwiki.wordpress.com/758216222/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/scraperwiki.wordpress.com/758216222/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/scraperwiki.wordpress.com/758216222/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/scraperwiki.wordpress.com/758216222/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/scraperwiki.wordpress.com/758216222/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/scraperwiki.wordpress.com/758216222/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/scraperwiki.wordpress.com/758216222/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/scraperwiki.wordpress.com/758216222/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&amp;blog=14548467&amp;post=758216222&amp;subd=scraperwiki&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scraperwiki.com/2012/02/02/100-years-of-history-and-i-just-hope-that-we-do-it-justice/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/a801e770feed3df03f36195443374935?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">ainemcguire</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2012/02/p1300222.jpg" medium="image">
			<media:title type="html">OLYMPUS DIGITAL CAMERA</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2012/02/p1300217.jpg" medium="image">
			<media:title type="html">OLYMPUS DIGITAL CAMERA</media:title>
		</media:content>

		<media:content url="https://scraperwiki.com/cropper/png/u/page_1/clip_75,19_1000,409/?url=http%3A%2F%2Fscraperwiki.files.wordpress.com%2F2012%2F02%2Fcolumbia-agenda.pdf" medium="image" />

		<media:content url="http://scraperwiki.files.wordpress.com/2012/02/data-derby-map.jpg?w=1024" medium="image">
			<media:title type="html">Data-Derby-Map</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2012/02/p1300216.jpg?w=150" medium="image">
			<media:title type="html">OLYMPUS DIGITAL CAMERA</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2012/02/p1300215.jpg?w=150" medium="image">
			<media:title type="html">OLYMPUS DIGITAL CAMERA</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2012/02/p1300220.jpg" medium="image">
			<media:title type="html">OLYMPUS DIGITAL CAMERA</media:title>
		</media:content>
	</item>
		<item>
		<title>Journalism Data Camp NY potential data sets</title>
		<link>http://blog.scraperwiki.com/2012/02/02/journalism-data-camp-ny-potential-data-sets/</link>
		<comments>http://blog.scraperwiki.com/2012/02/02/journalism-data-camp-ny-potential-data-sets/#comments</comments>
		<pubDate>Thu, 02 Feb 2012 15:20:48 +0000</pubDate>
		<dc:creator>Julian</dc:creator>
				<category><![CDATA[developer]]></category>
		<category><![CDATA[events]]></category>
		<category><![CDATA[journalism]]></category>

		<guid isPermaLink="false">http://blog.scraperwiki.com/?p=758216196</guid>
		<description><![CDATA[Here is a review of some of the datasets that have been submitted for the Columbia Journalism Data Camp this Friday. This list is only for backup in case not enough ideas show up with people on the day (never &#8230; <a href="http://blog.scraperwiki.com/2012/02/02/journalism-data-camp-ny-potential-data-sets/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&amp;blog=14548467&amp;post=758216196&amp;subd=scraperwiki&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Here is a review of some of the datasets that have been submitted for the <a href="https://scraperwiki.com/events/jdcny/">Columbia Journalism Data Camp</a> this Friday.</p>
<p>This list is only for backup in case not enough ideas show up with people on the day (never happens, but it&#8217;s always a fear).</p>
<h2><span style="text-decoration:underline;">1. Iowa accident reports</span></h2>
<p><img class="alignright size-full wp-image-758216206" title="iowacrashsm" src="http://scraperwiki.files.wordpress.com/2012/02/iowacrashsm.png?w=640" alt=""   /></p>
<p>The site <a href="http://accidentreports.iowa.gov/">http://accidentreports.iowa.gov</a> contains all the police roadside reports of accidents. It&#8217;s easy to scrape because the database ids are consecutive numbers:</p>
<p><a href="http://accidentreports.iowa.gov/index.php?pgname=IDOT_IOR_MV_Accident_details&amp;id=50070">http://accidentreports.iowa.gov/index.php?pgname=IDOT_IOR_MV_Accident_details&amp;id=50070</a></p>
<p>And it contains thousands of rinky-dink diagrams of the incidents.</p>
<p>First step is to copy all the html from each page into one database. Second step is to scan through all these pages and progressively extract more and more data from them.</p>
<p>Contrast with dataset of accidents <a href="https://scraperwiki.com/scrapers/roadaccidents_1/">available for the UK</a>.</p>
<h2><span style="text-decoration:underline;">2. South Dakota state budget information</span></h2>
<p>Apparently complete set of expenditures, contracts and revenues disclosed on <a href="http://open.sd.gov/">http://open.sd.gov/</a> in a form that is easy to scrape (some datasets even allow CSV download). Many states do this, with varying degrees of success.</p>
<p>Use this case to learn how to restructure and analyse financial accountancy flow information. Can you find any contracts that have suddenly been dropped in favour of another supplier?</p>
<h2><span style="text-decoration:underline;">3. New York School budgets</span></h2>
<p>The site <a href="http://schools.nyc.gov/AboutUs/funding/schoolbudgets/GalaxyAllocationFY2012.htm">schools.nyc.gov/AboutUs/funding/schoolbudgets/GalaxyAllocationFY2012</a> requires a school code. Try &#8220;M411&#8243;.</p>
<p>Apparently there is <a href="http://schools.nyc.gov/NR/rdonlyres/E595859D-5AF8-4100-AB4B-6058527FA427/0/2010_2011_EMS_PR_Results_2011_11_30.xlsx">this spreadsheet</a> of school codes.</p>
<p><img class="alignright size-full wp-image-758216210" title="nyschoolsnapple" src="http://scraperwiki.files.wordpress.com/2012/02/nyschoolsnapple.png?w=640&#038;h=301" alt="" width="640" height="301" /></p>
<p>Is there anything interesting to plot across all schools, such as the <strong>PSAL SNAPPLE FUNDS</strong>?</p>
<h2><span style="text-decoration:underline;">4. New York Lobbying registers</span></h2>
<p>Lobbying at the <a href="https://apps.jcope.ny.gov/lrr/menu_reports_public2.aspx">state</a> and <a href="http://www.nyc.gov/lobbyistsearch/">city</a> level. Some of this is challenging.</p>
<p>Is there a cross-over between the jurisdictions? Can you uniquely identify the corporate interests and relate them to the legislative or regulatory program?</p>
<h2><span style="text-decoration:underline;">5. Court case information</span></h2>
<p>Go to <a href="https://www.dccourts.gov/cco/">https://www.dccourts.gov/cco/</a> (Try &#8220;Lockheed&#8221;). Not obvious where the information is.</p>
<p>The <a href="http://iapps.courts.state.ny.us/iscroll/AdvSearch.html">New York City courts</a> are behind a captcha. Maybe better luck with the <a href="http://www.courts.state.ny.us">New York State courts</a>.</p>
<p>Court datasets are usually very difficult to obtain and jealously protected. The legal process resists modernization and is universally paper based. Electronic documents (contracts, settlements, filings) almost always turn out to be image scans of papers.</p>
<h2><span style="text-decoration:underline;">6. New York City Police crime data</span></h2>
<p>There are <a href="http://www.nyc.gov/html/nypd/html/crime_prevention/crime_statistics.shtml">weekly PDFs</a> for each police precinct. These are taken down and replaced by the next one, so there is no historical record.</p>
<p>Luckily someone has <a href="https://scraperwiki.com/scrapers/current-week-reported-crime-city-wide-and-for-prec/">scraped the data</a> since 2010, though the numbers may need some processing before you map them.</p>
<h2><span style="text-decoration:underline;">7. New York State gas drilling permits</span></h2>
<p><img class="alignright size-full wp-image-758216214" title="StatfjordA_Jarvin1982__reasonably_small" src="http://scraperwiki.files.wordpress.com/2012/02/statfjorda_jarvin1982__reasonably_small.jpg?w=640" alt=""   /><br />
These are <a href="http://www.dec.ny.gov/cfmx/extapps/GasOil/standard/drilling-permits/">available</a> but don&#8217;t seem to have been updated recently. What&#8217;s going on?</p>
<p>Wouldn&#8217;t it be nice to make another twitterbot to be friends with <a href="https://twitter.com/northseaoil1">NorthSeaOil1</a>?</p>
<p>Don&#8217;t forget to read the <a href="http://www.dec.ny.gov/cfmx/extapps/GasOil/search/transfers/index.cfm">Well ownership transfers</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/scraperwiki.wordpress.com/758216196/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/scraperwiki.wordpress.com/758216196/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/scraperwiki.wordpress.com/758216196/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/scraperwiki.wordpress.com/758216196/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/scraperwiki.wordpress.com/758216196/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/scraperwiki.wordpress.com/758216196/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/scraperwiki.wordpress.com/758216196/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/scraperwiki.wordpress.com/758216196/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/scraperwiki.wordpress.com/758216196/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/scraperwiki.wordpress.com/758216196/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/scraperwiki.wordpress.com/758216196/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/scraperwiki.wordpress.com/758216196/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/scraperwiki.wordpress.com/758216196/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/scraperwiki.wordpress.com/758216196/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&amp;blog=14548467&amp;post=758216196&amp;subd=scraperwiki&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scraperwiki.com/2012/02/02/journalism-data-camp-ny-potential-data-sets/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/ae3cb03a98a6470bdf839dd84a226e47?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">goatchurch</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2012/02/iowacrashsm.png" medium="image">
			<media:title type="html">iowacrashsm</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2012/02/nyschoolsnapple.png" medium="image">
			<media:title type="html">nyschoolsnapple</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2012/02/statfjorda_jarvin1982__reasonably_small.jpg" medium="image">
			<media:title type="html">StatfjordA_Jarvin1982__reasonably_small</media:title>
		</media:content>
	</item>
		<item>
		<title>#JDCNY Journalism Data Camp New York &#8211; Agenda (Fri 3rd and Sat 4th Feb)</title>
		<link>http://blog.scraperwiki.com/2012/02/01/jdcny-journalism-data-camp-new-york-agenda-fri-3rd-and-sat-4th-feb/</link>
		<comments>http://blog.scraperwiki.com/2012/02/01/jdcny-journalism-data-camp-new-york-agenda-fri-3rd-and-sat-4th-feb/#comments</comments>
		<pubDate>Wed, 01 Feb 2012 18:58:42 +0000</pubDate>
		<dc:creator>ainemcguire</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[#jdcny]]></category>

		<guid isPermaLink="false">http://blog.scraperwiki.com/?p=758216156</guid>
		<description><![CDATA[Here is a breakdown on what we are planning for Friday and Saturday&#8230; Click the image to see the PDF! Thank you<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&amp;blog=14548467&amp;post=758216156&amp;subd=scraperwiki&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Here is a breakdown on what we are planning for Friday and Saturday&#8230;</p>
<p><a title="PDF excerpt" href="http://scraperwiki.files.wordpress.com/2012/02/columbia-agenda.pdf"><img style="border:thin black solid;width:400px;" src="https://scraperwiki.com/cropper/png/u/page_1/clip_75,19_1000,409/?url=http%3A%2F%2Fscraperwiki.files.wordpress.com%2F2012%2F02%2Fcolumbia-agenda.pdf" alt="" /></a></p>
<p>Click the image to see the PDF!</p>
<p>Thank you</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/scraperwiki.wordpress.com/758216156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/scraperwiki.wordpress.com/758216156/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/scraperwiki.wordpress.com/758216156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/scraperwiki.wordpress.com/758216156/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/scraperwiki.wordpress.com/758216156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/scraperwiki.wordpress.com/758216156/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/scraperwiki.wordpress.com/758216156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/scraperwiki.wordpress.com/758216156/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/scraperwiki.wordpress.com/758216156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/scraperwiki.wordpress.com/758216156/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/scraperwiki.wordpress.com/758216156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/scraperwiki.wordpress.com/758216156/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/scraperwiki.wordpress.com/758216156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/scraperwiki.wordpress.com/758216156/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&amp;blog=14548467&amp;post=758216156&amp;subd=scraperwiki&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scraperwiki.com/2012/02/01/jdcny-journalism-data-camp-new-york-agenda-fri-3rd-and-sat-4th-feb/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/a801e770feed3df03f36195443374935?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">ainemcguire</media:title>
		</media:content>

		<media:content url="https://scraperwiki.com/cropper/png/u/page_1/clip_75,19_1000,409/?url=http%3A%2F%2Fscraperwiki.files.wordpress.com%2F2012%2F02%2Fcolumbia-agenda.pdf" medium="image" />
	</item>
		<item>
		<title>&#8220;the impact on our industry only begins this weekend&#8221; says Susan E McGregor, Professor at the world&#8217;s foremost school of journalism</title>
		<link>http://blog.scraperwiki.com/2012/02/01/the-impact-on-our-industry-only-begins-this-weekend-says-susan-mcgregor-professor-at-the-worlds-foremost-school-of-journalism/</link>
		<comments>http://blog.scraperwiki.com/2012/02/01/the-impact-on-our-industry-only-begins-this-weekend-says-susan-mcgregor-professor-at-the-worlds-foremost-school-of-journalism/#comments</comments>
		<pubDate>Wed, 01 Feb 2012 18:15:23 +0000</pubDate>
		<dc:creator>ainemcguire</dc:creator>
				<category><![CDATA[events]]></category>
		<category><![CDATA[journalism]]></category>
		<category><![CDATA[cuny]]></category>
		<category><![CDATA[data journalism]]></category>

		<guid isPermaLink="false">http://blog.scraperwiki.com/?p=758216134</guid>
		<description><![CDATA[This is a guest blog post by Susan E. McGregor &#8211; Assistant Professor at the Tow Center for Digital Journalism Columbia University The Tow Center for Digital Journalism at Columbia University Graduate School of Journalism is proud to be partnering &#8230; <a href="http://blog.scraperwiki.com/2012/02/01/the-impact-on-our-industry-only-begins-this-weekend-says-susan-mcgregor-professor-at-the-worlds-foremost-school-of-journalism/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&amp;blog=14548467&amp;post=758216134&amp;subd=scraperwiki&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><em>This is a guest blog post by Susan E. McGregor &#8211; Assistant Professor at the Tow Center for Digital Journalism Columbia University </em></p>
<p><em><a href="http://scraperwiki.files.wordpress.com/2012/02/smcgregor_1128111.jpg"><img class="alignleft size-full wp-image-758216163" title="SMcGregor_112811" src="http://scraperwiki.files.wordpress.com/2012/02/smcgregor_1128111.jpg?w=640" alt=""   /></a> </em></p>
<p><a href="http://towcenter.org/">The Tow Center for Digital Journalism</a> at <a href="http://www.journalism.columbia.edu/">Columbia University Graduate School of Journalism</a> is proud to be partnering with <a href="http://knightfoundation.org/grants/20112812/">Knight News Challenge</a> winner <a href="https://scraperwiki.com/">ScraperWiki</a> this Friday and Saturday for their first <a href="https://scraperwiki.com/events/jdcny/">Journalism Data Camp</a> in the U.S. This event provides us with an opportunity to host a wide range of programmers, journalists and educators interested in expanding access to essential data sets, while connecting those communities to one another. We are also looking forward to extending the impact of this weekend’s activities by working in conjunction with our colleagues at the <a href="http://stabilecenter.org/">Stabile Center for Investigative Journalism</a> and <a href="http://www.thenewyorkworld.com/">The New York World</a> to further pursue those stories related to New York accountability issues that may be touched on during this weekend’s data “liberation” activities.</p>
<p>As an online tool, ScraperWiki is an innovative technical platform that allows users to build, test, and execute programmatic &#8220;scrapers&#8221; that transform web pages and pdfs into more accessible, usable data formats. As an online archive and repository, ScraperWiki helps improve access to scraped data sets by making them collectively available on their website. Finally, as a web-based collaboration space, ScraperWiki helps convene journalists and programmers around projects of shared interest, in addition to fostering peer-to-peer training and support.</p>
<p>Each of the above features of the ScraperWiki platform resonates closely with the Tow Center’s own priorities for data journalism. Making data available in formats that can be easily parsed, analyzed, and distributed is an essential part of data transparency, and the accountability journalism it serves. Providing a public access point for that data allows both journalists and their audiences to fact-check and elaborate upon the work that their peers have done, leveraging it against future projects and creating more comprehensive resources. And of course, the knowledge sharing and collaboration that takes place between programmers and journalists through ScraperWiki echoes the Tow Center’s mandate to educate and innovate at the intersection of computer science and journalism, both through its own <a href="http://www.journalism.columbia.edu/page/276-dualdegree-journalism-computer-science/279">dual-degree program in computer science and journalism</a>, and through such public events as this one.</p>
<p>While we are certain that ScraperWiki will find ready adoption in cities and newsrooms throughout the country in the months to come, we look forward to growing an ongoing relationship with ScraperWiki and its contributors here in the New York area. By hosting this event we hope to introduce many of our students and colleagues to a truly remarkable tool, one whose impact on our industry only begins this weekend.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/scraperwiki.wordpress.com/758216134/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/scraperwiki.wordpress.com/758216134/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/scraperwiki.wordpress.com/758216134/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/scraperwiki.wordpress.com/758216134/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/scraperwiki.wordpress.com/758216134/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/scraperwiki.wordpress.com/758216134/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/scraperwiki.wordpress.com/758216134/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/scraperwiki.wordpress.com/758216134/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/scraperwiki.wordpress.com/758216134/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/scraperwiki.wordpress.com/758216134/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/scraperwiki.wordpress.com/758216134/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/scraperwiki.wordpress.com/758216134/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/scraperwiki.wordpress.com/758216134/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/scraperwiki.wordpress.com/758216134/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&amp;blog=14548467&amp;post=758216134&amp;subd=scraperwiki&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scraperwiki.com/2012/02/01/the-impact-on-our-industry-only-begins-this-weekend-says-susan-mcgregor-professor-at-the-worlds-foremost-school-of-journalism/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/a801e770feed3df03f36195443374935?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">ainemcguire</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2012/02/smcgregor_1128111.jpg" medium="image">
			<media:title type="html">SMcGregor_112811</media:title>
		</media:content>
	</item>
		<item>
		<title>How to stop missing the good weekends</title>
		<link>http://blog.scraperwiki.com/2012/01/20/how-to-stop-missing-the-good-weekends/</link>
		<comments>http://blog.scraperwiki.com/2012/01/20/how-to-stop-missing-the-good-weekends/#comments</comments>
		<pubDate>Fri, 20 Jan 2012 09:27:12 +0000</pubDate>
		<dc:creator>Julian</dc:creator>
				<category><![CDATA[developer]]></category>
		<category><![CDATA[Scrapers]]></category>
		<category><![CDATA[alerts]]></category>
		<category><![CDATA[email alerts]]></category>
		<category><![CDATA[emails]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[scraperwiki]]></category>
		<category><![CDATA[weather]]></category>

		<guid isPermaLink="false">http://blog.scraperwiki.com/?p=758215936</guid>
		<description><![CDATA[Far too often I get so stuck into the work week that I forget to monitor the weather for the weekend when I should be going off to play on my dive kayaks &#8212; an activity which is somewhat weather &#8230; <a href="http://blog.scraperwiki.com/2012/01/20/how-to-stop-missing-the-good-weekends/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&amp;blog=14548467&amp;post=758215936&amp;subd=scraperwiki&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a href="http://scraperwiki.files.wordpress.com/2012/01/michael-fish.jpg"><img class="alignright  wp-image-758216121" title="Michael Fish presenting the weather" src="http://scraperwiki.files.wordpress.com/2012/01/michael-fish.jpg?w=330&#038;h=216" alt="The BBC's Michael Fish presenting the weather in the 80s, with a ScraperWiki tractor superimposed over Liverpool" width="330" height="216" /></a>Far too often I get so stuck into the work week that I forget to monitor the weather for the weekend when I should be going off to <a href="http://igniteliverpool.defnetmedia.com/2012/01/julian-todd-%E2%80%93-kayak-diving-in-the-uk/">play on my dive kayaks</a> &#8212; an activity which is somewhat weather dependent.</p>
<p>Luckily, help is at hand in the form of the ScraperWiki email alert system.</p>
<p>As you may have noticed, when you do any work on ScraperWiki, you start to receive daily emails that go:</p>
<pre>Dear Julian_Todd,

Welcome to your personal ScraperWiki email update.

Of the 320 scrapers you own, and 157 scrapers you have edited, we
have the following news since 2011-12-01T14:51:34:

Histparl MP list - https://scraperwiki.com/scrapers/histparl_mp_list :
  * ran 1 times producing 0 records from 2 pages
  * with 1 exceptions, (XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '&lt;!DOCTYP')

...<em>Lots more of the same</em>

This concludes your ScraperWiki email update till next time.

Please follow this link to change how often you get these emails,
or to unsubscribe: https://scraperwiki.com/profiles/edit/#alerts</pre>
<p>The idea behind this is to attract your attention to matters you may be interested in &#8212; such as fixing those poor dear scrapers you have worked on in the past and are now neglecting.</p>
<p>As with all good features, this was implemented as a quick hack.</p>
<p>I thought: why design a whole email alert system, with special options for daily and weekly emails, when we already have a scraper scheduling system which can do just that?</p>
<p>With the addition of a single flag to designate a scraper as an emailer (plus <a href="https://bitbucket.org/ScraperWiki/scraperwiki/src/471a8cb3abf3/web/codewiki/viewsrpc.py#cl-295">a further 20 lines of code</a>), a new fully fledged extensible feature was born.</p>
<p>Of course, this is not counting the code that is in the Wiki part of ScraperWiki.</p>
<p>The default code in your emailer looks roughly like so:</p>
<pre>import scraperwiki
emaillibrary = scraperwiki.utils.swimport("general-emails-on-scrapers")
subjectline, headerlines, bodylines, footerlines = emaillibrary.EmailMessageParts("onlyexceptions")
if bodylines:
    print "\n".join([subjectline] + headerlines + bodylines + footerlines)</pre>
<p>As you can see, it imports the 138 lines of Python from <a href="https://scraperwiki.com/scrapers/general-emails-on-scrapers/edit/">general-emails-on-scrapers</a>, which I am not here to talk about right now.</p>
<h3 style="font-family:'Helvetica Neue', Arial, Helvetica, 'Nimbus Sans L', sans-serif;font-weight:bold;">Using ScraperWiki emails to watch the weather</h3>
<p>Instead, what I want to explain is how I inserted my <strong>Good Weather Weekend Watcher</strong> by polling the <a href="http://www.metoffice.gov.uk/weather/uk/wl/holyhead_forecast_weather.html">weather forecast for Holyhead</a>.</p>
<p>My <a href="https://scraperwiki.com/scrapers/julian_todd-email-alert/edit/#">extra code</a> goes like this:</p>
<pre>weatherlines = [ ]
if datetime.date.today().weekday() == 2:  # Wednesday
    url = "http://www.metoffice.gov.uk/weather/uk/wl/holyhead_forecast_weather.html"
    html = urllib.urlopen(url).read()
    root = lxml.html.fromstring(html)
    rows = root.cssselect("div.tableWrapper table tr")
    for row in rows:
        #print lxml.html.tostring(row)
        metweatherline = row.text_content().strip()
        if metweatherline[:3] == "Sat":
            subjectline += " With added weather"
            weatherlines.append("*** Weather warning for the weekend:")
            weatherlines.append("   " + metweatherline)
            weatherlines.append("")</pre>
<p>What this does is check if today is Wednesday (day of the week #2 in <a href="http://docs.python.org/library/datetime.html#datetime.date.weekday">Python land</a>), then it parses through the <a href="http://www.metoffice.gov.uk/weather/uk/wl/holyhead_forecast_weather.html">Met Office Weather Report table</a> for my chosen location, and pulls out the row for Saturday.</p>
<p>Finally we have to handle producing the combined email message, the one which can contain <strong>either</strong> a set of broken scraper alerts, <strong>or</strong> the weather forecast, <strong>or</strong> both.</p>
<pre>if bodylines or weatherlines:
    if not bodylines:
        headerlines, footerlines = [ ], [ ]   # kill off cruft surrounding no message
    print "\n".join([subjectline] + weatherlines + headerlines + bodylines + footerlines)</pre>
<p>The current state of the result is:</p>
<pre>*** Weather warning for the weekend:
  Mon 5Dec
  Day

  7 °C
  W
  33 mph
  47 mph
  Very Good</pre>
<p>This was a very quick low-level implementation of the idea with no formatting and no filtering yet.</p>
<p>Email alerts can quickly become sophisticated and complex. Maybe I should only send a message out if the wind is below a certain speed. Should I monitor previous days&#8217; weather to predict whether the sea will be calm? Or I could check the wave heights on the off-shore buoys? Perhaps my calendar should be consulted for prior engagements so I don&#8217;t get frustrated by being told I am missing out on a good weekend when I had promised to go to a wedding.</p>
<p>The possibilities are endless and so much more interesting than if we&#8217;d implemented this email alert feature in the traditional way, rather than taking advantage of the utterly unique platform that we happened to already have in ScraperWiki.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/scraperwiki.wordpress.com/758215936/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/scraperwiki.wordpress.com/758215936/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/scraperwiki.wordpress.com/758215936/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/scraperwiki.wordpress.com/758215936/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/scraperwiki.wordpress.com/758215936/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/scraperwiki.wordpress.com/758215936/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/scraperwiki.wordpress.com/758215936/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/scraperwiki.wordpress.com/758215936/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/scraperwiki.wordpress.com/758215936/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/scraperwiki.wordpress.com/758215936/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/scraperwiki.wordpress.com/758215936/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/scraperwiki.wordpress.com/758215936/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/scraperwiki.wordpress.com/758215936/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/scraperwiki.wordpress.com/758215936/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&amp;blog=14548467&amp;post=758215936&amp;subd=scraperwiki&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scraperwiki.com/2012/01/20/how-to-stop-missing-the-good-weekends/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/ae3cb03a98a6470bdf839dd84a226e47?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">goatchurch</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2012/01/michael-fish.jpg" medium="image">
			<media:title type="html">Michael Fish presenting the weather</media:title>
		</media:content>
	</item>
		<item>
		<title>The long dark tea time of the computer programmer</title>
		<link>http://blog.scraperwiki.com/2012/01/13/the-long-dark-tea-time-of-the-computer-programmer/</link>
		<comments>http://blog.scraperwiki.com/2012/01/13/the-long-dark-tea-time-of-the-computer-programmer/#comments</comments>
		<pubDate>Fri, 13 Jan 2012 10:00:14 +0000</pubDate>
		<dc:creator>Julian</dc:creator>
				<category><![CDATA[thoughts]]></category>
		<category><![CDATA[learning]]></category>
		<category><![CDATA[makerbot]]></category>

		<guid isPermaLink="false">http://blog.scraperwiki.com/?p=758216075</guid>
		<description><![CDATA[The way in which Information Technology is taught in England is so dull and harmful it should be scrapped – that&#8217;s the view of the Education Secretary Michael Gove. &#8216;A nation of digital illiterates&#8217; (BBC) Many years ago there was &#8230; <a href="http://blog.scraperwiki.com/2012/01/13/the-long-dark-tea-time-of-the-computer-programmer/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&amp;blog=14548467&amp;post=758216075&amp;subd=scraperwiki&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a href="http://scraperwiki.files.wordpress.com/2012/01/making_most_micro.jpeg"><img class="alignleft wp-image-758216100" title="Ian McNaught-Davis presenting the BBC's 1983 TV series &quot;Making the Most of the Micro&quot;" src="http://scraperwiki.files.wordpress.com/2012/01/making_most_micro.jpeg?w=200&#038;h=150" alt="Ian McNaught-Davis presenting the BBC's 1983 TV series &quot;Making the Most of the Micro&quot;" width="200" height="150" /></a></p>
<blockquote><p><strong>The way in which Information Technology is taught in England is so dull and harmful it should be scrapped – that&#8217;s the view of the Education Secretary Michael Gove.</strong><br />
<a href="http://news.bbc.co.uk/today/hi/today/newsid_9675000/9675420.stm">&#8216;A nation of digital illiterates&#8217; (BBC)</a></p></blockquote>
<p>Many years ago there was a total corporate take-over of the computer software sector in the UK. Big money was to be made out of controlling the profits generated by software applications, which were protected from competition by incompatible and inoperable standards and the force of law. (An attempt by the UK government to establish a very modest requirement for open standards was <a href="http://blogs.computerworlduk.com/open-enterprise/2012/01/uk-cabinet-office-betrayal-of-open-standards-confirmed/index.htm">successfully killed off</a> last week.)</p>
<p>One of the most painful aspects of this take-over was the way in which the same corporations managed to deform the entire education system into serving their purposes. All things resembling actual computer programming were cleansed from the curriculum, which was instead packed with dire, tedious training modules for drilling students in how to use those self-same corporations&#8217; big software suites.</p>
<p>Back in the 1980s, before this take-over, I learned to program on the <a href="http://en.wikipedia.org/wiki/BBC_Micro">BBC Micro</a>, which was widespread throughout the UK at the time. There is good evidence that this was the reason there has been such a strong software industry in the UK over the last three decades.</p>
<p>Let&#8217;s just use the words of computer games pioneer Ian Livingston from his <a href="http://www.nesta.org.uk/library/documents/NextGenv32.pdf">February 2011 report</a>:</p>
<blockquote><p>&#8220;Given that the new online world is being transformed by creative technology companies like Facebook, Twitter, Google and video games companies, it seems incredible that there is an absence of computer programming in schools. The UK has gone backwards at a time when the requirement for computer science as a core skill is more essential than ever before. When Sir Clive Sinclair launched the ZX Spectrum in 1982, affordable computers were eagerly purchased for the homes of a creative nation. At the same time, the BBC Micro was adopted as the computer platform of choice for most schools and became the cornerstone of computing in British education in the 1980s. There was a thirst for creative computing both in the home and in schools creating a further demand at universities for courses in computer science. This certainly contributed to the rapid growth of the UK computer games industry.</p>
<p>&#8220;But instead of building on the BBC&#8217;s Computer Literacy Project in the 1980s, schools turned away from programming in favour of ICT. Whilst useful in teaching various proprietary office software packages, ICT fails to inspire children to study computer programming. It is certainly not much help for a career in games. In a world where technology affects everything in our daily lives, so few children are taught such an essential STEM skill as programming. Bored by ICT, young people do not see the potential of the digital creative industries. It is hardly surprising that the games industry keeps complaining about the lack of industry-ready computer programmers and digital artists.&#8221;</p></blockquote>
<p>The <a href="http://www.dcms.gov.uk/images/publications/Govt-Resp_NextGen_Cm-8226.pdf">official Government response</a> was mostly public relations, mentioning developments that have nothing to do them such as <a href="http://en.wikipedia.org/wiki/Raspberry_Pi">Raspberry Pi</a>, which, incidentally, appears to be an attempt by Livingston&#8217;s generation to recreate the 1980s when we once watched computers on TV (instead of watching TV on computers).</p>
<p>While this change in Government policy is absolutely vital, it is odd the way the people lobbying for it haven&#8217;t branched outside of their narrow fields of computer games and computer graphics – which are ultimately little more than a game of shifting pixels around on a VDU, utilizing very mature technologies based on software applications developed by large corporations.</p>
<p>They&#8217;re missing the point: This is the era of smart energy monitors that need to be coupled to microcontrollers that do something with appliances over the internet in response to data. Or robotrading bank accounts that use live data feeds to monitor and execute your share portfolio while you&#8217;re watching Corrie. The focus should be on <a href="http://www.arduino.cc/">Arduinos</a> and simple robotics to control the energy use in your house, or on creative programming to trump the hedgefunders.</p>
<p>Recently I have been <a href="http://www.freesteel.co.uk/wpblog/2011/12/getting-life-out-of-makerbot-cupcake/">doing experiments</a> through the interface of a home-built 3D printer, completely bypassing their UI application to drive it directly through the serial port.</p>
<p>Here is my first result:<br />
<span style="text-align:center; display: block;"><a href="http://blog.scraperwiki.com/2012/01/13/the-long-dark-tea-time-of-the-computer-programmer/"><img src="http://img.youtube.com/vi/RifBbEoczZ0/2.jpg" alt="" /></a></span></p>
<p>Come on guys. I know it takes years to mastering the ability to render the perfect series of CGI frames of a spoon stirring the brown liquid in a cup of tea. But get a computer controlled robot to actually make me a cup of tea &#8212; that would be something!</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/scraperwiki.wordpress.com/758216075/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/scraperwiki.wordpress.com/758216075/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/scraperwiki.wordpress.com/758216075/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/scraperwiki.wordpress.com/758216075/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/scraperwiki.wordpress.com/758216075/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/scraperwiki.wordpress.com/758216075/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/scraperwiki.wordpress.com/758216075/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/scraperwiki.wordpress.com/758216075/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/scraperwiki.wordpress.com/758216075/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/scraperwiki.wordpress.com/758216075/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/scraperwiki.wordpress.com/758216075/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/scraperwiki.wordpress.com/758216075/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/scraperwiki.wordpress.com/758216075/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/scraperwiki.wordpress.com/758216075/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&amp;blog=14548467&amp;post=758216075&amp;subd=scraperwiki&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scraperwiki.com/2012/01/13/the-long-dark-tea-time-of-the-computer-programmer/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/ae3cb03a98a6470bdf839dd84a226e47?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">goatchurch</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2012/01/making_most_micro.jpeg" medium="image">
			<media:title type="html">Ian McNaught-Davis presenting the BBC&#039;s 1983 TV series &#34;Making the Most of the Micro&#34;</media:title>
		</media:content>
	</item>
		<item>
		<title>ScraperWikiをためしてみよう</title>
		<link>http://blog.scraperwiki.com/2012/01/06/scraperwiki%e3%82%92%e3%81%9f%e3%82%81%e3%81%97%e3%81%a6%e3%81%bf%e3%82%88%e3%81%86/</link>
		<comments>http://blog.scraperwiki.com/2012/01/06/scraperwiki%e3%82%92%e3%81%9f%e3%82%81%e3%81%97%e3%81%a6%e3%81%bf%e3%82%88%e3%81%86/#comments</comments>
		<pubDate>Fri, 06 Jan 2012 15:48:12 +0000</pubDate>
		<dc:creator>Francis Irving</dc:creator>
				<category><![CDATA[developer]]></category>
		<category><![CDATA[opendata]]></category>
		<category><![CDATA[Scrapers]]></category>
		<category><![CDATA[japanese]]></category>
		<category><![CDATA[representatives]]></category>

		<guid isPermaLink="false">http://blog.scraperwiki.com/?p=758216050</guid>
		<description><![CDATA[Guest post by Makoto Inoue, a Japanese ScraperWiki user. Makoto works in London as a Web developer, a technical writer, and a translator. He has a Japanese blog and his Twitter account is @makoto_inoue. はじめに みなさんスクレイプ（Scrape）という単語はご存知でしょうか？ ウェッブページから特定のデータを引っこ抜く作業のことをスクレイピング（Scraping）と呼びます。 昨今のホームページではデータを簡単に提供するためのAPI（Application Programming Interface）というしくみが多いので「なんで今更そんなの必要なの」と思われる方&#62;も多いかもしれません。しかしながら前回起きた東日本大地震の際、地震や電力の速報や、各地の被害状況を把握するために必要な政府の統計情報などがAPIとして提供されておらず、開発者の中には自分でスクレイパー（Scraper）用のプログラムを書いた人も多いのではないのでしょうか？　ただそういった多くの開発者の善意でつくられたプログラムがいろいろなサイトに散らばっていたり、やがてメンテナンスされなくなるのは非常に残念なことです。 そういうときにScraperWikiの出番です。 ScraperWikiとは ScraperWikiはイギリスのスタートアップ企業で、スクレイパーコードを共有するサイトを提供しています。開発者達はサイト上から直接コード（Ruby, PHP, Python）を編集、実行することができます。スクレイプを定期的に実行することも可能で、取得されたデータはScraperWikiに保存されますが、ScraperWikiはAPIを用意しているので、このAPIを通して、他のサイトでデータを再利用することが可能です。 &#8230; <a href="http://blog.scraperwiki.com/2012/01/06/scraperwiki%e3%82%92%e3%81%9f%e3%82%81%e3%81%97%e3%81%a6%e3%81%bf%e3%82%88%e3%81%86/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&amp;blog=14548467&amp;post=758216050&amp;subd=scraperwiki&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><em>Guest post by Makoto Inoue, a Japanese ScraperWiki user. Makoto works in </em><em>London as a Web developer, a technical writer, and a translator. </em><em>He has a <a href="http://d.hatena.ne.jp/makotoi/">Japanese blog</a> and his Twitter </em><em>account is <a href="http://twitter.com/makoto_inoue">@makoto_inoue</a>.</em></p>
<h1>はじめに</h1>
<p>みなさんスクレイプ（Scrape）という単語はご存知でしょうか？</p>
<p>ウェッブページから特定のデータを引っこ抜く作業のことをスクレイピング（Scraping）と呼びます。</p>
<p>昨今のホームページではデータを簡単に提供するためのAPI（Application Programming Interface）というしくみが多いので「なんで今更そんなの必要なの」と思われる方&gt;も多いかもしれません。しかしながら前回起きた東日本大地震の際、地震や電力の速報や、各地の被害状況を把握するために必要な政府の統計情報などがAPIとして提供されておらず、開発者の中には自分でスクレイパー（Scraper）用のプログラムを書いた人も多いのではないのでしょうか？　ただそういった多くの開発者の善意でつくられたプログラムがいろいろなサイトに散らばっていたり、やがてメンテナンスされなくなるのは非常に残念なことです。</p>
<p>そういうときに<a href="http://scraperwiki.com">ScraperWiki</a>の出番です。</p>
<h2>ScraperWikiとは</h2>
<p><a href="http://scraperwiki.com">ScraperWiki</a>はイギリスのスタートアップ企業で、スクレイパーコードを共有するサイトを提供しています。開発者達はサイト上から直接コード（Ruby, PHP, Python）を編集、実行することができます。スクレイプを定期的に実行することも可能で、取得されたデータはScraperWikiに保存されますが、ScraperWikiはAPIを用意しているので、このAPIを通して、他のサイトでデータを再利用することが可能です。</p>
<p>「Wiki」といっているだけあって、一般公開されているコードは他の人も編集したり、またコードをコピーして他のスクレイピングに利用することもできます。定期的に実&gt;行されているスクレイパーがエラーを起こしていないかをチェックする仕組みがあり「みんなでスクレイピングを管理」するための仕組みがいたるところにあります。</p>
<p>ScraperWikiは、もともとイギリスで、どの議員がどの法案に賛成または反対票を投じたかを議会のサイトから創業者の一人が2003年頃にスクレイプしたことを起源に持ちます。</p>
<span style="text-align:center; display: block;"><a href="http://blog.scraperwiki.com/2012/01/06/scraperwiki%e3%82%92%e3%81%9f%e3%82%81%e3%81%97%e3%81%a6%e3%81%bf%e3%82%88%e3%81%86/"><img src="http://img.youtube.com/vi/hu3gRqhXKag/2.jpg" alt="" /></a></span>
<p>日本であればちょうど<a href="http://www.sangiin.go.jp/japanese/joho1/kousei/vote/179/179-1209-v001.htm">こういったページ</a>でしょうか？</p>
<p>現在では<a href="http://blog.scraperwiki.com/2011/02/25/read-all-about-it-read-all-about-it-%E2%80%9Cscraperwiki-gets-on-the-guardian-front-page-%E2%80%9D/">Guardian社といった大手報道機関が企業ロビイストの議会での影響力を調べるのにつかったり</a>、イギリス政府自身が<a href="http://alpha.gov.uk">alpha.gov.uk</a>というプロトタイプサ&gt;イトで、<a href="http://blog.scraperwiki.com/2011/05/11/access-government-in-a-way-that-makes-sense-to-you-surely-not/">各省庁に点在したデータを一元的にアクセスするための仕組みとしてScraperWikiを使っている</a>そうです。</p>
<p>ScraperWikiのビジネスモデルですが、一般公開するコードに関しては無料ですが、<a href="https://scraperwiki.com/pricing/">非公開にしたり、定期的にスクレイプする量などに応じて課金する</a>ようになっています。</p>
<p>前置きが長くなってきましたが、実際に使ってみましょう。</p>
<h2>既存のスクレイパーを眺めてみる</h2>
<p>「ScraperWiki」でGoogle検索すると、すでにScraperWikiを使用している日本人の方がいらっしゃいました。</p>
<ul>
<li><a href="http://d.hatena.ne.jp/uasi/20110603/1307098299">「スクレイピングするなら ScraperWiki 使うといいよ 」</a></li>
</ul>
<p>ここでは衆議院議員のデータをスクレイプするのに使用しています。</p>
<ul>
<li><a href="https://scraperwiki.com/scrapers/members_of_the_house_of_representatives_of_japan/">Members of the House of Representatives of Japan</a></li>
</ul>
<p><a href="http://scraperwiki.files.wordpress.com/2012/01/house_of_rep.png"><img class="alignnone size-full wp-image-758216061" title="House of Representatives" src="http://scraperwiki.files.wordpress.com/2012/01/house_of_rep.png?w=640&#038;h=457" alt="" width="640" height="457" /></a></p>
<p>データは月一回走るように設定されていたり、複数のcontributorがいるのがわかります。</p>
<p>ページの下の方にはスプレッドシート形式でデータを閲覧できるようになっていますが、これだけだと他のサイトで再利用とか難しいですよね。そういうときは&#8221;Explorer with API&#8221;ボタンをクリックしてみて下さい。そこのページの最後に以下のようなurlがあると思います。</p>
<blockquote><p>https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=jsondict&amp;name=members_of_the_house_of_representatives_of_japan&amp;query=select%20*%20from%20%60swdata%60%20limit%2010</p></blockquote>
<p>このurlにアクセスすると、先ほどのデータをJSON(Javascript Object Notation)で返してくれます。出力フォーマットは CSV, RSS,HTMLテーブルといった他の形式にも対応している上sql文をつかってフィルタリングなどをかけることも可能です。</p>
<blockquote><p>select * from `swdata` where party = &#8216;民主&#8217;</p></blockquote>
<h2>COPYして独自のスクレイパーを作ってみる</h2>
<p>ブラウザの”バック”ボタンを押して先ほどのページのスプレッドシートの下の方に目を通してい見て下さい。”This Scraper in Context”というところ”Copied To”という項&gt;目があります。これはこのソースコードがコピーされ、他の用途に利用されていることを示しています。</p>
<p><a href="http://scraperwiki.files.wordpress.com/2012/01/context.png"><img class="alignnone size-full wp-image-758216059" title="Context" src="http://scraperwiki.files.wordpress.com/2012/01/context.png?w=640&#038;h=147" alt="" width="640" height="147" /></a></p>
<p>そこに<a href="https://scraperwiki.com/scrapers/members_of_the_house_of_councillors_of_japan_1">「makoto / Members of the House of Councillors of Japan」</a>とあるの&gt;でクリックしてみて下さい。実はこれは私が参議院議員の名簿を抜き出すために作ったスクレイパーです。衆議院と参議院はそれぞれ別にホームページを持っているのです&gt;が、それぞれの議員名簿のページが結構似ていたので簡単に流用できるのではと思っていました。</p>
<ul>
<li><a href="http://www.shugiin.go.jp/index.nsf/html/index_kousei3.htm">衆議院議員一覧</a></li>
<li><a href="http://www.sangiin.go.jp/japanese/joho1/kousei/giin/179/giin.htm">参議院議員一覧</a></li>
</ul>
<p>作り方は簡単でCopyリンクをクリックするだけです。ログインしていなくてもコピーをとれますが、これを機にアカウントを取得するのをお勧めします。</p>
<p>”Edit”ページを開くとその場でコードを編集するためのオンラインエディタが現れます。下にある”Run”ボタンを押すと実際にサイトからデータをとってきている模様が見て取れます。</p>
<p><a href="http://www.screenr.com/embed/BgQs">http://www.screenr.com/embed/BgQs</a></p>
<p>以下がオリジナルのコードと私のコードの差異です。</p>
<p><a href="http://scraperwiki.files.wordpress.com/2012/01/diff.png"><img class="alignnone size-full wp-image-758216060" title="Diff" src="http://scraperwiki.files.wordpress.com/2012/01/diff.png?w=640&#038;h=498" alt="" width="640" height="498" /></a></p>
<p>衆議院と参議院のページ異なっていたため変更した点はは４つほどありました。</p>
<ul>
<li>エンコーディング（文字の表示形式）がUTFとShift-JISでことなる</li>
<li>衆議院のページは複数ページにまたがっているが参議院ページ１ページのみ</li>
<li>衆議院のページで議員名は「くん」づけ。参議院のページは芸名と本名の両方が載っている</li>
</ul>
<p>他にもHTMLのページの文法が微妙に違っていたので、XPathという、HTMLに構造的にアクセスするための書式を少し換えました。</p>
<p>もちろんこれらの変更をするのにはある程度のプログラミング知識が必要なのですが。動くサンプルを少し自分用にカスタマイズするScraperWikiはプログラミングを勉強したい人にとっても絶好の教材なのではないでしょうか？　私自身XPathはあまり使ったことがなかったのですが、このもとプログラムを参考にすることで比較的簡単に学習できました。</p>
<h2>最後に</h2>
<p>今回のScraperWikiの簡単なチュートリアルで概要がわかっていただけたでしょうか？</p>
<p>公共機関、メディアや政府機関の中でインターネットを通じた情報公開は進んできていますが、「マッシュアップを前提としたデータの再利用」を考慮したサイトが十分で&gt;ないのが現状です。そういった状態に一石と投じるべくScraperWikiは活動しており、ヨーロッパのジャーナリストや政府関係者の間では徐々に認知度があがってきております。　現在ScraperWikiでは米国でのワークショップを予定していますが、日本でもワークショップを始めるべく準備をしている所です。　もし興味のある方は<a href="https://scraperwiki.com/contact/">コンタクトページ</a>より気軽にご連絡下さい。</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/scraperwiki.wordpress.com/758216050/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/scraperwiki.wordpress.com/758216050/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/scraperwiki.wordpress.com/758216050/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/scraperwiki.wordpress.com/758216050/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/scraperwiki.wordpress.com/758216050/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/scraperwiki.wordpress.com/758216050/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/scraperwiki.wordpress.com/758216050/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/scraperwiki.wordpress.com/758216050/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/scraperwiki.wordpress.com/758216050/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/scraperwiki.wordpress.com/758216050/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/scraperwiki.wordpress.com/758216050/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/scraperwiki.wordpress.com/758216050/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/scraperwiki.wordpress.com/758216050/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/scraperwiki.wordpress.com/758216050/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&amp;blog=14548467&amp;post=758216050&amp;subd=scraperwiki&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scraperwiki.com/2012/01/06/scraperwiki%e3%82%92%e3%81%9f%e3%82%81%e3%81%97%e3%81%a6%e3%81%bf%e3%82%88%e3%81%86/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/385f073a12b016d1a85c0fda88ce82d5?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">frabcus</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2012/01/house_of_rep.png" medium="image">
			<media:title type="html">House of Representatives</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2012/01/context.png" medium="image">
			<media:title type="html">Context</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2012/01/diff.png" medium="image">
			<media:title type="html">Diff</media:title>
		</media:content>
	</item>
		<item>
		<title>Happy New Year and Happy New York!</title>
		<link>http://blog.scraperwiki.com/2012/01/03/happy-new-year-and-happy-new-york/</link>
		<comments>http://blog.scraperwiki.com/2012/01/03/happy-new-year-and-happy-new-york/#comments</comments>
		<pubDate>Tue, 03 Jan 2012 20:32:42 +0000</pubDate>
		<dc:creator>ainemcguire</dc:creator>
				<category><![CDATA[events]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[data driven journalism]]></category>
		<category><![CDATA[data-driven]]></category>
		<category><![CDATA[New York Digital]]></category>
		<category><![CDATA[PHP]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[ruby]]></category>

		<guid isPermaLink="false">http://blog.scraperwiki.com/?p=758216011</guid>
		<description><![CDATA[We are really pleased to announce that we will be hosting our very first US two day Journalism Data Camp event in conjunction with the Tow Center for Digital Journalism at Columbia University and supported by the Knight Foundation on &#8230; <a href="http://blog.scraperwiki.com/2012/01/03/happy-new-year-and-happy-new-york/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&amp;blog=14548467&amp;post=758216011&amp;subd=scraperwiki&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>We are really pleased to announce that we will be hosting our very first US two day <a href="https://scraperwiki.com/events/jdcny/">Journalism Data Camp</a> event in conjunction with the Tow Center for Digital Journalism at Columbia University and supported by the Knight Foundation on February 3rd and 4th 2012.</p>
<p>We have been working with Emily Bell @emilybell, Director of the Tow Center and Susan McGregor @SusanEMcG, Assistant Professor at the Columbia J School to plan the event. The main objective is to liberate and use New York data for the purposes of keeping business and power accountable.</p>
<p>After a short introduction on the first day, we will split the event into three parallel streams; journalism data projects; liberating New York data; and &#8216;learn to scrape&#8217;. We plan to inject some fun by running a derby for the project stream and also by awarding prizes in all of the streams.  We hope to make the event engaging and enjoyable.</p>
<p>We need journalists, media professionals, students of journalism, political science or  information technology, coders, statisticians and public data boffs to dig up the data!</p>
<p>Please pick a stream and <a href="https://scraperwiki.com/events/jdcny/">sign-up</a> to help us to make New York a data driven city!</p>
<p>Our thanks to Columbia University, Civic Commons, The New York Times, and CUNY for allowing us to use their premises as we sojourned in the big apple</p>
<p>Zarino has created a map with our US events which we will update with additional events as we add locations. <a href="https://scraperwiki.com/events/">https://scraperwiki.com/events/</a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/scraperwiki.wordpress.com/758216011/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/scraperwiki.wordpress.com/758216011/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/scraperwiki.wordpress.com/758216011/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/scraperwiki.wordpress.com/758216011/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/scraperwiki.wordpress.com/758216011/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/scraperwiki.wordpress.com/758216011/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/scraperwiki.wordpress.com/758216011/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/scraperwiki.wordpress.com/758216011/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/scraperwiki.wordpress.com/758216011/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/scraperwiki.wordpress.com/758216011/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/scraperwiki.wordpress.com/758216011/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/scraperwiki.wordpress.com/758216011/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/scraperwiki.wordpress.com/758216011/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/scraperwiki.wordpress.com/758216011/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&amp;blog=14548467&amp;post=758216011&amp;subd=scraperwiki&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scraperwiki.com/2012/01/03/happy-new-year-and-happy-new-york/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/a801e770feed3df03f36195443374935?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">ainemcguire</media:title>
		</media:content>
	</item>
	</channel>
</rss>
