<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>ScraperWiki Data Blog</title>
	<atom:link href="http://blog.scraperwiki.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.scraperwiki.com</link>
	<description>A blog about ScraperWiki and all things data</description>
	<lastBuildDate>Thu, 16 May 2013 11:35:17 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='blog.scraperwiki.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>ScraperWiki Data Blog</title>
		<link>http://blog.scraperwiki.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://blog.scraperwiki.com/osd.xml" title="ScraperWiki Data Blog" />
	<atom:link rel='hub' href='http://blog.scraperwiki.com/?pushpress=hub'/>
		<item>
		<title>Summarise #4: Images and domains</title>
		<link>http://blog.scraperwiki.com/2013/05/14/summarise-4-images-and-domains/</link>
		<comments>http://blog.scraperwiki.com/2013/05/14/summarise-4-images-and-domains/#comments</comments>
		<pubDate>Tue, 14 May 2013 13:13:01 +0000</pubDate>
		<dc:creator>Francis Irving</dc:creator>
				<category><![CDATA[beta]]></category>

		<guid isPermaLink="false">http://blog.scraperwiki.com/?p=758218641</guid>
		<description><![CDATA[(This is the fourth part in a series of posts about the &#8220;Summarise this dataset&#8221; tool on the new beta.scraperwiki.com platform  – go there and sign up for free to try it out! The code is open source; take a &#8230; <a href="http://blog.scraperwiki.com/2013/05/14/summarise-4-images-and-domains/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&#038;blog=14548467&#038;post=758218641&#038;subd=scraperwiki&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>(This is the fourth part in a series of posts about the &#8220;Summarise this dataset&#8221; tool on the new <a href="beta.scraperwiki.com">beta.scraperwiki.com</a> platform  –  go there and sign up for free to try it out! The code is open source; take a look in <a href="https://github.com/frabcus/magic-summary-tool/blob/master/http/facts.js">facts.js</a> for the key parts)</p>
<p>URLs are a type of data that is particularly easy to detect. The summarise tool does a reasonable job of displaying them anyway, but a couple of tricks make it even better.</p>
<p>The first insight is that just as time can be grouped by days or months or years, URLs can  –  in theory –  be grouped by protocol, domain, partial paths or query parameters. We found the domain was the most useful, as it shows which website the URLs are from.</p>
<p>For example, I use a tool called Pocket to bookmark things to read or watch later on my phone. This is the list of the most common websites I bookmark that way:</p>
<p><a href="http://scraperwiki.files.wordpress.com/2013/05/screen-shot-2013-05-02-at-23-50-43.png"><img class="aligncenter size-full wp-image-758218646" alt="Domain grouping" src="http://scraperwiki.files.wordpress.com/2013/05/screen-shot-2013-05-02-at-23-50-43.png?w=640"   /></a></p>
<p>Images are another common kind of URL that is easy to detect. A regular expression catches most of them automatically based on the file extension. (Although see <a href="https://github.com/frabcus/magic-summary-tool/issues/30">this bug</a>, at some point we&#8217;ll need a semantic layer&#8230;)</p>
<p>Here you can see the top artwork tracks from a Last.fm scraper, which gets the data of all the music that I&#8217;ve ever listened to:</p>
<p><a href="http://scraperwiki.files.wordpress.com/2013/05/screen-shot-2013-05-02-at-23-53-51.png"><img class="aligncenter size-full wp-image-758218648" alt="Images" src="http://scraperwiki.files.wordpress.com/2013/05/screen-shot-2013-05-02-at-23-53-51.png?w=640"   /></a></p>
<p>You can immediately see that Yann Tiersen features heavily in the most replayed, with both <em>Goodbye Lenin!</em> and <em>Amelie</em>.</p>
<p>As explained in <a href="http://blog.scraperwiki.com/2013/04/29/summarise-2-pies-and-facts/">Summarise #2: Pies and facts</a>, the summarise tool generates lots of different &#8220;facts&#8221; about the data. It then has a ranking algorithm to decide which are the best to show.</p>
<p>When the data is a URL, extra facts as described above (tables of website domains, collages of images) are generated. The tool then selects whether those are more interesting than the basic facts about the URLs.</p>
<p>For example, if there were only a few different URLs, they might all be shown in a pie. But if there were a lot more, but from only a few domains, then the domains would be shown in a pie.</p>
<p><strong>Try it yourself!</strong> Use “Create a dataset” to get some data into <a title="n" href="https://beta.scraperwiki.com/">new ScraperWiki</a>. Then pick “Summarise this data” from the tools menu and see what it tells you.</p>
<p><span style="font-size:16px;line-height:1.5;">Next time &#8211; words and countries! And then, a final post to round it up, about data that is (nearly) always one type, and other interesting tidbits.</span></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/scraperwiki.wordpress.com/758218641/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/scraperwiki.wordpress.com/758218641/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&#038;blog=14548467&#038;post=758218641&#038;subd=scraperwiki&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scraperwiki.com/2013/05/14/summarise-4-images-and-domains/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/385f073a12b016d1a85c0fda88ce82d5?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">frabcus</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/05/screen-shot-2013-05-02-at-23-50-43.png" medium="image">
			<media:title type="html">Domain grouping</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/05/screen-shot-2013-05-02-at-23-53-51.png" medium="image">
			<media:title type="html">Images</media:title>
		</media:content>
	</item>
		<item>
		<title>Scheduling! Keep your data fresh</title>
		<link>http://blog.scraperwiki.com/2013/05/13/scheduling-keep-your-data-fresh/</link>
		<comments>http://blog.scraperwiki.com/2013/05/13/scheduling-keep-your-data-fresh/#comments</comments>
		<pubDate>Mon, 13 May 2013 13:36:16 +0000</pubDate>
		<dc:creator>Francis Irving</dc:creator>
				<category><![CDATA[developer]]></category>

		<guid isPermaLink="false">http://blog.scraperwiki.com/?p=758218734</guid>
		<description><![CDATA[We&#8217;ve added scheduling to the &#8220;Code in your browser&#8221; tool on beta.scraperwiki.com. For now it is daily, as that covers most people&#8217;s uses. Please ask if you need something else! Or have a look at the tool&#8217;s source code. Want &#8230; <a href="http://blog.scraperwiki.com/2013/05/13/scheduling-keep-your-data-fresh/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&#038;blog=14548467&#038;post=758218734&#038;subd=scraperwiki&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>We&#8217;ve added scheduling to the &#8220;Code in your browser&#8221; tool on <a href="http://beta.scraperwiki.com/">beta.scraperwiki.com</a>.</p>
<p><a href="http://scraperwiki.files.wordpress.com/2013/05/screen-shot-2013-05-11-at-13-59-21.png"><img class="alignnone size-full wp-image-758218736" alt="Scheduling" src="http://scraperwiki.files.wordpress.com/2013/05/screen-shot-2013-05-11-at-13-59-21.png?w=640"   /></a></p>
<p>For now it is daily, as that covers most people&#8217;s uses. Please ask if you need something else! Or have a look at the <a href="https://github.com/frabcus/code-scraper-in-browser-tool/">tool&#8217;s source code</a>.</p>
<p><strong>Want to know how to use the new ScraperWiki?</strong> There&#8217;s a <a href="https://beta.scraperwiki.com/help/code-in-your-browser/">quick start guide</a> to coding in your browser on it.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/scraperwiki.wordpress.com/758218734/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/scraperwiki.wordpress.com/758218734/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&#038;blog=14548467&#038;post=758218734&#038;subd=scraperwiki&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scraperwiki.com/2013/05/13/scheduling-keep-your-data-fresh/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/385f073a12b016d1a85c0fda88ce82d5?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">frabcus</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/05/screen-shot-2013-05-11-at-13-59-21.png" medium="image">
			<media:title type="html">Scheduling</media:title>
		</media:content>
	</item>
		<item>
		<title>Free community accounts on the ScraperWiki Beta</title>
		<link>http://blog.scraperwiki.com/2013/05/10/free-community-accounts/</link>
		<comments>http://blog.scraperwiki.com/2013/05/10/free-community-accounts/#comments</comments>
		<pubDate>Fri, 10 May 2013 11:00:02 +0000</pubDate>
		<dc:creator>Zarino Zappia</dc:creator>
				<category><![CDATA[developer]]></category>
		<category><![CDATA[news]]></category>
		<category><![CDATA[x.scraperwiki.com]]></category>

		<guid isPermaLink="false">http://blog.scraperwiki.com/?p=758218693</guid>
		<description><![CDATA[We&#8217;ve been teasing and tempting you with blog posts about the first few tools on the new ScraperWiki Beta for a while now. It&#8217;s time to let you try them out first-hand. As of right now, the new ScraperWiki Beta &#8230; <a href="http://blog.scraperwiki.com/2013/05/10/free-community-accounts/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&#038;blog=14548467&#038;post=758218693&#038;subd=scraperwiki&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://beta.scraperwiki.com"><img class="aligncenter size-full wp-image-758218702" alt="Community Accounts" src="http://scraperwiki.files.wordpress.com/2013/05/community-accounts.png?w=640&#038;h=240" width="640" height="240" /></a></p>
<p>We&#8217;ve been <a href="http://blog.scraperwiki.com/2013/04/12/summarise-1-grouping-automatically-for-you/">teasing</a> <a href="http://blog.scraperwiki.com/2013/04/29/summarise-2-pies-and-facts/">and</a> <a href="http://blog.scraperwiki.com/2013/05/08/buckets_of_time/">tempting</a> you with blog posts about the first few tools on the new ScraperWiki Beta for a while now. It&#8217;s time to let you try them out first-hand.</p>
<p>As of right now, the new ScraperWiki Beta is open for you, your aunt, anyone, to sign up for a <b>free community account</b>: Check out <a href="http://beta.scraperwiki.com" target="_blank">beta.scraperwiki.com</a>.</p>
<p>We&#8217;re really excited. Not only does this mean all of our Classic Premium Account holders, and our new private beta applicants, have been settled into the new platform, but now regular Classic users get to try the new ScraperWiki out, for free.</p>
<p>The new ScraperWiki beta is a little rough around the edges, but it can already do everything ScraperWiki Classic did, and more. As we (and you!) develop and share new tools on the platform, it&#8217;s only going to get <i>more</i> powerful and <i>more</i> exciting.</p>
<p>The <b>Code in your browser</b> tool will let you copy and paste your scrapers from ScraperWiki Classic, while <b>Search for Tweets</b>, <b>Summarise Automatically</b> and <b>Query with SQL</b> should give you an idea of how simple and focussed ScraperWiki tools are meant to be. The new ScraperWiki isn&#8217;t one monolithic app – it&#8217;s an ever-expanding collection of tools that interact and plug into each other to help you get your job done. I can&#8217;t wait to see more tools appear in the near future!</p>
<p>To find out more about how the new Beta is different from ScraperWiki Classic, check out our <a href="https://beta.scraperwiki.com/help/whats-new/" target="_blank">&#8220;What&#8217;s new&#8221; guide</a>. And to report any bugs or missing features, <a href="http://github.com/scraperwiki/custard/issues" target="_blank">raise an issue</a> on our Github repo or email Zach, our community manager, at <a href="mailto:zach@scraperwiki.com">zach@scraperwiki.com</a>.</p>
<p><b>Free ScraperWiki Community accounts.</b> Come try us out: <a href="http://beta.scraperwiki.com" target="_blank">beta.scraperwiki.com</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/scraperwiki.wordpress.com/758218693/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/scraperwiki.wordpress.com/758218693/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&#038;blog=14548467&#038;post=758218693&#038;subd=scraperwiki&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scraperwiki.com/2013/05/10/free-community-accounts/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/fcddd0dd6b487fe1302fabaffac7d2b1?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">zarino</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/05/community-accounts.png" medium="image">
			<media:title type="html">Community Accounts</media:title>
		</media:content>
	</item>
		<item>
		<title>Book review: Interactive Data Visualization for the web by Scott Murray</title>
		<link>http://blog.scraperwiki.com/2013/05/09/book-review-interactive-data-visualization-for-the-web-by-scott-murray/</link>
		<comments>http://blog.scraperwiki.com/2013/05/09/book-review-interactive-data-visualization-for-the-web-by-scott-murray/#comments</comments>
		<pubDate>Thu, 09 May 2013 08:49:53 +0000</pubDate>
		<dc:creator>Ian Hopkinson</dc:creator>
				<category><![CDATA[research]]></category>

		<guid isPermaLink="false">http://blog.scraperwiki.com/?p=758218656</guid>
		<description><![CDATA[Next in my book reading, I turn to Interactive Data Visualisation for the web by Scott Murray (@alignedleft on twitter). This book covers the d3 JavaScript library for data visualisation, written by Mike Bostock who was also responsible for the Protovis &#8230; <a href="http://blog.scraperwiki.com/2013/05/09/book-review-interactive-data-visualization-for-the-web-by-scott-murray/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&#038;blog=14548467&#038;post=758218656&#038;subd=scraperwiki&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://scraperwiki.files.wordpress.com/2013/05/interactivevisualisation.jpg"><img class="alignright size-medium wp-image-758218657" alt="Book cover - interactive visualisation for the web" src="http://scraperwiki.files.wordpress.com/2013/05/interactivevisualisation.jpg?w=228&#038;h=300" width="228" height="300" /></a>Next in my book reading, I turn to <a href="http://shop.oreilly.com/product/0636920026938.do"><em>Interactive Data Visualisation for the web</em></a> by Scott Murray (<a href="https://twitter.com/alignedleft">@alignedleft</a> on twitter). This book covers the <a href="http://d3js.org/">d3 JavaScript library</a> for data visualisation, written by Mike Bostock who was also responsible for the <a href="http://mbostock.github.io/protovis/">Protovis library</a>.  If you&#8217;d like a taster of the book&#8217;s content, a number of the examples can also be found on the author&#8217;s <a href="http://alignedleft.com/tutorials/">website</a>.</p>
<p>The book is largely aimed at web designers who are looking to include interactive data visualisations in their work. It includes some introductory material on JavaScript, HTML, and CSS, so has some value for programmers moving into web visualisation. I quite liked the repetition of this relatively basic material, and the conceptual introduction to the d3 library.</p>
<p>I found the book rather slow: on page 197 – approaching the final fifth of the book – we were still making a bar chart. A smaller effort was expended in that period on scatter graphs. As a data scientist, I expect to have several dozen plot types in that number of pages! This is something of which Scott warns us, though. d3 is a visualisation framework built for explanatory presentation (i.e. you know the story you want to tell) rather than being an exploratory tool (i.e. you want to find out about your data). To be clear: this &#8220;slowness&#8221; is not a fault of the book, rather a disjunction between the book and my expectations.</p>
<p>From a technical point of view, d3 works by binding data to elements in the DOM for a webpage. It&#8217;s possible to do this for any element type, but practically speaking only Scaleable Vector Graphics (SVG) elements make real sense. This restriction means that d3 will only work for more recent browsers. This may be a possible problem for those trapped in some corporate environments. The library contains a lot of helper functions for generating scales, loading up data, selecting and modifying elements, animation and so forth. d3 is low-level library; there is no PlotBarChart function.</p>
<p>Achieving the static effects demonstrated in this book using other tools such as R, Matlab, or Python would be a relatively straightforward task. The animations, transitions and interactivity would be more difficult to do. More widely, the d3 library supports the creation of hierarchical visualisations which I would struggle to create using other tools.</p>
<p>This book is quite a basic introduction, you can get a much better overview of what is possible with d3 by looking at the <a href="https://github.com/mbostock/d3/wiki/API-Reference">API documentation</a> and the <a href="https://github.com/mbostock/d3/wiki/Gallery">Gallery</a>. Scott lists quite a few other <a href="https://delicious.com/somebeans/alignedleft">resources</a> including a wide range for the <a href="https://delicious.com/somebeans/alignedleft+d3">d3 library</a> itself, <a href="https://delicious.com/somebeans/alignedleft+ToolsBuiltwithD3">systems built on d3</a>, and <a href="https://delicious.com/somebeans/alignedleft+EasyCharts">alternatives for d3</a> if it were not the library you were looking for.</p>
<p>I can see myself using d3 in the future, perhaps not for building generic tools but for custom visualisations where the data is known and the aim is to best explain that data. Scott quotes Ben Schniederman on this regarding the structure of such visualisations:</p>
<blockquote><p>overview first, zoom and filter, then details on demand</p></blockquote>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/scraperwiki.wordpress.com/758218656/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/scraperwiki.wordpress.com/758218656/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&#038;blog=14548467&#038;post=758218656&#038;subd=scraperwiki&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scraperwiki.com/2013/05/09/book-review-interactive-data-visualization-for-the-web-by-scott-murray/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/3554b6603fe848d7853c6bc6d74757bc?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">somebeans</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/05/interactivevisualisation.jpg?w=228" medium="image">
			<media:title type="html">Book cover - interactive visualisation for the web</media:title>
		</media:content>
	</item>
		<item>
		<title>Summarise #3: Buckets of time and numbers</title>
		<link>http://blog.scraperwiki.com/2013/05/08/buckets_of_time/</link>
		<comments>http://blog.scraperwiki.com/2013/05/08/buckets_of_time/#comments</comments>
		<pubDate>Wed, 08 May 2013 09:33:32 +0000</pubDate>
		<dc:creator>Francis Irving</dc:creator>
				<category><![CDATA[beta]]></category>

		<guid isPermaLink="false">http://blog.scraperwiki.com/?p=758218412</guid>
		<description><![CDATA[In the last two weeks I introduced the &#8220;Summarise automatically tool&#8221;, which magically shows you interesting facts about any dataset in the new ScraperWiki. It&#8217;s an open source tool &#8211; geeks can play along on Github, or use the SSH &#8230; <a href="http://blog.scraperwiki.com/2013/05/08/buckets_of_time/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&#038;blog=14548467&#038;post=758218412&#038;subd=scraperwiki&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>In the <a href="http://blog.scraperwiki.com/?p=758218406">last</a> <a href="http://blog.scraperwiki.com/?p=758218316">two</a> weeks I introduced the &#8220;Summarise automatically tool&#8221;, which magically shows you interesting facts about any dataset in the new ScraperWiki.</p>
<p>It&#8217;s an open source tool &#8211; geeks can <a href="https://github.com/frabcus/magic-summary-tool/">play along</a> on Github, or use the SSH button to log into the tool and see the code running in action.</p>
<p>After adding pie charts, I realised that even for general data lots of columns had basic types in them which could be easily detected, and special visualisations shown. For example, dates and times or image URLs.</p>
<p>For a while now, I&#8217;ve been using <a href="http://getpocket.com/a/">Pocket</a> (formerly Read It Later) to bookmark articles and videos to read or watch later. I&#8217;ve got a scraper that calls the Pocket API (get in touch if you&#8217;d use it and want me to release it as a tool!) and saves all my bookmarks as a dataset.</p>
<p>This is a histogram the &#8220;Summarise automatically&#8221; tool made of when I added bookmarks.</p>
<p><a href="http://scraperwiki.files.wordpress.com/2013/04/screen-shot-2013-04-10-at-18-23-54.png"><img class="alignnone size-full wp-image-758218415" alt="Pocket time added" src="http://scraperwiki.files.wordpress.com/2013/04/screen-shot-2013-04-10-at-18-23-54.png?w=640"   /></a></p>
<p>You can immediately see I first tried out Pocket a tiny bit in November/December 2011, but then stopped using it for four months. Then in May 2012, I start again in earnest. That was because I&#8217;d got a new smartphone with a larger screen, and wanted to read articles on the train. You can also see I went on holiday in August 2012, and didn&#8217;t bookmark much then.</p>
<p>The code that automatically made this chart is in the &#8220;fact_time_charts&#8221; function in <a href="https://github.com/frabcus/magic-summary-tool/blob/master/http/facts.js">facts.js</a>. First of all it uses the fantastic <a href="http://momentjs.com/">moment.js</a> library to parse every value in the column. If at least half of them appear to be dates/times, it goes ahead and makes a time histogram.</p>
<p>The interesting bit is when it tries various ways to &#8220;bucket&#8221; (or &#8220;bin&#8221;) the data. That is, count the number of times something in the columns is in a particular hour, data, month or year. It tries out all four, and only uses the chart that has fewer than 31 rows.</p>
<p><code>// try grouping into buckets at various granularities<br />
_bucket_time_chart(col, group, "YYYY", "years", "YYYY", "time_chart_year", 90)<br />
_bucket_time_chart(col, group, "YYYY-MM", "months", "MMM YYYY", "time_chart_month", 91)<br />
_bucket_time_chart(col, group, "YYYY-MM-DD", "days", "D MMM YYYY", "time_chart_day", 92)<br />
_bucket_time_chart(col, group, "YYYY-MM-DD HH", "hours", "ha D MMM YYYY", "time_chart_hour", 93)<br />
</code></p>
<p>This means if the data is spread out over lots of years, it will show it by year. If it all happened in one day, it&#8217;ll show a histogram by hour.</p>
<p>The &#8220;Summarise automatically&#8221; tool does a similar thing for columns of numbers. It shows a histogram so you can see how they are distributed. For example, this chart was made automatically for a Climate Code Foundation dataset of sea temperature station observations.</p>
<p><a href="http://scraperwiki.files.wordpress.com/2013/04/screen-shot-2013-04-10-at-18-34-04.png"><img class="alignnone size-full wp-image-758218419" alt="JASL observation z values" src="http://scraperwiki.files.wordpress.com/2013/04/screen-shot-2013-04-10-at-18-34-04.png?w=640"   /></a></p>
<p>&#8220;Z&#8221; is the relative height of the sea. You can see there are two peaks in the histogram where lots of observations were made &#8211; one is for low tide and the other for high tide.</p>
<p>Once again this puts the data points into bins of different sizes to make the histogram. This time it does a logarithm to find the power of 10 to use as the width for each bar that means there are a reasonable number of bars &#8211; as near to 33 as possible. The code for this is in &#8220;fact_numbers_chart&#8221; in <a href="https://github.com/frabcus/magic-summary-tool/blob/master/http/facts.js">facts.js</a>.</p>
<p>A notable bit works out if the histogram &#8220;looks interesting&#8221;. We had lots of them that only showed one bar, because a few outliers were far off the edge. The test in the end was to look at the second most common value, and see if it is at least the sqrt of the number of rows. This means the bar charts are at least slightly interesting &#8211; it falls back to tables of the top values if they&#8217;re not.</p>
<p><code> // if the second most common value is at least the sqrt of the number of rows<br />
if (group[1].c &lt; Math.sqrt(total) ) {<br />
return<br />
}<br />
</code></p>
<p><strong>Try it yourself!</strong> Use &#8220;create a dataset&#8221; to get some data into ScraperWiki. Then pick &#8220;Summarise automatically&#8221; from the tools menu and see what it tells you.</p>
<p>Next time, I add fancy stuff to display URLs.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/scraperwiki.wordpress.com/758218412/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/scraperwiki.wordpress.com/758218412/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&#038;blog=14548467&#038;post=758218412&#038;subd=scraperwiki&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scraperwiki.com/2013/05/08/buckets_of_time/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/385f073a12b016d1a85c0fda88ce82d5?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">frabcus</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/04/screen-shot-2013-04-10-at-18-23-54.png" medium="image">
			<media:title type="html">Pocket time added</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/04/screen-shot-2013-04-10-at-18-34-04.png" medium="image">
			<media:title type="html">JASL observation z values</media:title>
		</media:content>
	</item>
		<item>
		<title>Summarising Serendipity</title>
		<link>http://blog.scraperwiki.com/2013/05/07/summarising-serendipity/</link>
		<comments>http://blog.scraperwiki.com/2013/05/07/summarising-serendipity/#comments</comments>
		<pubDate>Tue, 07 May 2013 09:50:01 +0000</pubDate>
		<dc:creator>Zach Beauvais</dc:creator>
				<category><![CDATA[beta]]></category>

		<guid isPermaLink="false">http://blog.scraperwiki.com/?p=758218654</guid>
		<description><![CDATA[5 years ago, a friend and I sat down in a pub in Shrewsbury, drank some beer, and chatted about the web. Every month since, people have been doing that in Shrewsbury (and a few times in Ludlow). It&#8217;s called &#8230; <a href="http://blog.scraperwiki.com/2013/05/07/summarising-serendipity/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&#038;blog=14548467&#038;post=758218654&#038;subd=scraperwiki&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>5 years ago, a friend and I sat down in a pub in Shrewsbury, drank some beer, and chatted about the web. Every month since, people have been doing that in Shrewsbury (and a few times in Ludlow). It&#8217;s called ShropGeek (we&#8217;re very savvy in our naming conventions, you see). It was started and organised almost exclusively via twitter, and it has evolved from a monthly banter-session into an annual conference.</p>
<p>I say this here not as an advertisement (<a href="http://2013.shropgeek-revolution.co.uk/">cough</a>), but because I have almost accidentally ended up as a co-organiser of quite a big event. (It&#8217;s an accident, on my part, because <a href="http://www.kirstyburgoine.co.uk/">Kirsty</a> has put a hell of a lot of time and effort into it!)</p>
<p>Because of its twitter-heavy organisation, I would have a pretty good idea about what was going on, because <a>@shropgeek</a> would normally be included in conversations. With the nature of the beast evolving, I&#8217;ve had some pretty important questions about how people share, and what they&#8217;re saying about the conference, and I accidentally (I seem accident-prone when it comes to administrative tasks) discovered a very useful resource which sits under my nose at my day-job: the <a href="blog.scraperwiki.com/2013/04/12/summarise-1-grouping-automatically-for-you/">Summarise Automatically tool</a> on the New ScraperWiki.</p>
<p><a href="http://scraperwiki.files.wordpress.com/2013/05/summarise_words.png"><img class="alignright size-medium wp-image-758218667" alt="summarise_words" src="http://scraperwiki.files.wordpress.com/2013/05/summarise_words.png?w=300&#038;h=137" width="300" height="137" /></a>One of the first things I did, to test out the summariser was to search twitter for mentions of the &#8220;#revolutionconf&#8221; hashtag, then click on &#8220;summarise this data.&#8221; My expectations were to see some cool graphics, and mainly to test it out as a ScraperWiki tool. What I found, though, were some really valuable views on how people are tweeting.</p>
<p>Basically, the summariser tool tries to tell you some instant things about your data by, well, summarising the columns in your data. This can be a bit of a mixed bag, with some summaries making little sense (but we&#8217;ll get better at that). However, the really cool thing is the very high-level, dashboard-like information I could get on *this* data, which I know comprises tweets, and all of which are related to my hashtag.</p>
<p><a href="http://scraperwiki.files.wordpress.com/2013/05/summarise_screen_name.png"><img src="http://scraperwiki.files.wordpress.com/2013/05/summarise_screen_name.png?w=300&#038;h=208" alt="summarise_screen_name" width="300" height="208" class="alignright size-medium wp-image-758218672" /></a>1. The first win was a simple count of how many mentions there are. I saw that the hashtag hasn&#8217;t been used as much as it could be (with only 66 instances), and realised that I&#8217;ve tweeted several times without it. /me slaps own hand!</p>
<p>2. Next, for me, was the screen_name summary. I saw several people on that list who I didn&#8217;t realise were in-the-know, and was able to remind myself to thank them soon!</p>
<p>3. The pie-chart saying other hashtags was also interesting, because it included the word &#8220;#excited&#8221;. Although this doesn&#8217;t seem to have *every* other hashtag, it was good to see.</p>
<p><a href="http://scraperwiki.files.wordpress.com/2013/05/summarise_hashtags.png"><img src="http://scraperwiki.files.wordpress.com/2013/05/summarise_hashtags.png?w=300&#038;h=186" alt="summarise_hashtags" width="300" height="186" class="aligncenter size-medium wp-image-758218675" /></a></p>
<p>4. Finally, &#8220;url&#8221; column was summarised as a pie-chart, showing me which urls were included within tweets containing the conference hashtag. This is very interesting, because I can see if people are linking to the index page, or to the ticket page for the site. Also, I can see what *isn&#8217;t* being linked (e.g. the Lanyrd page for the event or direct to eventbrite, which I expected to happen.)</p>
<p><a href="http://scraperwiki.files.wordpress.com/2013/05/summarise_url.png"><img src="http://scraperwiki.files.wordpress.com/2013/05/summarise_url.png?w=250&#038;h=300" alt="summarise_url" width="250" height="300" class="alignleft size-medium wp-image-758218677" /></a>These were all interesting, and helped me instantly better understand how people are talking about the conference. I should also point out that the tool ran automatically: all I did was to install the tool on my search data, and it presents me with this information without any setup. Best of all, it also showed me some things I wasn&#8217;t planning for. The list of re-tweeters, for example, jogged my memory, and made me consider asking some specific people to mention the event, which is something I hadn&#8217;t thought of doing.</p>
<p>I&#8217;m pretty excited about this tool, not just because it&#8217;s geeky and has charts, but because it&#8217;s at a very early stage and *already* did something useful with my social data. As it improves, I hope we get some more instant-win effects from it, and I&#8217;d be keen to hear what we could do to make it better, too.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/scraperwiki.wordpress.com/758218654/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/scraperwiki.wordpress.com/758218654/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&#038;blog=14548467&#038;post=758218654&#038;subd=scraperwiki&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scraperwiki.com/2013/05/07/summarising-serendipity/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/ca6cefbc3643b303defdb39068b8a39e?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">zbeauvais</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/05/summarise_words.png?w=300" medium="image">
			<media:title type="html">summarise_words</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/05/summarise_screen_name.png?w=300" medium="image">
			<media:title type="html">summarise_screen_name</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/05/summarise_hashtags.png?w=300" medium="image">
			<media:title type="html">summarise_hashtags</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/05/summarise_url.png?w=250" medium="image">
			<media:title type="html">summarise_url</media:title>
		</media:content>
	</item>
		<item>
		<title>Newspapers, advertising, revenue, innovation</title>
		<link>http://blog.scraperwiki.com/2013/05/03/newspapers-advertising-revenue-innovation/</link>
		<comments>http://blog.scraperwiki.com/2013/05/03/newspapers-advertising-revenue-innovation/#comments</comments>
		<pubDate>Fri, 03 May 2013 12:37:19 +0000</pubDate>
		<dc:creator>ainemcguire</dc:creator>
				<category><![CDATA[events]]></category>
		<category><![CDATA[journalism]]></category>

		<guid isPermaLink="false">http://blog.scraperwiki.com/?p=758218558</guid>
		<description><![CDATA[A couple weeks ago, I joined the 110-year-old WAN-IFRA at their annual Digital Media Conference at the swish ETCVenues&#8217; 200 Aldersgate London pad. The organisation has become the voice for the worldwide community of newspaper publishers, and the DMC was &#8230; <a href="http://blog.scraperwiki.com/2013/05/03/newspapers-advertising-revenue-innovation/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&#038;blog=14548467&#038;post=758218558&#038;subd=scraperwiki&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://scraperwiki.files.wordpress.com/2013/05/cartoon-newspaper-164x300.jpg"><img class=" wp-image-758218559 alignleft" alt="Cartoon-newspaper-164x300" src="http://scraperwiki.files.wordpress.com/2013/05/cartoon-newspaper-164x300.jpg?w=140&#038;h=320" width="140" height="320" /></a>A couple weeks ago, I joined the 110-year-old <a href="http://www.wan-ifra.org">WAN-IFRA</a> at their annual <a href="http://www.wan-ifra.org/events/digital-media-europe-2013">Digital Media Conference</a> at the swish ETCVenues&#8217; 200 Aldersgate London pad. The organisation has become the voice for the worldwide community of newspaper publishers, and the DMC was a truly international affair with 37 countries from all five continents represented. Senior executives see it as a place to have a &#8216;pow-wow&#8217; with their peers on what&#8217;s happening in the industry, and to listen and respond to some of the wider issues affecting the sector. Having cut a swathe through the industry&#8217;s revenue model, Google&#8217;s presence was understandably palpable! They had a good showing because many publishers are now value-added resellers to the giant tech company.</p>
<p>It quickly became clear that the big issues facing the industry are:</p>
<ul>
<li>how to grow revenue from advertising</li>
<li>how to cut the cost of serving multiple platforms like tablets, mobile devices and PCs</li>
<li>how to innovate.</li>
</ul>
<p>Unsurprisingly <a href="http://en.wikipedia.org/wiki/HTML5">HTML5 </a>was also a popular topic and a number of ready-made products featured in the presentations.</p>
<p>Day 2 was focused on innovation and I had an opportunity to talk about what ScraperWiki has been doing to help in the sector.  I tried to feature stories that data scientists from our community created specifically for the media.</p>
<div id="attachment_758218598" class="wp-caption alignleft" style="width: 160px"><a href="http://dharmafly.com/theywriteforyou/" target="_blank"><img class="size-thumbnail wp-image-758218598 " alt="They Write for You" src="http://scraperwiki.files.wordpress.com/2013/05/theywriteforyou.png?w=150&#038;h=71" width="150" height="71" /></a><p class="wp-caption-text">They Write for You</p></div>
<p>I also wanted to talk about some of the women doing great work, so I rolled back to the story that Anna Powell Smith (<a href="https://twitter.com/darkgreener">@darkgreener</a>) helped craft at our very first Hacks/Hacker day in January 2010. The story was about the number of articles written by MPs for British newspapers – it is a simple and effective visualisation: <a href="http://dharmafly.com/theywriteforyou/">&#8216;They Write for You&#8221;</a>.</p>
<div id="attachment_758218601" class="wp-caption alignright" style="width: 160px"><a href="http://www.channel4.com/news/could-selling-off-britains-assets-cut-the-debt" target="_blank"><img class="size-thumbnail wp-image-758218601 " alt="A load of bubbles!" src="http://scraperwiki.files.wordpress.com/2013/05/assetbubble.png?w=150&#038;h=137" width="150" height="137" /></a><p class="wp-caption-text">A load of bubbles!</p></div>
<p>I also talked about the data-driven stories that Nicola Hughes (<a href="https://twitter.com/DataMinerUK">@datamineruk</a>), Francis Irving(<a href="https://twitter.com/frabcus">@frabcus</a>) and Julian Todd (<a href="https://twitter.com/goatchurch">@goatchurch</a>) created for the awarding-winning <em>Dispatches</em> programme. These focused around the <a href="http://www.channel4.com/news/could-selling-off-britains-assets-cut-the-debt">National Asset Register</a> and <a href="http://www.channel4.com/news/could-councils-sell-land-to-fill-budget-holes">English Brownfield Sites</a>.</p>
<p>I finished on the rose visualisation that Julian and Zarino Zappia (<a href="https://twitter.com/zarino">@zarino</a>) made to enliven <a href="https://views.scraperwiki.com/run/somerset_fire_day_rose/">Somerset and Devon Fire Incidents</a>. It seemed like a good candidate to show how local government data can be used to make an interesting, evergreen story:</p>
<div id="attachment_758218603" class="wp-caption aligncenter" style="width: 310px"><a href="https://views.scraperwiki.com/run/somerset_fire_day_rose/" target="_blank"><img class="size-medium wp-image-758218603 " alt="rose visualisation in Devon'" src="http://scraperwiki.files.wordpress.com/2013/05/devonrose.png?w=300&#038;h=200" width="300" height="200" /></a><p class="wp-caption-text">Rose visualisation from Devon</p></div>
<p><strong>Dr Johnny Ryan </strong>(<a href="https://twitter.com/johnnyryan">@johnnyryan</a>), author of <a href="http://www.amazon.co.uk/History-Internet-Digital-Future/dp/1861897774/" target="_blank">&#8216;A History of the Internet and the Digital Future&#8217; </a>and Chief Innovation Officer at <i><a href="http://www.irishtimes.com" target="_blank">The Irish Times</a> </i>followed up by introducing three new media startups.  This was interesting because the paper is a reasonably conservative publication that has taken the unusual step of acting as a <a href="http://www.youtube.com/playlist?list=PL70F9498992D82318">technology accelerator</a> in Dublin. So, hats off to its editor <a href="http://en.wikipedia.org/wiki/Kevin_O%27Sullivan_(journalist)" target="_blank">Kevin O&#8217;Sullivan</a>!  It provides space, desks, access to management and a platform for the startups to introduce their offerings into the media market.</p>
<p>Here are the three Dr Ryan mentioned:</p>
<p><strong>Oliver Mooney </strong><a href="https://twitter.com/OliverMooney">(@olivermooney)</a> <strong> </strong>told us how <a href="http://getbulb.com"><em>GetBulb</em></a> allows you to make compelling infographics simply by copying and pasting your data into a template.  They also have a wacky introduction video which made me smile!</p>
<p><strong>Paul Quigley</strong> (<a href="https://twitter.com/paulyq">@paulyq</a>) introduced <em><a href="http://www.newswhip.com">NewsWhip</a></em>, a technology that tracks all the news shared on Facebook and Twitter each day to find the fastest-spreading, most-shared, high-quality stuff.</p>
<p><strong>Neil O&#8217;Connor</strong> from <em><a href="http://blockmetrics.com">Blockmetrics</a> </em>showed his technology to detects ads being blocked by website visitors, and how to analyse how much ad revenue is being lost as a consequence how companies can do something about it.</p>
<p>The industry is very well aware of the challenges it faces, although there was a level of surprise by some delegates that mobile advertising would not be a panacea against falling revenues. This industry faces tough times ahead, but refreshingly it is proactively looking at innovation as both a defence mechanism and a route to growth and profitability.</p>
<div id="attachment_758218560" class="wp-caption aligncenter" style="width: 310px"><a href="http://scraperwiki.files.wordpress.com/2013/05/200-aldersgate.jpg"><img class=" wp-image-758218560  " alt="200 Aldersgate London" src="http://scraperwiki.files.wordpress.com/2013/05/200-aldersgate.jpg?w=300&#038;h=224" width="300" height="224" /></a><p class="wp-caption-text">200 Aldersgate, London</p></div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/scraperwiki.wordpress.com/758218558/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/scraperwiki.wordpress.com/758218558/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&#038;blog=14548467&#038;post=758218558&#038;subd=scraperwiki&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scraperwiki.com/2013/05/03/newspapers-advertising-revenue-innovation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/a801e770feed3df03f36195443374935?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">ainemcguire</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/05/cartoon-newspaper-164x300.jpg" medium="image">
			<media:title type="html">Cartoon-newspaper-164x300</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/05/theywriteforyou.png?w=150" medium="image">
			<media:title type="html">They Write for You</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/05/assetbubble.png?w=150" medium="image">
			<media:title type="html">A load of bubbles!</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/05/devonrose.png?w=300" medium="image">
			<media:title type="html">rose visualisation in Devon&#039;</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/05/200-aldersgate.jpg?w=300" medium="image">
			<media:title type="html">200 Aldersgate London</media:title>
		</media:content>
	</item>
		<item>
		<title>Internships &#8211; coding and data science</title>
		<link>http://blog.scraperwiki.com/2013/05/01/internships-coding-and-data-science/</link>
		<comments>http://blog.scraperwiki.com/2013/05/01/internships-coding-and-data-science/#comments</comments>
		<pubDate>Wed, 01 May 2013 15:46:27 +0000</pubDate>
		<dc:creator>Francis Irving</dc:creator>
				<category><![CDATA[jobs]]></category>

		<guid isPermaLink="false">http://blog.scraperwiki.com/?p=758218404</guid>
		<description><![CDATA[The last two summers, we had a really good intern (Aidan Hobson Sayers - thanks for finding him for us, John!). We&#8217;d like to do it again this year. We&#8217;ve opportunities in three areas, depending on your skills and interests. Platform team &#8230; <a href="http://blog.scraperwiki.com/2013/05/01/internships-coding-and-data-science/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&#038;blog=14548467&#038;post=758218404&#038;subd=scraperwiki&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<div id="attachment_758218622" class="wp-caption alignright" style="width: 310px"><a href="http://scraperwiki.files.wordpress.com/2013/04/img_2550.jpeg"><img class="size-medium wp-image-758218622" alt="friendly data scientists" src="http://scraperwiki.files.wordpress.com/2013/04/img_2550.jpeg?w=300&#038;h=224" width="300" height="224" /></a><p class="wp-caption-text">Friendly data scientists</p></div>
<div id="attachment_758218623" class="wp-caption alignright" style="width: 310px"><a href="http://scraperwiki.files.wordpress.com/2013/04/img_2549.jpeg"><img class="size-medium wp-image-758218623" alt="meet the marmot" src="http://scraperwiki.files.wordpress.com/2013/04/img_2549.jpeg?w=300&#038;h=224" width="300" height="224" /></a><p class="wp-caption-text">Maurice the ScraperWiki marmot</p></div>
<div id="attachment_758218624" class="wp-caption alignright" style="width: 310px"><a href="http://scraperwiki.files.wordpress.com/2013/04/img_2548.jpeg"><img class="size-medium wp-image-758218624" alt="devoted developers" src="http://scraperwiki.files.wordpress.com/2013/04/img_2548.jpeg?w=300&#038;h=224" width="300" height="224" /></a><p class="wp-caption-text">Devoted developers</p></div>
<p>The last two summers, we had a really good intern (Aidan Hobson Sayers - thanks for finding him for us, <a href="http://blog.johnmckerrell.com/">John</a>!).</p>
<p>We&#8217;d like to do it again this year. We&#8217;ve opportunities in three areas, depending on your skills and interests.</p>
<ol>
<li><span style="font-size:16px;line-height:1.5;">Platform team &#8211; CoffeeScript, Backbone, Unix. We use Extreme Programming.</span></li>
<li><span style="font-size:16px;line-height:1.5;">Data science team &#8211; Python, R. Scraping, statistics, working with customers.</span></li>
<li>Tool making team &#8211; gorgeous user interfaces, with a mixture of the above skills.</li>
</ol>
<p>The deal is&#8230;</p>
<ul>
<li>Work with a friendly, talented team in Liverpool, where a whole community is quietly growing the UK&#8217;s next big tech cluster.</li>
<li><span style="font-size:16px;line-height:1.5;">It&#8217;s at ScraperWiki&#8217;s offices. You need to be either based in commuting range, or prepared to move here for at least 6 weeks over the summer.</span></li>
<li>We pay either travelling expenses, or if you&#8217;re more experienced, the standard student summer placement week rate.</li>
<li>Oh, and you learn about startups, and changing the world of data analysis on the web.</li>
</ul>
<p><span style="font-size:16px;line-height:1.5;">If you&#8217;d like to apply, please send:</span></p>
<ul>
<li><span style="line-height:15.994318008423px;">Your CV</span></li>
<li>A link to a scraper you&#8217;ve written of some kind, or an open source project you&#8217;ve made a large contribution to</li>
</ul>
<p>To <a href="mailto:francis@scraperwiki.com">francis@scraperwiki.com</a> with the word &#8220;swintern&#8221; in the subject. We&#8217;ll take applications from either students or non-students.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/scraperwiki.wordpress.com/758218404/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/scraperwiki.wordpress.com/758218404/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&#038;blog=14548467&#038;post=758218404&#038;subd=scraperwiki&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scraperwiki.com/2013/05/01/internships-coding-and-data-science/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/385f073a12b016d1a85c0fda88ce82d5?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">frabcus</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/04/img_2550.jpeg?w=300" medium="image">
			<media:title type="html">friendly data scientists</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/04/img_2549.jpeg?w=300" medium="image">
			<media:title type="html">meet the marmot</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/04/img_2548.jpeg?w=300" medium="image">
			<media:title type="html">devoted developers</media:title>
		</media:content>
	</item>
		<item>
		<title>A sea of data</title>
		<link>http://blog.scraperwiki.com/2013/04/30/a-sea-of-data/</link>
		<comments>http://blog.scraperwiki.com/2013/04/30/a-sea-of-data/#comments</comments>
		<pubDate>Tue, 30 Apr 2013 11:14:29 +0000</pubDate>
		<dc:creator>drj11</dc:creator>
				<category><![CDATA[thoughts]]></category>

		<guid isPermaLink="false">http://blog.scraperwiki.com/?p=758218445</guid>
		<description><![CDATA[My friend Simon Holgate of Sea Level Research has recently &#8220;cursed&#8221; me by introducing me to tides and sea-level data. Now I&#8217;m hooked. Why are tides interesting? When you&#8217;re trying to navigate a super-tanker into San Francisco Bay and you &#8230; <a href="http://blog.scraperwiki.com/2013/04/30/a-sea-of-data/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&#038;blog=14548467&#038;post=758218445&#038;subd=scraperwiki&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://scraperwiki.files.wordpress.com/2013/04/napoleon_sainthelene.jpg"><img src="http://scraperwiki.files.wordpress.com/2013/04/napoleon_sainthelene.jpg?w=300&#038;h=218" alt="Napoleon_sainthelene" width="300" height="218" class="alignright size-medium wp-image-758218457" /></a>My friend Simon Holgate of <a href="http://sealevelresearch.com/">Sea Level Research</a> has recently &#8220;cursed&#8221; me by introducing me to tides and sea-level data. Now I&#8217;m hooked. Why are tides interesting? When you&#8217;re trying to navigate a super-tanker into San Francisco Bay and you only have few centimetres of clearance, whether the tide is in or out could be quite important!</p>
<p>The French port of Brest has the longest historical tidal record. The Joint Archive for Sea Level has hourly readings from 1846. Those of you wanting to follow along at home should get the code:</p>
<pre class="brush: plain; title: ; notranslate">
    git clone git://github.com/drj11/sea-level-tool.git
    cd sea-level-tool
    virtualenv .
    . bin/activate
    pip install -r requirements.txt
</pre>
<p>After that lot (phew!), you can get the data for Brest by going:</p>
<pre class="brush: plain; title: ; notranslate">
    code/etl 822a
</pre>
<p>The sea level tool is written in Python and uses our <a href="http://github.com/scraperwiki/scraperwiki-python">scraperwiki</a> library to store the sea level data in a sqlite database.</p>
<p>Tide data can be surprisingly complex (the 486 pages of [<a href="http://eprints.soton.ac.uk/19157/1/sea-level.pdf">PUGH1987</a>] are testimony to that), but in essence we have a time series of heights, <var>z</var>. Often even really simple analyses can tell us interesting facts about the data.</p>
<p>As Ian tells us, <a href="http://blog.scraperwiki.com/2013/03/27/book-review-r-in-action-by-robert-i-kabacoff/">R is good for visualisations</a>. And it turns out it has an installable RSQLite package that can load R dataframes from a sqlite file. And I feel like a grown-up data scientist when I use R. The relevant snippet of R is:</p>
<pre class="brush: r; title: ; notranslate">
    library(RSQLite)
    db &lt;- dbConnect(dbDriver('SQLite'), dbname=&#039;scraperwiki.sqlite&#039;, loadable.extensions=TRUE)
    bre &lt;- dbGetQuery(db, 'SELECT*FROM obs WHERE jaslid==&quot;h822a&quot; ORDER BY t')
</pre>
<p>I&#039;m sure you&#039;re all aware that the sea level goes up and down to make <var>tides</var> and some tides are bigger than others. Here&#8217;s a typical month at Brest (1999-01):</p>
<p><a href="http://scraperwiki.files.wordpress.com/2013/04/bre-ts.png"><img src="http://scraperwiki.files.wordpress.com/2013/04/bre-ts.png?w=480&#038;h=160" alt="bre-ts" width="480" height="160" class="size-medium wp-image-758218448" /></a></p>
<p>There are well over 1500 months of data for Brest. Can we summarise the data? A histogram works well:</p>
<p><a href="http://scraperwiki.files.wordpress.com/2013/04/bre-hist.png"><img src="http://scraperwiki.files.wordpress.com/2013/04/bre-hist.png?w=300&#038;h=300" alt="bre-hist" width="300" height="300" class="alignleft size-medium wp-image-758218450" /></a></p>
<p>Remember that this is a histogram of hourly sea level observations. So the two humps show the most frequent sea level heights that appear in the hourly series. These are clustered around two heights that are more commonly observed than all others. These are the <var>mean low tide</var>, and the <var>mean high tide</var>. The <var>range</var>, the distance between mean low tide and mean high tide, is about 2.5 metres (big tides, big data!). </p>
<p>This is a comparitively large range, certainly compared to a site like St Helena (where the British imprisoned Napoleon after his defeat at Waterloo). Let&#8217;s plot St Helena&#8217;s tides on the same histogram as Brest, for comparison:</p>
<p><a href="http://scraperwiki.files.wordpress.com/2013/04/sth2-hist.png"><img src="http://scraperwiki.files.wordpress.com/2013/04/sth2-hist.png?w=300&#038;h=300" alt="sth2-hist" width="300" height="300" class="alignleft size-medium wp-image-758218452" /></a></p>
<p>Again we have a mean low tide and a mean high tide, but this time the range is about 0.4 metres, and the entire span of observed heights including extremes fits into 1.5 metres. St Helena is a rock in the middle of a large ocean, and this small range is typical of the oceanic tides. It&#8217;s the shallow waters of a continental shelf, and complex basin dynamics in northwest Europe (and Kelvin waves, see <a href="http://www.youtube.com/watch?v=uQPpPhxqPxY">Lucy&#8217;s IgniteLiverpool</a> talk for more details) that gives ports like Brest a high tidal range.</p>
<p>Notice that St Helena has some <em>negative</em> sea levels. Sea level is measured to a 0-point that is fixed for each station but varies from station to station. It is common to pick that point as being the lowest sea level (either observed or predicted) over some period, so that almost all actual observations are positive. Brest follows the usual convention, almost all the observations are positive (you can&#8217;t tell from the histogram but there are a few negative ones). It is not clear what the 0-point on the St Helena chart is (it&#8217;s clearly not a low low water, and doesn&#8217;t look like a mean water level either), and I have exhausted the budget for researching the matter.</p>
<p>Tides are a new subject for me, and when I was reading Pugh&#8217;s book, one of the first surprises was the existence of places that do not get two tides a day. An example is Fremantle, Australia, which instead of getting two tides a day (semi-diurnal) gets just one tide a day (diurnal):</p>
<p><a href="http://scraperwiki.files.wordpress.com/2013/04/fre-ts.png"><img src="http://scraperwiki.files.wordpress.com/2013/04/fre-ts.png?w=640" alt="fre-ts"   class="size-full wp-image-758218454" /></a></p>
<p>The diurnal tides are produced predominantly by the effect of <a href="http://en.wikipedia.org/wiki/Declination">lunar declination</a>. When the moon crosses the equator (twice a nodical month), its declination is zero, the effect is reduced to zero, and so are the diurnal tides. This is in contrast to the twice-daily tides which, while they exhibit large (spring) and small (neap) tides, we still get tides whatever time of the month it is. Because of the modulation of the diurnal tide there is no &#8220;mean low tide&#8221; and &#8220;mean high tide&#8221;, tides of all heights are produced, and we get a single hump in the distribution (adding the fremantle data in red):</p>
<p><a href="http://scraperwiki.files.wordpress.com/2013/04/fre3-hist.png"><img src="http://scraperwiki.files.wordpress.com/2013/04/fre3-hist.png?w=640" alt="fre3-hist"   class="alignleft size-full wp-image-758218455" /></a></p>
<p>So we&#8217;ve found something interesting about the Fremantle tides from the kind of histogram which we probably learnt to do in primary school.</p>
<p>Napoleon died on St Helena, but my investigations into St Helena&#8217;s tides will continue on the ScraperWiki data hub, using a mixture of standard platform tools, like the <a href="http://blog.scraperwiki.com/2013/04/29/summarise-2-pies-and-facts/">summarise tool</a>, and custom tools, like a tidal analysis tool.</p>
<p><em> Image &#8220;Napoleon at Saint-Helene, by Francois-Joseph Sandmann,&#8221; in Public Domain from <a href="https://en.wikipedia.org/wiki/File:Napoleon_sainthelene.jpg">Wikipedia</a></em></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/scraperwiki.wordpress.com/758218445/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/scraperwiki.wordpress.com/758218445/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&#038;blog=14548467&#038;post=758218445&#038;subd=scraperwiki&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scraperwiki.com/2013/04/30/a-sea-of-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/9c6dfaac50b9c43815dd18081e87f3e3?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">drj11</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/04/napoleon_sainthelene.jpg?w=300" medium="image">
			<media:title type="html">Napoleon_sainthelene</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/04/bre-ts.png" medium="image">
			<media:title type="html">bre-ts</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/04/bre-hist.png?w=300" medium="image">
			<media:title type="html">bre-hist</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/04/sth2-hist.png?w=300" medium="image">
			<media:title type="html">sth2-hist</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/04/fre-ts.png" medium="image">
			<media:title type="html">fre-ts</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/04/fre3-hist.png" medium="image">
			<media:title type="html">fre3-hist</media:title>
		</media:content>
	</item>
		<item>
		<title>Summarise #2: Pies and facts</title>
		<link>http://blog.scraperwiki.com/2013/04/29/summarise-2-pies-and-facts/</link>
		<comments>http://blog.scraperwiki.com/2013/04/29/summarise-2-pies-and-facts/#comments</comments>
		<pubDate>Mon, 29 Apr 2013 15:02:38 +0000</pubDate>
		<dc:creator>Francis Irving</dc:creator>
				<category><![CDATA[beta]]></category>

		<guid isPermaLink="false">http://blog.scraperwiki.com/?p=758218406</guid>
		<description><![CDATA[In a previous blog post, I showed how by counting the most common values in each column (like a pivot table, or &#8220;group by&#8221; in SQL),  I managed to make a tool that can automatically summarise datasets. I quickly realised &#8230; <a href="http://blog.scraperwiki.com/2013/04/29/summarise-2-pies-and-facts/">Continue reading <span class="meta-nav">&#8594;</span></a><img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&#038;blog=14548467&#038;post=758218406&#038;subd=scraperwiki&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>In a <a href="http://blog.scraperwiki.com/?p=758218316">previous blog post</a>, I showed how by counting the most common values in each column (like a pivot table, or &#8220;group by&#8221; in SQL),  I managed to make a tool that can automatically summarise datasets.</p>
<p>I quickly realised that there were better ways of visualising the data than just showing tables. For example, if there are only a few possible values for a column, it makes better sense as a pie chart.</p>
<p>For example, these are the oceans from the Climate Code Foundation&#8217;s sea-level station data (the same dataset that appeared in the last blog post).</p>
<p><a href="http://scraperwiki.files.wordpress.com/2013/04/screen-shot-2013-04-10-at-18-04-29.png"><img class=" wp-image-758218408 alignnone" alt="JASL ocean pie chart" src="http://scraperwiki.files.wordpress.com/2013/04/screen-shot-2013-04-10-at-18-04-29.png?w=305&#038;h=354" width="305" height="354" /></a></p>
<p>After playing with a few datasets, and with David&#8217;s help, we found that the pies are useful when there are more than two but fewer than eight values.</p>
<p>The code that makes the pie chart is in the &#8220;fact_groups_pie&#8221; function in the <a href="https://github.com/frabcus/magic-summary-tool/blob/master/http/facts.js">facts.js</a> file. I&#8217;m calling each possible visualisation a &#8220;fact&#8221;. There&#8217;s a bunch of code in the &#8220;add_fact&#8221; function in <a href="https://github.com/frabcus/magic-summary-tool/blob/master/http/code.js">code.js</a> which, for each possible fact, decides which has the highest priority, and shows that one for each column. For example, a pie chart (if there are few enough values) overrides a table.</p>
<p>The pie is made using <a href="https://developers.google.com/chart/interactive/docs/gallery/piechart">Google charts</a> (code in <a href="https://github.com/frabcus/magic-summary-tool/blob/master/http/charts.js">charts.js</a>) – I deliberately wanted to keep things simple for this tool. Because the visualisations are automatically chosen, it didn&#8217;t feel right to hand craft them in D3.</p>
<p><strong>You can play too!</strong> <a href="http://blog.scraperwiki.com/2013/04/19/two-ways/">If you are part of the Beta</a>, you can use the &#8220;Summarise automatically&#8221; tool yourself now on your own dataset. Either upload a spreadsheet with the &#8220;Upload spreadsheet&#8221; tool, or use the &#8220;Twitter search tool&#8221; or one of the coding tools to get some data you care about into ScraperWiki. Then choose &#8220;Summarise automatically&#8221; from the tools menu and see what surprises there are.</p>
<p>You&#8217;ll probably see one of the visualisation type I haven&#8217;t talked about yet. Next time &#8211; all about showing time and numbers using buckets&#8230;</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/scraperwiki.wordpress.com/758218406/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/scraperwiki.wordpress.com/758218406/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scraperwiki.com&#038;blog=14548467&#038;post=758218406&#038;subd=scraperwiki&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scraperwiki.com/2013/04/29/summarise-2-pies-and-facts/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/385f073a12b016d1a85c0fda88ce82d5?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">frabcus</media:title>
		</media:content>

		<media:content url="http://scraperwiki.files.wordpress.com/2013/04/screen-shot-2013-04-10-at-18-04-29.png" medium="image">
			<media:title type="html">JASL ocean pie chart</media:title>
		</media:content>
	</item>
	</channel>
</rss>
