Internships – coding and data science

friendly data scientists

Friendly data scientists

meet the marmot

Maurice the ScraperWiki marmot

devoted developers

Devoted developers

The last two summers, we had a really good intern (Aidan Hobson Sayers - thanks for finding him for us, John!).

We’d like to do it again this year. We’ve opportunities in three areas, depending on your skills and interests.

  1. Platform team – CoffeeScript, Backbone, Unix. We use Extreme Programming.
  2. Data science team – Python, R. Scraping, statistics, working with customers.
  3. Tool making team – gorgeous user interfaces, with a mixture of the above skills.

The deal is…

  • Work with a friendly, talented team in Liverpool, where a whole community is quietly growing the UK’s next big tech cluster.
  • It’s at ScraperWiki’s offices. You need to be either based in commuting range, or prepared to move here for at least 6 weeks over the summer.
  • We pay either travelling expenses, or if you’re more experienced, the standard student summer placement week rate.
  • Oh, and you learn about startups, and changing the world of data analysis on the web.

If you’d like to apply, please send:

  • Your CV
  • A link to a scraper you’ve written of some kind, or an open source project you’ve made a large contribution to

To francis@scraperwiki.com with the word “swintern” in the subject. We’ll take applications from either students or non-students.

Posted in jobs | Leave a comment

A sea of data

Napoleon_saintheleneMy friend Simon Holgate of Sea Level Research has recently “cursed” me by introducing me to tides and sea-level data. Now I’m hooked. Why are tides interesting? When you’re trying to navigate a super-tanker into San Francisco Bay and you only have few centimetres of clearance, whether the tide is in or out could be quite important!

The French port of Brest has the longest historical tidal record. The Joint Archive for Sea Level has hourly readings from 1846. Those of you wanting to follow along at home should get the code:

    git clone git://github.com/drj11/sea-level-tool.git
    cd sea-level-tool
    virtualenv .
    . bin/activate
    pip install -r requirements.txt

After that lot (phew!), you can get the data for Brest by going:

    code/etl 822a

The sea level tool is written in Python and uses our scraperwiki library to store the sea level data in a sqlite database.

Tide data can be surprisingly complex (the 486 pages of [PUGH1987] are testimony to that), but in essence we have a time series of heights, z. Often even really simple analyses can tell us interesting facts about the data.

As Ian tells us, R is good for visualisations. And it turns out it has an installable RSQLite package that can load R dataframes from a sqlite file. And I feel like a grown-up data scientist when I use R. The relevant snippet of R is:

    library(RSQLite)
    db <- dbConnect(dbDriver('SQLite'), dbname='scraperwiki.sqlite', loadable.extensions=TRUE)
    bre <- dbGetQuery(db, 'SELECT*FROM obs WHERE jaslid=="h822a" ORDER BY t')

I'm sure you're all aware that the sea level goes up and down to make tides and some tides are bigger than others. Here’s a typical month at Brest (1999-01):

bre-ts

There are well over 1500 months of data for Brest. Can we summarise the data? A histogram works well:

bre-hist

Remember that this is a histogram of hourly sea level observations. So the two humps show the most frequent sea level heights that appear in the hourly series. These are clustered around two heights that are more commonly observed than all others. These are the mean low tide, and the mean high tide. The range, the distance between mean low tide and mean high tide, is about 2.5 metres (big tides, big data!).

This is a comparitively large range, certainly compared to a site like St Helena (where the British imprisoned Napoleon after his defeat at Waterloo). Let’s plot St Helena’s tides on the same histogram as Brest, for comparison:

sth2-hist

Again we have a mean low tide and a mean high tide, but this time the range is about 0.4 metres, and the entire span of observed heights including extremes fits into 1.5 metres. St Helena is a rock in the middle of a large ocean, and this small range is typical of the oceanic tides. It’s the shallow waters of a continental shelf, and complex basin dynamics in northwest Europe (and Kelvin waves, see Lucy’s IgniteLiverpool talk for more details) that gives ports like Brest a high tidal range.

Notice that St Helena has some negative sea levels. Sea level is measured to a 0-point that is fixed for each station but varies from station to station. It is common to pick that point as being the lowest sea level (either observed or predicted) over some period, so that almost all actual observations are positive. Brest follows the usual convention, almost all the observations are positive (you can’t tell from the histogram but there are a few negative ones). It is not clear what the 0-point on the St Helena chart is (it’s clearly not a low low water, and doesn’t look like a mean water level either), and I have exhausted the budget for researching the matter.

Tides are a new subject for me, and when I was reading Pugh’s book, one of the first surprises was the existence of places that do not get two tides a day. An example is Fremantle, Australia, which instead of getting two tides a day (semi-diurnal) gets just one tide a day (diurnal):

fre-ts

The diurnal tides are produced predominantly by the effect of lunar declination. When the moon crosses the equator (twice a nodical month), its declination is zero, the effect is reduced to zero, and so are the diurnal tides. This is in contrast to the twice-daily tides which, while they exhibit large (spring) and small (neap) tides, we still get tides whatever time of the month it is. Because of the modulation of the diurnal tide there is no “mean low tide” and “mean high tide”, tides of all heights are produced, and we get a single hump in the distribution (adding the fremantle data in red):

fre3-hist

So we’ve found something interesting about the Fremantle tides from the kind of histogram which we probably learnt to do in primary school.

Napoleon died on St Helena, but my investigations into St Helena’s tides will continue on the ScraperWiki data hub, using a mixture of standard platform tools, like the summarise tool, and custom tools, like a tidal analysis tool.

Image “Napoleon at Saint-Helene, by Francois-Joseph Sandmann,” in Public Domain from Wikipedia

Posted in thoughts | Leave a comment

Summarise #2: Pies and facts

In a previous blog post, I showed how by counting the most common values in each column (like a pivot table, or “group by” in SQL),  I managed to make a tool that can automatically summarise datasets.

I quickly realised that there were better ways of visualising the data than just showing tables. For example, if there are only a few possible values for a column, it makes better sense as a pie chart.

For example, these are the oceans from the Climate Code Foundation’s sea-level station data (the same dataset that appeared in the last blog post).

JASL ocean pie chart

After playing with a few datasets, and with David’s help, we found that the pies are useful when there are more than two but fewer than eight values.

The code that makes the pie chart is in the “fact_groups_pie” function in the facts.js file. I’m calling each possible visualisation a “fact”. There’s a bunch of code in the “add_fact” function in code.js which, for each possible fact, decides which has the highest priority, and shows that one for each column. For example, a pie chart (if there are few enough values) overrides a table.

The pie is made using Google charts (code in charts.js) – I deliberately wanted to keep things simple for this tool. Because the visualisations are automatically chosen, it didn’t feel right to hand craft them in D3.

You can play too! If you are part of the Beta, you can use the “Summarise automatically” tool yourself now on your own dataset. Either upload a spreadsheet with the “Upload spreadsheet” tool, or use the “Twitter search tool” or one of the coding tools to get some data you care about into ScraperWiki. Then choose “Summarise automatically” from the tools menu and see what surprises there are.

You’ll probably see one of the visualisation type I haven’t talked about yet. Next time – all about showing time and numbers using buckets…

Posted in beta | 4 Comments

data-driven london week

view of the Shard from ShoreditchMost mornings this week, I awoke in the mystical land of Hackney, and battled hordes of hipster-cyclists to make my way to the Google Campus – a refuge of data-folk. At least, that’s how I like to remember it.

As I blogged last week, several ScraperWikians attended and spoke at a range of events, all put on to the tune of “Big Data.” I spent Monday evening with a friendly meetup group talking about the importance of data in marketing. And on Wednesday, I watched a very smart presentation by Thomas Stone (hopefully, soon-to-be Dr. Stone) from prediction.io, which looks to be an interesting, open-source project for developers to call upon machine learning without the need for proprietary lock-in.

Alongside Stone, I also learned about Games Analytics from their COO, Mark Robinson. The gist of the talk was that games – particularly online games – give their producers the chance to deeply understand how players actually use their product. Through continuous contact with the players, they can learn: what stops them from playing, where they find it difficult to continue, how many times they log-in before purchasing… What I liked about this, was the lack of hand-wavey discussion about “data leading to insights.” Instead, Robinson’s talk focused on how this data can lead to quite practical decisions, such as making levels of a game quicker at the start, reducing the cost in places, and increasing it in others.

Between those two events, I had the tremendous privilege of joining around 120 others for the W3C’s Open Data on the Web. The remarkable brain-power per square inch at the workshop was mentioned quite a few times, and – although I tend to feel disinclined to perpetuate that kind of talk – I must agree. The Campus hosted architects, businesspeople, developers, hackers and scientists, from government bodies, universities, NGOs and foundations mixed with large companies (including IBM, Adobe, Tesco and Google).

I was particularly drawn to discussions about building and growing businesses on data. I’m intrigued by, and think ScraperWiki is well-placed to, work on addressing the use of open data to augment private data – for example: taking aggregated customer data, and matching with government stats, open geographic data, public social media, etc. I’ve got a few ideas for some tooling to the new ScraperWiki platform, which I’d like to explore in a few weeks.

I don’t feel there is enough space here to do proper justice to the topics covered, but suffice it to say I’m glad I had a chance to go, and was able to take part in the afternoon’s Barcamp (our team discussed the application of the recent revolution of distributed coding workflows to data handling – in other words, Github for data).

I would also like to point out a few of the sessions, and recommend the papers to read:

I don’t have a link yet to Tescos’ talk (just the abstract) about their huge sets of data (product, customers, locations, journeys…), but if anyone has, or as soon as I find it, I’ll put it here!

Posted in events | Leave a comment

Book review: JavaScript: The Good Parts by Douglas Crockford

JavaScript: The Good PartsThis week I’ve been programming in JavaScript, something of a novelty for me. Jealous of the Dear Leader’s automatically summarize tool I wanted to make something myself, hopefully a future post will describe my timeline visualising tool. Further motivations are that web scraping requires some knowledge of JavaScript since it is a key browser technology and, in its prototypical state, the ScraperWiki platform sometimes requires you to launch a console and type in JavaScript to do stuff.

I have two books on JavaScript, the one I review here is JavaScript: The Good Parts by Douglas Crockford - a slim volume which tersely describes what the author feels the best bits of JavaScript, incidently highlighting the bad bits. The second book is the JavaScript Bible by Danny Goodman, Michael Morrison, Paul Novitski, Tia Gustaff Rayl which I bought some time ago, impressed by its sheer bulk but which I am unlikely ever to read let alone review!

Learning new programming languages is easy in some senses: it’s generally straightforward to get something to happen simply because core syntax is common across many languages. The only seriously different language I’ve used is Haskell. The difficulty with programming languages is idiom, the parallel is with human languages: the barrier to making yourself understood in a language is low, but to speak fluently and elegantly needs a higher level of understanding which isn’t simply captured in grammar. Programming languages are by their nature flexible so it’s quite possible to write one in the style of another – whether you should do this is another question.

My first programming language was BASIC, I suspect I speak all other computer languages with a distinct BASIC accent. As an aside, Edsger Dijkstra has said:

[...] the teaching of BASIC should be rated as a criminal offence: it mutilates the mind beyond recovery.

- so perhaps there is no hope for me.

JavaScript has always felt to me a toy language: it originates in a web browser and relies on HTML to import libraries but nowadays it is available on servers in the form of node.js, has a wide range of mature libraries and is very widely used. So perhaps my prejudices are wrong.

The central idea of JavaScript: The Good Parts is to present an ideal subset of the language, the Good Parts, and ignore the less good parts. The particular bad parts of which I was glad to be warned:

  • JavaScript arrays aren’t proper arrays with array-like performance, they are weird dictionaries;
  • variables have function not block scope;
  • unless declared inside a function variables have global scope;
  • there is a difference between the equality == and === (and similarly the inequality operators). The short one coerces and then compares, the longer one does not, and is thus preferred. 

I liked the railroad presentation of syntax and the section on regular expressions is good too.

Railroad syntax diagram - for statement

Elsewhere Crockford has spoken approvingly of CoffeeScript which compiles to JavaScript but is arguably syntactically nicer, it appears to hide some of the bad parts of JavaScript which Crockford identifies.

If you are new to JavaScript but not to programming then this is a good book which will give you a fine start and warn you of some pitfalls. You should be aware that you are reading about Crockford’s ideal not the code you will find in the wild.

Posted in research | Tagged , | 1 Comment

Two ways you can help guide ScraperWiki’s new platform.

tractor-shinyYou will have noticed some activity over the past few weeks, as we have begun reaching out about the new ScraperWiki platform. We’ve blogged about some of the new features, and have invited the first ever users outside the office to have a poke around the beta.

That initial feedback has been immeasurably helpful, and has lead to bug-fixes, feature requests, and some directional suggestions which we can’t thank you enough for.

But we need more.

  • more feedback
  • more testers
  • more questions answered
  • more coffee…

So, there are now two ways you can join the testing community, and – with absolutely no exaggeration – play a vital part in the future functionality, design, and direction of the new ScraperWiki.

First, you can become a premium user of the New ScraperWiki. We have just switched on the payment plans, and are tweaking settings and tuning things up as it’s rolled out. You can see the two available premium plans here, and sign up.

Second, you can join the queue for private free-tier testers by emailing new@scraperwiki.com. Just ping us your name from the email account you’d like inviting, and we will add you to the list, which we’re breaking into groups simply so we can test different things and make the most of your first impressions!

Posted in beta | Leave a comment

Big Data Week Events

big-data-tractorNext week, a plethora of organisations, hackers and data scientists are celebrating “Big Data Week,” and the ScraperWiki team will be taking part in London.

We will be supporting the DoES Liverpool exhibit at the Internet of Things stream of Internet World at Earls Court (#internetworld2013). Francis will also be giving a talk at 1:30 on Wednesday: discussing the past, present and future of government data, and you can catch his talk at the Big Data Show “Volume and Variety” Theatre. Registration is free but the organisers recommend you book in advance.

I will be more or less camped at the Google Campus in Shoreditch, attending Marketing ReMix on Monday and the Big Data Meetup hosted by the Geckoboard team on Wednesday. If anyone’s about on Tuesday, I’ll be working from the Google cafe, so drop me a line if you’d like to meet up!

There will be plenty of opportunity to meet in London next week, so if you have any questions for ScraperWikians, get in touch and come join us!

Posted in events | 1 Comment