Microfinance Data Scraping

I went to the Datakind‘s New York Datadive last November and met the Microfinance Information Exchange (MIX), a group that ‘delivers data services, analysis, research and business information on the institutions that provide financial services to the world’s poor’. They wanted to see whether web-scraping could save them from manually gathering data. So fellow divers and I showed MIX the utility of web-scraping. Over the course of a day, about six people scraped data about microfinance institutions from a bunch of websites, saving MIX an estimated year of manual data entry.

Over the past few months, I worked further with MIX to study who has access to what sorts of financial services. DataKind just put up our blog post about the project. Read the post, or just look at the map and explore the data.

Screenshot of the interactive map displaying the scraped data

Posted in opendata, research, Scrapers | Leave a comment

5 yr old goes ‘potty’ at Devon and Somerset Service (Emergencies and Data Driven Stories)

It’s 9:54am in Torquay on a Wednesday morning:

One appliance from Torquays fire station was mobilised to reports of a child with a potty seat stuck on its head.

On arrival an undistressed two year old female was discovered with a toilet seat stuck on her head.

Crews used vaseline and the finger kit to remove the seat from the childs head to leave her uninjured.

A couple of different interests directed me to scrape the latest incidents of the Devon and Somerset Fire and Rescue Service. The scraper that has collected the data is here.

Why does this matter?

Everybody loves their public safety workers — Police, Fire, and Ambulance. They save lives, give comfort, and are there when things get out of hand.

Where is the standardized performance data for these incident response workers? Real-time and rich data would revolutionize its governance and administration, would give real evidence of whether there are too many or too few police, fire or ambulance personnel/vehicles/stations in any locale, or would enable the implementation of imaginative and realistic policies resulting from major efficiency and resilience improvements all through the system?

For those of you who want to skip all the background discussion, just head directly over to the visualization.

A rose diagram showing incidents handled by the Devon and Somerset Fire Service

The easiest method to monitor the needs of the organizations is to see how much work each employee is doing, and add more or take away staff depending on their workloads. The problem is, for an emergency service that exists on standby for unforeseen events, there needs to be a level of idle capacity in the system. Also, there will be a degree of unproductive make-work in any organization — Indeed, a lot of form filling currently happens around the place, despite there being no accessible data at the end of it.

The second easiest method of oversight is to compare one area with another. I have an example from California City Finance where the Excel spreadsheet of Fire Spending By city even has a breakdown of the spending per capita and as a percentage of the total city budget. The city to look at is Vallejo which entered bankruptcy in 2008. Many of its citizens blamed this on the exorbitant salaries and benefits of its firefighters and police officers. I can’t quite see it in this data, and the story journalism on it doesn’t provide an unequivocal picture.

The best method for determining the efficient and robust provision of such services is to have an accurate and comprehensive computer model on which to run simulations of the business and experiment with different strategies. This is what Tesco or Walmart or any large corporation would do in order to drive up its efficiency and monitor and deal with threats to its business. There is bound to be a dashboard in Tesco HQ monitoring the distribution of full fat milk across the country, and they would know to three decimal places what percentage of the product was being poured down the drain because it got past its sell-by date, and, conversely, whenever too little of the substance had been delivered such that stocks ran out. They would use the data to work out what circumstances caused changes in demand. For example, school holidays.

I have surveyed many of the documents within the Devon & Somerset Fire & Rescue Authority website, and have come up with no evidence of such data or its analysis anywhere within the organization. This is quite a surprise, and perhaps I haven’t looked hard enough, because the documents are extremely boring and strikingly irrelevant.

Under the hood – how it all works

The scraper itself has gone through several iterations. It currently operates through three functions: MainIndex(), MainDetails(), MainParse(). Data for each incident is put into several tables joined by the IncidentID value derived from the incident’s static url, eg:

http://www.dsfire.gov.uk/News/Newsdesk/IncidentDetail.cfm?IncidentID=7901&siteCategoryId=3&T1ID=26&T2ID=41

MainIndex() operates their search incidents form grabbing 10 days at a time and saving URLs for each individual incident page into the table swdata.

MainDetails() downloads each of those incident pages, parsing the obvious metadata, and saving the remaining HTML content of the description into the database. (This used to attempt to parse the text, but I then had to move it into the third function so I could develop it more easily.) A good way to find the list of urls that have not been downloaded and saved into the swdetails is to use the following SQL statement:

select swdata.IncidentID, swdata.urlpage 
from swdata 
left join swdetails on swdetails.IncidentID=swdata.IncidentID 
where swdetails.IncidentID is null 
limit 5

We then download the HTML from each of the five urlpages, save it into the table under the column divdetails and repeat until no more unmatched records are retrieved.

MainParse() performs the same progressive operation on the HTML contents of divdetails, saving it into the the table swparse. Because I was developing this function experimentally to see how much information I could obtain from the free-form text, I had to frequently drop and recreate enough of the table for the join command to work:

scraperwiki.sqlite.execute("drop table if exists swparse")
scraperwiki.sqlite.execute("create table if not exists swparse (IncidentID text)")

After marking the text down (by replacing the <p> tags with linefeeds), we have text that reads like this (emphasis added):

One appliance from Holsworthy was mobilised to reports of a motorbike on fire. Crew Commander Squirrell was in charge.

On arrival one motorbike was discovered well alight. One hose reel was used to extinguish the fire. The police were also in attendance at this incident.

We can get who is in charge and what their rank is using this regular expression:

re.findall("(crew|watch|station|group|incident|area)\s+(commander|manager)\s*([\w\-]+)(?i)", details)

You can see the whole table here including silly names, misspellings, and clear flaws within my regular expression such as not being able to handle the case of a first name and a last name being included. (The personnel misspellings suggest that either these incident reports are not integrated with their actual incident logs where you would expect persons to be identified with their codenumbers, or their record keeping is terrible.)

For detecting how many vehicles were in attenence, I used this algorithm:

appliances = re.findall("(\S+) (?:(fire|rescue) )?(appliances?|engines?|tenders?|vehicles?)(?: from ([A-Za-z]+))?(?i)", details)
nvehicles = 0
for scount, fire, engine, town in lappliances:
    if town and "town" not in data:
        data["town"] = town.lower(); 
    if re.match("one|1|an?|another(?i)", scount):  count = 1
    elif re.match("two|2(?i)", scount):            count = 2
    elif re.match("three(?i)", scount):            count = 3
    elif re.match("four(?i)", scount):             count = 4
    else:                                          count = 0
    nvehicles += count

And now onto the visualization

It’s not good enough to have the data. You need to do something with it. See it and explore it.

For some reason I decided that I wanted to graph the hour of the day each incident took place, and produced this time rose, which is a polar bar graph with one sector showing the number of incidents occurring each hour.

You can filter by the day of the week, the number of vehicles involved, the category, year, and fire station town. Then click on one of the sectors to see all the incidents for that hour, and click on an incident to read its description.

Now, if we matched our stations against the list of all stations, and geolocated the incident locations using the Google Maps API (subject to not going OVER_QUERY_LIMIT), then we would be able to plot a map of how far the appliances were driving to respond to each incident. Even better, I could post the start and end locations into the Google Directions API, and get journey times and an idea of which roads and junctions are the most critical.

There’s more. What if we could identify when the response did not come from the closest station, because it was over capacity? What if we could test whether closing down or expanding one of the other stations would improve the performance in response to the database of times, places and severities of each incident? What if each journey time was logged to find where the road traffic bottlenecks are? How about cross-referencing the fire service logs for each incident with the equivalent logs held by the police and ambulance services, to identify the Total Response Cover for the whole incident – information that’s otherwise balkanized and duplicated among the three different historically independent services.

Sometimes it’s also enlightening to see what doesn’t appear in your datasets. In this case, one incident I was specifically looking for strangely doesn’t appear in these Devon and Somerset Fire logs: On 17 March 2011 the Police, Fire and Ambulance were all mobilized in massive numbers towards Goatchurch Cavern – but the Mendip Cave Rescue service only heard about it via the Avon and Somerset Cliff Rescue. Surprise surprise, the event’s missing from my Fire logs database. No one knows anything of what is going on. And while we’re at it, why are they separate organizations anyway?

Next up, someone else can do the Cornwall Fire and Rescue Service and see if they can get their incident search form to work.

Posted in Scrapers, opendata | Tagged , , , , , , | Leave a comment

Handling exceptions in scrapers

When requesting and parsing data from a source with unknown properties and random behavior (in other words, scraping), I expect all kinds of bizarrities to occur. Managing exceptions is particularly helpful in such cases.

Here is some ways that an exception might be raised.

[][0] #The list has no zeroth element, so this raises an IndexError
{}['foo'] #The dictionary has no foo element, so this raises a KeyError

Catching the exception is sometimes cleaner than preventing it from happening in the first place. Here are some examples handling bizarre exceptions in scrapers.

Example 1: Inconsistant date formats

Let’s say we’re parsing dates.

import datetime

This doesn’t raise an error.

datetime.datetime.strptime('2012-04-19', '%Y-%m-%d')

But this does.

datetime.datetime.strptime('April 19, 2012', '%Y-%m-%d')

It raises a ValueError because the date formats don’t match. So what do we do if we’re scraping a data source with multiple date formats?

Ignoring unexpected date formats

A simple thing is to ignore the date formats that we didn’t expect.

import lxml.html
import datetime

def parse_date1(source):
    rawdate = lxml.html.fromstring(source).get_element_by_id('date').text

    try:
         cleandate = datetime.datetime.strptime(rawdate, '%Y-%m-%d')
    except ValueError:
         cleandate = None

    return cleandate

print parse_date1('<div id="date">2012-04-19</div>')

If we make a clean date column in a database and put this in there, we’ll have some rows with dates and some rows with nulls. If there are only a few nulls, we might just parse those by hand.

Trying multiple date formats

Maybe we have determined that this particular data source uses three different date formats. We can try all three.

import lxml.html
import datetime

def parse_date2(source):
    rawdate = lxml.html.fromstring(source).get_element_by_id('date').text

    for date_format in ['%Y-%m-%d', '%B %d, %Y', '%d %B, %Y']:
        try:
             cleandate = datetime.datetime.strptime(rawdate, date_format)
             return cleandate
        except ValueError:
             pass

    return None

print parse_date2('<div id="date">19 April, 2012</div>')

This loops through three different date formats and returns the first one that doesn’t raise the error.

Example 2: Unreliable HTTP connection

If you’re scraping an unreliable website or you are behind an unreliable internet connection, you may sometimes get HTTPErrors or URLErrors for valid URLs. Trying again later might help.

import urllib2

def load(url):
    retries = 3
    for i in range(retries):
        try:
            handle = urllib2.urlopen(url)
            return handle.read()
        except urllib2.URLError:
            if i + 1 == retries:
                raise
            else:
                time.sleep(42)
    # never get here

print load('http://thomaslevine.com')

This function tries to download the page thee times. On the first two fails, it waits 42 seconds and tries again. On the third failure, it raises the error. On a success, it returs the content of the page.

Example 3: Logging errors rather than raising them

For more complicated parses, you might find loads of errors popping up in weird places, so you might want to go through all of the documents before deciding which to fix first or whether to do some of them manually.

import scraperwiki

for document_name in document_names:
    try:
        parse_document(document_name)
    except Exception as e:
        scraperwiki.sqlite.save([], {
            'documentName': document_name,
            'exceptionType': str(type(e)),
            'exceptionMessage': str(e)
        }, 'errors')

This catches any exception raised by a particular document, stores it in the database and then continues with the next document. Looking at the database afterwards, you might notice some trends in the errors that you can easily fix and some others where you might hard-code the correct parse.

Example 4: Exiting gracefully

When I’m scraping over 9000 pages and my script fails on page 8765, I like to be able to resume where I left off. I can often figure out where I left off based on the previous row that I saved to a database or file, but sometimes I can’t, particularly when I don’t have a unique index.

for bar in bars:
    try:
        foo(bar)
    except:
        print('Failure at bar = "%s"' % bar)
        raise

This will tell me which bar I left off on. It’s fancier if I save the information to the database, so here is how I might do that with ScraperWiki.

import scraperwiki

resume_index = scraperwiki.sqlite.get_var('resume_index', 0)
for i, bar in enumerate(bars[resume_index:]):
    try:
        foo(bar)
    except:
        scraperwiki.sqlite.save_var('resume_index', i)
        raise
scraperwiki.sqlite.save_var('resume_index', 0)

ScraperWiki has a limit on CPU time, so an error that often concerns me is the scraperwiki.CPUTimeExceededError. This error is raised after the script has used 80 seconds of CPU time; if you catch the exception, you have two CPU seconds to clean up. You might want to handle this error differently from other errors.

import scraperwiki

resume_index = scraperwiki.sqlite.get_var('resume_index', 0)
for i, bar in enumerate(bars[resume_index:]):
    try:
        foo(bar)
    except scraperwiki.CPUTimeExceededError:
        scraperwiki.sqlite.save_var('resume_index', i)
    except Exception as e:
        scraperwiki.sqlite.save_var('resume_index', i)
        scraperwiki.sqlite.save([], {
            'bar': bar,
            'exceptionType': str(type(e)),
            'exceptionMessage': str(e)
        }, 'errors')
scraperwiki.sqlite.save_var('resume_index', 0)

tl;dr

Expect exceptions to occur when you are scraping a randomly unreliable website with randomly inconsistent content, and consider handling them in ways that allow the script to keep running when one document of interest is bizarrely formatted or not available.

Posted in Uncategorized | 2 Comments

Announcing ScraperWiki Premium Accounts!

ScraperWiki digger in front of credit card payment logosThe most exciting bit about ScraperWiki is how it forms a link between two very different worlds.

On the one hand, we love the public good that data liberation enables, and we’re used by everyone from journalists (did you see us on the Guardian front page last week?) to activists (like the guys behind Australian planning alerts).

But we also love the value that businesses create using data. They use ScraperWiki in many ways – like pulling customised marketing leads from the web, and extracting and cleaning old proprietary data so it can be sold anew – something we’ll be blogging about a lot more in the next few weeks.

Today, we’re really excited to announce that anyone (be they journalists, businesses or anything else!) can now use ScraperWiki in private with the click of a button. Our new premium accounts range from $9 per month for individuals, to $299 for corporates with lots of collaborators – all you need is a credit card.

For that monthly fee you get to make ScraperWiki vaults (secure, private areas, which you can share with precisely who you want) and you also get the ability to schedule any scraper to run hourly (for data feeds that update more often than once a day).

This will let journalists keep their scrapers secret – embargoed until they write their story. It will let businesses scrape websites without revealing to their competitors the advantage they’ve found. It will let anyone scrape their own private data, in private, to repurpose it and do wonderful things that nobody had ever intended.

We’re quite excited to hear about what you do. Since vaults are private we won’t know, so please get in touch. We’d love to write about it here, if you’ll let us.

Posted in business, developer | Leave a comment

Parsing panic

This is a guest post by Martha Rotter, co-founder of Woop.ie and recently launched Irish technology magazine Idea.

Hey remember the Wikipedia blackout? I do, because I was highly amused by the number of students panicking due to papers or homework they seemingly could not complete without this one website.

One of my favourite things to do with ScraperWiki is to capture people’s reactions and sentiments, and then try to make predictions based on the data. I call it a “Zeitgeist Parse”, because I’m looking for the general public’s response to some event currently happening. Looking at the barrage of tweets coming from confused and frustrated students, I wondered could we predict an upcoming epidemic of bad grades or test results.

PROCESS

I built a few quick scrapers to grab tweets related to Wikipedia blackouts. The queries I used were “wikipedia AND paper” and “wikipedia AND homework”. I thought there might be slight variations in what people with homework were worried about versus maybe more detailed term papers or reports. You can see the Python code for them on my Scraperwiki profile.

After the results were stored, I wanted to do something very simple. I wanted to parse all of the records and get the words tweeted most frequently. From there, I could start to analyze the data more clearly and find patterns and trends.

One way to do this is to take the data & use something like IBM’s ManyEyes to get a visualization of frequently used text. This is handy if you want a Tag Cloud or basic chart to view the results.

However I was conscious of the fact that with so many tweets, it could be easy to miss smaller but still significant trends. A really easy way to parse and sort text is by using Excel + VBA. Since ScraperWiki can export to CSV, I downloaded the CSV files & wrote a small macro to walk through the words and count instances of each of them. After sorting the results, I had a fairly solid picture of the top words used by protesting tweeters.

WHAT I DIDN’T FIND

I actually did not find specific subjects. Hardly any comments about which course or paper was in danger due to the shutdown. Few worries about particular subjects, the notable exceptions being history, with 50 instances and English with 37 instances appearing in the data. For a moment, my experiment was basically a waste of time and processing power.

WHAT I DID FIND

But as I examined the results, what I actually found was slightly more interesting. After removing obvious words like Wikipedia and homework, I started to see a few recurring patterns in terms of type of language used.

The panic of the situation jumps out immediately. Words like GOTTA (I didn’t remove capitalization as it adds context in this scenario), fail, DOWN, NEED, TOMORROW, extension, justmyluck, screwed, and even HLP!!!!!!! appear in high numbers throughout the results.

Next I noticed the very emotional nature of the language. As expected, lots of swearing and foul language appears. But also high instances of things like hate, mad, suck, freaking, omfg, fixitnow, and of course WTF showed up in the data. As someone who in college definitely did my share of writing papers the night before they were due, I understand the terror and panic. On the other hand, I was usually surrounded by library books I had checked out (probably that day) with no fear that they might suddenly go blank.

The last pattern that I noticed was one of interesting hashtags. These included expected ones like #blackout, #badtiming, #PIPA, #stopSOPA, #wikipediablackout, and #sopa. But also some really bizarre ones that I have no idea how they related to the situation, and may simply remain a mystery: #fratproblems, #thekidsareourfuture, #BingGrlProblems, #SHOUTOUT, and #cooooooooooooooooooooooooooooooool. But my favourite one was probably #GoToALibrary!

SO YOU WANNA CREATE A ZEITGEIST PARSE?

Start by identifying your query parameters. Are you searching by words, by geography, by date? Remember that Twitter’s Search API only goes back a few days, so if you’re looking for anything older than a week this API won’t work. Twitter’s API documentation is great but does change every so often so keep an eye on Twitter Developers for the most up-to-date information about their API and what you can use as parameters for the Search API.

Once you have defined your query, the next step is to create your ScraperWiki scraper with the information. Feel free to copy the source from one of my scrapers like this one. and update with your own parameters.

Next you’ll need to set up the scraper to run one or more times. How often do you want it to run to get useful results? You can run it once & download the data as a JSON or CSV file, or as a SQLite database. Or you can schedule it to run at regular intervals and download the info yourself each time.

After you have the data you need, all you have left to do is analyse. I mentioned ManyEyes earlier, which you can use to get some nice visualizations quite easily or you can use Excel or Google Refine to parse and examine the data. If you’re comfortable with JavaScript, something like HighCharts can help to create nice, interactive visualizations easily from your data.

And now you have a good overview of what people think about the given topic in either a dataset or visualization. Hopefully you made some predictions about what you would find so you can validate your predictions or, as in my case here, observe something completely different.

SUMMARY

Writing a quick Python method using ScraperWiki to query Twitter’s search API is fairly straightforward. Finishing a term paper without using Wikipedia on the other hand? Not so straightforward for some unfortunate students!

You can read Martha’s articles about the Irish presidential election at Visualizing Aras Election and Visualizing Aras Election, Part Two.

Posted in journalism | 2 Comments

Is scraping legal?

Lots of people, when they hear about ScraperWiki, ask “is scraping legal? how can you build a business off that?”. Usually to follow up by saying “we do it in our company, but we would never tell anyone”.

This is strange to us, as we have come from a world of good scraping. Taking Government data, and making it easier for people to use for things that benefit all of society. We’re in favour of that kind of scraping.

It’s obviously a spectrum. At the other extreme, the most evil scraping would be to steal content that somebody else sells, and then to republish it at harm to their business. We’re against that kind of scraping.

It’s not scraping itself which is good or bad, or legal or illegal, but the circumstances in which you’re doing it.

We’ve written up in full our policy about the legality, it’s in our FAQ under ‘What’s your policy on what’s legal to scrape?‘. Lots of details about robots.txt and take down notices, and what is our and your legal responsibility.

Finally, ScraperWiki isn’t just about scraping.

We’re a data hub, and you need to get data into a data hub. As well as scraping, lots of people make API calls to do that on ScraperWiki, or download their own files from their own servers.

This is much more profound than it sounds – when you are using data for a new purpose, even if it is already structured, you still need to get it and convert it to your new needs. How you do that is a detail that depends on the circumstances.

The difference between parsing HTML web pages, and using a JSON REST API is surprisingly small. As an example, Thomas scraped EventBrite even though it has an API (see the post at the end of that thread by Ryan who works at EventBrite!), because it was easier at the time for him.

What matters is getting the data, and converting it into a form where it can do something useful for the world. And doing that legally. Whether you’re using Nokogiri or Nestful.

Posted in thoughts | 3 Comments

…in data we trust….

We’re in Washington DC, the nation’s capital and US HQ! The city is bathed in spring sunlight, the blossoms are out and there’s a bit of a buzz about the town. The ScraperWiki truck is getting ready to park at The Washington Post on Friday and Saturday for our 3rd major US Journalism Data Camp (hashtag: #jdcdc)

It’s an election year so we can be forgiven for feeling a little smug, our raison d’etre is to dig up data, so where better to make it happen than at The Washington Post, a newspaper that inspired a generation of investigative journalists, inscribed the word ‘Watergate’ as a formal entry in the Oxford English Dictionary, and made ‘deep throat’ a double entendre!

Health, transport, education, security, they’re all ripe for data liberation. We’ve detected interest in “Super” PACS and lobbying data, so let’s hope we see a major focus on these at the event. One of AP’s senior investigative reporters Jack Gillum, (@jackgillum) is keen to drill into Independent Expenditure aka Election Advertisements, and Campaign Finance Disclosure data.  Our own Julian Todd (@goatchurch) has commenced work on liberating lobbying data in New York.

The guys here at The Washington Post have a wish list for liberation and it’s by no means exhaustive:

We’re thrilled by the fact that we signed up so many data scientists and media professionals. The coders will be freeing and/or learning to scrape data and everyone else will be facilitated into teams to hypothesize, gather, analyze, create and present stories and applications based on data. The outcomes will be presented on Saturday at 04:00p and we have a bunch of prizes to give away for the most inspired ideas. We also have some special ScraperWiki prizes for technical contributions.

What’s happening on Friday 30th?

08:30a We will open registration and serve tea coffee and biscuits

09:30a Kick-off and a short plenary. We’ll hear from Vernon Loeb (@VernonLoeb) about what it’s like to work as a data digger at the Post and Chuck Lewis (@crelewis) from AU will talk about partnership with the capital’s flagship publication. Our Own Francis Irving (@frabcus) will say hello and talk ‘data’.  Julian Todd (@goatchurch) and Thomas Levine (@thomaslevine) will explain why scraping is an important technique for getting data and show some examples.  Tom Lee ((@tjl) from Sunlight Foundation will make a callout for help with their GASP project and put some context around votesmart closing their doors.

10:15a The Data Derby and Data Liberators will meet and pour over data ideas. We will review the lifecycle of a data driven story and familiarise people with the ScraperWiki Data Derby route map. We will set out some ideas and facilitate people into teams, with each picking a magnet as their map route icon. The coders who have signed up for the morning ‘Learn to Scrape‘ with python class will be directed to the tutorial room for the three hour session. Anyone signed up for the afternoon tutorial will join the data derby/liberators for some fun.

Data Derby Route Map

12:45a Lightening Talk: Greg Franczyk from The Washington Post will talk about the evolving role of data in the media industry and data’s evolution in media: specifically, how it is gathered and stored, its changing relationship with news, and how it’s presented to consumers.
Callout – Mjumbe Poe, (@mjumbewu) Code for America Fellow would like to share the story of scraping council data for Councilmatic and he would like to get people interested in tackling the agendas.

01:00p Light lunch

01:30p Projects continue…

02:15p ‘Learn to Scrape” Python afternoon tutorial commences three hour tutorial

05:30p Reception (Beer and Pizza).

******************Special NOTE********

Learn to Scrape
The two three hour tutorials Friday morning and afternoon will be run by our chief data scientist Julian Todd (@goatchurch)and Thomas Levine (@thomaslevine) data advocate aided and abetted by Michelle Koeth (@michellekoeth) Code for America Fellow.  They will cover things like identifying good targets for webscraping and navigating the complexity of different types of web pages.  Attendees will create their own scrapers .  The objective will be to get the data into a structured format, and join it with data from another source.  If time allows we will also try to encourage people to do further analysis.

*************************

What’s happening on Saturday 31st March?

09:30a Welcome plus tea coffee and biscuits

09:45 Throughout the morning we will follow the Data Derby route map – please study the picture above.

12:45a  Lightening talk – Jack Gillum, (@jackgillum) AP Investigative Journalist and Michelle Minkoff (@michelleminkoff) Interactive Producer will take about how “Super” PACs and big money have dominated this election cycle and tell us that there is little to fear as there is a mountain of data available on who’s backing presidential candidates and which can help journalists make sense of the big-time fundraisers this year. They’ll also talk about Federal Election Commission filings and show how they can be parsed for good storytelling”.
Callout
: Jan Scaffer (@janjlab) from J-Lab wants to invite ideas from our participants on how to better organize and collect their data, which includes one of the largest databases of U.S. community news sites and a significant database of grant-funded media projects.

0100p Light lunch

0200p Project teams will finalize the details of their data stories in preparation for the presentation.

03:00p Heading towards the finishing line…

04:00p Presentations and Prizes.

The American University School of Communication has been amazingly supportive, Sharon Metcalf (Director Of Partnerships and Programs), is an absolute gem as are her colleagues Lynne Perri (@Lynneperri), Professor of Journalism and Chuck Lewis (@crelewis)Prof of Journalism and Executive Editor – Investigative Reporting Workshop) who were instrumental in getting the event off the ground. We have also been overwhelmed by the support from Vernon Loeb (@VernonLoeb) the Local Editor at The Washington Post who together with Greg Franczyk have set us up in their swish conference center.  A huge ‘thank you’ to Jane Lockhart and her operations team for helping us with logistics. And last but by no means least a big round of applause to Associated Press for helping to fund our refreshments, Sunlight Foundation for our beer and pizza and to J-Lab for sponsoring the prizes – Hip Hip Horray!

Eugene Meyer (Foyer - The Washington Post)

Posted in Uncategorized