Culture Hack Day: CultureGrid hacks - A mobile app, JSONP API and a large image scraper
For the recent Culture Hack Day, I did a handful of small hacks around CultureGrid. It's an aggregated search on 80+ UK cultural institutions' archives of cultural artefacts, so you can use it to search across museums for 1.2m things - paintings, drawings, photographs, objects, locations, and so on. Quite the treasure trove!
They have an XML API and I wanted to do something using JQuery mobile aimed at providing a mobile interface to the database. So step 1 was to find a way to represent their XML as JSON for Backbone.js. I since found out that there's a way to retrieve JSON directly from their API but it's undocumented. So I wrote an XML parser and made it available as a JSONP proxy.
Try it: JSONP from the CultureGrid
Then I could start using that JSON in a mobile app. I've been playing with CoffeeScript recently (I've got a spare time mobile app I'm working on), and I'm loving it so far.
So, representing each of the results (they are called 'docs' in the XML) as a Backbone Model, adding a Backbone Controller to handle listing and showing the results, I quickly had a way to search the database on a mobile device.
Try it: Culturegrid mobile app
But one problem I found was that Culturegrid would supply me lots of information about something, but only give me a very small thumbnail of the result. If I wanted to see the painting that I was searching for, I'd have to follow a link to it and view it on the website of the institution itself. Not great for a mobile app.
So, how do you scrape those images into the mobile app? All the sites are different, the resulting pages contain lots of other images and I don't want to write a scraper for every single site.
I came up with this solution - a generic scraper that learns over time. It has an ultra-simple API:
- Fire it a web page.
- It grabs the page and looks for images within the HTML
- It weights the results by how many times it has seen each of the images (logos fall to the bottom of the rankings quite quickly)
- It gets the outlier or if it can't find one, strips out the best five and then weights them by how many bytes long they are by doing a HEAD request against them.
- It gets the best one it can find.
- Then it caches the result for next time and stores the URLs of the images it has seen.
- Then it serves the cached image via Varnish for subsequent requests.
Try it: Culture scraper
So, plugging in the scraper, I could add full size images to the mobile app. Total time probably 16 hours. All the code for these examples are available via the links above or on my Github.
What next? I'd love to apply some of this to a particular collection or archive, maybe look at what you can do with geolocation. Oh hang on - there's a hack day on February 18th to take some of these things and run with them...