Internet Archive News

updates about

The Little Bot That Could

homebuying is lots and lots of paperwork

Meet oclcBot. He was written by Bruce Washburn at OCLC Research to help connect Open Library records to He’s just finished updating almost 4 million Open Library editions with links! No metadata exchange at all, except these identifiers. Tiny, but powerful, because that lets systems that “speak OCLC” communicate directly with Open Library without knowing any Open Library IDs. As Anand mentioned in his recent post about Coverstore Improvements, we’ve also made the system for displaying covers externally using other types of identifiers more efficient.

There was a bit of a bumpy start to oclcBot’s updates, and Bruce and I thought it might be good to hear what it was like in the trenches. From Bruce:

This project was essentially very simple: find corresponding Open Library and OCLC WorldCat records by a shared attribute (ISBN), and update the Open Library record with the corresponding OCLC number. Once OCLC had generated a list of OCLC numbers and their corresponding ISBNs, it seemed to be a simple matter of using the very robust Open Library API to look for matching records, check to see if they already included an OCLC number, and update the record accordingly. Complications arose, related to scale. There were about 90 million ISBNs to check from the OCLC list, and checking them one at a time via the API was projected to take a very long time. So we used a data dump of all the Open Library records to identify those with ISBNs, and also built a very fast index of the OCLC list to check against. With that we were able to produce a new list of Open Library records and corresponding new OCLC numbers. And a batch update facility in the Open Library API made it possible to send API requests 1,000 records at a time. The pre-processing and the batch process both yielded some additional lists that will require more scrutiny to process (records associated with multiple ISBNs, API exceptions for individual records), but the great majority of records were updated via the oclcBot without any further effort.

So, it’s still early days with our Bot operations, but we’re looking for external developers who might be interested to try to do these “surgical strike” style updates to loads of Open Library records at once. If you’re curious, please visit our Writing Open Library Bots in the Open Library Developers area.

Thank you, Bruce!

(And thanks to Solo for the CC BY-NC-SA 2.0 oclcBot photo.)

Originally posted on The Open Library Blog by George Oates.

Written by internetarchive

May 3, 2011 at 7:11 pm

Posted in Uncategorized

%d bloggers like this: