Internet Archive News

updates about archive.org

Time travel through millions of historic Open Library images

The BBC has an article about Kalev Leetaru’s project to extract images from millions of Open Library pages.

You can read about how it works…

The Internet Archive had used an optical character recognition (OCR) program to analyse each of its 600 million scanned pages in order to convert the image of each word into searchable text. As part of the process, the software recognised which parts of a page were pictures in order to discard them.

Mr Leetaru’s code used this information to go back to the original scans, extract the regions the OCR program had ignored, and then save each one as a separate file in the Jpeg picture format. The software also copied the caption for each image and the text from the paragraphs immediately preceding and following it in the book. Each Jpeg and its associated text was then posted to a new Flickr page, allowing the public to hunt through the vast catalogue using the site’s search tool.

“I think one of the greatest things people will do is time travel through the images,” Mr Leetaru said.

… or just check out some of the results. Images plus citations plus metadata! We couldn’t be happier. Free to use with no restrictions.

Image from page 301 of "The New England magazine" (1887)

Image from page 788 of "St. Nicholas [serial]" (1873)

Image from page 210 of "Farmington, Connecticut, the village of beautiful homes" (1906)

Image from page 1121 of "The Saturday evening post" (1839)

Image from page 368 of "New England; a human interest geographical reader" (1917)

Image from page 902 of "Canadian grocer July-December 1896" (1889)

Image from page 249 of "Gleanings in bee culture" (1874)

Image from page 411 of "The Canadian druggist" (1889)

Originally posted on The Open Library Blog by Jessamyn West.

Written by internetarchive

August 29, 2014 at 10:52 pm

Posted in internet archive

Millions of historic images posted to Flickr

by Robert Miller, Global Director of Books, Internet Archive

flickr_image

“Reading a book from the inside out!”. Well not quite, but a new way to read our eBooks has just been launched. Check out this great BBC article:
http://www.bbc.co.uk/news/technology-28976849

And this fabulous Flikr commons collection:
https://www.flickr.com/photos/internetarchivebookimages

What is it and how did it get done?
BBC article A Yahoo research fellow at Georgetown University, Kalev Leetaru, extracted over 14 million images from 2 million Internet Archive public domain eBooks that span over 500 years of content.  Because we have OCR’d the books, we have now been able to attach about 500 words before and after each image. This means you can now see, click and read about each image in the collection. Think full-text search of images!

How many images are there?
As of today, 2.6 million of the 14 million images have been uploaded to Flikr Commons. Soon we will be able to add continuously to this collection from the over 1,000+ new eBooks we scan each day. Dr. Simon Chaplin, Head of the Wellcome Library says, “This way of discovering and reading a book will help transform our medical heritage collection as it goes up online. This is a big step forward and will bring digitized book collections to new audiences.”

What is fun to do with this collection?
Trying typing in the word “telephone’ and enjoy what images appear? Curious about how death has been characterized over 500 years of images – type in “mordis”. Feeling good about health care – type in medicine and prepare to be amazed. Remember, all of these images are in the public domain!

Future plans?
We will be working with our wonderful friends at Flikr to make this collection even more interesting –  more images, more sub-collections and some very interesting ideas of how to use some image recognition tools to help us learn more about, well, anything!

Questions about this collection, projects or things to come?
Email me at robert@archive.org

Originally posted on The Internet Archive Blog by internetarchive.

Written by internetarchive

August 29, 2014 at 4:37 pm

Posted in News

Open Library Scheduled Hardware Maintenance

Open Library will be down from 6:00PM to 8:00PM SF Time (PDT, UTC/GMT -7 hours) on August 19, 2014 due to a scheduled hardware maintenance.

We’ll post updates here and on @openlibrary twitter.

Thank you for your cooperation.

Originally posted on The Open Library Blog by Anand Chitipothu.

Written by internetarchive

August 19, 2014 at 5:33 pm

Posted in Uncategorized

Wikimania London!

Internet Archive at Wikimedia

The Internet Archive had a booth at Wikimania in London. The booth was in the Community Village section of the conference. We hope you stopped by and said hello, grabbed a sticker or a handout, and learned a bit more about our book scanning projects and told us what you were up to. If you’d like to pick up digital copies of our handouts, PDFs are here.

We also went to a lot of programs that were really worthwhile, the free/open culture vibe was palpable and exciting with 2500+ people all getting together to find ways to share more content in more ways. A few other documents we picked up that might be interesting to other folks.

For people who like working on Wikipedia but are often flustered by paywalls, you should know about the Wikipedia Library which has a project to help editors access reliable sources. The Wikipedia Loves Libraries project is gearing up for a month of wiki-workshops and edit-a-thons at libraries around Open Access Week in October/November.

Originally posted on The Open Library Blog by Jessamyn West.

Written by internetarchive

August 12, 2014 at 8:09 pm

Posted in internet archive

Open library’s been doing that the whole time…. for free

Amazon’s “Kindle Unlimited” announcement has been helping raise awareness of Open Library.

Last week, Amazon informed us that for ten dollars per month, Kindle users can have unlimited access to over six hundred thousand books in its library. But it shouldn’t cost a thing to borrow a book, Amazon, you foul, horrible, profiteering enemies of civilization. For a monthly cost of zero dollars, it is possible to read six million e-texts at the Open Library, right now. On a Kindle, or any other tablet or screen thing.

Don’t forget our easy to use interface or downloading with your choice of device or software!
sesame street book of nonsense in the bookreader

Originally posted on The Open Library Blog by Jessamyn West.

Written by internetarchive

July 21, 2014 at 5:10 pm

Posted in lending, News

Zoia Horn, librarian and activist, dies

Ms. Horn presenting The Zoia Horn Intellectual Freedom Award to the Internet Archive’s Brewster Kahle

July 12, 2014 marked the passing of an extraordinary librarian, Zoia Horn. Ms. Horn was best known in library circles for spending three weeks in jail in 1972 for having refused to testify before a grand jury regarding information relating to Phillip Berrigan’s library use. Ms. Horn stated: “To me it stands on: Freedom of thought — but government spying in homes, in libraries and universities inhibits and destroys this freedom.”

Throughout her life, Ms. Horn was on the forefront of the protection of academic and intellectual freedom, especially in libraries. She was an outspoken opponent of the PATRIOT ACT. She won numerous awards for her work, and a Zoia Horn Intellectual Freedom Award was inaugurated in 2004 by the California Library Association.

The Internet Archive is proud to have been a recipient of that award in 2010, and Brewster Kahle was presented with the award by Ms. Horn herself.

Along with so many others who have fought for freedom, we will greatly miss Ms. Horn, and we honor her memory by continuing her work.

Zoia Horn’s autobiography (read online)

Originally posted on The Internet Archive Blog by internetarchive.

Written by internetarchive

July 15, 2014 at 9:55 pm

Posted in News

Free the Screenshots!

As the Archive moves more widely into the archiving of software, it quickly becomes apparent that there’s going to be an awful lot of programs online without much indication of what they are. With many thousands of programs or program collections to choose from, determining what might be inside becomes a pretty involved task.

In the case of movies, images and texts, there are previews that help show what is contained in the files in a given item. These are extremely helpful, as they not only show the quality or style of the works, but give all sorts of information that might not be reflected in the metadata.

Starting now, the same will be true for many types of software.

screenshot_01

The Atari 800 graphical masterpiece Astro Chase.

Using a combination of the JSMESS emulator and screen capturing software, the Archive has begun automatic “playing out” of sets of programs, snagging shots of what the software does, and then providing it as a guidepost of what is to come with that program.

For example, work has just been completed on the playable Sega Genesis Library,  where the directory view of the items in the collection show helpful screenshots, and individual games show animated playthroughs of the beginning of the cartridge.

00_coverscreenshotThe process is still evolving – currently it requires real-time capture (that is, capturing the first five minutes of a program takes an actual five minutes), but with multiple machines moving through collections, screenshots will be available for huge amounts of programs in coming weeks and months.

Along with the obvious graphical prettiness comes an even greater cultural benefit: the freeing of screenshots.

As these shots have often been done manually or have been gathered by hand, there has risen a tendency to put watermarks or credits with the images to indicate who did the work. While it’s an understandable urge to want some kudos for the effort, it meant that the very work being lauded (the graphics of the program) was being vandalized to ensure credit where credit was due.

None of the screenshots we are generating will have watermarks, and can be used freely for other purposes as you see fit.

To celebrate this, we’ve created a compilation of all the Sega Genesis screenshots generated by the project so far. The compilation is here. Be warned – it’s 4.3 gigabytes of 16,900 screenshots of 573 cartridges! (There’s a way to browse it at this link.)

Many screenshots are simply informative, but many more are truly works of art, as artists and programmers strained the edges of these underpowered machines to create the most evocative images possible. With this screenshotting effort underway, that work will hopefully get a new life and respect on the web.

Free the Screenshots!

screenshot_36

 

 

Originally posted on The Internet Archive Blog by Jason Scott.

Written by internetarchive

July 14, 2014 at 5:46 pm

Posted in News

Follow

Get every new post delivered to your Inbox.