Internet Archive News

updates about

Archive for the ‘Books Archive’ Category

Working to Stop Rewriting Copyright Laws via TPP Treaty

The Internet Archive joined Our Fair Deal along with EFF and Public Knowledge to stop the US from using the Trans-Pacific Partnership treaty from changing our copyright laws.   The coalition sent two open letters to TPP negotiators today on critical issues that you can learn about here. Let’s foster open debate and proper process before further changes to copyright laws restrict public access even more.

Please consider joining this coalition.

Originally posted on The Internet Archive Blog by brewster.

Written by internetarchive

July 9, 2014 at 9:16 pm

Posted in Books Archive, News

Popular subjects in our book collection

We took a leisurely stroll through half a million books today, and we noticed that lots of the books were congregating around some popular categories.  This isn’t an exhaustive list, we just thought it would nice to share a little of the landscape with you.  Click through to download or borrow these books through our Open Library site.

Originally posted on The Internet Archive Blog by Alexis Rossi.

Written by internetarchive

February 21, 2014 at 9:46 pm

450,000 Early Journal Articles Now Available

jstorlogoInternet Archive announces today the addition of over 450,000 journal articles from the JSTOR Early Journal Content collection. Early Journal Content is a selection of pre-1923 materials from more than 350 journals and includes articles in the arts and humanities, economics and politics, and mathematics and other sciences. This content was digitized by JSTOR and is freely available through, and it can now also be accessed and downloaded via

Screen Shot 2013-04-09 at 10.58.20 AMHeidi McGregor from JSTOR said, “We’re happy to work with the Internet Archive to broaden access to the JSTOR Early Journal Content even further, offering people the ability to use it alongside other Internet Archive held collections.”

Screen Shot 2013-04-09 at 12.15.43 PMAll 2 terabytes of the Early Journal Collection are available for bulk harvesting from the Internet Archive. Web search engines have been indexing the full-text contents of these materials already and, so far, people and robots have downloaded the articles over 400,000 times even before it has been announced. A data bundle including OCR text and metadata is also available from JSTOR’s Data for Research service for free downloading.

Originally posted on The Internet Archive Blog by brewster.

Written by internetarchive

April 11, 2013 at 8:10 pm

Posted in Books Archive, News

Launch of the DigiBaeck Project


The Internet Archive, working with the Leo Baeck Institute, is pleased to be a part of the Oct 16, 2012 launch of their DigiBaeck project, a massive (formerly print) archival collection of history pertaining  to German speaking Jewry.

Robert Miller, Global Director of Books for the Internet Archive states that “digitizing over 4,000 linear feet of material whose scope ran the gamut of post cards from Berlin to letters from Auschwitz was both empowering and humbling at the same time.” He continues, “One of my staff, who worked on the collection, family was from Poland and suffered terribly during the Holocaust. Being able to assist in putting these original documents online was cathartic for her.”

The Leo Baeck Institute helped teach Miller’s teams in Princeton, NJ and San Francisco, CA. how to work with and handle unique and high value archival materials. And he and his staff helped teach Leo Baeck how to move from print to on-line pixels. It was a true partnership in every sense of the word.

Brewster Kahle, founder of the Internet Archive, states, “it is collections going public like Leo Baeck’s that remind us of the adage that collections that remain private or not digital are for all intents and purposes extinct. I applaud Leo Baeck for the direction they have taken.”

Baeck Institute logoLinks to the Internet Archive’s copy of the the Leo Baeck Material may be found at and details about the Leo Beck Collection may be found on their site at

The link to the New York Times Piece may be found here at

Originally posted on The Internet Archive Blog by internetarchive.

Written by internetarchive

October 15, 2012 at 8:13 pm

Posted in Books Archive, News

Uploading images for text items (update on * format)

The old news

Until about a year ago, if you wanted to upload a set of individual page images and have them be recognized as a “book” so we’d create the usual derivative formats from them, you had to mimic the * (“Single Page Processed JP2 ZIP”) archive files that are created automatically at our scanning centers. Making these from your own existing images is inconvenient and error-prone, due to the rigid expectations for how individual image files are named and organized into a directory structure. That route was also limited to JPEG2000 (“JP2″) image files.

Things changed with the introduction last year of our * (“Generic Raw Book Zip”) format, which is much more flexible.  If you provide a file whose name ends in, we’ll make a * from it:  the * file will be unpacked, its contents sorted alphabetically, and the set of images found within converted into a standard *, which we’ll then process as usual.

In a bit more detail, the * will be scanned for files it contains, at any directory level, whose names end with .jp2, .jpg, .jpeg, .tif, .tiff, .bmp or .png, matched case-insensitively; any other files (.xml, .txt, etc.) will be ignored.  You can mix and match different image formats.  All image files found will be sorted alphabetically (including any directory names, so that files originally in the same subdirectory stay together in the new sequence), converted to JPEG2000 if they’re not already, renamed the way our code expects, and packed into a new *, leaving your * in place as it was.

For an example of how messy an * we can deal with, see:

The 589 images files found there were converted into:

Note that the new *, and the files it contains, are named according to the name of the original * file (“hr100106″), regardless of how directories and files are named inside the *  Those files and directories can be named any way you like; the names matter only in that they determine the sequence of the images in the new *

The new news

Now for what’s changed:

  • *_images.tar (“Generic Raw Book Tar”) is accepted as well as * Producing a tar file may be more convenient than producing a zip file for some uploaders, particularly if the file is going to be large. Older implementations of the zip compression scheme were limited to 4 GB, and some tools were known to produce files that we couldn’t read if the size exceeded 2 GB. Our advice in the past has been to use the 7-Zip tool for creating any zips larger than 2 GB. That still works, or you can now make a tar instead; the size of tar files is effectively unlimited.
  • Comic Book archive files are accepted. *.cbz (“Comic Book ZIP”) files are essentially zip archive files containing page images, typically as either JPEGs or PNGs. We now accept *.cbz files and treat them just like * files. Similarly, *.cbr (“Comic Book RAR”) files are RAR archive files containing page images, and we now treat those just like * files, too. So if you have any *.cbz or *.cbr files, just uploading them as is should result in having all the usual derivative formats created.
Originally posted on The Internet Archive Blog by Hank Bromley.

Written by internetarchive

May 24, 2012 at 3:26 pm

Posted in Books Archive

Archive-It Team Encourages Your Contributions To The “Occupy Movement” Collection

Since September 17th, 2011 when protesters descended on Wall Street, set up tents, and refused to move until their voices were heard, an impassioned plea for economic and social equality has manifested itself in similar protests and demonstrations around the world. Inspired by “Occupy Wall Street (OWS)”, these global protests and demonstrations are collectively now being referred to as the “Occupy Movement”.

In an effort to document these historic, and politically and socially charged, events as they unfold, IA’s Archive-It team has recently created an “Occupy Movement” collection to begin capturing information about the movement found online. With blogs communicating movement ideals and demands, social media used to coordinate demonstrations, and news related websites portraying the movement from a dizzying variety of angles, the presence and representation of the Occupy Movement online is both hugely valuable to our understanding of the movement as a whole, while constantly in-flux and at-risk.

The value of the collection hinges on the diversity, depth, and breadth of our seeds and websites we crawl. We are asking and encouraging anyone with websites they feel are important to archive, sites that tell a story about the movement, to pass them along and we will add them to the Occupy Movement collection. These might include movement-wide or city-specific websites, sites with images, blogs, YouTube videos, even Twitter accounts of individuals or organizations involved with the movement. No ideas or additions are too small or too large; perhaps your ideas or suggestions will be a unique part of the movement not yet represented in our collection. IA Archive-It friends and partners are already sending in seeds, which we greatly appreciate.

The web content captured in this collection will be included in the General Archive collection at
which has been actively collecting materials on the Occupy Movement for a few months.

Please send any seeds suggestions, questions, or comments to Graham at

Originally posted on The Internet Archive Blog by internetarchive.

Written by internetarchive

December 7, 2011 at 9:02 pm

Thursday Night 5:30pm Books in Browsers in San Francisco

Books in Browsers logo Please join the Internet Archive and O’Reilly Media:

Eleven of the most exciting ebook startups and leaders in publishing will present short-form “ignite talks” on Thursday night, October 27, at Books in Browsers: Ignite!

Books in Browsers: Ignite!
300 Funston Ave, San Francisco
Thursday, October 27, 2011
Reception: 5:30 pm (snacks & wine)
Ignite program: 6:30-7:30 pm
Donate 5 Books for scanning!

Ignite talks are a special format where each speaker has just five minutes to share their personal and professional visions in 20 slides, auto-advancing every 15 seconds. The list of talks is online at

Books in Browsers evening gatheringThe Ignite program will bring the latest news from the Archive’s Open Library project plus 10 of the hottest new companies from around the world, many of them just emerging from stealth, that are defining the future of reading and publishing. Several of these start-ups will be presenting their work for the very first time.

Join us for a couple of hours on a Thursday night, and get a peek at the future of books! Press are welcome. Please RSVP at and bring some books to donate to our new Physical Archive.

Originally posted on The Internet Archive Blog by brewster.

Written by internetarchive

October 24, 2011 at 5:13 pm

Posted in Books Archive, News