Internet Archive News

updates about archive.org

Archive for the ‘Wayback Machine’ Category

Fixing Broken Links on the Internet

No More 404s

Today the Internet Archive announces a new initiative to fix broken links across the Internet.  We have spent 17 years building an archive of web content, and now we want you to help us bring those pages back out onto the web to heal broken links everywhere.

When I discover the perfect recipe for Nutella cookies, I want to make sure I can find those instructions again later.  But if the average lifespan of a web page is 100 days, bookmarking a page in your browser is not a great plan for saving information.  The Internet echoes with the empty spaces where data used to be.  Geocities – gone.  Friendster – gone.  Posterous – gone.  MobileMe – gone.

Imagine how critical this problem is for those who want to cite web pages in dissertations, legal opinions, or scientific research.  A recent Harvard study found that 49% of the URLs referenced in U.S. Supreme Court decisions are dead now.  Those decisions affect everyone in the U.S., but the evidence the opinions are based on is disappearing.

In 1996 the Internet Archive started saving web pages with the help of Alexa Internet.  We wanted to preserve cultural artifacts created on the web and make sure they would remain available for the researchers, historians, and scholars of the future.  We launched the Wayback Machine in 2001 with 10 billion pages.  For many years we relied on donations of web content from others to build the archive.  In 2004 we started crawling the web on behalf of a few, big partner organizations and of course that content also went into the Wayback Machine.  In 2006 we launched Archive-It, a web archiving service that allows librarians and others interested in saving web pages to create curated collections of valuable web content.  In 2010 we started archiving wide portions of the Internet on our own behalf.  Today, between our donating partners, thousands of librarians and archivists, and our own wide crawling efforts, we archive around one billion pages every week.  The Wayback Machine now contains more than 360 billion URL captures.

ftc.gov

FTC.gov directed people to the Wayback Machine during the recent shut down of the U.S. federal government.

We have been serving archived web pages to the public via the Wayback Machine for twelve years now, and it is gratifying to see how this service has become a medium of record for so many.  Wayback pages are cited in papers, referenced in news articles and submitted as evidence in trials.  Now even the U.S. government relies on this web archive.

We’ve also had some problems to overcome.  This time last year the contents of the Wayback Machine were at least a year out of date.  There was no way for individuals to ask us to archive a particular page, so you could only cite an archived page if we already had the content.  And you had to know about the Wayback Machine and come to our site to find anything.  We have set out to fix those problems, and hopefully we can fix broken links all over the Internet as a result.

Up to date.  Newly crawled content appears in the Wayback Machine about an hour or so after we get it.  We are constantly crawling the Internet and adding new pages, and many popular sites get crawled every day.

Archive a page. We have added the ability to archive a page instantly and get back a permanent URL for that page in the Wayback Machine.  This service allows anyone — wikipedia editors, scholars, legal professionals, students, or home cooks like me — to create a stable URL to cite, share or bookmark any information they want to still have access to in the future.

Do we have it?  We have developed an Availability API that will let developers everywhere build tools to make the web more reliable.  We have built a few tools of our own as a proof of concept, but what we really want is to allow people to take the Wayback Machine out onto the web.

Fixing broken links.  We started archiving the web before Google, before Youtube, before Wikipedia, before people started to treat the Internet as the world’s encyclopedia. With all of the recent improvements to the Wayback Machine, we now have the ability to start healing the gaping holes left by dead pages on the Internet.  We have started by working with a couple of large sites, and we hope to expand from there.

WordPress.com is one of the top 20 sites in the world, with hundreds of millions of users each month.  We worked with Automattic to get a feed of new posts made to WordPress.com blogs and self-hosted WordPress sites.  We crawl the posts themselves, as well as all of their outlinks and embedded content – about 3,000,000 URLs per day.  This is great for archival purposes, but we also want to use the archive to make sure WordPress blogs are reliable sources of information.  To start with, we worked with Janis Elsts, a developer from Latvia who focuses on WordPress plugin development, to put suggestions from the Wayback into his Broken Link Checker plugin.  This plugin has been downloaded 2 million times, and now when his users find a broken link on their blog they can instantly replace it with an archived version.  We continue to work with Automattic to find more ways to fix or prevent dead links on WordPress blogs.

Wikipedia.org is one of the most popular information resources in the world with  almost 500 million users each month.  Among their millions of amazing articles that all of us rely on, there are 125,000 of them right now with dead links.  We have started crawling the outlinks for every new article and update as they are made – about 5 million new URLs are archived every day.  Now we have to figure out how to get archived pages back in to Wikipedia to fix some of those dead links.  Kunal Mehta, a Wikipedian from San Jose, recently wrote a protoype bot that can add archived versions to any link in Wikipedia so that when those links are determined to be dead the links can be switched over automatically and continue to work.  It will take a while to work this through the process the Wikipedia community of editors uses to approve bots, but that conversation is under way.

Every webmaster.  Webmasters can add a short snippet of code to their 404 page that will let users know if the Wayback Machine has a copy of the page in our archive – your web pages don’t have to die!

We started with a big goal — to archive the Internet and preserve it for history.  This year we started looking at the smaller goals — archiving a single page on request, making pages available more quickly, and letting you get information back out of the Wayback in an automated way.  We have spent 17 years building this amazing collection, let’s use it to make the web a better place.

Thank you so much to everyone who has helped to build such an outstanding resource:

Kenji Nagahashi
Ilya Kreymer
Sam Stoller
Raj Kumar
Alex Buie
Brad Tofel
Adam Miller
Jeff Kaplan
Ronna Tanenbaum
Kris Carpenter
Vinay Goel
John Lekashman
Kristine Hanna
Alexis Rossi
Brewster Kahle
Kunal Mehta
SJ Klein
Janis Elsts
Jackie Dana
Martin Remy

Originally posted on The Internet Archive Blog by Alexis Rossi.
Advertisements

Written by internetarchive

October 25, 2013 at 9:40 am

Posted in News, Wayback Machine

Blacked Out Government Websites Available Through Wayback Machine

Congress has caused the U.S. federal government to shut down and many important websites have gone dark.  Fortunately, we have the Wayback Machine to help.

Many sites are displaying messages that say that they are not being updated or maintained during the government shut down, but the following sites are some who have shut their doors today.  Clicking the logos will take you to a Wayback Machine archived capture of the site.

noaa.gov
National Oceanic and Atmospheric Administration
noaa.gov
parkservice
National Park Service
nps.gov
 LOClogo3
Library of Congress
loc.gov
 NSF_Logo
National Science Foundation
nsf.gov
 fcc-logo
Federal Communication Commission
fcc.gov
 CensusBureauSeal
Bureau of the Census
census.gov
 usdalogo
U.S. Department of Agriculture
usda.gov
usgs
United States Geological Survey
usgs.gov
usitc
U.S. International Trade Commission
usitc.gov
 FTC-logo
Federal Trade Commission
ftc.gov
Corporation_for_National_and_Community_Service
Corporation for National and Community Service
nationalservice.gov
trade.gov
International Trade Administration
trade.gov

 

Originally posted on The Internet Archive Blog by brewster.

Written by internetarchive

October 2, 2013 at 1:30 am

Posted in News, Wayback Machine

80 terabytes of archived web crawl data available for research

petaboxInternet Archive crawls and saves web pages and makes them available for viewing through the Wayback Machine because we believe in the importance of archiving digital artifacts for future generations to learn from.  In the process, of course, we accumulate a lot of data.

We are interested in exploring how others might be able to interact with or learn from this content if we make it available in bulk.  To that end, we would like to experiment with offering access to one of our crawls from 2011 with about 80 terabytes of WARC files containing captures of about 2.7 billion URIs.  The files contain text content and any media that we were able to capture, including images, flash, videos, etc.

What’s in the data set:

  • Crawl start date: 09 March, 2011
  • Crawl end date: 23 December, 2011
  • Number of captures: 2,713,676,341
  • Number of unique URLs: 2,273,840,159
  • Number of hosts: 29,032,069

The seed list for this crawl was a list of Alexa’s top 1 million web sites, retrieved close to the crawl start date.  We used Heritrix (3.1.1-SNAPSHOT) crawler software and respected robots.txt directives.  The scope of the crawl was not limited except for a few manually excluded sites.  However this was a somewhat experimental crawl for us, as we were using newly minted software to feed URLs to the crawlers, and we know there were some operational issues with it.  For example, in many cases we may not have crawled all of the embedded and linked objects in a page since the URLs for these resources were added into queues that quickly grew bigger than the intended size of the crawl (and therefore we never got to them).  We also included repeated crawls of some Argentinian government sites, so looking at results by country will be somewhat skewed.  We have made many changes to how we do these wide crawls since this particular example, but we wanted to make the data available “warts and all” for people to experiment with.  We have also done some further analysis of the content.

Hosts Crawled pie chart

If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what you’re hoping to do with it.  We may not be able to say “yes” to all requests, since we’re just figuring out whether this is a good idea, but everyone will be considered.

 

Originally posted on The Internet Archive Blog by internetarchive.

Written by internetarchive

October 26, 2012 at 12:20 am

Posted in News, Wayback Machine

HTTP Archive joins with Internet Archive

It was announced today that HTTP Archive has become part of Internet Archive.

The Internet Archive provides an archive of web site content through the Wayback Machine, but we do not capture data about the performance of web sites.  Steve Souders’s HTTP Archive started capturing and archiving this sort of data in October 2010 and has expanded the number of sites covered to 18,000 with the help of Pat Meenan and WebPagetest.

Steve Souders will continue to run the HTTP Archive project, and we hope to expand its reach to 1 million sites.  To this end, the Internet Archive is accepting donations for the HTTP Archive project to support the growth of the infrastructure necessary to increase coverage.  The following companies have already agreed to support the project: Google, Mozilla, New Relic, O’Reilly Media, Etsy, Strangeloop, and dynaTrace Software. Coders are also invited to participate in the open source project.

Internet Archive is excited about archiving another aspect of the web for both present day and future researchers.

Originally posted on The Internet Archive Blog by internetarchive.

Written by internetarchive

June 15, 2011 at 10:54 pm

Posted in News, Wayback Machine

Wayback Machine & Web Archiving Open Thread, April 2011

Anything you want to know or discuss about the Wayback Machine or the Internet Archive’s web archive? This is the place!

What do you want to know about the Wayback Machine and Internet Archive web archive? Do you have problems, concerns, suggestions? This is the place!

If your comment is a question, please check the classic Wayback Machine Frequently-Asked-Questions (FAQ) or new Wayback Machine FAQ site to see if your question has already been addressed before posting.

A few other things to note before posting:

Everything else? Fire away!

Originally posted on The Web Archiving at archive.org Blog by gojomo.

Written by internetarchive

April 7, 2011 at 9:35 pm

Wayback Machine & Web Archiving Open Thread, September 2010

Time for another open thread!

What do you want to know about the Wayback Machine and Internet Archive web archive? Do you have problems, concerns, suggestions? This is the place!

If your comment is a question, please check the Wayback Machine Frequently-Asked-Questions (FAQ) to see if your question has already been addressed before posting.

A few other things to note before posting:

Everything else? Fire away!

Originally posted on The Web Archiving at archive.org Blog by gojomo.

Written by internetarchive

September 7, 2010 at 10:04 pm

Want to discuss the Wayback Machine or Internet Archive’s web archive?

Over on the web groups blog is a post invite input on the Wayback Machine and Internet Archive’s web archive. You can post comments and suggestions. There’s also some useful links. Check it out at http://iawebarchiving.wordpress.com/2010/07/06/wayback-machine-web-archiving-open-thread-july-2010/

Originally posted on The Internet Archive Blog by internetarchive.

Written by internetarchive

July 8, 2010 at 5:32 pm

Posted in News, Wayback Machine