Internet Archive News

updates about archive.org

Downloading in bulk using wget

If you’ve ever wanted to download files from many different archive.org items in an automated way, here is one method to do it.

BEFORE YOU BEGIN
You will need to have wget installed (it’s free!), and it helps to have some understanding of basic unix commands.  It may also help to read How Archive.org Items Are Structured so that you understand terminology used here.

The basic method for using wget to download files is:

  1. Generate a list of item identifiers from which you wish to grab files
  2. Create a directory to hold the downloaded files
  3. Construct your wget command to retrieve the appropriate files
  4. Run the command and wait for it to finish

We strongly recommend trying this process with ONE identifier first to make sure you get what you want as output before you try to download files from many items.

Step 1: Generate list of identifiers

The list of identifiers should be in plain text and should contain one identifier per line with no extraneous characters or spaces.  You can name it anything you’d like, but I will use the name itemlist.txt in all examples here.

How do you get a list of identifiers?  Of course you can just paste or type them into a file individually if you’d like, but if you are working with large numbers of items this will be impractical.  The easiest way to get a list is to use advanced search to return a csv file.

Determine your search query using the search engine.  In this example, I am looking for items in the Prelinger collection with the subject “Health and Hygiene.”  There are currently 41 items that match this query.  Once you’ve figured out your query:

  1. Go to the advanced search page and paste your query into the query box of the section titled “Advanced Search returning JSON, XML, and more.”
  2. Choose “identifier” from the list of fields to return.
  3. Optionally sort the results (sorting by identifier is handy)
  4. Enter a number into the “Number of results” box that matches (or is higher than) the number of results your query returns
  5. Choose the “CSV format” radio button.
  6. Click the search button (may take a while depending on how many results you have)

Advanced Search

An alert box will ask if you want your results – click “OK” to proceed.  A file called search.csv will be downloaded to your default download location (often your Desktop or your Downloads folder).  The contents of the CSV will look like this:

“identifier”
“AboutFac1941″
“Attitude1949″
“BodyCare1948″
“Cancer_2″
“Careofth1949″
“Careofth1951″
“CityWate1941″

You’ll need to remove the first heading line “identifier” and remove the double quotes from each line.  The easiest way to do this is to open the file in Excel or a similar spreadsheet program, copy the column of identifiers (minus the heading), and paste it into a text program like TextWrangler or Notepad.  Save your new text file as itemlist.txt (or whatever name you prefer).  The contents of the file should now look like this:

AboutFac1941
Attitude1949
BodyCare1948
Cancer_2
Careofth1949
Careofth1951
CityWate1941

You can use this advanced search method to create lists of thousands of identifiers, although we don’t recommend using it to retrieve more than 10,000 or so items at once (it will time out at a certain point).

Step 2: Create a directory for downloaded files

Open a terminal emulator, such as Terminal (on Mac) or Cygwin (on Windows), and navigate to the location where you’d like to store your downloaded files.  Make sure there is sufficient storage space here.  You may want to store files on an external drive, in a folder on your Desktop, etc.  This example assumes you want to create a folder on your Desktop.

  1. Change directories (cd) to your Desktop
  2. Make sure itemlist.txt is on your Desktop
  3. Create a new folder (directory) for your downloads called archivedownloads (mkdir archivedownloads)
  4. cd archivedownloads

Step 3: Create wget command

Your wget command will look something like this:

wget -r -H -nc -np -nH –cut-dirs=2 -e robots=off -i ../itemlist.txt -B ‘http://www.archive.org/download/’

Explanation of each option in the wget command:

-r
recursive download; required in order to move from the item identifier down into its individual files

-H
enable spanning across hosts when doing recursive retrieving (the initial URL for the directory will be on http://www.archive.org, and the individual file locations will be on a specific datanode)

-nc
no clobber; if a local copy already exists of a file, don’t download it again (useful if you have to restart the wget at some point, as it avoids re-downloading all the files that were already done during the first pass)

-np
no parent; ensures that the recursion doesn’t climb back up the directory tree to other items (by, for instance, following the “../” link in the directory listing)

-nH
no host directories; when using -r, wget will create a directory tree to stick the local copies in, starting with the hostname ({datanode}.us.archive.org/), unless -nH is provided

–cut-dirs=2
completes what -nH started by skipping the hostname; when saving files on the local disk (from a URL likehttp://{datanode}.us.archive.org/{drive}/items/{identifier}/{identifier}.pdf), skip the /{drive}/items/ portion of the URL, too, so that all {identifier} directories appear together in the current directory, instead of being buried several levels down in multiple {drive}/items/ directories

-e robots=off
archive.org datanodes contain robots.txt files telling robotic crawlers not to traverse the directory structure; in order to recurse from the directory to the individual files, we need to tell wget to ignore the robots.txt directive

-i ../itemlist.txt
location of input file listing all the URLs to use; “../itemlist” means the list of items should appear one level up in the directory structure, in a file called “itemlist.txt” (you can call the file anything you want, so long as you specify its actual name after -i)

-B ‘http://www.archive.org/download/’
base URL; gets prepended to the text read from the -i file (this is what allows us to have just the identifiers in the itemlist file, rather than the full URL on each line)

 Additional options that may be needed sometimes:

-A
-R
accept-list and reject-list, either limiting the download to certain kinds of file, or excluding certain kinds of file; for instance, -R _orig_jp2.tar,_jpg.pdf would download all files except those whose names end with _orig_jp2.tar or _jpg.pdf, and -A “*zelazny*” -R .ps would download all files containing zelazny in their names, except those ending with .ps. See http://www.gnu.org/software/wget/manual/html_node/Types-of-Files.html for a fuller explanation.

 Step 4: Run the command

Run the command from within the directory or folder you created to hold the downloaded files.  You will see your progress on the screen.  If you have sorted your itemlist.txt alphabetically, you can estimate how far through the list you are based on the screen output.

Depending on how many files you are downloading and their size, it may take quite some time for this command to finish running.

Tips:

  • You can terminate the command by pressing “control” and “c” on your keyboard simultaneously while in the terminal window.
  • If your command will take a while to complete, make sure your computer is set to never sleep and turn off automatic updates.
  • If you think you missed some items (e.g. due to machines being down), you can simply rerun the command after it finishes.  The “no clobber” option in the command will prevent already retrieved files from being overwritten.
Originally posted on The Internet Archive Blog by internetarchive.
Advertisements

Written by internetarchive

April 26, 2012 at 1:28 am

Posted in internet archive

%d bloggers like this: