Bulk Downloading, Aaron Swartz, and Terms of Service

[Aaron Swartz worked for and with the Internet Archive for years.]

Aaron was threatened with 35 years in prison for being accused of something my library actively encourages: bulk downloading of library collections.    Some are calling it “hacking”, which is a problematic distortion of the term in the first place.(1)   It might be time to break down some of what is currently going on in scholarly research as it relates to datamining, bulk downloading, and terms of service.    It makes me very sad and mad because this confusion may have that lead a library (JSTOR) to track down a user, have led MIT to call the police (and not try to call them off later), and have a US Attorney mistake this for a crime which then combined to help lead to a death of a rising star in the Internet community.

Libraries:   All libraries, including JSTOR and the Internet Archive, contains materials from lots of different people and places– some copyrighted some not.   Jim Gray called libraries “Engines of Research.”    Research, by definition, is searching– searching for new patterns and new ideas.    Libraries provide raw materials for researchers.   Fortunately in the digital world, bulk access to materials does not hurt our preservation function as rifling through pages in the past might have.

Academic publishing is changing:  Traditionally academic publications mostly came from non-profit scholarly associations and university presses.   Some organizations started to acquire and aggregate many journals into databases, organizations such as Elsevier, Wiley-Blackwell, JSTOR.   These databases were funded by academic institutions and only available to those subscribing institutions.   Further than this, academic publishing is going more “open.”  New publishers are being created to explicitly allowe open-access and bulk access such as the Public Library of Science.   Their open access journals end up being cited more often and this openness explicitly allow research results using “datamining” techniques.    Many universities made all future professor’s articles open access, except when specifically requested not to.

Datamining academic research as academic research:  Datamining academic publications is popular now because modern computers make it easy and the results are novel and publishable.   This involves collecting masses of journal articles so that they can be analyses by computer programs to find statistical patterns.   This is different from individuals reading a paper at a time.    Biology and medicine is especially helped by this, but it is now going on in humanities and law research.  Larry Lessig wrote:

While at Stanford, Swartz had worked with a law student to download all the law review articles in the Westlaw database, to map funders of research with research conclusions. The result of that research was published in the Stanford Law Review, and showed a troubling connection between funders and their conclusions. At the time of Aaron’s alleged “crime,” he was a fellow at my Center at Harvard. The work of the Center? Studying the corruption of academic research (among other institutions) caused by money.

Bulk downloading or “crawling”:   Bulk downloading is now done for various reasons, and those libraries with large collections take various positions on it and express these positions in Terms of Service and robot exclusions.   “Robots” or “crawlers” in this context are computer programs that do repetitive actions like downloading many documents from a website.  Some such users are search engines, some are backing-up materials, some doing new research such as visualizing data, some building different interfaces to the full dataset (like freebase reuse of wikipedia), or even enabling others to more easily download in bulk.    Most datasets have some sort of licenses involved, so there is some nervousness on the part of the providers to explicitly allow all bulk downloading (for instance of’s book catalog data which is licensed from many players), but in general people are becoming more comfortable with the re-purposing of their data as it becomes more common.

The Internet Archive is regularly crawled.   We try to make our systems strong enough to serve these loads, and sometimes try to get robots to slow down.   We get hit with spam all the time, and occasional denial of service attacks.    But we haven’t called the police– we deal with it.   As a library we try to serve as many users as we can and some of those users are robots.

Open Data is a raising trend supported by government agencies and libraries.   Open Data is bulk data that is specifically licensed for datamining, graphing, and linking to other open data.    This is the minority of databases, but it is growing in importance.   I bring this up because it shows a trend towards openness and datamining.

Terms of Service and Robots.txt files:  These mostly invisible “agreements” that are often defensive documents to protect the organization from users and suppliers.   These are regularly trodden on sometimes resulting in the providers instituting technological measures to slow down mass downloaders.    I think of most Terms of Service as like an old joke about the Soviet Union:  everything is illegal except when it is not.    It is important to note that the specifics of many Terms of Service and robot exclusion files are regularly ignored by millions of people, and enforcement is ignored by millions of organizations.   Enforcement is often very selectively applied.

Bulk Downloading, Aaron Swartz, and Terms of Service:  putting this all together means that mass downloading is often not discouraged as long as it is done slowly enough, what most concerns providers, in our experience, is what is done with the materials after they are downloaded.    Terms of Service documents are generally “CYA” documents in which it is difficult to communicate nuance– but we should recognize that violating them may not be “right”, but is common practice.  Opening up library databases, including but not limited to public domain materials, to new types of research is important especially in academia.   Most organizations are adapting to these new types of computational research opportunities but some will try to stop them.   All in all we do not have a good way to draw lines of what is acceptable practice yet– it is all evolving.   What I know of Aaron’s downloading old journal articles for later use is not outside of what many people do.  What is unusual are the reactions on the part of JSTOR, MIT, and the US prosecutors.

What I am suggesting is we need a bit more slack in the system.   We need to be able to talk things through before we turn to police and courts.   We need to leave room for a new generation of people and ideas that may alter how our institutions work.    No, more than that, we should welcome and encourage people and ideas that will alter how our institutions work.

Aaron helped many of us adapt our institutions’ services to the digital opportunities– lets continue this important work.






