Get Archived: 7 Ways to Preserve Your Internet Content for Posterity

An ocean is never the same body of water because it’s always moving, changing, evaporating and being replenished by new rainwater and runoff. Likewise, Internet content is an ocean of information with content that’s evaporating almost as quickly as new content flows into it.

We’ve heard of the deep Web and the invisible Web, private or subscriber-based databases that are not accessible to indexing by public search engines (intentionally or unintentionally), but what about millions of links that are broken when content types or entire Web sites are removed from production? Or domain names that expire?

While we might not miss “Kyle’s Frat Party” site, what about information of value to journalists, researchers and academics? For online journal publishers and academic researchers who cite Internet content in the form of URLs, this is an especially troublesome issue. In 2003, the San Francisco Chronicle reported:

…a growing number of scientists and scholars who are nervous about their increasing reliance on a medium that is proving far more ephemeral than archival. In one recent study, one-fifth of the Internet addresses used in a Web-based high school science curriculum disappeared over 12 months. Another study, published in January, found that 40 percent to 50 percent of the URLs referenced in articles in two computing journals were inaccessible within four years.

ArchiveIt 2.0

One solution offered by The Internet Archive, a non-profit organization dedicated to preserving the Web and other digital archives, is the Archive-It 2.0 service, which allows the permanent capture of Web-based information for reference and archival purposes. Existing partners in this effort include the featured collections of the University of Toronto, Indiana University and North Carolina State Archives.

Archive-It 2.0 enables digital archivists, library and museum professionals to create more tailored, relevant and search-friendly collections of up to 10 million URLs based on regular Web crawls across selected websites. Through test crawls, subscribers may see what kind of web material would populate a certain collection before actually archiving them permanently. An optional paid feature within Archive-It 2.0, Archive-It Pro, allows subscribers to not only set caps on how many web documents are collected from a website over time, but also block the collection of materials from specific websites altogether. The digital collections, as a result, are focused and more easily managed, because irrelevant materials do not find their way into an institution’s archives.

Another issue is the increasing number of sites publishing information dynamically. Unlike static pages that can be archived as a hard document, dynamic pages feature content on demand that changes based on what information is requested from a database. Most blog sites offer Permalinks so search engines can index a permanent (or semi-permanent) record of journal entries, but as the Goddard Library Web Project discovered in a D-Lib Magazine article published in Nov 2004, the Web is becoming increasingly inaccessible for archival purposes:

We encountered several problems when performing the crawl on the increasingly complex scientific web sites. The most common problem resulted from the increasingly dynamic nature of those web sites. This includes content that is controlled by Javascript and Flash technologies, and dynamic content driven from database queries or content management systems. The crawling tool is unable to crawl a web page containing a search form that queries a database.

ISO Standards for publications

While librarians and Internet archivists try to address the issue of vanishing or inaccessible Internet content, Web site owners and content developers can play a part in helping libraries and archives document and preserve the Web. On Canada’s national Library and Archives site, there’s an excellent paper on Electronic Publishing published in 2001. While this was intended for Canadian publishers, the principles can be broadly applied to any electronic publishers on the Web. This matrix explains the scope of what the document means by electronic publishers.

Serial publications

If you publish an online journal, ezine or other serial publication online, applying for an ISSN (International Standard Serial Number) is a way to assign “a unique code for identifying serial publications, such as periodicals, newspapers, annuals, journals and monographic series” (Canada’s ISSN) and “magazines, newspapers, annuals (such as reports, yearbooks, and directories), journals, memoirs, proceedings, transactions of societies, and monographic series” (the United States ISSN). For serials distributed on the Internet and World Wide Web, the ISSN should appear on the first screen of the item.

While publishers are not legally obliged to use an ISSN, the U.S. site lists the benefits of applying for an ISSN:

The ISSN should be as basic a part of a serial as the title. The advantages of using it are abundant and the more the number is used the more benefits will accrue.

  • ISSN provides a useful and economical method of communication between publishers and suppliers, making trade distribution systems faster and more efficient.
  • The ISSN results in accurate citing of serials by scholars, researchers, abstracters, and librarians.
  • As a standard numeric identification code, the ISSN is eminently suitable for computer use in fulfilling the need for file update and linkage, retrieval, and transmittal of data.
  • ISSN is used in libraries for identifying titles, ordering and checking in, and claiming serials.
  • ISSN simplifies interlibrary loan systems and union catalog reporting and listing.
  • The U.S. Postal Service uses the ISSN to regulate certain publications mailed at second-class and controlled circulation rates.
  • The ISSN is an integral component of the journal article citation used to monitor payments to the Copyright Clearance Center Inc.
  • All ISSN registrations are maintained in an international data base and are made available in the ISDS Register, a microfiche publication which is scheduled to cease in the near future, or in “ISSN Compact,” a CD-ROM. These products are described in a document maintained by the ISSN International Centre: ISSN products

Individual publications

For individual publications, publishers should apply for an ISBN number. International Standard Book Numbers (ISBN) are 10-digit standard numbers for the unique identification of each edition of a book or other monographic publication (e.g. pamphlets, educational kits, etc.), as per this information on the Canadian ISBN site:

The International Standard Book Number (ISBN) is a system of numerical identification for books, pamphlets, educational kits, microforms, CD-ROM and other digital and electronic publications. Assigning a unique number to each published title, provides that title with its own, unduplicated, internationally recognized identifier.

As content publishers, our sites become part of the ocean of content online. We have a moral obligation to our current and future users to ensure the content we create becomes part of the Internet’s official historical record, good and bad, of humankind.

7 Ways to Keep Your Content from Vanishing

Most Web publishers, including me, are guilty of breaking links or removing content and having readers email you to ask “What happened to that (article/news item/link/download) on your site?”, but here are some steps you can take to help keep your content online and accessible (assuming you want it to be so!)

  1. Check your links! This may be obvious, but with all the content management, link verification software and other tools available to Web publishers, it’s still a stinky issue. You or your Web development staff should establish link-naming conventions (e.g. Wikipedia’s) to govern the rules of how links are named, which can be followed consistently whether they are being named manually or dynamically.
  2. Archive your links. If you really need to remove a link that is still valid, but isn’t relevant/essential to your site anymore, consider creating a Link Archive page where you can move the links so they can still be indexed by search engines and found by your users. Otherwise, create a redirect for old links so they point to a message indicating they are no longer available, or to new pages/content.
  3. Archive your old site(s). Redesigning your site? Replacing it with a new version that has new content. Consider leaving the old site on your server in a Historical Site Archive area. If you don’t want search engines to index it and return pages of outdated results to your users, try using a robots.txt file that will exclude the historical pages from spidering.
  4. Let the Internet Archive do the work. Read How can I get my site included in the Archive on the Internet Archive’s site. It’s a blast from the past to see older versions of sites going back to the mid-90s on the Internet Archive, and users can link to these pages, too.
  5. Let search engines archive your pages. Find out how to ensure that your site is search engine optimized and that pages are not being published in a way that will cause search engine spiders to exclude them from indexing. Search Engine Watch is an excellent resource for SEO, SearchTools.com has some useful information on indexing robots and spiders, and the all-important Google provides guidelines for Webmasters on how to make your site Google-friendly.
  6. Open up your content. Mirroring your content on other sites is another strategy for keeping your content alive and accessible. By licencing your content through a Creative Commons licence and/or offering it for republication or repurposing on the Internet, you can help ensure that your content stays alive and accessible. For more information, visit the Open Content Alliance or the innovative Public Knowledge Project.
  7. Use Universal Design principles. Last but not least, using Universal Design principles to ensure accessibility to the broadest range of users. It’s not only good from a usability perspective, but also from an archiving perspective.