How can you archive a set of established blogs together? « Digital Humanities Questions & Answers

Digital Humanities Questions & Answers » Applications, Tools, Formats

How can you archive a set of established blogs together?

(10 posts) (7 voices)

Asked 4 years ago by elotroalex
Latest answer from elotroalex

Tags:

elotroalex
Member
I am asking this question on behalf of Media Studies professor Hector Amaya. He is working on a project revolving around the Narco-violence phenomenon on the US-Mexico border. As part of his project he wants to archive a series of blogs that have sprouted on the blogosphere reacting to the violence. The blogs are a sort of citizens media that functions in lieu of an effectively intimidated official media. The blogs (and the bloggers) are of course very vulnerable. Here are two of the key blogs: El Blog del Narco: http://www.blogdelnarco.com/ and El Mundo del Narco: http://www.mundonarco.com/. There are many other he wants to archive, but other than the InternetArchive, I don't know anyone who has done this. An HTML capture will probably work, but I don't even know how one would do that with a series of blogs. Any ideas?
Tweet this question
Posted 4 years ago Permalink
Zach Coble
Member

Good question, I've also begun to look into this issue for a couple of Civil War blogs that we're trying to archive at Gettysburg College. One suggestion that has been floated is Archive-It <http://www.archive-it.org>, which is from the Internet Archive folks. I'm interested to hear input from others.

Posted 4 years ago Permalink
elotroalex
Member

Thanks, Zach. Interestingly enough, the University of Virginia is not one of the partners. I will look deeper behind the stage of the IA to see if some of their archiving code can be useful. In the meantime, the suggestion box remains open.

Posted 4 years ago Permalink
Patrick Murray-John

With blogs, one question is what, exactly, you want to archive -- just the posts, or the entire site?

If just the posts, then the quickest way to accomplish that is to just fire up a WordPress installation, get the FeedWordpress plugin, and use that to suck in the RSS feeds (assuming they're available).

I've gotten a basic FeedImporter going for OccupyArchive (running Omeka), but it needs some work before it works in general.

If you want the entire site -- pages, links in the sidebar, etc. -- then something like archive-it would be in order. Or if the sites aren't too extensive use Zotero to capture the snapshots.

Posted 4 years ago Permalink
elotroalex
Member

Ok, so ran the problem by Wayne Graham (@wayne_graham) and he suggested HTTrack, which allows you to download websites offline. I'm playing with it tonight.

Posted 4 years ago Permalink
Joe Wicentowski
Member

I hadn't heard of HTTrack. For the Mac, you might also check out SiteSucker - see http://www.sitesucker.us/mac/mac.html. It does quite a nice job of downloading all the pages on a site, and provides a good set of options.

An interesting hosted alternative might be Instapaper - as a (paid) subscriber to the service, the pages you save are archived and are searchable.

Posted 4 years ago Permalink
Wayne Graham
Member
There are quite a few ways of actually going about doing this. As I had mentioned to Alex last week, HTTrack is pretty solid and has a "gui" for this sort of thing. There is also Heritix (it's what the Internet Archive uses to crawl sites), but does require some server deployment skills to run. You can also use the wget command-line tool to create an offline mirror (see the man page for way more than you probably want to know about wget).

For using wget, you would do something like this:
wget --wait 1 -xmk <a href="http://www.blogdelnarco.com/" rel="nofollow">http://www.blogdelnarco.com/</a>
This will tell wget to go to http://www.blogdelnarco.com/, mirror the directory structure and all its content, as well as convert any links to your local path. These options also tell the tool to wait 1 second between requests. The above does respect the robots.txt that a site may have, but if needed, you can override this setting by adding '-e robots=off' to the options.
Posted 4 years ago Permalink
Trevor Munoz
Member

HTTrack is also a new tool to me—will have to check that out.

Depending on how seriously the professor wants to be about saving these blogs, it might be worth going the extra step to use Heritix (also used by Library of Congress) because this captures additional technical and administrative metadata—documenting when and where the material is being collected from as well as things about the state of the servers, etc. There is also the advantage of getting a wrapper format to keep all of the html, images, etc., organized together. This format WARC is the standard preservation format for web archives and a library might be more ready to take the blog archive if it is done in this standard way.

Just something else to consider.

Posted 4 years ago Permalink
Bradley Dilger
Member

Why not contact the bloggers and arrange to do this collaboratively? For example, with the WordPress Database Backup plugin, the bloggers could easily have archives emailed periodically to the researchers. Arguably, that's a more ethically sound approach than just slurping someone's site (even if it is public). And the bloggers might be glad to have someone back up their work.

Posted 4 years ago Permalink
elotroalex
Member

Thanks everyone for the tips. I'm going to play with Heritix. Bradley, thanks for the idea. So simple, it might just work.

Posted 4 years ago Permalink

RSS feed for this topic

Reply

You must log in to post.