The Interwebs

Published: Posted on

Usually we tend to think of the WWW as a tool for research, and I’ll dive into some of the ways that I make use of specific tools to search and mine the web for resources into a later post, but today I wanted to share a bit about how the web can serve as a subject for research. Web social science is the next big thing, with regular sessions now appearing at many major academic society conferences. If you want to get the big overview, I’d recommend you start with Robert Ackland’s recent book, Web Social Science: Concepts, Data and Tools for Social Scientists in the Digital Age (University users can click here to read the book online via the University of Edinburgh library). Ackland’s book is a terrific resource, covering both qualitative and quantitative modes of research and he covers a large range of tools from online surveys and focus groups, web content gathering and analysis, social media network analysis (which I’ll discuss in a future post), and online experimentation. For an author who is quite technical the book covers a very helpful range of ethical considerations, surveys a range of contemporary methodological literature, and he presents the domain of research involved in each of these which would be accessible to a readership that hasn’t done this kind of work before. A few years ago when I began doing web social science and social network analysis, I found Acklands book to be a terrific catalyst into the wider field of web studies.

I’ve been gathering websites for a few research purposes. Connecting back to my interest in geocoding – many of the groups I’m studying have publicly available data on member group locations. A quick web scrape, and a more lengthy bit of data cleaning yields a table of site names, with address, coordinates, and often a web address. I’ll go into this a bit more in my next post so you can try you hand at creating a geocoded database.

There is also a huge amount of data available for documentary-style analysis sitting out on the internet. Many groups post their newsletters in web-only format, or will archive PDF versions of their print materials and this provides a really useful way to get at historical work of a particular group – in my case environmental community groups and churches. Now in some cases, you may find that a website has disappeared or has been hijacked by malevolent forces. Fear not if this happens to you (I’ve found both to be the case in my own relatively recent work), just hop on over to the Wayback Machine at the Internet Archive and look up a historic version of your page. This is a ridiculously useful tool, and will also allow you to view a historic development of a webpage including content changes at regular (usually bi-annual) intervals.

Finally the web can offer a terrific data set for “big data” explorations. Now here there are a few ethical implications I should highlight as they aren’t immediately obvious if you haven’t hosted a web site of your own before. There are a range of tools that can “crawl” websites that is, download en masse a large batch of pages. Basically, this is a software tool that will download a web page you specify and then analyse that page for any URL information – the program then goes on to download whatever data happens to on that page, and so on. You can usually specify “link depth” to restrict how many times the program will follow a link to another page as this can get very big very fast. On that last detail is where the ethical problems can arise. Many web hosts provide a metered service, so if you generate a huge volume of traffic for some small charity you may inadvertently crash their website (if the server is small), use up their monthly quota, or increase the amount they pay on their hosting invoice. Any of these situations are best avoided, and some web crawling software has begun to build in time and bandwidth limitations to prevent researchers (and search engines) from zapping small web properties with a massive web crawl. So with that in mind, let me note a few tools that are already adapted for academic use:

Richard A. Rogers at the University of Amsterdam developed a software application called Issue Crawler. He is also author of Information Politics on the Web (University of MIT Press, 2005). Issue crawler is meant to crawl what Rogers calls issue networks on the web – since I’m interested in environmental issue networks, I find his work pretty compelling and useful. You can read more about issuecrawler on Rogers’ wikipedia page.

I’ve already mentioned Robert Ackland and his book above, but it’s important to note that he is the creator of another tool for web crawling, called VOSON which is hosted through uberlink.com. VOSON is cool in that you provide a list of seed links and then the tool slowly accumulates all the web pages that match your list, storing them on their server. In some cases they’ll have already crawled a website and so can aggregate user requests and reduce the load on web clients. VOSON and Issue Crawler both cost money to use, so you’ll want to write them into your next big grant. But there are a few other options if you want to run some of your own crawling, albeit on a more modest scale (and off peak hours!). I use a German software application called devonthink to parse through large batches of text, including PDF files and web sites (which the application can archive). It is a very powerful tool with algorithmic searching which can intelligently sift through lots of data and help you find connections you might not have been expecting. Of course, you can also go to the command line and use CURL or Wget. Konrad Lawson breaks down the procedure for web scraping using the command line over at the ProfHacker blog. There is much more to be said on this topic, and a ton of other tools I’ll continue to introduce on this blog, but that’s a good introduction for now, I think.

Have you used websites in your research? I’d love to hear more in the comments about what you’ve found useful and what has proven a distraction!

Leave a Reply

Your email address will not be published. Required fields are marked *