## This dataset has been moved to the Edinburgh DataVault, where it is directly accessible only by authorised University of Edinburgh users. For further information please see https://www.research.ed.ac.uk/en/datasets/securing-a-data-set-on-allegations-of-sexual-abuse-made-against-t ##
In this work we look at the initial phase of an ESRC funded project involving academics from Social Work, Criminology, Informatics and the University of Edinburgh Library.This project collected and analysed a data set on allegations of sexual abuse made against the former disc jockey, Jimmy Savile. The Savile affair has taken place in a public and highly charged, arena. It has generated massive media attention and spawned several public reports, most notably that which was produced as a result of Operation Yewtree. Early allegations against Savile emanate from former residents at Duncroft, a residential school for `wayward but intelligent young women'. This project stems from data produced and collected by the blogger `Anna Raccoon' herself a former resident at the school. Through her blogs on the subject of Savile and Duncroft she was contacted by others and has collected a variety of information on the subject. The data harvested from the blog are supplemented by official reports and other blogs.
The initial component of the project involves capturing Anna Racoon’s blog (The Racoon Arms). This is a WordPress blog that was taken down by the author. Following previous research approaches [9, 8] we searched for copies of the site in other content management systems. We found that this site had been archived in several frozen states in the Internet Archive’s WayBackMachine (IA). An active blog is a constantly evolving object, and therefore careful consideration needs to be given as to what version or versions should be harvested. Given that the blog is available via the IA, one might question why it is necessary to download a copy at all. There are two main reasons for doing so. Firstly, the IA may at any time, and without notice, remove the objects from their archive. Secondly, to provide additional functionality to support qualitative analysis of the content of the blog, as well as indexing to support additional resource discovery not provided within the blog software or the IA. While harvesting the contents of a blog manually can be a long and arduous process, it can be simplified and automated using a software solution, such as wget. Apart from soliciting permission from the IA, decisions need to be made as to which version or versions should be harvested. Further decisions included to what level of recursion each harvest should be and whether just blog text or all files contributing to content and functionality of the blog should be gathered. Such decisions influence not only the size of the eventual object, but also the richness of the context. There are also concomitant draw-backs – the deeper the recursion, the greater the number of missing files (those that have not been harvested by the IA). Given that WordPress blogs are based on HTML format files, apart from any images and other audio-visual files that may be associated with the blogs, the text portion is in as efficient a format as possible vis-a-vis file storage as well as capacity to use XML to provide value added indexing and tagging. Storage capacity requirements depend largely on the number of snapshots of the blog that are harvested and the level of recursion specified in the harvests. The size of one snapshot can range from 53 MiB to 660 MiB (ranging from 1,500 to 88,000 files), depending on the options specified.