8 Accessing Newspaper Data from the Shared Research Repository

Most of the rest of this book uses newspaper data from a number of newspaper digitisation projects connected to the British Library, as written about in Chapter 2. These projects have made the ‘raw data’ for a number of titles freely available for anyone to use. This chapter explains the structure of the repository which holds them, and walks through a method for downloading titles in bulk.

Shared Research Repository

The titles released so far are available on the British Library’s Shared Research Repository.

The items in the repository are organised into collections. All the newspaper-related data released on to the repository can be found within the British Library News Datasets collection. Clicking on this link will bring up a list of all the items collected under this headings. There are also three sub-headings: Title Lists, Newspapers, and Press Directories. Clicking on the first of these, Newspapers, will display just the newspaper data items.

Screenshot showing the Shared Research Repository maintained by the British Library, on the newspaper collections page.

Newspaper File Structure

Each separate title (if a newspaper changed title, they are combined together) is listed here as a dataset. Clicking into one of these, you’ll see that each year of that title is available as a separate downloadable .zip file.

If you download one of these and decompress it, you’ll see the structure of the title. It contains a root folder, containing the name and year of the title, as well as a unique title code. Contained within this folder are further folders, one for each day of the newspaper. These folders are named using the month and day of publication.

Within this folder are the actual newspaper files: one .xml file for each page of that day’s paper, plus one more METS file.

Screenshot showing the folder structure of a decompressed newspaper-year file

Downloading Titles in Bulk

Acquiring a dataset on which to work can be cumbersome if you have to manually download each file. The first part of this tutorial will show you how to bulk download all or some of the available titles. This was heavily inspired by the method found here. If you want to bulk download titles using a more robust method (using the command line, so no need for R), then I really recommend checking out that repository.

To do this there are three basic steps:

Make a list of all the links in the repository collection for each title
Go into each of those title pages, and get a list of all links for each zip file download.
Optionally, you can specify which titles you’d like to download, or even which year.
Download all the relevant files to your machine.

Create a dataframe of all the newspaper files in the repository

First, load the libraries needed:

library(tidyverse)
library(XML)
library(xml2)
library(rvest)

Next, grab all the pages of the collection:

urls = paste0("https://bl.iro.bl.uk/collections/9a6a4cdd-2bfe-47bb-8c14-c0a5d100501f?locale=en&page=",1:6)

Use lapply to go through the list, use the function read_html to read the page into R, and store each as an item in a list:

list_of_pages <- lapply(urls, read_html)

Next, write a function that takes a single html page (as downloaded with read_html), extracts the links and newspaper titles, and puts it into a dataframe.

make_df = function(x){
  
  all_collections =  x %>% 
  html_nodes(".search-result-title") %>% 
  html_nodes('a') %>% 
  html_attr('href') %>% 
  paste0("https://bl.iro.bl.uk",.)

all_collections_titles = x %>% 
  html_nodes(".search-result-title") %>% 
  html_text()

all_collections_df = tibble(all_collections, all_collections_titles) %>% 
  filter(str_detect(all_collections, "concern\\/datasets"))

all_collections_df

}

Run this function on the list of html pages. This will return a list of dataframes. Merge them into one with rbindlist from data.table.

l = purrr::map(list_of_pages, make_df)

l = data.table::rbindlist(l)
l %>% knitr::kable('html')

l = l %>% mutate(pages = paste0(all_collections, "&page=")) 

sequence <- 1:10

# Expand the dataframe and concatenate with the sequence
expanded_df <- l %>%
  crossing(sequence) %>%
  mutate(pages = paste(pages, sequence, sep = ""))

Now we have a dataframe containing the url for each of the titles in the collection. The second stage is to go to each of these urls and extract the relevant download links.

Write another function. This takes a url, extracts all the links and IDs within it, and turns it into a dataframe. It then filters to just the relevant links (which have the ID ‘file_download’).

get_collection_links = function(c){
  tryCatch({
collection = c  %>%  read_html()
  
links = collection%>% html_nodes('a') %>% html_attr('href')

id = collection %>% html_nodes('a') %>% html_attr('id')

text = collection%>% html_nodes('a') %>% html_attr('title')

links_df = tibble(links, id, text)

}, error = function(e) {
    # Action to perform when an error occurs
    result <- NA
  })

return(links_df)



}

Use lapply to run this function on the column of urls from the previous step,and merge it with rbindlist. Keep just links which contain the text Download BLNewspapers.

t = pbapply::pblapply(expanded_df$pages, get_collection_links)

names(t) = expanded_df$all_collections_titles[1]

t_df = t %>%
  data.table::rbindlist(idcol = 'title') 

t_df = t_df %>% 
  filter(str_detect(text, "BLNewspapers"))

The new dataframe needs a bit of tidying up. To use the download.file() function in R we need to also specify the full filename and location where we’d like the file to be put. At the moment the ‘text’ column is what we want but it needs some alterations. First, remove the ‘Download’ text from the beginning.

Next, separate the text into a series of columns, using either _ or . as the separator. Create a new ‘filename’ column which pastes the different bits of the text back together without the long code.

Add /newspapers/ to the beginning of the filename, so that the files can be downloaded into that folder.

t_df = t_df %>% distinct(text, .keep_all = TRUE) %>% 
  mutate(year = str_extract(text, "(?<=_)[0-9]{4}(?=[_.])")) %>% 
  mutate(nid = str_extract(text, "[0-9]{7}")) %>% mutate(filename = str_extract(text, '(?<=Download ")[^"]+'))

t_df = t_df %>% mutate(links = paste0("https://bl.iro.bl.uk",links )) %>% 
  mutate(destination = paste0('/Users/Yann/Documents/non-Github/r-newspaper-quarto/newspapers/', filename))

The result is a dataframe which can be used to download either all or some of the files.

Filter the download links by date or title

You can now filter this dataframe to produce a list of titles and/or years you’re interested in. For example, if you just want all the newspapers for 1855:

files_of_interest = t_df %>% filter(as.numeric(year) == 1855)
files_of_interest%>% knitr::kable('html')

To download these we use the Map function, which will apply the function download.file to the vector of links, using the dest colum we created as the file destination. download.file by default times out after 100 seconds, but these downloads will take much longer. Increase this using options(timeout=9999).

Before this step, you’ll need to create a new folder called ‘newspapers’, within the working directory of the R project.

options(timeout=9999)

Map(function(u, d) download.file(u, d, mode="wb"), files_of_interest$links, files_of_interest$dest)

Folder structure

Once these have downloaded, you can quickly unzip them using R. First it’s worth understanding a little about the folder structure you’ll see once they’re unzipped.

Each file will have a filename like this:

BLNewspapers_TheSun_0002194_1850.zip

This is made from the following pieces of information:

BLNewspapers - this identifies the file as coming from the British Library

TheSun - this is the title of the newspaper, as found on the Library’s catalogue.

0002194 - This is the NLP, a unique code given to each title. This code is also found on the Title-level list, in case you want to link the titles from the repository to that dataset.

1850 - The year.

Contruct a Corpus

At this point, and for the rest of the tutorials in the book, you might want to construct a ‘corpus’ of newspapers, using whatever criteria you see fit. Perhaps you’re interested in a longitudinal study, and would like to download a small sample of years spread out over the century, or maybe you’d like to look at all the issues in a single newspaper, or perhaps all of a single year across a range of titles.

The tutorials will make most sense and produce similar results if your corpus is the same as above: all newspapers in the repository from the year 1855. You can also download a single .zip file with the extracted text from these titles here.

Bulk extract the files using unzip() and a for() loop

R can be used to unzip the files in bulk, which is particularly useful if you have downloaded a large number of files. It’s very simple, there’s just two steps. This is useful if you’re using windows and have a large number of files to unzip.

First, use list.files() to create a vector, called zipfile containing the full file paths to all the zip files in the ‘newspapers’ folder you’ve just created.

zipfiles = list.files("/Volumes/T7/zipfiles/", full.names = TRUE)
zipfiles

Now, use this in a loop with unzip().

Loops in R are very useful for automating simple tasks. The below takes each file named in the ‘zipfiles’ vector, and unzips it. It takes some time.

purrr::map(zipfiles, unzip)

Once this is done, you’ll have a new (or several new) folders in the project directory (not the newspapers directory). These are named using a numeric code, called the ‘NLP’, so they should look like this in your project directory:

To tidy up, put these back into the newspapers folder.

These files contain the METS/ALTO .xml files with the newspaper text. If you have followed the above and downloaded all newspapers for the year 1855, you should have seven different titles and a few hundred newspaper issues. In the next chapter, you’ll extract this text from the .xml and save it in a more convenient format. These final files will form the basis for the following tutorials which process and analyse the text of the newspapers.