library(tidyverse)
library(XML)
library(xml2)
library(rvest)8 Accessing Newspaper Data from the Shared Research Repository
Most of the rest of this book uses newspaper data from a number of newspaper digitisation projects connected to the British Library, as written about in Chapter 2. These projects have made the ‘raw data’ for a number of titles freely available for anyone to use. This chapter explains the structure of the repository which holds them, and walks through a method for downloading titles in bulk.
Downloading Titles in Bulk
Acquiring a dataset on which to work can be cumbersome if you have to manually download each file. The first part of this tutorial will show you how to bulk download all or some of the available titles. This was heavily inspired by the method found here. If you want to bulk download titles using a more robust method (using the command line, so no need for R), then I really recommend checking out that repository.
To do this there are three basic steps:
- Make a list of all the links in the repository collection for each title
- Go into each of those title pages, and get a list of all links for each zip file download.
- Optionally, you can specify which titles you’d like to download, or even which year.
- Download all the relevant files to your machine.
Create a dataframe of all the newspaper files in the repository
First, load the libraries needed:
Next, grab all the pages of the collection:
urls = paste0("https://bl.iro.bl.uk/collections/9a6a4cdd-2bfe-47bb-8c14-c0a5d100501f?locale=en&page=",1:6)Use lapply to go through the list, use the function read_html to read the page into R, and store each as an item in a list:
list_of_pages <- lapply(urls, read_html)Next, write a function that takes a single html page (as downloaded with read_html), extracts the links and newspaper titles, and puts it into a dataframe.
make_df = function(x){
all_collections = x %>%
html_nodes(".search-result-title") %>%
html_nodes('a') %>%
html_attr('href') %>%
paste0("https://bl.iro.bl.uk",.)
all_collections_titles = x %>%
html_nodes(".search-result-title") %>%
html_text()
all_collections_df = tibble(all_collections, all_collections_titles) %>%
filter(str_detect(all_collections, "concern\\/datasets"))
all_collections_df
}Run this function on the list of html pages. This will return a list of dataframes. Merge them into one with rbindlist from data.table.
l = purrr::map(list_of_pages, make_df)
l = data.table::rbindlist(l)
l %>% knitr::kable('html')
l = l %>% mutate(pages = paste0(all_collections, "&page="))
sequence <- 1:10
# Expand the dataframe and concatenate with the sequence
expanded_df <- l %>%
crossing(sequence) %>%
mutate(pages = paste(pages, sequence, sep = ""))Now we have a dataframe containing the url for each of the titles in the collection. The second stage is to go to each of these urls and extract the relevant download links.
Write another function. This takes a url, extracts all the links and IDs within it, and turns it into a dataframe. It then filters to just the relevant links (which have the ID ‘file_download’).
get_collection_links = function(c){
tryCatch({
collection = c %>% read_html()
links = collection%>% html_nodes('a') %>% html_attr('href')
id = collection %>% html_nodes('a') %>% html_attr('id')
text = collection%>% html_nodes('a') %>% html_attr('title')
links_df = tibble(links, id, text)
}, error = function(e) {
# Action to perform when an error occurs
result <- NA
})
return(links_df)
}Use lapply to run this function on the column of urls from the previous step,and merge it with rbindlist. Keep just links which contain the text Download BLNewspapers.
t = pbapply::pblapply(expanded_df$pages, get_collection_links)
names(t) = expanded_df$all_collections_titles[1]
t_df = t %>%
data.table::rbindlist(idcol = 'title')
t_df = t_df %>%
filter(str_detect(text, "BLNewspapers"))The new dataframe needs a bit of tidying up. To use the download.file() function in R we need to also specify the full filename and location where we’d like the file to be put. At the moment the ‘text’ column is what we want but it needs some alterations. First, remove the ‘Download’ text from the beginning.
Next, separate the text into a series of columns, using either _ or . as the separator. Create a new ‘filename’ column which pastes the different bits of the text back together without the long code.
Add /newspapers/ to the beginning of the filename, so that the files can be downloaded into that folder.
t_df = t_df %>% distinct(text, .keep_all = TRUE) %>%
mutate(year = str_extract(text, "(?<=_)[0-9]{4}(?=[_.])")) %>%
mutate(nid = str_extract(text, "[0-9]{7}")) %>% mutate(filename = str_extract(text, '(?<=Download ")[^"]+'))
t_df = t_df %>% mutate(links = paste0("https://bl.iro.bl.uk",links )) %>%
mutate(destination = paste0('/Users/Yann/Documents/non-Github/r-newspaper-quarto/newspapers/', filename))The result is a dataframe which can be used to download either all or some of the files.
Filter the download links by date or title
You can now filter this dataframe to produce a list of titles and/or years you’re interested in. For example, if you just want all the newspapers for 1855:
files_of_interest = t_df %>% filter(as.numeric(year) == 1855)
files_of_interest%>% knitr::kable('html')To download these we use the Map function, which will apply the function download.file to the vector of links, using the dest colum we created as the file destination. download.file by default times out after 100 seconds, but these downloads will take much longer. Increase this using options(timeout=9999).
Before this step, you’ll need to create a new folder called ‘newspapers’, within the working directory of the R project.
options(timeout=9999)
Map(function(u, d) download.file(u, d, mode="wb"), files_of_interest$links, files_of_interest$dest)Folder structure
Once these have downloaded, you can quickly unzip them using R. First it’s worth understanding a little about the folder structure you’ll see once they’re unzipped.
Each file will have a filename like this:
BLNewspapers_TheSun_0002194_1850.zip
This is made from the following pieces of information:
BLNewspapers - this identifies the file as coming from the British Library
TheSun - this is the title of the newspaper, as found on the Library’s catalogue.
0002194 - This is the NLP, a unique code given to each title. This code is also found on the Title-level list, in case you want to link the titles from the repository to that dataset.
1850 - The year.
Contruct a Corpus
At this point, and for the rest of the tutorials in the book, you might want to construct a ‘corpus’ of newspapers, using whatever criteria you see fit. Perhaps you’re interested in a longitudinal study, and would like to download a small sample of years spread out over the century, or maybe you’d like to look at all the issues in a single newspaper, or perhaps all of a single year across a range of titles.
The tutorials will make most sense and produce similar results if your corpus is the same as above: all newspapers in the repository from the year 1855. You can also download a single .zip file with the extracted text from these titles here.
Bulk extract the files using unzip() and a for() loop
R can be used to unzip the files in bulk, which is particularly useful if you have downloaded a large number of files. It’s very simple, there’s just two steps. This is useful if you’re using windows and have a large number of files to unzip.
First, use list.files() to create a vector, called zipfile containing the full file paths to all the zip files in the ‘newspapers’ folder you’ve just created.
zipfiles = list.files("/Volumes/T7/zipfiles/", full.names = TRUE)
zipfilesNow, use this in a loop with unzip().
Loops in R are very useful for automating simple tasks. The below takes each file named in the ‘zipfiles’ vector, and unzips it. It takes some time.
purrr::map(zipfiles, unzip)Once this is done, you’ll have a new (or several new) folders in the project directory (not the newspapers directory). These are named using a numeric code, called the ‘NLP’, so they should look like this in your project directory:
To tidy up, put these back into the newspapers folder.
These files contain the METS/ALTO .xml files with the newspaper text. If you have followed the above and downloaded all newspapers for the year 1855, you should have seven different titles and a few hundred newspaper issues. In the next chapter, you’ll extract this text from the .xml and save it in a more convenient format. These final files will form the basis for the following tutorials which process and analyse the text of the newspapers.

