library(tidyverse)
library(XML)
library(xml2)
library(rvest)
8 Accessing Newspaper Data from the Shared Research Repository
Most of the rest of this book uses newspaper data from a number of newspaper digitisation projects connected to the British Library, as written about in Chapter 2. These projects have made the ‘raw data’ for a number of titles freely available for anyone to use. This chapter explains the structure of the repository which holds them, and walks through a method for downloading titles in bulk.
Downloading Titles in Bulk
Acquiring a dataset on which to work can be cumbersome if you have to manually download each file. The first part of this tutorial will show you how to bulk download all or some of the available titles. This was heavily inspired by the method found here. If you want to bulk download titles using a more robust method (using the command line, so no need for R), then I really recommend checking out that repository.
To do this there are three basic steps:
- Make a list of all the links in the repository collection for each title
- Go into each of those title pages, and get a list of all links for each zip file download.
- Optionally, you can specify which titles you’d like to download, or even which year.
- Download all the relevant files to your machine.
Create a dataframe of all the newspaper files in the repository
First, load the libraries needed:
Next, grab all the pages of the collection:
= paste0("https://bl.iro.bl.uk/collections/9a6a4cdd-2bfe-47bb-8c14-c0a5d100501f?locale=en&page=",1:6) urls
Use lapply
to go through the list, use the function read_html
to read the page into R, and store each as an item in a list:
<- lapply(urls, read_html) list_of_pages
Next, write a function that takes a single html page (as downloaded with read_html
), extracts the links and newspaper titles, and puts it into a dataframe.
= function(x){
make_df
= x %>%
all_collections html_nodes(".search-result-title") %>%
html_nodes('a') %>%
html_attr('href') %>%
paste0("https://bl.iro.bl.uk",.)
= x %>%
all_collections_titles html_nodes(".search-result-title") %>%
html_text()
= tibble(all_collections, all_collections_titles) %>%
all_collections_df filter(str_detect(all_collections, "concern\\/datasets"))
all_collections_df
}
Run this function on the list of html pages. This will return a list of dataframes. Merge them into one with rbindlist
from data.table
.
= purrr::map(list_of_pages, make_df)
l
= data.table::rbindlist(l)
l %>% knitr::kable('html')
l
= l %>% mutate(pages = paste0(all_collections, "&page="))
l
<- 1:10
sequence
# Expand the dataframe and concatenate with the sequence
<- l %>%
expanded_df crossing(sequence) %>%
mutate(pages = paste(pages, sequence, sep = ""))
Now we have a dataframe containing the url for each of the titles in the collection. The second stage is to go to each of these urls and extract the relevant download links.
Write another function. This takes a url, extracts all the links and IDs within it, and turns it into a dataframe. It then filters to just the relevant links (which have the ID ‘file_download’).
= function(c){
get_collection_links tryCatch({
= c %>% read_html()
collection
= collection%>% html_nodes('a') %>% html_attr('href')
links
= collection %>% html_nodes('a') %>% html_attr('id')
id
= collection%>% html_nodes('a') %>% html_attr('title')
text
= tibble(links, id, text)
links_df
error = function(e) {
}, # Action to perform when an error occurs
<- NA
result
})
return(links_df)
}
Use lapply
to run this function on the column of urls from the previous step,and merge it with rbindlist
. Keep just links which contain the text Download BLNewspapers
.
= pbapply::pblapply(expanded_df$pages, get_collection_links)
t
names(t) = expanded_df$all_collections_titles[1]
= t %>%
t_df ::rbindlist(idcol = 'title')
data.table
= t_df %>%
t_df filter(str_detect(text, "BLNewspapers"))
The new dataframe needs a bit of tidying up. To use the download.file()
function in R we need to also specify the full filename and location where we’d like the file to be put. At the moment the ‘text’ column is what we want but it needs some alterations. First, remove the ‘Download’ text from the beginning.
Next, separate the text into a series of columns, using either _ or . as the separator. Create a new ‘filename’ column which pastes the different bits of the text back together without the long code.
Add /newspapers/
to the beginning of the filename, so that the files can be downloaded into that folder.
= t_df %>% distinct(text, .keep_all = TRUE) %>%
t_df mutate(year = str_extract(text, "(?<=_)[0-9]{4}(?=[_.])")) %>%
mutate(nid = str_extract(text, "[0-9]{7}")) %>% mutate(filename = str_extract(text, '(?<=Download ")[^"]+'))
= t_df %>% mutate(links = paste0("https://bl.iro.bl.uk",links )) %>%
t_df mutate(destination = paste0('/Users/Yann/Documents/non-Github/r-newspaper-quarto/newspapers/', filename))
The result is a dataframe which can be used to download either all or some of the files.
Filter the download links by date or title
You can now filter this dataframe to produce a list of titles and/or years you’re interested in. For example, if you just want all the newspapers for 1855:
= t_df %>% filter(as.numeric(year) == 1855)
files_of_interest %>% knitr::kable('html') files_of_interest
To download these we use the Map
function, which will apply the function download.file
to the vector of links, using the dest colum we created as the file destination. download.file
by default times out after 100 seconds, but these downloads will take much longer. Increase this using options(timeout=9999)
.
Before this step, you’ll need to create a new folder called ‘newspapers’, within the working directory of the R project.
options(timeout=9999)
Map(function(u, d) download.file(u, d, mode="wb"), files_of_interest$links, files_of_interest$dest)
Folder structure
Once these have downloaded, you can quickly unzip them using R. First it’s worth understanding a little about the folder structure you’ll see once they’re unzipped.
Each file will have a filename like this:
BLNewspapers_TheSun_0002194_1850.zip
This is made from the following pieces of information:
BLNewspapers - this identifies the file as coming from the British Library
TheSun - this is the title of the newspaper, as found on the Library’s catalogue.
0002194 - This is the NLP, a unique code given to each title. This code is also found on the Title-level list, in case you want to link the titles from the repository to that dataset.
1850 - The year.
Contruct a Corpus
At this point, and for the rest of the tutorials in the book, you might want to construct a ‘corpus’ of newspapers, using whatever criteria you see fit. Perhaps you’re interested in a longitudinal study, and would like to download a small sample of years spread out over the century, or maybe you’d like to look at all the issues in a single newspaper, or perhaps all of a single year across a range of titles.
The tutorials will make most sense and produce similar results if your corpus is the same as above: all newspapers in the repository from the year 1855. You can also download a single .zip file with the extracted text from these titles here.
Bulk extract the files using unzip() and a for() loop
R can be used to unzip the files in bulk, which is particularly useful if you have downloaded a large number of files. It’s very simple, there’s just two steps. This is useful if you’re using windows and have a large number of files to unzip.
First, use list.files()
to create a vector, called zipfile
containing the full file paths to all the zip files in the ‘newspapers’ folder you’ve just created.
= list.files("/Volumes/T7/zipfiles/", full.names = TRUE)
zipfiles zipfiles
Now, use this in a loop with unzip()
.
Loops in R are very useful for automating simple tasks. The below takes each file named in the ‘zipfiles’ vector, and unzips it. It takes some time.
::map(zipfiles, unzip) purrr
Once this is done, you’ll have a new (or several new) folders in the project directory (not the newspapers directory). These are named using a numeric code, called the ‘NLP’, so they should look like this in your project directory:
To tidy up, put these back into the newspapers folder.
These files contain the METS/ALTO .xml files with the newspaper text. If you have followed the above and downloaded all newspapers for the year 1855, you should have seven different titles and a few hundred newspaper issues. In the next chapter, you’ll extract this text from the .xml and save it in a more convenient format. These final files will form the basis for the following tutorials which process and analyse the text of the newspapers.