3 Accessing Newspaper Data Internationally

Many countries have digitised and published parts of their national newspaper collections. In most cases, the newspapers are made available through interfaces designed for search and browsing. Access to the underlying data, for the kinds of methods used in this book, varies greatly across national collections. Some, such as Australia and the U.S. have digitised and made freely available large collections of newspapers, accessible through an API which means they can be easily downloaded or incorporated into third-party applications or resources. Others, such as the UK, are making some data available. Many others do not make any provisions beyond keyword searching or images without OCR.

This chapter is a work-in-progress, and it attempts to survey the existing data provisions made for national newspaper collections. It is not meant as a comprehensive guide to the international digitised newspaper landscape. For a more detailed description of the format, availability, and structure of some key national collections, see the Atlas of Digitised Newspapers.

In a few cases, where title lists have been made available, I have included interactive maps intended as a fun way of seeing at a glance what is included in the collection. More will be added if the correct metadata can be found.

United States

The Library of Congress in the U.S. sponsored a project called ‘Chronicling America’, which has created a newspaper dataset and interface which currently has about 16 million pages. All the titles are freely available through the website without a paywall. To access the data itself, the CA database has an API, with instructions here. As with the UK titles, Chronicling America newspapers use the METS/ALTO format. However it may be in a slightly different format and require adjusting the method by which you extract the text from the .xml files.

You can also download all the OCR results directly for each title on this page. Each newspaper contains a list of folders for each issue, and within that can be found a single file for the OCR results (ocr.xml) and a single file for the plain text (ocr.txt).

Chronicling America publish a simple tab-separated-values list of the titles currently in the database here: https://chroniclingamerica.loc.gov/newspapers.txt

We can easily use this information to produce a State-level map of the newspaper titles:

Interactive map of Library of Congress Newspapers

American Stories dataset

Because the Library of Congress has made the image scans freely available, there are opportunities for further enhancing the data. An interesting alternative source of newspaper text data is the American Stories Dataset, released in September 2023. The authors took the original page scans and have re-processed them to detect articles and to improve the OCR. It’s available as a dataset through HuggingFace. HuggingFace is an online hub for sharing machine learning data and models. To download the data, you’ll need to use some basic python, including installing the Huggingface package, but it may be a valuable source for anyone wanting to work with high-quality newspaper data from the U.S.

Australia

Trove is a centralised data store of cultural heritage items from the National Library of Australia and other partners. Newspapers published between 1803 and 1955 are freely available through Trove. As well as browsing and searching, Trove has an API which allows you to download newspaper text and image data, and associated metadata. See the documentation for more details. Tim Sherratt publishes a large number of guides to using Trove, including live code tutorials.

The Netherlands

Historical newspapers in the Netherlands are available through a resource and interface called Delpher, maintained by the Koninklijke Bibliotheek, the Dutch Royal (e.g. National) Library. As well as browsing/searching, users can download in bulk all newspapers published between 1618 and 1879. A full list of the available titles can be found here.

Newspaper data is available in a series of .zip files, and is published in METS/ALTO format. Page scans are not included, but can be retrieved manually from the web using a unique identifier and a URL. A full description of the data format and structure is available on the Delpher website. There is also an API available: users need to apply directly to Delpher for access. See here.

Finland

Many of Finland’s newspapers have been digitised, and are available through a standard web interface. All the OCR results (not images) of newspapers published between 1771 and 1874 are available as a single bulk download through the Language Bank of Finland. The file is 13GB and the newspaper OCR results are in METS/ALTO format.

Luxembourg

Luxembourg have digitised and made available about 800,000 pages of digitised newspapers, and made them available through the National Library’s Open Data service. The data is presented in a number of different ‘packs’, each made with different user needs in mind. The data is in METS/ALTO format. The website contains extensive documetation on the format used, and a tool for processing the files can be found on the organisation’s Github page.

Helpfully, they also make available a ‘text analysis pack’ where the plain text has been extracted from the METS/ALTO and made available either in a series of simplified .xml files, or as a .json file with one line per article.

Pan-European Collections

A number of European projects have worked to make newspapers from multiple countries available through a single repository and interface. Europeana Newspapers makes available, for browsing and keyword searching, about 20 million pages of newspapers from 18 partner libaries. You can view the final report, including links to tools and further information, here.

Another project worth mentioning is Impresso.. Impresso is a database and interface combining newspapers from multiple European countries. The data is not all freely available, but the interface allows for a number of text analysis tasks (such as topic modelling and text reuse) to be carried out on a large corpus, once a non-disclosure agreement has been signed. Users can also create a search query and export the resulting articles as a dataset. A new version is currently in the pipeline.