Appendix B — Data and Inspiration for the Final Project

On this page you can find links to humanities datasets, as well as tips for finding your own.

I’ll update it as the course goes on to contain information relevant for each of the last three graded assignments, which involve making a visualisation, making a map, and the final project.

Importing a dataset and dataset standards

For these assignments, you’ll need to begin by importing your dataset as an object into the RStudio/Posit environment. How you do this depends a little on the standard of your dataset. As we’re working with tabular data on this course, your data should in most cases be either in one of these formats:

  • comma separated values (files ending in .csv). A very common tabular data format. Use read_csv() from the Tidyverse package readr (remember to load the package first) to read the file in to Posit.

  • Tab separated values (files ending in .tsv). Another common tabular format. Use read_tsv() from the Tidyverse package readr to read the file.

  • Excel files: Microsoft proprietary format, which can contain multiple sheets. Use read_excel() from the Tidyverse package readxl to read the file.

  • Google sheets. You can read Google sheets directly from the internet using the package googlesheets4.

You can read data originating from either your local machine, or directly from the internet in some cases.

  • If you’re loading from your local machine, remember to upload the file to Posit cloud first, if you’re using it.

  • If you know the path online to a file in one of the formats above, you can simply add the full path as the filename within quotation marks, e.g. read_csv('https://github.com/yann-ryan/nobel_data/blob/main/laureates_df.csv').

Datasets for making a basic visualisation (the first assignment or if you want to stick with something simple for the final project)

The visualisation assignment is asking you to demonstrate your skills in making a visualisation using ggplot2. I highly recommend using one of the datasets we have worked with on the course, which are relatively clean and simple, by humanities data standards. Data cleaning and figuring out formats can be very time consuming, so unless you are very comfortable with these things, I would wait until the final project to use a bespoke dataset.

Warning

If you do want to use your own dataset, be aware that your account on Posit Cloud has a limit of 1GB of RAM. This means that you will not be able to load larger datasets into memory, and you’ll get an error. In this case, it’s best to install R and RStudio on your own machine, which likely has more RAM.

Whatever dataset you use, a really important task is to read the documentation available, so you undertand, as best as possible, how it was made and who was responsible for it.

Datasets used on the course

  • The Titanic dataset. There are many versions of this on the internet, because it is widely used in machine learning problems. I created a version based on this dataset, because it contains the full list and hasn’t been divided into different parts for training and testing machine learning algorithms. There is a little information on how the dataset was constructed, but unfortunately the link to further information is no longer working. My version is available here.

  • The Bellevue Almshouse data. This is a dataset of records of immigrants who were admitted to the Bellevue Almshouse (house for the poor) in the 1840s. You can find more information and links to the data here.

  • The Movie Dialogue dataset. You can find more information about this dataset here. I created the third file, imdb_genre, from a larger IMDB dataset. You can find that file here.

  • The Nobel Prize Winners dataset. A dataset containing two tables, with information on the Nobel Prizes and the Nobel Prize winners. I created this using the official Nobel API. You can find the files here.

Other (relatively simple) datasets

If you really want to use another dataset, choose something which is clean and manageable for a short assignment. It should also be, broadly speaking, related to humanities, meaning don’t use datasets which are primarily natural sciences, geography, etc. Before you commit to it, check to see that there is at least some documentation regarding where it came from and how it was done. This is important in general (because data should follow FAIR principles), but also more specifically because you’ll need it to write a good report.

  • There are a large number of datasets from the ‘TidyTuesday’ weekly challenge. Behind this link you’ll find a collection of folders for each year. In each of these folders you’ll find another folder for each week, but if you scroll past these folders you’ll see useful links to the data and sources. Not all will be relevant, because there are many not relevant to the humanities. They are fairly well documented and include sources, but do check that this information is there before you commit to using them.
  • The Pudding has some nice and often fun datasets, though they have a very American focus.
  • I’ve used several of the datasets from this Introduction to Cultural Analytics book by Melanie Walsh. You can find some more datasets on this page. Not all are tabular data - some are primarily text-based, which is outside the scope of this course (you’re welcome to use them to produce visualisations but you’ll likely need some text mining skills which we won’t cover).

Datasets for the final project

At the moment this is a random assortment of datasets and data sources related to the humanities! I’ve deliberate avoided listing text-based datasets, of which there are lots but require different techniques to those we have used on the course.

Individual datasets

Data repositories or lists

There are many, many sources of datasets available online. Look for relatively simple csv and tsv files or sets of files, and avoid those that mention APIs or Linked Open Data unless you are comfortable working with these things.

  • CBS Statline: Dutch Statistics. Also an R package here.

  • DANS Humanities and Social Sciences: repository for many datasets, just be aware they may be very specific and not always be easy to use.

  • KB Labs datasets (many are text based but some are structured).

  • The Google dataset search. However make sure you can trace back the provenance!

  • Journal of Open Humanities Data. Journal with open data and articles outlining possible use cases.

  • IMDB datasets. Large-scale datasets from the IMDB website. You’ll probably need to install and run R on your local machine, because reading and analysing these will require more RAM than is available in Posit cloud.

  • Book catalogue metadata is a nice source of structured humanities data. The Hathi Trust metadata files contain information (not content) on all the works digitised by the Hathi Trust. Look for the most recent file starting with hathi_full. These are also large files and will require you to install and run R Studio on your own machine.

  • The ‘DH toychest’ datasets. In particular, follow the link and scroll down to the section ‘Datasets’.

  • List of Museum APIs. Also lists some datasets in regular tabular formats like tsv and csv.

Tip

For very large datasets, it may also be helpful to use an R package such as data.table which is more efficient than Tidyverse for working with very large data. data.table has a very different syntax to the Tidyverse, but there is a package, tidytable, which converts between the two.

R Packages

  • HistData

  • refugees (with guidelines here)

  • A huge list here - but again, try to only use those with good documentation.

R Packages which connect to an API

  • spotifyr allows you to download data, either public or your own listening data, from Spotify. You need to register for an API key - follow the instructions behind the link.

  • RedditExtractoR. I haven’t tried it, but apparently will allow you to download data from Reddit. Use at your own risk!

  • The osmdata package. Download geographic data through R. Very fun to work with, though takes some understanding of the queries. Try a tutorial here if you’re interested.

Social media data

Unfortunately, doing analysis on social media data has become difficult in the last few years as companies have locked down their services. There are some individual datasets of collected tweets which you could use, e.g. here. There is an additional step needed with tweets - the collected data is only what are known as tweet IDs, and they must be ‘hydrated’ using software such as Hydrator in order to actually read and analyse them.

Inspiration for the final project