20 Data challenge
For this data challenge, I’ll ask you to work with a real-world cultural dataset, which is in need of some data cleaning.
You’re asked to carry out the whole process, from start to finish, including importing the raw data file and exporting the result.
The dataset is a record of performances and attendances at performing arts venues in the Netherlands, from 1999 to 2023. It’s a good example of the kinds of things you’ll often need to do to datasets in order to make them useable for visualisation and analysis. You can find the dataset here. click on Original Dataset on the left hand side, and then click on Download CSV.
Do the following in the Posit cloud workspace.
Upload the dataset from your computer to Posit cloud.
Import the dataset to your environment with a suitable name.
Take a look at the dataset.
It has a large number of columns, and one row per year and per region. The other columns contain various information about performance numbers and attendance numbers. The attendance numbers are in units of 1000s of people.
The Periods column has some extra text after the year.
The regions contains all the various regional divisions, meaning it has rows for the Netherlands for a whole, one for larger regions, and one for provinces.
The basic cleaning steps are to:
Filter the dataset to include only the PV level regional codes.
Adjust the Periods column so it contains only the year, as a number. This should be recognised as a number by R.
Remove the column NumberOfOrganisation_1 and TotalAttendance_13 columns, as these are not needed.
To make this dataset tidy, we need to do a number of things:
Make two separate tables, one for attendance and one for performance, because these are two separate things. You’ll need to assign them names so they are saved in the environment.
Use pivot_longer with each dataset so that you have a performance_type or attendance_type column, containing the type, with the value in a separate column.
Some final steps before exporting:
Multiply the attendance values by 1000 to get the correct numbers.
Finally, join the region_codes dataset (available in the workspace) to your datasets, so that the full region name is added.
Export the data as a csv using the function
write_csv().
Finally, use the data…
- Get the average attendance for each performance type for each region.
- Calculate the total number of performances per region.