17 Home exercises

Instructions

The at-home exercises should be completed using Posit cloud. Open the new assignment ‘Home Exercise 3’. Create a new .Rmd file (use File -> New File -> R Markdown.

I advise saving the new file straight away. Because you’ll submit this file as an assignment, use a standardised name: lastname_firstname3. Regularly re-save the file.

When you have finished the exercise (or part of it), ‘knit’ the file. Export the .Rmd and the html as a .zip file, and upload this to the assignment area.

If you can’t knit the file because of an error, either comment out the code which doesn’t work using #, or just submit the .Rmd file on its own.

At home exercises

You can find the Goodreads dataset as a csv in the project folder. This is a dataset of the 11,000 most popular books taken from the review site Goodreads.com, and includes author information, plus the amount and average rating for each.

There is also a genre column which comes from the tags users have added. Note that originally, the dataset had multiple genres for each book. To simplify things for this class, I have only kept the first one.

Use summarise, group_by(), and all the previously-learned functions to do the following:

Load the tidyverse library
Read the dataset Goodreads_books_with_genres.csv into your environment, and save it as an object with the name goodreads_df. (see the previous week’s home exercises to see how to do this and the previous step).
Filter the dataset to include only books with the genre Science Fiction (note the capital letters) which received a rating of over 4.5.
Filter the dataset so it includes either of these conditions: either not written in English, or over 500 pages.
Create a new column, and calculate the ratio of ratings to text reviews for each book. What kind of books have a large number of ratings but relatively few text reviews? Why do you think this is the case?
Filter to include only books written in 2004. Arrange this list in order of publication date first, and then publisher. Keep only the title, author, and publication date. There are two ways of doing this:
- 1, Use the str_detect() function described in the optional section in the book to filter to include entries in the publication_date column which end in the text 2004.
- 2, Extract the year from the publication_date column. The cleanest way to do this is to first convert it to a date format the computer can read, using a function called mdy(). You can write over the existing publication column using mutate(publication_date = mdy(publication_date)). Next, you can extract the year from this date using a function called year(). Create a new column called year using mutate(), where you will apply the function year() to the publication date column. Now you can filter this new column to retain only rows where the value is 2004.
More of a challenge: Which author has the highest average rating? Before you calculate this, we need to make it a bit more meaningful by wrangling the dataset a bit, so our result isn’t skewed by authors with one book and a tiny number of 5 star ratings. First, keep only books with at least 100 ratings, and then authors with at least 3 of such books. Calculate the average (or mean) rating for each author. Arrange this list in descending order of the average rating.

Filter combined with str_detect (optional)

Find all the books written in English (as above but using str_detect)
Find all the Swedish authors (names that end in sson).
Find all the books where Isaac Asimov is listed as an author, even where other authors are listed at the same time.
Find all books published by a Penguin imprint (classics, UK, etc. )
Find all books with numbers in the title.