2 Installing R and Loading Datasets

2.1 Installing R, RStudio, and Tidyverse

2.1.1 Step 1: Install R

In the first half of the course, we will use the statistical programming language R to construct our visualizations and gain familiarity manipulating data. R is very powerful, free open-source software.

Download the latest version of R (4.0.3 or later) from R-Project.org.

R is capable of handling massive datasets, querying databases in SQL, creating beautiful visualizations, running complex statistical analysis, and building state-of-the-art machine learning models. You can also easily switch between R and other software packages (like Python) in the same script.

2.1.2 Step 2: Install RStudio

You can interact with R via the command line, but the RStudio IDE (Integrated Developer Environment) offers a much better user experience. Download the latest version of RStudio from RStudio.com.

Once you have R and RStudio installed, open RStudio and it should look something like the figure below. We will only interact with R through RStudio in this course.There are excellent resources online if you run into trouble; for example, search YouTube for “RStudio Installation.”

2.1.3 Step 2: Install the tidyverse (and ggPlot Visualization Package)

We will use the ggplot package developed by Hadley Wickham. We will also use a number of other commands for “data wrangling” (manipulating and working with datasets). Conveniently, these packages are all offered together in a “meta-package” called the tidyverse, which refers to a group of packages that handle data in a “tidy” way.

Type install.packages("tidyverse") in the Console in RStudio to install the tidyverse. RStudio will download the package from CRAN and install it for you.

The tidyverse packages are now installed on your computer (which you only have to do once) but you will have to load them every time you open RStudio.

Load the tidyverse with the command library(tidyverse).

library(tidyverse)

Then, check that it’s working by checking the help file for one of the tidyverse packages like ggplot. Type help("ggplot") and the help screen should pop up on the lower right hand tab. You can also search for help directly in this window.

2.2 Load the Dataset for This Course

We will first be using an excerpt of a set of real data on hotel stays from two hotels; later we’ll use a set of purchase card transactions from government employees in the City of San Jose, CA. Let’s look at the hotel dataset first.

The dataset is available as a “delimited text file” of “comma-separated values” or .csv which your computer will probably want to open into Microsoft Excel or Apple Numbers. You can take a look at it, but some datasets in the real world are far too large to open in Excel. R can handle them though!

For the assignments in this class (and in general) it is best to write all your code as a script file so that you can save and edit your work (rather than just typing it directly into the console). The script is typed in the window above the Console. Use the Run Button to run specific lines or the entire script.

2.2.1 Download the Dataset

First, download the dataset. It’s a best practice to save everything for a project inside the same folder. Create a folder for the class, and keep everything there.

2.2.2 Set Working Directory

Next, set your “Working Directory” to that folder you set up and where you’ve saved the dataset to. The working directory is where RStudio will look for files and save exports. You can set your working directory by clicking on the “Session” tab on the menu bar.Alternatively, you can use the command setwd() to manually tell RStudio where to look. I usually include the one line of code to set my working directory in every R script that I write.

# Set Working Directory
setwd("/Volumes/GoogleDrive/My Drive/6. Teaching/IDSC 4210 - Data Visualization/IDSC 4210 Course Notes")

2.2.3 Load Dataset into R

We’re going to use read_csv() command from the readr package which is included in the tidyverse. We want to save the dataset in R to a dataframe called bookingdata.

### Load dataset
# Load tidyverse (Since we need readr and ggplot2)
library(tidyverse)

# Load hotel_bookings.csv to a dataframe called bookingdata
bookingdata <- read_csv("hotel_bookings.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   hotel = col_character(),
##   arrival_date_month = col_character(),
##   meal = col_character(),
##   country = col_character(),
##   market_segment = col_character(),
##   reserved_room_type = col_character(),
##   assigned_room_type = col_character(),
##   deposit_type = col_character(),
##   agent = col_character(),
##   reservation_status = col_character()
## )
## See spec(...) for full column specifications.

R tells us that it loaded all of the columns with some “specifications,” ie, what type of data is in each column.

And if we look in the top right hand panel, we now have a dataframe called bookingdata with 119,390 observations of 24 variables.

2.2.4 Examine the Booking Dataset

Whenever you get a new dataset or pull a SQL query, you should always look at the data to make sure it’s what you’re expecting.

In R, we can either click on the object over in the Environment pane, or you can use the command View(bookingdata)

We should refer to the Data Dictionary File, which is also provided on Canvas, to know what each variable is. In this dataset, some key variables are:

hotel: Hotel Location, either “City Hotel” or “Resort Hotel”

is_canceled: A 0/1 indicator for whether the booking was canceled or not

arrival_date_year: The year of the planned arrival date

adr: The “Average Daily Rate” or average price per day for the guest’s stay

Always make sure you know how to describe your dataset, including what each row represents and what data is included. Our hotel dataset can be described like this:

“This dataset has information from about 100K hotel bookings. Each row represents one booking at a hotel, and includes information on the stay (date, price, hotel) and the guest (number of adults, country of origin).”

2.2.5 Observations, Variables and Values

There are three key terms that have specific meanings when talking about data analysis.

  • An observation is a set of measurements usually made at the same time. For a “tidy” or “long” dataset, each observation is on its own row. You may hear observations called rows or records; these terms are often interchangeable (but not always…it depends on the data structure!).

  • A dimension or variable is something that you can measure (e.g. height, income, destination, win/loss)

  • A value is the result or state of a measurement for a variable (e.g. height = 2 meters, win/loss = WIN)

Always make sure you know what the observations are in your datasets, where data from variables comes from (or how it’s calculated), and possible values for every variable.

2.2.6 Initial Exploration

Next, Let’s look at our data in more detail. The summary command does that for us nicely, helping us learn what we’re dealing with. (The describe command in the psych package is a little nicer, but doesn’t make categorical variables as clear.)

### Summarize the data
# Base r command
summary(bookingdata)
##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_day_of_month stays_in_weekend_nights
##  Length:119390      Min.   : 1.0              Min.   : 0.0000        
##  Class :character   1st Qu.: 8.0              1st Qu.: 0.0000        
##  Mode  :character   Median :16.0              Median : 1.0000        
##                     Mean   :15.8              Mean   : 0.9276        
##                     3rd Qu.:23.0              3rd Qu.: 2.0000        
##                     Max.   :31.0              Max.   :19.0000        
##                                                                      
##  stays_in_week_nights     adults          children           meal          
##  Min.   : 0.0         Min.   : 0.000   Min.   : 0.0000   Length:119390     
##  1st Qu.: 1.0         1st Qu.: 2.000   1st Qu.: 0.0000   Class :character  
##  Median : 2.0         Median : 2.000   Median : 0.0000   Mode  :character  
##  Mean   : 2.5         Mean   : 1.856   Mean   : 0.1039                     
##  3rd Qu.: 3.0         3rd Qu.: 2.000   3rd Qu.: 0.0000                     
##  Max.   :50.0         Max.   :55.000   Max.   :10.0000                     
##                                        NA's   :4                           
##    country          market_segment     is_repeated_guest
##  Length:119390      Length:119390      Min.   :0.00000  
##  Class :character   Class :character   1st Qu.:0.00000  
##  Mode  :character   Mode  :character   Median :0.00000  
##                                        Mean   :0.03191  
##                                        3rd Qu.:0.00000  
##                                        Max.   :1.00000  
##                                                         
##  previous_bookings_not_canceled reserved_room_type assigned_room_type
##  Min.   : 0.0000                Length:119390      Length:119390     
##  1st Qu.: 0.0000                Class :character   Class :character  
##  Median : 0.0000                Mode  :character   Mode  :character  
##  Mean   : 0.1371                                                     
##  3rd Qu.: 0.0000                                                     
##  Max.   :72.0000                                                     
##                                                                      
##  booking_changes   deposit_type          agent                adr         
##  Min.   : 0.0000   Length:119390      Length:119390      Min.   :  -6.38  
##  1st Qu.: 0.0000   Class :character   Class :character   1st Qu.:  69.29  
##  Median : 0.0000   Mode  :character   Mode  :character   Median :  94.58  
##  Mean   : 0.2211                                         Mean   : 101.83  
##  3rd Qu.: 0.0000                                         3rd Qu.: 126.00  
##  Max.   :21.0000                                         Max.   :5400.00  
##                                                                           
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
## 
# Use describe from the psych package piped into the kable command
library(psych)
library(knitr)
describe(bookingdata) %>%
  select("Mean" = "mean", "SD" = "sd", "Median" = "median", "Min" = "min", "Max" = "max") %>%          # Only display a few columns 
  kable(digits = 1)                     # kable() formats tables nicely, round to 1 digits
Mean SD Median Min Max
hotel* 1.3 0.5 1.0 1.0 2
is_canceled 0.4 0.5 0.0 0.0 1
lead_time 104.0 106.9 69.0 0.0 737
arrival_date_year 2016.2 0.7 2016.0 2015.0 2017
arrival_date_month* 6.5 3.5 7.0 1.0 12
arrival_date_day_of_month 15.8 8.8 16.0 1.0 31
stays_in_weekend_nights 0.9 1.0 1.0 0.0 19
stays_in_week_nights 2.5 1.9 2.0 0.0 50
adults 1.9 0.6 2.0 0.0 55
children 0.1 0.4 0.0 0.0 10
meal* 1.6 1.1 1.0 1.0 5
country* 94.6 45.1 82.0 1.0 178
market_segment* 5.9 1.3 6.0 1.0 8
is_repeated_guest 0.0 0.2 0.0 0.0 1
previous_bookings_not_canceled 0.1 1.5 0.0 0.0 72
reserved_room_type* 2.0 1.7 1.0 1.0 10
assigned_room_type* 2.3 1.9 1.0 1.0 12
booking_changes 0.2 0.7 0.0 0.0 21
deposit_type* 1.1 0.3 1.0 1.0 3
agent* 211.8 122.8 296.0 1.0 334
adr 101.8 50.5 94.6 -6.4 5400
required_car_parking_spaces 0.1 0.2 0.0 0.0 8
total_of_special_requests 0.6 0.8 0.0 0.0 5
reservation_status* 1.6 0.5 2.0 1.0 3

2.2.7 Missing Data

You have to know how many records have missing values, recorded in r as NA for Not Applicable. If data is potentially missing for a reason related an important outcome, all of your analysis could be off. Imagine if we were doing a survey about home internet access of our customers but we conducted the survey via the internet. Many customers might not respond, but the non-response (and therefore NAs) would be more prevalent among customers without internet access.

The is.na() function in R evaluates each element of whatever input you give it (e.g. a column in your dataframe) returns an object with the same dimensions containing values TRUE or FALSE. TRUE in this case means that that element was missing. You can take the sum over the output of is.na() to count up the number of NAs. Let’s do that for our hotel booking dataset.

### Count the number of NAs in the booking data.
sum(is.na(bookingdata))
## [1] 4

Since we only have 4 missing values in our dataset, we can either ignore them or drop those observations. We can actually look above and see that summary includes NAs in its output. It looks like four records are missing entries for the number of children in the party.

2.2.8 Tables and Counts

Let’s get a sense for the data. How many bookings were at the City Hotel and how many at the Resort Hotel? (To hide the true identity of the hotel names and be more descriptive than Hotel 1 and Hotel 2, they gave use “City” and “Resort”). We can use the command table to easily break it down a bit.

### Tabulate the number of bookings by Hotel Location
table(bookingdata$hotel)
## 
##   City Hotel Resort Hotel 
##        79330        40060

Notice how we use the name of the data object bookingdata followed by a dollar sign and then our variable name, hotel? This is a key point.

We can also create a two-way table. Let’s look at bookings by month and hotel.

### Tabulate the number of bookings by Month and Hotel
table(bookingdata$arrival_date_month, bookingdata$hotel)
##            
##             City Hotel Resort Hotel
##   April           7480         3609
##   August          8983         4894
##   December        4132         2648
##   February        4965         3103
##   January         3736         2193
##   July            8088         4573
##   June            7894         3045
##   March           6458         3336
##   May             8232         3559
##   November        4357         2437
##   October         7605         3555
##   September       7400         3108

Do some more exploring on your own. What else do you see? It’s hard to get a sense for this many numbers. Some visualizations would help us!