9 Comparing Quantities

9.0.1 Scatterplot

When comparing two numerical variables, often the most straightforward plot is a scatterplot. Each observation is plotted on a two-dimensional graph, and we can easily look for patterns or trends. Let’s compare the price paid (Average Daily Rate or adr) with how far out the guest booked the stay (lead_time).

As we’ll see in a minute, scatterplots aren’t always helpful if there are too many data points, so we’ll take a sample of 800 bookings from our dataset to plot (this also speeds up plotting).

### Scatterplot comparing price and how far out a guest booked the stay

# Set seed so that we get the same "random" numbers each time and the plot is reproducible
# You can pick any number for the seed, but to get the same result each time 
# you have to use the same seed
set.seed(132334)

# Plot a random sample of 800 bookings that weren't complementary (ie, price = 0)
bookingdata %>%
  filter(adr != 0) %>%
  sample_n(size = 800) %>%
  ggplot(aes(x =  lead_time, y = adr)) +
  geom_point(color = "#00759a") +
  scale_x_reverse() +   # Reverse the x axis so we "count down"
  scale_y_continuous(labels=scales::dollar_format())+
  labs(x = "Lead Time (Days)",
       y = "Average Daily Rate (Price $)",
       title = "Scatterplot: Distribution of Two Numerical Variables",
       subtitle = "Average Daily Rate ($) versus Lead Time") +
  theme(axis.line = element_line(color = "black"),
        axis.title.y = element_text(angle = 0, vjust = 0.5),
        panel.background=element_blank(),
        panel.grid.minor=element_blank(),
        panel.grid.major.y=element_blank(),
        panel.grid.major.x=element_blank())

We see an interesting pattern emerge…there’s kind of a floor around $50 and few rooms are booked below that for a while. We can add a line to the plot to help people see it.

Let’s look at lead time for prices below versus above $50

bookingdata %>%
  group_by(adr<=50) %>%
  summarize(median_lead = median(lead_time))

## # A tibble: 2 x 2
##   `adr <= 50` median_lead
## * <lgl>             <dbl>
## 1 FALSE                75
## 2 TRUE                 19

9.0.1.1 Overplotting

We mentioned above that if you plot every one of the ~120K bookings on a plot, it’s going to be very crowded, so we took a smaller sample. But sometimes, even sampling isn’t enough, or it won’t resolve the issue.

Let’s suppose we want to look for a pattern in parking requirements in last minute stays…does everyone who books at the last minute arrive by car? And does group size matter (families vs business travelers?) So we have three variables here: lead time (numerical), group size (numerical), and whether the guests need a parking spot (nominal). We’ll just look at stays booked in the last week before the stay (and again, just take a small sample of the data since we have so many data points in this dataset).

### Make a scatterplot comparing group size, booking time 
### and whether parking is needed

# Set the seed so that the sample is reproducible
set.seed(678)

# Make the plot
bookingdata %>%
  filter(lead_time<=7) %>%
  sample_frac(size = 0.05) %>%          # Sample just 5% of the data
  mutate(group_size = children + adults,
         car = ifelse(required_car_parking_spaces > 0, "Driving", "Other transportation")) %>%
  filter(group_size>0) %>%
  ggplot(aes(x = lead_time, y = group_size, color = car)) +
  geom_point() +
  labs(color = "Guest Transportation",
       x = "Booking Lead Time (Days)",
       y = "Group Size",
       title = "Transportation Method by Group Size and Booking Lead Time") +
  scale_x_reverse() +
  scale_color_manual(values = c("#00759a", "grey70")) +
  theme(axis.line = element_line(color = "black"),
        axis.title.y = element_text(angle = 0, vjust = 0.5),
        panel.background=element_blank(),
        legend.position = "bottom",
        legend.key = element_rect(fill = "white"))

Okay, so nobody booking on the day of their stay needs parking. Wait, really? Also, we see 32 points. But there are more data points in the dataset (even after sampling). What’s going on here is overplotting…there are dots on top of one another.

Three solutions to overplotting are to:

Jitter points slightly (moving them away from each other) with geom_jitter()
Adjust transparency (with an alpha) so that points become transparent and you see where they are denser
Sample the data using sample_n() or sample_frac() and only look at some of the data points

9.0.1.1.1 Jittering

One solution is jittering, where we move points slightly to the side so that we can see them. Instead of geom_point we use geom_jitter.

### Make a jittered scatterplot comparing group size, booking time 
### and whether parking is needed

# Set the seed so that the sample is reproducible (and matches the one up above!)
set.seed(678)

# Make the plot
bookingdata %>%
  filter(lead_time<=7) %>%
  sample_frac(size = 0.05) %>%          # Sample just 5% of the data
  mutate(group_size = children + adults,
         car = ifelse(required_car_parking_spaces > 0, "Driving", "Other transportation")) %>%
  filter(group_size>0) %>%
  ggplot(aes(x = lead_time, y = group_size, color = car)) +
  geom_jitter() +
  labs(color = "Guest Transportation",
       x = "Booking Lead Time (Days)",
       y = "Group Size",
       title = "Transportation Method by Group Size and Booking Lead Time") +
  scale_x_reverse() +
  scale_color_manual(values = c("#00759a", "grey70")) +
  theme(axis.line = element_line(color = "black"),
        axis.title.y = element_text(angle = 0, vjust = 0.5),
        panel.background=element_blank(),
        legend.position = "bottom",
        legend.key = element_rect(fill = "white"))

Now we can see that there is quite a bit of diversity and the pattern isn’t so clear. We still have some overplotting, but it’s not as bad. What’s the potential downside? Well, it kind of looks like there are groups with 1.5 guests, but we know that’s not true. WIth jittering, you are slightly altering the values so it’s not a true representation of the underlying data; you might see patterns in the jittered version that don’t exist in real life.

9.0.1.1.2 Adjust Transparency Alpha

Adjusting the alpha value will change the transparency of your points or density plot etc. Thus, darker points indicate a greater density. This can be useful but it’s hard to tell exact differences in magnitude.

9.0.1.1.3 Sample Data

Sometimes there’s just no way to fit everything on the scatterplot. The previous examples have used sample_n or sample_frac to select either a set number or a fraction of the dataset. This can be very helpful, but you have to be careful…is the sample representative of the overall dataset?

9.0.2 Bar Chart

Bar charts allow people to quickly compare quantities. Unlike scatterplots, they may not show every observation and usually require us to summarize the data in some way. The default is often the count of observations, but we may also look at the mean, median, max, etc.

bookingdata %>%
  ggplot(aes(x = reservation_status, y = adr, fill = hotel)) +
  geom_bar(stat = "summary", fun = "mean", width = 0.8, position = position_dodge2()) +
  stat_summary(aes(y = adr + 4, label = dollar(round(..y..,0))), fun = "mean",
               geom = "text",
               hjust = 0.5, angle = 0, size = 3,
              position = position_dodge(width = .9)) +
  scale_fill_manual(values = c("#7a0019", "#00759a")) +
  labs(x = "Reservation Status",
       y = "",
       title = "Mean ADR ($) by Hotel",
       fill = "") +
  theme(axis.line.y = element_blank(),
        axis.line.x = element_line(color = "gray40"),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        plot.background = element_blank(),
        panel.background = element_blank(),
        panel.grid = element_blank(),
        plot.title = element_text(hjust = 0.5),
        legend.position = "bottom")

One critical consideration is to ensure that the Y axis starts at zero for bar charts. It can be very misleading to present a bar graph that doesn’t start at zero because the magnitude of the difference isn’t the same as the total magnitude represented. In fact, geom_bar in ggplot won’t even let you start the Y axis anywhere else…to create the figure on the left I had to “brute force” it with geom_rect.