Hotel Bookings.

Introduction.

This time, we’ll analyse the Hotel Booking Demand dataset available here in kaggle, it’s an interesting dataset for exploratory analysis. The table contains data from two hotels: a) City hotel and b) Resort hotel, it consist of 32 columns, with records from July, 2015 to August, 2017.

We’ll define our business task loosely as find relevant patterns that allow us to make predictions about the hotel guests. Up to this version of the notebook, the analysis include:

  • Amount of bookings per hotel throughout the year.
  • Comparison between canceled and non-canceled reservations.
  • Most popular countries the guests came from.
  • Most common guest groups.
  • Average daily rates throughout the year.
  • Canceled vs non-canceled visits.
  • Cancelation frecuency by lead time.
  • Visit duration by hotel and guest groups.

Note before we begin: As with most of my notebooks, I’ve left the main code on the first cell, feel free to unhide it to see it in more detail. Do not hesitate to share your feedback on the comments section, I’d love to hear your thoughts.

Findings.

Amount of bookings per hotel throughout the year.

Let’s begin by answering the question what is the hotel with the most demand? as I briefly mentioned above, we’re basically analysing two hotels, one located on the city and the second one is a resort (we don’t know exactly where they are).

If we count the total amount of reservations per hotel and month we can get the following chart:

line_1

There is no question, the city hotel has the most reservations, this hotel presents the greater range of variation throughout the year too. Also, we can see some similarities, January seems to be the month with the fewer amount of bookings on both hotels. The horizontal dotted lines are fixed on January, the begining of the year, and allows us to easily identify the general pattern, we can see how, after january, there is a consistent increase on the amount of bookings until April, then we can expect to see a decrease by July, then another increase by September-October to finally return to a minimun activity by January.

Comparison between canceled and non-canceled reservations.

Now, let’s take the previos chart one step further by adding a new factor, let’s filter the bookings that were canceled, this way we will be able to compare the expected vs. the real amount of guests (NOTE: Please keep in mind that up to this point we are still not working with an actual amount of people, but rather, the amount of service request, for example, one reservation may consist of 1, 2, 3 or more people, we’ll talk about that later). The resulting chart looks like this:

line_2

Adding this change reveals something entirely different! we can see how there is a significant difference between the amount of reservations submitted (transparent dotted lines) vs. the real occupation the hotels had (solid continuous lines). Regardless of the diffentence between expected vs actual visitors, we can still see the City hotel coming up on top as the option with the most bookings, January as the month with the fewer reservations and October as the one with the most.

Most common guest groups.

Now, let’s talk about the groups of people visisting the hotels. Up to this point, our analysis has focused on the amount of bookings, but, as we can tell by looking at the data, one reservation may include one or multiple people, for example, people traveling alone, couples, families, etc. By merging information about a) number of adults b) number of children and c) number of babies into a single string, we can study how often the different patterns repeat, and as a result, we can get this chart:

col_2

The pattern is quite similar among the two hotels, by far, 2 adults is the most common combination, followed by 1 adult traveling alone. Let’s see if the pattern remains consistent month by month:

line_4

Its amazing to see how, the combination consisting of 2 adults makes by far the most visits on both hotels. for some reason, the amount of people traveling alone decreases significantly on July, the rest of combinations tends to remain very consistent.

Average daily rates throughout the year.

Now it’s time to find an answer to the question when is it most likely for a guest to get a better deal? this is an important piece of information, we all want to enjoy our vacations but its always better if we can save some money, right?, here we need to take in consideration two different factors 1) how many people are going? and 2) month of the year we are planning to go if we canculate the average daily rate (ADR) per month, group of guests, and we separate by hotel as we did before, we can get the following chart:

line_3

First, let’s talk about the individual series, it makes perfect sense what we are seeing, basically, the most people traveling, the most money we are going to spend, right? so 1 adult traveling alone is consistently spending the less amount of money, while 2 adults and 2 children spends the most. As for the monthly ADR variation, we can see that January (and the nearby months) offers the better rates, while August is just the opposite, showing the higher ADR, this is true at least for the resort hotel, as the city hotel seems to keep more consistency on its prices.

Canceled vs non-canceled visits.

Now, Let’s see how the guests groups we defined earlier compare regarding canceled vs. non-canceled bookings, the next chart is quite interesting!

col_3

As we can see, there is a significant gap between the amount of visits both hotels could have (potentially) versus the real amount. It would be great if we could know what is main reason why all those bookings were canceled, as both hotels would greatly benefit from closing that huge difference.

Cancelation frecuency by lead time.

Now that we’ve seen the gap between canceled vs non-canceled visits, the next question is, is there a relationship between bookings being canceled and their lead time? lead time is defined as the time (in days) between the moment the booking has been made and the expected arrival date. The chart below will help us answer this question.

hist_1

Each bin represents 2 weeks, the first bin is ,on both hotels, significantly smaller than the second one, and from that point, we see their size consistently decreasing until they reaches their minimun after 450 days. We can conclude that bookings with a registered lead time from 14 to 150 days are more likely to be cancelled. These cancelation frecuencies keep proportion with the guests groups we’ve been comparing, so bookings consisting of 2 adults come up on top here as well.

Visit duration by hotel and guest groups.

Now, let’s focus on the number of days the visitors tends to stay in each hotel. Our data source has the amount of days spent during weekdays and weekends on separate columns, I’ve decided to sum those two columns to get a total amount of days, then I put that data into the chart below.

NOTE: To keep the chart simple I’ve limited the x-axis to 20 days

hist_2

The results are quite interesting, it seems that the city hotel, while having the most visits, these are, in overall, shorter to those we see on the resort hotel. Visitors traveling alone tends to prefer shorter trips, as we can see on the chart, their visits (in orage) are fewer the longer the visit is. For some reason, 7-days bookings are quite popular on the resort, as you can see, there is a spike on the histogram showing this preference.

References