This time, we’ll analyse the Hotel Booking Demand dataset available here in kaggle, it’s an interesting dataset for exploratory analysis. The table contains data from two hotels: a) City hotel and b) Resort hotel, it consist of 32 columns, with records from July, 2015 to August, 2017.
We’ll define our business task loosely as find relevant patterns that allow us to make predictions about the hotel guests. Up to this version of the notebook, the analysis include:
Note before we begin: As with most of my notebooks, I’ve left the main code on the first cell, feel free to unhide it to see it in more detail. Do not hesitate to share your feedback on the comments section, I’d love to hear your thoughts.
Let’s begin by answering the question what is the hotel with the most demand? as I briefly mentioned above, we’re basically analysing two hotels, one located on the city and the second one is a resort (we don’t know exactly where they are).
If we count the total amount of reservations per hotel and month we can get the following chart:
line_1
There is no question, the city hotel has the most reservations, this hotel presents the greater range of variation throughout the year too. Also, we can see some similarities, January seems to be the month with the fewer amount of bookings on both hotels. The horizontal dotted lines are fixed on January, the begining of the year, and allows us to easily identify the general pattern, we can see how, after january, there is a consistent increase on the amount of bookings until April, then we can expect to see a decrease by July, then another increase by September-October to finally return to a minimun activity by January.
Now, let’s take the previos chart one step further by adding a new factor, let’s filter the bookings that were canceled, this way we will be able to compare the expected vs. the real amount of guests (NOTE: Please keep in mind that up to this point we are still not working with an actual amount of people, but rather, the amount of service request, for example, one reservation may consist of 1, 2, 3 or more people, we’ll talk about that later). The resulting chart looks like this:
line_2
Adding this change reveals something entirely different! we can see how there is a significant difference between the amount of reservations submitted (transparent dotted lines) vs. the real occupation the hotels had (solid continuous lines). Regardless of the diffentence between expected vs actual visitors, we can still see the City hotel coming up on top as the option with the most bookings, January as the month with the fewer reservations and October as the one with the most.
Now that we have a general knowledge about the amount of visitors each hotel receives throughout the year, the next question is… where are they from? a simple answer for this question would be very simple, they come from all around the globe, but that is not satisfying at all, let’s pin down our top contries per hotel:
col_1
You can find the complete list of country codes here, but for the purposes of this analysis, we just need to keep in mind these:
The chart above only displays countries with at least 3.5% occupation rate, so please keep in mind that the visitors come from many different countries, but their percentage is below the criteria. We can see that the top countries for both hotels are all from Europe.
Now, let’s see how these countries bahave throughout the year:
area_1
The “Other” category includes all the different countries besides the top 7 that we described earlier. We can see how the city hotel has a greater level of diversity on their guests, we don’t have any Personally Identifiable Information (PII for short) but it would be great if we could survey that population to see what causes the difference, for example, more business trips? We can see how, as we could expect from the previous chart, guests from Portugal makes (consistenly) the most visits, followed by people from France and The United Kingdom.
Now, let’s talk about the groups of people visisting the hotels. Up to this point, our analysis has focused on the amount of bookings, but, as we can tell by looking at the data, one reservation may include one or multiple people, for example, people traveling alone, couples, families, etc. By merging information about a) number of adults b) number of children and c) number of babies into a single string, we can study how often the different patterns repeat, and as a result, we can get this chart:
col_2
The pattern is quite similar among the two hotels, by far, 2 adults is the most common combination, followed by 1 adult traveling alone. Let’s see if the pattern remains consistent month by month:
line_4
Its amazing to see how, the combination consisting of 2 adults makes by far the most visits on both hotels. for some reason, the amount of people traveling alone decreases significantly on July, the rest of combinations tends to remain very consistent.
Now it’s time to find an answer to the question when is it most likely for a guest to get a better deal? this is an important piece of information, we all want to enjoy our vacations but its always better if we can save some money, right?, here we need to take in consideration two different factors 1) how many people are going? and 2) month of the year we are planning to go if we canculate the average daily rate (ADR) per month, group of guests, and we separate by hotel as we did before, we can get the following chart:
line_3
First, let’s talk about the individual series, it makes perfect sense what we are seeing, basically, the most people traveling, the most money we are going to spend, right? so 1 adult traveling alone is consistently spending the less amount of money, while 2 adults and 2 children spends the most. As for the monthly ADR variation, we can see that January (and the nearby months) offers the better rates, while August is just the opposite, showing the higher ADR, this is true at least for the resort hotel, as the city hotel seems to keep more consistency on its prices.
Now, Let’s see how the guests groups we defined earlier compare regarding canceled vs. non-canceled bookings, the next chart is quite interesting!
col_3
As we can see, there is a significant gap between the amount of visits both hotels could have (potentially) versus the real amount. It would be great if we could know what is main reason why all those bookings were canceled, as both hotels would greatly benefit from closing that huge difference.
Now that we’ve seen the gap between canceled vs non-canceled visits, the next question is, is there a relationship between bookings being canceled and their lead time? lead time is defined as the time (in days) between the moment the booking has been made and the expected arrival date. The chart below will help us answer this question.
hist_1
Each bin represents 2 weeks, the first bin is ,on both hotels, significantly smaller than the second one, and from that point, we see their size consistently decreasing until they reaches their minimun after 450 days. We can conclude that bookings with a registered lead time from 14 to 150 days are more likely to be cancelled. These cancelation frecuencies keep proportion with the guests groups we’ve been comparing, so bookings consisting of 2 adults come up on top here as well.
Now, let’s focus on the number of days the visitors tends to stay in each hotel. Our data source has the amount of days spent during weekdays and weekends on separate columns, I’ve decided to sum those two columns to get a total amount of days, then I put that data into the chart below.
NOTE: To keep the chart simple I’ve limited the x-axis to 20 days
hist_2
The results are quite interesting, it seems that the city hotel, while having the most visits, these are, in overall, shorter to those we see on the resort hotel. Visitors traveling alone tends to prefer shorter trips, as we can see on the chart, their visits (in orage) are fewer the longer the visit is. For some reason, 7-days bookings are quite popular on the resort, as you can see, there is a spike on the histogram showing this preference.