Predicting Airbnb prices in Boston with Machine Learning & other geographical insights.

9 min readDec 15, 2020

Since its establishment in 2008, Airbnb has been offering tourists a unique way to find short and long-term homestay accommodations when traveling. As part of the Airbnb Inside initiative, the Boston Airbnb Listing dataset describes the listing activities of properties in Boston, MA.

Here, I will analyze the Airbnb Boston Listings dataset from here, which includes around 130 descriptions of amenities, location, and price for each listing.

Aside from the Listings, Airbnb Inside offers two other types of dataset:

Reviews: includes guests’ detailed comments;
Calendar: lists price and availability for each property.

The original Airbnb datasets can be found here.

The Business Challenge

If you are going to be a host on Airbnb, it’s worth understanding the impact of property type, features, and location on your revenue. Hence, I built two models to 1. predict the price of the Airbnb properties in Boston using selected features included in the Listings dataset and 2. explain price variations concurrently with proximity from Boston downtown.

Data Understanding & Modelling

We structure the analysis as follows:

What are the most common property_type and room_type in Boston? And what are the impact of room and property type on bookings?
What are the most common amenities available across Airbnb Boston listings? Based on property availability, how do amenities attract bookings?
Can we build an ML model that predicts property price based on the most popular and common features of the listings?
What do correlations look like between price and other features_of_interest?
How does property price spread over Boston — uniformly or not?
Can a linear regression model explain how distance from Boston downtown predicts variations in property price?

So, let’s get started!

Q1: What are the most common property and room types in Boston?

We sort propety_type values and count how many listings are under each type:

Apartment          2612
House               562
Condominium         231
Townhouse            54
Bed & Breakfast      41
Loft                 39
Other                17
Boat                 12
Villa                 6
Entire Floor          4
Dorm                  2
Camper/RV             1
Guesthouse            1
Name: property_type, dtype: int64

Out of a total of 3585 listings, the majority of properties fall into the Apartment, House, Condominium, Townhouse, and Bed & Breakfast categories.

We sort room_type and count how many listings are under each type:

Entire home/apt    2127
Private room       1378
Shared room          80
Name: room_type, dtype: int64

Q1.1: Effects of Property and Room Type on Bookings

Here, we want to see the impact of propety_type and room_type on future bookings. We divide availability_30by 30 (days) and create a new variable called booking_percentage_30 which shows the percentage of days booked out of the next 30 days for each listing. Therefore, booking_percentage_30as a variable of property popularity.

df_lis['booking_percentage_30'] = 1 - (df_lis['availability_30']/30)
df_lis['booking_percentage_30'].head()0    1.000000
1    0.133333
2    0.366667
3    0.800000
4    0.566667
Name: booking_percentage_30, dtype: float64

In the output above, we found the percentage of days booked out of the next 30 days for each listing. For instance, property 3 is booked 80% of the next 30 days.

Now, we will use our new ‘booking_percentage_30’ to investigate how room type and property type attract bookings.

We first look at room_type:

df_lis.groupby(['room_type'])['booking_percentage_30'].mean().sort_values(ascending=False)

And then at propety_type:

(df_lis.groupby([‘property_type’])[‘booking_percentage_30’].mean().sort_values(ascending=False)).plot(kind=’bar’, legend=None)

Results: As for room type, despite Entire Home having higher counts, results show that Private Room type is preferred over Entire home/apt. Also, some property types have very few listings as opposed to Apartment, House, Condominium. A property type such as “Villa”, having only 6 listings, might easily have a higher booking percentage as opposed to “Apartment”, having 2612 listings. Therefore, we conclude that Boats, with 12 listings, are less popular than Villa, with only 6 listings. Apartments, with 2612 listings, are almost as much popular as B&B, with 41 listings.

Q2: What are the most common amenities that are available in the Airbnb Boston dataset?

Besides the information provided by the columns, the column amenities provides further detailed information about the properties, such as kitchen, WiFi, washer/dryer.

We would like to know what are the most popular amenities across the listings and later select some of them to predict the listings price.

Since each listing in the dataset has a unique set of amenities,we will first convert amenities into a new list, list_of_amenities. We will use a function to count the number of listings that contain each given amenity in the list_of_amenities. Last, we create a new dataframe that has columns of amenities and the count of these amenities for each listing.

Here, we create a bar chart to show the percentage of each amenity from the most common to the least common.

Q2.1: How do amenities attract bookings?

Here we want to see the impact of amenities on future bookings, based on the Availability in the next 30 days, which we have already used as an indicator of property popularity.

Results: We found — besides the second most popular amenity, which seems to be an unlabeled column — that having a Smoke Detector in the property increases the booking percentage for the next 30 days by almost 15%. Also, having a Buzzer/Wireless Intercom, Fire Extinguisher, Wireless Internet, and Lock on Bedroom Door are good ideas for increasing the popularity of a listing.

Q3: Can we build a model to predict the price of a listing based on the most influential features of the dataset?

There are many features in the dataset that influence the price of the listing. In the third question, I would like to train a model to estimate the price of a listing.

To do so, we select a list of features of interest (FOI) that are very likely crucial for estimating the price of a listing. The FOIs are some columns from the original dataset, along with some of the new columns with categorical values created from the amenities.We proceed to create a new sub-dataframe of listings.

Before training the model, we need to check the missing values and impute the mean, mode or 0, depending on the variable. In doing so, we need to be careful not to dilute the predictive power of the machine learning model, as it can lead to overgeneralizations.

After creating the X (features) and y (the variable to be modelled) dataframes, splitting the new dataframes into train and test dataframes, applying the linear regression, fitting the model, we make predictions with the test set and score the success of the model.

y = df_lis_ml['price']
X = df_lis_ml.drop(columns='price')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)
lm_model = LinearRegression(normalize=True)
lm_model.fit(X_train, y_train)
y_test_preds = lm_model.predict(X_test)
test_score = r2_score(y_test, y_test_preds)
print(test_score)

Results: The R² score of our model is 0.22. It means that our model explains a full 22% of the variation of property price. It is worth to note that we strictly circled the features of our model to property type, amenities, location, and we did not consider many other crucial features such as peak seasons, guest reviews, or host ratings, to name a few. However, our model could still predict a fifth of price variations.

Now, we want to check the most influential coefficients of our model providing the coefficient estimates:

def coef_weights(coefficients, X_train):
coefs_df = pd.DataFrame()
    coefs_df['est_int'] = X_train.columns
    coefs_df['coefs'] = lm_model.coef_
    coefs_df['abs_coefs'] = np.abs(lm_model.coef_)
    coefs_df = coefs_df.sort_values('abs_coefs', ascending=False)
    return coefs_df
coef_df = coef_weights(lm_model.coef_, X_train)

Results: From the table, we can see that coordinates have a significant positive effect on listing price. Whereas, property type Camper, Private Room, Shared Room, Entire Floor and a Pool lower the price of the property. Kitchen, one of the most common property features across the listings, also seems to lower the price of the property.

We now test a range of k values from 10 to 74 (which is the number of all features) and find the number of features that generate the highest R² value.

Results: The highest R² score is achieved with 30 features and the R² score is 0,226

Q4: What do correlations look like between `price` and other `features of interest`?

By using the seaborn library, the following heatmap shows correlation coefficients between price and other property features. We code a diverging color palette that has markedly different colors at the two ends of the value-range with a pale midpoint 0. The warmer and lighter the color, the larger the positive correlation magnitude.

As for the variable price, the heatmap shows higher positive correlations with bedrooms, beds, accommodates, and cleaning fee.

Q5: How does property `price` spread over Boston — uniformly or not?

Since latitudeand longitudeare the most influential coefficients of our model to predicting property price, it is worth to see how property price spreads across Boston.

Before plotting, we notice the mode for property price is $150, the third quartile is $220, and few other properties are listed for more than $220. Hence, we scale our price dataset and drop all homes with a price over $220, as outliers will throw off our color coding.

df_lis_map_price = df_lis[df_lis['price'] < 220]

We now make a scatterplot of property prices and their relative coordinates.

Results: By looking at the map, we can infer that the most expensive properties are mainly clustered around the following coordinates: 42.36 and — 71.060. However, there are highly priced listings also clustered around 42.36 and -71.150 for instance.

In order to get a better visualization of how property prices are distributed, we create a Heat Map of Boston.

We use the folium package (https://pypi.org/project/folium/) to create a map of the listings locations and their relative prices by using the latitudeand longitude information for each property.

As we can see, prices do not necessarily evenly spread, as in decreasing, moving from Boston downtown outwards.

Q6: Can a linear regression model explain how distance from Boston downtown predicts variations in property `price`?

In Q5, we determined that the hottest cluster in Boston, price-wise, is located around the following coordinates: 42.36, -71.060. Incidentally, these coordinates closely overlap with the ones of Boston downtown.
In our previous model, we used latitudeand longitude as variables along with other features. Here, we will only use the coordinates of Boston downtown to create a new variable for each listing: distance_from_Boston_Downtown

def distance_from_Boston_Downtown(lat, lon, downtown=[42.3557, -71.0572]):
    
    R = 6373.0
    lat1 = math.radians(downtown[0])
    lon1 = math.radians(downtown[1])
    lat2 = math.radians(lat)
    lon2 = math.radians(lon)
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = math.sin(dlat / 2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2)**2
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    distance = R * c
    return distancefor i in range(len(df_lis_ml)):
    df_lis_ml.loc[ i , 'distance_from_Boston_Downtown'] = distance_from_Boston_Downtown(df_lis_ml.loc[ i ,'latitude'] , df_lis_ml.loc[ i ,'longitude'])

After calculating the distance from downtown, we applying the linear regression, fit the model, make predictions with the test set and score the success of the model

y = df_lis_ml['price']
X = df_lis_ml.drop(columns='price')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)
lm_model = LinearRegression(normalize=True)
lm_model.fit(X_train, y_train)
y_test_preds = lm_model.predict(X_test)
test_score = r2_score(y_test, y_test_preds)
print(test_score)

Results: after calculating the distance from downtown for each property we train the model. The model has an R² value of 0.23, meaning that 23% of the variation in the prices is explained alone by the distance from Boston downtown.

We now check the coefficients of the new regression model.

coef_df = coef_weights(lm_model.coef_, X_train)
coef_df[coef_df['est_int'] == 'distance_from_Boston_Downtown']

Results: The coefficient of determination indicates that distance from downtown has a negative effect on price, meaning as the distance increases, the price decreases. More specifically, for each kilometer distance, the property price drops around 15 USD.

References

The Boston Airbnb Dataset analyzed is on Kaggle: https://www.kaggle.com/airbnb/boston

The libraries used, dataset, and detailed code breakdown are available on my GitHub:
https://github.com/OliviaCrrbb/Boston-Airbnb-Price-Analysis-Modelling

Acknowledgments

Thanks to Kaggle and AirBnb for the dataset and Udacity for the course.