Data Pragmatist
Posts
Demystifying Seattle Airbnb Data Using CRISP-DM

Demystifying Seattle Airbnb Data Using CRISP-DM

Uncovering Insights and Predictive Models for Seattle Airbnb Listings

October 30, 2023

Welcome to learning Monday edition of the Data Pragmatist, your dose of all things data science and AI.

📖 Estimated Reading Time: 4 minutes. Missed our previous editions?

Today we will embark on a data science journey using the CRISP-DM model, delving into an Airbnb dataset for Seattle. We will aim to answer intriguing questions by applying the CRISP-DM framework. Along with this, I have provided latest Tech Headlines.

	Sponsored AI Minds NewsletterNewsletter at the Intersection of Human Minds and AI

Do follow us on Linkedin and Twitter for more real-time updates.

Tech Headlines

🎮 Sony Security Breach: Sony admits to security breaches exposing employee data due to “MOVEit” platform vulnerabilities. 🕵️‍♂️💼 🔗Source
🔮 Meta vs Apple: Meta strategizes against Apple's Vision Pro with affordable, controller-free VR headsets focusing on gaming & productivity. 🕶️🍏 🔗Source
🧬 AI Predicts Viral Mutations: Harvard and Oxford design EVEscape, an AI predicting virus mutations for improved vaccine creation. 💉🤖 🔗Source
💰 JPMorgan's Token Platform: JPMorgan introduces blockchain platform, TCN, for efficient tokenized asset transfers; BlackRock is a key client. 🌐💼 🔗Source
🛰️ Starlink Mobile: SpaceX ventures into satellite-based cellular service, aiming for global coverage by 2025. 📱🌍 🔗Source

🧠Demystifying Seattle Airbnb Data Using CRISP-DM

Data science is a multifaceted field with wide-ranging applications, and it encompasses various perspectives and definitions. However, a common goal shared by data scientists is to gain a deeper understanding of the data at hand and create predictive models. To navigate it, data scientists have devised the "Cross-Industry Standard Process for Data Mining" (CRISP-DM) to standardize and streamline the data science process.

The CRISP-DM Process

CRISP-DM outlines a structured approach to the data science lifecycle, which typically consists of six essential steps:

Understand the Business
Understand the Data
Prepare the Data
Model the Data
Evaluate the Results
Deploy

It's worth noting that not all data science inquiries necessitate all six steps. Let's dive into each of these steps and explore their application in the context of the Airbnb dataset for Seattle.

1. Understand the Business

Before delving into data analysis, it's crucial to grasp the essence of the business and its objectives. Airbnb is an online marketplace that offers lodging, primarily home-stays for vacation rentals, and tourism activities. This foundational understanding sets the stage for our data exploration.

2. Understand the Data

To gain insights into the dataset, we must first become acquainted with the data at our disposal. The Airbnb dataset for Seattle comprises three CSV files, each containing information on listings, reviews, and the calendar. The listings dataset boasts 3818 rows and 92 columns, and a quick analysis of the continuous columns provides valuable statistical insights.

Some observations from the data analysis:

Properties receive relatively few reviews per month.
Most reviews are positive.
The majority of listings lack information in the license field.
Most listings accommodate a small number of guests.

In addition to these insights, we've also identified some data issues:

Price fields do not appear in the list of continuous variables, likely due to the presence of the "$" sign.
Certain fields, such as 'license,' contain missing values that require cleaning.

Having comprehended the data, we can now address specific questions and apply relevant steps of the CRISP-DM process to answer them.

Which Seattle Neighborhoods Earn the Highest Revenue?

Since the dataset lacks a direct revenue column, we can estimate the number of stays at each listing by assuming that each review represents a stay. To calculate estimated revenue, we use the minimum number of days per stay. The following steps are involved:

1. Clean the data by removing the "$" sign from price values.

2. Calculate estimated revenue for each listing using the number of reviews as the number of stays.

3. Sum up the revenues for each listing and compute the average revenue per neighborhood.

The results reveal the neighborhoods with the highest estimated listing revenues, shedding light on the most lucrative areas.

Which Accommodation Size Receives the Highest Number of Bookings?

Assuming each review corresponds to a successful booking, we can examine the proportion of bookings relative to the number of guests a property can accommodate. This analysis allows us to visualize which accommodation sizes receive the highest number of bookings. Notably, properties accommodating two guests account for over half of all bookings, suggesting a preference for such properties among Airbnb users.

How Well Can We Predict Listing Prices?

Predicting listing prices is a pivotal task in data science, involving model creation. The process consists of several key steps:

Data cleaning, including imputing values for categorical variables.
Data splitting into training and test datasets.
Instantiation of a linear regression model and fitting it to the training data.
Testing the model by predicting the test data.

The evaluation of the model yields valuable insights:

The R-squared score for the training dataset is 0.5927, while the test dataset scores a slightly higher R-squared of 0.5975. This discrepancy suggests that the training data might represent outliers more effectively.
The root mean square error (RMSE) of the model is 3446.62, implying room for improvement without the risk of overfitting.
Enhancements can be made by incorporating additional variables not analyzed in this study.

To enhance the data visualization, a chart illustrating actual values versus predicted values is created, providing a visual representation of the model's performance.

In conclusion, the CRISP-DM process is a valuable framework for conducting data science investigations. By applying this model to the Airbnb dataset for Seattle, we were able to gain insights into revenue generation, booking trends, and price prediction. Data science is a dynamic field, and using such structured methodologies ensures that we make data-driven decisions and draw actionable conclusions. This journey through the data science process has illuminated the potential for further exploration and optimization in the realm of Airbnb in Seattle.

Source of information: https://www.kaggle.com/code/yifanma/predicting-listing-prices/notebook

How did you like today's email?

Your feedback will help us improve.

💢 Could do better | 💪 Pretty good | ❤️ Loved it

If you are interested in contributing to the newsletter, respond to this email. We are looking for contributions from you — our readers to keep the community alive and going.