COM7024
Msc Data Science
Programming for Data
Analytics
Investigating the Manchester Housing Market
STU218659
Lee Braiden
Investigating the Manchester Housing Market
The main goal of this report is to examine the Manchester Housing dataset and offer
insights to help make informed decisions. The analysis is based on the CRISP DM (Cross
Industry Standard Process, for Data Mining) framework encompassing stages like Business
Understanding, Data Understanding, Data Preparation, Modeling, Evaluation and Deployment.
Within this report are statistical examinations the application of the Central Limit Theorem and
Python utilization, for data analysis.
Exploring Business Factors
The main objective is to pinpoint the elements that impact property values in Manchester
specifically looking at features, like footage, construction year, proximity to water and available
amenities. This study seeks to provide insights, for pricing tactics, real estate development
choices and potential investment prospects.
Data Understanding
Dataset Overview
The dataset contains various attributes of properties in Manchester, including:
• Price
• Waterfront status
• Floor Space
• Year Built
• Bedrooms
• Bathrooms
• Location
• Property Type
• Condition
• Lot Size
• Amenities
First, we loaded the dataset and displayed the first 10 rows for initial inspection.
Descriptive Statistics
In this study we analyzed the statistics, for waterfront homes to get insights, into their
characteristics and variations. The findings revealed that waterfront properties generally
command prices offer spacious living areas and come with a greater range of amenities
compared to non-waterfront properties.
Data Preparation
Data Cleaning and Transformation
It is important to find and fill in missing values accurately for analysis. We replaced missing
values, with the occurring value for categorical variables and made sure to verify and adjust data
types as needed. This process guaranteed that all data points were ready for use and maintained
consistency, for analysis.
Statistical Test: T-test
A statistical test known as a T test was performed to analyze the price disparity between
properties near water and those that are not. The results showed a T statistic of 0.210 and a p
value of 0.836 suggesting that there is a slight difference in prices, between waterfront and non-
waterfront properties.
Central Limit Theorem Demonstration
To explain the Central Limit Theorem, we took samples from the dataset. Graphed the averages
of these samples. The outcome showed that the distribution of sample averages resembled a
distribution. This proves that as the sample size grows the average price becomes normally
distributed, regardless of whether the original price distribution's normal or not.
Modeling and Analysis
Correlation Analysis
Correlation matrices were computed before and after data preprocessing to understand
relationships between numeric variables. Key correlations identified include:
• A moderate positive correlation (0.390) between Floor Space and Price.
• A minor correlation (0.094) between Year Built and Price.
• A minor correlation (0.045) between Waterfront status and Price.
Heatmaps were used to visualize these correlations, highlighting the relationships between
different property attributes.
Visualizations
Several plots were created to visualize relationships between variables:
• Distribution of Floor Space: This histogram showed the spread and central tendency of
floor space across properties.
• Year Built vs. Price: A scatter plot revealed a positive trend, indicating that newer
properties tend to be priced higher.
• Floor Space vs. Price: A scatter plot demonstrated a clear positive relationship,
suggesting that larger properties command higher prices.
• Waterfront vs. Price: A box plot showed that waterfront properties generally have higher
median prices, though the variability within each category was considerable.
Evaluation
The analysis revealed key insights:
• There is a moderate positive correlation between Floor Space and Price.
• Bedrooms and Bathrooms have strong positive correlations with Price.
• Waterfront status has a minor impact on Price, as indicated by the T-test results.
These findings suggest that while certain factors like floor space and the number of bedrooms
significantly influence property prices, others like the year built and waterfront status have less
impact.
Recommendations
1. Focus on Floor Space and Amenities: Properties with larger floor space and better
amenities should be priced higher, as these factors significantly influence property prices.
2. Year Built Consideration: While newer properties are slightly more valuable, this factor
is less significant compared to floor space and amenities.
3. Investment in Non-Waterfront Properties: Given the minor price difference between
waterfront and non-waterfront properties, investing in well-located non-waterfront
properties with good amenities might be more cost-effective.
The thorough investigation of the Manchester Housing dataset has given us information,
about the factors affecting property prices. By using the CRISP DM framework, we carefully
studied the data, utilized techniques and drew significant conclusions to guide our strategic
choices.
References
Han, J., Pei, J., & Kamber, M. (2011). Data Mining: Concepts and Techniques. Elsevier.
McKinney, W. (2010). Data Analysis with Python. O'Reilly Media.
Silver, N. (2012). The Signal and the Noise: Why So Many Predictions Fail--but Some Don't.
Penguin.
Appendix
# Importing required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
# Path of Manchester Housing dataset
file_path = r'C:\Users\Administrator\Desktop\DataAnalytics\manchester_housing_data.csv'
data = pd.read_csv(file_path)
# Display the first few rows of the dataset
print("First 10 rows of the dataset:")
print(data.head(10))
# Descriptive statistics for waterfront properties
print("statistics for waterfront properties:")
waterfront_properties = data[data['Waterfront'] == 1]
print(waterfront_properties.describe())
# Graph the distribution of floor space
plt.figure(figsize=(10, 6))
sns.histplot(data['Floor Space'], kde=True)
plt.title('Distribution of Floor Space')
plt.xlabel('Floor Space (sq ft)')
plt.ylabel('Frequency')
plt.show()
# Correlation matrix for numeric columns
print("\nCorrelation matrix for numeric columns:")
numeric_cols = data.select_dtypes(include=[np.number])
correlation_matrix = numeric_cols.corr()
print(correlation_matrix)
# Visualize the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
# Scatter plot for Year Built vs. Price
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Year Built', y='Price', data=data)
plt.title('Year Built vs. Price')
plt.xlabel('Year Built')
plt.ylabel('Price')
plt.show()
# Scatter plot for Floor Space vs. Price
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Floor Space', y='Price', data=data)
plt.title('Floor Space vs. Price')
plt.xlabel('Floor Space (sq ft)')
plt.ylabel('Price')
plt.show()
# Box plot for Waterfront vs. Price
plt.figure(figsize=(10, 6))
sns.boxplot(x='Waterfront', y='Price', data=data)
plt.title('Waterfront vs. Price')
plt.xlabel('Waterfront')
plt.ylabel('Price')
plt.show()
# Correlation between Floor Space and Price
correlation_floor_space_price = data['Floor Space'].corr(data['Price'])
print(f"\nCorrelation between Floor Space and Price:
{correlation_floor_space_price:.3f}")
# Correlation between Year Built and Price
correlation_year_price = data['Year Built'].corr(data['Price'])
print(f"Correlation between Year Built and Price: {correlation_year_price:.3f}")
# Central Limit Theorem
sample_means = []
for _ in range(1000):
sample = data['Price'].sample(30, replace=True)
sample_means.append(sample.mean())
plt.figure(figsize=(10, 6))
sns.histplot(sample_means, kde=True)
plt.title('Sampling Distribution of the Sample Mean [Central Limit Theorem]')
plt.xlabel('Sample Mean of Price')
plt.ylabel('Frequency')
plt.show()
# T-test (Statistical test) to compare prices of waterfront vs. non-waterfront properties
print("\nPerforming T-test to compare prices of waterfront vs. non-waterfront
properties:")
waterfront_prices = data[data['Waterfront'] == 1]['Price']
non_waterfront_prices = data[data['Waterfront'] == 0]['Price']
t_stat, p_val = stats.ttest_ind(waterfront_prices, non_waterfront_prices)
print(f"Results: t-statistic = {t_stat:.3f}, p-value = {p_val:.3f}")
# Identifying missing values in data
print("\nIdentifying missing values in the dataset:")
missing_values = data.isnull().sum()
print("Missing Values in Dataset:\n", missing_values)
# Impute missing values
data['Amenities'] = data['Amenities'].fillna(data['Amenities'].mode()[0])
# Checking data types and converting them if necessary
data['Price'] = data['Price'].astype(float)
data['Waterfront'] = data['Waterfront'].astype(int)
data['Floor Space'] = data['Floor Space'].astype(float)
data['Year Built'] = data['Year Built'].astype(int)
# To use only numeric columns for correlation
numeric_cols_post = data.select_dtypes(include=[np.number])
correlation_matrix_post = numeric_cols_post.corr()
print("\nCorrelation Matrix after Preprocessing:\n", correlation_matrix_post)
# Visualize the updated correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix_post, annot=True, cmap='coolwarm')
plt.title('Updated Correlation Matrix')
plt.show()
Output (in sequence)