TRẦN THỊ LAN ANH
Lớp:CNTTK20P
MãSV: DTC21H4802010537
Access to Safe drinking water is eassently a global issue. The World Health Organization
(WHO) estimates that half of all people in the world are affected by the lack of safe drinking
water. With this assesment, we will explore the data and look for patterns in the data to analyze if
the given data is a good indicator of safe drinking water.
In [20]:
# Import all required libraries
import pandas as pd
import numpy as np
import [Link] as plt
import seaborn as sns
from scipy import stats
[Link]('ignore', category=[Link])
#sns.set_context('notebook')
Data Set
The dataset is downloaded from [Link] and is available for download at:
[Link]
EDA - Exploratory Data Analysis
In this section we will explore the data and look for patterns in the data to analyze if the given
data is a good indicator of safe drinking water.
1) Describe the data
2) Visualize the data
3) Identify the missing values and fill them
4) Identify the outliers and remove them
5) Identify the categorical variables and encode them (if any)
6) Identify the numerical variables and perform basic statistical analysis
In [2]:
#2 Đường dẫn tới file CSV
df = pd.read_csv(r'G:/python excel/water_potability.csv')
In [3]:
# Read the csv file from the url
In [4]:
# Print the first 5 rows of the dataframe
display([Link]())
Chlor Trihalo Tur Pota
Hard Sulfa Condu Organic
ph Solids amine methane bidit bilit
ness te ctivity _carbon
s s y y
204.8 20791. 7.3002 368.5 564.30 10.3797 86.99097 2.96
0 NaN 0
90455 318981 12 16441 8654 83 0 3135
3.71
129.4 18630. 6.6352 592.88 15.1800 56.32907 4.50
1 608 NaN 0
22921 057858 46 5359 13 6 0656
0
8.09
224.2 19909. 9.2758 418.60 16.8686 66.42009 3.05
2 912 NaN 0
36259 541732 84 6213 37 3 5934
4
8.31
214.3 22018. 8.0593 356.8 363.26 18.4365 100.3416 4.62
3 676 0
73394 417441 32 86136 6516 24 74 8771
6
9.09
181.1 17978. 6.5466 310.1 398.41 11.5582 31.99799 4.07
4 222 0
01509 986339 00 35738 0813 79 3 5075
3
More information about the data
ph - PH is an important parameter in evaluating the acid–base balance of water. It is also the
indicator of acidic or alkaline condition of water status. WHO has recommended maximum
permissible limit of pH from 6.5 to 8.5. The current investigation ranges were 6.52–6.83 which
are in the range of WHO standards.
Hardness - Hardness is a measure of the physical properties of the water. It is a measure of the
ability of the water to support the roots and the leaves. The lower the hardness, the more support
the roots and leaves can have.
Solids (Total dissolved solids - TDS) - TDS is a measure of the solids in the water. The water
with high TDS value indicates that water is highly mineralized. Desirable limit for TDS is 500
mg/l and maximum limit is 1000 mg/l which prescribed for drinking purpose.
Chloramines - Chlorine and chloramine are the major disinfectants used in public water systems.
Chloramines are most commonly formed when ammonia is added to chlorine to treat drinking
water. Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are
considered safe in drinking water.
Sulfate - Sulfate is a common disinfectant used in public water systems. Sulfate levels up to 2
milligrams per liter (mg/L or 2 parts per million (ppm)) are considered safe in drinking water.
Conductivity - Pure water is not a good conductor of electric current rather’s a good insulator.
Increase in ions concentration enhances the electrical conductivity of water. Generally, the
amount of dissolved solids in water determines the electrical conductivity. Electrical
conductivity (EC) actually measures the ionic process of a solution that enables it to transmit
current. According to WHO standards, EC value should not exceeded 400 μS/cm.
Organic_carbon - Total Organic Carbon (TOC) in source waters comes from decaying natural
organic matter (NOM) as well as synthetic sources. TOC is a measure of the total amount of
carbon in organic compounds in pure water. According to US EPA < 2 mg/L as TOC in treated /
drinking water, and < 4 mg/Lit in source water which is use for treatment.
Trihalomethanes - THMs are chemicals which may be found in water treated with chlorine. The
concentration of THMs in drinking water varies according to the level of organic material in the
water, the amount of chlorine required to treat the water, and the temperature of the water that is
being treated. THM levels up to 80 ppm is considered safe in drinking water.
Turbidity - Turbidity is a measure of the water’s ability to absorb particulate matter. The lower
the turbidity, the more it can absorb particulate matter.
Potability (Target variable) - Indicates if water is safe for human consumption where 1 means
Potable and 0 means Not potable.
In [5]:
# datatypes of the columns
print([Link]())
ph float64
Hardness float64
Solids float64
Chloramines float64
Sulfate float64
Conductivity float64
Organic_carbon float64
Trihalomethanes float64
Turbidity float64
Potability int64
dtype: object
In [6]:
# Describe the data
print([Link]())
Out[6]:
Chlor Cond Organi Trihalo
Hard Sulfa Turb Pota
ph Solids amin uctivi c_carb methan
ness te idity bility
es ty on es
co 2785. 3276. 3276. 3276. 2495. 3276. 3276. 3276.
3276.00 3114.00
un 0000 0000 00000 00000 0000 00000 0000 0000
0000 0000
t 00 00 0 0 00 0 00 00
m 196.3 22014 333.7
7.080 7.122 426.2 14.2849 66.3962 3.966 0.390
ea 6949 .0925 7577
795 277 05111 70 93 786 110
n 6 26 7
8768.
st 1.594 32.87 1.583 41.41 80.82 3.30816 16.1750 0.780 0.487
57082
d 320 9761 085 6840 4064 2 08 382 849
8
129.0
mi 0.000 47.43 320.9 0.352 181.4 2.20000 0.73800 1.450 0.000
0000
n 000 2000 42611 000 83754 0 0 000 000
0
176.8 15666 307.6
25 6.093 6.127 365.7 12.0658 55.8445 3.439 0.000
5053 .6902 9949
% 092 421 34414 01 36 711 000
8 97 8
196.9 20927 333.0
50 7.036 7.130 421.8 14.2183 66.6224 3.955 0.000
6762 .8336 7354
% 752 299 84968 38 85 028 000
7 07 6
216.6 27332 359.9
75 8.062 8.114 481.7 16.5576 77.3374 4.500 1.000
6745 .7621 5017
% 066 887 92304 52 73 320 000
6 27 0
Chlor Cond Organi Trihalo
Hard Sulfa Turb Pota
ph Solids amin uctivi c_carb methan
ness te idity bility
es ty on es
323.1 61227 481.0
m 14.00 13.12 753.3 28.3000 124.000 6.739 1.000
2400 .1960 3064
ax 0000 7000 42620 00 000 000 000
0 08 2
In [7]:
# Check if there are any null columns
print([Link]().sum())
Out[7]:
ph 491
Hardness 0
Solids 0
Chloramines 0
Sulfate 781
Conductivity 0
Organic_carbon 0
Trihalomethanes 162
Turbidity 0
Potability 0
dtype: int64
In [8]:
# Lets try to plot misisng values
[Link](figsize=(10, 6))
missing_values.plot(kind='bar', color='skyblue')
[Link]('Number of Missing Values in Each Column')
[Link]('Columns')
[Link]('Number of Missing Values')
[Link](rotation=45)
[Link]()
Analyze ph column
In [9]:
# for ph column
# set the histogram, mean and median
#9 # Tính toán trung bình và trung vị của cột 'ph'
mean_ph = df['ph'].mean()
median_ph = df['ph'].median()
# Hiển thị kết quả
print(f"Trung bình của cột 'ph': {mean_ph}")
print(f"Trung vị của cột 'ph': {median_ph}")
# Vẽ biểu đồ cột cho cột 'ph'
[Link](figsize=(10, 6))
[Link](df['ph'], kde=False, bins=20, color='skyblue', edgecolor='black')
[Link](mean_ph, color='r', linestyle='dashed', linewidth=1, label=f'Mean: {mean_ph:.2f}')
[Link](median_ph, color='g', linestyle='dashed', linewidth=1, label=f'Median:
{median_ph:.2f}')
[Link]('Distribution of pH')
[Link]('pH')
[Link]('Frequency')
[Link]()
[Link]()
Based on the above data, we can impute ph with either mean or median. There is no skweness in
the data.
Analyze Sulfate column
In [10]:
#10# Mô tả thống kê cơ bản của cột 'Sulphate'
sulfate_values = df['Sulfate'].dropna() # Drop NaN values for plotting
mean_sulfate = sulfate_values.mean()
median_sulfate = sulfate_values.median()
[Link](figsize=(10, 6))
[Link](sulfate_values, bins=30, kde=False, color='skyblue', edgecolor='black')
# Add mean and median lines
[Link](mean_sulfate, color='red', linestyle='dashed', linewidth=2, label=f'Mean:
{mean_sulfate:.2f}')
[Link](median_sulfate, color='green', linestyle='dashed', linewidth=2, label=f'Median:
{median_sulfate:.2f}')
[Link]('Histogram of Sulfate values')
[Link]('Sulfate')
[Link]('Frequency')
[Link]()
[Link]()
Based on the above data, we can impute Sulphate with either mean or median.
Analyze Trihalomethanes column
In [11]:
#11# Mô tả thống kê cơ bản của cột 'Trihalomethanes'
print("Thống kê cơ bản của cột 'Trihalomethanes':")
print(df['Trihalomethanes'].describe())
# Kiểm tra giá trị thiếu trong cột 'Trihalomethanes'
print("\nSố lượng giá trị thiếu trong cột 'Trihalomethanes':")
print(df['Trihalomethanes'].isnull().sum())
# Tính toán trung bình, trung vị, độ lệch chuẩn
mean_thm = df['Trihalomethanes'].mean()
median_thm = df['Trihalomethanes'].median()
std_thm = df['Trihalomethanes'].std()
# Hiển thị kết quả
print(f"\nTrung bình của cột 'Trihalomethanes': {mean_thm}")
print(f"Trung vị của cột 'Trihalomethanes': {median_thm}")
print(f"Độ lệch chuẩn của cột 'Trihalomethanes': {std_thm}")
Based on the above data, we can impute Trihalomethanes with either mean or median
# Vẽ biểu đồ phân phối của cột 'Trihalomethanes'
[Link](figsize=(10, 6))
[Link](df['Trihalomethanes'], kde=True, bins=20, color='skyblue', edgecolor='black')
[Link](mean_thm, color='r', linestyle='dashed', linewidth=1, label=f'Mean: {mean_thm:.2f}')
[Link](median_thm, color='g', linestyle='dashed', linewidth=1, label=f'Median:
{median_thm:.2f}')
[Link]('Distribution of Trihalomethanes')
[Link]('Trihalomethanes')
[Link]('Frequency')
[Link]()
[Link]().
Missing Value imputation
Missing values in ph column
In [12]:
# impute missing values with mean
# 12# Kiểm tra giá trị thiếu trong cột 'ph'
missing_ph_values = df['ph'].isnull().sum()
# Hiển thị số lượng giá trị thiếu trong cột 'ph'
print(f"Số lượng giá trị thiếu trong cột 'ph': {missing_ph_values}")
Identify outliers in the data
In [13]:
# check outliers
# 13# Đọc dữ liệu vào filr, sau đó tính toán
# Tính toán các phân vị Q1 và Q3
Q1 = df['ph'].quantile(0.25)
Q3 = df['ph'].quantile(0.75)
# Tính toán IQR
IQR = Q3 - Q1
# Xác định các điểm ngoại lai
outliers = df[(df['ph'] < (Q1 - 1.5 * IQR)) | (df['ph'] > (Q3 + 1.5 * IQR))]
# Hiển thị số lượng và các điểm ngoại lai
print(f"Số lượng điểm ngoại lai trong cột 'ph': {len(outliers)}")
print("Các điểm ngoại lai trong cột 'ph':")
print(outliers)
Identify corrleation between variables
In [14]:
# # Tính toán ma trận hệ số tương quan
correlation_matrix = [Link]()
# Hiển thị ma trận hệ số tương quan
print("Ma trận hệ số tương quan:")
print(correlation_matrix)
# Vẽ ma trận hệ số tương quan dưới dạng heatmap
[Link](figsize=(12, 8))
[Link](correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
[Link]('Correlation Matrix')
[Link]()
There are no categorical variables in the dataset.
Identify skweness in the data
In [15]:
# identify skewness
# Enter your code here
# Showing the skewed columns
# Enter your code here
Number of skewed columns : 0
Out[15]:
#15# Tính toán độ méo lệch của các biến số liên tục
skewness = [Link]()
# Xác định các cột có độ méo lệch dương (lệch phải)
right_skewed_columns = skewness[skewness > 0].[Link]()
# Hiển thị số lượng và các cột bị lệch phải
print(f"Số lượng cột bị lệch phải: {len(right_skewed_columns)}")
print("Các cột bị lệch phải:")
print(right_skewed_columns)
Skew
There are no skew in our data :)
Lets see the distribution of Potability
In [31]:
#16(31)# Hiển thị phân phối của khả năng uống được
[Link](figsize=(10, 6))
[Link](x='Potability', data=df, palette='viridis')
[Link]('Phân phối của khả năng uống được')
[Link]('Khả năng uống được')
[Link]('Số lượng')
[Link]()
Out[31]:
0 1998
1 1278
Name: Potability, dtype: int64
Feature Transformation
In [16]:
# print the dataframe head
# #16
print([Link](5).to_string(index=False))
Out[16]:
Chlor Trihalo Tur Pota
Hard Sulfa Condu Organic
ph Solids amine methane bidit bilit
ness te ctivity _carbon
s s y y
7.08
204.8 20791. 7.3002 368.5 564.30 10.3797 86.99097 2.96
0 079 0
90455 318981 12 16441 8654 83 0 3135
5
3.71
129.4 18630. 6.6352 333.7 592.88 15.1800 56.32907 4.50
1 608 0
22921 057858 46 75777 5359 13 6 0656
0
8.09
224.2 19909. 9.2758 333.7 418.60 16.8686 66.42009 3.05
2 912 0
36259 541732 84 75777 6213 37 3 5934
4
8.31
214.3 22018. 8.0593 356.8 363.26 18.4365 100.3416 4.62
3 676 0
73394 417441 32 86136 6516 24 74 8771
6
9.09
181.1 17978. 6.5466 310.1 398.41 11.5582 31.99799 4.07
4 222 0
01509 986339 00 35738 0813 79 3 5075
3
In [17]:
# Feature transformation
# scale the numeric columns
# 17
import pandas as pd
from [Link] import StandardScaler
# Đường dẫn tới file CSV
df = pd.read_csv(r'G:/python excel/water_potability.csv')
# Chọn các cột số liên tục
continuous_columns = ['ph', 'Sulfate', 'Trihalomethanes']
# Khởi tạo bộ chuyển đổi chuẩn hóa
scaler = StandardScaler()
# Áp dụng chuẩn hóa
df[continuous_columns] = scaler.fit_transform(df[continuous_columns])
print([Link]())
In [18]:
# After transformation print the dataframe head
# #18
import pandas as pd
from [Link] import StandardScaler, MinMaxScaler
# Đọc dữ liệu từ file CSV
df = pd.read_csv('G:/python excel/water_potability.csv')
# Chọn các cột số liên tục
continuous_columns = ['ph', 'Sulfate', 'Trihalomethanes']
# Khởi tạo bộ chuyển đổi chuẩn hóa
scaler = StandardScaler()
# Áp dụng chuẩn hóa
df[continuous_columns] = scaler.fit_transform(df[continuous_columns])
# Hiển thị phần đầu của DataFrame sau khi chuẩn hóa
print("DataFrame sau khi chuẩn hóa:")
print([Link]())
Out[18]:
Hard Soli Chlora Sulf Condu Organic Trihalom Turb Pota
ph
ness ds mines ate ctivity _carbon ethanes idity bility
- -
0.00 0.198 0.0854 1.04 1.2271 -
0 0.01 1.028759 0.935 0
0000 981 92 3542 78 0.854560
1702 210
- - - -
0.00 1.4734 0.514
1 2.11 1.696 0.19 0.2490 0.214093 -0.502884 0
0000 06 449
3014 382 6962 88
- - -
0.63 0.684 1.0795 0.00
2 0.08 0.0282 0.590024 0.001189 0.847 0
9503 850 58 0000
7287 51 715
-
0.77 0.437 0.09 0.4674 0.69 0.635
3 0.5050 0.939076 1.695662 0
6180 145 3483 46 4190 242
79
- - - - -
1.26 - 0.113
4 0.398 0.25 0.2936 0.71 0.2022 -1.718287 0
3161 0.592197 188
477 2771 90 0100 62
Save the cleaned data
In [30]:
# #19(30)
# Chọn các cột số liên tục
continuous_columns = ['ph', 'Sulfate', 'Trihalomethanes']
Hypothesis Testing
We define a hypothesis to test in our data set
Hypothesis 1:
Null: Increase in pH is associated with increase in Solids
Alternate : No relataion between ph and Solids
In [29]:
# import pandas as pd
from scipy import stats
import seaborn as sns
import [Link] as plt
# Đọc dữ liệu từ file CSV
df = pd.read_csv('G:/python excel/water_potability.csv')
# Giả sử cột 'Solid' là tên cột tương ứng
# Tính toán hệ số tương quan giữa pH và Solid
stat, p_value = stats.ttest_ind(df['ph'], df['Solids'])
print("T-test result: statistic = {}, p-value = {}".format(stat, p_value))
if p_value < 0.05:
print("Giá trị p nhỏ hơn 0,05, vì vậy bác bỏ giả thuyết không ở mức ý nghĩa 5%.")
else:
print("Giá trị p không nhỏ hơn 0,05, vì vậy không thể bác bỏ giả thuyết khống ở mức ý nghĩa
5%.")
Out[29]:
Ttest_indResult(statistic=-4.476932191647608, pvalue=7.705940306619221e-06)
the p value is less than 0.05 , so we are rejecting the null hypothesis at 5% significance level.
Next Step in analyzing the data
Write here
Quality of data
Write here
Key findings
Write here