CHAPTER 5
EXPERIMENT SETUP
5.1 Experimental Framework
Python is a prominent environment using by researcher to development or deployment
of generated systems. It has vast set of libraries with number of modules, packages that
supports programmer to attain in many ways to complete their work efficiently.
Figure 5.1: GUI Anaconda
Anaconda is a totally free Environment their source is really open to all for doing much.
Python and its libraries are using in data science and data analysis very efficiently. They
are also largely used for creating expandable machine learning algorithms. Python can
apply various machines learning techniques such65 as Classification, Regression,
Recommendation, and Clustering.
1
Python offers to researcher ready-to-Implement Environment for doing or performing
data mining tasks on huge volumes and variety of data effectively in lesser time.
Pandas
SciKit-Learn
Python Utility
SciPy
Matplotlib
Figure 5.2: Libraries of Python
5.2 Dataset & Features
Machine learning data is usually described in a matrix called dataset. This matrix is
structured in a way that corresponds to each row an observation (example) data set and
each column represents a feature (also variable or attribute) that describes the data. Data
values can take many representations. Data can be numerical (integer or real numbers) or
nominal data, where values are differentiated by name. Nominal data is type of
Categorical data type of that, as its name indicates, the data only can have a fixed set of
nominal values (or categories).
5.3 Implementation
The model employs filters for faster evaluation and lesser overall time. The pre-
processing methods and application of filters affect a lot in final evaluation results of
classifiers (ML based models). The feature extraction methods, conversion of nominal to
binary and cleaning are few of those filters
2
5.4 Different Process Stage
Figure 5.3: Calling Libraries
Explanation: In the figure 5.3 we called Libraries which will help you to call all
functionality which required.
Figure 5.4: Major Columns
Explanation: In the igure 5.4 we try to show number of major columns available in our
DataSet.we have 9 columns in our data set.
Figure 5.5: Major Columns
Explanation: In the igure 5.5 we try to show all attributes available in our DataSet.we
have 7 columns in our data set.
3
Description of Different Columns:
Figure 5.6: Descriptive Summary
Explanation: In the igure 5.6 we try to show age ,bmi,children and Expenses.
Distribution of age, bmi and expenses:
Figure 5.7: Distribution of smoker, children and region (Pia & Bar Graph)
Explanation: In the Figure 5.7 we Explained How Distribution gives clear Results :
1) We can say that we have an equal number of people of all ages
2.) Where maximum people have bmi around 30
4
3.)Finally Expenses are seem to be right skewed.(Learn Skewness from probablity and
statistical)
Note: Because Expenses are skewed in nature so that can be either transform this column
using log transformation or square root transformation. And this can be converted into
normal Distribution.
Distribution of age, bmi and expenses:
Figure 5.8: Distribution of smoker, children and region (Area Graph)
5
Bi Varient Analysis:
Figure 5.9: Bi-Variant Analysis
In Figure 5.9 Bivariant Analysis is one of the Simplest forms of quantitative analysis.
It involves analysis of two variables for determining the empirical relationship between
them.it can be helpful in testing simple hypothesis of association.
Impact of smoking and children in Medical Expenses:
Figure 5.10: Impact of smoking in Medical Expenses
6
In Figure 5.10 we explained How these columns play vital roles in Medical Expenses:
Number of children column as facet in the row side wheras Region column will be added
as a facet in the column side.
If the age is less the Expense would also less.
The Expenses of smoker in all regions ranges from 20 to 60k.
Whereas the Expenses of Non-Smoker in all regions ranges from 10 to 20k
The lesser range of Expenses is for lesser age people and vice versa.
Bubble Chart to represent the relation of expenses with different parameters
Figure 5.11: Bubble chart Analysis
In Figure 5.11 we explained How these columns play vital roles in Medical Expenses:
This chart makes it clear that BMI is not a powerful expense, as people having less BMI
also have high Medical Expenses.
This chart makes it clear that people who smokes have higher Medical Expenses.
The sise of Bubble, which represents age, shows that people having higher
expenses belongs to Higher Expenses categories.
7
Polar Graph Explanation Based Upon Region:
Figure 5.12: Polar Analysis
Figure 5.13: Region Wise Categorization
In Figure 5.12 & 5.13 we Explained How Region will also give impact of Medical
Expenses. We can say that in this explanation southeast region 's having max values.
Note: 3 regions expenses are very similar in terms of expenses.
8