Data Mining

Uploaded by

Abdullah Mazi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views4 pages

Data Mining

Uploaded by

Abdullah Mazi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Descriptive statistics in data analysis:

Descriptive statistics are essential in data analysis as they summarize and organize
data to make it easier to understand. These statistics provide insights into the central
tendency, variability, and distribution of the dataset without making inferences or
predictions. Here are the key components of descriptive statistics:

### 1. Measures of Central Tendency

These describe the center point of a dataset and indicate where most data points are
concentrated.
- **Mean**: The arithmetic average of all data points.
- **Median**: The middle value that separates the higher half from the lower half of
the data.
- **Mode**: The most frequently occurring value in the dataset.

### 2. Measures of Dispersion (Variability)

These indicate how spread out the data is.
- **Range**: The difference between the highest and lowest values.
- **Variance**: Measures the spread of the data points from the mean (the average
of squared differences from the mean).
- **Standard Deviation**: The square root of the variance, indicating the average
distance of each data point from the mean.
- **Interquartile Range (IQR)**: The range of the middle 50% of the data,
calculated as the difference between the third quartile (Q3) and the first quartile (Q1).

### 3. Shape of the Distribution

- **Skewness**: Measures the asymmetry of the data distribution. A skewed
distribution can be:
- **Positively skewed**: Tail on the right side.
- **Negatively skewed**: Tail on the left side.
- **Kurtosis**: Indicates the "tailedness" or peak of the distribution compared to a
normal distribution.

### 4. Frequency Distribution

- **Counts/Proportions**: Displays how often each value or range of values occurs
in the dataset.
- **Histograms**: Graphical representations of the frequency distribution.

### 5. Percentiles and Quartiles

- **Percentiles**: Values that divide the data into 100 equal parts. For example, the
90th percentile is the value below which 90% of the data falls.
- **Quartiles**: Special percentiles that divide the data into four equal parts, with
the first quartile (Q1), second quartile (Q2 or the median), and third quartile (Q3).
### 6. **Visualization**
Descriptive statistics are often represented graphically through:
- **Bar Charts**: For categorical data.
- **Histograms**: For continuous data.
- **Box Plots**: Show the distribution and highlight outliers.
- **Pie Charts**: For showing proportions.

In data analysis, descriptive statistics are typically the first step, helping researchers
or analysts understand the basic features of the data before moving to inferential
statistics or predictive models.

What do you mean by data coding? describe data coading and entry procedure in
SPSS platform.

Data coding refers to the process of assigning numerical or categorical values to

responses or data points, especially when dealing with qualitative or non-numerical
data. This step is essential for data analysis because statistical software like SPSS
(Statistical Package for the Social Sciences) requires data to be in numerical form for
analysis.

Data Coding Procedure in SPSS:

1. Identify Variables and Responses:

○ Start by listing all the variables (e.g., gender, age, education level) and
their possible responses.
○ For qualitative data (e.g., gender: male, female), assign numeric codes
to each response. For example, male = 1, female = 2.
2. Create a Codebook:
○ This is a guide that outlines each variable, its values, and what the
numeric codes represent. For example:
■ Gender: 1 = Male, 2 = Female
■ Education: 1 = High School, 2 = Bachelor’s, 3 = Master’s, etc.
3. Coding Missing Data:
○ If there is missing data, define a code for missing responses, e.g., 99
or -1, to differentiate from valid responses.

Data Entry Procedure in SPSS:

1. Define Variables:
○ Open SPSS and go to the Variable View tab.
○ In this tab, define each variable by giving it a name, setting its type
(numeric, string, date), width, decimal places, and label (description of
the variable).
2. Assign Value Labels:
○ For categorical variables, click on the "Values" column next to the
variable name. Enter the numeric codes (e.g., 1, 2, 3) and their
corresponding labels (e.g., Male, Female).
3. Data Entry:
○ Switch to the Data View tab.
○ Enter the data based on the coding scheme from your codebook. For
each respondent or observation, input the numeric codes that
correspond to their responses.
4. Handling Missing Data:
○ If data is missing for any variable, enter the pre-defined code for
missing values (e.g., 99).
5. Saving the Dataset:
○ After entering the data, save the file by going to File > Save As and
saving it as an SPSS file (.sav format).

By following these steps, you ensure that the data is entered systematically and is
ready for analysis in SPSS.

7 Fundamental Steps of Data Mining

What is Data Mining?
Data mining is a brilliant method of extracting and scrutinising huge quantities of
data and searching for the repetitive relationships among them. It helps businesses
in spam detection or fraud detection in their database. As a result, businesses can
come up with more-improved decisions by avoiding prior mistakes in their future
endeavours. Data mining is different from data analysis and follows a separate
method.‍

7 Fundamental Steps of Data Mining

1. Cleaning of Incomplete Data: The first step to data mining is cleaning
incomplete or dirty data in order to maintain the industry standard.Otherwise, there
will be endless system failures and poor insights, which can take more time and
effort. As per the requirements of specific industries, the specialists use multiple
methods or tools to accomplish this task.

2. Integration of Data:
In the second step, the specialists perform data integration, which refers to analysing
data by combining the sources and sets of multiple data. It’s a crucial step that
requires different databases to do the second layer of data cleaning. The main
purpose here is to improve data quality by eliminating inconsistent information.

3. Reduction of Data:

Now that the cleaning process is complete, it’s time for the reduction of data so that
the quality enhances further.Hence, specialists take small data and reduce the
structure, to sum up, its main message. Machine learning is a very important process
that is used along with several data mining tools for smooth performance in this third
step.

4. Transformation of Data:

Every data mining task has its own mining goals, which gets clarified in the fourth
step. It’s the phase when the specialists combine all the preparation data through
different methods such as data mapping, normalisation, aggregation and others. As
a result, the quality of data gets improved further and the specialists move one step
forward to create a final report.

5. Data Mining:

Though the entire process is known as data mining, this step specifically includes the
mining tasks. Some modelling techniques used in this step are classification,
clustering etc. The specialists use multiple tools for data mining and other intelligent
methods to come up with models, which are basically the extracted information.

6. Pattern Analysis:

Data mining is a process that finds out the pattern of relationships between multiple
data. In the sixth step, the specialists finally come up with their insights and discuss
them with business owners so that new decisions can be taken. Starting from sales
to employee behaviour and customer needs, all things are discussed in this step.

7. Sharing Final Report:

Right after the discussion, the specialists usually present their final report that
includes every relevant information of the process including their intelligent insight on
the overall business performance and its pattern of problems. Companies get the
report and realise the pattern of their behaviour so that they can improve it in the
future.

Data Analysis CheatSheet
No ratings yet
Data Analysis CheatSheet
29 pages
Practical Research Week 1
No ratings yet
Practical Research Week 1
1 page
Quantitative Data Coding & Analysis Guide
No ratings yet
Quantitative Data Coding & Analysis Guide
104 pages
MPC 006 2024-25 For SSC and All Educational Needs
No ratings yet
MPC 006 2024-25 For SSC and All Educational Needs
27 pages
Data Mining Reviewer
No ratings yet
Data Mining Reviewer
4 pages
Statstics NOTES SEM2
No ratings yet
Statstics NOTES SEM2
20 pages
KJWDH
No ratings yet
KJWDH
4 pages
Vii. Data Analysis
100% (1)
Vii. Data Analysis
5 pages
Business Analytics Overview (MIS171)
No ratings yet
Business Analytics Overview (MIS171)
6 pages
Analysis and Interpretation of Data in Research
100% (1)
Analysis and Interpretation of Data in Research
7 pages
CH01 - Introduction To Statistics 2
No ratings yet
CH01 - Introduction To Statistics 2
52 pages
Unit 2 Describing Data
No ratings yet
Unit 2 Describing Data
21 pages
Data Analysis CheatSheet
No ratings yet
Data Analysis CheatSheet
34 pages
Steps in Quantitative Data Analysis
No ratings yet
Steps in Quantitative Data Analysis
14 pages
7 Types of Statistical Analysis Techniques
No ratings yet
7 Types of Statistical Analysis Techniques
7 pages
Python EDA: Stats, Visualization, Correlation
No ratings yet
Python EDA: Stats, Visualization, Correlation
7 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
PR2 M8 Fundamentals of Data
No ratings yet
PR2 M8 Fundamentals of Data
9 pages
Basic Data Analysis
No ratings yet
Basic Data Analysis
16 pages
Data Analysis: Editing, Coding, and Statistics
No ratings yet
Data Analysis: Editing, Coding, and Statistics
8 pages
Data Analysis Fundamentals Explained
No ratings yet
Data Analysis Fundamentals Explained
24 pages
3is Q4 Complete Notes
No ratings yet
3is Q4 Complete Notes
20 pages
Statistics Refresher
No ratings yet
Statistics Refresher
11 pages
Unit 1,2
No ratings yet
Unit 1,2
17 pages
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
100% (1)
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
8 pages
UNIT II-DSDA - Docx Notes
No ratings yet
UNIT II-DSDA - Docx Notes
26 pages
Statistics For Data Science
100% (1)
Statistics For Data Science
30 pages
FTA-Module 1-Notes
No ratings yet
FTA-Module 1-Notes
24 pages
DA Notes
No ratings yet
DA Notes
15 pages
EDA - Reviewer Midterm
No ratings yet
EDA - Reviewer Midterm
8 pages
Statistical Modeling For Data Analysis
100% (1)
Statistical Modeling For Data Analysis
24 pages
Tabulation and Data Analysis
No ratings yet
Tabulation and Data Analysis
4 pages
Step 1: Define Your Questions
No ratings yet
Step 1: Define Your Questions
4 pages
208 RM Lab File1 PDF
No ratings yet
208 RM Lab File1 PDF
31 pages
Chapter 6 Research Methods
No ratings yet
Chapter 6 Research Methods
24 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Data Analytics Use Cases and Process
No ratings yet
Data Analytics Use Cases and Process
5 pages
MULTIVARIATE ANALYSIS Part 1
No ratings yet
MULTIVARIATE ANALYSIS Part 1
30 pages
Executive Masterclass ?? ?????????? ?????? ??? ?????? ????????
No ratings yet
Executive Masterclass ?? ?????????? ?????? ??? ?????? ????????
87 pages
Nursing Research Methods: PH.D in Nursing
No ratings yet
Nursing Research Methods: PH.D in Nursing
66 pages
Data Processing Essentials
No ratings yet
Data Processing Essentials
79 pages
Research Paper PDF
No ratings yet
Research Paper PDF
15 pages
Data Analysis
No ratings yet
Data Analysis
22 pages
DATA ANANLYSIS MLT Final Year
No ratings yet
DATA ANANLYSIS MLT Final Year
19 pages
Research File 3
No ratings yet
Research File 3
10 pages
BRM CH-6
No ratings yet
BRM CH-6
30 pages
Unit V Statistical Data Analysis
No ratings yet
Unit V Statistical Data Analysis
72 pages
Unit 4
No ratings yet
Unit 4
14 pages
Data Analysis
100% (1)
Data Analysis
23 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
14 pages
Statistical Analysis Tools Overview
No ratings yet
Statistical Analysis Tools Overview
5 pages
Regression
No ratings yet
Regression
86 pages
Data Analysis 2
No ratings yet
Data Analysis 2
8 pages
Class Notes
No ratings yet
Class Notes
15 pages
Dev Core
No ratings yet
Dev Core
7 pages
EPSM Unit 7 Data Analytics
100% (1)
EPSM Unit 7 Data Analytics
27 pages
Data Analysis Methods
50% (2)
Data Analysis Methods
4 pages
LESSON 13 - Quantitative Data Analysis
No ratings yet
LESSON 13 - Quantitative Data Analysis
5 pages
أثر تطبيق معايير التقارير المالية الدولية ias ifrs على جودة المعلومات المالية (دراسة عينة من الأكاديميين والمهنيين)
100% (1)
أثر تطبيق معايير التقارير المالية الدولية ias ifrs على جودة المعلومات المالية (دراسة عينة من الأكاديميين والمهنيين)
17 pages
Steenbergen ModelingMultilevelData 2002
No ratings yet
Steenbergen ModelingMultilevelData 2002
21 pages
Probability & Statistics Exam Prep
No ratings yet
Probability & Statistics Exam Prep
10 pages
The Anatomy of Factor Momentum Hanlin Yang Leippold Markus
No ratings yet
The Anatomy of Factor Momentum Hanlin Yang Leippold Markus
73 pages
Key Financial Formulas Overview
No ratings yet
Key Financial Formulas Overview
3 pages
Probability & Statistics Course
No ratings yet
Probability & Statistics Course
2 pages
Statistical Quality Control: Basic Statistics
No ratings yet
Statistical Quality Control: Basic Statistics
9 pages
Data Analysis: Dr. C Santhosh Kumar
No ratings yet
Data Analysis: Dr. C Santhosh Kumar
22 pages
Wiley'S Cfa Program Level I Smartsheets: Fundamentals For Cfa Exam Success
No ratings yet
Wiley'S Cfa Program Level I Smartsheets: Fundamentals For Cfa Exam Success
11 pages
Session 2 On Discreatization - Binning Notes
No ratings yet
Session 2 On Discreatization - Binning Notes
14 pages
Test of Significance in Analytical Chemistry
No ratings yet
Test of Significance in Analytical Chemistry
3 pages
Optimizing Supply Chain Dynamics Using Machine Learning
No ratings yet
Optimizing Supply Chain Dynamics Using Machine Learning
59 pages
Feature Selection 16891042299
No ratings yet
Feature Selection 16891042299
23 pages
Whatisaχ (Chi-square) test used for?: Statistical test used to based on a hypothesis
No ratings yet
Whatisaχ (Chi-square) test used for?: Statistical test used to based on a hypothesis
20 pages
ANOVA - Modified
No ratings yet
ANOVA - Modified
53 pages
DC Fan Life Testing Report
No ratings yet
DC Fan Life Testing Report
2 pages
Lab Exercises On ANOVA, R-Square Test, and T-SNE (RRStudio)
No ratings yet
Lab Exercises On ANOVA, R-Square Test, and T-SNE (RRStudio)
3 pages
Data Mining Classification Basics
No ratings yet
Data Mining Classification Basics
50 pages
HSC - Finance Banking and Insurance 1
No ratings yet
HSC - Finance Banking and Insurance 1
4 pages
Week 1 Get Familier With Jupyter Notebook
No ratings yet
Week 1 Get Familier With Jupyter Notebook
4 pages
Processing & Analysis of Data
No ratings yet
Processing & Analysis of Data
25 pages
Moderation and Mediation Analysis
No ratings yet
Moderation and Mediation Analysis
11 pages
2.1 ML (Implementation of Simple Linear Regression in Python)
No ratings yet
2.1 ML (Implementation of Simple Linear Regression in Python)
8 pages
Orlando Magic Ticket Sales Analysis
No ratings yet
Orlando Magic Ticket Sales Analysis
24 pages
MMW Lecture 4.3 Data Management Part 3
No ratings yet
MMW Lecture 4.3 Data Management Part 3
55 pages
Cronbach's Alpha PDF
No ratings yet
Cronbach's Alpha PDF
2 pages
Dav Exp2
No ratings yet
Dav Exp2
3 pages
15-Nonparametric Methods Chap015
No ratings yet
15-Nonparametric Methods Chap015
21 pages
Factor Analysis SPSS
No ratings yet
Factor Analysis SPSS
3 pages
18MAB301T - Notes - Unit3 - Testing of Hypothesis PDF
No ratings yet
18MAB301T - Notes - Unit3 - Testing of Hypothesis PDF
43 pages