Descriptive statistics in data analysis:
Descriptive statistics are essential in data analysis as they summarize and organize
data to make it easier to understand. These statistics provide insights into the central
tendency, variability, and distribution of the dataset without making inferences or
predictions. Here are the key components of descriptive statistics:
### 1. **Measures of Central Tendency**
These describe the center point of a dataset and indicate where most data points are
concentrated.
- **Mean**: The arithmetic average of all data points.
- **Median**: The middle value that separates the higher half from the lower half of
the data.
- **Mode**: The most frequently occurring value in the dataset.
### 2. **Measures of Dispersion (Variability)**
These indicate how spread out the data is.
- **Range**: The difference between the highest and lowest values.
- **Variance**: Measures the spread of the data points from the mean (the average
of squared differences from the mean).
- **Standard Deviation**: The square root of the variance, indicating the average
distance of each data point from the mean.
- **Interquartile Range (IQR)**: The range of the middle 50% of the data,
calculated as the difference between the third quartile (Q3) and the first quartile (Q1).
### 3. **Shape of the Distribution**
- **Skewness**: Measures the asymmetry of the data distribution. A skewed
distribution can be:
- **Positively skewed**: Tail on the right side.
- **Negatively skewed**: Tail on the left side.
- **Kurtosis**: Indicates the "tailedness" or peak of the distribution compared to a
normal distribution.
### 4. **Frequency Distribution**
- **Counts/Proportions**: Displays how often each value or range of values occurs
in the dataset.
- **Histograms**: Graphical representations of the frequency distribution.
### 5. **Percentiles and Quartiles**
- **Percentiles**: Values that divide the data into 100 equal parts. For example, the
90th percentile is the value below which 90% of the data falls.
- **Quartiles**: Special percentiles that divide the data into four equal parts, with
the first quartile (Q1), second quartile (Q2 or the median), and third quartile (Q3).
### 6. **Visualization**
Descriptive statistics are often represented graphically through:
- **Bar Charts**: For categorical data.
- **Histograms**: For continuous data.
- **Box Plots**: Show the distribution and highlight outliers.
- **Pie Charts**: For showing proportions.
In data analysis, descriptive statistics are typically the first step, helping researchers
or analysts understand the basic features of the data before moving to inferential
statistics or predictive models.
What do you mean by data coding? describe data coading and entry procedure in
SPSS platform.
Data coding refers to the process of assigning numerical or categorical values to
responses or data points, especially when dealing with qualitative or non-numerical
data. This step is essential for data analysis because statistical software like SPSS
(Statistical Package for the Social Sciences) requires data to be in numerical form for
analysis.
Data Coding Procedure in SPSS:
1. Identify Variables and Responses:
○ Start by listing all the variables (e.g., gender, age, education level) and
their possible responses.
○ For qualitative data (e.g., gender: male, female), assign numeric codes
to each response. For example, male = 1, female = 2.
2. Create a Codebook:
○ This is a guide that outlines each variable, its values, and what the
numeric codes represent. For example:
■ Gender: 1 = Male, 2 = Female
■ Education: 1 = High School, 2 = Bachelor’s, 3 = Master’s, etc.
3. Coding Missing Data:
○ If there is missing data, define a code for missing responses, e.g., 99
or -1, to differentiate from valid responses.
Data Entry Procedure in SPSS:
1. Define Variables:
○ Open SPSS and go to the Variable View tab.
○ In this tab, define each variable by giving it a name, setting its type
(numeric, string, date), width, decimal places, and label (description of
the variable).
2. Assign Value Labels:
○ For categorical variables, click on the "Values" column next to the
variable name. Enter the numeric codes (e.g., 1, 2, 3) and their
corresponding labels (e.g., Male, Female).
3. Data Entry:
○ Switch to the Data View tab.
○ Enter the data based on the coding scheme from your codebook. For
each respondent or observation, input the numeric codes that
correspond to their responses.
4. Handling Missing Data:
○ If data is missing for any variable, enter the pre-defined code for
missing values (e.g., 99).
5. Saving the Dataset:
○ After entering the data, save the file by going to File > Save As and
saving it as an SPSS file (.sav format).
By following these steps, you ensure that the data is entered systematically and is
ready for analysis in SPSS.
7 Fundamental Steps of Data Mining
What is Data Mining?
Data mining is a brilliant method of extracting and scrutinising huge quantities of
data and searching for the repetitive relationships among them. It helps businesses
in spam detection or fraud detection in their database. As a result, businesses can
come up with more-improved decisions by avoiding prior mistakes in their future
endeavours. Data mining is different from data analysis and follows a separate
method.
7 Fundamental Steps of Data Mining
1. Cleaning of Incomplete Data: The first step to data mining is cleaning
incomplete or dirty data in order to maintain the industry standard.Otherwise, there
will be endless system failures and poor insights, which can take more time and
effort. As per the requirements of specific industries, the specialists use multiple
methods or tools to accomplish this task.
2. Integration of Data:
In the second step, the specialists perform data integration, which refers to analysing
data by combining the sources and sets of multiple data. It’s a crucial step that
requires different databases to do the second layer of data cleaning. The main
purpose here is to improve data quality by eliminating inconsistent information.
3. Reduction of Data:
Now that the cleaning process is complete, it’s time for the reduction of data so that
the quality enhances further.Hence, specialists take small data and reduce the
structure, to sum up, its main message. Machine learning is a very important process
that is used along with several data mining tools for smooth performance in this third
step.
4. Transformation of Data:
Every data mining task has its own mining goals, which gets clarified in the fourth
step. It’s the phase when the specialists combine all the preparation data through
different methods such as data mapping, normalisation, aggregation and others. As
a result, the quality of data gets improved further and the specialists move one step
forward to create a final report.
5. Data Mining:
Though the entire process is known as data mining, this step specifically includes the
mining tasks. Some modelling techniques used in this step are classification,
clustering etc. The specialists use multiple tools for data mining and other intelligent
methods to come up with models, which are basically the extracted information.
6. Pattern Analysis:
Data mining is a process that finds out the pattern of relationships between multiple
data. In the sixth step, the specialists finally come up with their insights and discuss
them with business owners so that new decisions can be taken. Starting from sales
to employee behaviour and customer needs, all things are discussed in this step.
7. Sharing Final Report:
Right after the discussion, the specialists usually present their final report that
includes every relevant information of the process including their intelligent insight on
the overall business performance and its pattern of problems. Companies get the
report and realise the pattern of their behaviour so that they can improve it in the
future.