0% found this document useful (0 votes)
20 views12 pages

Unit 4 2marff

The document is a question bank for the subject CCS346 Exploratory Data Analysis at Ramco Institute of Technology, focusing on Bivariate Analysis. It covers key concepts such as relationships between variables, percentage tables, contingency tables, and various statistical methods including scatter plots and hypothesis testing. The content includes definitions, explanations, and comparisons of statistical terms and techniques relevant to data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views12 pages

Unit 4 2marff

The document is a question bank for the subject CCS346 Exploratory Data Analysis at Ramco Institute of Technology, focusing on Bivariate Analysis. It covers key concepts such as relationships between variables, percentage tables, contingency tables, and various statistical methods including scatter plots and hypothesis testing. The content includes definitions, explanations, and comparisons of statistical terms and techniques relevant to data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

RAMCO INSTITUTE OF TECHNOLOGY

Department of Computer Science and Technology


Academic Year: 2024 - 2025 (Even Semester)
QUESTION BANK
Semester / Class : VI/ VI Year B.E CSE
Name of Subject : CCS346 Exploratory Data Analysis
Name of Faculty member : Mrs.S.Vijaya Amala Devi AP/CSE

UNIT IV BIVARIATE ANALYSIS


Relationships between Two Variables - Percentage Tables - Analysing Contingency Tables -
Handling Several Batches - Scatterplots and Resistant Lines.

PART A
1. Write short note on Bivariate Analysis.
 Bivariate analysis is a statistical method used to explore and understand the
relationship between two different variables in a dataset.
 It involves examining how changes in one variable are associated with
changes in another, providing insights into potential connections, correlations,
or dependencies between them.

2. What is meant by Percentage Table in Bivariate Analysis?


 In bivariate analysis, a percentage table is used to display the joint distribution
of two categorical variables.
 It shows the relative frequencies or percentages of observations that fall into
each combination of categories from the two variables.

3. Write short note on Contingency Tables.


 A contingency table displays frequencies for combinations of two categorical
variables. Analysts also refer to contingency tables as cross tabulation and
two-way tables.
 Contingency tables classify outcomes for one variable in rows and the other in
columns.
 The values at the row and column intersections are frequencies for each
unique combination of the two variables.
 Use contingency tables to understand the relationship between categorical
variables

4. What are the ways for Creating Contingency table?


The 3 ways for creating APA style contingency tables in SPSS:
 CROSSTABS is easiest. It can create several tables in one go but they require
quite some (manual) editing.
 CTABLES runs the desired table straight away and could be run from the
menu. However, it creates one table at the time and requires an additional
license.

1
 TABLES also comes up with the right table straight away. However, the
syntax is difficult and there's no menu.

5. Define Marginal Distribution.


Marginal distributions represent the frequency distribution of one categorical variable
without regard for other variables.

6. Define Conditional Distribution.


 Conditional Distribution specify the value for one of the variables in the
contingency table and then assess the distribution of frequencies for the other
variable.
 In other words, it conditions the frequency distribution for one variable by
setting a value of the other variable.

7. How to calculate row and column percentages in a two-way table?


 Row Percentage: Take a cell value and divide by the cell’s row total.
 Column Percentage: Take a cell value and divide by the cell’s column total.
 For example, the row percentage of females who prefer chocolate is simply
the number of observations in the Female/Chocolate cell divided by the row
total for women: 37 / 66 = 56%.
 The column percentage for the same cell is the frequency of the
Female/Chocolate cell divided by the column total for chocolate: 37 / 58 =
63.8%

8. What is meant by Box Plot.


Box plot: A box plot, also known as a box plot, box plots or box-and-whisker plot, is
a standardized way of displaying the distribution of a data set based on its five-
number summary of data points: the “minimum,” first quartile [Q1], median, third
quartile [Q3] and “maximum”.

9. Define Outliers.
Outliers are data points in a dataset that significantly deviate from the majority of
other data points. They are observations that fall well outside the typical range or
distribution of values and may be unusually high or low in comparison to the rest of
the data.

10. What is meant by linear relationship in Bivariate analysis?


 In bivariate analysis, which involves examining two variables together, one of
the key aspects to explore is linear relationships.
 A linear relationship between two variables means that as one variable
changes, the other tends to change in a straight-line fashion.

11. Write a short note on Scatter Plot.


 A scatter plot is a graphical representation used to visualize the relationship
between two sets of data points.

2
 It consists of points on a two-dimensional plane, where each point represents
the values of two variables.
 By plotting these points, it can quickly identify patterns, correlations, or trends
in the data.

12. What are bounded numbers?


 Bounded numbers are numbers that fall within a specified range or interval,
defined by an upper bound (maximum value) and a lower bound (minimum
value).
 For example, in the interval [1, 10], the lower bound is 1, and the upper bound
is 10. These bounds can be used in various mathematical and statistical
contexts to set limits on the values that a variable can take.

13. Differentiate between proportions and probabilities.

Proportions Probabilities
A proportion represents a part of a whole Probability quantifies the likelihood of
and is expressed as a fraction or ratio. It an event occurring, ranging from 0
is a way to compare the number of (impossible event) to 1 (certain event).
favourable outcomes to the total number It is a measure of how likely an event
of possible outcomes. is to happen.
Proportion No of favorable outcome
Probability=
No of favorable outcome Total No of possible outcome
¿
Total No of possible outcome
Proportions are often used in descriptive Probabilities are used in predictive
contexts. contexts.

14. Define Percentage.


 A percentage is a proportion multiplied by 100. It expresses a part of a whole as a
fraction of 100 and is often used to simplify comparisons.
 Percentage = Proportion × 100
 Percentages are simply proportions expressed on a scale from 0 to 100, making
them easier to understand and compare.

15. What is meant by NS-SEC?


 NS-SEC (National Statistics Socio-Economic Classification):The NS-SEC
is a socio-economic classification system used in the UK to categorize
individuals based on their occupation.
 It is designed to provide a comprehensive and standardized way to measure
and analyze socio-economic position.

16. What are the purpose of NS-SEC?


Purpose of NS-SEC:
 To identify and analyze social and economic inequalities.
 To study patterns and trends in social and economic behaviour.
 To inform policy-making and resource allocation.

17. What is cell frequency?

3
In the context of data analysis, "cells" is the more scientific term for pigeon holes in a
contingency table or cross tabulation. The number of cases or observations in each
cell is called the cell frequency.

18. What are marginal in the context of a contingency table?


In a contingency table, each row and column can have a total presented at the right-
hand end and at the bottom. These totals are called marginal. They represent the sum
of the cell frequencies for each row and column, respectively.

19. Define percentage tables.


 Percentage tables are tables where the cell values are expressed as percentages
rather than raw frequencies.
 They provide a clearer understanding of the distribution of data across
different categories by showing the proportion of each cell relative to a total.
 The commonest way to make contingency tables readable is to cast them in
percentage tables.

20. What are the types of percentage tables?


Types of Percentage Tables:
 Row Percentage Tables: Each cell value is expressed as a percentage of the
row total. This helps to compare the distribution of categories within each row.
 Column Percentage Tables: Each cell value is expressed as a percentage of
the column total. This helps to compare the distribution of categories within
each column.
 Total Percentage Tables: Each cell value is expressed as a percentage of the
overall total of all cells. This helps to understand the contribution of each cell
to the entire table.

21. What are the differences between reproducibility and clarity?
 Reproducibility refers to the ability to duplicate the results of a study or
experiment using the same methods, data, and conditions, ensuring the
findings are reliable and credible.
 Clarity, on the other hand, refers to the ease with which information, methods,
and results can be understood, ensuring that the audience can easily
comprehend and effectively communicate the research.
 While reproducibility focuses on the reliability and consistency of results,
clarity focuses on the presentation and accessibility of information.

22. What is R and C in degrees of freedom?


The degrees of freedom for the chi-square are calculated using the following formula:
df = (r-1)(c-1)
where r is the number of rows and c is the number of columns. If the observed chi-
square test statistic is greater than the critical value, the null hypothesis can be
rejected.

23. What are the types of hypothesis?


Types of Hypotheses:

4
 Null Hypothesis (H₀): This hypothesis proposes that there is no significant
difference or effect between variables.
 Alternative Hypothesis (H₁ or Hₐ): This hypothesis proposes that there is a
significant difference or effect between variables, opposite to the null hypothesis.

24. What is the null hypothesis?


 The null hypothesis (denoted as H0H0) in statistical hypothesis testing is a
statement that suggests there is no significant difference, effect, or relationship
between variables in a population.
 It serves as the default position, assuming that any observed differences or
effects in sample data are due to random chance or sampling variability rather
than a true effect.
 The null hypothesis is essential in hypothesis testing as it provides a baseline
against which the alternative hypothesis (H0 or H1) is evaluated.
 If there is sufficient evidence from the data to reject the null hypothesis, it may
be concluded that there is a significant effect or relationship present. However,
failing to reject the null hypothesis does not necessarily prove it to be true; it
simply suggests that there is not enough evidence to conclude otherwise based
on the data at hand.

25. Differentiate between type 1 and type 2 errors.


Type 1 Error Type 2 Error
Type 1 error, also known as a false Type 2 error, also known as a false
positive, occurs when a true null negative, occurs when a false null
hypothesis is rejected. In other words, it hypothesis is not rejected. In other
is the incorrect rejection of a null words, it is the failure to reject a null
hypothesis that is actually true. hypothesis that is actually false.
Denoted as α, it represents the Denoted as β, it represents the
probability of rejecting a null hypothesis probability of failing to reject a null
when it is actually true. hypothesis when it is actually false.
Example: Concluding a new drug is Example: Failing to conclude a new
effective when it has no effect (null drug is effective when it actually is
hypothesis). effective (alternative hypothesis).

26. What is the Office for National Statistics (ONS)?


 The Office for National Statistics (ONS) is the UK's largest independent
producer of official statistics.
 It gathers data through various methods, including surveys, censuses,
administrative data sources, and partnerships with other organizations.
 It plays a crucial role in providing accurate and timely statistics that inform
decision-making, policy development, research, and public understanding of
societal trends in the UK.

27. What is the Labour Force Survey(LFS)?


 The Labour Force Survey (LFS) is a large-scale household survey conducted
regularly in many countries, including the United Kingdom.
 It is designed to gather comprehensive information about the labor market,
employment, unemployment, and related topics.
5
 The data collected from the LFS is used by government agencies,
policymakers, researchers, and businesses for various purposes.

28. What are the reasons for outliers?


 Measurement Errors:
 Outliers can arise due to errors in measurement, data entry, or data
processing. For example, a misread sensor or a typo during data entry
could lead to an outlier.
 Natural Variability:
 Inherent variability in the data generation process can sometimes
produce outliers. This variability may be due to random chance or
natural extremes within the system being studied.
 Sampling Variation:
 Outliers can also occur due to sampling variation, especially in smaller
sample sizes.
 Genuine Extreme Events:
 Sometimes, outliers represent real and significant events or phenomena
that are rare but impactful. For example, in financial data, outlier stock
prices may correspond to major market movements or economic crises.
29. What is the formula for T Test?
The formula for two –sample t-test are:
x´1− x´2
t=
s x 1 x 2 . √ n 11+ ¿1n 2 ¿
 X1 and X2 are the means of the two samples.
 s x1 x2 is the pooled standard deviation.
 n1 is the sample size of the first sample.
 N2 is the sample size of the second sample.

30. Define One-way analysis of variance (ANOVA).


 One-way analysis of variance (ANOVA) is a statistical technique used to
compare the means of three or more groups to determine if there is a
statistically significant difference between them. It's an extension of the
independent samples t-test, which only compares the means of two groups.
 It determines whether there are any statistically significant differences
between the means of three or more independent groups. It helps answer
questions such as whether different treatments lead to different outcomes or
whether there are differences between groups based on some categorical
variable.

31. Define Lone Parents.


Lone parents refers to individuals who are raising children without a spouse or partner
present in the household. This term typically applies to both single mothers and single
fathers who assume the primary responsibility for childcare and raising their children
alone.

32. Where to draw the Line?

6
1. Make half the point lie above the line and half below along the full length of
the line.
2. Make each point as near to the line as possible.
3. Make each point as near to the line in the Y direction as possible.
4. Make the squared distance between each point and the line in the Y direction
as small as possible.

33. Define Resistant Line.


A resistant line is a method of fitting a line to data that minimizes the influence of
outliers. It is a robust alternative to the ordinary least squares (OLS) regression line.

34. Define Relationships between Two Variables?


Relationships between two variables refer to the association or correlation between
their respective values in a dataset.

35. Differentiate between Percentage Tables and Contingency Tables?


Percentage tables display proportions, while contingency tables show the distribution
of two categorical variables.

36. What is the Purpose of Analysing Contingency Tables?


Analysing contingency tables helps examine the relationship between two categorical
variables and identify patterns or dependencies.

37. Define Handling Several Batches in Data Analysis?


Handling several batches involves managing and analyzing data that is collected or
processed in multiple groups or sets.

38. Differentiate between Scatterplots and Resistant Lines?


Scatterplots visually represent the relationship between two numerical variables,
while resistant lines are fitted lines that minimize the impact of outliers.

39. Explain the Significance of Transformations in Data Analysis?


Transformations modify the scale or distribution of data, aiding in making
relationships more linear or meeting statistical assumptions.

40. What is the Purpose of Percentage Tables in Data Analysis?


Percentage tables express values as a percentage of the total, providing a clearer
understanding of the relative contribution of each category.

41. Define Contingency Tables in Statistics?


Contingency tables organize and display the frequency distribution of two categorical
variables, showing how they are related.

42. How do Scatterplots Facilitate Relationship Visualization?


Scatterplots visually display the relationship between two numerical variables,
allowing for the observation of patterns, trends, or correlations.

7
43. Explain the Concept of Resistant Lines in Regression?
Resistant lines in regression minimize the impact of outliers, providing a more
accurate representation of the overall relationship between two variables.

44. Define Transformation Techniques in Data Analysis?


Transformation techniques modify the structure or scale of data to meet statistical
assumptions or enhance analysis.

45. Differentiate between Batch Handling and Data Aggregation?


Batch handling involves managing data in separate groups, while data aggregation
combines and summarizes information across groups.

46. What is the Role of Scatterplots in Identifying Patterns?


Scatterplots visually reveal patterns or trends in data, aiding in the identification of
relationships between two variables.

47. Define the Purpose of Percentage Tables in Categorical Data Analysis?


Percentage tables are useful in categorical data analysis to express the distribution of
categories as a percentage of the total, providing a relative comparison.

48. Explain the Significance of Handling Several Batches in Experimental Design?


Handling several batches is important in experimental design to account for variations
introduced by different conditions, ensuring robust and generalizable results.

49. Define the Concept of Data Transformation in Regression Analysis?


Data transformation in regression alters the form or distribution of data to meet
assumptions, improve model fit, or enhance interpretability.

50. How do Contingency Tables Aid in Statistical Inference?


Contingency tables assist in statistical inference by providing a structured overview of
the relationship between two categorical variables.

51. What is the Purpose of Resistant Lines in Outlier-Prone Data?


Resistant lines are valuable in outlier-prone data as they reduce the impact of extreme
values, ensuring a more reliable interpretation of the relationship between variables.

52. Define the Role of Scatterplots in Exploratory Data Analysis?


Scatterplots play a crucial role in exploratory data analysis by visually uncovering
potential relationships, outliers, or patterns between two variables.

53. How do Transformations Improve Linearity in Regression?


Transformations in regression analysis can improve linearity, making the relationship
between variables more suitable for linear modelling and interpretation.
54. What is the use of scatter plot in bivariate analysis? NOV/DEC 23

8
A scatter plot shows a lot about the relationship between the variables. Here are five key
uses:
 It is used to identify relationship and pattern.
 It is used for detecting correlation.
 Scatter plots can easily highlight outliers, which are points that fall far from the
general pattern of the data. These outliers can be further investigated to
understand their cause or impact.
 It provides a visual assessment of the distribution of data points, helping to
identify clusters, gaps, or unusual patterns in the data.
 By using different colors or symbols for different datasets, scatter plots can
compare relationships between variables across multiple groups or conditions,
making it easier to spot differences or similarities.

55. List two common methods used in bivariate analysis to examine relationship
between two variables. NOV/DEC 23
Two common methods used in bivariate analysis to examine the relationship between
two variables are:
1. Correlation Analysis:
 Measures the strength and direction of the linear relationship between two
continuous variables.
 Common correlation measures include Pearson's correlation coefficient and
Spearman's rank correlation coefficient.
2. Regression Analysis:
 Models the relationship between a dependent variable and one or more
independent variables.
 Simple linear regression is used for one independent variable, while multiple
regression is used for more than one independent variable.

56. What is the significance of percentage table? NOV/DEC 24


Percentage table helps in understanding the relationship between two categorical
variables by representing their joint distribution in percentage form.
Significance:
 Comparative Analysis – Helps compare proportions across different
categories, making patterns more interpretable.
 Identifying Trends – Reveals associations or dependencies between two
variables.
 Handling Imbalance – Adjusts for differences in sample sizes, preventing
misleading conclusions.
 Better Visualization – Used alongside heat maps or stacked bar charts for
deeper insights.

57. Which test is recommended for analysing the contingency table? NOV/DEC 24
The Chi-Square Test for Independence is the most recommended test for analyzing
contingency tables in Exploratory Data Analysis (EDA).

 Measures Association – Determines if two categorical variables are


independent or related.

9
 Applicable for Large Datasets – Works well with categorical data arranged in
a contingency table.
 Non-Parametric Test – Does not assume normality in data distribution.

58. Name the two types of statistical testing in bivariate analysis? NOV/DEC 22
In bivariate analysis, statistical testing is categorized into:
1. Parametric Tests – Used when data follows a normal distribution.
o t-Test: Compares the means of two groups (e.g., independent or paired
t-test).
o Pearson Correlation: Measures the linear relationship between two
continuous variables.
2. Non-Parametric Tests – Used when data does not follow a normal distribution.
o Chi-Square Test: Tests the association between two categorical
variables.
o Spearman Rank Correlation: Measures the monotonic relationship
between two continuous or ordinal variables.

59. Is bivariate qualitative or quantitative? NOV/DEC 22


Bivariate analysis can be both qualitative and quantitative, depending on the type of
variables being analyzed:
 Qualitative (Categorical-Categorical)
o Example: Relationship between Gender (Male/Female) and Preference
(Online/Offline).
o Test Used: Chi-Square Test.
 Quantitative (Numerical-Numerical)
o Example: Relationship between Height and Weight.
o Test Used: Pearson/Spearman Correlation, Regression Analysis.
 Mixed (Categorical-Numerical)
o Example: Relationship between Education Level (High School,
Graduate) and Salary.
o Test Used: t-Test, ANOVA.

60. What are the three common methods for performing bivariate analysis?
NOV/DEC 23
1. Graphical Analysis (visualization
2. Statistical Analysis
3. Numerical Summaries

61. Outline the difference between univariate and bivariate data. NOV/DEC 23
Feature Univariate Data Bivariate Data
Definition Data that involves a single variable. Data that involves two variables.
Purpose Describes and summarizes the Analyzes the relationship between
distribution of one variable. two variables.
Examples Examining the height of students. Examining the relationship
between height and weight.
Analysis Mean, median, mode, variance, Scatter plots, correlation,

10
Methods histograms, box plots. regression, chi-square test.
Types of Only one variable is analyzed. Both variables can be numerical,
Variables categorical, or mixed.

62. uency distribution of one categorical


63. variable without regard for other variables.
64. distributions represent the frequency distribution of one categorical
65. variable without regard for other variables.
66. 3 ways for creating APA style contingency tables in SPSS:
6 7 .  CROSSTABS is easiest. You can create
several tables in one go but they
68. require quite some (manual) editing.
6 9 .  CTABLES runs the desired table straight away
and could be run from the
70. menu. However, it creates one table at the time and requires an additional
71. license.
7 2 .  TABLES also comes up with the right table
straight away. However, the
73. syntax is difficult and there's no menu
74. 3 ways for creating APA style contingency tables in SPSS:
7 5 .  CROSSTABS is easiest. You can create
several tables in one go but they
76. require quite some (manual) editing.
7 7 .  CTABLES runs the desired table straight away
and could be run from the
78. menu. However, it creates one table at the time and requires an additional
79. license.
8 0 .  TABLES also comes up with the right table
straight away. However, the
81. syntax is difficult and there's no me3 ways for creating APA style contingency
tables in SPSS:CROSSTABS is easiest. You can create several tables in one go
but they require quite some (manual) editing.CTABLES runs the desired table
straight away and could be run from the menu. However, it creates one table at
the time and requires an additional license.TABLES also comes up with the
right table straight away. However, the syntax is difficult and there's no menu

82. A contingency table displays frequencies for combinations of two categorical


83. variables. Analysts also refer to contingency tables as crosstabulation and two-
84. way tables.
85. Contingency tables classify outcomes for one variable in rows and the other in
86. columns. The values at the row and column intersections are frequencies for each
87. unique combination of the two variables.
88. Use contingency tables to understand the relationship between categorical
89. variables
90. A contingency table displays frequencies for combinations of two categorical
91. variables. Analysts also refer to contingency tables as crosstabulation and two-
92. way tables.
93. Contingency tables classify outcomes for one variable in rows and the other in

11
94. columns. The values at the row and column intersections are frequencies for each
95. unique combination of the two variables.
96. Use contingency tables to understand the relationship between categorical
97. variables
98. A contingency table displays frequencies for combinations of two categorical
99. variables. Analysts also refer to contingency tables as crosstabulation and two-
100. way tables.
101. Contingency tables classify outcomes for one variable in rows and the other in
102. columns. The values at the row and column intersections are frequencies for
each
103. unique combination of the two variables.
104. Use contingency tables to understand the relationship between categorical
105. variables

12

You might also like