PSYC6102 Psychological Statistics
PSYC6102 Psychological Statistics
In research, statistics plays a key part in understanding a data set In the scientific process, an
identified problem hypothesis is the start of the study. This hypothesis is tested by different
means that would either prove or debunk the question at band. The way to test this
hypothesis is through an objective manner where data is collected and statistics is
applied to analyze the problem.
Statistics - is as a way of gathering, studying, interpreting and presenting a set of data that is
being used to reflect certain behaviors or patterns of a group called a population. While it
involves numbers and computation, the interpretation of the meaning of these numbers play a
big role in understanding the behavior being observed.
For example, we see statistics used in our daily lives such as:
Statistics can also influence the design of the study bringing the researcher to always question
and look for the best possible ways to test the problem. At the end of the research, statistics can
give input on how the study can be improved for further work/research.
Sampling
In social research, the basis of the study relies on an identified population which is normally a
group of people that is the focus of the study. In cases where a population is too big to study; a
sample size is taken and studied as a representative of the population. The sample size is a
subset of the population where a researcher may draw conclusions from that is reflective of the
population. For example, a researcher may say that he/she wants to study the buying behavior
of cigarette smokers in Quezon City. To be able to survey these people would be impossible as
there might be more than 1,000,000 smokers in the area. In order to still achieve the objective of
the study, the researcher may opt to scale down his/her respondents by identifying a sample
size of 5% of the whole population. This would narrow down the researchers target number of
respondents (the sample size) to just about 50,000.
Sampling is the method by which the researcher chooses a group to study without bias. In
choosing 50,000 respondents, the researcher must select his/her respondents anonymously
and equitably. The most common method of sampling is called random sampling wherein
each member of the population has an equal chance of participating in the study. A
researcher may select respondents based on several categories such as age, gender, income
bracket, education attainment, etc., based on factors that are identified in the study. It must be
noted that in choosing a sample, all factors must be represented equitably. If the researcher is
choosing 50,000 respondents, he/she must endeavor to have 50% males and 50% females
represented. The researcher may then divide the males and females into different age ranges of
21-30 years old, 31-40 years old and 41-50 years old.
There are several ways to present data that has been studied. When research is conducted, the
researcher identifies variables that he/she wants to study. Variables may be persons, things,
attributes, or characteristics that are different from one another. These variables may be
qualitative in nature meaning a name is attributed to it (ie. Colors - blue, red and yellow or
name of player - Kobe, Tony, Manu) or qualitative in nature represented by numbers.
Plotting the data would enable the researcher to quickly understand and draw
conclusions by looking at the direction, trends and observations of the numbers.
Most would be familiar with the usual bar graph where it directly plots the count/frequency of the
data.
Another way to plot data is by using linear graph. This graph is used when presenting data
that is seen in groups where the difference of each groups are being studied.
Scatter plots are used when studying the relationship of two variables in the study.
Data may also be summarized by averages which would enable you to see the magnitude of the
effect of your identified variables. This may be measured through three (3) Ms which go by
the mean, median and mode.
Often times, when an average is mentioned, most people would refer to the mean. The mean
uses all data on hand but would be greatly be skewed by outliers (too high or small
numbers in the data set). Due to this tendency, most researchers consider the median which
is the midpoint of the data set. The mode is the most common data in the set but would
not provide any further analysis.
Central Tendency – the researcher tries to look at a single value that would describe the whole
data set to which the middle ground is taken into consideration. This central number usually
summarizes the data set but would pose generic conclusions that do not take into consideration
extreme value in the data.
MEAN - uses all data on hand but would be greatly be skewed by outliers (too high or small
numbers in the data set). Normally referred to as the average score of the group. It is computed
by adding all the scores of the variable divided by the number of entries.
For example:
Tom compute for the mean of all points of players, we would have to add all the points together
then divide the sum by 8 which is the total number of players represented. The mean is then
15.625.
If the population mean is being computed, the is then represented by the Greek symbol
denoted as .
MEDIAN - the midpoint of the data set. It is the middlemost number of your data set which is
organized from least to greatest. Outliners or too high or too low numbers do not affect the
median.
In the data given below, we would first need t0 arrange the numbers by least to greatest.
MODE - the most common data in the set but would not provide any further analysis. It is the
most frequent number that appears in the data set. The mode is used in analyzing data when
one wants to know which is the most common answer.
In qualitative data, the mode may indicate popularity of choice as shown in the example below
where blue is the most common favorite color.
In quantitative data, however, obtaining the mode as the only measure in the study may be
challenging in cases where there is a tie in choice or when the mode is an outlier of a study.
These situations would mean that the mode is not the best measure for central tendency.
RANGE - the difference of the largest and smallest values in the data set
STANDARD DEVIATION - looks at the average difference from the mean
VARIANCE - the square of the standard deviation
Researches may also look at how widespread the data is reflecting how widely the differences
are of the variables. The most common is that of the range which is the difference of the largest
and smallest values in the data set. The standard deviation meanwhile looks at the average
difference from the mean while variance would be the square of the standard deviation.
The descriptive statistics above shows how many Filipinos are recorded in the National
Statistics Office but one cannot infer that these Filipinos are all in the Philippines. One may
know from the statistics above that Quezon City has a high literacy rate but may not conclude
that it is because of the city's education programs.
2. Inferential statistics on the other hand, infers or creates generalizations using
statistical data that is taken from the sample size. These generalizations are considered
as a reflection of the population and may go beyond the data that is on hand. One might
say that with the use of inferential statistics, a researcher may show cause and effect,
probability relationship among measures that are included in the study. Inferential
statistics may be as follows:
• 7 out of 10 moms choose this diaper
• 47% of Filipinos do not go to regular physical examinations
• 80% of UP students like the color maroon
The inferential statistics above shows the probability or the likelihood of something to happen
considering the sample that is tested or surveyed. It shows that the researcher is concluding
information based on the results of the sample that was tested. To illustrate further, for a
researcher to conclude that 7 out of 10 moms choose a diaper X, the researcher must have
interviewed or surveyed a sample size of 100 moms and out of this, 70 moms chose diaper X.
This data given then shows that it is inferred or generalized that 7 out of 10 moms would choose
diaper X.
Let's take the UP students' color preference. For the researcher to arrive at the statement
that80%of students like the color maroon, he/she must have asked a random sample of the
student body their favorite colors as asking all students would have been impractical. From the
random sample, it was found that 8 out of 10 students would choose the color maroon thus the
conclusion that 80% of the student population likes maroon. Furthermore, if the random sample
¼'aS choosing between maroon and blue, the researcher may also conclude that 20% of UP
students dislike the color blue.
Assignment
1. This method of statistics is a direct interpretation of the obtained data. Descriptive
Statistics
2. The two statistical methods are best combined in a research to give an accurate picture
and
generalization of the problem at hand. True
3. The mode is the number that is most frequently seen in a data set. True
4. A variable is an identified characteristic in a research that the researcher wants to take a
closer look at. True
5. is the method by which survey respondents are chosen without bias and with equal
ability to be participants in the study. Random Sampling
6. The average or the mean is the midpoint of a data set. False
7. Qualitative variables are assigned names to certain objects or characteristics of a study.
True
8. This method of statistics infers or generalizes obtained data from the sample size to the
population. Inferential Statistics
9. The whole group of the study which the sample reflects is called the
________________. Population
10. The population is a representative group of the whole. False
11. An identified representative group or subset of the whole is called a ______________.
Sample
12. Numbers represent quantitative variables always. True
13. Random sampling is getting qualified research respondents with each being chosen with
the same probability. True
14. The sample size is taking all members of the group being studied and getting data from
them. False
15. Descriptive statistics are used to infer generalizations to the whole population. False
16. Compute for the percentage and proportion. Type the answer on the space provided.
30%
17. Which scatter plot of Gender vs English Grade best represents the data above?
18. 175 employees out of 1,078 16%
19. 65 boxes out of 120 54%
20. A distribution is a ________________. Record of all data in a study
21. 82 mugs out of 165 49%
22. The class for Statistics has a population of 35 students. The passing rate for the exam is
at 50%. In the last exam, 23 students got 8 and above out of the total score of 15. All the
rest got 6 and below. What is the percentage of the population that passed the exam?
65%
23. 6,874 pens out of 10,000 68%
24. Which bar graph of Gender vs Total Participants per Gender best represents the data
above.
25. A student for a score of 27 out of 30 in an exam. This means that the student answered
90% of the questions correctly. True
26. What is the percentage of the following students? 77% and 49%
27. Student A 50/
28. Student B 32/
29. A regression analysis shows how one can predict the outcome of the other variable.
True
30. When the Pearson and Spearman correlation values equals a +1, it means
______________. That when one variable increases, the other variable increases as
well
31. A t test is used to compare 3 groups and their mean differences. False
32. Independent t test is used when there are different subjects in the study. True
33. The t test has various thresholds/significance levels. True
34. A paired samples t test and a matched pairs t test are different. False
35. A regression analysis shows ____________________. Predictability
36. The formula of a one sample t test is _________________________.
37. The degrees of freedom is computed by ______________________-
38. A correlation coefficient can go above/below +1 and -1. False
39. A negative coefficient means that as one variable decreases, the other variable also
decreases. False
40. In a simple linear regression, b1 is the estimated intercept. False
41. In a simple linear regression, b0 is the estimated slope. False
42. A one sample t test is a test of the sample mean. True
43. The degrees of freedom is denoted by n-10. False
44. test has various thresholds/significance levels to where the p value is compared to. What
are the three thresholds? 5%, 1%, 10%
45. In regression analysis, the variable being predicted is the independent variable. False
46. All are properties of a regression line except ____________________. If R2 is equal to
0, it means that 45% is predictable
47. A one sample t test is ____________________. A test of the sample mean
48. The t Test is characteristically used to compare ______________. 2 groups and the
differences of their means
49. The hypothesis of a study must always be formed at the beginning of the research. True
50. The null hypothesis means is stated as _____________.
51. When comparing two population means, the = sign always appears in the
__________________. Null Hypothesis
52. A group of researchers tested the before and after effects of television on a group of 25
children. The samples in this test are __________________. Dependent
53. The probability of a Type 1 error is denoted by ________________
54. An alternative hypothesis means _______________. There is a difference between the
two sample groups
55. To compare populations, data must come from random samples. True
56. A pooled variances test is used to know the variance of two groups that are assumed to
be alike. True
57. Describe the relationship based when r value = - 0.89 – negative strong
58. When hypothesis testing the means of 2 independent populations, our data should be
_______________. Randomly selected
59. The correlation analysis is characteristically used to ______________. Show the
relationship between two variables
60. Comparing two populations mean that you can get data from any two sample sizes from
any population. False
61. To conduct a dependent means test would be to employ the ___________________.
Paired Samples t Test
62. In testing means of populations, which of the following data is required? Random
sampling from 2 populations
63. When comparing two population means, the = sign always appears in the alternative
hypothesis. False
64. Significance levels of a research are always at a 90%, 95% and 99% level. True
65. Hypothesis testing is done to predict the outcome of the study. False
66. A negative one tailed graph would have the two ends of the normal curve graph shaded.
False
67. A group of researchers tested the before and after effects of television on
a group of 25 children. To test the hypothesis we use the data
_______________.T distribution with 24 degrees of freedom
78. When comparing two population means, the = sign always appears in the alternative
hypothesis. False
79. Comparing two populations mean that you can get data from any two sample sizes from
any population. False
80. A regression line may be plotted to show where predicted values may fall in a graph
based on the independent variable. It is also called a _________________________.
Best fitting line
81. The pooled and separate variances tests are interchangeable. False
82. When comparing two populations, it is assumed that both have normal distributions.
True
83. Type 2 Error is when the researcher ________________. Does not reject the null
hypothesis when in fact it is false
84. When the Pearson and Spearman correlation values equals a +1, it means
______________. That when one variable increases, the other variable increases as
well
85. Another way of stating the null hypothesis is True
86. To reject the null hypothesis, the p value must be less than or equal to α. True
87. A type 1 error occurs when _________________________. The null hypothesis is
rejected when it is true
88. Hypothesis testing is done to _______________. Gather data that would confirm or
debunk the research problem
89. An alternative hypothesis for 2 independent population means is __________.
90. The p value gives you the _____________________.Both one-tailed/two-tailed
graph and value of test statistic
91. The global medical standard for a normal sugar level for men is 80mg/dl.
The barangay conducted a medical mission and tested 25 males where
the average sugar level is 85 mg/dl and a standard deviation of 3. The
medical staff wants to know if the result is significantly higher than the
average. What is the best T-test to use? - One Sample T-test
92. Which of the following is not useful in doing an ANOVA? Same participants across
conditions
93. A chi square test is done with ______________ data. Categorical
94. A chi square test does not measure the __________________. predictability
95. How many degrees of freedom should be used in the table below? 4
96. The degrees of freedom for a chi square is ________________. (rows - 1)(columns-1)
97. What should be included in reporting ANOVA? F statistic
98. Which is not a Chi Square Test? No correct answer
99. How many dependent and independent variables are present in a one way ANOVA?
One
100. dependent with more than 2 averages, one independent
101. The Chi Square Goodness of Fit shows ____________________. If data comes
from the population or not
102. Which of the following is not a property of a chi square tests? From dependent
variable
103. How many dependent variables are present in a two way ANOVA? 3
104. A student answered 77 out of 100 in a test. What percentage did he answer
incorrectly? 33%
105. Depending on the alternative hypothesis, the answer when comparing means
can be a one tailed graph. False
106. The strength of a correlation is determined by Coefficient of correlation
121. A t-test with two samples is most commonly used when there are small sample
sizes used in the study. False
122. The mean is the best value to describe an entire set of data. True
123. The degrees of freedom for a sample size of 501 is _________________. 500
124. A research problem does not always need to have a null hypothesis because it is
mostly rejected. False
125. The mu is a greek symbol for ________________. Mean
126. Gender is an example of nominal data. True
127. For the hypothesis , the researcher randomly sampled 40 participants from the
first group and 20 participants from the second group. What is the degrees of freedom?
58
128. A sample contains all the data of the group that a researcher is studying. True
129. R squared is a measure of __________________. Variance of squares
130. A type 2 error is the probability of accepting a true alternative hypothesis. False
131. When making use of the same subjects testing the before and after effects, we
would need data to be___________________. Dependent
132. The mean absolute deviation is also known as the ______________________.
Mean deviation from the mean
133. If r2 = 0, it means that _________________. A strong positive linear relationship
134. Descriptive statistics on the other hand is a/an __________________
interpretation from the data gathered. Direct
135. The degrees of freedom for a sample size of 50 for the first group and 49 for the
second group is _________________. 97
136. What is the mode of the following numbers 10, 10, 10, 11, 11, 12, 13? 10
There are three main measures of central tendency: mode. median. mean.
Univariate analysis – looks at one study measure across different situations or scenarios.
Variables may be studied and may be shown by the raw data itself.
Graphs are useful in presenting univariate data for ease of interpretation and study. At one
glance, the researcher may see trends and draw conclusions from these. Bar graphs, linear
graphs, pie charts may be used to show the different relationships of the variables.
Assignment:
Compute the percentage and proportion for each.
1. 65 boxes out of 120. 54.17%
2. 175 employees out of 1,078. 16.23%
3. 6874 pens out of 10,000. 68.74%
4. 3 balls out of 10. 30%
5. 82 mugs out of 165. 49%
6. 4 songs out of 10 40%
7. 79 correct items out of 100. 79.0%
8. 3,476 cars retrieved out of 10,429. 33.33%
9. 594 slippers out of 23,497. 2.52%
10. 12 out of 16. 75%
Laboratory Assignment:
1-3. Compute the mean, median and mode in Excel.
Blue 10
Red 4
Yellow 7
Green 5
Orange 4
Purple 6
Gold 8
Silver 9
Black 3
White 2
4-6. A teacher in statistics gave out practice problems for students to answer at home. Each
student spent some minutes answering these as recorded below. Compute the mean, median
and mode in Excel.
15
14
17
15
24
26
18
9
23
22
7-9. Customer service centers are being improved in the country. With this effort, a local firm
recorded the number of minutes it takes for call center agents to resolve a complaint. Below is
the data. Compute for the mean, median and mode in Excel.
25
17
20
14
10
15
21
22
14
10
13
20
11. True or False. Measures of central tendency revolve around three areas namely the
range, mean and median. FALSE
The population is the representative sample of an identified group. – false
What is the proportion equivalent proportion if a factory worker accomplishes 82% of his work?
0.82
When doing random sampling, it means that one person may be selected twice in a study. –
false
A student who got a score of 9 out of 10 in a test is an example of inferential statistics. – false
Measures of Dispersion – shows how scattered or spread out the data is.
Range – shows the difference of the greatest value and that of the lowest value in a data set.
For example, if the largest value in the data is 50, and the lowest value in the data is 24, then
the computed range would be 26.
MEAN
=AVERAGE(data set)
MEDIAN
arrange all data in least to greatest then get the middle number
MODE
most common numbers in the data set; if there is none
VARIANCE
=VAR(all the numbers in the given data set)
RANGE
=largest value-lowest value
Assignment:
Normal Curve – is a bell shaped graph that is symmetric/equal in both sides. The numbers at
the end of the graph are never ending. The graph of the normal distribution is taken from the
mean and the standard deviation whereas the means shows where the graph will be located
and the standard deviation shows the height and width of the graph. A tall and narrow graph will
show that there is a small standard deviation while a short and wide graph will show that there is
a large standard deviation.
The standard score is most commonly referred to as he z score in statistics whereby it depicts the
likelihood that the score will appear again in the normal distribution. The z score also allows the
researcher to compare z score of 2 difference normal distributions because of the uniformity of the
data. All variables in a data set may be converted to a z score by following the formula below:
Z area
Mean (μ)
Standard deviation (σ)
Assignment
Compute for the z score and z area (area between the mean and positive z scoe) for each data set.
Formula:
z = (x – μ) / σ
= (190 – 150) / 25 = 1.6
T distribution is a statistical test that is used to compare 2 groups/populations and the differences of
their means.
The degrees of freedom is the varying ways that the data may be reported which is normally
represented by n-1. The distribution of the t statistic in a sample of 15 is n-1 or 15-1=14 degrees
of freedom.
What is the critical value that we use for a two-tailed test? - α/2
Comparing two population proportions starts with having categorical data (data that is answered
by a yes or no) that are taken from 2 separate unique groups (male/female). In order for this to
happen, data should come from 2 random samples from each population where the null
hypothesis Ho states that the proportion of population 1 (p1) has no difference form he
proportion of population 2 (p2); H0:p1=p2.
The alternative hypothesis would postulate then that there is a difference in the proportion of the
2 populations denoted by This would mean that our graph is a two-tailed graph
qualifying any number at each opposite end of the bell curve.
The formula for this would be as follows: this is known as the two proportion z-test.
Two proportion z test – compares 2 population proportions starts with having categorical data (data that
is answered by yes or no) that are taken from 2 separate unique groups
Pooled Variances Test – method for estimating variance of several populations assuming that the
variance for each are the same
Separated Variances test – method for estimating variance of several populations assuming that the
variance for each are dfferent
ANOVA or analysis variance is done when the researcher is studying three or more groups in a
research. A test of ANOVA gives you a single p value in one test versus that of the multiple
permutations or groupings you may have when comparing two groups at a time.
F ratio is the result of an anova test also knowns as the Fisher analysis of variance.
The null hyphesisfor ANOVA estimates the variane of the random error and should be a value
close to 1. If the ration has a higher number, then the null hypothesis is rejected. The null
hypothesis is denoted by H0=u1=u2=u3.
The alternative hypothesis for an ANOVA test would suppose that means being compared are
not all equal. The alternative hypothesis is denoted by H1= at least one mean is not statistically
equal.
One way ANOVA is a test for 3 or more groups having only one independent variable to study.
Two way ANOVA is test of 3 or more groups having two independent variables to study.
The Chi square test is primarily a statistical test done to examine whole distributions and the
relationship between distributions using nominal or categorical data. It is used to measure
observed results vs expected results. The data that is used in chi square tests must posses
these attributes: they must be raw, mutually exclusive(no data should be counted twice,; each
data shall stand on its own and be counted as one), from independent variables and from a
large enough sample (greater than 5)
A Chi square test uses categorical data and normally comes in a form of a frequent count. This
would mean that a chi square test would show the relationship or non-relationship of variables
but nit the strength of the relationship or predictability.
Three types o Chi Square testing employing the same formula:
1. Goodness of Fit test
2. Test of Homogeneity
3. Test of Independence
Degrees of freedom in a Chi square test is important as it tells you how many items in your data
are independent.
The Goodness of Fit is a method of establishing if a set of categorial data really comes from the
population distribution or not.
Chi Test for homogeneity – method of finding out if two or more subsets of a whole population
share the same distribution based on one variable
Chi square test of independence is a method of finding out whether two variables are related
with one another in a given population.
Multivariate analysis focuses on the study of more than two variables at a time allowing the
researcher to infer outcomes closer to reality because of different numerous factors affecting a
subject. It allows the researcher to confirm if his/her study model is based on real life
observations and if what is assumed is the best explanation of the association of variables. It
can be conducted in both experimental and non-experimental designs. Whether the researcher
is trying to establish causality o relationship, both factors can be determined in a multivariate
analysis. It I essential to consider both descriptive and inferential statistics in multivariate
analysis to complement study findings. Most multivariate statistical tools stem from univariate
statistical stools. Through the use of matrix, the data is analyzed based on correlation, variance
or covariance, sum-of-squares or cross-product.
If two or more variables are related to one another, it means that there are fundamental
overlapping traits or characteristics that are pointing to some underlying variances in the data
set. If two data sets share a significant amount of variance, it means that the researcher can
now explain a mix of variables that can explain its relationship.
4 research questions in research study:
1. Level of association between variables
2. Significant variances between group means
3. Predict association through one or more variables in two or more groups
4. Explain core structures
Techniques in doing multivariate analysis:
1. Multiple regression – examines the relationship of a single dependent variable and two
or more independent variables. This relationship looks into the linear relationship while
assuming normality, linearity, and equal variances. This can also be used as a
predictability tool.
2. Discriminant Analysis – categorizes properly observations or subject in homogenous or
like groups. The soundness of the data is derived from the degree of differences and
how well the model classifies. To know which variables have most impact, a researcher
will look at partial F- values. The higher the partial F value, the higher the impact of the
variable to the discriminant
3. Multivariate analysis of variance (MANOVA) – examines the association of a number of
categorical independent variables and two or more metric dependent variables. Contrary
to ANOVA, MANOVA analyzes the dependence relationship of different dependent
measures across groups. This statistical technique is often used in an experimental
design where outcomes are hypothesized.
4. Factor analysis – a researcher basically looks at the underlying structure of several
independent variables with a goal of narrowing down to a few variables to be studied in
research as each variable may have overlaps with others. Kaiser’s measure of statistical
adequacy measures the degree of which one variable can predict other variables. An
MSA of 0.80 is considered very good while an MSA of 0.50 is poor.
5. Cluster Analysis – separates large data sets into separate groups with common events.
These individuals or units may be classified based on commonalities or similarities of
traits or characteristics. There are three types of cluster analysis namely:
a. Hierarchical – a big group data is classified into smaller ones
b. Nonhierarchical – identification of clusters is done prior to data gathering
c. Mixture of both
Four rules are followed when ding cluster analysis:
1. Clusters should be different from one another
2. It should be available,
3. quantifiable and
4. big enough to be able to deduce information from
5. 1. The Correlation coefficient is denoted by r and is a measure of any linear
trend between two variables.
6. 2. ANOVA uses the F-value to determine whether the between group
variability of means is larger than the within the group variability of the
individual values.
7. 3. Regression is a statistical method used in finance, investing, and other
disciplines that attempts to determine the strength and character of the
relationship between one dependent variable and is denoted by Y and a
series of other variables.
8. 4. Mean is the average of the given numbers and is calculated by dividing the
sum of given numbers by the total number of numbers.
9. 5. Two-tailed test is a method in which the critical area of a distribution is two-
sided and tests whether a sample is greater than or less than a certain range
of values. It is used in null-hypothesis testing and testing for statistical
significance.
10. 6. The standard deviation, denoted by SD, measures the amount of
variability, or dispersion, from the individual data values to the mean.
11. 7. The median is the middle number in a sorted, ascending or descending, list
of numbers and can be more descriptive of that data set than the average.
12. 8. A chi-square statistic, denoted by x2, is a measure of the difference
between the observed and expected frequencies of the outcomes of a set of
events or variables.
13. 9. The mode is the most commonly observed value in a set of data.
14. 10. A one-tailed test results from an alternative hypothesis which specifies a
direction. It is used to determine only if there is a difference between groups
in a specific direction.