RM Project File
RM Project File
Research Methodology
PRACTICAL FILE
Frequency distribution is very common and important method for analyzing the nominal
(categorical) and ordinal (ranking) variables in a dataset. In every questionnaire, one
section is dedicated to demographic profiles. The different categories of demographic
profiles in a dataset are normally represented by frequency distribution in a tabular as
well as graphical forms.
Dataset of workers working in small and medium scale enterpreises in city og India is
shown below in table 1.1
The coding details of different variables in the dataset are shown below table 1.2
Religion 1=hindu
2=muslim
3=other religion
The final SPSS output in tabular form is shown below in Table 1.3.
Conclusion: The education level of 50 different workers are calculated and found that
the number of workers below 10th grade, high school, intermediate, technical diploma and
degree level are 14(28%), 17(34%), 7(14%), 5(10%) and 7(14%) respectively.
WORKSHEET 2
Measures of Central Tendency
There are three mail measures of central tendency. These are as follows:
Arithmetic mean
Median
Mode
Arithmetic Mean
The mean of variable represents its average value. It can be calculated by using the
following formula:
n
∑ fiXi
X = i=0
∑f
Where, X represents the mean and fi represents the frequncy of an ith observation of the
variable.
One of the problems with arithmetic mean is that it is highly sensitive to the presence of
outliers in the data of the relatedvariable. To avoid this problem, the trimmed mean of the
variable can be estimated. Trimmed mean is the value of the mean of a variable after
removing some extreme observation (e.g., 2.5 percent from both the tails of the
distribution) from the frequency distribution.
Medain
The mode of avariable is the observation with highest frequency or highest concentration
of frequencies.
Dataset of monthly sales figures (in crores) of an enterprise for 50 consecutives are given
in Table 2.1.
SPSS Commands
STEP 1: Click Analyse → Descriptive statistics → Frequency
STEP 2: Transfer the variable to variable window and click ‘statistics’ as shown in the
figure 2.2.
Statistics
sales per month
Valid 50
N
Missing 0
Mean 61.42
Median 55.00
Mode 45a
25 40.00
Percentiles 50 55.00
75 76.25
a. Multiple modes exist. The
smallest value is shown
Conclusion: Table 2.2 represents SPSS Output.
Mean value of sales figure is 61.38.
Outlier are:
The extreme observations lying in the extreme tails of the probability distribution
of the variables.
The observations with the highest residuals for a relation model (regression
model)
The observations that, if not included in the analysis, caus a significant difference
in the result.
On the basis of the cases mentioned above, outliers can be divided into three different
types:
1. Extreme values or univariate outliers
2. Multivariate outliers
3. Influencers
The required output is shown in table 3.2 and box plot diagram is shown in figure 3.4.
Table3.2: SPSS output of outlier testing
Extreme Values
Case Valu
Numbe e
r
1 22 13.0
2 29 5.0
Hig
3 4 4.5
hest
4 26 4.5
hours spent 5 27 4.5
for playing 1 8 .5
2 30 1.0
Low
3 11 1.0
est
4 10 1.0
5 15 1.5a
a. Only a partial list of cases with the
value 1.5 is shown in the table of
lower extremes.
Conclusion- Table 3.2 represents SPSS Output of outliers. It represents extreme high and
extreme low values in the sportsman dataset. Case number 22, 29, 4, 26, and 27 have
extreme high values and Case number 8,30,11,10 and 15 have extreme lower values.
Figure 3.4 represents that Case number 22 is an outlier.
WORKSHEET 4.1
Test of Difference: One sample T-Test
In many situations, we come across claims made by marketers about their products. For
example, a car manufacturer may claim that the average mileage of a car is, for say, 199.9
kmpl or a business school may claim that the average package offered to its students is
Rs. 12 lakh per annum. A researcher may be interested in analyzing the truthfulness of
these claims. For this analysis, the researcher needs to randomly pick a small sample
from the population and compare its mean with the claimed population mean. The sample
mean and the population mean maybe different from each other. In order to test whether
this difference is statistically significant, we should apply one-sample test.
“Ho : There is no significant difference between sample mean and population mean.”
The t-statistic in one-sample t-test can be estimated by using the following formula:
x−μ
t=
√
σ
N−1
Where, x =sample mean, μ = population mean, σ = standard deviation of sample mean and
N = sample size.
Objective: To find out the difference between population mean(μ) and sample mean
(x ).
Dataset of weight lost (in figure) by 50 customers a month after joining the weight loss
program is shown in table 4.1.1
Table4.1.1: Data of weight lost by 50 customers a month after joining the weight loss
program
S.No Weight lost S.No Weight lost S.No Weight lost
1 2 18 4 35 4
2 3 19 3 36 4
3 2 20 4 37 3
4 4 21 5 38 4
5 5 22 6 39 3
6 3 23 4 40 4
7 3 24 5 41 5
8 2 25 6 42 4
9 3 26 5 43 3
10 4 27 4 44 4
11 2 28 4 45 5
12 3 29 5 46 5
13 3 30 5 47 4
14 4 31 6 48 5
15 3 32 2 49 6
16 4 33 5 50 5
17 5 34 5
The final SPSS output (Statistical Package of Social Science) in tabular form is shown
below in table 4.2.2, table 4.2.3 and table 4.2.4 respectively.
Table4.1.2
One-Sample Statistics
N Mean Std. Std. Error
Deviation Mean
WeightLost 50 4.0200 1.11557 .15777
Table4.1.3
One-Sample Test
Test Value = 0
T Df Sig. (2-tailed) Mean 95% Confidence Interval of
Difference the Difference
Lower Upper
WeightLost 25.481 49 .000 4.02000 3.7030 4.3370
Since p value is less than significance level of 0.05, then Ho gets rejected.
Conclusion: Sample mean is 4.02 kgs which is less than the claimed population mean of
5 kgs. The t statistics is found to be 25.481 with p value of .000.Since the p value of t
statistics is less than 5% level of significance, hence with 95% confidence level the null
hypothesis of no difference between sample mean and population mean cannot be
accepted and it can be concluded that sample mean is significantly different from
population mean. Therefore, the company is making a wrong statement about the weight
loss of its customers.
WORKSHEET 4.2
Test of Difference: Paired Sample T-Test
A paired sample t-test is also known as repeated sample t-test because data (responses) is
collected from the same respondents but at different time periods. A paired sample t-test
should be used when we want to test the impact of an event or experiment on the variable
under study. In this case, the data is collected from the same respondents before and after
the event. After this, means are compared. The null hypothesis of paired samples t-test is
that the means of pre-sample and post-sample are equal. Some of the instances where
paired samples t-test can be applied are as follows:
SPSS Commands
STEP 1:Click Analyse → Compare Means → Paired sample t-test
STEP 2: Click on the variable ‘pre-training score’. Then click on the post training
variable. Now, move the paired variable into the ‘paired variable’ box by clicking on the
right arrow button. Finally click on ‘OK’ as shown in the figure4.2.2
Figure
4.2.2 Screenshot of paired sample t-test (step2)
The final SPSS output (Statistical Package of Social Science) in tabular form is shown
below in table 4.2.2, table 4.2.3 and table 4.2.4 respectively.
N Correlation Sig.
Lower Upper
Conclusion- Paired sample statistics and paired sample correlation are shown in table
4.2.2 and 4.2.3 respectively. As shown in table 4.2.2 The mean sales figure before the
training program is 40050 whereas, the sales figure after training program is 43200.The
sales figure increases after the training program. Table 4.2.3 indicates sample correlation
coefficient between pre-test and post-test (0.982) and a test of significance of correlation
( p<.001).
Table 4.2.4 shows the result of paired sample t-test. The null hypothesis of paired sample
t-test assumes that the pre-sample sample mean is same as the post-sample mean. The p
value of t statistics is found to be less than 5% level of significance. Hence, the null
hypothesis cannot be accepted. Therefore it can be concluded that training program
is highly effective in increasing the sales figures of the company
\
WORKSHEET 4.3
When we want to test the difference between two independent sample means, we use
independent-sample t-test. The independent samples may belong to the same population
or different population. Some of the instances in which the independent samples t-test can
be used are as follows:
1. Testing difference in the average level of performance between employees with the
MBA degree and employees without the MBA degree.
2. Testing difference in the average wages received by labor in two different industries.
The t-statistic in the case of independent-sample t-test can be calculated by using the
following formula:
X 1 −X 2
t x −x =
1 2
In SPSS, the independent samples t-test is conducted in two stages. At stage one, SPSS
software compares variances of two samples. The statistical method of comparing two
sample variances is known as Levene’s homogeneity test of variance. The null hypothesis
of this test is ‘Equal variance assumed’, i.e., there are no significant differences between
the sample variances of two independent samples. In other words, the two samples are
comparable. On the basis of Levene’s test of homogeneity, the SPSS gives two values of
t-statistic. In case of equal variances, both the values are the same. In case the sample
variances are different, the lower t-statistic value should be considered for final analysis.
Ho: There is no significant difference in the average performance of the employees for
age groups below and above 40 years of age.
H1: There is a significant difference in the average performance of the employees for age
groups below and above 40 years of age.
SPSS Commands
STEP 2: Send the test variable ‘performance score’ to the test variable(s) window. Then
send ‘age’ variable in the grouping variable and click ‘Define Groups’ as shown in
figure4.3.2
STEP 3: Now define the cut point as 40. Next click ‘Continue’ as shown in fig.4.3.3
Figure4.3.3 Screenshot of independent sample t-test (step3)
The final SPSS output (Statistical Package of Social Science) in tabular form is shown
below in table 4.3.2and table 4.3.3 respectively.
Group Statistics
Age N Mean Std. Deviation Std. Error Mean
>= 40 22 68.86 19.075 4.067
Performance Score
< 40 28 55.07 16.777 3.171
WORKSHEET 5
One way Anova
Concept of ANOVA
Independent-samples t-test can be applied to situations where there are only two
independent samples. In other words, we can use independent-samples t-tests for
comparing the means of two populations (such as males and females). When we have
more than two independent samples, t-test is inappropriate. The Analysis of Variance
(ANOVA) has an advantage over t-test when the researcher wants to compare the means
of a large number of populations (i.e., three or more). ANOVA is a parametric test that is
used to study the difference among more than two groups in the datasets. It helps in
explaining the amount of variation in the dataset. In a dataset, two main types of
variations can occur. One type of variation occurs due to chance and the other type of
variation occurs due to specific reasons. These variations are studied separately in
ANOVA to identify the actual cause of variation and help the researcher in taking
effective decisions.
In case of more than two independent samples, the ANOVA test explains three types of
variance. These are as follows:
Total variance
Between group variance
Within group variance
The ANOVA test is based on the logic that if the between group variance is
significantly greater than the within group variance, it indicates that the means of
different samples are significantly different.
There are two main types of ANOVA, namely, one-way ANOVA and two-way
ANOVA. One-way ANOVA determines whether all the independent samples (groups)
have the same group means or not. On the other hand, two-way ANOVA is used when
you need to study the impact of two categorical variables on a scale variable.
Objective: To find out the difference between salaries of graduates, post graduates and
PhDs.
HO: There is no difference between salaries of graduates, post graduates and PhDs.
H1: There is difference between salaries of graduates, post graduates and PhDs.
SPSS Commands
STEP 1: Click Analyse → Compare Means → One-way ANOVA
Figure5.1 Screenshot of One-way ANOVA (step1)
STEP 2: Transfer the variable ‘salary’ to dependent list window and variable
‘qualification’ to factor window.
STEP 3: Select ‘Post hoc’ and then click ‘Tukey’ as shown below in figure5.3
Figure5.3 Screenshot of One-way ANOVA (step3)
STEP 4: Click ‘Options’ and select ‘Homogeneity of variance test’ and ‘Means plot’ as
shown below in figure5.4.
Descriptives
Salary
N Mean Std. Std. Error
95% Confidence Interval Minim Maxim
Deviation for Mean um um
Lower Upper
Bound Bound
Graduate 10 32200.0000 7828.72205 2475.65928 26599.6696 37800.3304 23000 45000
Post 9 44000.0000 12629.33094 4209.77698 34292.2369 53707.7631 32000 65000
Graduate
Phd 9 50555.5556 15297.96646 5099.32215 38796.4976 62314.6135 36000 85000
Total 28 41892.8571 14082.66411 2661.37336 36432.1701 47353.5442 23000 85000
1.450 2 25 .254
Conclusion: Table 5.2 indicates that average salary of graduate is 32200, of post-
graduate is 44000 and finally of Phd is 50555.5556. This indicates that average salary of
Phd is highest and average salary of graduates is lowest. Table 5.3 represents the Levene
Test which assumes the null hypothesis that all sample variances are same.The
significance value of 0.254 indicates that 95% level of confidence the null hypothesis can
be accepted. The homogeneity of variance is one of the desired condition of one way
ANOVA test. Table 5.4 represents the results of F test in one-way ANOVA. As shown in
Table 5.4 the p value of F statistics (5.591) is more than 5% level of significance. Hence
with 95% confidence level, the null hypothesis of equal group means will be accepted.
Thus it can be concluded that average salary of graduates, post-graduates and Phds are
same.
WORKSHEET-6
Chi-Square Test
Chi-square test is one of the most popular non-parametric tests. It is used in two cases
which are as follows:
To test the association between nominal variables in research.
To test the difference between the expected and observed frequencies of an event.
The process of chi-square test compares the actual observed frequencies with the
calculated expected frequencies of different combinations of nominal variables. The
difference between observed and expected frequencies gives logic of possible association
between categorical variables. The chi-square statistics compares the observed count in
each table cell to the count that would be expected between the row and column
classifications under the assumptions of no associations. A negligible difference between
observed and expected frequencies may indicate no association, whereas a big difference
may indicate the possibility of association.
Objective: To analyze the association between education background and level of
familiarity with the internet.
Ho: There is no significant association between education background and level of
formality with the internet.
H1: There is significant association between education background and level of formality
with the internet.
Table 6.2 has the data collected from 100 internet users. The data consists of two nominal
variables ‘Level of familiarity with the internet’ and ‘Education Background.’ The details
of the codes provided to different sub-categories of these nominal variables are shown in
table 6.1.
Table 6.1 Codes provided to sub-categories
Codes for the variable ‘Level of Codes for the variable ‘Education
familiarity with the internet’ Background’
1= Low Familiarity 1= Humanities
2= Medium 2= Management
3=High 3= Technology
4= IT
SPSS Commands
STEP 1: Click Analyse → Descriptive statistics →Cross Tabs
STEP 2: Transfer ‘education background’ to the row(s) window and ‘familiarity with the
internet’ to the column(s) window. Click statistics as shown in figure6.2
STEP 4: Click on ‘Cells’ and select ‘Observed’ and ‘Expected’ and click ‘Continue’ as
shown in figure6.4
Chi-Square Tests
Value Df Asymp. Sig.
(2-sided)
Pearson Chi-Square 9.515a 6 .147
Likelihood Ratio 10.096 6 .121
Linear-by-Linear .002 1 .963
Association
N of Valid Cases 100
a. 2 cells (16.7%) have expected count less than 5. The
minimum expected count is 4.80.
Symmetric Measures
Value Approx.
Sig.
Phi .308 .147
Nominal by
Nominal Cramer's .218 .147
V
N of Valid Cases 100
Conclusion- The p value (.147) is more than 5% level of significance which indicates
that null hypothesis of no association between education background and level of
familiarity with internet is accepted.