0% found this document useful (0 votes)
928 views18 pages

Data Collection Statistics

The document discusses the key differences between primary and secondary data. Primary data is collected directly by the researcher for a specific study, while secondary data has already been collected by others. Some key differences are that primary data involves more time and cost to collect but is more accurate and specific to the research question, while secondary data is easier to obtain but may not be as relevant or reliable. The document provides examples of primary data collection methods like surveys and interviews and sources of secondary data like government publications and previous research reports.

Uploaded by

Gabriel Belmonte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
928 views18 pages

Data Collection Statistics

The document discusses the key differences between primary and secondary data. Primary data is collected directly by the researcher for a specific study, while secondary data has already been collected by others. Some key differences are that primary data involves more time and cost to collect but is more accurate and specific to the research question, while secondary data is easier to obtain but may not be as relevant or reliable. The document provides examples of primary data collection methods like surveys and interviews and sources of secondary data like government publications and previous research reports.

Uploaded by

Gabriel Belmonte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Data collection plays a very crucial role in the statistical analysis.

In research,
there are different methods used to gather information, all of which fall into
two categories, i.e. primary data, and secondary data. As the name suggests,
primary data is one which is collected for the first time by the researcher while
secondary data is the data already collected or produced by others.

There are many differences between primary and secondary data, which are
discussed in this article. But the most important difference is that primary
data is factual and original whereas secondary data is just the analysis and
interpretation of the primary data. While primary data is collected with an aim
for getting solution to the problem at hand, secondary data is collected for
other purposes.

Content: Primary Data Vs Secondary Data


1. Comparison Chart
2. Definition
3. Key Differences
4. Conclusion

Comparison Chart

BASIS FOR
PRIMARY DATA SECONDARY DATA
COMPARISON

Meaning Primary data refers to the Secondary data means data


first hand data gathered collected by someone else
by the researcher himself. earlier.

Data Real time data Past data

Process Very involved Quick and easy

Source Surveys, observations, Government publications,


experiments, websites, books, journal
questionnaire, personal articles, internal records
interview, etc. etc.
BASIS FOR
PRIMARY DATA SECONDARY DATA
COMPARISON

Cost effectiveness Expensive Economical

Collection time Long Short

Specific Always specific to the May or may not be specific


researcher's needs. to the researcher's need.

Available in Crude form Refined form

Accuracy and More Relatively less


Reliability

Definition of Primary Data

Primary data is data originated for the first time by the researcher through
direct efforts and experience, specifically for the purpose of addressing his
research problem. Also known as the first hand or raw data. Primary data
collection is quite expensive, as the research is conducted by the organisation
or agency itself, which requires resources like investment and manpower. The
data collection is under direct control and supervision of the investigator.

The data can be collected through various methods like surveys, observations,
physical testing, mailed questionnaires, questionnaire filled and sent by
enumerators, personal interviews, telephonic interviews, focus groups, case
studies, etc.

Definition of Secondary Data

Secondary data implies second-hand information which is already collected


and recorded by any person other than the user for a purpose, not relating to
the current research problem. It is the readily available form of data collected
from various sources like censuses, government publications, internal records
of the organisation, reports, books, journal articles, websites and so on.
Secondary data offer several advantages as it is easily available, saves time and
cost of the researcher. But there are some disadvantages associated with this,
as the data is gathered for the purposes other than the problem in mind, so the
usefulness of the data may be limited in a number of ways like relevance and
accuracy.

Moreover, the objective and the method adopted for acquiring data may not
be suitable to the current situation. Therefore, before using secondary data,
these factors should be kept in mind.

Key Differences Between Primary and Secondary Data


The fundamental differences between primary and secondary data are
discussed in the following points:

1. The term primary data refers to the data originated by the researcher for
the first time. Secondary data is the already existing data, collected by
the investigator agencies and organisations earlier.
2. Primary data is a real-time data whereas secondary data is one which
relates to the past.
3. Primary data is collected for addressing the problem at hand while
secondary data is collected for purposes other than the problem at hand.
4. Primary data collection is a very involved process. On the other hand,
secondary data collection process is rapid and easy.
5. Primary data collection sources include surveys, observations,
experiments, questionnaire, personal interview, etc. On the contrary,
secondary data collection sources are government publications,
websites, books, journal articles, internal records etc.
6. Primary data collection requires a large amount of resources like time,
cost and manpower. Conversely, secondary data is relatively inexpensive
and quickly available.
7. Primary data is always specific to the researcher’s needs, and he controls
the quality of research. In contrast, secondary data is neither specific to
the researcher’s need, nor he has control over the data quality.
8. Primary data is available in the raw form whereas secondary data is the
refined form of primary data. It can also be said that secondary data is
obtained when statistical methods are applied to the primary data.
9. Data collected through primary sources are more reliable and accurate
as compared to the secondary sources.

Conclusion

As can be seen from the above discussion that primary data is an original and
unique data, which is directly collected by the researcher from a source
according to his requirements. As opposed to secondary data which is easily
accessible but are not pure as they have undergone through many statistical
treatments.

Descriptive Statistics
Descriptive statistics are used to describe the basic features of the data
in a study. They provide simple summaries about the sample and the
measures. Together with simple graphics analysis, they form the basis of
virtually every quantitative analysis of data.

Descriptive statistics are typically distinguished from inferential statistics.


With descriptive statistics you are simply describing what is or what the
data shows. With inferential statistics, you are trying to reach
conclusions that extend beyond the immediate data alone. For instance,
we use inferential statistics to try to infer from the sample data what the
population might think. Or, we use inferential statistics to make
judgments of the probability that an observed difference between groups
is a dependable one or one that might have happened by chance in this
study. Thus, we use inferential statistics to make inferences from our
data to more general conditions; we use descriptive statistics simply to
describe what's going on in our data.

Descriptive Statistics are used to present quantitative descriptions in a


manageable form. In a research study we may have lots of measures.
Or we may measure a large number of people on any measure.
Descriptive statistics help us to simplify large amounts of data in a
sensible way. Each descriptive statistic reduces lots of data into a
simpler summary. For instance, consider a simple number used to
summarize how well a batter is performing in baseball, the batting
average. This single number is simply the number of hits divided by the
number of times at bat (reported to three significant digits). A batter who
is hitting .333 is getting a hit one time in every three at bats. One batting
.250 is hitting one time in four. The single number describes a large
number of discrete events. Or, consider the scourge of many students,
the Grade Point Average (GPA). This single number describes the
general performance of a student across a potentially wide range of
course experiences.

Every time you try to describe a large set of observations with a single
indicator you run the risk of distorting the original data or losing important
detail. The batting average doesn't tell you whether the batter is hitting
home runs or singles. It doesn't tell whether she's been in a slump or on
a streak. The GPA doesn't tell you whether the student was in difficult
courses or easy ones, or whether they were courses in their major field
or in other disciplines. Even given these limitations, descriptive statistics
provide a powerful summary that may enable comparisons across
people or other units.

Univariate Analysis
Univariate analysis involves the examination across cases of one
variable at a time. There are three major characteristics of a single
variable that we tend to look at:

 the distribution

 the central tendency

 the dispersion

In most situations, we would describe all three of these characteristics


for each of the variables in our study.

The Distribution. The distribution is a summary of the frequency of


individual values or ranges of values for a variable. The simplest
distribution would list every value of a variable and the number of
persons who had each value. For instance, a typical way to describe the
distribution of college students is by year in college, listing the number or
percent of students at each of the four years. Or, we describe gender by
listing the number or percent of males and females. In these cases, the
variable has few enough values that we can list each one and
summarize how many sample cases had the value. But what do we do
for a variable like income or GPA? With these variables there can be a
large number of possible values, with relatively few people having each
one. In this case, we group the raw scores into categories according to
ranges of values. For instance, we might look at GPA according to the
letter grade ranges. Or, we might group income into four or five ranges
of income values.

Table 1. Frequency distribution table.

One of the most common ways to describe a single variable is with


a frequency distribution. Depending on the particular variable, all of
the data values may be represented, or you may group the values into
categories first (e.g., with age, price, or temperature variables, it would
usually not be sensible to determine the frequencies for each value.
Rather, the value are grouped into ranges and the frequencies
determined.). Frequency distributions can be depicted in two ways, as a
table or as a graph. Table 1 shows an age frequency distribution with
five categories of age ranges defined. The same frequency distribution
can be depicted in a graph as shown in Figure 1. This type of graph is
often referred to as a histogram or bar chart.
Figure 1. Frequency distribution bar chart.

Distributions may also be displayed using percentages. For example,


you could use percentages to describe the:

 percentage of people in different income levels

 percentage of people in different age ranges

 percentage of people in different ranges of standardized test scores

Central Tendency. The central tendency of a distribution is an estimate


of the "center" of a distribution of values. There are three major types of
estimates of central tendency:

 Mean

 Median

 Mode

The Mean or average is probably the most commonly used method of


describing central tendency. To compute the mean all you do is add up
all the values and divide by the number of values. For example, the
mean or average quiz score is determined by summing all the scores
and dividing by the number of students taking the exam. For example,
consider the test score values:

15, 20, 21, 20, 36, 15, 25, 15


The sum of these 8 values is 167, so the mean is 167/8 = 20.875.

The Median is the score found at the exact middle of the set of values.
One way to compute the median is to list all scores in numerical order,
and then locate the score in the center of the sample. For example, if
there are 500 scores in the list, score #250 would be the median. If we
order the 8 scores shown above, we would get:

15,15,15,20,20,21,25,36

There are 8 scores and score #4 and #5 represent the halfway point.
Since both of these scores are 20, the median is 20. If the two middle
scores had different values, you would have to interpolate to determine
the median.

The mode is the most frequently occurring value in the set of scores. To
determine the mode, you might again order the scores as shown above,
and then count each one. The most frequently occurring value is the
mode. In our example, the value 15 occurs three times and is the model.
In some distributions there is more than one modal value. For instance,
in a bimodal distribution there are two values that occur most frequently.

Notice that for the same set of 8 scores we got three different values --
20.875, 20, and 15 -- for the mean, median and mode respectively. If the
distribution is truly normal (i.e., bell-shaped), the mean, median and
mode are all equal to each other.

Dispersion. Dispersion refers to the spread of the values around the


central tendency. There are two common measures of dispersion, the
range and the standard deviation. The range is simply the highest value
minus the lowest value. In our example distribution, the high value is 36
and the low is 15, so the range is 36 - 15 = 21.

The Standard Deviation is a more accurate and detailed estimate of


dispersion because an outlier can greatly exaggerate the range (as was
true in this example where the single outlier value of 36 stands apart
from the rest of the values. The Standard Deviation shows the relation
that set of scores has to the mean of the sample. Again lets take the set
of scores:

15,20,21,20,36,15,25,15

to compute the standard deviation, we first find the distance between


each value and the mean. We know from above that the mean is 20.875.
So, the differences from the mean are:

15 - 20.875 = -5.875

20 - 20.875 = -0.875

21 - 20.875 = +0.125

20 - 20.875 = -0.875

36 - 20.875 = 15.125

15 - 20.875 = -5.875

25 - 20.875 = +4.125

15 - 20.875 = -5.875

Notice that values that are below the mean have negative discrepancies
and values above it have positive ones. Next, we square each
discrepancy:

-5.875 * -5.875 = 34.515625

-0.875 * -0.875 = 0.765625

+0.125 * +0.125 = 0.015625

-0.875 * -0.875 = 0.765625

15.125 * 15.125 = 228.765625

-5.875 * -5.875 = 34.515625

+4.125 * +4.125 = 17.015625

-5.875 * -5.875 = 34.515625

Now, we take these "squares" and sum them to get the Sum of Squares
(SS) value. Here, the sum is 350.875. Next, we divide this sum by the
number of scores minus 1. Here, the result is 350.875 / 7 = 50.125. This
value is known as the variance. To get the standard deviation, we take
the square root of the variance (remember that we squared the
deviations earlier). This would be SQRT(50.125) = 7.079901129253.

Although this computation may seem convoluted, it's actually quite


simple. To see this, consider the formula for the standard deviation:

In the top part of the ratio, the numerator, we see that each score has
the the mean subtracted from it, the difference is squared, and the
squares are summed. In the bottom part, we take the number of scores
minus 1. The ratio is the variance and the square root is the standard
deviation. In English, we can describe the standard deviation as:

the square root of the sum of the squared deviations from the mean
divided by the number of scores minus one

Although we can calculate these univariate statistics by hand, it gets


quite tedious when you have more than a few values and variables.
Every statistics program is capable of calculating them easily for you.
For instance, I put the eight scores into SPSS and got the following table
as a result:
N 8

Mean 20.8750

Median 20.0000

Mode 15.00

Std. Deviation 7.0799

Variance 50.1250

Range 21.00

which confirms the calculations I did by hand above.

The standard deviation allows us to reach some conclusions about


specific scores in our distribution. Assuming that the distribution of
scores is normal or bell-shaped (or close to it!), the following conclusions
can be reached:

 approximately 68% of the scores in the sample fall within one standard deviation of the mean

 approximately 95% of the scores in the sample fall within two standard deviations of the mean

 approximately 99% of the scores in the sample fall within three standard deviations of the mean

For instance, since the mean in our example is 20.875 and the standard
deviation is 7.0799, we can from the above statement estimate that
approximately 95% of the scores will fall in the range of 20.875-
(2*7.0799) to 20.875+(2*7.0799) or between 6.7152 and 35.0348. This
kind of information is a critical stepping stone to enabling us to compare
the performance of an individual on one variable with their performance
on another, even when the variables are measured on entirely different
scales.
Deciles are similar to quartiles. But while quartiles sort data into four quarters, deciles sort data into ten
equal parts: The 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, 90th and 100th percentiles.
A decile rank assigns a number to a decile:
Decile Rank Percentile
1 10th

2 20th

3 30th

4 40th

5 50th

6 60th

7 70th

8 80th

9 90th

The higher your place in the decile rankings, the higher your overall ranking. For example, if you were in
the 99th percentile for a particular test, that would put you in the decile ranking of 10. A person who
scored very low (say, the 5th percentile) would find themselves in a decile rank of 1.

A chart showing decile rankings for discharged stroke patients. Image: SUNY Buffalo

In this page 'Deciles' we are going to see the partitional values of the given
data.
Decile : Definition

Deciles are nine partitional values of the data or the given set of
observation into ten equal parts. These 9 values are represented by D₁, D₂,
D₃, D₄, D₅, D₆, D₇, D₈ and D₉ .

They shows the 10%, 20%.30%, 40%, 50%, 60%, 70%, 80% and 90%

For ungrouped data:

Example 1:

Given the series 3,5, 7, 4 6,2 and 9.

Calculate the 2nd and 4th decile.


Solution:

To find the decile first we have to arrange the data in order.

2,3, 4,5, 6, 7 and 9.

Here n = 7

D₂ = value of 2[(n+1)/10]th item.

= value of 2x[(7+1)/10]th item

= value of 1.6th item.

= 1st value + 0.6 of the distance between 1st and


2nd value

= 2 + 0.6(3-2)

D₂ = 2.6

Now let us find the value for D₄

Solution:

The ordered data is 2, 3, 4, 5, 6, 7 and 9.

Here n = 7

D₄ = value of 4[(7+1)/10]th item

= value of 4 x 8/10 th item.

= value of 3.2 th item


= 3rd value + 0.2 of the distance between 3rd
and 4th value

= 4 + 0.2(5-4)

= 4.2

For grouped data:

Where

Lᵢ = Lower limit of the decile class.

N = Sum of the absolute frequency

Fᵢ₋₁ = Absolute frequency immediately below


the decile class

aᵢ = Width of the class containing the


decile class.

Note: The decile is independent of the width of the classes.

Let us an example for the grouped data.

Example:
Calculate the decile D₁ and D₃ for the following table.

Solution:

Calculation for the first decile

D₁ = L₁ + { [(k.N)/10 - F₁₋₁]/f₁}.a₁

= 40 + {[(1.70)/10 - 0]/8}.10

= 40 + [(70/10)/8) .10]

= 40 + 70/8

= 390/8

= 48.75
Calculation for 3rd decile

D₃ = L₃ + { [(k.N)/10 - F₃₋₁]/f₃}.a₃

= 60 + {[3.70/10-F₂]/14}10

= 60 + (210/10)-20]/14}.10

= 60 + [(21-20)/14].10

= 60 + 10/14

= 60 + 0.71

= 60.71

The formula given below is also to find the deciles for the grouped data.

Where

l = lower boundary of the class containing the


kth decile

h = Width of that class

f = frequency of that class

n = total number of frequencies


c = cumulative frequency preceding to that class

We will get the same answer on doing the above method also.

Students can choose the formula which is more convenient to them to


find out the decile. If you are having any doubt you can contact us through
mail, we will help you to clear your doubts.

You might also like