Answer
Sampling is that part of statistical practice concerned with the
selection of individual observations intended to yield some
knowledge about a population of concern, especially for the
purposes of statistical inference. In particular, results from
probability theory and statistical theory are employed to guide
practice.
The sampling process consists of five stages:
Definition of population of concern
Specification of a sampling frame, a set of items or events that
it is possible to measure
Specification of sampling method for selecting items or events
from the frame
Sampling and data collecting
Review of sampling process
Sampling methods
Within any of the types of frame identified above, a variety of
sampling methods can be employed, individually or in
combination. Factors commonly influencing the choice between
these designs include:
Nature and quality of the frame
Availability of auxiliary information about units on the frame
Accuracy requirements, and the need to measure accuracy
Whether detailed analysis of the sample is expected
Cost/operational concerns
Simple random sampling
In a simple random sample ('SRS') of a given size, all such
subsets of the frame are given an equal probability. Each
element of the frame thus has an equal probability of
selection: the frame is not subdivided or partitioned.
Furthermore, any given pair of elements has the same chance
of selection as any other such pair (and similarly for triples,
and so on). This minimises bias and simplifies analysis of
results. In particular, the variance between individual results
within the sample is a good indicator of variance in the overall
population, which makes it relatively easy to estimate the
accuracy of results.
However, SRS can be vulnerable to sampling error because the
randomness of the selection may result in a sample that
doesn't reflect the makeup of the population. For instance, a
simple random sample of ten people from a given country will
on average produce five men and five women, but any given
trial is likely to overrepresent one sex and underrepresent the
other. Systematic and stratified techniques, discussed below,
attempt to overcome this problem by using information about
the population to choose a more representative sample.
SRS may also be cumbersome and tedious when sampling from
an unusually large target population. In some cases,
investigators are interested in research questions specific to
subgroups of the population. For example, researchers might
be interested in examining whether cognitive ability as a
predictor of job performance is equally applicable across racial
groups. SRS cannot accommodate the needs of researchers in
this situation because it does not provide subsamples of the
population. Stratified sampling, which is discussed below,
addresses this weakness of SRS.
Simple random sampling is always an EPS design, but not all
EPS designs are simple random sampling.
Systematic sampling
Systematic sampling relies on arranging the target population
according to some ordering scheme and then selecting
elements at regular intervals through that ordered list.
Systematic sampling involves a random start and then
proceeds with the selection of every kth element from then
onwards. In this case, k=(population size/sample size). It is
important that the starting point is not automatically the first
in the list, but is instead randomly chosen from within the first
to the kth element in the list. A simple example would be to
select every 10th name from the telephone directory (an 'every
10th' sample, also referred to as 'sampling with a skip of 10').
As long as the starting point is randomized, systematic
sampling is a type of probability sampling. It is easy to
implement and the stratification induced can make it efficient,
if the variable by which the list is ordered is correlated with
the variable of interest. 'Every 10th' sampling is especially
useful for efficient sampling from databases.
Example: Suppose we wish to sample people from a long street
that starts in a poor district (house #1) and ends in an
expensive district (house #1000). A simple random selection of
addresses from this street could easily end up with too many
from the high end and too few from the low end (or vice
versa), leading to an unrepresentative sample. Selecting (e.g.)
every 10th street number along the street ensures that the
sample is spread evenly along the length of the street,
representing all of these districts. (Note that if we always start
at house #1 and end at #991, the sample is slightly biased
towards the low end; by randomly selecting the start between
#1 and #10, this bias is eliminated.)
However, systematic sampling is especially vulnerable to
periodicities in the list. If periodicity is present and the period
is a multiple or factor of the interval used, the sample is
especially likely to be unrepresentative of the overall
population, making the scheme less accurate than simple
random sampling.
Example: Consider a street where the odd-numbered houses
are all on the north (expensive) side of the road, and the even-
numbered houses are all on the south (cheap) side. Under the
sampling scheme given above, it is impossible' to get a
representative sample; either the houses sampled will all be
from the odd-numbered, expensive side, or they will all be
from the even-numbered, cheap side.
Another drawback of systematic sampling is that even in
scenarios where it is more accurate than SRS, its theoretical
properties make it difficult to quantify that accuracy. (In the
two examples of systematic sampling that are given above,
much of the potential sampling error is due to variation
between neighbouring houses - but because this method never
selects two neighbouring houses, the sample will not give us
any information on that variation.)
As described above, systematic sampling is an EPS method,
because all elements have the same probability of selection (in
the example given, one in ten). It is not 'simple random
sampling' because different subsets of the same size have
different selection probabilities - e.g. the set {4,14,24,...,994}
has a one-in-ten probability of selection, but the set
{4,13,24,34,...} has zero probability of selection.
Systematic sampling can also be adapted to a non-EPS
approach; for an example, see discussion of PPS samples
below.
Stratified sampling
Where the population embraces a number of distinct
categories, the frame can be organized by these categories
into separate "strata." Each stratum is then sampled as an
independent sub-population, out of which individual elements
can be randomly selected[3]. There are several potential
benefits to stratified sampling.
First, dividing the population into distinct, independent strata
can enable researchers to draw inferences about specific
subgroups that may be lost in a more generalized random
sample.
Second, utilizing a stratified sampling method can lead to
more efficient statistical estimates (provided that strata are
selected based upon relevance to the criterion in question,
instead of availability of the samples). It is important to note
that even if a stratified sampling approach does not lead to
increased statistical efficiency, such a tactic will not result in
less efficiency than would simple random sampling, provided
that each stratum is proportional to the group’s size in the
population.
Third, it is sometimes the case that data are more readily
available for individual, pre-existing strata within a population
than for the overall population; in such cases, using a
stratified sampling approach may be more convenient than
aggregating data across groups (though this may potentially
be at odds with the previously noted importance of utilizing
criterion-relevant strata).
Finally, since each stratum is treated as an independent
population, different sampling approaches can be applied to
different strata, potentially enabling researchers to use the
approach best suited (or most cost-effective) for each
identified subgroup within the population.
There are, however, some potential drawbacks to using
stratified sampling. First, identifying strata and implementing
such an approach can increase the cost and complexity of
sample selection, as well as leading to increased complexity of
population estimates. Second, when examining multiple
criteria, stratifying variables may be related to some, but not
to others, further complicating the design, and potentially
reducing the utility of the strata. Finally, in some cases (such
as designs with a large number of strata, or those with a
specified minimum sample size per group), stratified sampling
can potentially require a larger sample than would other
methods (although in most cases, the required sample size
would be no larger than would be required for simple random
sampling.
A stratified sampling approach is most effective when three
conditions are met
Variability within strata are minimized
Variability between strata are maximized
The variables upon which the population is stratified are
strongly correlated with the desired dependent variable.
Advantages over other sampling methods
Focuses on important subpopulations and ignores irrelevant
ones.
Allows use of different sampling techniques for different
subpopulations.
Improves the accuracy/efficiency of estimation.
Permits greater balancing of statistical power of tests of
differences between strata by sampling equal numbers from
strata varying widely in size.
Disadvantages
Requires selection of relevant stratification variables which
can be difficult.
Is not useful when there are no homogeneous subgroups.
Can be expensive to implement.
Poststratification
Stratification is sometimes introduced after the sampling
phase in a process called "poststratification". This approach is
typically implemented due to a lack of prior knowledge of an
appropriate stratifying variable or when the experimenter
lacks the necessary information to create a stratifying variable
during the sampling phase. Although the method is susceptible
to the pitfalls of post hoc approaches, it can provide several
benefits in the right situation. Implementation usually follows
a simple random sample. In addition to allowing for
stratification on an ancillary variable, poststratification can be
used to implement weighting, which can improve the precision
of a sample's estimates.
Oversampling
Choice-based sampling is one of the stratified sampling
strategies. In choice-based sampling the data are stratified on
the target and a sample is taken from each strata so that the
rare target class will be more represented in the sample. The
model is then built on this biased sample. The effects of the
input variables on the target are often estimated with more
precision with the choice-based sample even when a smaller
overall sample size is taken, compared to a random sample.
The results usually must be adjusted to correct for the
oversampling.
Probability proportional to size sampling
In some cases the sample designer has access to an "auxiliary
variable" or "size measure", believed to be correlated to the
variable of interest, for each element in the population. This
data can be used to improve accuracy in sample design. One
option is to use the auxiliary variable as a basis for
stratification, as discussed above.
Another option is probability-proportional-to-size ('PPS')
sampling, in which the selection probability for each element is
set to be proportional to its size measure, up to a maximum of
1. In a simple PPS design, these selection probabilities can
then be used as the basis for Poisson sampling. However, this
has the drawbacks of variable sample size, and different
portions of the population may still be over- or under-
represented due to chance variation in selections. To address
this problem, PPS may be combined with a systematic
approach.
Example: Suppose we have six schools with populations of 150,
180, 200, 220, 260, and 490 students respectively (total 1500
students), and we want to use student population as the basis
for a PPS sample of size three. To do this, we could allocate the
first school numbers 1 to 150, the second school 151 to
330 (= 150 + 180), the third school 331 to 530, and so on to
the last school (1011 to 1500). We then generate a random
start between 1 and 500 (equal to 1500/3) and count through
the school populations by multiples of 500. If our random start
was 137, we would select the schools which have been
allocated numbers 137, 637, and 1137, i.e. the first, fourth,
and sixth schools.
The PPS approach can improve accuracy for a given sample
size by concentrating sample on large elements that have the
greatest impact on population estimates. PPS sampling is
commonly used for surveys of businesses, where element size
varies greatly and auxiliary information is often available - for
instance, a survey attempting to measure the number of
guest-nights spent in hotels might use each hotel's number of
rooms as an auxiliary variable. In some cases, an older
measurement of the variable of interest can be used as an
auxiliary variable when attempting to produce more current
estimates.
Cluster sampling
Sometimes it is cheaper to 'cluster' the sample in some way
e.g. by selecting respondents from certain areas only, or
certain time-periods only. (Nearly all samples are in some
sense 'clustered' in time - although this is rarely taken into
account in the analysis.)
Cluster sampling is an example of 'two-stage sampling' or
'multistage sampling': in the first stage a sample of areas is
chosen; in the second stage a sample of respondents within
those areas is selected.
This can reduce travel and other administrative costs. It also
means that one does not need a sampling frame listing all
elements in the target population. Instead, clusters can be
chosen from a cluster-level frame, with an element-level frame
created only for the selected clusters. Cluster sampling
generally increases the variability of sample estimates above
that of simple random sampling, depending on how the
clusters differ between themselves, as compared with the
within-cluster variation.
Nevertheless, some of the disadvantages of cluster sampling
are the reliance of sample estimate precision on the actual
clusters chosen. If clusters chosen are biased in a certain way,
inferences drawn about population parameters from these
sample estimates will be far off from being accurate.
Multistage sampling Multistage sampling is a complex form of
cluster sampling in which two or more levels of units are
imbedded one in the other. The first stage consists of
constructing the clusters that will be used to sample from. In
the second stage, a sample of primary units is randomly
selected from each cluster (rather than using all units
contained in all selected clusters). In following stages, in each
of those selected clusters, additional samples of units are
selected, and so on. All ultimate units (individuals, for
instance) selected at the last step of this procedure are then
surveyed.
This technique, thus, is essentially the process of taking
random samples of preceding random samples. It is not as
effective as true random sampling, but it probably solves more
of the problems inherent to random sampling. Moreover, It is
an effective strategy because it banks on multiple
randomizations. As such, it is extremely useful.
Multistage sampling is used frequently when a complete list of
all members of the population does not exist and is
inappropriate. Moreover, by avoiding the use of all sample
units in all selected clusters, multistage sampling avoids the
large, and perhaps unnecessary, costs associated traditional
cluster sampling.
Matched random sampling
A method of assigning participants to groups in which pairs of
participants are first matched on some characteristic and then
individually assigned randomly to groups.[5]
The Procedure for Matched random sampling can be briefed
with the following contexts,
Two samples in which the members are clearly paired, or are
matched explicitly by the researcher. For example, IQ
measurements or pairs of identical twins.
Those samples in which the same attribute, or variable, is
measured twice on each subject, under different
circumstances. Commonly called repeated measures. Examples
include the times of a group of athletes for 1500m before and
after a week of special training; the milk yields of cows before
and after being fed a particular diet.
Quota sampling
In quota sampling, the population is first segmented into
mutually exclusive sub-groups, just as in stratified sampling.
Then judgment is used to select the subjects or units from
each segment based on a specified proportion. For example, an
interviewer may be told to sample 200 females and 300 males
between the age of 45 and 60.
It is this second step which makes the technique one of non-
probability sampling. In quota sampling the selection of the
sample is non-random. For example interviewers might be
tempted to interview those who look most helpful. The
problem is that these samples may be biased because not
everyone gets a chance of selection. This random element is
its greatest weakness and quota versus probability has been a
matter of controversy for many years
Mechanical sampling
Mechanical sampling is typically used in sampling solids,
liquids and gases, using devices such as grabs, scoops, thief
probes, the COLIWASA and riffle splitter.
Care is needed in ensuring that the sample is representative of
the frame. Much work in the theory and practice of mechanical
sampling was developed by Pierre Gy and Jan Visman.
Convenience sampling
Convenience sampling (sometimes known as grab or
opportunity sampling) is a type of nonprobability sampling
which involves the sample being drawn from that part of the
population which is close to hand. That is, a sample population
selected because it is readily available and convenient. The
researcher using such a sample cannot scientifically make
generalizations about the total population from this sample
because it would not be representative enough. For example, if
the interviewer was to conduct such a survey at a shopping
center early in the morning on a given day, the people that
he/she could interview would be limited to those given there at
that given time, which would not represent the views of other
members of society in such an area, if the survey was to be
conducted at different times of day and several times per
week. This type of sampling is most useful for pilot testing.
Several important considerations for researchers using
convenience samples include:
Are there controls within the research design or experiment
which can serve to lessen the impact of a non-random,
convenience sample whereby ensuring the results will be more
representative of the population?
Is there good reason to believe that a particular convenience
sample would or should respond or behave differently than a
random sample from the same population?
Is the question being asked by the research one that can
adequately be answered using a convenience sample?
In social science research, snowball sampling is a similar
technique, where existing study subjects are used to recruit
more subjects into the sample.
Line-intercept sampling
Line-intercept sampling is a method of sampling elements in a
region whereby an element is sampled if a chosen line
segment, called a “transect”, intersects the element.
Panel sampling
Panel sampling is the method of first selecting a group of
participants through a random sampling method and then
asking that group for the same information again several times
over a period of time. Therefore, each participant is given the
same survey or interview at two or more time points; each
period of data collection is called a "wave". This sampling
methodology is often chosen for large scale or nation-wide
studies in order to gauge changes in the population with
regard to any number of variables from chronic illness to job
stress to weekly food expenditures. Panel sampling can also be
used to inform researchers about within-person health
changes due to age or help explain changes in continuous
dependent variables such as spousal interaction. There have
been several proposed methods of analyzing panel sample
data, including MANOVA, growth curves, and structural
equation modeling with lagged effects. For a more thorough
look at analytical techniques for panel data, see Johnson
(1995).
Event Sampling Methodology
Event Sampling Methodology (ESM) is a new form of sampling
method that allows researchers to study ongoing experiences
and events that vary across and within days in its naturally-
occurring environment. Because of the frequent sampling of
events inherent in ESM, it enables researchers to measure the
typology of activity and detect the temporal and dynamic
fluctuations of work experiences. Popularity of ESM as a new
form of research design increased over the recent years
because it addresses the shortcomings of cross-sectional
research, where once unable to, researchers can now detect
intra-individual variances across time. In ESM, participants are
asked to record their experiences and perceptions in a paper
or electronic diary.
There are three types of ESM:# Signal contingent – random
beeping notifies participants to record data. The advantage of
this type of ESM is minimization of recall bias.
Event contingent – records data when certain events occur
Interval contingent – records data according to the passing of a
certain period of time
ESM has several disadvantages. One of the disadvantages of
ESM is it can sometimes be perceived as invasive and intrusive
by participants. ESM also leads to possible self-selection bias.
It may be that only certain types of individuals are willing to
participate in this type of study creating a non-random sample.
Another concern is related to participant cooperation.
Participants may not be actually fill out their diaries at the
specified times. Furthermore, ESM may substantively change
the phenomenon being studied. Reactivity or priming effects
may occur, such that repeated measurement may cause
changes in the participants' experiences. This method of
sampling data is also highly vulnerable to common method
variance.[6]
Further, it is important to think about whether or not an
appropriate dependent variable is being used in an ESM
design. For example, it might be logical to use ESM in order to
answer research questions which involve dependent variables
with a great deal of variation throughout the day. Thus,
variables such as change in mood, change in stress level, or
the immediate impact of particular events may be best studied
using ESM methodology. However, it is not likely that utilizing
ESM will yield meaningful predictions when measuring
someone performing a repetitive task throughout the day or
when dependent variables are long-term in nature (coronary
heart problems).
Answer 2
In probability theory and statistics, correlation (often
measured as a correlation coefficient) indicates the strength
and direction of a linear relationship between two random
variables.
In statistics, regression analysis refers to techniques for the
modeling and analysis of numerical data consisting of values of
a dependent variable (also called a response variable) and of
one or more independent variables (also known as explanatory
variables or predictors). The dependent variable in the
regression equation is modeled as a function of the
independent variables, corresponding parameters
("constants"), and an error term. The error term is treated as a
random variable. It represents unexplained variation in the
dependent variable. The parameters are estimated so as to
give a "best fit" of the data. Most commonly the best fit is
evaluated by using the least squares method, but other
criteria have also been used.
Answer 3
Forecasting involves making the best possible judgment about
some future event.It is no longer reasonable to rely solely on
intuition, or ones feel for the situation, in projecting sales,
inventory needs, personnel requirements, and other important
economic or business variables.
Who uses forecasts?
Accountants - costs, revenues, tax-planning
Personnel Departments - recruitment of new employees
Financial Experts - interest rates
Production Managers - raw materials needs, inventories
Marketing Managers - sales forecasts for promotions
Major Types of Forecasting Methods
Subjective Methods
Sales Force Composites
Customer Surveys
Jury of Executive Opinions
Delphi Method
Time series analysis
In statistics, signal processing, and many other fields, a time
series is a sequence of data points, measured typically at
successive times, spaced at (often uniform) time intervals.
Time series analysis comprises methods that attempt to
understand such time series, often either to understand the
underlying context of the data points (Where did they come
from? What generated them?), or to make forecasts
(predictions). Time series forecasting is the use of a model to
forecast future events based on known past events: to
forecast future data points before they are measured. A
standard example in econometrics is the opening price of a
share of stock based on its past performance.
The term time series analysis is used to distinguish a problem,
firstly from more ordinary data analysis problems (where there
is no natural ordering of the context of individual
observations), and secondly from spatial data analysis where
there is a context that observations (often) relate to
geographical locations. There are additional possibilities in the
form of space-time models (often called spatial-temporal
analysis). A time series model will generally reflect the fact
that observations close together in time will be more closely
related than observations further apart. In addition, time
series models will often make use of the natural one-way
ordering of time so that values in a series for a given time will
be expressed as deriving in some way from past values, rather
than from future values (see time reversibility.)
Methods for time series analyses are often divided into two
classes: frequency-domain methods and time-domain methods.
The former centre around spectral analysis and recently
wavelet analysis, and can be regarded as model-free analyses
well-suited to exploratory investigations. Time-domain
methods have a model-free subset consisting of the
examination of auto-correlation and cross-correlation analysis,
but it is here that partly and fully-specified time series models
make their appearance.
Prior moving average
A simple moving average (SMA) is the un weighted mean of
the previous n data points. For example, a 10-day simple
moving average of closing price is the mean of the previous 10
days' closing prices. If those prices are then
the formula is
When calculating successive values, a new value comes into
the sum and an old value drops out, meaning a full summation
each time is unnecessary,
In technical analysis there are various popular values for n,
like 10 days, 40 days, or 200 days. The period selected
depends on the kind of movement one is concentrating on,
such as short, intermediate, or long term. In any case moving
average levels are interpreted as support in a rising market, or
resistance in a falling market.
In all cases a moving average lags behind the latest data point,
simply from the nature of its smoothing. An SMA can lag to an
undesirable extent, and can be disproportionately influenced
by old data points dropping out of the average. This is
addressed by giving extra weight to more recent data points,
as in the weighted and exponential moving averages.
One characteristic of the SMA is that if the data have a
periodic fluctuation, then applying an SMA of that period will
eliminate that variation (the average always containing one
complete cycle). But a perfectly regular cycle is rarely
encountered in economics or finance.
Central moving average
For a number of applications it is advantageous to avoid the
shifting induced by using only 'past' data. Hence a central
moving average can be computed, using both 'past' and
'future' data. The 'future' data in this case are not predictions,
but merely data obtained after the time at which the average
is to be computed.
Weighted and exponential moving averages [citation needed]
(see below) can also be computed centrally.
Cumulative moving average
The cumulative moving average is also frequently called a
running average or a long running average although the term
running average is also used as synonym for a moving
average. This article uses the term cumulative moving average
or simply cumulative average since this term is more
descriptive and unambiguous.
In some data acquisition systems, the data arrives in an
ordered data stream and the stastitician would like to get the
average of all of the data up until the current data point. For
example, an investor may want the average price of all of the
stock transactions for a particular stock up until the current
time. As each new transaction occurs, the average price at the
time of the transaction can be calculated for all of the
transactions up to that point using the cumulative average.
This is the cumulative average, which is typically an
unweighted average of the sequence of i values x1, ..., xi up to
the current time:
The brute force method to calculate this would be to store all
of the data and calculate the sum and divide by the number of
data points every time a new data point arrived. However, it is
possible to simply update cumulative average as a new value
xi+1 becomes available, using the formula:
where CA0 can be taken to be equal to 0.
Thus the current cumulative average for a new data point is
equal to the previous cumulative average plus the difference
between the latest data point and the previous average
divided by the number of points received so far. When all of
the data points arrive (i = N), the cumulative average will
equal the final average.
The derivation of the cumulative average formula is
straightforward. Using
and similarly for i + 1, it is seen that
Solving this equation for CAi+1 results in:
Weighted moving average
A weighted average is any average that has multiplying factors
to give different weights to different data points.
Mathematically, the moving average is the convolution of the
data points with a moving average function; in technical
analysis, a weighted moving average (WMA) has the specific
meaning of weights which decrease arithmetically. In an n-day
WMA the latest day has weight n, the second latest n − 1, etc,
down to zero.
WMA weights n = 15
The denominator is a triangle number, and can be easily
computed as
When calculating the WMA across successive values, it can be
noted the difference between the numerators of WMAM+1 and
WMAM is npM+1 − pM − ... − pM−n+1. If we denote the sum
pM + ... + pM−n+1 by TotalM, then
The graph at the right shows how the weights decrease, from
highest weight for the most recent data points, down to zero.
It can be compared to the weights in the exponential moving
average which follows.
Exponential moving average
EMA weights N=15
An exponential moving average (EMA), sometimes also called
an exponentially weighted moving average (EWMA), applies
weighting factors which decrease exponentially. The weighting
for each older data point decreases exponentially, giving much
more importance to recent observations while still not
discarding older observations entirely. The graph at right
shows an example of the weight decrease.
Parameters:
The degree of weighing decrease is expressed as a constant
smoothing factor α, a number between 0 and 1. α may be
expressed as a percentage, so a smoothing factor of 10% is
equivalent to α = 0.1. A higher α discounts older observations
faster. Alternatively, α may be expressed in terms of N time
periods, where . For example, N = 19 is equivalent to
α = 0.1. The half-life of the weights (the interval over which
the weights decrease by a factor of two) is approximately
(within 1% if N > 5).
The observation at a time period t is designated Yt, and the
value of the EMA at any time period t is designated St. S1 is
undefined. S2 may be initialized in a number of different ways,
most commonly by setting S2 to Y1, though other techniques
exist, such as setting S2 to an average of the first 4 or 5
observations. The prominence of the S2 initialization's effect
on the resultant moving average depends on α; smaller α
values make the choice of S2 relatively more important than
larger α values, since a higher α discounts older observations
faster.
Formula:
The formula for calculating the EMA at time periods t > 2 is
This formulation is according to Hunter (1986)[2]. The weights
will obey α(1 − α)xYt − (x + 1). An alternate approach by
Roberts (1959) uses Yt in lieu of Yt−1[3]:
This formula can also be expressed in technical analysis terms
as follows, showing how the EMA steps towards the latest data
point, but only by a proportion of the difference (each
time):[4]
Expanding out EMAyesterday each time results in the following
power series, showing how the weighting factor on each data
point p1, p2, etc, decrease exponentially:
In theory this is an infinite sum, but because 1 − α is less than
1, the terms become smaller and smaller, and can be ignored
once small enough.
The N periods in an N-day EMA only specify the α factor. N is
not a stopping point for the calculation in the way it is in an
SMA or WMA. The first N data points in an EMA represent
about 86% of the total weight in the calculation[citation
needed].
The power formula above gives a starting value for a particular
day, after which the successive days formula shown first can
be applied.
The question of how far back to go for an initial value depends,
in the worst case, on the data. If there are huge p price values
in old data then they'll have an effect on the total even if their
weighting is very small. If one assumes prices don't vary too
wildly then just the weighting can be considered. The weight
omitted by stopping after k terms is
Which is
I.e. a fraction
out of the total weight.
For example, to have 99.9% of the weight,
terms should be used. Since approaches
as N increases, this simplifies to approximately
k = 3.45(N + 1)
for this example (99.9% weight).
Modified Moving Average
This is called modified moving average (MMA), running moving
average (RMA), or smoothed moving average.
Definition
In short, this is exponential moving average, which .
[edit] Application of exponential moving average to OS
performance metrics
Some computer performance metrics use a form of exponential
moving average, for example, the average process queue
length, or the average CPU utilization.
Here α is defined as a function of time between two readings.
An example of a coefficient giving bigger weight to the current
reading, and smaller weight to the older readings is
where time for readings tn is expressed in seconds, and W is
the period of time in minutes over which the reading is said to
be averaged (the mean lifetime of each reading in the
average). Given the above definition of α, the moving average
can be expressed as
For example, a 15-minute average L of a process queue length
Q, measured every 5 seconds (time difference is 5 seconds), is
computed as
Answer 4
Statistics is the scientific application of mathematical
principles to the collection, analysis, and presentation of
numerical data. Statisticians contribute to scientific enquiry by
applying their mathematical and statistical knowledge to the
design of surveys and experiments; the collection, processing,
and analysis of data; and the interpretation of the results.
Statisticians may apply their knowledge of statistical methods
to a variety of subject areas, such as biology, economics,
engineering, medicine, public health, psychology, marketing,
education, and sports.
Many economic, social, political, and military decisions cannot
be made without statistical techniques, such as the design of
experiments to gain federal approval of a newly manufactured
drug.
Characteristics of Statistics
Some of its important characteristics are given below:
Statistics are aggregates of facts.
Statistics are numerically expressed.
Statistics are affected to a marked extent by multiplicity of
causes.
Statistics are enumerated or estimated according to a
reasonable standard of accuracy.
Statistics are collected for a predetermine purpose.
Statistics are collected in a systemic manner.
Statistics must be comparable to each other.
Limitations of statistics
Actually, statistics is an exact science. It is the application of
the results that leads to problems.
Perhaps the most commonly encountered limitation of
statistics is the misunderstanding that a statistical measure
can be used as a measure of the accuracy of a measurement.
Statistics, in general, provide very little information on the
intrinsic accuracy of a measurement. Statistics can only
provide an estimate of the minimal error that might be in the
measurement. The actual error can be much greater than the
minimal (statistical) error.
Another way to put it is that statistics measure the variability
of a measurement, not the accuracy of a measurement.
First of all, there's the sample size. If you don't have a big
enough sample, you can't give a very reliable answer. Of
course, with limited resources it might not be possible to
collect a large sample. There's a tradeoff between sample size
and reliability.
Also, there are aspects of statistics which are countrintuitive
and tend to confuse people. For instance, Simpson's Paradox.
We all know that lots of sunlight is good for crops. But if you
do a survey of crop yield and compare it to weather records,
you might actually find that the sunnier it is, the less the crops
grow. This is different from the problem of small sample size
outlined above. I'm not talking about a few freak years with
lots of pests throwing the statistics out. The problem is that
when it's sunny, it generally isn't raining. So the years with
lots of sunshine have lower crop yields because they have less
rain. It is possible to correct for this, by looking at records of
both sunshine and rainfall. Basically, if you're looking for a
direct relationship between two variables, you need to think of
all the other variables that might have an effect and take
those into account. But I think this increases the sample size
needed, if you overdo it.
Another fallacy is to assume that correlation implies causality.
For instance, you could do a survey and discover that poorer
neighbourhoods have more crime. You might conclude that
living in such a place makes someone a criminal. But this might
not be the case. It might be that wealth affects both crime and
place of residence, i.e. poor people tend to live in certain
places (because they can't afford to live anywhere else), and
poor people often turn to crime (to feed their families). Or
there might be some other variable we haven't thought of yet.
Answer 5
1. Background
A strengthened planning system is necessary in the formation
for the
construction of a society based on sustainable development as
well as justice
and equity with the reduction of poverty and
institutionalization of peace and
democracy. It will be necessary to develop an effective
planning system and to
make reforms on the institutional and procedural aspects of
the planning
organization timely with revisions, in order to support the
state mechanism in
intensifying the development activities by mainstreaming a
balance between the
growing ambition of the people and limited resources and
means.
For economic transformation in line with the democratic values
and
beliefs, a competent statistical system needs to be developed
for dynamic plan
formulation, in accordance with the latest liberal economic
philosophy and the
geo-physical economic and social conditions.
2. Review of the Current Situation
The Tenth Plan had made commitments to coordinate long-
term and
periodic plans for the raising of living standards of the poor,
backward groups
and regions, policy formulation, input projection, selection of
development
programs, approval, implementation and monitoring and
evaluation works; to
supply all types of reliable and quality statistics; to make
institutional
strengthening of CBS, legal reforms and human resource
development; to fully
implement the overall national statistical plans; to strengthen
national accounts;
and to prepare an Action Plan in an integrated way in order to
specify the status
of the designated indication in the Tenth Plan. Against these
commitments the
achievements within the Tenth Plan period were: PMAS was
established and the
documentation on this was made public on a yearly basis and
NPC was
restructured. Similarly, MTEF system was initiated and
institutionalized to make
resource allocation and selection of projects logical in
accordance with the
national goal of poverty reduction. Assistance was provided in
a special way for
the implementation of programs of national importance under
the Immediate
Action Plan. Some exercises were done on the implementation
of business plan
in some ministries to bring about effectiveness in the work
performance,
monitoring indicators were developed for the implementation
of the monitoring
subsystem, tasks on impact and cost effectiveness study were
carried out, MDG
progress report was prepared, and need assessment was made
to MDG in five
districts. National accounts statistics was timely in accordance
with the System of
National Accounts (SNA) 1993. Determination of specified
indicator of MDGs and
the Tenth Plan were determined, Nepal Info was prepared,
living standard
measurement and poverty mapping was accomplished, digital
mapping of all the
VDCs was updated using GIS, and National Development
Volunteers Program
launched in 42 districts.
3. Problems, Challenges and Opportunities
Some of the problems are duplication in the statistics in Nepal,
lack of
proper level and standards as well as coordination and limited
use of information
technology in plan preparation and data processing. Due to
these problems, the
following have become challenging and serious:
• To clarify the role of NPC in the liberal economic system.
• To make planning and action based on demand, information
and
facts.
• To establish coordination on the above among the involved
agencies.
• Professional achievement with capacity enhancement.
• Institutional and legal reforms.
With successful coordination of the above actions, the plan,
policy and
program being prepared will be effective, simplification of
implementation,
monitoring and evaluation will occur, it is challenging to give
attention to making
the different stages of planning development strong in the
coming days. By
utilizing the experience of planned development over the past
fifty years, efforts
now will be concentrated to formulate plans capable of
bringing intensity in the
overall development through the maximum mobilization of
social and economic
infrastructure.
4. Long Term Vision
To develop a strengthened planning system, capable of playing
a timely
role in the construction of a Prosperous, Modern and Just
Nepal.
5. Objectives
To institutionalize an effective panning system, by developing
a reliable
statistical system.
6. Strategies
• To prepare the infrastructure for a strengthened and dynamic
planning system.
• To carry out institutional strengthening and enhance the
capacity of
planning organizations, statistics and planning units.
• Institutional strengthening of the National Statistical System
and the
National Accounts System will be done.
7. Policy and Working Policies
• Planning system based on research and objective analysis
and a
favorable dynamic political, economic and social context will
be
developed.
• A framework will be prepared after making a study on the
planning
and statistical system to be adopted in accordance with a
federal
structure.
• NPC and CBS will be structured and strengthened with a view
of
making notable reforms in the working system and
effectiveness.
• Planning and statistical system will gradually be made
inclusive and
engendered.
8. Programs
System Development
• Program to make the current planning practice gender
accountable,
pro-people, and effective by incorporating programs like study,
research, survey, tours, seminars, conferences, public debate,
and
the enhancement of public awareness.
• In accordance with a federal structure, in order to prepare
appropriate framework for planning and statistical system,
programs
like necessary studies, surveys, debate, seminar, discussion,
meeting/conference, tour, awareness enhancement will be
carried
out.
• Institutional and procedural arrangement will be made in
order to
develop an effective training system in the field of planning,
programming, monitoring and evaluation and statistics.
• National statistical system will be enforced.
• National Accounts System strengthening program will be
launched.
Institutional Strengthening
• Institutional, legal and work procedural restructuring of NPC
after a
comprehensive study is done, and it will be made more timely,
more
effective and competent.
• By constructing a modern and well-equipped planning house,
PMIS
will be followed by using Information and Communication
Technology. To modernize the plan formulation system,
networks
with other ministries and stakeholders will be established.
• Working environment will be improved, by making physical
amenities
available, and monitoring and evaluation reforms.
• Extension and reforms of MTEF and the adoption and
extension
works will be done on gender planning.
• A separate unit will be institutionalized after being formed
for study,
research and analysis. To strengthen planning
divisions/sections and
to make effective use of the concerned agencies by
coordinating the
actions of different ministries and agencies.
• To make institutional strengthening of GIS by making
maximum
utilization of information technology in the collection of
statistics.
• To construct a new building for GIS and cartography, library,
training
centers and data processing.
Preparation of Long Term Vision Paper and Policy Research and
Other
Provisions
• A long-term vision paper will be prepared to design the
destination of
economic social development of the country and where it
should be
in 20 years, and this paper will be adopted.
• By developing a competent policy study analysis system, the
commission can give the country advice on timely policy
formulation
and their role in a competent way.
• In order to develop the National Statistical System, problems
related
to the availability of data, duplication quality, integrated
system,
coordination and legitimacy, will be solved through legal,
institutional,
human resources, quality related aspects addressing the
ongoing
programs, reforms and institutional strengthening,
development of
institutional memory, survey, study, research works.
• As National Development Volunteers Service program has
been
found to help in the upliftment of the marginalized groups and
regions, this will be developed as an autonomous agency after
its
institutional strengthening.
• By formulating model community development programs
arrangements will be made for its implementation in the
designated
VDC and areas.
• Surveys and studies including Industrial Census, Nepal Labor
Power
Survey and Nepal Living Standards Survey will be carried out.
Likewise, preparation works for Census 2011 and Agricultural
Census 2011 will be made. In conducting these censuses and
surveys, on the basis of feasibility as well as importance, the
contribution of women in the national economy, tourism,
health,
informal sector will gradually be incorporated as in the
preparation of
satellite accounting, for their contribution even in non-
economic
activities.
Most often we collect statistical data by doing surveys or
experiments. To do a survey, we pick a small number of people
and ask them questions. Then, we use their answers as the
data. The choice of which individuals to take for a survey or
data collection is very important, as it directly influences the
statistics. When the statistics are done, it can no longer be
determined which individuals are taken. Suppose we want to
measure the water quality of a big lake. If we take samples
next to the waste drain, we will get different results than if the
samples are taken in a far away, hard to reach, spot of the
lake.
There are two kinds of problems which are commonly found
when taking samples:
If there are many samples, the samples will likely be very close
to what they are in the real population. If there are very few
samples, however, they might be very different from what they
are in the real population. This error is called a chance error.
The individuals for the samples need to be chosen carefully,
usually they will be chosen randomly. If this is not the case,
the samples might be very different from what they really are
in the total population. This is true even if a great number of
samples is taken. This kind of error is called bias
Errors
We can avoid chance errors by taking a larger sample, and we
can avoid some bias by choosing randomly. However,
sometimes large random samples are hard to take. And bias
can happen if some people refuse to answer our questions, or
if they know they are getting a fake treatment. These
problems can be hard to fix.
Descriptive statistics
Finding the middle of the data
The middle of the data is often called an average. The average
tells us about a typical individual in the population. There are
three kinds of average that are often used: the mean, the
median and the mode.
The examples below use this sample data:
Name | A B C D E F G H I J
---------------------------------------------
score| 23 26 49 49 57 64 66 78 82 92
Mean
The formula for the mean is
Where are the data and N is the population size.
(see Sigma Notation).
This means that you add up all the values, and then divide by
the number of values.
In our example
The problem with the mean is that it doesn't tell anything
about how the values are distributed. Values that are very
large or very small change the mean a lot. In statistics, these
extreme values might be errors of measurement, but
sometimes the population really does contain these values. For
example, if in a room there are 10 people who make $10/day
and 1 who makes $1,000,000/day. The mean of the data is
$90,918/day. Even though it is the average amount, the mean
in this case is not the amount any single person makes, and is
probably useless.
Median
The median is the middle item of the data. To find the median
we sort the data from the smallest number to the largest
number and then choose the number in the middle. If there are
an even number of data, there won't be a number right in the
middle, so we choose the two middle ones and calculate their
mean. In our example there are 10 items of data, the two
middle ones are "E" and "F", so the median is (57+64)/2 =
60.5.
Mode
The mode is the most frequent item of data. For example the
most common letter in English is the letter "e". We would say
that "e" is the mode of the distribution of the letters.
The mode is the only form of average that can be used for
numbers that can't be put in order.
Finding the spread of the data
Another thing we can say about a set of data is how spread out
it is. A common way to describe the spread of a set of data is
the standard deviation. If the standard deviation of a set of
data is small, then most of the data is very close to the
average. If the standard deviation is large, though, then a lot
of the data is very different from the average.
If the data follows the common pattern called the normal
distribution, then it is very useful to know the standard
deviation. If the data follows this pattern (we would say the
data is normally distributed), about 68 of every 100 pieces of
data will be off the average by less than the standard
deviation. Not only that, but about 95 of every 100
measurements will be off the average by less that two times
the standard deviation, and about 997 in 1000 will be closer to
the average than three standard deviations.
Other descriptive statistics
We also can use statistics to find out that some percent,
percentile, number, or fraction of people or things in a group
do something or fit in a certain category.
For example, social scientists used statistics to find out that
49% of people in the world are males