Statistics
Statistics
The
purpose of this test is to determine if a difference between observed data and expected data is due to
chance, or if it is due to a relationship between the variables you are studying. Chi-square is most
commonly used by researchers who are studying survey response data because it applies to categorical
variables. Demography, consumer and marketing research, political science, and economics are all
examples of this type of research.
#2. A probability distribution is a mathematical function that describes the probability of different
possible values of a variable. Probability distributions are often depicted using graphs or probability
tables. In Statistics, the probability distribution gives the possibility of each outcome of a random
experiment or event. It provides the probabilities of different possible occurrences. Also read, events in
probability, here. To recall, the probability is a measure of uncertainty of various phenomena.
#3. Binomial distribution is the one in which the number of outcomes are only two, that is success or
failure. Example of binomial distribution: Coin toss. Poisson distribution: Poisson distribution is the one
in which the number of possible outcomes has no limits.
#short notes
1.GM- the geometric mean is a mean or average which indicates a central tendency of a finite set of real
numbers by using the product of their values.
2.AM- the arithmetic mean, arithmetic average, or just the mean or average, is the sum of a collection of
numbers divided by the count of numbers in the collection. The collection is often a set of results from
an experiment, an observational study, or a survey.
3. In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a
matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily
used in survey research, business intelligence,
4.Simple random sampling is a type of probability sampling in which the researcher randomly selects a
subset of participants from a population. Each member of the population has an equal chance of being
selected.
4 Types of Random Sampling Techniques- Simple random sampling, Stratified random sampling, Cluster
random sampling, Systematic random sampling.
5. In statistics, cluster sampling is a sampling plan used when mutually homogeneous yet internally
heterogeneous groupings are evident in a statistical population. It is often used in marketing
research.in cluster sampling you randomly select entire groups and include all units of each group in
your sample. However, in stratified sampling, you select some units of all groups and include them in
your sample. In this way, both methods can ensure that your sample is representative of the
target population.
#4.An array is a series of memory locations – or 'boxes' – each of which holds a single item of data, but
with each box sharing the same name. All data in an array must be of the same data type.
#Advantages of Array 1.They provide easy access to all the elements at once and the order of accessing
any element does not matter. 2.You do not need to worry about the allocation of memory when creating
an array, as all elements are allocated memory in contiguous memory locations of the array.
#5.Probability is simply how likely something is to happen. Whenever we're unsure about the outcome
of an event, we can talk about the probabilities of certain outcomes—how likely they are. In statistics
and probability theory, independent events are two events wherein the occurrence of one event does
not affect the occurrence of another event or events. The simplest example of such events is tossing two
coins. The outcome of tossing the first coin cannot influence the outcome of tossing the second coin.
In the case where events A and B are independent (where event A has no effect on the probability of
event B), the conditional probability of event B given event A is simply the probability of event B, that
is P(B). P(A and B) = P(A)P(B|A).
#6. In statistics, a central tendency is a central or typical value for a probability distribution. Colloquially,
measures of central tendency are often called averages. The term central tendency dates from the late
1920s. The most common measures of central tendency are the arithmetic mean, the median, and the
mode. Dispersion is the state of getting dispersed or spread. Statistical dispersion means the extent to
which numerical data is likely to vary about an average value. In other words, dispersion helps to
understand the distribution of the data.
#In probability theory and statistics, skewness is a measure of the asymmetry of the probability
distribution of a real-valued random variable about its mean. The skewness value can be positive, zero,
negative, or undefined.
Ch-1
#1.Accounting- Public accounting firms use statistical sampling procedures when conducting audits
fexpensive clients. For instance, suppose an accounting firm wants to determine whether the amount of
accounts receivable shown on a client’s balance sheet fairly represents the actual amount of accounts
receivable. Usually the large number of individual accounts receivable makes reviewing and validating
every account too time-consuming and expensive.
Finance - Financial analysts use a variety of statistical information to guide their investment
recommendations. In the case of stocks, the analysts review a variety of financial data including
price/earnings ratios and dividend yields. By comparing the information for an individual stock with
information about the stock market averages, a financial analyst can begin to draw a conclusion as to
whether an individual stock is over- or underpriced.
Marketing- Electronic scanners at retail checkout counters collect data for a variety of marketing
research applications. For example, data suppliers such as ACNielsen and Information Resources, Inc.,
purchase point-of-sale scanner data from grocery stores, process the data, and then sell statistical
summaries of the data to manufacturers. Manufacturers spend hundreds of thousands of dollars per
product category to obtain this type of scanner data. Manufacturers also purchase data and statistical
summaries on promotional activities such as special pricing and the use of in-store displays.
Production - Today’s emphasis on quality makes quality control an important application of statistics in
production. A variety of statistical quality control charts are used to monitor the output of a production
process. In particular, an x-bar chart can be used to monitor the average output. Suppose, for example,
that a machine fills containers with 12 ounces of a soft drink. Periodically, a production worker selects a
sample of containers and computes the average number of ounces in the sample. This average, or x-bar
value, is plotted on an x-bar chart.
Economics- Economists frequently provide forecasts about the future of the economy or some aspect of
it. They use a variety of statistical information in making such forecasts. For instance, in fore- casting
inflation rates, economists use statistical information on such indicators as the Producer Price Index, the
unemployment rate, and manufacturing capacity utilization. Often these statistical indicators are entered
into computerized forecasting models that predict inflation rates.
#2.Data- Data are the facts and figures collected, analyzed, and summarized for presentation and
interpretation.
#3.Elements, Variables, and Observations- Elements are the entities on which data are collected. For the
data set in Table 1.1 each individual mutual fund is an element: the element names appear in the first
column. With 25 mutual funds, the data set contains 25 elements. A variable is a characteristic of
interest for the elements.
#4. observation /Fund Type: The type of mutual fund, labeled DE (Domestic Equity), IE (International
Equity), and FI (Fixed Income) • Net Asset Value ($): The closing price per share on December 31, 2007
Measurements collected on each variable for every element in a study provide the data. The set of
measurements obtained for a particular element is called an observation.
#5.Scales of Measurement - When the data for a variable consist of labels or names used to identify an
attribute of the element, the scale of measurement is considered a nominal scale.
the scale of measurement for the Fund Type variable is nominal because DE, IE, and FI are labels used to
identify the category or type of fund.
The scale of measurement for a variable is called an ordinal scale if the data exhibit the properties of
nominal data and the order or rank of the data is meaningful. For example, Eastside Automotive sends
customers a questionnaire designed to obtain data on the quality of its automotive repair service. Each
customer provides a repair service rating of excellent, good, or poor. Because the data obtained are the
labels—excellent, good, or poor—the data have the properties of nominal data
Ch-3 #1.Mean-Perhaps the most important measure of location is the mean, or average value, for a
variable. The mean provides a measure of central location for the data. If the data are for a sample, the
mean is denoted by ; if the data are for a population, the mean is denoted by the Greek letter μ.
#2.The median is another measure of central location. The median is the value in the middle when the
data are arranged in ascending order (smallest value to largest value). With an odd number of
observations, the median is the middle value. An even number of observations has no single middle
value. In this case, we follow convention and define the median as the average of the values for the
middle two observations. Arrange the data in ascending order (smallest value to largest value). (a) For an
odd number of observations, the median is the middle value. (b) For an even number of observations,
the median is the average of the two middle values.
#3.MODE The mode is the value that occurs with greatest frequency
To illustrate the identification of the mode, consider the sample of five class sizes. The only value that
occurs more than once is 46. Because this value, occurring with a frequency of 2, has the greatest
frequency, it is the mode. As another illustration, consider the sample of starting salaries for the business
school graduates. The only monthly starting salary that occurs more than once is $3480. Because this
value has the greatest frequency, it is the mode.
#4.PERCENTILE The pth percentile is a value such that at least p percent of the observations are less than
or equal to this value and at least (100 p) percent of the observations are greater than or equal to this
value.
#5.Quartiles are just specific percentiles; thus, the steps for computing percentiles can be applied
directly in the computation of quartiles.
#6.The simplest measure of variability is the range. Range = Largest value - Smallest value
# interquartile range - A measure of variability that overcomes the dependency on extreme values is the
interquartile range (IQR). This measure of variability is the difference between the third quartile, Q3, and
the first quartile, Q1. In other words, the interquartile range is the range for the middle 50% of the data.
IQR = Q3- Q1
#7.The variance is a measure of variability that utilizes all the data. The variance is based on the
difference between the value of each observation (xi) and the mean. The difference between each xi and
the mean ( for a sample, μ for a population) is called a deviation about the mean.
#7.The standard deviation is defined to be the positive square root of the variance. Following the
notation we adopted for a sample variance and a population variance, we use s to denote the sample
standard deviation and σ to denote the population standard deviation.
CH-4
#1.Probability is a numerical measure of the likelihood that an event will occur. Thus, probabilities can be
used as measures of the degree of uncertainty associated with the four events previously listed. If
probabilities are available, we can determine the likelihood of each event occurring.
#2.three useful counting rules.
COUNTING RULE FOR MULTIPLE-STEP EXPERIMENTS If an experiment can be described as a sequence of
k steps with n1 possible outcomes on the first step, n2 possible outcomes on the second step, and so on,
then the total number of experimental outcomes is given by (n1) (n2)...(nk).
# A tree diagram is a graphical representation that helps in visualizing a multiple-step experiment.
#Combinations - A second useful counting rule allows one to count the number of experimental
outcomes when the experiment involves selecting n objects from a (usually larger) set of N objects. It is
called the counting rule for combinations.
# Permutations A third counting rule that is sometimes useful is the counting rule for permutations. It
allows one to compute the number of experimental outcomes when n objects are to be selected from a
set of N objects where the order of selection is important. The same n objects selected in a different
order are considered a different experimental outcome.
Ch-5
#1. A random variable is a numerical description of the outcome of an experiment
#2.A random variable that may assume either a finite number of values or an infinite sequence of values
such as 0, 1, 2, . . . is referred to as a discrete random variable.
# A random variable that may assume any numerical value in an interval or collection of intervals is
called a continuous random variable. Experimental outcomes based on measurement scales such as
time, weight, distance, and temperature can be described by continuous random variables.
#3. A random variable that may assume any numerical value in an interval or collection of intervals is
called a continuous random variable. Experimental outcomes based on measurement scales such as
time, weight, distance, and temperature can be described by continuous random variables. For example,
consider an experiment of monitoring incoming telephone calls to the claims office of a major insurance
company. Suppose the random variable of interest is x the time between consecutive incoming calls in
minutes. This random variable may assume any value in the interval x 0. Actually, an infinite number of
values are possible for x, including values such as 1.26 minutes, 2.751 minutes, 4.3333 minutes, and so
on
#4. The expected value, or mean, of a random variable is a measure of the central location for the
random variable.
-&Stratified random sampling involves dividing a population into groups with similar attributes and
randomly sampling each group.This method ensures that different segments in a population are equally
represented. To give an example, imagine a survey is conducted at a school to determine overall
satisfaction. Here, stratified random sampling can equally represent the opinions of students in each
department.
merits- Accurately Reflects Population Studied Stratified random sampling accurately reflects the
population being studied because researchers are stratifying the entire population before applying
random sampling methods. In short, it ensures each subgroup within the population receives proper
representation within the sample. As a result, stratified random sampling provides better coverage of
the population since the researchers have control over the subgroups to ensure all of them are
represented in the sampling.
Demerits- Can't Be Used in All Studies: Unfortunately, this method of research cannot be used in every
study. The method's disadvantage is that several conditions must be met for it to be used properly.
Researchers must identify every member of a population being studied and classify each of them into
one, and only one, subpopulation. As a result, stratified random sampling is disadvantageous when
researchers can't confidently classify every member of the population into a subgroup. Also, finding an
exhaustive and definitive list of an entire population can be challenging.
&-Cluster sampling starts by dividing a population into groups or clusters. What makes this different
from stratified sampling is that each cluster must be representative of the larger population. Then, you
randomly select entire clusters to sample. For example, if a school had five different eighth grade
classes, cluster random sampling means any one class would serve as a sample.
merits- Cluster sampling is more time- and cost-efficient than other probability sampling methods,
particularly when it comes to large samples spread across a wide geographical area.
Demerits- it provides less statistical certainty than other methods, such as simple random sampling,
because it is difficult to ensure that your clusters properly represent the population as a whole.
&-Systematic random sampling is a common technique in which you sample every kth element. For
example, if you were conducting surveys at a mall, you might survey every 100th person that walks in.If
you have a sampling frame, then you would divide the size of the frame, N, by the desired sample
size, n, to get the index number, k. You would then choose every kth element in the frame to create
your sample. merits- Easy to Execute and Understand, Control and Sense of Process, Clustered
Selection Eliminated, Low Risk Factor Demerits- Assumes Size of Population Can Be Determined, Need
for Natural Degree of Randomness, Greater Risk of Data Manipulation
#2. Non random
-&Convenience or haphazard sampling
Units are selected in an arbitrary manner with little or no planning involved. Haphazard sampling
assumes that the population units are all alike, then any unit may be chosen for the sample. An example
of haphazard sampling is the vox pop survey where the interviewer selects any person who happens to
walk by. Unfortunately, unless the population units are truly similar, selection is subject to the biases of
the interviewer and whoever happened to walk by at the time of sampling.
merits- Collect data quickly, Inexpensive to create samples, Easy to do research, Low cost,Readily
available sample, Fewer rules to follow
demerits- Bias in sampling,Lack of variety, External validity is limited, Unknown errors, Possibility of
researcher bias
-&Judgement sampling
With this method, sampling is done based on previous ideas of population composition and behaviour.
An expert with knowledge of the population decides which units in the population should be sampled. In
other words, the expert purposely selects what is considered to be a representative sample. Judgment
sampling is subject to the researcher’s biases and is perhaps even more biased than haphazard
sampling. Since any preconceptions the researcher has are reflected in the sample, large biases can be
introduced if these preconceptions are inaccurate. However, it can be useful in exploratory studies, for
example in selecting members for focus groups or in-depth interviews to test specific aspects of a
questionnaire.
The advantages - 1.The approach is understood as well and has been refined through experience over
many years; 2.The auditor is given an opportunity to bring his judgement and expertise to play. Well all
auditing in professional judgement is an exercise; 3.No special knowledge of statistics is utilized; 4.No
time is wasted playing along with mathematics;
Disadvantages- 1.It is unscientific; 2.It usually too large samples are selected and It is wasteful; 3.You
cannot extrapolate the conclusion to the population such as a entire as the samples are not
representative; 4.Personal bias in, to choice the sample is unavoidable;
There is no logic to the selection of its size or the sample;
The sample selection is so erratic which is cannot be said to have applied for all items in a year;So the
result reached is usually vague.
-&Quota sampling This is one of the most common forms of non-probability sampling. Sampling is done
until a specific number of units (quotas) for various subpopulations have been selected. Quota sampling
is a means for satisfying sample size objectives for the subpopulations. The quotas may be based on
population proportions. For example, if there are 100 men and 100 women in the population and a
sample of 20 are to be drawn, 10 men and 10 women may be interviewed. Quota sampling can be
considered preferable to other forms of non-probability sampling (e.g. judgment sampling) because it
forces the inclusion of members of different subpopulations.
Advantages- 1.One advantage of quota sampling is it saves time. It’s the ideal choice for gathering
primary data within a limited time. 2. Quota sampling reduces cost. This is because less time is used to
gather data. 3. Quota sampling can be used in the absence of sampling frames. When the original set of
data that samples can be drawn from is absent, the best choice is to apply the quota sampling method.
4. Another advantage of quota sampling is that the researcher can conveniently analyze and interpret
the responses to the test or survey. This is because the right questions are presented to the right sample
group.
Disadvantages- 1. One disadvantage of the quota sampling method is that it is risky to project the
research result to the whole population because you cannot calculate the sampling error of the test
from one quota. This is because quota sampling is not a probability sampling method. 2. The quota
sampling mostly accurately represents the characteristic of the population. However, there might be
population feature inaccuracy in the total sample group. 3. If the researcher is not experienced or not
competent the quality of the method may suffer bias. This will make the results of the sampling
inaccurate.
CH-15
#1.The Null Hypothesis is the assumption that the event will not occur. A null hypothesis has no bearing
on the study's outcome unless it is rejected. H0 is the symbol for it, and it is pronounced H-naught.
#2. The Alternate Hypothesis is the logical opposite of the null hypothesis. The acceptance of the
alternative hypothesis follows the rejection of the null hypothesis. H1 is the symbol for it.
#3. Let's understand this with an example. A sanitizer manufacturer claims that its product kills 95
percent of germs on average. To put this company's claim to the test, create a null and alternate
hypothesis. H0 (Null Hypothesis): Average = 95%. Alternative Hypothesis (H1): The average is less than
95%.
Another straightforward example to understand this concept is determining whether or not a coin is fair
and balanced. The null hypothesis states that the probability of a show of heads is equal to the
likelihood of a show of tails. In contrast, the alternate theory states that the probability of a show of
heads and tails would be very different.
#4Steps of Hypothesis Testing
Step 1: Specify Your Null and Alternate Hypotheses- It is critical to rephrase your original research
hypothesis (the prediction that you wish to study) as a null (Ho) and alternative (Ha) hypothesis so that
you can test it quantitatively. Your first hypothesis, which predicts a link between variables, is generally
your alternate hypothesis. The null hypothesis predicts no link between the variables of interest.
Step 2: Gather Data- For a statistical test to be legitimate, sampling and data collection must be done in
a way that is meant to test your hypothesis. You cannot draw statistical conclusions about the
population you are interested in if your data is not representative.
Step 3: Conduct a Statistical Test -Other statistical tests are available, but they all compare within-group
variance (how to spread out the data inside a category) against between-group variance (how different
the categories are from one another). If the between-group variation is big enough that there is little or
no overlap between groups, your statistical test will display a low p-value to represent this.
Step 4: Determine Rejection Of Your Null Hypothesis- Your statistical test results must determine
whether your null hypothesis should be rejected or not. In most circumstances, you will base your
judgment on the p-value provided by the statistical test. In most circumstances, your preset level of
significance for rejecting the null hypothesis will be 0.05 - that is, when there is less than a 5% likelihood
that these data would be seen if the null hypothesis were true. In other circumstances, researchers use
a lower level of significance, such as 0.01 (1%). This reduces the possibility of wrongly rejecting the null
hypothesis.
Step 5: Present Your Results - The findings of hypothesis testing will be discussed in the results and
discussion portions of your research paper, dissertation, or thesis. You should include a concise
overview of the data and a summary of the findings of your statistical test in the results section. You can
talk about whether your results confirmed your initial hypothesis or not in the conversation. Rejecting or
failing to reject the null hypothesis is a formal term used in hypothesis testing. This is likely a must for
your statistics assignments.
#5.Types of Hypothesis Testing
Z Test- To determine whether a discovery or relationship is statistically significant, hypothesis testing
uses a z-test. It usually checks to see if two means are the same (the null hypothesis). Only when the
population standard deviation is known and the sample size is 30 data points or more, can a z-test be
applied.
T Test- A statistical test called a t-test is employed to compare the means of two groups. To determine
whether two groups differ or if a procedure or treatment affects the population of interest, it is
frequently used in hypothesis testing.
Chi-Square - You utilize a Chi-square test for hypothesis testing concerning whether your data is as
predicted. To determine if the expected and observed results are well-fitted, the Chi-square test
analyzes the differences between categorical variables from a random sample. The test's fundamental
premise is that the observed values in your data should be compared to the predicted values that would
be present if the null hypothesis were true.
Ch-17
#1.A chi-squared test (also chi-square or χ2 test) is a statistical hypothesis test used in the analysis
of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to
examine whether two categorical variables (two dimensions of the contingency table) are independent
in influencing the test statistic (values within the table).[1] The test is valid when the test statistic is chi-
squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants
thereof.
#2. In statistics, a categorical variable (also called qualitative variable) is a variable that can take on one
of a limited, and usually fixed, number of possible values, assigning each individual or other unit of
observation to a particular group or nominal category on the basis of some qualitative property. In
computer science and some branches of mathematics, categorical variables are referred to as
enumerations or enumerated types.
Examples of values that might be represented in a categorical variable: The roll of a six-sided die:
possible outcomes are 1,2,3,4,5, or 6.2. Demographic information of a population: gender, disease
status. 3.The blood type of a person: A, B, AB or O. 4.The political party that a voter might vote for, e. g.
Green Party, Christian Democrat, Social Democrat, etc. 5.The type of a rock: igneous, sedimentary or
metamorphic. 6.The identity of a particular word (e.g., in a language model): One of V possible choices,
for a vocabulary of size V.
Ch-14
#1. The independent variable is the cause. Its value is independent of other variables in your study.
The dependent variable is the effect. Its value depends on changes in the independent variable.
Dependent and independent variables are variables in mathematical modeling, statistical modeling and
experimental sciences. Dependent variables are studied under the supposition or demand that they
depend, by some law or rule (e.g., by a mathematical function), on the values of other variables.
Independent variables, in turn, are not seen as depending on any other variable in the scope of the
experiment in question. In this sense, some common independent variables are time, space, density,
mass, fluid flow rate, and previous values of some observed value of interest (e.g. human population
size) to predict future values (the dependent variable). Of the two, it is always the dependent variable
whose variation is being studied, by altering inputs, also known as regressors in a statistical context. In
an experiment, any variable that can be attributed a value without attributing a value to any other
variable is called an independent variable. Models and experiments test the effects that the
independent variables have on the dependent variables. Sometimes, even if their influence is not of
direct interest, independent variables may be included for other reasons, such as to account for their
potential confounding effect.
Ch-17
#1. The simplest form of a price index shows how the current price per unit for a given item compares to
a base period price per unit for the same item.