0% found this document useful (0 votes)
237 views1,242 pages

Advanced Sampling Theory

The document is a comprehensive text on advanced sampling theory and its applications, authored by Sarjinder Singh. It covers fundamental concepts, various sampling methods, and statistical estimators, providing a detailed exploration of both theoretical and practical aspects of sampling in statistics. The book is structured into multiple chapters, each focusing on specific topics related to sampling techniques and their implications in statistical analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
237 views1,242 pages

Advanced Sampling Theory

The document is a comprehensive text on advanced sampling theory and its applications, authored by Sarjinder Singh. It covers fundamental concepts, various sampling methods, and statistical estimators, providing a detailed exploration of both theoretical and practical aspects of sampling in statistics. The book is structured into multiple chapters, each focusing on specific topics related to sampling techniques and their implications in statistical analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1242

Advanced Sampling Theory with Applications

Advanced Sampling
Theory with Applications
How Michael 'selected' Amy
Volume I

by

Sarjinder Singh
St. Cloud State University,
Department of Statistics,
St. Cloud, MN, U.S.A.

SPRINGER-SCIENCE+BUSINESS M E D I A , B . V .
A C.LP. Catalogue record for this book is available from the Library of Congress.

ISBN 978-94-010-3728-0 ISBN 978-94-007-0789-4 (eBook)


DOI 10.1007/978-94-007-0789-4

Printed on acid-free paper

All Rights Reserved


© 2003 Springer Science+Business Media Dordrecht
Originally published by Kluwer Academic Publishers in 2003
No part of this work may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, electronic, mechanical, photocopying, microfilming,
recording or otherwise, without written permission from the Publisher, with the exception
of any material supplied specifically for the purpose of being entered
and executed on a computer system, for exclusive use by the purchaser of the work.
GoO
(
I
.
A _""

With whose grace..'


TABLE OF CONTENTS

PREFACE
XX I

1 BASIC CONCEPTS AND MATHEMATICAL


NOTATION

1.0 Introduction 1
1.1 Population 1
1.1.1 Finite popul ation 1
1.1.2 Infinite population 1
1.1.3 Target population 1
1.1.4 Study popul ation 1
1.2 Sample 2
1.3 Examples of populations and samples 2
1.4 Census 2
1.5 Relati ve aspects of sampling versus census 2
1.6 Stud y variable 2
1.7 Auxiliary variable 3
1.8 Difference betwe en stud y variable and auxiliary var iable 3
1.9 Parameter 3
I. I0 Statistic 3
I. I I Stat istics 4
1.12 Sample se lectio n 4
1.12.1 Ch it method or Lottery method 4
1.12.1.1 With replacement sampling 4
1.12.1.2 Without replacem ent sampling 5
1.12.2 Random number table method 5
1.12.2.1Remainder method 6
1.13 Probability sampling 7
1.14 Probability of selecting a sample 7
1.15 Popu lation mean /tot al 8
1.16 Population moments 8
1.17 Population standard deviation 8
1.18 Population coefficient of variation 8
1.19 Relative mean square err or 9
1.20 Sample mean 9
1.21 Sample variance 9
1.22 Estimator 10
1.23 Estimate 10
1.24 Sample space 10
1.25 Univariate random variable 11
1.25.1 Qualitative random variables 11
VIII Advanced sampling theory with applications

1.25.2 Quantitative random variables II


1.25.2.1 Discrete random variable 11
1.25.2.2 Continuous random variable 11
1.26 Probability mass function (p.m.f.) ofa univariate discrete random
variable 12
1.27 Probability density function (p.d.f.) of a univariate continuous
random variable 12
1.28 Expected value and variance of a univariate random variable 13
1.29 Distribution function of a univariate random variable 13
1.29.1 Discrete distribution function 14
1.29.2 Continuous distribution function 14
1.30 Selection of a sample using known univariate distribution function 15
1.30.1 Discrete random variable 15
1.30.2 Continuous random variable 17
1.31 Discrete bivariate random variable 19
1.32 Joint probability distribution function of bivariate discrete random
variables 20
1.33 Joint cumulative distribution function of bivariate discrete random
variables 20
1.34 Marginal distributions of a bivariate discrete random variable 20
1.35 Selection of a sample using known discrete bivariate distribution
function 20
1.36 Continuous bivariate random variable 21
1.37 Joint probability distribution function of bivariate continuous
random variable 21
1.38 Joint cumulative distribution function of a bivariate continuous
random variable 22
1.39 Marginal cumulative distributions of bivariate continuous random
variable 22
1.40 Selection of a sample using known bivariate continuous
distribution function 22
1.41 Properties of a best estimator 24
1.41.1 Unbiasedness 24
1.41.1.1 Bias 28
1.41.2 Consistency 28
1.41.3 Sufficiency 28
1.41.4 Efficiency 29
1.41.4.1 Variance 29
1.41.4.2 Mean square error 29
1.42 Relative efficiency 29
1.43 Relative bias 29
1.44 Variance estimation through splitting 30
1.45 Loss function 31
1.46 Admissible estimator 31
1.47 Sample survey 31
1.48 Sampling distribution 32
1.49 Sampling frame 33
Table of contents ix

1.50 Sample survey design 33


1.51 Errors in the estimators 33
1.51.1 Sampling errors 34
1.51.2 Non-sampling errors 34
1.5 1.2.1 Non-response errors 35
1.51.2.2 Measurement errors 35
1.51.2.3 Tabulation errors 35
1.51.2.4 Computational errors 35
1.52 Point estimator 35
1.53 Interval estimator 35
1.54 Confidence interval 35
1.55 Population proportion 38
1.56 Sample proportion 38
1.57 Variance of sample proportion and confidence interval estimates 39
1.58 Relative standard error 50
1.59 Auxiliary information 50
1.60 Some useful mathematical formulae 56
1.61 Ordered statistics 57
1.61.1 Population median 57
[.6 [ .2 Population quartiles 58
[ .6 [ .3 Population percentiles 59
1.61.4 Population mode 59
1.62 Definition(s) of statistics 59
1.63 Limitations of statistics 60
1.64 Lack of confidence in statistics 60
1.65 Scope of statistics 60
Exercises 60
Practical problems 63

2 SIMPLE RANDOM SAMPLING

2.0 Introduction 71
2.1 Simple random sampling with replacement 71
2.2 Simple random sampling without replacement 79
2.3 Estimation of population proportion 94
2.4 Searls' estimator of population mean 103
2.5 Use of distinct units in the WR sample at the estimation stage 106
2.5.1 Estimation of mean 107
2.5 .2 Estimation of finite population variance 113
2.6 Estimation of total or mean ofa subgroup (domain) ofa population 118
2.7 Dealing with a rare attribute using inverse sampling [23
2.8 Controlled sampling 125
2.9 Determinant sampling 127
Exercises 128
Practical problems 132
x Advanced samp ling theory with applicat ions

3 l.JSEOF AUXILIARY INFORMATION: SIMPLE


RANDOM SAMPLING

3.0 Introduction 137


3.1 Notation and expected values 137
3.2 Estimation of population mean 138
3.2.1 Ratio estimator 138
3.2.2 Product estimator 145
3.2.3 Regression estimator 149
3.2.4 Power transformation estimator 160
3.2.5 A dual of ratio estimator 161
3.2.6 General class of estimators 164
3.2.7 Wider class of estimators 166
3.2.8 Use of known variance of auxil iary variable at estimation
stage of population mean 167
3.2.8.1 A class of estimators 167
3.2.8.2 A wider class of estimators 169
3.2.9 Methods to remove bias from ratio and product type
estimators 173
3.2.9.1 Queno uille's method 173
3.2.9.2 Interpenetrating sampling method 175
3.2.9.3 Exactly unbiased ratio type estimator 180
3.2.9.4 Unbiased product type estimator 183
3.2.9.5 Class of almost unbiased estimators of population
ratio and product 185
3.2.9.6 Filtration of bias 187
3.3 Estimation of finite population variance 191
3.3.1 Ratio type estimator 192
3.3.2 Difference type estimator 197
3.3.3 Power transformation type estimator 198
3.3.4 General class of estimators 199
3.4 Estimation of regression coefficient 203
3.4.1 Usual estimator 203
3.4.2 Unbiased estimator 204
3.4.3 Improved estimators of regression coefficient 207
3.5 Estimation of finite population correla tion coefficient 209
3.6 Superpopulation model approac h 214
3.6.1 Relationship between linear model and regression
estimator 214
3.6.2 Improved estimator of variance of linear regression
estimator 217
3.6.3 Relationship between linear model and ratio estimator 221
3.7 Jackknife variance estimator 223
3.7.1 Ratio estimator 223
3.7.2 Regression estimator 226
Table of contents xi

3.8 Estimation of population mean using more than one auxiliary


variable 229
3.8.1 Multivariate ratio estimator 230
3.8.2 Multivariate regression type estimators 231
3.8.3 General class of estimators 239
3.9 General class of estimators to estimate any population parameter 245
3.10 Estimation of ratio or product of two population means 248
3.11 Median estimation in survey sampling 250
Exercises 257
Practica l problems 281

4 USE OF AUXILIARY INFORMATION: PROBABILITY


PROPORTIONAL TO SIZE AND WITH
REPLACEMENT (PPSWR)SAMPLING

4.0 Introduction 295


4.1 What is PPSWR sampling? 295
4.1.1 Cumulative total method 300
4.1.2 Lahiri's method 303
4.2 Estimation of population total 306
4.3 Relative efficiency of PPSWR sampling with respect to SRSWR
sampling 312
4.3.1 Superpopulation model approach 312
4.3.2 Cost aspect 315
4.4 PPSWR sampling : More than one auxiliary variable is available 317
4.4.1 Notation and expectations 318
4.4.2 Class of estimators 319
4.4.3 Wider class of estimators 320
4.4.4 PPSWR sampling with negatively correlated variables 324
4.5 Multi-character survey 326
4.5.1 Study variables have poor positive correlation with the
selection probabilities . 326
4.5.1.1 General class of estimators 335
4.5.2 Study variables have poor positive as well as poor negative
correlat ion with the selection probabilities 336
4.6 Concept of revised selection probabilities 339
4.7 Estimation of correlation coefficient using PPSWR sampling 340
Exercises 341
Practical problems 345
xii Advanced samp ling theory with applications

5 U SE OF AUXILIARY INFORMATION:
PROBABILITY PROPORTIONAL TO SIZE AND
WITHOUT REPLACEMENT (PPSWOR) SAMPLING

5.0 Introduction 349


5.0.1 Useful symbols 349
5.0.2 Some mathematical relations 349
5.1 Horvitz and Thompson estimator and related topics 351
5.2 General class of estimators 373
5.3 Model based estimation strategies 375
5.3.1 A brief history of the superpopulation model 377
5.3.2 Scott, Brewer and Ho 's robust estimation strategy 378
5.3.3 Design variance and anticipated variance of linear
regression type estimator 383
5.4 Construction and optimal choice of inclus ion probabilities 385
5.4.1 Pareto zrps samp ling estimation scheme 386
5.4.2 Hanurav's method 387
5.4.3 Brewer' s method 388
5.4.4 Sampford's method 389
5.4.5 Narain's method 390
5.4.6 Midzunc--Sen method 390
5.4.7 Kumar- -Gupta--N igam scheme 391
5.4.8 Dey and Srivastava scheme for even sample size 392
5.4.9 SSS sampling scheme 393
5.4.10 Optimal choice of first order inclusion probab ilities 394
5.5 Calibration approach 399
5.6 Calibrated estimator of the variance of the estimator of population
total 409
5.7 Estimation of variance of GREG 413
5.8 Improved estimator of variance of the GREG : The higher leve l
calibration approach 419
5.8.1 Recalibrated estimator of the variance of GREG 424
5.8.2 Recalibration using optimal designs for the GREG 426
5.9 Calibrated estimators of variance of estimator of total and
distribution function 428
5.9.1 Unified setup 430
5.10 Calibration of estimator of variance of regression predictor 431
5.10.1 Chaudhuri and Roy's results 433
5.10 .2 Calibrated estimators of variance of regression predictor 436
5.10.2.1 Model assisted calibration 436
5.10.2.2 Calibration estimators when variance of auxil iary
variabl e is known 440
5.10.2 .2.1Each component of Vx is known 441
5.10.2 .2.2 Compromized calibration 442
5.10.2.3 Predict ion variance 444
Table of contents Xlll

5. I I
Ordered and unordered estimators 444
5.11.1 Ordered estimators 445
5.11.2 Unordered estimators 449
5.12 Rao--Hartley--Cochran (RHC) sampling strategy 452
5.13 Unbiased strategies using IPPS sampling schemes 462
5.13.1 Estimation of population mean using a ratio estimator 462
5.13.2 Estimation of finite population variance 464
5.14 Godambe 's strategy: Estimation of parameters in survey sampling 465
5.14.1 Optimal estimating function 470
5.14.2 Regression type estimators 472
5.14.3 Singh's strategy in two-dimensional space 473
5.14.4 Godambe's strategy for linear Bayes and optimal
estimation 476
5.15 Unified theory of survey sampling 479
5.15.1 Class of admissible estimators 479
5.15.2 Estimator 479
5.15.3 Admissible estimator 479
5.15.4 Strictly admissible estimator 479
5.15.5 Linear estimators of population total 483
5.15.6 Admissible estimators of variances of estimators of total 485
5.15.6. I Condition for the unbiased estimator of variance 485
5.15.6.2 Admissible and unbiased estimator of variance 485
5.15.6.3 Fixed size sampling design 485
5.15.6.4 Horvitz and Thompson estimator and its variance
in two forms 485
5.15.7 Polynomial type estimators 489
5.15.8 Alternative optimality criterion 490
5.15.9 Sufficient statistic in survey sampling 491
5.16 Estimators based on conditional inclusion probabilities 493
5.17 Current topics in survey sampling 494
5.17.1 Surveydesign 495
5.17.2 Data collection and processing 495
5.17.3 Estimation and analysis of data 496
5.18 Miscellaneous discussions/topics 497
5.18.1 Generalized IPPS designs 497
5.18.2 Tam's optimal strategies 498
5.18.3 Use of ranks in sample selection 498
5.18.4 Prediction approach 498
5.18.5 Total of bottom (or top) percentiles of a finite population 499
5.18.6 General form of estimator of variance 499
5.18.7 Poisson sampling 499
5.18.8 Cosmetic calibration 500
5.18.9 Mixing of non-parametric models in survey sampling 501
5.19 Golden Jubilee Year 2003 of the linear regression estimator 504
Exercises 507
Practical Problems 520
XIV Advanced sampling theory with applications

6 USE OF AUXILIARY INFORMATION:


MULTI-PH ASE SAMPLING

6.0 Introduction 529


6.1 SRSWOR scheme at the first as well as at the second phases of the
sample selection 530
6.1.0 Notation and expected values 530
6.1.1 Ratio estimator 532
6.1.1.1 Cost function 535
6.1.2 Difference estimator 539
6.1.3 Regression estimator 540
6.1.4 General class of estimators of populati on mean 541
6.1.5 Estimation of finite population variance 544
6.1.6 Calibration approach in two-phase sampling 545
6.2 Two -phase sampling using two auxiliary variables 549
6.3 Chain ratio type estimators 554
6.4 Calibration using two auxiliary varia bles 555
6.5 Estimation of variance of calibrated est imator in two-phase
sampling: low and higher level calibration 560
6.6 Two -phase sampling using multi-auxiliary variables 563
6.7 Unified approach in two-phase sampling 563
6.8 Concept of three-phase sampling 565
6.9 Estimation of variance of regression est imator under two-pha se
sampling 567
6.10 Two-phase sampling using PPSWR sampling 572
6.11 Concept of dual frame surveys 576
6.11.1 Common variable s used for further calibration of we ights 576
6.11.2 Estimation of variance using dual frame surveys 577
6.12 Estimation of median using two-phase sampling 578
6.12.1 General class of estimators 578
6.12.2 Regression type estimator 579
6.12.3 Position estimator 581
6.12.4 Stratification estimator 582
6.12.5 Optimum first and second phase samples for med ian
estimation 584
6.12.5 .1 Cost is fixed 584
6.12.5.2 Variance is fixed 584
6.12.6 Kuk and Mak 's techn ique in two-ph ase sampling 584
6.12.7 Chen and Qin technique in two-phase sampling 586
6.13 Distribution function with two-ph ase sampling 588
6.14 Improved version of two-phase calibration approach 590
6.14.1 Improved first phase calibration 590
6.14.2 Improved second phase calibration 592
Exercises 594
Practical problems 612
Table of contents xv
VOLUME II
7 SYSTEMATIC SAMPLING

7.0 Introduction 615


7.1 Systematic sampling 615
7.2 Modified systematic sampling 620
7.3 Circular systematic sampling 621
7.4 PPS circular systematic sampling 623
7.5 Estimation of variance under systematic sampling 624
7.5.1 Sub-sampling or replicated sub-sampling scheme 625
7.5.2 Successive differences 626
7.5.3 Variance of circular systematic sampling 627
7.6 Systematic sampling in population with linear trend 627
7.6.1 Estimators with linear trend 627
7.6.2 Modification of estimates 629
7.6.3 Estimators based on centrally located samples 631
7.6.4 Estimators based on balanced systematic sampling 633
7.7 Singh and Singh 's systematic sampling scheme 635
7.8 Zinger strategy in systematic sampling 637
7.9 Populations with cyclic or periodic trends 638
7.10 Multi-dimensional systematic sampling 639
Exercises 642
Practical problems 646

8 STRATIFIED AND POST-STRATIFIED SAMPLING

8.0 Introduction 649


8.1 Stratified sampling 650
8.2 Different methods of sample allocation 659
8.2. I Equal allocation 659
8.2.2 Proportional allocation 659
8.2.3 Optimum allocation method 662
8.3 Use of auxiliary information at estimation stage 676
8.3.1 Separate ratio estimator 677
8.3.2 Separate regression estimator 681
8.3.3 Combined ratio estimator 684
8.3.4 Comb ined regression estimator 688
8.3.5 On degree of freedom in stratified random sampling 693
8.4 Calibration approach for stratified sampling design 696
8.4.1 Exact combined linear regression using calibration 700
8.5 Construction of strata boundaries 70 I
8.5.1 Strata boundaries for proportional allocation 702
8.5.2 Strata boundaries for Neyman allocation 703
8.5.3 Stratification using auxiliary information 708
8.6 Superpopulation model approach 712
8.7 Multi-way stratification 713
XVI Advanced sampling theory with applications

8.8 Stratum boundaries for multi-variate populations 718


8.9 Optimum allocation in multi-variate stratified sampling 723
8.10 Stratification using two-phase sampling 726
8.11 Post-stratified sampling 729
8.11 .1 Conditional post-stratification 730
8.11.2 Unconditional post-stratification 731
8.12 Estimation of proportion using stratified random samp ling 735
Exercises 738
Practical problems 748

9 NON-OVERLAPPING, OVERLAPPING, POST, AND


ADAPTIVE CLUSTER SAMPLING

9.0 Introduction 765


9.1 Non-overlapping clusters of equal size 766
9.2 Optimum value of non-overlapping cluster size 790
9.3 Estimation of proportion using non-overlapping cluster sampling 792
9.4 Non-overlapping clusters of different sizes 796
9.5 Selection of non-overlapping clusters with unequal probability
sampling 805
9.6 Optimal and robust strategies for non-overlapping cluster sampling 808
9.7 Overlapping cluster sampling 812
9.7. I Population size is known 812
9.7.2 Population size is unknown 814
9.8 Post-cluster sampling 8 I7
9.9 Adaptive cluster sampling 819
Exercises 820
Practical problems 822

10 MULTI-STAGE, .SUCCESSIVE, AND RE-SAMPLING


STRATEGIES

10.0 Introduction 829


10.1 Notation 830
10.2 Procedure for construction of estimators of the tota l 83 I
10.3 Method of calculating the variance of the estimators 833
10.3.1 Selection of first and second stage units using SRSWOR
sampling 834
10.3.2 Optimum allocation in two-stage sampling 836
10.4 Optimum allocation of sample in three-stage sampling 837
10.5 Modified three-stage sampling 838
10.6 General class of estimators in two-stage sampling 839
10.7 Prediction estimator under two-stage sampling 842
10.8 Prediction approach to robust variance estimation in two-stage
cluster sampling 844
Table of contents xvii

10.8.1 Royall's technique of variance estimation 846


10.9 Two-stage sampling with successive occasions 847
10.9.1 Arnab's successive sampling scheme 848
10.10 Estimation strategies in supplemented panels 865
10.1 I Re-sampling methods 866
10.11.1 Jackknife variance estimator 867
10.11.2 Balanced half sample (BHS) method 871
10.11.3 Bootstrap variance estimator 873
Exercises 873
Practical problems 887

11 RANDOMIZED RESPONSE SAMPLING: TOOLS FOR


SOCIAL SURVEYS

11.0 Introduction 889


11 .1 Pioneer model 889
11.2 Franklin 's model 892
11.3 Unrelated question model and related issues 897
11.3.1 When proportion of unrelated character is known 897
11.3.2 When proportion of unrelated character is unknown 898
11.4 Regression analysis 903
11.4.1 Ridge regression estimator 905
11.5 Hidden gangs in finite populations 907
11.5.1 Two sample method 907
11.5.2 One sample method 911
11.5.3 Estimation of correlation coefficient between two
characters of a hidden gang 912
11.6 Unified approach for hidden gangs 916
11.7 Randomized response technique for a quantitative variable 920
11 .8 GREG using scrambled responses 924
11.8.1 Calibration of scrambled responses 925
11.8.2 Higher order calibration of the estimators of variance
under scrambled responses 928
11.8.3 General class of estimators 930
11 .9 On respondent's protection: Qualitative characters 930
11.9.1 Leysieffer and Warner's measure 930
11.9.2 Lanke's measure 932
11.9.3 Mangat and Singh's two-stage model 933
11.9.4 Mangat and Singh's two-stage and Warner 's model at
equal level of protection 935
11.9.5 Mangat's model 939
11.9.6 Mangat's and Warner 's model at equal level of protection 940
11.10 On respondent's protection: Quantitative characters 942
11 .10.1 Unrelated question model for quantitative data 942
11.10.2 The additive model 943
11 .10.3 The multiplicative model 943
XVIII Advanced sampling theory with applications

11.10.4 Measure of privacy protection 944


11.10.5 Comparison between additive and multiplicative models 945
11.11 Test for detecting untruthful answering 949
11.12 Stochastic randomized response technique 951
Exercises 954
Practical problems 972

12 NON-RESPONSE AND ITS TREATMENTS

12.0 Introduction 975


12.1 Hansen and Hurwitz pioneer model 976
12.2 Politz and Simmons model 980
12.3 Horvitz and Thompson estimator under non-response 984
12.4 Ratio and regression type estimators 986
12.4.1 Distribution and some expected values 987
12.4.2 Estimation of population mean 987
12.4.3 Estimation of finite population variance 993
12.5 Calibrated estimators of total and variance in the presence of
non-response 1000
12.5.1 Estimation of populat ion total and variance 1000
12.5.2 Calibration estimator for the total 1002
12.5.3 Calibration of the estimators of variance 1003
12.5.3. 1 PPSWOR Sampling 1005
12.5.3.2 SRSWOR Sampling 1007
12.6 Different treatments of non-response 1009
12.6.1 Ratio method of imputation 1010
12.6.2 Mean method of imputation 1010
12.6.3 Hot deck (HD) method of imputation 1010
12.6.4 Nearest neighbor (NN) method of imputation 1011
12.7 Superpopulation model approach 1013
12.7.1 Different components of variance 1014
12.8 Jackknife technique 1016
12.9 Hot deck imputation for multi-stage designs 1017
12.10 Multiple imputation 1021
12.10.1 Degree offreedom with multiple imputation for small
samples 1024
12.11 Compromised imputation 1025
12.11.1 Practicability of compromised imputation 1027
12.11.2 Recommendations of compromised imputation 1027
12.11.3 Warm deck imputation 1028
12.11.4 Mean cum NN imputation 1028
12.12 Estimation of response probabilities 1031
12.13 Estimators based on estimated response probabilities 1033
12.13.1 Estimators based on response probabilities 1035
12.13.2 Calibration of response probabilities 1037
12.13.2.1 Calibrated estim ator and its variance 1038
Tab le of contents XIX

12.13.2.2 Estimation of variance of the calibrated


estimator 1039
Exercises 1041
Practical problems 1058

13 M ISCELLANEOUS TOPICS

13.0 Introduction 1065


13.1 Estimation of measurement errors 1065
13.1.1 Estimation of measurement error using a single
measurement per element 1066
13.1.1. I Model and notation 1066
13.1.1.2 Grubbs' estimators 1066
13.1.2 Bhatia, Mang at, and Morri son's (BMM) repeated
measurement estimators 1068
13.1.2.1 Mode l and notation 1069
13.2 Raking ratio using contingency tables 1073
13.3 Continuous populations 1077
13.4 Small area estim ation 1081
13.4.1 Sympt omatic accounting techniques 1081
13.4.2 Vital rates meth od (VRM) 1081
13.4.3 Censu s component method (CCM) 1082
13.4.4 Housing unit method (HUM) 1083
13.4.5 Synthet ic estimator 1083
13.4.6 Composite estim ator 1086
13.4.7 Model based techn iques 1090
13.4.7.1 Henderson ' s model 1090
13.4.7.2 Nested error regressi on model 1093
13.4.7.3 Random regress ion coefficient model 1095
13.4.7.4 Fay and Herriot model 1097
13.4.8 Further genera lizations 1097
13.4.9 Estimation of prop ortion of a characteristic in small areas
of a population 1099
Exercises 1101
Practical problem s 1101

A pPENDIX

T ABLES

I Pseudo -Random Numbers (PRN) 1105


2 Critical values based on t distribution 11 07
3 Area under the standard normal curve 1109
xx Advanced sampling theory with applications

POPULATIONS

All operating banks : Amount (in $000) of agricultural loans


outstanding in different states in 1997 1111
2 Hypothetical situation of a small village having only 30 older
persons (age more than 50 years) : Approximate duration of sleep
(in minutes) and age (in years) of the persons 1113
3 Apples , commercial crop: Season average price (in $) per pound , by
States, 1994-1996 1114
4 Fish caught: Estimated number of fish caught by marine
recreational fishermen by species group and year, Atlantic and Gulf
coasts, 1992-1995 1116
5 Tobacco: Area (hectares), yield and production (metric tons) in
specified countries during 1998 1119
6 Age specific death rates from 1990 to 2065 (Number per 100,000
births) 1123
7 State population projections, 1995 and 2000 (Number in thousands) 1124
8 Projected vital statistics by country or area during 2000 1126
9 Number of immigrants admitted to the USA 1129

BIBLIOGRAPHY
1131
AUTHOR INDEX
1193
HANDY SUBJECT INDEX
1215
ADDITIONAL INFORMATION
1219
PREFACE

Advanced Sampling Theory with Applications: How Michael 'Selected' Amy is


a comprehensive exposition of basic and advanced sampling techniques along with
their applications in the diverse fields of science and technology .

This book is a multi-purpose document. It can be used as a text by teachers, as a


reference manual by researchers, and as a practical guide by statisticians. It covers
1179 references from different research journals through almost 2158 citations
across 1248 pages, a large number of complete proofs of theorems, important
results such as corollaries, and 335 unsolved exercises from several research papers.
It includes 162 solved, data based, real life numerical examples in disciplines such
as Agriculture, Demography, Social Science, Applied Economics, Engineering,
Medicine, and Survey Sampling. These solved examples are very useful for an
understanding of the applications of advanced sampling theory in our daily life and
in diverse fields of science. An additional 177 unsolved practical problems are
given at the ends of the chapters. University and college professors may find these
useful when assigning exercises to students. Each exercise gives exposure to several
complete research papers for researchers/students. For example, by referring to
Exercise 3.1 at the back of Chapter 3, different types of estimators of a population
mean studied by Chakrabarty (1968), Vos (1980), Adhvaryu and Gupta (1983),
Walsh (1970), Sahai and Sahai (1985) and Sisodia and Dwivedi (1981) are
examined. Thus, this single exercise discusses about six research papers. Similarly,
Exercise 5.7 explains the other possibilities in the calibration approach considered
by Deville and Sarndal (1992) and their followers.
The data based problems show statisticians how to select a sample and obtain
estimates of parameters from a given population by using different sampling
strategies like SRSWR, SRSWOR, PPSWR, PPSWOR, RHC, systematic sampling,
stratified sampling, cluster sampling, and multi-stage sampling . Derivations of
calibration weights from the design weights under single phase and two-phase
sampling have been provided for simple numerical examples. These examples will
be useful to understand the meaning of benchmarks to improve the design weights.
These examples also explain the background of well known scientific computer
packages like CALMAR, GES, SAS, STATA, and SUDAAN, etc., some of which
are very expensive, used to generate calibration weights by most organizations in
the public and private sectors. The ideas of hot deck, cold deck, mean method of
imputation, ratio method of imputation, compromised imputation, and multiple
imputation have been explained with very simple numerical examples. Simple
examples are also provided to understand Jackknife variance estimation under
single phase, two-phase [or random non-response by following Sitter (1997)] and
multi-stage stratified designs.
XXII Advanced sampling theory with applications

I have pro vided a summary of my book from which a stati stician can reach a fruitful
dec ision by makin g a comparison in his/her mind with the existing books in the
international marke t.

Title s) 4
Dedication 2
Table of contents 14
Preface 8 9 I
I 70 13 II 20 2 58
2 66 20 22 19 58 24
3 158 36 68 38 307 61
4 54 9 15 10 84 26
5 180 13 43 15 651 43
6 86 10 29 10 170 21
7 34 8 17 9 72 23
8 116 21 24 19 112 70
9 64 12 11 14 61 57
10 60 3 31 4 162 13
II 86 3 33 5 216 7
12 90 8 24 9 154 28
13 40 6 7 5 100 15
A endix 26 12
Biblio ra h 62
Author Index 22
Subi ect Index 4
Related Books 2
24

This book also covers, in a very simple and compact way, many new topics not yet
available in any book on the intern ational market. A few of these interesting topics
are: median estimation under single phase and two-ph ase sampling, difference
between low level and higher level calibration approach, calibration weights and
design weights, estimation of parametric function s, hidden gangs in finite
populations, compromised imputation, variance estimation using distinct units ,
general class of estimators of popul ation mean and variance, wider class of
estimators of population mean and variance, power tran sformation estimators,
estimators based on the mean of non-sampled units of the auxiliary character, ratio
and regression type estimators for estimating finite population variance similar to
prop osed by Isaki ( 1982), unbiased estimators of mean and variance under
Midzuno 's scheme of sampling, usual and mod ified jackknife variance estimator,
Preface XXIII

estimation of regression coefficient, concept of revised selection probabilities,


multi-character surveys sampling, overlapping, adaptive , and post cluster sampling,
new techniques in systematic sampling, successive sampling, small area estimation,
continuous populations, and estimation of measurement errors.

This book has 459 tables, figures, maps, and graphs to explain the exercises and
theory in a simple way. The collection of 1179 references (assembled over more
than ten years from journals available in India, Australia, Canada, and the USA) is a
vital resource for researcher . The most interesting part is the method of notation
along with complete proofs of the basic theorems . From my experience and
discussion with several research workers in survey sampling , I found that most
people dislike the form or method of notation used by different writers in the past.
In the book I have tried to keep these notations simple, neat, and understandable. I
used data relating to the United States of America and other countries of the world,
so that international students should find it interesting and easy to understand. I am
confident that the book will find a good place and reputation in the international
market, as there is currently no book which is so thorough and simple in its
presentation of the subject of survey sampling.

The objective , style, and pattern of this book are quite different from other books
available in the market. This book will be helpful to:

( a ) Graduates and undergraduates majoring in statistics and programs where


sampling techniques are frequently used;
( b ) Graduates currently involved in M.Sc. or Ph.D. programs in sampling theory
or using sampling techniques in their research;
( c ) Government organizations such as the US Bureau of Statistics, the Statistics
Canada, the Australian Bureau of Statistics, the New Zealand Bureau of Statistics,
and the Indian Statistical Institute, in addition to private organizations such as
RAND and WESTSTAT, etc.

In this book I have begun each chapter with basic concepts and complete
derivations of the theorems or results. I ended each chapter by filling the gap
between the origin of each topic and the recent references. In each chapter I
provided exercises which summarize the research papers. Thus this book not only
gives the basic techniques of sampling theory but also reviews most of the research
papers available in the literature related to sampling theory. It will also serve as an
umbrella of references under different topics in sampling theory, in addition to
clarifying the basic mathematical derivations . In short, it is an advanced book, but
provides an exposure to elementary ideas too. It is a much better restatement of the
existing knowledge available in journals and books . I have used data, graphs,
tables, and pictures to make sampling techniques clear to the learners .
XXIV Advanced sampling theory with applications

EXERCISES ,>,

At the end of each chapter I have provided exercises and their solutions are given
through references to the related research papers. Exercises can be used to clarify or
relate the classroom work to the other possibilities in the literature .

At the end of each chapter I have provided practical problems which enable
students and teachers to do additional exercises with real data.

I have taken real data related to the United States of America and many other
countries around the world. This data is freely available in libraries for public use
and it has been provided in the Appendix of this book for the convenience of the
readers . This will be interesting to the international students .

NEW TECHNOLOGIES <, : .


This provides to students or researchers new formulae available in the literature,
which can be used to develop new computer programs for estimating parameters in
survey sampling and to learn basic statistical techniques .

.SOLU.TIO:N·.MANUAL'·'
I am working on a complete solution manual to the practical problems and selected
theoretical exercises given at the end the chapters.

I was born in the village of Ajnoud, in the district of Ludhiana, in the state of
Punjab, India in 1963. My primary education is from the Govt. Primary School,
Ajnoud; the Govt. Middle School, Bilga; and Govt. High School, Sahnewal, which
are near my birthplace. I did my undergraduate work at Govt. College Karamsar,
Rarra Sahib. Still I remember that I used to bicycle my way to college, about 15 km,
daily on the bank of canals. It was fun and that life has never come back. M.Sc. and
Ph.D. degrees in statistics were completed at the Punjab Agricultural University
(PAU), Ludhiana, and most of the time spent in room no. 46 of hostel no. 5.

I attended conferences of the Indian Society of Agricultural Statistics held at


Gujarat, Haryana, Orissa, and Kerala, and was a winner of the Gold Medal in 1994
for the Young Scientist Award . I attended conference s of the Australian Statistical
Society in Sydney and the Gold Coast. I attended a conference of the International
Indian Statistical Association at Hamilton, and the Statistical Society of Canada
conferences at Hamilton, Regina, and Halifax in addition to the Concordia
University conference . I also attended the Joint Statistical Meetings (JSM-200 1,
2002) at Atlanta and New York.
Preface xxv

At present I am an Assistant Professor at St. Cloud State University, St. Cloud, MN,
USA, and recently introduced the idea of obtaining exact traditional linear
regression estimator using calibration approach. From 200 I to 2002 I did post
doctoral work at Carleton University, Canada. From 2000 to 2001 I was a Visiting
Instructor at the University of Saskatchewan, Canada. From 1999 to 2000 I was a
Visiting Instructor at the University of Southern Maine, USA, where I taught
several courses to undergraduate and graduate students, and introduced the idea of
compromised imputation in survey sampling. From 1998 to 1999 I was Visiting
Scientist at the University of Windsor Canada. From 1996 to 1998 I was Research
Officer-II in the Methodology Division of the Australian Bureau of Statistics where
I developed higher order calibration approach for estimating the variance of the
GREG, and introduced the concept of hidden gangs in finite populations. From
1995 to 1996 I was Research Assistant at Monash University, Australia. From 1991
to 1995 I was Research Fellow, Assistant Statistician and then Assistant Professor
at PAU, Ludhiana, India and was also awarded a Ph.D. in statistics in 1991. I have
published over 80 research papers in reputed journals of statistics and energy
science. I am also co-author of a monograph entitled, Energy in Punjab Agriculture,
published by the Indian Council of Agricultural Research, New Delhi.

Advanced Sampling Theory with Applications is my additional achievement. In


this book you can enjoy my new ideas such as:

"How did Michael select Amy? :'


"How can you weigh elephants in a circus?"
and " ¥!"
~.

"How many girls like Bob?"

in addition to higher order calibration, bias filtration, hybridising imputation and


calibration techniques, hidden gangs , median estimation using two-phase sampling,
several new randomised response models, and exact traditional linear regression
using calibration technique etc..

~CKNOWLED6EMENTS

Indeed the words at my command are not adequate to convey the feelings of
gratitude toward the late Prof. Ravindra Singh for his constant, untiring and ever
encouraging support since 1996 when I started writing this book. Prof. Ravindra
Singh passed away Feb . 4, 2003, which is a great loss to his erstwhile students and
colleagues, including me. He was my major advisor in my Ph.D. and was closely
associated in my research work. Since 1996 Mr. Stephen Hom, supervisor at the
Australian Bureau of Statistics, always encouraged to me to complete this book and
I appreciate his sincere co-operation, contribution and kindness in joint research
papers as well guidance to complete this book. The help of Prof. M.L. King,
Monash University is also appreciated. I started writing this book while staying
with Dr. Jaswinder Singh, his wife Dr. Rajvinder Kaur, and their daughter Miss
XXVI Advanced sampling theory with applications

Jasraj Kaur in Australia during 1996. Almost seven years I worked day and night on
this book, and during May-July, 2003, I rented a room near an Indian restaurant in
Malton , Canada to save cooking time and spent most of the time on this book

Thanks are due to Prof. Ragunath Arnab, University of Durban--Westville, for help
in completing the work in Chapter 10 related to his contribution in successive
sampling, and completing some joint research papers . The help of Prof. H.P. Singh,
Vikram University in joint publications is also duly acknowledged.

The contribution of late Prof. D.S. Tracy , University of Windsor, of reading a few
chapters of the very early draft of the manuscript has also been duly acknowledged.
The contribution of Ms. Margot Siekman, University of Southern Maine in reading
a few chapters has also been duly acknowledged. Thanks are also due to a
professional editor Kathlean Prenderqast, University of Saskatchewan, for critically
checking the grammar and punctuation of a few chapters. Prof. M. Bickis ,
University of Saskatchewan, really helped me in my career when I was on the road
and looking for a job by going from university to university in Canada. Prof. Silvia
Valdes and Ms. Laurie McDermott's help, University of Southern Maine, has been
much appreciated. Thanks are also due to Professor Patrick Farrell, Carleton
University, for giving me a chance to work with him as a post doctoral fellow .
Thanks are also due to Prof. David Robinson at SCSU for providing a very peaceful
work environment in the department. The aid of one Stat 321 student, Miss Kok
Yuin Ong in cross checking all the solved numerical examples, and a professional
English editor Mr. Eric Westphal in reading the entire manuscript at SCSU is much
appreciated. Thanks are also due to a professional editor Dr . M. Cole from England
for editing the complete manuscript, and to bring it in the present form. Mary
Shrode and Mitra Sangrovla, Learning Resources and Technology Service, SCSU,
for help in drawing a few illustrations using NOV A art explosion 600,000 images
collection is duly acknowledged.

I am also thankful to the galaxy of my friends/colleagues, viz., Dr . Inderjit Grewal


(PAU), Dr. B.R. Garg (PAU) , Dr. Sukhjinder Sidhu (PAU) , Prof. L. N. Upadhyaya
(Indian School of Mines), Er. Amarjot Singh (Australia), Mr. Qasim Shah
(Australia), Mr. Kuldeep Virdi (Canada) , Mr. Kulwinder Channa (Canada), Prof.
Balbinder Deo (Canada), Er. Mohan Jhajj (Canada), Mr. Gurbakhash Ubhi
(Canada), Mr. Gurmeet Ghatore (USA) , Dr. Gurjit Sidhu (USA), Prof. Balwant
Singh (USA), Prof. Munir Mahmood (USA), and Mr. Suman Kumar (USA) . All
cannot be listed, but none is forgotten . I met uncle Mr. Trilochan Singh at Ottawa,
who changed my style of living a bit and taught me to get involved with other
things, not only sampling theory, and I appreciate his advice . I sincerely appreciate
Dr. Jog inder Singh's advice at Ottawa, who taught me to do meditation imagining
the writing of the name of God with eyes closed and I found it helps when under
pressure from work . I am most grateful to my teachers and colleagues for their help
and co-operation. Special thanks are due to my father Mr. Sardeep Ubhi , my mother
Mrs. Ranjit Ubhi for making this book possible, my brothers Jatinder and
Kulwinder, and my late sister Sarjinder.
Preface XXVII

The permission of Dimitri Chappas , NOAA/ National Climatic Data Center to print
a few maps is also duly acknowledged. Free access to data given in the Appendix by
Agricultural Statistics and Statistical Abstracts of the United States are also duly
acknowledged. I would also like to extend my thanks to the Editor James Finlay,
Associate Editor Inge Hardon , and reviewers for bringing the original version of the
manuscript into the present form and into the public domain .

Note that I used EXCEL to solve the numerical examples , and while using a hand
calculator there may be some discrepancies in the results after one or two decimal
places . Further note that the names used in the examples such as Amy, Bob, Mr.
Bean, etc., are generic , and are not intended to resemble any real people. I would
also like to submit that all opinions and methods of presentation of results in this
book are solely the author's and are not necessarily representative of any institute or
organization. I tried to collect all recent and old papers, but if you have any
published related paper and would like that to be highlighted in the next volume of
my book, please feel free to mail a copy to me, and it will be my pleasure to give a
suitable place to your paper . To my knowledge this will be the first book , in survey
sampling, open to everyone to share contribution irrespective your designation ,
status, group of scientists, journals names, or any other discriminating character
existing in this world, you feel. Your opinions are most welcome and any suggestion
for improvement will be much appreciated via e-mail.

Sarjinder Singh (B:Sc., M.Sc., Ph.D ., Gold Medalist, and Post Doctorate)
Assistant Professor, Department of Statistics, S1. Cloud State University,
S1. Cloud, MN, 56301-4498, USA E-mail: [email protected]
1. BASIC CONCEPTS AND MATHEMATICAL NOTATION

1.0 INTRODUCTION

In this chapter we introduce some basic concepts and mathematical notation , which
should be known to every surve y statistician. The meanin g and the use of these
terms is supported by using them in the subsequent chapters.

1.1 POPULATION

In statistical language the term population is applied to any finite or infinite


collection of individuals or units. It has displaced the older term 'universe' . It is
practically synonymous with ' aggregate' . A population is a collection of objects or
units about which we want to know something or draw an inferen ce. The population
may be finite or infinite . Assume a population cons ists of electric bulb s produced by
a plant. We may want to estimate the average life of the bulbs. The numb er of bulb s
produced by the plant may be finite or infinite.

1.1.1 FINITE POPULATION

If the number of objects or units in the popula tion is count able , it is said to be a
finite population. For example, the number of houses in a suburb is a finite
population.

1.1.2 INFINITE POPULATION

If the number of objects or units in the population is infinite, it is said to be an


infinite population. For example, the number of stars in the sky forms an infinite
population. In general, the population is denoted by Q and its size is denoted by N .
In the case of infinite population, N ~ 00 .

1.1.3 TARGET POPULATION

A finite or infinite population about which we requ ire information is called target
population. For example, all 18 year old girls in the United States .

1.1.4 STUDY POPULATION

This is the basic finite set of individuals we intend to study. For exa mple, all 18 year
old girls whose permanent address is in New York .

S. Singh, Advanced Sampling Theory with Applications


© Kluwer Academic Publishers 2003
2 Advanced sampling theory with applications

A subset of the population, which represen ts the entire population, is called a


sample. The sample is denoted by s and its size by n .

We provide here a few examples of populations and samples as follows:

( a ) All bulbs manufactured in a plant constitute a population. Now consider we


want to estimate the average lifetime of all the bulbs. Instead of taking the whole
population for testing purpose into consideration, we take 50 bulbs . Then the
collection of 50 bulbs will be called a sample;
( b ) Consider we want to find the percentage of ticketless travellers in the TTC
buses of Toronto. Then all persons travelling in all the busses of Toronto will
constitute the population and the persons checked by a particular checker(s) will
form a sample.

A census is a particular case of a sample. If we take a whole population as the


sample then the sampling survey is called a census .

The following table provides some of the major differences between a sample and a
census .
.. . ..,
: i: ,lc~f~:lir,,~~~ AspeCt· i : '· '"Y'.i' : Hy ,..." . •ll~Ji:~ . ,.... .'F, : ' · " ·· c Ji'~~/ "'.i:i' $ l !l:; j :;: '''Census: · j : l( '';·)~
Cost Less More
Effort Less More
Time consumed Less More
Errors May be predicted with certain confidence No such errors
Accuracy of More Less
measurements

The variable of interest or the variable about which we want to draw some inference
is called a study variable . Its value for the til unit is generally denoted by Yi' For
example , the life of the bulbs produced by certain plant can be taken as a study
variable .
Chapter I : Basic concepts and mathematical notation 3

1. 7 AUXILIARY VARIABLE

A variable hav ing a direct or indirect relationship to the study variable is called an
auxiliary variable. The value of an auxiliary variable for the /" unit is generally
denoted by X i or zi , etc .. For example, the time or money spent on producing each
bulb by the plant to maintain the quality can be taken as an auxiliary variable.

1.8 DIFFERENCE BETWEEl'(STUDYVARIABLE AND AUXILIARY


VARIABLE

The main differences between the study variable and auxiliary variable are as
follows :
\
Factors >', ;> ;',.f "S tudy/V ariable Auxiliary Variable
Cost More Less
Effort More Less
Sources of availability Current Surveys or Current or Past Survey,
Experiments Books or Journals etc.
Interest of an investigator More Less
Error in measurement More Less
Sources of error More Fewer
Notation Y X,Z

1.9 PARAMETER

An unknown quantity, which may vary over different sets of values forming
population is called a parameter. Any function of population values of a variable is
called a parameter. It is generally denoted by O .
Mathematically, suppose a population n consists of N units and the value of its /"
unit is Yi . Then any function of Y; values is a parameter, i.e.,
Parameter = f(Y1'Y2 ' .... ' YN ). (1.9 .1)
For example, if Y; denotes the total life time of the /" bulb , then the average life
time of the bulbs produced by the company is a parameter and is given by
I
Parameter = -(l\+Y2+ .... +YN ) . (1.9 .2)
N

1.10 STATISTIC

A summary value calculated from a sample of observations, usually but not


necessarily as an estimator of some population parameter is called a statistic and is
generally denoted bye . Mathematically, suppose a sample s consists of n units
and the value of the /" unit of the sample is denoted by vt- Any function of vt
valu es will be a statistic, i.e.,
4 Advanced sampling theory with appl ications

Statistic = /(YI 'YZ'····'Yn)· (1.10.1)


For example, if Yi denotes the total life time of the {1' bulb , then the average life
time of the bulbs produced by the company is estimated by the statistic, defined as
Statistic = ..!..(YI + y z + .....+ Yn) ' (1.10 .2)
n

I: 11 STATISTICS

Statistics is a science of collecting, analys ing and interpreting numerical data


relating to an aggregate of individuals or units.

1.12 SAMPLE SELECTION

A sample can be selected from a population in many ways . In this chapter, we will
discuss only two simple methods of samp le selection. As the readers get familiar
with sample selection, more complicated schemes will be discussed in following
chapters.

1.12:1 CHIT METHOD ·OR UOTTERYMETHOD

Suppose we have N = 10,000 blocks in New York City . We wish to draw a sample
of n = 100blocks to draw an inference about a character unde r study, e.g., average
amount of alcohol used or number of bulbs used in each block produced by a
certain company. Assign numbers to the 10,000 blocks and write these numbers on
chits and fold them in such way that all chits look identical. Put all the chits in a
box. Then there are two poss ibilities :

1.12.1.1 WITH REPLACEMENT SAMPLING

Select one chit out of 10,000 chits in the box and note the number of the block
written on it. This is the first unit selected in the sample. Before selecting the
second chit, we replace the first chit in the box and mix with the other chits
thoroughly. Then select the second chit and note the name of the block written on it.
This is called the second unit selected in the sample. Go on repeating the process,
until 100 chits have been selected. Note that the chits are selected after replacing the
previous chit in the box some chits may be selected more than once. Such a
sampling procedure is called Simple Random Sampling With Replacement or
simply SRSWR sampling. Let us expla in with a few numbers of block s in a
population as follows :

Suppose a population consists of N = 3 blocks , say A, B and C . We wish to draw


all possible samples of size n = 2 using SRSWR sampling. The possible ordered
samples are : AA, AB, AC , BA, BB, BC, CA, CB, cc. Thus a total 9 samples of size
2 can be drawn from the population of size 3, which in fact is given by 3 z = 9.
Chapter I: Basic concepts and mathematical notation 5

In general, the total number of samples of size n drawn from a population of size
N in with replacement sampling is Nil and is denoted by s(n).
Thus
s(n) = s", (1.12 .1)

Now imagine the situation , 'How many WR samples, each of n = 100blocks, are
possible out of N = 10,000blocks?'

1.12.1.2 WITHOUT REPLACEMENT SAMPLING

In case of without replacement sampling, we do not replace the chit while selecting
the next chit; i.e., the number of chits in the box goes on decreasing as we go on
selecting chits. Hence, there is no chance for a chit to be selected more than once.
Such a sampling procedure is called Simple Random Sampling and Without
Replacement or simply SRSWOR sampling. Let us explain it as follows: Suppose a
population consists of N = 3 blocks A, Band C. We wish to draw all possible
unordered samples of size n = 2. Evidently, the possible samples are : AB,
AC, BC. Thus a total of 3 samples of size 2 can be drawn from the population of
size 3, which in fact is given by 3C 2 = 3 . In general, the total number of samples of
size n drawn without replacement from a population of size N is given by NCII or

Thus
N N!
s (n) = CII = ( ) (1.12.2)
n! N-n.
where n! = n(n-IXn - 2).......2.1 , and O! = I.
Now think again, 'How many WOR samples, each of n = 100 blocks, are possible
out of N = 10,000blocks?'

Note that it is a very cumbersome job to make identical chits if the size of the
population is very large. In such situations, another method of sample selection is
based on the use of a random number table . A random number table is a set of
numbers used for drawing random samples. The numbers are usually compiled by a
process involving a chance element, and in their simplest form, consist of a series of
digits 0 to 9 occurring at random with equal probability.

1.12.2 RANDOM NUMBERTABLE METHOD

As mentioned above, in this table the numbers from 0 to 9 are written both in
columns and rows. For the purpose of illustrations, we used Pseudo-Random
Numbers (PRN), generated by using the UNIF subroutine following Bratley, Fox,
6 Advanced sampling theory with applications

and Schrage (1983), as given in Table 1 of the Appendix. We generally app ly the
following rules to select a sample:

Rule 1. First we write all random numbers into groups of columns as already done
in Table I of the Appendix. We take as many columns in each group as the number
of digits in the population size.
Rule 2. List all the indiv iduals or units in the population and assign them numbers
1,2,3,...,N.
Rule 3. Randomly select any starting po int in the table of random numbers. Write
all the numbers less than or equal to N that follow the starting point until we obtain
n numbers. If we are using SRSWOR sampling discard any number that is repeated
in the random number table. If we are using SRSWR sampling retain the repeated
numbers .
Rule 4. Select those units that are assigned the numbers listed in Rule 3. This will
constitute a required random sample .

Let us explain these rules as follows : Suppose we are given a population of


N = 225 units and we want to select a sample of say n = 36 units from it. To pickup
a random sample of 36 units out of a population of 225 units , use any three columns
from the random number table. For example, use column I to 3, 4 to 6, etc.,
rejecting any number greater than 225 (and also the number 000) . As an example,
the following table lists the 36 units selected using SRSWR sampling procedure
with the use of Pseudo-Random Numbers (PRN) given in Table 1 of the Appendix.

Uiiits selected in the sample


014 049 053 039 196 183 171 225 179 153 142 138
070 083 001 209 222 075 219 092 155 012 099 211
027 039 048 048 080 161 006 059 199 150 025 173

In the case of SRSWOR sampling, the figures 039, 048 would not get repeated; i.e.,
we would take every unit only once, so we will continue to select two more distinct
random numbers as 078 and 163.

Although the above method of selecting a sample by using a random number table
is very efficient, may make a lot of rejections of the random numbers, therefore we
would like to discuss a shortcut method called the remainder method.

1.12.2.LREMAINlfER METHOD

Using the above example, if any three digit selected random number is greater than
225 then divide it by 225. We choose the serial number from 1 through 224
corresponding to the remainder when it is not zero and the serial number 225 when
the remainder is zero. However, it is necessary to reject the numbers from 901 to
999 (besides 000) in adopting this procedure as otherwise units with ser ial number
1 to 99 will have a larger probability (5/999) of selection, while those with serial
Chapter I : Basic concepts and mathema tica l notation 7

number 100 to 225 will have probability only equal to 4/999. If we use this
proced ure and also the same three figure random numbers as given in columns I to
3, 4 to 6, etc., we obtain the sample of units which are assig ned numbers given
below. Agai n in SRSWR sampling the number that gives rise to the same remainder
are not discarded while in SRSWOR sampling procedure such numbers are
discarded . Thus an SRSWR samp le is as give n below:
.... C' , , H Units selected in the sample
138 151 099 025 014 022 197 176 I I 209 042 194
015 049 095 040 027 124 116 097 126 142 073 158
108 053 046 001 207 156 201 027 II I 209 065 184

Note that in the SRSWR sample, only one unit 209 is repeated, thus for SRSWOR
sampling, we continue to apply remainder approach until another distinct unit is
selected, which is 089 in this case. Further note that the first random number 992
was discarded due to requiremen t of this rule .

1.13 PROBABILITY SAMPLING

Probability sampling is any method of selection of a sample based on the theory of


probability. At any stage of the selection process, the probability of a given set of
units being selected must be known .

1.14 PRO BABILI TY OF SELECTING A SAMPLE

Every sample selected from the popu lation has some known probabil ity of being
selected at any occ asion . It is generally denoted by the symbo l, PI or p(t) . For
example the probability of selecting a samp le using
with replacemen t sampling, PI = 1/ N n , t = 1,2, ..., N n , (1.14.1)
and
without replacement sampling, PI = 1/ N Cn , t = 1,2 , ... ,N CII • (1.14.2)
The following tab le describes the difference between with replacement and witho ut
replacement sampl ing procedures.

With repl acemen t sampl ing ' .:I··' Without replacement sampl ing
Cheaper Costly
Few units may be selected more than A unit can get selected only once .
once .
Less efficient. More efficient.
Number of possible samp les s(n) = N n
Number of poss ible samples s(n) = N C"

Probabi lity of selec ting a particular Probability of selecti ng a partic ular


sample PI = 1/ N" , t = 1,2,...., Nil . samp le PI = 1/ N c" , t = 1,2,...,N CII •
-th -th
Probability of selecting I unit III a Probability of selecting t unit III a
sample Ii = !/N , i = 1,2, ..., N . samp le Ii' = !/N , i = 1,2,....,N.
8 Advanced sampling theor y with applications

1.15 .POPULATION MEAN/TOTAL

Let Yi , j = 1,2,....,N, denote the value of the ( h unit In a population, then the
population mean is defined as
- 1( ) 1 N
Y = - l"\ + Y2 + ....+ YN = - L Y; ( 1.15.1)
N N ;=\
and popu lation tota l is defined as
Y=(l"\ +Y2 + ····+YN) = ~Y;= NY . (1.15 .2)
;=\
Th e unit s of mea surements of population mean are the sam e as thos e for the actual
data. For exa mple, if the (h unit, Y; , ';j j , is mea sured in doll ars, then the popul ation
mean , Y, has the same units as dollars.

1.16 POPULATION MOMENTS

Th e ,-th ord er central population moments are defined as


1 N( -)r
u ; = -( - ) L Y; - Y , r = 2, 3, ... . (1.1 6.1)
N- I ;= \
If r = 2 then fl 2 repr esents the second order popul ation mom ent , given by
2 1 N( - \2 (1.16.2)
u: = Sy =- - L }j - Y J
N -1 i=!
and is named the population mean square.
Note that the pop ulation variance is defined as

0"; = ~ ~ (}j - Y~ = (N- I) s;. (1.16 .3)


N i=l N
If the data is in do llars, then the units of measurement of 0";. wi ll be doll ars 2
.

1.17 POPULATION STANDARD DEVIATION

Th e positive square root of the popu lation variance is called the population standard
deviation and it is denoted by O"y . Th e units of measurements of " » will again be
the same as that of actual data. For instance, in the above example, the units of
0"
measurements of y will be doll ars.

1.18 POPULATION COEFFICIENT OF VARIATI ON

The ratio of standar d deviation to population mean is call ed the coe fficient of
variation. It is denoted by Cy that is

(1.1 8.1)
Chapter I : Basic concept s and math ematical notation 9

Evidently Cy is a unit free numb er. It is useful to compare the variability in two
different populations having different units of measur ements, e.g., S and kg. It is
also ca lled the relative standard error (RSE) . Sometim es we also consider
C y ~Sy /Y.

1.19 RELATIVE MEAN SQUARE ERROR

The relative mean square error is defin ed as the square of the coe fficient of
variation Cy and is generally written by RMSE.
Mathematically
2
2 ay (1.19.1)
RMSE = Cy = -=T .
y

Sometimes it is also denote it by rjJ2 .

1.20 SAMPLE MEAN

Let Yi' i = 1,2,..., 11, deno te the value of the til unit selected in the sample, then the
sample mean is defin ed as
_ 1 11
Y =- L Yi · (1.2 0. 1)
Il i=l

1.21 SAMPLE VARIANCE

The sample va riance s~ is defined as

2 1 /l ( \2 (1.21.1)
S =- - L Yi - YJ .
y 11- 1 i= 1

Remark 1.1. The popul ation mean Y and population van ance a; etc., are
unknown quantities (parameters) and can be denoted by the symbol 8 . The sampl e
mean Y and sample variance s~ etc., are known after sampling and are called

statistic and can be denot ed by iJ . Also note that sample standard deviation (or
standard error) and sample coe fficient of variation can also be defin ed as Sy = M
and Cy =

--=-
Sy
, respe ctively. Note that standard error is a statistic whe re as standard
Y
deviation is a parameter.
10 Advanced sampling theory with applications

1.22 ESTIMATOR

A statistic 81 obtained from values in the sample s is also called an esti mator of the
population parameter () . Note that the notation 81 , or 8, or 8
11 have same
meaning. For example the notation YI' or Y, or Yll have the same meani ng, and s;,
or S;'(I) have the same meaning. We choose acco rding to our requirements for a
give n top ic or exercise.

1.23 ESTIMATE

Any num eric value obtained from the sample information is called the estimate of
the population parameter. It is also ca lled a statistic.

1.24 SAMPLE SPACE

A sample space is a set of all possible values of a variable of interest. It is denoted


by If! or S . For exa mple, if we throw a pair of fair coins, eac h having two faces ,
then the sample space will consist of all possi ble 4 outcomes as:
If! = {HH, HT , TH, TT }.

A pic toria l represe ntatio n of such a sample space is give n in Figure 1.24.1.

Expe riment: Toss two coins

e50
T

2 x 2 = 4 outcomes

c/ .< :
Tr ee diagram:
HH
H

.< :
H
T HT
H TH
T

""" First
Coin
T
Seco nd
Coin
TT
Sample
Spa ce

Fig.1.24. 1 Sample space whi le toss ing two co ins.


Chapter I : Basic concepts and mathematical notation II

1.25 UNIVARIATE RANDOM VARIABLE

A random variable is a real valued function defin ed on the sample spac e lfI . It is
generally of two type s:

( i ) Qu alitative random variable; ( ii) Quant itative random variable.

Let us discuss these random variables in more detail s as follow s:

1.25.1 QUALITATIVE RANDOM VARIABLES

Qualit ative random variables assume values that are not necessar ily numerical, but
can be categorized . For example, Gender has two po ssibl e values: Male and
Female. These two can be arbitrary coded numerically as Female = 0 and Male = I .
Such coded variables are called Nominal variables. In another example, consider
Grades that can take five pos sible values: A, B, C, D , and F. These five
categori es can be arb itrarily coded numerically as: A = 4, B = 3, C = 2, D = 1, and
F = o. Note that here the magnitude of cod ing tells us quality of Grade that if code
is 3 then the Grade is better than the Grade if code is 2. Such a coded variable is
called Ordinal variabl e. Also note that in the case of the Nominal variable, code
Male = I and Female = 0, does not mean that males are superior to female s.
Adding, subtracting or averagin g such qual itative variables has no meaning. Thu s
qualitative variables are of two types: ( a) Nominal var iables; ( b ) Ordinal
variables. Pie charts or Bar charts are generally used to present qualitat ive
variables.

1.25.2 QUANTITATIVE RANDOM VARIABLES

Quantitative random variables can take num erical values for which addin g,
subtrac ting or avera ging such variables does have meanin g. Exa mples of
cont inuou s variables are wei ght, height, numb er of students, etc.. In general, two
types of quantitative random var iables are availabl e: ( a ) Discret e random variable;
( b ) Continuous random variable.

1.25.2.1DISCRETE RANDOM VARIABLE

If a random variable takes a countable number of value s, it is called a discrete


random variable. In other word s, a real valued function defin ed on a discrete
sample space is called a discrete random variable. For example, the number of
students can be 0, I , 2 etc..

1.25.2.2 CONTINUOUS RANDOM VARIABLE

A rando m variabl e is said to be continuous if it can take all possibl e value s bet ween
certain limits. For exa mple, height a student can be 5.6 feet.
12 Advanced sampling theory with applications

A pictorial representation to differentiate between Qua litative and Quantitative


variables is given in Figure 1.25.1 .

Random
Variable

...
Qualitative Quantitative

.> <, .> <,


Nominal Ordinal Discrete Continuous
e.g., e.g., e.g., e.g.,
Gender, Grade, No. of grades, Height,
Religion Age groups No . of students Age

Fig. 1.25.1 Forms of a random variable or data.

Note that Age itself is a quantitative variable whereas Age Groups is a qualitative
variable. Pie charts, bar charts, dot plots , line charts, stem and leaf plots, histograms
and box plots are generally used to present quantitative variables.

1.26 PROB ABILI TY MASS FUNCTI ON (p.m.f.) OF A UNIVARIATE


DISCRETE RAN DOM VARIA BLE

Let X be a discrete random variable taking at most a countable infinite number of


value s x" X2,... ., and with each possible outcome Xi , we associate a number
Pi = p(X = Xi) = P(Xi) ' called the probabi lity of Xi ' Then P(Xi) is called the p.m.f.
of Xi if it satisfy the following two conditions:
co
( a ) p(x;);:: 0 and (b) I: P(Xi ) = 1. (1.26 .1)
i=1

1.27 PROBABILITY DENSITY FUNCTION (p.d.f.) OF A UNIVARIATE


CONTINU OUS RAND OM VARIABLE

Let X be a continuous random variable on an interval X E [a,b] , where


- 00 < a s. x ::; b < + 00. Then a function f (x) is said to be a probability density
function (p.d.f.) ifit satisfies the following two conditions:
Chapter I: Basic concepts and mathematical notation 13

(a)f(x)::::O and (b) ff(x)dx=l. (1.27 .1)


a

VAl;UE.ANDNARIANCE"OF A UNIVARIATE
1.28EXPE:~TEJ)
RAN])OMVARIAimE . . .

If a discrete random variable X takes all possible values Xi with probability mass
function , P(Xi) , in the sample space, If, then its expected value is

E(x) = IXiP(XJ (1.28 .1)


IjI

and, the variance of the random variable X is given by

v(x)= I (x; - E(x))2 p(x;) ( 1.28.2)


IjI

or, equivalently

V(x) = I xl p(xJ- {E(x)}2 . (1.28.3)


IjI

Sometimes (1.28.2) is called a formula by definition and that in (1.28 .3) is called a
computing formula.

If X is a continuous random variable with x E [a ,b] and f(x) be the probability


density function , then its expected value is
b
E(x) = [x f(x)ix (1.28.4)
a
and the variance of the random variable X is given by
b
V(x) = I(x - E(x))2 f(x)ix ( 1.28.5)
a

or equivalently
b
V(x) = Ix 2 f(x)ix - {E(x)}2 . ( 1.28.6)
a

1.29 DISTRIByTIONFUNCJION OFA UNIVARIAJE RANDOM


NARIABEE '0

Let X be a random variable, then the function, F(x) = p(X ~ x) is called a


dist ribution function of the random variable and it has the following properties:
14 Advanced sampling theory with applications

(i) Ifa ::;X::;b then P(a::;X ::;b)=F(b) -F(a) ;


(ii) Ifa::;b then F(a)::;F(b);
(iii) O::;F(x)::;I;
( iv ) The distribution of F(x) is uniform between 0 and I .
Again it can be of two-types:

1.29.1 DISCRETEDISTRIBUTION FUNCTION

In this case there are a coun tab le number of points XI , Xz , .. . along with associated

probabilities p(Xj), p(xz),...., where pk):::: 0 and I p(xJ = 1 such that


i= 1
F(x) = p(X ::; x) = 2: pk). For example if Xi takes integral values Xi = {I, 2, 3, 4,5}
{i:x; "x}
with probabilities P(Xi) = 0.2 , then function F(x) is a step function as shown in the
Figure 1.29.1.

Discrete Distribution Function

1.5
_1
><
ir 0.5

2 3 4 5
x

Fig. 1.29.1 Discrete distribution function .

1.29.2 CONTINUOUS DISTRIBUTION FUNCTION

If X is a continuous random variable with the probability density function (p .d.f.)


j(x) , then the function
X
F(x) = p(X ::; x) = Jf(t')dt , - 00 < X< + 00 (1.29 .1)
- 00

is calIed the distribution function or sometimes the cumulative distribution function


(c.d .f.) of the random variabl e X. The relationship between F(x) and j(x) is given
Chapter I : Basic concepts and mathematic al notation 15

dF(x)
by f(x) = - - . Th e c.d.f. F(x) IS a non-decre asing function of x and is
dx
continuo us on the right. Also note that F(- 00) = 0, F(+00) = I , 0 ~ F(x) ~ I, and
b
P(a ~ .r s b) = Jf(x)dx = F(b)- F(a) . For exampl e, if x is a continuous random
a
variable with probability den sity function (p.d. f.)
I O c x « I,
f ()
.r = { (1.29.2)
o otherwise,
then its cumulative distribution function (c.d.f.) is given by
0 if x < O,
F(x) = x if 0 ~ .r ~ I,
1
(1.29.3)
I if x> I,
and its graphical representation is given in Figure 1.29.2.

Continuous distribution function

1
1.5

~ :: ~, ~+ -~ ,-~,
0.0 0.2 0.4 0.6 0.8 1.0 1.5 2 2.5 3
x

Fig. 1.29.2 Continuous distribution function .

1.30 SELECTION OF A SAMPLE USING KN OWN UN IVA RIA TE


DI STRIBUTION FUNCTION

There are two cases:

1.30.1 DI SCRETE RANDOM VA RIABLE

If Xi is a discrete random variable with probability mass function P( Xi) and


distribution function, F(x )= p[X s x]= I P (Xi ) . Let 0 ~ F(x) ~ I be any random
i :xi ::;'x
numb er drawn from the Pseudo-Random Number (PRN) Table I given in the
Appendix.

Th en we can wr ite F(x) as


16 Advanced sampling theory with applications

F(x)= Ip(Xi-d+p(x) (say) (1.30 .1)


i:xi_l<x
Then the integr al value of the random variable x selected in the samp le is given by

X=P-I[F(X) -. IP(Xi- I)] (1.30 .2)


l:Xi_I <X
where P - I denotes the inverse functio n.

Example 1.30. 1. A discrete random variab le X has the followi ng probability mass
function:

Select a random sample of three units using the method of random numbers .

Sol ution: The cumulative distribution function of the random variable X is given
by

We used the first six columns of the Pseudo-Random Nu mber (PRN) Table I give n
in the Appendix multi plied by 10-6 as the random ly selected values of F(x). Then
the integral value of the random variable x selected in the sample is obtained using

the inverse relationship x = P-1 [F(X)- . I p(xi-d] as follows:


l:.\:i-I<X

Va lue of<F(x)using..PRN < Value ofX' observed in the


Table I -. sample
0.992954 6
0.588183 3
0.601448 3

In case of with replacement sampling, the value of x = 3 has bee n selected twice, as
otherwise for WOR sampling we have to continue the process until three distinct
values of x are not selected .

Exa mple 1.30.2. If x follows a binomial distribu tion with parameters Nand p , that
is, x - B(N,p), say N = 10 and P = 0.4 . Select an SRSWR sample of 11= 4 units by
using the random number method .
Chapter 1: Basic concepts and mathematical notation 17

Solution. Since x follows a binomial distribution, so p(X = x) = C;' p X(I_ Pt- x


, for
x = 0, 1, 2..., N . Here N = 10 and p = 0.4, so the cumulative distribution

function F(x) = p[X :0; xl = I.C;' pX(I_


x~x
t - is:
P x

We used three columns from 7th to 9th of the Pseudo-Random Number (PRN) Tab le
1 give n in the Appendix multiplied by 10- 3 as the randomly se lected va lues of
F(x) . Then the integral value of the random var iable x selected in the sample is

obtained using the inverse relationship x = p - I [F(X) - . L P(X i - d] as follows:


I:Xi_I <X

Va lueof F(x) usipgPRN Value of.ieobserved-in the


;. .; Tab le 1; .; .. ;. sample
0.622 3
0.77 1 4
0.917 5
0.675 4

1.30.2 CONTINUO US RANDO M VARIABLE

If x is a continuous random variable with probability density function f(x) and


x
distribution function, F(x) = p[x :0; xl= fJ(t)dt. Let 0 :0; F(x) :o; 1 be any random
- (I)

number drawn from the Pseudo-Random Number (PRN) Table 1 given in the
Appendix .

Then we can wr ite F(x) as


-o
F(x) = fJ(t)dt + f(x) (say) ( 1.30.3)
- (I)

where Xo < x but very close to x .

Then the value of the ran dom variab le x selected in the sample is given by

x = rl(F(x) - :~(t)dtJ ( 1.30.4)

where r' denotes the inverse function.


18 Advanced sampling theory with applications

Example 1.30.3. A continuous random variable X has the cumulative distribution


function as
0 if x< I,

F(x)= ~ (x_I )4 if i s x s s, (1.30.5)

1
16
1 if x > 3.
Select a sample of 11 = 10 units by using SRSWR sampling.

Solution. We are given F(x) = ~(x _I)4 which implies that x = 2[F(x)JI/4 + I . By
16
using the first three column s of the Pseudo -Random Numbers (PRN) Tab le I given
in the Appen dix multiplied by 10- 3 we obtain the observed values of F(x) and the
samp led values of x as:
1'2 F(x) "t '
h .x • ."
C

0.992 2.995988
0.588 2.751356
0.601 2.760956
0.549 2.721563
0.925 2.961397
0.014 1.687958
0.697 2.827419
0.872 2.932676
0.626 2.778990
0.236 2.393985

Example 1.30.4. A continuous random variable x has density function

f (x) = JJr - 1(I + X2)' if- oo < x < +00,


lO otherwise.
(1.30 .6)

Select a samp le of 11 = 5 units by using with replacement sampling.


Solutio n. The cumulative distribution function (c.d.f.) of x is given by
F(x) = P(X :s:x)= ff(x)dx= ~
- 00
f _1+1_2dx=~ [tan-l(x)too = ~[tan-l (x) + '::']
Jr - 00 X Jr Jr 2
which implies that
x =tan[Jr F(x)- %]. (1.30 .7)

Using the three column s multiplied by 10-3 , say 7th to 9th, of the Pseudo-Random
Numbers (PRN ) Table I given in the Appendix , the first five observed values of
F(x) are given by 0.622,0.77 1,0.917,0.675 and 0.534 . Thus the sampled five
values from the above distribution are
Chapter I: Basic concepts and mathematical notation 19

'F(x)u" r. x
0.622 0.403214
0.771 1.141487
0.917 3.747745
0.675 0.612801
0.534 0.107222

Note that we have used the tan function in radians and :r = 4 tan- I ( I ).

Example 1.30.5. A continuous random variable X has density function

f(x) = f-o~ if S < x < 10,


1
(1.30.8)
otherwise.
Select a sample of n = S units by using with replacement sampling.

Solution. The dist ribution of x is uniform between 5 and 10, so its probability
distribution function is
F(x) = p[x ~ x] = ff (x)dx= .!-(x- S) ( 1.30.9)
5 S
which implies that
x =S[F(x)+I] . (1.30.10)

Using the three columns multiplied by 10- 3 , say t h to 9th , of the Pseudo-Random
Number Table I given in the appendix, the first five observed values of F(x) are
given by 0.622, 0.771, 0.917, 0.675 and 0.534 . Thus the sampled five values from
the above distribution are given by

u.
p(x)'
" , .'
.... .~,. X' . •
0.622 8.110
0.771 8.855
0.917 9.585
0.675 8.375
0.534 7.670

1.31 DISCRETE BIVARIATE RANDOM V ARIABLE

If X and Yare discrete random variables, the probability that X will take on the
value x and Y will take on the value y as p(X = x,Y = y) = p(x,y), is called the joint
probability distribution function of a bivariate random variable.
20 Advanced sampling theory with applications

1.32 JOINTPROBABILlTYDISTRIBUTION FUNCTION OF BIVARIATE


i ,. ., ',, ' ' .

DISCRETERANDOM VARIABLES'

A bivariate function can serve as the joint probability distribution of a pair of


discrete random variables X and Y if and only if its values, p(x,y) , satisfy the
conditions:

(a) p(x,y)?: 0 for each pair of values (x,y) within its domain.
and
(b) IIp(x,y)= 1, where the sum extends over all possible pairs (x,y) .
xy

1.33 JOINTCuMuLA:TrvIt'nIs'iRIBUTIONFUNCTION.OF BIVARIATE


DISCRETE RANDOM VARIABLES ..

If X and Yare discrete random variables, the function given by


F(x,y) =p(X ~x,Y~y)= I Ip(s ,t)
ss'x by (1.33 .1)
for - 00 < x < + 00 ,-00 < y < +00, where p(s,t) is the value of the joint probability
distribution of Xand Yat the point (s ,t), is called the Joint distribution function or
the Joint cumulative distribution, of X and Y.

1.34 MARGINAL DISTRIBUflONSOF A BIVARIATE DISCRETE


RANDOM VARIABLE.

If X and Yare discrete random variables and p(x,y) is the value of the joint
probability distribution at (x, y), the function given by

pAx) = Ip(x,y)
y (I .34.I)
for each x with in the range of X is called the marginal distribution of X , and the
function ,
Py{Y) = I p(x,y) ( 1.34.2)
x
for each y within the range of Y is called the marginal distribution of Y.

1.35 SELECTION OF A SAMPLE USINGKNOWN DISCRETE


BIVARIATEDISTRIBUTION FUNCTION

Letp(x,y)denote the joint probability mass function (p.m .f.) of two random
variables x and y . Also, let F(x,y) denote the cumulative mass function (c.m .f.)
of X and y . It is well known that, the distribution of the marginal distribution
function (m.d.f.) Py(Y) for any joint probability density function of X and y is
rectangular (or uniform) in the range [0, 1]. Random numbers in the random
Chapte r I : Basic concepts and mathematical notation 21

number table also follow the same distribution . Then to find out the value of y one
solves the equation (1.35.1) below .

Py(Y) = L: fp(x , y) = Rz· (1.35 .1)


x y;O

The known form of the joint dens ity function p(x, y) [one can choose any suitable

form for p(xI>Y)] can be substituted in (1.35 .1). The value / of y so obtained is
used to find the value x * of x . For this we use the cond itional mass funct ion of x
given y = y * since the distribution of the cond itional mass function will also be
un iform in [0, I] . Thus anoth er random number R, is drawn and the value / of X
is determined from the equation

( 1.35.2)

where At, *) is the conditional density function of


y X given y =y *. The values
x * and y * of x and y, respect ively, so obtained will follow the joint probability
mass function p(x,y) . The equations (1.35 .1) and (1.35 .2) may be solved either
through usual methods or through iteration procedures.

1.36 CONTINUOUSBIVARIATERAND()l\1VAR.IABEE

A bivariate func tion with value s f (x,y ) , defin ed over the two-dimensional plane is
called a j oint prob ability density function of the continuous random variables X
and Y if and only if

p[(X,Y)ES] = fff(x , y)dxdy . (1.36 .1)


s

1037 JOINT PROBABIUTYDISTRIBUTION FUNCTION OF BIVARIATE


CONTINUOUS RANDOM V ARIABEE

A bivar iate function can serve as the joint probability distribution of a pair of
continuous random variables X and Y if and only if its values, f (x, y), satisfy
the conditions:

( a) f(x , y ) ~ 0 for each pair of values (x,y ) withi n its doma in; (1.37 .1)

+00+00
(b) J fJ (x,y ) dxdy = 1. (1.37 .2)
- 00 -00
22 Advance d sa mp ling theory with applications

1.38 JOINT CUMULATIVE DISTRIBUTION FUNCTION OF A


BIVARIATE CONTINUOUS RANDOM VARIABLE

If X and Yare continuous random var iables, the fun ction given by
y x
F(x, y) = p(x s x, Y :$ y )= J fJ(s, 1}:isdl (1.3 8.1)
-00 - 00

for - 00 < x < + 00 , -00 < y < +00 , where j(s, I) is the value of the j oint probab ility
distribu tion of X and Y at the point (s, I), is calIed the Joint distribution function
or the Joint cumulative distri bution, of X and Y.

1.39 MARGINAL CUM ULATIVE DISTRIBUTIONS OF BIVARIATE


CONTINUOUS RANDOM VARIABLE

If X and Ya re continuous random variables and j(x,y) is the va lue of the j oint
prob ab ility density function, then cumulative marginal probab ility distributi on
function of y is give n by
v +00
Fy(y) = oJ fJ(x, y)dxdy (1.39 .1)
- 00-00

for - 00 < y < +00 , and the cumulative margi nal probability dis tribution funct ion of x
is give n by
x +00
FAx)= J fJ(x,y)dydx ( 1.39.2)
- 00 - 00

for -00 < x < +00 •

1.40 SELECTION OF A SAMPLE USING KNOWN BIVARIATE


CONTINUOUS DISTRIBUTION FUNCTION

In genera l, let j(x, y) deno te the joint probability density functio n (p.d.f.) of two
continuous ran dom varia bles X and y . Also let F(x, y) denote the cumu lative
density function (c.d.f.) of X and y . It is well know n that the distribution of the
marginal distribution function (m.d.f.) F2(y) for any joint probability density
fun ction of X and y is rectangular (or unifo rm) in the range [0, I] . Rand om
nu mbers in the rand om number table also follow the same distributi on. To find out

r
the value of y , one so lves the equ ation ( 1040.1) below.

Fz('v)='JI'{+OOfJ(x, y)dx y= Rz· ( 1040. 1)


o - 00
Th e known form of the joint density funct ion j(x, y) [one can choose any suitable
form for j(X1' y) ] can be subst ituted in ( 1040. 1). Th e value y * of Y so obtained is
used to find the value x * of x . For this we use the conditiona l density of X given
Chapter I : Basic concepts and math em atic al not ation 23

y =y* since the distribution of the conditional distribution function will also be
uniform in [0, I]. Thu s anoth er random number R, is drawn and the va lue x * of X
is de term ined from the equation:

'IJi (r I y = / )dx = R) ( 1.40 .2)


o
where Ji(rI y = y *) is the conditional den sity funct ion of x g iven y = y * . The
values x * and y * of x and y , respectively, so obtained w ill follow the joint
prob ability density function f(x,y ) . The equations (1.40.1 ) and (1.40 .2) may be
so lved either through usu al methods or through iteration pro cedures.

Example 1.40.1. If the joint prob ability density function of two continuous random
variables x and y is given by,

()
1
f x,y =
~3 (x + 2Y) for O< x <l, O < y <l ,

o otherwise,
then , se lect six pairs of obse rva tions (x, y) by using the Random Number Tabl e
method.

r l'{ r
Solution. We have

Fy(Y)= Y{+oo
f ff(x,y)dx 2 f(x
y= ' f - ) + 2y )dx y = y + 2Y 2
0-00 0 30 3

Let Fy(Y) = R2 be a Pseudo-Rand om Number (PRN) se lected from Tabl e I (say, by


using first three columns) is R2 = 0.992 then by so lving the qu adr atic equation
2i + y - 3R2 = 0 , we obt ain one real root 0 < y < 1 by so lving

- 1+)(I)z+ 24 Rz as y = -1 +.JI +24 xO.992 = 0.995.


y= 2 2

Now, given y = / = 0.995 we have

f (x I y = /)= f(x I y = 0.995) = ~(I.99 + x).


3

Let 0 < Rl < 1 be any oth er random number, say obtained by usin g i h
to 9 th
co lumns, of the Pseudo-Random Numbers given in Tabl e I of the Appendix, then
the value of x is given by so lving the integr al

x {
ff~r Iy *\ .
= y p x = Rl or
F( \ ..J
- f 1.99 + x J'IX = 0.622
o 30
or, equiva lently solving a quadr atic equation x 2 + 3.98x - 3Rl = 0 , which implies
24 Advanced sampling theory with applicat ions

- 3.98 ± ~3 .982 + 12R1


x= 2 '
and the real root between the range of x as 0 to 1 is given by

- 3.98 + ~3.982 + 12R] - 3.98 + ~3 .982 + 12 x 0.622


x= = = 0.423 .
2 2
On repeating the above process we obtain a sample of n = 6 observations as given
below:
R2 Y Rl X
0.992 0.995 0.622 0.423
0.588 0.722 0.771 0.5 14
0.601 0.732 0.917 0.600
0.549 0.691 0.675 0.456
0.925 0.954 0.534 0.368
0.014 0.039 0.513 0.355

1.41 PROPERTIES OF A BEST ESTIMATOR

An estimator 01 of a population parameter, 0 , is said to be best if it has the


following properties:

( a ) Unbiasedness; (b) Consistency; ( c ) Suffic iency ; and ( d ) Efficiency .

Let us now briefly explain the meaning of these terms .

1;41 ~lUNBIASEDNESS

An estim ator 01 is an unbiased estimator of a population parameter 0 if


( .) s(o ) •
E\OI = ! pA = 0 (1.41.1)
1=1

where, PI' denote the probability of selecting the (Iii sampl e from the population,

n , and s~) PI = 1. Note that total number of possib le samples, in case of SRSWR
1=1

sampling are, s(n ) = N n ,and in case of SRSWOR sampling are, s(n)=N en '

For example:
( i ) Sample mean YI is an unbiased estimator of population mean Y under both
SRSWR and SRSWOR sampling
Chapter I: Basic concepts and mathematical notation 25

E()lt) = s~)Pt)lt = Y.
_
(1041.2)
1=1

(ii ) Sample variance s; is an unbiased estimator of population mean squared


error S; under SRSWOR sampling
NC
E(S;)= In pt(s;1 = S; (1041.3)
1=1

and also s; is an unbiased estimator of population variance iJ; under SRSWR


sampling

E(s;) = 'tPt(s;)t =iJ;. (104104)


1=1

Example 1.41.1. Suppose a population consists of N = 4 units . The variable y;


takes values I, 2, 3,4 and are distinguished as A, B, C, and D, respectively. Then
the population mean is given by
_ 1 N 1
Y =-L Y. =-(1+2 +3+4)=2.5
N i=1 I 4
the population mean squared error is given by
S; = _1- I(Y; -
N -1 i=1
ff = _1-[(1-2.5f
4-1
+(2 - 2.5f +(3 - 2.5f +(4 - 2.5f]= 2.
3
and population variance is given by
(J~ = ~ I(Y; - ff = ~[(1-2.5f +(2 - 2.5f +(3- 2.5f +(4- 2.5f]=2.= 1.25.
N i=1 4 4

Show that the sample mean j', is an unbiased estimator of population mean f under
both SRSWR and SRSWOR sampling. The sample variance s; is an unbiased
estimator of population mean squared error S; under SRSWOR sampling, and the
population variance iJ; under SRSWR sampling, respectively.

Solution. There are two cases :

Case I. Suppose we are drawing all possible samples of size n = 2 by using


SRSWOR sampling.

The total number of all possible samples is s(n )= N C; = 4C2 = 6 and Pt = 1/6 .
Now we have the following table .
26 Advanced sampling theory with app lications

Sample Sampled-units ." S amp le mean < ' Sample < \' Probability of
, . .. ,.. <'<,' ,' • ., >
, -~'

varIance selecting the '


No " . '; ." I;.. · '·'/T
\. .~.

. >OJ' $;(1)', I : sample


I ...•.. .: PI
1 (A, B) or( I ,2) Y1 =(1+2)/2=1.5 0.5 1/6

2 (A, C) or (1,3) Y 2 = (1+3)/2=2 .0 2.0 1/6

3 (A, D) or (1 ,4) Y = (1+4)/2=2.5 4.5 1/6


3
4 (B, C) or (2,3) Y4 = (2+3)/2=2 .5 0.5 1/6

5 (B, D) or (2,4) Y = (2 +4) /2=3 .0 2.0 1/6


5
6 (C, D) or (3,4) Y6 =(3+4)/2=3 .5 0 .5 1/6

Thus the expected va lue of the sample mean, YI' is g iven by


_ 1 N C'L 1 6_ 1 _
E(YI)=~ L Yt =- LYt = - (1.5 + 2 + 2.5 + 2.5 + 3 + 3.5) = 2.5 = Y,
C ll 1=1 6 s=1 6
and that of sample variance, s;, is given by
ESy(t}
1 N CII 2
[ 2 ]= -N- I( ) 10 5 _ 2
I Sy(t} =- 0.5 + 2.0 +4.5 +0.5 + 2.0 + 0.5 = - = - - 5 y'
CII 1=1 6 6 3

Distribution of sample mean and sample variance using WOR sampling:

D istribution of samplemealls <.'" 'D istribution 'of sample variances


Sample mean . Fre quency SiurtoIevariimce Freo uencv
1.5 1 0.5 3
2.0 1 2.0 2
2.5 2 4 .5 1
3 .0 I
3 .5 1

The above tab le shows that the distribution of sample means is symmetric and that
of sample variance is skewed to the right in the case of without rep lacement
sampling.

Case II. Suppose we are drawing all possible samples of size n = 2 by usi ng
SRSWR sampling.

The total number of all possible samples is s(n) = N il = 4 2 = 16 and Pt = 1/16 for
all t = 1,2, ..., 16.

Now we have the following table:


Chapter I: Basic conce pts and math em atical notation 27

,..

S~ir].ple :,SaJTIpled oPnjt>,;w1; SilmpI~means


No. '"
' "' ,. Yt
,;"Sample,;
I·' variance
Probability
of se lecting
.
~c " ~,

t " ,
2
Sy(t) a sample
:'" c't Pt
'"
1 (A, A) or ( I, I) YI = (1 + 1)/ 2=1.0 0.0 1/ 16
2 (A, B) or (1 , 2) Y2 = (1+ 2)/2 = 1.5 0.5 1/l 6
3 (A, C) or (I , 3) Y3 =(1+3)/2 =2.0 2.0 l/16
4 (A, D) or ( I , 4) Y4 = (1 + 4)/ 2 = 2.5 4.5 1/l 6
5 (B, A) or (2, I) Y5 = (2 + 1)/ 2 = 1.5 0.5 l/ 16
6 (B, B) or (2,2) Y6 = (2 + 2)/2 = 2.0 0.0 1/ 16
7 (B, C) or (2,3) Y7 = (2 + 3)/ 2 = 2.5 0.5 1/l 6
8 (B, D) or (2, 4) Y8 = (2+ 4) / 2 = 3.0 2.0 l/ 16
9 (C, A) or (3, I) Y9 = (3 + 1)/ 2 = 2.0 2.0 1/l 6
10 (C, B) or (3 , 2) YIO = (3 +2)/2 = 2.5 0.5 1/l 6
II (C, C) or (3,3) YI I =(3+3) / 2=3 .0 0.0 I/l 6
12 (C, D) or (3, 4) Y12 = (3 +4)/ 2 = 3.5 0.5 l/ 16
13 (D,A) or (4, I) YI3 = (4 + 1)/2 = 2.5 4.5 I/l 6
14 (D, B) or (4,2) YI4 = (4 + 2)/2 = 3.0 2.0 l/16
15 (D , C) or (4,3) YI5 = (4 +3)/2 = 3.5 0.5 1/l 6
16 (D , D) or (4, 4) YI6 = (4 +4)/2 = 4.0 0.0 l/16

Distribution of sample mean and sample variance using SRSWR sampling:


Again for the SRSW R sa mpling the sa mple mean has sy mmetric distribution and
the distribut ion of sample variance is skewed to the right.

Distributio n of sample means Distribution of sample .variances


Sample mean Frequency Sampl e varia nce Freque ncy
1.0 I 0.0 4
1.5 2 0.5 6
2.0 3 2.0 4
2.5 4 4.5 2
3.0 3
3.5 2
4.0 I

Thus the expec ted val ue of the sa mple mean )it is give n by
28 Advanced sampling theory with app licatio ns

£(y,)=
_
-N"1 N" _ 1 16_ 1 40-
L Y/ = - LY, = - (1+ 1.5+ ....+4)= - = 2.5 = Y
/ =1 16 s=1 16 16
and that of the sampl e varia nce s; is given by
z ] 1 N" Z 1 16 Z 1( ) 20 _ 2
£ [sY(') =- " LS y(/) = -LS y(,)= -0+0.5+ ...+ 2+ 0.5 =- = 1.25 -o- y .
N ' =1 16 s =1 16 16
Then we have the following new term .

1.41.1.1 BIAS

It is the difference between the expected value of a statistic ()/ and the actual value
of the parameter () that is
B(O/) = £(0,)-o. ( 1.41.5)
Thu s an estimator 0, is unbi ased if £(0/)=(), which is obvious by setting B(OJ=o.
1.41.2 CONSISTENCY

There are several definitions for the consiste ncy of any statistic, but we will use the
simplest. An estimator 0/ of the population parameter () is said to be consistent if
Lim(O/)= o. (1.4 1.6)
n-too
For example:
( i ) The sample mean y/ ( or simply y) is a consis tent estimator of the finite
popul ation mean, Y.
( ii ) The sample mean squared error s; is a consistent estimator of the population

mean squared error , s;.


Remark 1.2. An unbiased estima tor need not neces sarily be consistent, e.g., the
sample mean based on a sample of size one is unbiased but not consistent.

1.41.3 SUFFICIENCY

An esti mator 0, is said to be suffic ient for a parameter () if the distribut ion of a
sample YI,YZ,...,Y" given 0/ does not depe nd on () . The distribution of 0, then
contains all the information in the samp le relevant to the estim ation of () and

°
knowledge of 0/ and its sampling distribution is 'sufficient' to give that
information . In general, a set of estima tors or statistics 1, Oz, ,Ok are 'jointly
sufficient' for para meters ()" (}z , . .. .. , (}k if the distribution of samp le values given
01 ,Oz , A does not depend on these (}I>(}z, ,(}k .
Chapter I: Basic concepts and mathematical notation 29

1.41.4 EFFICIENCY

Before defining the term efficiency, we shall discuss two more terms, viz., variance
and mean square error of the estimator.

1.41.4.1 VARIANCE

The variance of an estimator eof a population parameter 0 is defined as


l

(1.41.7)

It is generally denoted by the symbol aJ .


,--' . ,. " , ' " . "". . . .,~

1.41.4.2 MEAN SQUARE ERROR

The mean square error (MSE) of an estimator e of a parameter 0 is defined as:


l

MSE(el)= E[e of =v(eJ+ {s(eJ2


l - (1.41.8)
where s(e denotes the bias in the estimator e of O.
l ) l

Evidently if B(e )=0 then


l

MSE(el)= v(e l ) .

Thus if e and e
l 2 are two different estimators of the parameter e then the
estimatore is said to be more efficient than the estimator e2 if and only if
l

MSE(e,) <MSE(eJ
1.42 RELATIVE EFFICIENCY

In general, the relative efficiency of an estimator 0, with respect to another


estimator e is expressed as a percentage and is defined as:
2

RE =MSE(e2)xIOO/MSE(e,) . (1.42.1)

1:43 RELATIVE 'BIAS

The ratio of the absolute value of the bias in an estimator to the square root of the
mean squar e error of the estimator is called the relative bias.
It is defined as:
RB=ls(el)I/~MSE(el) (1.43 .1)
where B(e = E(e e and the
l ) l )- relative bias IS independent of the units of
measurement of the origin al data.
30 Advanced sampling theory with applications

1.44 VARIANCE ESTIMATION THRO UGH SPLITTING

If el, e2, ....,e" are independently distrib uted random variables with E(ej) = e '<j j ,
. I n .
and e= - I ej , then
11 j=l
v(e)= - ( 1_ ) f (ej -e'f (I .44.1)
1111- 1 j=1
is an unbiased estimator of v(e). If ej = e(jl is the l ' estimator of e obtained by
dropping the l'
unit from the samp le of size /I , then such a method of varia nce
estimation are also called Jackknife method of varia nce estimation , and the
estimator of variance takes the form
VJack (e) = (11 - I)
11 j=1
i. (e( jl - ef (1.44. 2)

. I" ·
where e = - I e(jl .
11 j=1
. _ I"
For example, if e = Y = - I Yi is an estimator of the population mean Y under
11 i=1
. I"
SRSWR samp ling, then e(jl = Y(jl = - - . I Yi , denote the estimator of the
II-I'*J=I
population mean Y obtai ned by droppingj" unit from the samp le. Clearly, we can
write

e(jl = Y(jl = _1- i:Yi = _ 1_[i: Yi - Yj] = _1_ [i: Yi - Y + Y - Yj ]


11 - I i*j=1 11 - I i=1 11 - I i=1
1 [_ _ -
= II_ IIIY - Y+Y- Yj
]_ I {_)
= Y - ~ \)'j - Y '

Also

e = ~II j=lf Y(j)=~II j=III-li;tj=1


f f Yi] [_1- =_1- f f Yi
II(II-I) j=li;tj=l

=_1- f [ f Yi - Yj] =_ I - f [II Y- Yj ]


11(11 -1) j=1 i=1 11(11 -1) j=1
= _ 1_ [ IllY _ I y.J= _1_[1I 2y -IIY]= 11(11 -1) Y = y .
11(11- 1) j=1 j=1 J 11(11- 1) 11(11-1)
Thus the Jackknife estimator of variance of e = y is given by

VJack Y srswr =-(II -I)~
(- ) {- -YJ
L.\)'(jl
11 j=1
-\2 (II- I) ~[{-
=- L. Y- -
11 j=\
I {
11 -
1\)'rY -Y_]2 -)t
2
i:
= _(11-_1) [_ I_ (Yr y)1 = _ I _ i:
(yr y~ = s;, ,
11 j=1 11 -I J 11(11 - I ) j=1 11
Chapter I : Basic concepts and mathematical notation 31

where s y2 = -1- In ( Yi - Y-)2 .


n -I i= \

Thus VJack(Y\rswr is an unbiased estimator of V(y) under SRSWR sampling. The


Jackknife technique provides good estimate of variance under SRSWR sampling,
but for other sampling schemes we need to adjust it to obtain unbi ased estimator of
variance. For example, under SRSWOR sampling, an unbiased estimator of V(y)
will be
• (_)
vJack Y srswor =
(l -fXn -I)~(.,-
L. \)'(;) -
_)2
Y (1.44.3)
n j =\

where f = n] N .
Note that this is not always possible to adjust Jackknife estimator of variance to
make it unbiased for other sampling schemes available in the literature.

1.45L.OSSFUNCTION

The risk or loss associated with an estimator 0, of e, of order r is defined as


R(O,)= E[O, -eJr (1.45.1)
for r = 2,3,.... If r = 2 then R(O,)= MSE(O,) which is generally called quadratic loss
function .

1.46 ADMISSIBLE ESTIMATOR

Let r be a class of estimators of a population parameter (J. For a given loss


function, let R(O,) represent the risk or expected loss associated with the estimator
0, of e. Out of two estimators 0\ and O2 of the population parameter e, the
estimator 01 will be said to be uniformly better than O2 if, for a given loss function,
the following inequality

(1.46 .1)
holds for all possible values of the characteristic under study . Now an estimator 0,
belonging to r is said to be admissible in r if there exists no other estimator in r
which is better than 0,.

1.47 SAMPLE SURVEY

A sample survey is a survey which is carried out using sampling methods, i.e., in
which only a portion and not the whole population is surveyed.
32 Advanced sampling theory with applications

1.48 SAMPLING DISTRIBUTION

A sampling distribution is a distribution of a statistic in all possible samples which


can be chosen according to a specified sampling scheme. The expression almost
always relates to a sampling scheme involving random selection, and most usually
concerns the distribution of a function of a fixed number n of independent
variables.

Example 1.48.1. Select all possible SRSWR samples each of two units from the
population consisting of four units 1,3,5 and 7.
( a ) Construct the sampling distribution of the sample means.
( b ) Construct the sampling distribution of the sample variances.
Solution. The list of 16 samples of size 2 from the population and the mean of each
sample is given in the following table.
Samnle ..~ 1,1 1,3 1,5 1,7 3,1 3,3 3,5 3,7 5,1 5,3 5,5 5,7 7,1 7,3 7,5 7,7
Means I 2 3 4 2 3 4 5 3 4 5 6 4 5 6 7
Variances 0 2 8 18 2 0 2 8 8 2 0 2 18 8 2 0
( a ) The relative frequency distribution of the sample means is
."
Sample Frequency Relative
means frequency
I I 0.0625
2 2 0.1250
3 3 0.1875
4 4 0.2500
5 3 0.1875
6 2 0.1250
7 I 0.0625

A histogram of the above sampling distribution of sample means is given below.

Sampling distribution of
the sample means

>. 0.3
Q) <J
~ a; 0.2
ra
8:1 go
::::I

0.1
.;: 0
2 3 4 5 6 7
Sample Means

Fig. 1.48.1 Distribution of the sample means.


Chapter I: Basic concepts and mat hematica l notation 33

( b ) The relative frequency distribution of the sample variances IS given in the


following tab le.

Sample ii, Fre quep cy;


I', .
Relative
freqti~m~y •
y.
;"; ',t-;~

0 4 0.250
2 6 0.375
8 4 0.250
18 2 0.125

A relative histogram of the above sampling distribution of sample vanances IS


give n be low .

Sampl ing distribution of the sample


var iances

0 .4 .
'"
u 0 .35
Ii 0 .3
& 0 .2 5
£ 0 .2
~ 0 .15
... 0.'
~ 0 .0 5
o.
18
Sam pie v ari a nce

Fi g. 1.48.2 Distribut ion of the sam ple variances.

1.49 SAMPLING.FRAME

A sample space lfI (or S) of iden tifiable units or eleme nts of populatio n to be
surveyed is called a sampling frame . It may be a discrete space such as househo lds,
ind ividuals or a continuous space such as area under a particular crop.

f;50 SAMPLE SURVEY DESJGN

Let If/ ={t }, t = 1,2 ,..., s(O) be a speci fied space of samples, B( be a Borel set in
lfI and p( be the probability measure defined on B(, then the triplet (If/, B(, p() is
called a sample survey design.

1.51 ERRORS IN THE ESTIM ATOR S'

In general, two types of errors, which arise dur ing the process of sampling, have
bee n ob served in actual practice in the estima tors :
( a ) Sa mp ling errors; ( b) Non-samp ling errors.
Let us brie fly explain these errors.
34 Advanced sampling theory with applications

1.51.1 SAMPLING ERRORS

An error which arises due to sampling is called a sampling error. Let us explain this
with the help of the following example. For a population of size N = 4 , let the units
be A = 1 , B = 2, C = 3, and D = 4. The population mean is given by, Y = 2.5.
There are N Cll =4C2 = 6 possible samples each of size II = 2 . The units selected in
the six samples are : (A, B), (A,C), (A,D), (B, C), (B, D), and (C, D} Thus six
sample means are given by:

2 3 4 5 6
(A, C) (A, D) (B,C) (B,D) (C,D)
or
3,4)
3.5

Ifwe take each of the sample means and population mean separately, then, we have
the following cases: error of (A ,B) = 11.5 - 2.51 = 1.0 ; error of (A, C) = 12.0 - 2.51 = 0.5 ;
error of (A, D) = 12.5 - 2.51 = 0.0 ; error of (B,C) = 12.5 - 2.51 = 0.0; error of
(B, D) = 13.0 - 2.51 = 0.5 ; error of (C,D) = 13.5 - 2.51= 1.0 . Note that we are measuring
only two units out of four units, i.e., we have only partial information in the sample
therefore sampling error arises. One of the measurements for the sampling error is
the variance of the estimator. For example, the variance of the sample mean
estimator, YI' is

V(YI)= E[YI-E(YI W= fpl(yI-Y~ =-!- f(yl-Y~


1=1 6 1=1

=i[(I.S - 2.S)2-t{2.0 - 2.sf-t{2.S - 2.sf-t{2.S - 2.sf-t{3.0 - 2.sf-t{3.S - 2.sf]= 0.41

because PI = -1 'iI t =1,..,6, and


(~) -
EIYI =Y .
6
Also note that
-1l)S2 = (4-2)
( NNil y4x2
x~ = 0.41 = V(YI).
3
For its theoretical derivations refer to Chapter 2.

These errors are of four types:


( i ) Non-response errors; ( ii ) Measurement errors;
( iii) Tabulation errors; and (iv ) Computational errors.
Let us discuss each of these errors in brief as follows :
Chapter I: Basic concepts and mathematical notation 35

1.51.2.1 NON-RESPONSE ERRORS

The people from whom we get the information are called the respondents and the
people in the sample from whom we do not get information are called non-
respondents . The error which arises, when we fail to get the information is called
non-response error and the phenomenon is called non-response. This error arises
because of the fact that we are not able to cover the whole sample. For example , if
we want to interview 100 farmers and suppose 5 out of them do not allow us to
interview them. Then we are interviewing only 95. So the sample is not complete.
Such errors are called non-response errors.

1.51.ij,MEASQREMENT E~ORS .

The errors that we bring in measuring the characters are called measurement errors.
For example, suppose we want to measure the age of the respondents. Among the
respondents, some may report their age less than their actual age. These types of
errors are called measurement errors.

·1.5 1.~.J TABULATIQN ERRORS '

The errors which arise due to missing some numbers due to non availability of data
or recording some numbers wrongly, while making a table is called a tabulation
error.

1.51.2.4 € OMPUTA TIONAL ERRORS

After the table is formed, we start our calculations. The errors committed In
calculations are known as computational errors.
'. .
1.52 POINT ESTIMATOR

A point estimator endeavours to give the best single estimated value of the
parameter. For example, the average height of school children is 5.3 feet.

An interval estimator of a population parameter which specifies a range of values


bounded by an upper and a lower limit, within which the true value is asserted to
lie. For example, the average height of school children lies between 4.9 feet and
5.5 feet.

1.54 CONFIIn;NCE INTERVAL

If it is possible to define two statistics B1 and B2 (functions of sample values only)


to estimate a population parameter (), such that
36 Advanced sampling theory with applications

p(o, < 0< 02 )=(I-a) (1.54.1)


where I -a is some fixed probability, the interval between 0, and O2 is called a
confidence interval. The assertion that 0 lies in this interval will be true, on the
average, in a proportion 1- a of the cases when the assertion is made . Note that an
interval estimate at the same level of confidence with smaller width is considered as
better estimate. For example, if someone says with 95% confidence that the average
marks of a particular class lies between 65% to 85%, then this estimate is better
than the estimate if someone says with same confidence that the average marks lies
between 0% to 100%. We saw for SRSWOR sampling, the sample mean Yt is
unbiased for population mean with
VVt) = (N~n )s;. (1.54 .2)

and an unb iased estimator of variance is given by


,(-Yt )= (N---;:;;;-
v
-n)s 2
y . ( 1.54.3)

Thus there are two cases: If VVt) is known then for a large sample, a
(1 - a)100% confidence interval estimate for the population mean Y is given by
Yt±Za/2~VVt) (1.54.4)
where Za/2 values are given in Table 3 of the Appendix, and if VV t ) is unknown
then for a small sample, a (I - a)100% confidence interval estimate for the
population mean Y is given by
Yt±la/2(df=n-l~vVt) (1.54 .5)
where la/2[df = n - I) values are given in Table 2 of the Appendix, and df stands for
degree of freedom . Note that if a=0.05 it represents (l -a)100% =(1-0.05)100%
= 95% confidence interval.

Let us illustrate it with the following example:

Example 1.54.1. Consider a population consisting of N = 7 units, viz. A = I,


B = I, C = 8, D = 8, E = 8, F = 9 , and G = 9. Consider all possible SRSWOR
samples of n = 2 units, and compute confidence interval estimates of population
mean based on each sample under the following two different situations:
( a ) Variance or mean square error s~ is known ;
(b) Variance or mean square error S; is unknown.
Find the proportion of confidence interval estimates in which the true population
mean is included.
Solution. Note that for this population, we have population mean Y = 6.29 and
population mean square s~ = 13.24 .
Chapter 1: Basic concepts and mathematical notation 37

Now the total number ofSRSWOR samples will be


N =7 = 7! =~ = 7 x6 x5 x4 x3 x2 xl = 21 .
Cn C2
2!(7 -2) 2!x5! 2 x l x 5 x 4 x 3 x2 x l

( a ) When population va riance is known : The lower and upper limits are

L, = y, - Za/2~V(Y, ) = y, - 1.96 (N~" )s;, ,


and
- + za/2Vcr=\"(
VI = Y, -)) = Y,
V ~YI - + 1.96 (N-;:;;;-
- II) Sy2 .

( b ) W hen populatio n variance is not known: The lower and upper limits are

L2 =y,-fa/2(df=II-I~V(Y,) = Y, - 12.71 (N~"};',


and
- + f /2(df = 11- 1),jV\y,
V 2 = y, a -r:-;) =y,
- + 12.71 -;:;;;-
-II ) Sy2 (N '

All possible sample s, sample means, and variances, lower and upper limits of the
95% confidence interval, and their coverage is given in the following table.

Sample '" Sample Variance knOwn Variance unknown


Values I ~ Y,
"".
s2
y
LI .
fJ:~~~ Y E C/(I) If L2 U2 Ye''cJ(2)
A B 1 1 1.0 0.00 -3.26 5.26 No 1.0 1.0 No
A C 1 8 4.5 24.50 0.24 8.76 Yes -33.1 42.1 Yes
A D 1 8 4.5 24.50 0.24 8.76 Yes -33.1 42.1 Yes
A E 1 8 4.5 24.50 0.24 8.76 Yes -33.1 42.1 Yes
A F 1 9 5.0 32.00 0.74 9.26 Yes -38.0 48.0 Yes
A G 1 9 5.0 32.00 0.74 9.26 Yes -38.0 48.0 Yes
B C I 8 4.5 24.50 0.24 8.76 Yes -33.1 42.1 Yes
B D 1 8 4.5 24.50 0.24 8.76 Yes -33.1 42.1 Yes
B E I 8 4.5 24.50 0.24 8.76 Yes -33.1 42.1 Yes
B F 1 9 5.0 32.00 0.74 9.26 Yes -38.0 48.0 Yes
B G 1 9 5.0 32.00 0.74 9.26 Yes -38.0 48.0 Yes
C D 8 8 8.0 0.00 3.74 12.26 Yes 8.0 8.0 No
C E 8 8 8.0 0.00 3.74 12.26 Yes 8.0 8.0 No
C F 8 9 8.5 0.50 4.24 12.76 Yes 3.1 13.9 Yes
C G 8 9 8.5 0.50 4.24 12.76 Yes 3.1 13.9 Yes
D E 8 8 8.0 0.00 3.74 12.26 Yes 8.0 8.0 No
Continued .
38 Advanced sampling theory with applications

D F 8 9 8.5 0.50 4.24 12.76 Yes 3.1 13.9 Yes


D G 8 9 8.5 0.50 4.24 12.76 Yes 3.1 13.9 Yes
E F 8 9 8.5 0.50 4.24 12.76 Yes 3.1 13.9 Yes
E G 8 9 8.5 0.50 4.24 12.76 Yes 3.1 13.9 Yes
F G 9 9 9.0 0.00 4.74 13.26 Yes 9.0 9.0 No

Thus we observed that the population mean Y = 6.29 lies 20 times between LI and
U 1 out of total 21 times, and hence the observed proportion of confidence intervals
containing population mean = 20/21 = 0.9524. In other words, 95.24% cases the
population mean lies between the confidence interval estimates when variance is
known . Thus the observed percentage is very close to the expected coverage of
95%.

Also we observed, when population variance is not known, then the populat ion
mean lies 16 times between L z and U z, and hence the observed proportion of the
confidence interval estimates containing population mean = 16/21 = 0.7619, that is,
only 76.19% times the population mean lies between the confidence interval
estimates when variance is unknown. Here the observed percentage of the coverage
is lower than the expected coverage of 95%. This may be due to very small sample
and population size. In practice as the sample size becomes large, (How large? Just
smile because there is no unique answer), then the observed proportion of coverage
in both cases converges to 95%.

It is defined as the total number of units of a particular interest in a subgroup A


(such that Au A C = n ) of a population divided by the total number of units in the
population . In other words, when the variable of interest Y takes only two values I
and 0 that is Yi = 1 (if i E A) and Yi = 0 ~f i E A c), then the population mean Y also
becomes population proportion P as follows:
- 1 N 1( ) N
Y = - L Jj = - 1 + 0 + 1+ 0 + ...... 0 + 1 = - I = P, (1.55.1)
Ni=1 N N
where N 1 denotes the number of units of the population in the group A, and N
denotes the total number of units in the population . Note that the value of
population proportion P lies between 0 and I that is 0:5: P :5: 1. Further note that
here we are dealing with qualitative variables.

It is defined as the total number of units of a particu lar interest in a subgroup A


(such that Au A C = s) of a sample divided by the total number of units in the
sample. In other words, when the variable of interest y takes only two values 1 and
Chapter I: Basic concepts and mathematical notation 39

0, that is, Yi = 1 (ifi E A) and Yi = 0 ~fi E A C ) , then the sample mean Y s also
becomes sample proportion P, as follows:
- 1~ 1(
Yt=-L.Yi=-I+O+1+0+ 0 +1 ) =-=p,
n\ ' (1.56.1)
n i; \ n n
where nj denotes the number of units of the sample in the group A, and n denotes
the total number of units in the sample. Note that the value of sample proportion
also lies between 0 and I that is 0:0; P:0; 1 .

( a ) The variance of sample proportion under SRSWOR sampling is give by


, ) (N - n) ( )
Vwor (P = - ( - - ) P 1- P (1.57.1)
n N-1
and its estimator is given by
,
V wor
(,) (N - n) ,( , \
P = N(n_1)p1-Pf
(I 57 2)
. .

Thus a (1- a )100% confidence interval estimate of the population proportion P is


given by
P ± 2a/2JVwor(P ). (1.57.3)

( b ) The variance of sample proportion under SRSWR sampling is give by


v:wr (,)
P
= P(l - p), (1.57.4)
n
and its estimator is given by
, (,) p(l- p) ( 1.57.5)
V wr P =-(-)'
n -1
Thus a (1- a )100% confidence interval estimate of the population proportion P is
given by
P'+2 ~J.
- a/2vvwr~P (1.57.6)
For detail of derivations of the results related to proportion, refer to Chapter 2.

Example 1.57.1. Consider a class consisting of 6 students. Their names and major
are given in the following table:

',';/ ''''' N am A,,, } IT" l'",,), " I . ,' , " '" ',i , ,'

Amy Math
Bob English
Chris Math
Don English
Erin Math
Frank English
40 Advanc ed sampling theory with applications

( a ) Find the proportion of English students in the class.


( b) How many SRSWOR samples, each of n = 4 students , will be there?
( c) What is the sampling distribution of estimate of proportion?

Solution. ( a ) Count the number of students in the population, N = 6.

Count the numb er of student s with major English, COUNT = N, = 3.

Compute the proportion of English students:


COUNT N 3
p= - ' =-=0.5 (parameter).
N N 6
Recall that a parameter is an unknown quantity and we try to estimate it by taking a
random sample from the population .

( b ) How many SRSWOR samples, each offour units, will there be?

The possible combinations of choosing 4 objects out of 6 object s are given by:
6 = 6! =~=6 x5 x4 x3 x2 xl =15.
C4
4!(6-4) 4!x2! 4 x3 x2 xl x2 xl
Note that each combination can be taken as a without replacement sample, so the
total number of distinct samples will be 15.
( c ) Sampling distribution of estimate of propo rtion: Let us construct those 15
samples as follow s:

I Am Bob Chris Don M E M E =2/4=0 .50


2 Am Bob Chris Erin M E M M =1/4=0.25
3 Am Bob Chris Frank M E M E =2/4=0 .50
4 Am Bob Don Erin M E E M =2/4=0.50
5 Am Bob Don Frank M E E E =3/4=0 .75
6 Am Bob Erin Frank M E M E =2/4=0 .50
7 Am Chris Don Erin M M E M =1/4=0.25
8 Am Chris Don Frank M M E E =2/4=0 .50
9 Am Chris Erin Frank M M M E = 1/4=0.25
10 Am Don Erin Frank M E M E =2/4=0.50
II Bob Chris Don Erin E M E M =2/4=0 .50
12 Bob Chris Don Frank E M E E =3/4=0.75
13 Bob Chris Erin Frank E M M E =2/4=0 .50
14 Bob Don Erin Frank E E M E =3/4=0.75
15 Chris Don Erin Frank M E M E =2/4=0.50

From the above table we have the following table:


Chapter I: Basic concepts and mathematical notation 41

Proportion Tally marking Frequency Relative frequency


estimate /; RFi = ii/Iii
P
0.25 III 3 3/15
0.50 IlII1 Ill! 9 9/15
0.75 III 3 3/15
Surr 15 1

The above table shows that the distribution of estimates of proportion is symmetric,
or say normal.

Let Xi = P and Pi = RF; . Then the expected value of Xi =P


3 3 9 3
E(Xi ) = J1 = E(j)) = 'LPi xi = PIx) + P2x2 + P3x3 = - xO.25+ - x 0.50 +-x 0.75
i= \ IS 15 IS
= 0.05 + 0.30 + 0.15 = 0.5 = P (Population proportion).

Because E(p )= P, whi ch implie s that the sample proportion IS an unbiased


estimator of the population proportion.
By the computing formula of variance, we have

0'2 = V(Xi ) = v(p) = [i~I(P;.1} )]- ell f = [PF(f + P2X~ + P3X!]- ell f
=[~ Xo.252 + ~x O.502 + ~X O.752 ] _ (0.5?
15 15 15
= [0.0125 + 0.15 + 0.1125] - (0.25) = 0.275 - 0.25 = 0.025.

Also note that


N
( N- I
-II) I P(I-P) =( 6 - 4) 0.5(1- 0.5) = ~ x 0.25 =0.025 .
6 -1 4 5 4

Thu s the variance of the estimator of population proportion using without


replacement sampling is given by

(N-II) P(I- p).


O' ~
fJ
=
N- I I
Simil arly, repeat this example yourself by taking all SRSWR samples, and study the
unbiasedness and variance of the estimate of proportion, and also histogram the
sampling distribution of the estimate of proportion.
42 Advanced sampling theory with applications

Example 1.57.2. Consider a class of 16 students taking statistics course , and their
names , marks, and major subjects are given in the following table:

"Sr:"N oF ,;Nam "' e;;;


"""!> {Marks"
Maiot'"\;
1 Ruth 92 Math
2 Ryan 97 Math
3 Tim 68 English
4 Raul 62 Math
5 Marla 97 English
6 Erin 68 Math
7 Judv 76 English
8 Trov 75 English
9 Tara 51 Math
10 Lisa 94 Math
11 John 70 Math
12 Cher 89 English
13 Lona 62 Math
14 Gina 63 Math
15 Jeff 48 Math
16 Sara 97 Math

I. Compute the following parameters :


( a ) Population mean;
( b ) Population variance ;
( c ) Population standard deviation;
( d ) Population mean square error;
( e ) Population coefficient of variation;
2. ( a ) Select an SRSWOR sample of 4 units using Random Numbe r method ;
( b ) Estimate the population mean and population total;
(c) Compute the variance of the estimator of population mean;
( d ) Estimates;;
(e) Estimate the variance of the estimator of population mean;
( f) Construct 95% confidence interval of the population mean assuming that
population mean square is known and sample size is large. Does the population
mean falls in it? Interpret it;
( g ) Construct 95% confidence interval assuming that population mean square
is unknown and sample size in small. Does the population mean falls in it?
Interpret it;
3. (a) Compute the population proportion of major in English students;
( b ) Estimate the proportion of major in English on the basis above sample;
( c ) Compute the variance of the estimator of population proportion;
(d) Estimate the variance of the estimator of population proportion;
( e ) Construct 95% confidence interval for the proportion.
Chapter I: Basic concepts and mathematical notation 43

Solution. We have
T·.
hi~ ..;!!~
.~
I l''1am ::.:•• j§ •• c~n~iii!. ,;
Ruth 92 8464
Ryan 97 9409
Tim 68 4624
Raul 62 3844
Marla 97 9409
Erin 68 4624
Judy 76 5776
Troy 75 5625
Tara 51 2601
Lisa 94 8836
John 70 4900
Cher 89 7921
Lona 62 3844
Gina 63 3969
Jeff 48 2304
Sara 97 9409
I:.i t "' Sum;:I :· .L 95559 .

From the population information we have


N N
N = 16, Dj = 1209 and Di = 95559.
i= l i=l
1.
( a ) Population mean :
N
If:
_ ._ I 1209
Y =.!.=L =- - = 75.56 (parameter) .
N 16
( b ) Population variance:
N )2
(
IJj2 _ ~
If:
95559- (1209f
(j2 =H N = 16 = 262.75 (parameter).
y N 16
( c ) Population standard deviat ion:
(jy = g = .J262.75 = 16.20 (parameter) .
( d ) Population mean square error :
N )2
N (
IJj2 _ ~
If:
. I
95559- (1209)2
52 = ;=1 N _ _ _ _1'-"6:.....- = 280.26 (parameter).
y N-l 16-1
44 Ad vanced sa mpling theory with applications

(e) Population coefficient of variation: We are using SRSWOR sampling , so


S fS2
_v_0 y_ _ ,,~
C v -- ---.X..
- -
- - -
280.26 -_ 16.74 -
- 0. 2215 (par ame t er ) .
. Y Y 75.56 75.56
2.
( a ) Se lection of 4 units using SRSWOR sampling ( 1/ = 4 ): Let us start with 1st row
and 6 th co lumn of the Pseudo-Random Number Tab le I given in the Appendix.

Random Decision: .. Name of the


Nu mber R - Rejection, S -- Selection selected student
62 R
77 R
92 R
67 R
53 R
51 R
33 R
07 S Jud y
62 R
69 R
76 R
48 R
50 R
88 R
37 R
72 R
63 R
21 R
33 R
25 R
76 R
09 S Tara
43 R
80 R
94 R
62 R
68 R
IS S Jeff
42 R
93 R
29 R
01 S Ruth
Chapter 1: Basic concepts and mathematical notation 45

So our SRSWOR sample consists of four students = {Judy, Tara, Jeff, Ruth} .

Now from the sample we have the following information:


I CY c~ame I ~f+"c Yi cCC
y
y; '~::c

Judy 76 5776
Tara 51 2601
Jeff 48 2304
Ruth 92 8464
Sum 267 19145
Thus
11 2
IYi
11
II = 4 , = 267 and IYi = 19145.
i =1 i=1
( b ) Sample mean :
11
Iy
- - = -267 = 66.75 (stati
- = -i-I
Yt . )
statistic
II 4
which is an estimate of popu lation mean.
N
Note that an estimator of population total Y = If; will be given by
i=\

Yt = N Yt = 16x66.75 = 1068 (statistic).


( c ) The variance of the sample mean estimator is given by
V(Yt) = (N -IIJS; = (16-4JX280.26 = 52.548 (parameter).
Nil 16x4

( d ) An estimator of Sf, is given by

I.Yi J2
( i=1
n 2
I Yi - -"'----~- 19145- (267f
2 i= 1 II _ _ _----""4_ = 440.91 (statistic).
Sy =
II-I 4-1
(e) An estimator of the variance of the estimator of the population mean is
;;(Y/) = (N -IIJs 2 = (16-4JX440.91 = 82.67 (statistic).
Nil y 16x4
( f) Here 95% confidence interval is given by
Yt ± I.96~V(yt), or 66.75± I.96~52.548, or 66.75 ± 14.20, or [52.55, 80.20] .

Yes, the true popul ation mean Y = 75.26 lies in the 95% confidence interval
estimate. The interpretation of95% confidence interval is that we are 95% sure that
the true mean lies in these two limits of this interval estimate. Note that interval
estimate is a statistic.
46 Advanced sampling theory with applications

( g ) Here 95% confidence interval estimate is given by


Yt ± la /2 (df = n -1 ).jv(Yt), or 66.75± 10.025 (df = 3}.j82.67 ,

or 66.75±3 .182~82 .67, or 66.75±28.93, or [37.82,95.68]

where la /2(df = n -1) = 10.025(df = 3) = 3.182 is taken from Table 2 of the Appendix.

Yes, again the true population mean lies in this 95% confidence interval and its
interpretation is same as above. Again note that interval estimate is a statistic.

3.
( a) Let us give upper case 'FLAG' of 1 to English majors and 0 to Math major
students in the whole population, then we have

o
2 R an Math o
3 Tim En lish
4 Raul Math o
5 Marla En lish
6 Erin Math o
7 Jud En \ish
8 Tro En lish
9 Tara Math o
10 Lisa Math o
11 John Math o
12 Cher En lish
13 Lona Math o
14 Gina Math o
15 Jeff Math o
16

Population Proportion:

N
L:FLAG i .. .
p= i=l = No.ofstudents wIth enghsh maJor =~=O .3125 (parameter).
N Total No.of Students 16
Chapter I: Basic concepts and mathematical notation 47

( b ) Let us now give the same lower case ' flag' to students in the sample .
r-:

:', "
Judy English I
Tara Math 0
Jeff Math 0
Ruth Math 0
:< J/>",Attt :,:;Y i:":, ~ UIIl'
I :~ ' " .
, ' '<> >' "/

Sample proportion: The sample proportion is given by


n
~:tlagi
i=1- - = -1
p=
A
= 025
. .
n 4

( c ) Variance of the estimator of proportion under SRSWOR sampling is given by


A) (N -n) (
Vwor(P =-(--)PI-P = (
) (16-4)
)xO.3125x 1-0.3125 =0.0429.
( )
nN-l 4x16-1

(d) An estimator of variance of the estimator of proportion under SRSWOR


sampling is given by
vAwor(pA) = (N - n)
(
A(
) p 1- p = (
A) (16- 4) 02
) x . 5 x 1- 0.25 = . 4
( ) 0 0 68 7 .
N n-l 164-1

( e ) A 95% confidence interval estimate for the true population proportion is


p± 1.96~vwor(P)
or 0.25 ± 1.96'/0.04687 , or 0.25 ± 0.424 , or [- 0.174, 0.674] , or [0.0, 0.674] .

Note that a proportion can never be negative, so lower limit has been changed to O.
Caution! It must be noted that we have here a very small sample, but in practice
when we deal with the problem of estimation of proportion, the minimum sample
size of 30 units is recommended from large populations. Note that instead of using
'FLAG' or 'flag' , sometimes we assign codes 0 or I directly to the variable Yor
X.

Example 1.57.3. For the population considered in the previous example:


( a ) John considers a sampling scheme consisting of only 4 samples as follows.

f §~l11ple '.
~Nu~ber '
Cher, John, Marla, Sara 0.25
2 Erin, Jud ,Raul, Tara 0.25
3 Gina, Lisa, Ruth, Tim 0.25
4 Jeff, Lona, Ran, Tro 0.25
48 Advanced sampling theory with applications

( b ) Mike considers another sampling plan consisting of 13 samples each of 4


students as given below:
,, )"
."",·"1, _." .; )
,.,',.". .. 11':;,'0:
· " ' .' ,~"!'f'. ,,
'" ) " ;" ,., ' . r:.• ,;' ;.,., .'·"lII,' ,"'!' c,
"'1,'

1 Cher, Erin, Gina, Jeff 1/13


2 Cher, Erin, Gina, John 1/13
3 Cher, Erin, Gina, Judy 1/13
4 Cher, Erin, Gina, Lisa 1/13
5 Cher, Erin, Gina, Lona 1/13
6 Cher, Erin, Gina, Marla 1/13
7 Cher, Erin, Gina, Raul 1/13
8 Cher, Erin, Gina, Ruth 1/13
9 Cher, Erin, Gina, Ryan 1/13
10 Cher, Erin, Gina, Sara 1/13
II Cher, Erin, Gina, Tara 1/13
12 Cher, Erin, Gina, Tim 1/13
13 Cher, Erin, Gina, Troy 1/13
Let Yt be an estimator of the population mean. Find the following for each one of
the above sampling schemes:
E(Yt); V(Yt) ; B(Yt); and MSE(Yt)·
Comment on the statement, 'Mike's sampling scheme is better than John's sampling
scheme '. Justify your logic and discuss the relative efficiency.

Solution. ( a ) John's sampling plan:

161.03
127.91
13.61
25.60
., 328.18
Chapter I: Basic concepts and mathematical notation 49

(_)
MSE Yt = ~
L. Pt Yt - Y {_ -}2 =-1 x328 .18=82.045 .
t= \ 4

( b ) Mike's sampling plan:


,;0:,:",{! '\!'; ~! ";:i{¥I~~(J?:;){~J
s:s ! ';; 1j'\t;:i Wt ~·~F·:.1;
Samole
" ;xr,,)[:' I ! ,;i; ~, l~~t?t~~~~ ~,; 1
's 0:';~Di~;::i ; Zj .'-!; I,Z
1 89 68 63 48 67.0 1/13 49 .269600 73.3164 10
2 89 68 63 70 72.5 1/13 2.308062 9.378906
3 89 68 63 76 74.0 1/13 0.000370 2.441406
4 89 68 63 94 78.5 1/13 20.077290 8.628906
5 89 68 63 62 70.5 1/13 12.384990 25.628910
6 89 68 63 97 79.3 1/13 27.360950 13.597660
7 89 68 63 62 70.5 1/13 12.384990 25.628910
8 89 68 63 92 78.0 1/13 15.846520 5.941406
9 89 68 63 97 79.3 1/13 27.360950 13.597660
10 89 68 63 97 79.3 1/13 27.360950 13.597660
11 89 68 63 51 67.8 1/13 39.303250 61.035160
12 89 68 63 68 72.0 1/13 4.077293 12.691410
13 89 68 63 75 73.8 1/13 0.072485 3.285 156
ii.;;j' !iir;;:i;.~0:ii:i .\',! !Surn 962:3 .". lift,!': ; 3;237.80770.0·.' 268:769560

where Y = 75.56 .

Thus we have
_ 13 _ 1
E(Yt ) = 'LPtYt = - x962 .3 = 74.02 ,
t= 1 13
and
B(Yt) = E(yt )- Y = 74.02 -75.56 = - 1.54 ,
V(Yt) = I Pt {Yt - E(Yt)}2 = ~13 x 237.8077 = 18.2929,
t=\

and
MSE(Yt) = Ipt ~t - Y}2 = ~ x 268.76956 = 20.675.
t =\ 13
Although John ' s samp ling scheme is less biased, it has too much mean square error
compared to Mike's sampling scheme . Thus we shall prefer Mike' s sampl ing
scheme over John's sampling scheme. Also note that the relative efficiency of
Mike' s sampl ing scheme over John 's sampling scheme is given by
MSE(- )
RE = Yt John X 100 = 82.045 X 100 = 396.83% .
MSE(Yt )Mike 20.675
Thus one can say that Mike's sampling plan is almost four times more efficient than
John 's sampl ing scheme.
50 Advanced sampling theory with applications

1.58 RELATIVESTANDARDERROR

The relative standard erro r of an estimato r e of population parameter e is defined


as the positive squ are root of the relative variance of the estim ator e.
Mathematically
RSE(e) =~Rv(e) (1.58 .1)

whe re Rv(e)=v(e)/[E(e)f denotes the relative variance of the estim ator e. The
another famous name for relative standard error is coefficient of variation .

1.59.AUXILIARYINFORMATION··.

In many sample surveys, it is possible to collect information abou t some variable(s)


in addition to the variable of interest or study variable . The auxiliary information is
accurately known from many sources like reference books, journals, adm inistrative
records etc. and is cheaper to obta in than the study variable. For example, while
estimating the average incom e of people living in a particular city, the plot area
owned by individual may be known from some published sources. Later on we will
observe that the known auxiliary information is also helpful in incre asing the
efficiency of the estimators. Before dealing with two variables, we should be
familiar with the following terms . If lj and X i denote the values of /" unit for the
study variable Y and auxiliary variable X , then we have :

( a ) The covariance between X and Y is


Cov (X, Y) = E[X - E (X )][Y - E (Y )] = E(xr)- E(X )E (Y ) . (1.59 .1)
The covariance between X and Y is same as that between Y and X, i.e.,
Cov(X ,Y)= Cov(Y,X).
For exam ple , for SRSWOR sampling, the covariance between X and Y is given
by
I
S xy=-- N(
IXi-X Yi-Y , -X -)
N -I i=1
- I N - IN
where X =- I X i and Y = - I Yi . Note that an unbiased estimator Sxy is given
N i=\ N i=1
by
-X
sXY = - I -1 In ( Xi - X Y i - Y -)
n - i=\
I n I n
where x =- IX i and y = - I Yi
11 i=l 11 i=1
( b ) The population correlation coefficient between X and Y is defined as
Cov(X,Y)
Pxy = ~v(x)~v(y) ' (1.59 .2)
For simple random sampling it is given by
Chapter I: Basic concepts and mathematical notation 5I

Px)' = Sx)'/ ~S}S.~ .


(1.59 .3)
A biased estimator of the corre lation coefficient P x)' is defi ned as

':,y= SXy/ ~s;s; . ( 1.59.4)


The value of Px)' (or rx)') is a unit free number and it lies in the interval [- 1, + 1]. It
is also indepe ndent of change of origi n and sca le of the variab les X and Y. The
linear relati onship can also be seen with the help of scatter diagrams as follows :

sex TTER PLOTS


y Px)' > 0 Y
o P x)' < 0

o o o
00 o
o
o o o
o o
o
x X

As X increases Yalso increases As X increases Y decreases


Relationship is positive Relationship is negative

y P x)' = + 1 Y Px)' = - 1

x X

As X increases Ya lso increases As X increases Y dec reases


and all points lie of a straig ht line and all points lie on a straig ht line
Perfect positive relation ship Perfect negati ve relationship
52 Advanced sampling theory with applications

y Pxy = 0 y Pxy maybe positive,negative or zero

000
o o
o o

x X

As X increases Y may increase As X increases Y first increases


or decrease (Y do not care X) and then decreases
No relationship Sign of relationship is not sure

Fig. 1.59.1 Scatter plots .

Note that a similar scatter plot can be made from sample values to find the sign of
sample correlation coefficient rty .
( C) The population regression coefficient of X on Y is defined as
,B = Cov(X,Y)/V(X). (1.59.5)
For simple random sampling, it is given by
,B=SXy /S; . (1.59.6)
A biased estimator of f3 is given by
b= sxy /s; (1.59 .7)
which in fact represents a change in the study variable Y with a unit change in the
auxiliary variable X . Note that sign of ,B (orb) is same as that of PXy(or rxy ) .

Example 1.59.1. Consider the following population consisting of five (N = 5)


units A , B , C , D, and E , where for each one of the unit in the population two
variables Yand X are measured .
Units A B C D E
I> Yi 9 11 13 16 21
..• Xi ' 14 18 19 20 24

Find the following parameters:


- - 2 2
( a) Y, X, S x , Sy , S ty ' P xy and f3 .
Chapte r 1: Basic concepts and mathematical notation 53

( b ) Select all possible SRSW OR samples of n = 3 units . Show that y , x, s.;,


s; , S xy are unbiased for Y , X, S; , S;, Sxy, but r xy and b remain biase d
estimators of P xy and fJ respective ly.
( c ) Compute Cov(y, r) by using definition .
( d ) Compute (1- f) S xy and comment on it.
n
Solution. From the complete population information, we have
Units -1'j -)2
X j (~.I ~'~rr. (Xi -x:)1"",(Yi ;-;X: (Xi - x)2 (Y; -rXXi-X)
A 9 14 -5 -5 25 25 25
B 11 18 -3 -1 9 1 3
C 13 19 -1 0 1 0 0
D 16 20 2 1 4 1 2
E 21 24 7 5 49 25 35
I ~ ·c 88 ,
Sum "'-70 95 0 O~
52 ,Ye Co 65 "-"""~
N N
( a ) From the above table we have LYi = 70, LXi =95, so that
;=1 ;=\
- I N I - I N I
Y =-LY;=- x70 =14 , and X=-LX;=- x95 =19 .
N ;=1 5 N i=1 5

From the above table, I(Y; - rl = 88, I {x; - xl = 52 and ~(y; - rXx; - x)= 65,
;;) ;;\ ;=\
so that
2 2
2 I N( -) 88 2 I N( - ) 52
Sy = - - L Y; - Y = - - = 22 , S x = - - LX; - X = - = 13 ,
N - I ;=\ 5-1 N-I;=I 5-1

I N( 65 -X -) Sxy 16.25
Sxy =--LY;-Y X i - X = -=16.25 , Px y = g = r.;:;-;:;:=0.960 ,
N - I;=I 5-1 S2 S2 ,,13 x22
x y

and fJ = Sxy = 16.25 = 1.25 .


S2 13
x
(b ) Here we have N = 5 and n = 3, so the total number of possib le SRSW OR
samp les will be 5C3 = 10.

Units
A
B
c
Sum
Continued .
54 Adva nced samp ling theory with applications

Units Sample'z " .,' >"'~


A 9 14 -3 -3.34 9 1I.I6 10.02
B II 18 -I 0.67 I 0.45 -0.67
D 16 20 4 2.67 16 7.13 10.68
Sum 36 '52 0 0.00 26 ,,-1 8.73 20.03
,
I~ Units , Sample 3 ';&"r.
A 9 14 -4.67 -4.67 21.81 21.81 21.81
B II 18 -2.67 -0.67 7.13 0.45 1.79
E 21 24 7.33 5.34 53.73 28.52 39.14
Sum-ll 41 ' 56 . . ~0.01 ·0.00; ' 82.67 '50.77 ,,, ;,, 62.74,' ;
Units ,I,;
.z :
;, j Sample 4- ' ii"
' j
.' "';~;;
A 9 14 -3.67 -3.67 13.44 13.44 13.44
C 13 19 0.33 1.33 0.11 1.78 0.44
D 16 20 3.33 2.34 1I.I 1 5.48 7.80
t~ Sum \1':515' ,; w 53 ~ 1.,0:00 ;>20!70.a ~i i rI 2 1~69 ,
\24;67J'
Units I f ' ",;' ';} 'fi. Sample '5~j\~ i" ,II' ,;<,i',' ",,;;;,
A 9 14 -5.33 -5.00 28.44 25.00 26.67
C 13 19 -1.33 0.00 1.78 0.00 0.00
E 21 24 6.67 5.00 44.44 25.00 33.33
Sum 43 57 0.00 ' 0.00 ;;,' 74.67 50.00 ~60. 00 ~
Units:> <'"
~'
"'" Sample 6 ;;&-. ,'~
A 9 14 -6.33 -5.33 40.11 28.44 33.78
D 16 20 0.67 0.67 0.44 0.44 0.44
E 21 24 5.67 4.67 32.11 21.78 26.44
Sum ': 46 58 I " 0.00 " 0.00,'0 72.67 50.67 -ir 60.67
Units ;;,.. "
''co ' . ~ Sample 7 ." . wo "'~ " i

B 11 18 -2.33 -1.00 5.44 1.00 2.33


C 13 19 -0.33 0.00 0.11 0.00 0.00
D 16 20 2.67 1.00 7.11 1.00 2.67
T'
I ,',; ",
Sum " i <tv "
;'tUriitsf I"; :;...
r.

, oi5 7;/:!', '""'0.00,;,


"," ' '1
'~:t~:
11112:67;'J' Ii:; 2.00 . Ii',: i d';5!OQ!r.
,;C' ,18.1" ''"' ",,,,,.\'''' ""'"
B II 18 -4.00 -2.33 16.00 5.44 9.33
C 13 19 -2.00 -1.33 4.00 1.78 2.67
E 21 24 6.00 3.67 36.00 13.44 22.00
SUI11 ,', 457 61 ' 0.00 ,1';;.0.00 <, ;5,6.00", 20.67 r,i-": "*34:00,rf
Units ', ,: ".'
Sample ~~ "
, ''§ "'o;"~
B 11 18 -5.00 -2.67 25.00 7.11 13.33
D 16 20 0.00 -0.67 0.00 0.44 0.00
E 21 24 5.00 3.33 25.00 1I.I I 16.67
Contin ued .
Chapter 1: Basic concepts and mathematical notation 55

13.44 4.00
0.44 1.00
18.78 9.00

From the above table we have

1 11.00 17.00 7.00 5.00 0.945 0.71


2 12.00 17.33 9.37 10.02 0.829 1.12
3 13.67 18.67 25.38 31.37 0.968 1.24
4 12.67 17.67 10.35 10.84 0.960 1.05
5 14.33 19.00 25.00 30.00 0.982 1.20
6 15.33 19.33 25.33 30.33 1.000 1.20
7 13.33 19.00 1.00 2.500 0.993 2.50
8 15.00 20.33 10.33 17.00 0.999 1.65
9 16.00 20.67 9.33 15.00 0.982 1.61
10 16.67 7.00 10.50 0.982 1.50

Thus we have the following results :


E(Y) = 14 = Y, that is, the sample mean y is unbiased for population mean of the
study variable ;

E(x) = 19 = X , that is, the sample mean x is unbiased for population mean of the
auxiliary variable;

E(s~)= 22.00 = s~, that is, the sample variance s~ is unbiased for population s~
of the study variable;

E(s;)= 13.00 = S; , that is, the sample variance s; is unbiased for population S;
of the auxiliary variable ;
E(sxy ) = 16.25 = S xy , that is, the sample covariance s xy is unbiased for population
S xy of both variables;
EVxy)= 0.964 7c- Pxy' that is, the sample rxy is biased for population Pxy' and
B~xy)= EVxy)- P xy = 0.964-0.960 = 0.004;
56 Advanced sampling theory with applicatio ns

and
E(b) = 1.4 l' 13, that is, the sample b is biased for the popu lation 13;
and
B(b)= E(b)- 13 = 1.40-1 .25 = 0.15.
( c) The covariance between ji and x is defined as:
Cov(y,x)= E[y - E(Y)Ix - E(x)] =E[y - fl:~ - x]=IPs~s -f Ixs - xl
s=1
Now we have

Qit.. . f.},Pthf,\{~I:- x)ro' (Y-, - Y-X"


XI ':"<-y.,
, -'l
i'"
~~XI '" , XI X<
11.00 17.00 -3.00 -2.00 6.0000
12.00 17.33 -2.00 -1.67 3.3400
13.67 18.67 -0.33 -0.33 0.1089
12.67 17.67 -1.33 -1.33 1.7689
14.33 19.00 0.33 0.00 0.0000
15.33 19.33 1.33 0.33 0.4389
13.33 19.00 -0.67 0.00 0.0000
15.00 20.33 1.00 1.33 1.3300
16.00 20.67 2.00 1.67 3.3400
16.67 21.00 2.67 2.00 5.3400
.
I. "' co Sum 0.00 <.,r~ o.oo 2 1.6667
So by definition,

Cov(y, IPI~' -iIx -xl =~ x 21.6667 = 2.1667.


x)= 1=1 l
10

( d ) Now we have
N - n S = (5-3) xI6.25 = 2.16667.
Nn ~ 5 x3

Thus we have
_ _) N- n (1- j)
COy (y , x =- - S ty = - -S,y, where j = niN .
Nn n

For theoretical proof refer to Chapter 2.

1.60 SOME,USEFUL MATHEMATICM:.l'FORMULAE''>

If x and y are two random variable and c and d are two real constants, then
(a) v(ex) = e 2V(x} (1.60. 1)

(b) Cov(ex ,dy) = edCov(x,y } (1.60 .2)


Chapter 1: Basic concept s an d mathematical notation 57

II
( C) If x = IXi , where the x; are also random variables, then we have
i= l

v(x) = V(;~X;) = ;~v(x;)+ ;;<~=I Cov(';' x;}. ( 1.60 .3)

II
( d ) If x = I Cixi , where C; are real constants, then we have
i= l

v(X) = V ( L" C;X;) = L" C;2V(x;)+ L" C;CjCOV(


x., )
Xj . ( 1.60.4)
;=1 ;=1 ;;<j=1

Not e that if Xi and Xj are independent then CovG;,Xj) = 0 and

V(X) = V ( ;~
" C;X; ) = ;~"C;2V(X;) .

II n
( e ) If x = I Cixi and Y = L d;y;, where Ci and d i are real con stant s, then we have
i= \ ;=1

Covtx, y ) = co{;~c;x;, ;~d;y;) = ;~C;d;COV(X;, y;) + ;;<tl c;djCov(x;, y j )' ( 1.60.5)

Note that if Xi and Yj are independent then COV(X;' Yj ) = 0 and

Cov(x, Y) =CO{;~IC;X;' Eld ;Y;) = E tC;d;COV(X;, y; ).


1.61 ORDERED STATISTICS

The se are param eters which dea l with arrangi ng the data in ascendi ng or descen ding
order, and we introdu ce a few of them here as follows:

1.61.1 POPULATION MEDIAN

It is a measure which divides the popul ation into exa ctly two eq ua l parts, and it is
denoted by M y . Its analogo us from the sample is ca lled sample median , and is
denoted by if y' A pictorial repr esentation is given below:

Data arranged in ascending order


50% data values 50% data va lues

/ Minimum )
\ Value

Fig. 1.61.1 Structure of dat a to find median.


58 Advanced sampling theory with application s

Rules to find sample median: Consider a sample having I


observations, and we
wish to find the sample median. The first step is to arrange the data in ascend ing
order , and then after that there are two situations:
(i ) If the sample size I is odd , then the value at the (II ; I} h position from the
ordered data are called sample median. As an illustration, consider a sample
I
consisting of = 5 (odd) observations as 50, 90, 30, 60, and 70. First step is to
arra nge the data in asce nding order as: 30, 50, 60, 70, 90. The second step is to
. k up a va Iue at the ( -2-
pIC 1I+I) th = ( -5 +
2-I) th = 3rd positron
. . = 60 , so M' y = 60 .

( ii ) If the sample size I is even, then the average of the values at the (%}h and

(%+ I}h positions from ordered data are called sampl e median . As an illustration,

consider a sample con sisting of 11= 6 (even) observations as 50, 90, 30, 60, 70 and
20. First step is to arrange the data in ascending order as: 20, 30, 50, 60, 70, 90.
Th e second step is to pick up two values: one at ( %}h = (%}h = 3rd position = 50 ,

and seco nd at (% + I}h = ( %+ I}h = 4th position = 60 . Then the average of these

values is called the median, and is given by if y = (50 + 60)/2 = 55.

1.61.2 POPULATION Q UARTILES

These are three measures which divide the popul ation into four equal parts. The /11
quartile is represe nted by Qj, i = 1,2,3. A pictori al representation is give n below:

25% 25% 25% 25%

Data arrang ed in ascending ord er

Minimu Maximum
Value Value

Fig. 1.61.2 Structure of data to find quartiles.


Chapter 1: Basic concepts and mathematical notation 59

Note that the second quartile Q2 is a median. The first quartile QI is a median of
the data less than or equal to the second quartile Q2, and third quartile Q3 is the
median of the data more than or equal to the second quartile Q2' Thus finding three
quartiles needs to find median three times from the given ordered data. The
population interquartile range is defined as: 0 = (Q3 - QI)' The sample analogous of
population quartiles are called sample quartiles and are denoted by Qi, i = 1,2,3 and
sample interquartile range is defined as: <3 = (Q3 - QI)' which is a measure of
variation in the data set.

1.61.3 ~OI'JILATION.PERCENl'ILES

These are 99 measures, which divide the population into equal 100 parts. The { Ii
population percentile is represented by 11, i = 1,2,.... ,99 and its pictorial
representation is given below:

1% 1% 1%

Data arranged in ascending order

Fig. 1.61.3 Structure of data to find percentile.

Its sample analogous is represented by A, i = 1,2,....,99 .


1..61.4 POI'ULATIONMOJ)E 7 /~>t'

It is value which occurs most frequently in the population and is denoted by M 0 ,

and its sample analogous is called sample mode and is denoted by ifO' As an
illustration, for the data set 60, 70, 30, 60, 30, 30, 80, 30, the mode value is 30,
because it occurred most frequently .

1.62 DEFINITION(S»OESIATISTICS

There are several definitions of statistics and we list a few of them are as follows :

( a) It is a science to describe or predict the behaviour of a population based on a


random and representative sample drawn from the same population.
60 Advanced sampling theory with applications

( b ) The science of statistics is the method of judging collective, natural, or social


phenomena from the results obtained from the analysis or enumeration or collection
of estimates.
( c ) The science which deals with the collection , analysis and interpretation of
numerical data.

1.63 LIMITATIONS OF STATISTICS

A few limitations of statistics are:


( a ) Statistics does not deal with individual measurements. This is the reason we
need police to investigate individuals;
( b ) Statistics deals only with quantitative characters or variables, and we have to
assign codes to qualitative variables before analysis ;
( c) Statistics results are true only on an average ;
( d ) Statistics can be misused or misinterpreted. For example last year 90% of the
pedestrians who died in road accidents were walking on paths, so it is safer to walk
in the middle of the road.

1.64 LACK OF CONFIDENCE lNSTATISTICS

A few people have the following types of views in their mind about statistics:
( a ) Statistics can prove anyth ing;
( b ) There are three types of lies --- lies, damned lies, and statistics;
( c ) Statistics are like clay of which one can make a God or devil as he/she pleases ;
( d ) It is only a tool , and cannot prove or disprove anyth ing.

1.65 SCOPE OF STATISTICS

It has scope in almost every kind of category we are divided in this world due to our
social setup , for example, Trade, Industry, Commerce, Economics, Biology,
Botany, Astronomy, Physics, Chemistry, Education, Medicine, Sociology,
Psychology, Religious studies , Meteorology, National defence, and Business:
Production, Sale, Purchage, Finance, Accounting, Quality control , etc..

EXERCISES

Exercise 1.1. Define the terms population, parameter, sample, and statistic.

Exercise 1.2. Describe the advantage of a sample survey in comparison with a


census survey. Write the circumstances under which census surveys are preferred to
sample surveys and vice versa?

Exercise 1.3. Describe the relationship between the variance and mean squared
error of an estimator. Hence deduc e the term relative efficiency.
Chapter 1: Basic concepts and mathematical notation 61

Exercise 1.4. You are required to plan a sample survey to study the environment
activities of a business in the United States . Suggest a suitable survey plan on the
following points : ( a ) sampling units; ( b ) sampling frame ; ( c ) method of
sampling; and ( d ) method of collecting information. Prepare a suitable
questionnaire which may be used to collect the required information.

Exercise 1.5. Define population, sampling unit and sampling frame for conducting
surveys on each of the following subjects. Mention other possible sampling units,
if any, in each case and discuss their relative merits .
( a ) Housing conditions in the United States.
( b ) Study of incidence of lung cancer and heart attacks in the United States .
(c) Measurement of the volume of timber available in the forests of Canberra .
( d ) Study of the birth rate in India.
( e ) Study of nutrient contents of food consumed by the residents of California.
( f) Labour manpower of large businesses in Canada .
( g ) Estimation of population density in India.

Exercise 1.6. What do you understand by the following terms?


( a ) Unbiasedness; (b) Consistency; (c) Sufficiency; and (d) Efficiency.

Exercise 1.7. Define the following :


( a ) Sampling frame; (b) Sample survey design ; and ( c ) Nonresponse.

Exercise 1.8. Show that the sample variance s; = _1_ I(Yi -)if can be put in
n-I i=l
different ways as

and the sample covariance


1 n( -
sXY=--IIxi -XYi-Y-)
X
n- i=1
can be written as

1 n ] 1 [nL.XiYi- (fXiJ(fYiJj n(fXiYiJ -(fxiJ(fYiJ


-I [i=\
Sxy=-- "LJiYi-nxy
- - " 1=\ 1=1 1=\
(1=\) 1=\
n -1 i=\ -I
=-- .
n n n 11

Exercise 1.9. Show that the population mean square error


2 1 N( -\2
Sy=--IJj-Y)
N-1i=1
62 Advanced sampl ing theory with applications

can be put in different ways as

N _ N
N
"y J2
S;=_I_[Ir/ _Ny2]= _1_ Iy;2_~
( L../

N - 1 i~ \ N -1 i ~\ N N(N -I)

and the population cova riance


S
xy
N(
N -1 i~1 I
-X -)
= -1 - I X -X y. -y
I

can be written as

rN (~ X;)( ~ lI )1 N(.~ Xill )- (~ X;)( ~ lI )


S = - I - [NIX-Y - N X
--]
Y = -I - IX-Y 1=1 1=\ 1= 1 1=\ 1=\
v N - I ; =1 / 1 N - I ;=1 / / N N (N - I ) '

Exercise 1.10. Construct a sample space and tree diagram for each one of the
following situations:
( a) Toss a fair coin; ( b ) Toss two fair coins ; ( c ) Toss a fair die; ( d ) Toss a fair
coin and a fair die; (e) Toss two fair dice ; and (f) Toss a fair die and a fair coin .

Exercise 1.11. State what type of variable each of the following is. If a variable is
quantitative, say whether it is discrete or continuous; and if the variable is
qualitative say whether it is nominal or ordinal.

I Religious preference.
2 Amount of water in a glass.
3 Master card number.
4 Number of students in a class of 32 who turn in assignments on time.
5 Brand of personal computer.
6 Amount of fluid dispensed by a machine used to fill cups with chocolate.
7 Number of graduate applications in statistics each year at the SCSU .
8 Amount of time required to drive a car for 35 miles.
9 Room temperature recorded every half hour.
10 Weight ofletters to be mailed .
11 Taste of milk.
12 Occup ation list.
13 Coded numbers to different colors, e.g., Red--l , Green--2, and Pink--3 .
14 Average daily low temperature per year in the St. Cloud city.
15 Nat ional ity of the students in your University.
16 Phone number.
17 Rent paid by the tenant.
18 Frog Jump in ems .
19 Colors of marbles .
Chapter 1: Basic concepts and mathematical notation 63

20 Number of mistakes in the examination.


21 Time to finish an examination.
22 Shoe number.
23 Gender of a student.
24 Discipline of a student.
25 Rating of a politician: good , better or best.
26 Sum of two real numbers.
27 Sum of two integers (or whole numbers).
28 Number of passengers in a bus.
29 Age of a patient.
° °
30 Age groups, e.g., to 5 years , 6 to 1 years etc.
31 Area code .
32 Postal code .
33 Product of two pos itive real numbers .
34 Length of a string in ems,
35 Height of door in feet.
36 Weather conditions: good, better and best.
37 Average of real numbers .
38 Proportion of red balls in a bag .
39 Number of e-mail accounts.
40 Number of questions in an examination .

PRACTICAL PROBLEMS

Practical 1.1. From a population of size 5 how many samples of size 2 can be
drawn by using ( a ) SRSWR and ( b ) SRSWOR sampling?

Practical 1.2. Mr. Bean selects all poss ible samples of two units from a population
consisting offour units viz. 10, 15, 20, 25 by using SRSWOR sampling. He noted
that the harmonic mean of this population is given by

y{ -101 +-151 + -20II


+ - }= 15.58442.
Hy =N
I IN - 1 = 4
i~ ' Yi 25

The total number of possible samples =N CI/= 4CZ = 6 and these samples are given by

(10, 15), (10, 20), (10, 25), (15, 20), (15, 25) and (20, 25) .
The harmonic means for these samples are, respectively, are

H,=n
, / I1/ -=21 y{-10II}
i~' Yi 15
1 y{ -10II}
+ - =12, ' / I -=2
Hz=n +- =13 .33333
1/

20
i~ IYi

Mr. Bean took the harmonic mean of these six sample harmonic means, as follows :
64 Advanced sampling theory with applications

HM = _ 6 _= 6
6 1
- ,- {-1 + 1 + 1 + 1 +- I-+I }
i~IHi 12 13.33333 14.28571 17.14286 18.75 22.22222

= 15.58442 = H y .

Then Mr. Bean made the following statements.

(a) Sample harmonic mean is an unb iased estimator of population harmonic mean.

( b ) The expected value of sample harmon ic mean ill is defined as

E(iI)=ts(nJ ~) = Hv :
I

1=1 HI

Do you agree with him? If not, why?

. .
Hmt. Expected value . E HI =
. (,)
z:
s(o ) '
PI H I
.
with PI
1 ( ) ( L N
=- '<f t = 1,2,oo ., s nand s n F ell '
1=1 6

( c ) Find the bias, var iance, and mean square error in the estimator ill .
( d ) Does the relation MSE(ill)= V (ill)+ {s(il l )}2 hold?

Practical 1.3. Suppose that a population consists of 5 units given by : 10, 15, 20, 25,
and 30 .Select all possible samples of 3 units using SRSWR and SRSWOR
sampling.
( a ) Show that the sample mean is an unbiased estimator of population mean in
each case.
( b ) The sample var iance is unbiased estimator of the population variance under
SRSWR sampling, and for population mean squared error under SRSWOR
sampling.
( c ) Also plot the sampl ing distribution of sample mean and sample variance in
each situation.
( d ) Find the variance of sample mean under SRSWOR sampling using the
definition of variance? Show all steps .
( e ) Also compute ( N~ n )s; and comment on it.

r
Practical 1.4. Repeat Mr. Bean's exercise with the geometric mean (OM) and
comment on the results.

(}~l/i
n

Hint: The OM of n numbers Yl ,Y 2" " 'Y n is GM =


Chapter 1: Basic concepts and mathematical notation 65

Practical 1.5. If a random variable x foIlows a Poisson distribution , that is,


x - P(A) with A = 0.4 over N = 20 tria ls. Select a with replacement sample of
n = 5 units by using the method of cumu lative distr ibution function .
- A. ,;tt
Hi nt: The p.d.f. ofa Poisson random variab le x is given by p[X = x]= _e__ .
x!

P r actical 1.6. Suppose an urn contains N baIls of which Np are black and Nq are
white so that p + q = 1. The probability that if n baIls are drawn (without
replacement), exactly x of them will be black, is given by

such that 0:0; x :0; Np; and 0:0; n - x :0; Nq . Using the concept of c.d.f., select a
sample of three units by using without replacement samp ling.
Hi nt : Hypergeometric distrib ution .

Practical 1.7. If a discrete random variable Xhas a cumulative distribution


function :
0 for x < I,
1/3 for 1:o;x <4,
F(x) = 1/2 for 4 :0; x < 6,
5/6 for 6:0; x < 10,
1 for x ~ 10.
Se lect a sample of n = 5 units by using with replacement sampling.
Hint: Use random number table method .

P r actical 1.8. If the distribution function of a population consisting of N = 5 units


is give n by
x2 +5x
x =- - -
F () for x = 1, 2,3,4,5.
50
Draw a wit hout rep lacement sample of n = 2 un its.
Hi nt : Use random number tab le method .

Practical 1.9. If the distribution function of a cont inuous random variable x is


1
f(x) = [ )2 ]' -00 < x < + 00
Jr 1+ (x - 100

Use the first 6 col umns multiplied by 10- 6 as the values of the cumulative
distribution funct ion (c.d.f.) F(x) of the random variable x , and select a random
samp le of IS units by using with replacement sampling.
Hint: F(x) = 100+tanHF(x) -0.5)].
66 Advanced sampling theory with applications

Practical 1.10. If the distribution function ofa continuous random variable x in a

rl,
population is given by

f(x) = ~() eXPj_(x: tJ - 00 < x < +00

with tJ = 100 and a = 2.5 .Use the first 6 columns multiplied by 10- 6 as the values
of the cumulative distribution function (c.d .f.) F(x) of the random variable x, and
select a random sample of 15 units by using with replacement sampling.
Hint: x = tJ+Z() and z - N(O,I).

Practical 1.11. Find the value of c so that


cx (x - y ) for O<x<l, - x < Y < +x,
f (x,y ) =
{o .
otherwise,
becomes ajoint probability density function . Select a sample of 11 = 5 units by
using Random Number Table method.
l+x
Hint: f ff(x,y)iydx = I or Freund (2000) .
O-x

Practical 1.12. In the hope of preventing ecological damage from oil spills, a
biochemical company is developing an enzyme to break up oil into less harmful
chemicals. The table below shows the time it took for the enzyme to break up oil
samples at different temperatures. The researcher plans to use these data in
statistical analysis:

( a ) If you are a consultant which variable you will consider dependent and
independent? Denote your dependent variable by Y and independent variable
with X .
( b ) Assuming that these six observations form a population, compute the following
parameters:
- - 2 2 _ Sy _ St _ Sxy _ Sty _ Cy
Y ,X, Sy,Sx' Sy,Sx,Cy-~,Ct-~,Sxy,Px
y--- ,f3--2 andK-pxy- .
Y X SxSy c, s;

Practical 1.13. Consider the following population consisting of 5 units A = 10,


B = 20, C = 25, D = 50, and E = 4.
( a ) Compute the population harmonic mean .
( b ) Select all possible SRSWOR samples each consisting of 3 units .
( c ) For each sample of 3 units, obtain estimate of the population harmonic mean .
( d ) How many sample harmonic means are less than population harmonic mean?
( e ) Find the bias in the sample harmonic mean.
Chapter 1: Basic concepts and mathematical notation 67

( f) Find the variance of sample harmonic mean by the definition.


( g ) Find the MSE of the sample harmonic mean by the definition.
( h ) Does the relation MSE(Hs )=V(H s)+~(Hs )}2 hold?
Practical 1.14 . Consider a population consisting of 15 countries as listed in the
following table, and also gives the hypothetical suicide rates in these countries per
100,000 persons .

Sr.'No5 Country ,t'; 'i C , \',$ uicide rate (%)


1 Australia 22
2 Austria 55
3 Canada 29
4 Denmark 59
5 France 35
6 Ireland 10
7 Israel 12
8 Japan 35
9 Netherlands 20
10 Norway 26
11 Poland 26
12 Sweden 39
13 Switzerland 55
14 United Kingdom 18
15 United States 25
1. Compute the following parameters :
( a ) Population mean;
( b ) Population range;
( c ) Population variance;
( d ) Population standard deviation ;
( e ) Population mean square error;
( f) Population coefficient of variation;
( g ) Proportion of countries having suicide rate more that 25%.

2. (a) Select an SRSWOR sample of 5 units using Random Number Table


method (Rule: Start from 1st row and 3rd column of the Pseudo-Random
Number Table 1 given in the Appendix).
( b ) Estimate the population mean and population total.
( c ) Compute the variance of the estimator of population mean.
( d ) Estimate s;.
( e ) Estimate the variance of the estimator of population mean.
( f) Construct 95% confidence interval of the population mean assuming that
population mean square is known and sample size is large. Does the
population mean falls in it? Interpret it.
68 Adv anced sampling theo ry with appl ications

( g ) Con struct 95% confidence interval assuming that population mean square
is unknown and sample size in small. Doe s the population mean falls in it?
Interpret it.
( h ) Find the variance of estimator of proportion of countries having suicide
rate more than 25%.

Practical 1.15. Consider a popul ation cons isting of the follow ing six units:

Now con sider the follo wing sampling plan :


Sample No. Samples Prob ability
PI
1 A, C,E 1/9
2 A, C. F 1/9
3 A,D, E 1/9
4 A,D, F 1/9
5 B,C,E 1/9
6 B, C,F 1/9
7 B,D,E 1/9
8 B,D,F 1/9
9 C, D, F 1/9

Compute the following:


( a ) £(YI ),V(Yt ), S(Yt), and MSE(Yt) ·
( b ) Does the relation MSE(Yt) = V(Yt)+ {S(Yt )}2 hold s?

Practical 1.16. For a bivariate data of n = 10 pairs of observations we are given


fI n 2 n
LXi =5 7, L Yi = 263 , and L XiYi = 299.
i=1 ~I ~I

Assume that these 10 observations form a sample compute the following stat istic:
- ·, x- ·, s2y '. s2x '. S ' S . C'y -----=-
_ Sy . C' _
, x '" -=- , xy '.
Sx . S rxy -- Sxy . b_ s xy and
Y y ' x' -- , --
2
Y x SxSy Sx

Practical 1.17. The follow ing data show s the daily temp eratures in Ne w York over
a period of two weeks:
Chapter I: Basic concepts and mathematical notation 69

Find the following: sample size; sample mean; median; mode; first quartile ; second
quartile; third quartile; minimum value; maximum value; and interquartile range .

Practical 1.18. Construct scatter diagrams and find the linear correlation coefficient
in each one of the following five samples each of five units and comment on the
different situations will arise:

Practical 1.19. The following balloon is filled with five gases with their different
atomic number and atomic weights.
La a.a ._ • .L.a. a .LL .......

·
·
·

··
................ . ·-
70 Advanced sampling theory with applications

( I ) Sampling distribution of atomic weight of gases:

( a ) Find the average atomic weight of all the gases in the balloon;
( b ) Find the population variance 0- 2 of atomic weight of all the gases in the
balloon ;
( c ) Select all possible with replacement samples each consist ing of two gases;
( d ) Estimate the average atomic weight from each one of the 25 samples;
( e ) Construct a frequency distribution table of all poss ible sample means ;
( f) Construct an histogram. Is it symmetric?;
( g ) Find the expected value of all sample means of atomic weights from the
frequency distribution table you developed ;
( h ) Find the variance of all the sample means of atomic weights from the
frequency distribution table you developed.

( II ) Sampling distribution of proportion of inert gases in the balloon:

( a ) Find the proportion of inert gases in the balloon, and denote it by P ;


(b) Select all possible with replacement samples each consist ing of three gases;
( c ) Estimate the proportion of inert gases in each sample ;
( d ) Construct a frequency distribution table of all possible sample proportions of
inert gases ;
( e ) Construct an histogram. Is it symmetric?
( f) Find the expected value of all sample proportions of inert gases from the
frequency distribution table you developed;
( g ) Find the variance of all the sample proportions of inert gases from the
frequency distribution table you developed.

Practical 1.20. Consider a sample Y \,Y2" "'Yn and let Y k and s; denote the sample
mean and variance, respectively, of the first k observat ions .
( a ) Show that
2 (k - J) 2 J ( - )2
s k+! = - - S k + - - Y k+! - Yk .
k k+l
( b ) Suppose that a sample of 15 observations has sample mean and a sample
standard deviation 12.60 and 0.50, respectively. If we consider 16th observation of
the data set as 10.2. What will be the values of the sample mean and sample
standard deviation for all 16 observations?
2. SIMPLE RANDOM SAMPLING

2:0 INTRODUCTION

Simple Random Sampling (SRS) is the simplest and most commo n method of
selecti ng a sample, in which the sample is selected unit by unit , with equa l
probability of selection for each unit at eac h draw. In other words, simple random
sampling is a method of selecting a sample s of II units from a popul ation n of
size N by giving equal prob abilit y of selection to all units. It is a sampling scheme
in whic h all po ssible combinations of II units may be formed from the popul ation
of N units with the same chance of selection.
As discussed in chapter I:
( a ) If a unit is selected, observed, and replaced in the popul ation before the next
draw is made and the procedure is repeated n times, it gives rise to a simple
rando m sample of II units. Thi s procedure is kno wn as simple rando m sampling
with replacement and is denoted as SRSW R.
( b ) If a unit is selected, observed , and not replaced in the popul ation befor e
makin g the next draw, and the procedure is repeated until n distin ct units are
select ed, ignoring all repetition s, it is called simple random sampling without
replac ement and is denoted by SRSWOR. Let us discuss the properties of the
estim ator s of population mean, variance, and proportion in each of these cases.

2.1 SIMPLE RANDOM SAMPLING WITH REPLACEMENT

Suppose we select a sample of II ~ 2 units from the population of size N by using


SRSWR sampling. Let Y i ' i = 1,2 ,..., 11, denote the value of the i''' unit se lected in the
sample and Yi , i = 1,2,...,N , be the value of the i''' unit in the popul ation . Then we
have the followin g theorems :

T heorem 2.1.1. The sample mean y" = 11 -1 I Yi is an unbiased estimator of the


i= 1
_ I N
population mean Y = N- I Yi .
i=1

Proof. We have to prove that £(yJ = Y. Now we have

_ [I"] I"
£V,,) = £ - I Yi =- I £(yJ .
II i =1 II i=\
(2 .1.1)

Now Yi is a random variable and each unit has been selected by SR SWR sampling,
therefore Yi can take value s JI,Yz"" ' YN with prob abilities l/ V , l/ N, ...,!/N . By
the definition of the expected value we have

S. Singh, Advanced Sampling Theory with Applications


© Kluwer Academic Publishers 2003
72 Advanced sampling theory with app lications

I N -
E(Yi) = - L}j = Y .
Ni=1
Thu s (2. 1.1) impl ies

II
[I
E(YII ) =-InL - NLYi] =- LY
n
=Y .
i= 1 N i= l
I
Ili=l
c: -

(2.1.2)

Hence the theorem.

Corollary 2.1.1. The estimator YII = NYII is an unbiased estimator of population


tota l, Y .

Proof. We have
E(YII) = E[NYn ]= NE(Yn) =NY =Y (2.1.3)
which proves corollary.

Theor em 2.1.2. The varia nce of the estimator y" of the population mean Y is

- -I 2 2 -I N - -I N 2 -2
V(YII) = II O"y' whe re O"y = N i~I(}j - Y)2 = N [ i~l}j - NY ] (2 .1.4)

Proof. Because of independence of draws, we have

V(Yn) = V( -I LYi
II J=2I LV(Yi)
II
· (2. 1.5)
II i=1 II i=1
By the defin ition of var iance we have
V(Yi)=E[Yi - E(Yi)]2 = E~l )- {E(Yi )}2 =.l. ~ Y? _ y 2
N i=l

=-I [NL}j2 - NY
-2 ] =-I N(
L }j - -\2
YJ = 0"Y2 .
N i=1 N i=1
Using (2 .1.5) we have V(y,,) = O"}/11 . Hence the theorem.

Theorem 2.1.3. An unbiased estimator of the varia nce v(y,,) is given by


• _ s~
V(YII) = - , (2 .1.6)
II

where
2 I n _ I n 2 -2
Sy =- - L(Yi -Yn)2 = - - [ LYi -llYn ] ·
I/-li=l II- I i= 1

Proof. We have to show that E[v(yll)] = V(YII ). Now we have

E[v(y,,)]= E[S,~ ] = ~E(S}). (2.1.7)


II II

Note that
Chapter 2: Simple Random Sampling 73

2
E~,~ )= VVn )+y2= cry + y2
n
and

Es[y2]=E[1
- - (n'IYi2- nYn2)] =-- n 2-nYn2] =--
1 E[ 'IYi n 2-nYn2]
1 E[n- 'IYi
n- 1 i= \ n-I i= \ n- 1 n i=1

[1
_- -n- - 'In El)'i
n - 1 1/ i=1
(-2)~ -_ - n- [1"(1
( 2)- El)'n NY;2J - [cr;
- L- L
1/ - 1
- +Y-2)]
1/ i=\ N i= 1 1/

=_'_
'
1/ -
[J.- 2: Y/ - y2_cr; ]=_'_
1 N i=\
' [cr~ - cr.~ )
n- 1 )
1/ 1/
= cr 2.
)

From (2.1.7) we have

E[v(y,J= ~E(s;)
n
= cr;n = V(y,,) .
Hence the theorem.
Corolla ry 2.1.2. The variance of the estimator Yn = NYn of the popul ation total is
V(y,,) = N2V(y,,) .
T heorem 2.1.4 . Unde r SRSWR sampling, while estimating population mean (or
total) , the minimum sample size with minimum relati ve standard error (RSE) equal

1
to ¢ , is given by

n ~ [;,; , (218)

Proof. The relat ive standard error of the estimator Y" is given by

RSE(y,, )= ~v(y,, )/{EV,, )}2 = cr;/~,y2) . (2.1.9)

We need an estimator Yll such that RSE(y,, ) ~ rjJ , which implies that
2 2
cr; /~I P) ~rjJ, or cr! 2 ~rjJ2, or n e ;~2 '
1/Y rjJ Y
Hence the theorem.

Remark J.
2.1: If rjJ =( YZ:/2 with = Za/2 e j; then p[I(YIl; y)! ~ e) = 1- a.
Example 2.1.1. In 1995, a fisherman selected an SRSWR sample of six kinds of
fish out of 69 kind s of fish ava ilable at Atlantic and Gul f Coasts as give n below :

Kind offish Saltw ater White Blue Scup Summ er Scup


.- catfish perch runner flounder
No . offish 13859 3489 2319 3688 16238 3688
74 Advanced sampling theory with applic ations

( a) Estimate the average number of fish in each species group.


( b) Construct a 95% confidence interval for the average number of fish in each
speci es group
( c) Estimate the total number of fish at Atlantic and Gulf Coasts during 1995 .
( d) Construct a 95% confidence interval for the total number of fish at Atlantic
and Gulf Coasts during 1995.

Solution. We are given N = 69 and n = 6. From the sample information we have

13859 192071881
2 3489 12173121
3 2319 5377761
4 3688 13601344
5 16238 263672644
6 3688 13601344
.Sum 500498095'

( a ) Thus the average number of fish in each species group is given by


I n I 6 43281
Yn = - L: Yi = - L: Yi = - - = 7213.5.
ni=1 6 i =1 6

( b ) A (I - a)1 00% confidence interval for the population mean Y is given by


Yn ± (a/2(df = n -I )Jv(Yn)

where V(Yn) = s;/n.Now we have


sJ = _I_f I i _n-1( IyiJ2) = _1_[500498095- (43281)2] = 37658120.3.
n-Ili=1 li=l 6-1 6
Thus

V(Yn) = s;
= 37658120.3 = 6276353.38.
n 6
Using Table 2 from the Appendix the 95% confidence interval for the average
number of fish is given by
Yn± (O.05/2(df = 6 -1)Jv(Yn) , or 7213.5 ± 2.571.J6276353.38, or [772.46, 13654.53] .

( c ) An estimate of total number of fish is given by


y = NYn = 69 x 7213.5 = 497731.5 .
( d ) The 95% confidence interval for the total number of fish is given by
N x [772.46,13654.53] , or 69 x [772.46,13654.53] , or [53299.7, 942162.5] .
Chapter 2: Simple Random Sampling 75

Example 2.1.2. We wish to estimate the average number of fish in each one of the
species groups caught by marine recreational fishermen at the Atlantic and Gulf
coasts. There are 69 species groups caught during 1995 as shown in the population
4 in the Appendix. What is the minimum number of species groups to be selected
by SRSWR sampling to attain the accuracy of relative standard error 30%?
Given: sJ; = 37199578 and Y = 311528 .

Solution. We are given N = 69, SJ; = 37199578 and Y = 311528, thus


f .z..
N
311528 =4514.898
69
and
0'; = (N-l)SJ;
N
= (69-1) x37199578 = 36660453.68.
69
For ¢ = 0.30 either we are estimating population total or population mean, the
minimum sample size for the required degree of precision is given by
0';] 36660453 .68
n [
e ¢2f2 = 0.32 x (4514 .898f 19.98",20 .

Thus a sample of size n = 20 units is required to attain 30% relative standard error
of the estimator of population mean under SRSWR sampling.

Example 2.1.3. Select an SRSWR sample of twenty units from population 4 given
in the Appendix . Collect the information on the number of fish during 1995 in each
of the species group selected in the sample. Estimate the average number of fish in
each one of the species groups caught by marine recreational fishermen at Atlantic
and Gulf coasts during 1995. Construct the 95% confidence interval for the average
number of fish in each species group available in the United States.

Solution. The population size is N = 69, thus we used the first two columns of the
Pseudo-Random Number (PRN) Table 1 given in the Appendix to select 20 random
numbers between 1 and 69. The random numbers so selected are 58, 60, 54, 01, 69,
62,23,64,46,04,32,47,57,56,57,60,33,05,22 and 38.

01 Sharks, other 2016 -3977.25 15818517.560


04 Eels 152 -5841.25 34120201 .560
05 Herrings 30027 24033 .75 577621139.100
22 Crevalle jack 3951 -2042.25 4170785 .063
23 Blue runner 2319 -3674 .25 13500113.060
32 Yellowtail snapper 1334 -4659.25 21708610.560
33 Snappers , others 492 -5501.25 30263751.560
Continued .......
76 Advanced sampling theory with applications

38 Pinfish 16855 10861.75 117977613.100


46 Spot 11567 5573.75 31066689.060
47 Kingfish 4333 -1660 .25 2756430 .063
54 Tautog 3816 -2177 .25 4740417 .563
56 Wrasses , other 185 -5808 .25 33735768.060
57 Little tunny/ Atl bonito
782 -5211 .25 27157126.560
57 Little tunny/ Atl bonito
782 -52 11.25 27157126 .560
58 Atlantic mackerel
4008 - 1985.25 3941217.563
60 Spanish mackerel2568 -3425 .25 11732337.560
60 Spanish mackerel2568 -3425.25 11732337.560
62 Summer flounder16238 10244.75 104954902.600
64 Southern flounder
1446 -4547 .25 20677482 .560
69 Other fish 14426 8432.75 71111272.560
~ ~~~ 7 Suhi ~r,O O;) 1.1 6S943840IQQOclo
An estimate of the average number offish in each species group during 1995 is:
Yn = ~ IY; = 119865 = 5993.25.
II ;=1 20
Now
s; = _1_- 1 I (v; - Yn f = 1165943840
II ;=1 20-1
= 61365465.26,
and the estimate of variance of the estimator Yn is given by
2
v(y,,) = sy 61365465.25 = 3068273.26.
II 20
A (1- a)lOO% confidence interval for the average number of fish in each one of
the species groups caught during 1995 by marine recreational fishermen in the
United States is
Y" =Ffa / 2(df = II - 1).jv(y,, ).
Using Table 2 from the Appendix the 95% confidence interval is given by
Yn =FfO.02S(df = 20-1).jv(y,,) , or, 5993 .25+2.093~3068273.26
or [2327.05, 9659.45] .

Example 2.1.4. The depth y of the roots of plants in a field is uniform ly distributed
between 5cm and 8cm with the probability density function

f(y) = -1 V5<y<8•
3

We wish to estimate the average length of roots of the plants with an accuracy of
relative standard error of 5%, what is the required minimum with replacement
sample size n?
Chapter 2: Simple Random Sampling 77

Solution. We know that if y has a uniform distribution function


f(y) = - 1 \;f a < y < b•
b-a
Then the population mean
y = a + b = 5 + 8 = 6.5 ,
2 2
and the population variance
<J2 = = (8-5)2 =~=0.75.
(b-a)Z
y 12 12 12
We need ¢ = 0.05 , thus the required minimum sample size is given by
<J; 0.75
n '? 2-2 = 2 2 = 7.1'" 7.
¢ Y 0.05 x 6.5

Example 2.1.5. The depth y of the roots of plants in a field is uniformly distributed
between 5cm and 8 em with the probability density function

f (y) =.!. \;f 5< y<8•


3

Select a with replacement sample of n = 7 units. Construct a 95% confidence


interval to estimate the average depth of roots.

Solution. The cumulative distribution function (c.d.f.) is given by


Y Y1 (y - 5)
F(y)=P[Y~ y]= JJ(y}ty= F-dy=--
5 53 3
which implies that y = 3F(y)+ 5. We used the first three columns multiplied by
10- 3 of the Pseudo-Random Number (PRN) Table I given in the Appendix to select
seven values of F(y) and the required sampled values are computed by using the
relationship y = 3F(y)+ 5 as follows:

0.992 7.976 63.61658


0.588 6.764 45.75170
0.601 6.803 46.28081
0.549 6.647 44.18261
0.925 7.775 60.45063
0.014 5.042 25.42176
0.697 7.091 50.28228

Thus an estimate of the average depth of the roots is given by


78 Advanced sampl ing theory with applic ations

Yn =..!-i:Yi = 48.098 = 6.8711.


n i=1 7
To find sample variance s~ we apply here alternat ive method as
2
s; =_n -I_I !i:Y1 - n-
2
1(
i:Yi J j =_1_[335 .9864 - 48.098 ] = 0.9164,
i=1 i=l 7 -I 7

and the estimate of variance of the estimator Yn is given by


V(Yn) = sn; = 0.9164
7
= 0.1309.

A (I - a )100% confidence interval for the average depth of roots in the field is
Yn=+= ta/2(df = n -INv(y,J .
Using Table 2 from the Appendix the 95% confidence interval estimate of the
average depth of the roots is given by
Yn =+= to.025(df = 7 -I NV(Yn), or 6.8711+ 2.447~0. 1 309 , or [5.9857, 7.756] .

Theorem 2.1.5. The covar iance between two sample means Yn and xn under
SRSWR sampli ng is:
_ _) O'xy
C OY( Yn'Xn = -, (2.1.10)
n
where
0'xy = NI N -XXi - Xr) .
2: (Jj - Y
i= l

Proof. We have
- ,X-n) =C OY(I- In Yi,-InIXiJ= 2""
COY(Yn I ICOY(Yi
n
' X;) , (2.1.11)
n i=1 n i=l n i=l
Now
COY(Yi 'Xi )= E(YiXi) - E(Yi )E(Xi) . (2.1.12)
The random variable s (YiXi)' Yi and xi , respectively, can take anyone of the
value (JjXi ), Y.I and x.I for i = 1,2,...,N with probability 1/ N . Thus we have

C OY(Yi' Xi) =E(YiXi )-E(Yi )E(Xi) = NI IJiXi N J(IN IX


N - (IN IY; N iJ
1=1 1=1 1=1

I N
=-IYX- -X
Y I N(
- =-2: Y,. - -X -) =0'
y X -X .
N i= l l I N i=l 1 I xy

On substituting it in (2.1.11) we obtain


_ _) O'xy
COY(Yn , X n = - .
n
Hence the theo rem.
Chapter 2: Simple Random Sampling 79

Theorem 2.1.6. An unbiased estimator of Cov(Yn, xn) is given by


' (_ _) Sxy
COy Yn , X n =- , (2.1.13)
n
where
s xy = 1 In (Yi
--1 - Y-XXi - r)
x.
n- i=!
Proof. We have to prove that Elsxy)= axy . Now we have
1-
E(sxy)= E[_l_l I(Yi - yXXi -x)] = E- {IYiXi - nynxn}
n- i=\ n- 1 i=!

=_1_{
n i=!
IE(YiXi)-nE(Ynxn)} =
-1
_1_{i:...!.- IJiX
n -1 i=1 N i=1
i - n(Cov(Yn'xn)+ Y x)l
f
=_1_{~ ~}jXi _n(axy +y xJ} =_n_{...!.- ~}jXi -yx _axy}
n -1 N i=! n n -1 N i=l n

n- {O"xy---
O"x y}
=- =O"xy ·
n-l n
Hence the theorem.

Suppose we selected a sample of n ~ 2 units from the population of size N by


using SRSWOR sampling. Let Yi' i = 1,2,...,n denote the value of the {h unit
selected in the sample and Y., i = 1,2,...,N be the value of the /h unit in the
I
population . Then we have the following theorems:

_ -1 n
Theorem 2.2.1. The sample mean Y n =n L Yi is an unbiased estimator of the
i=1
- IN
population mean Y = N- Lli.
i=1
Proof. We have to show that E(Yn) = Y. It is interesting to note that this result can
be proved by using three different methods as shown below.

Method I. Note that the estimator Yn = n- 1 IYi can easily be written as


i=1
_ 1 N
Yn =- Ifill (2.2.1)
n i=1
where fi is a random variable defined as:
f. = {I if the ith unit of the population is in the sample, i.e., if i E S,

I 0 otherwise.
80 Adva nce d sampling theory with app lications

Note that Yi is a fixed value in the popul ation for the i''' un it, therefore, the
expec ted value of (2.2 . I) is give n by

E(YII) = E[~ ~ti}j] = ~E( ~ti}j) = ~[ ~ }jE(ti)]' (2.2.2)


II i ~1 II i~ 1 II i ~1

Not e that (N- I)C(II_l) is the numb er of samples in which a given population unit can
occur out of all NC II SRSWOR samples, and therefore the prob ab ility that i'''
(N-I )C
pop ulation unit is selected in the sample is = N (II-I) =.!!...- So the random
CII N

variab le t i takes the value I with probability ~ and 0 with pr obab ility (1-~ ).
Thu s the expected value of t, is

E(d = ~x 1+ (I - ~)x 0= ~ . (2.2.3)


From (2 .2.2) we have
_
E(YII) =-1 [ IN E(t;)}j ] =-1 IN-II }j =-1 I}j
N -
=Y .
II i~ 1 II i~ 1 N N i~ 1
Hence the theorem.

_ [I I
Method II. We can also prove the same result as foIlows

E(YII) = E - III Yi] = - III E(yJ . (2.2.4)


II i~ 1 II i~ 1

In (2 .2.4) the sample value Yi is a random vari able and can take any population
value lj , i = 1,2,...,N , with probabil ity 1/N .
Thu s we have
1 N -
E (Yi) = - L}j = Y.
Ni~ 1

From (2.2.4) we have


I II ( ) I II - -
E()
YII =- I EYi =- I Y =Y.
II i ~1 II i ~ 1

Hence the theorem.

Method III. To prove the above result by another method, let us consi der
(YII)I = sample mean YI based on the tl" sample selected from the pop ulation.
Note that there are N C II possi ble samples, the probability of selecting the l" samp le
IS

PI = I/(NCII ).
By the defin ition of expec ted value, we have
Chapter 2: Simple Random Sampling 81

L LY; J
C
=
I
N
N " ( ,,
n( C,,) 1;1 ;;1 1

Hence the theorem .

Solution. Let us consider a population consisting of four units A, B, C, and D.


The number of possible samples without replacement of size two is NCn = 4Cz = 6 .
Let }], Yz , Y3 and Y4 be the values of four units A, B, C, and D respectively, in
the population . Let vt denote the value of the /h unit selected in the sample. Then
we have the following situation:

Sam Ie no 2 3 4 5 6
Sampleduni ( A ,B) (A,C) (A,D) (B,C) (B,D) (C,D)
PopQla@nuilit~J. (}] , Yz ) (}], Y3) (}], Y4 ) (Yz , Y3) (Yz , Y4 ) ( Y3 , Y4 )

The values of the units in the sample in all these cases are YI and yz.

Thus we have

N~"(.Iy;) = f (l.y;) = f (YI + Yz)1


1;1 /;1 1 1;1 /; 1 1 1;1

=~+Yz~+~+Yz~+~+Yz~+~+Yz~+~+Yz\+~+Yz~
= (}] + Yz)+(}] + Y3)+ (}] + Y4)+ (Yz + Y3)+ (Yz + Y4)+ (Y3 + Y4 )

= 31) +3Yz +3Y3 +3Y4 =3CI1)+3qYZ+3CIY3+3qY4

=(4-1)qz-I)}] +(4-1)qz-I )Yz +(4-1)qZ-I)y3 +(4-1 )qZ -l)Y4


=(N-I)q,,_I)}]+(N-t)q,,_I)YZ+( N-I)q,, _I)y3+(N-I)q,, _I)Y4

= f {(N-I)q,,_I)}(Ji) .
;;1

Theorem 2.2.2. The probability for any population unit to get selected in the
sample at any particular draw is equivalent to inverse of the population size, that is,
Probabil ity of select ing the i1h unit in a sample = ~. (2.2.5)
N
82 Adva nced sampling theory with applications

Proof. Let us consider that at the r'" draw, the i''' popul ation unit Yi is se lected. Th is
is poss ible only if this unit has not been selec ted in the previous (r- I) draws . Let
us now consider the draws one by one.
First draw: The probab ility for the particular unit' }j , to get selected on the first
draw out of N units is = 1/N . Note that ' the probability ' that }j is not selected on
the first draw, from a popu lation of N units is = {1- 1/N } = (N - 1)/ N .

Second draw: The probabil ity that a particular unit is selected on the second dra w
(if it is not already selected on the first draw) is the product of two prob ab ilities,
namely
(Probability that }j is not selected on the first draw) x (Probability that }j is
selec ted on the second draw)
Therefore the prob ab ility that }j is selected on the seco nd draw is equal to
(N - I ) 1 1
- N - x(N_ I) = N

Note that the probabil ity the }j is not selecte d on the seco nd draw out of the
remai ning (N -I) popul ation units is equal to

1--- =--
1
(N - I)
N- 2
N- l '

Third draw : The pro babi lity that a particular popu lation unit is se lected on the third
draw (if it is not selected on the seco nd draw) is the product of three probabilities .
Probabi lity that }j is selected on the third draw (if it is not selected on first or
seco nd draw) is a prod uct of three probabilities as
(Probability that }j is not selected on first draw) x (Probab ility that }j is not
selected on second draw) x (Probability that }j is selected on the third draw)

= (~) x (~) x _1
N N- I N-2
=J.-.
N
Note that the prob ability that Y; is not selected on the third draw out of (N - 2)
population units is equal to

1-- -=-- .
1
N- 2
N -3
N- 2
Th is procedure continues up to (r - I) draws.

rth draw: Prob abil ity that }j is not selected up to (r - I) th draw is given by

- - x---x x
(N- I)
N
(N -2)
(N- I)
N- (r - I)
N- (r - 2)
=-- -
N- r+ 1
N
Probability that }j is selected at ,-t" draw [ass uming that it is not selected at any of
the prev ious (r - I) draws] is equa l to
Chapter 2: Simple Random Sampling 83

I I
N -(r-I) = N -r+1 .
So we obtain the probability of a particular unit Y; to get selected at the l' draw is
(N - r + I)
-'-----'- x
I
-
I
N (N -r+l) N
Hence the theorem.

Theorem 2.2.3. The variance of the estimator Yn is given by


V(Yn) = (1-f)
n
s; (2 .2.6)

where S'[;2 = -I - N(
2: Yi - Y-)2 and f = n/ N denote the finite population correction
N-I i=l
factor (f.p .c.).

Proof. We have

V(YII) = V(~ I.Yi) =


l n i=l vrl ~n iliY;) ,
i= J
(2 .2.7)

where Ii is a random variable that takes the value ' 1' if the {" unit is included in the
sample, otherwise it takes the value O. Note that the Jj is fixed for the {" un it in the
population we have

V(YII) = nl2 V(i~/iY; ) = n


l2 [~I V(/;Y; )\.~=I Cov&;Y; , liYi)]
= n\ [i~Y;2V(d \.~=1 Y;YiCOV&i,/J]. (202 .8)

In (2 .2.8) we need to determine the V(/J and CovV; , Ii). Note that the distributions
of t, and 11 are
I with probability n] N, and 12 = {I with probability (n/ N),
t, = { 0 with probability 1- (n/ N), , 0 with probability {I - (n/ N)}.
We have
vk) = E[/i - E(/iW= E~l)- {E(/i )}2

(2.2 .9)

The probability that both {" and r units are included in the sample IS

(N-2)C(n_2)
N =
n(n
(
- I)) an d ot herwise the pro babilitv
0 i
I ity IS 1-
n(n
(
-I)) ,there f ore
ell N N-I N N-I
84 Advanced sampling theory with app lications

I
I with probability /1((/1 - I)) .
N N- I
I;lj = 0 with probability { I - ~~:;_I?J
Now we have

COV/i,l ( )- E()
( j )=EI;lj /1(/1 - I)
li El( j ) = N(N - I)
(/IN )( N/I ) = Nn [ N- 1I- N
Il - Il ]

= n(n- N) =_.!!-( N - n J= n(N-n)


N2(N -I) N N(N -I) N2(N -I)" (2.2 .10)
Using (2.2.9) and (2.2.10) in (2.2.8) we obtain

VCYII)=~[~r;2{/I(N~n)}+
n N i=\
~j=\ r;Yj{-~(N-n)}]
N (N- I) i",

= (N-n)[~f2
c:
__I_ c: ~ fY .] (2.2 .11)
nN2 i=\ (N-I) i", j=\ I J '
Note that
-2 1 N 1 N 2 N
Y (N i=1I: YiJ2 = -N2 [ i=1I: Yi
= - + I:
ioto )= \
YiY) ] , (2.2 .12)

we obtain
N 2 -2 N 2
I r;Yj =NY -I r; ·
i"' j=\ i=1
On substituting (2.2.12) in (2.2.11) we obtain

VCYIl )= (N
fiN
-;1)[~r;2 __
(N
I_( N2y2 - ~ r;2J]
i= 1 i=1 -I )

=(N -2n)[~r;2 + _1- ~ r;2 _ ~ Y2]


nN i=\ N- 1i= 1 N- 1

=(N
/IN
-;1 )[(1 + _I_)~
N- I
r;2_ ~ y2]
N-I i= 1

= (N-;I)[~~r;2 _~Y2] = (N -fl)[_N_{ ~Y? _Ny2 }]


fiN (N- 1);=1 N- I nN 2 (N- I) i=1
_ N -Il S2 _ (1- f) S2
- ---;;;:;- Y - - fl - Y

where

Sy
2=-( -1) [N
Ir;2- NY-2].
N- I i= 1

Hence the theorem.


Chapter 2: Simple Random Sampl ing 85

Theorem 2.2.4. An unbiased estimator of the var iance V(Yn ) is given by


,(-) (1-fJ 2
v Yn = -n- Sy' (2.2.1 3)
whe re
2
Sy = - I - n 2-nYn
( LY; -2J and f =n/N .
n -I ;=1
Proof. We have to show that E[v(Yn )]=V(Yn) . Note that

E[v(Yn )] =E[ I ~f s~] = (I~fJE~~) . (2.2.14)

Now

E(s~)= E[_I{Iyr
n - I i= 1
-ny;}] =_1 [E( IYlJ -nE~; )l
n -I i= 1 J
= n~I[~i~/~l)- E~;)]. (2.2.15)

Not e that E~,~ )= V(Yn )+ {E(Yn)}2 = N- n s; + y2and each unit }j becomes selected
Nn
with probabil ity 1/ N , thus (2.2. I5) becomes

E(s 2)=_n [~ I (~ ~}j2 J - { (N -n) S 2 + y2 }]


~ y n- 1 ni= 1 i =1 N Nn y

=_n_[~ ~ y'2 _ y2 _ (N- n) S2] = _n_[~{ ~ }j2 _ Ny 2


n- I N i = 1 I Nn Y n-I N i =l
=_n_[ N -I { _ I_ ( ~}j2 _ Ny2 )}_ (N-n) s~ ]
n- I N N- I ;= 1 Nn )
=_n_[ N -I S2 _ (N-n) S2 ] = S2.
n-1 N Y Nn Y Y

On substituting this value in (2.2.13), we obtai n E[v(YI/ )]=V(YI/ ). Hence the


theorem .

Theorem 2.2.5. Under SRSWOR sampling, while estimating population mean (or
total), the minimum sample size with minimum relative standard error (RSE) equal
to ¢ , is

n> [~+ ¢2y2 ]-1 (2.2. I6)


N S y2

Proof. The relative standard error of the estimator Yn is given by


86 Advanced sampling theory with applications

(2.2.17)

Note that we need an estimator Yll such that RSE(y,, ) 5, ¢ , which implies that
- 1

( ~ - ~)
II
~}
N y2
5, ¢, or (~" - ~) 5-
N y2
5, ¢2, or
1 ¢ 2 y2
n >: [ N + s;]

Hence the theorem.

Rem ark 2.2: If ¢ = (YZ:/2) ,with e=Za/2 (I ~/\~ then p(! (Y"i Y)I : ; e]=I-a .

Exa mple 2.2.2 . A fishermen recruiting company, XYZ , sele cted an SRSWOR
samp le of six kinds of fish out of 69 kinds of fish avai lable at Atlantic and Gulf
Coasts as be low :

( a ) Estim ate the avera ge numb er of fish in each species group .


( b ) Construct a 95% confidence interval for the average numb er offish in each
species group
( c ) Estimate the total number of fish at Atlantic and Gulf Co asts dur ing 1995.
( d ) Con struct a 95% confidence interval for the total numb er of fish at Atlantic
and Gulf Coasts dur ing 1995.

So lutio n. We are given N = 69 and II = 6. From the sample information, we have

Samp le 2
Yi Yi
Unit
I 16855 284091025
2 10940 119683600
3 4793 22972849
4 2146 4605316
5 3816 14561856
6 935 874225
Sum 39485 446788871

( a ) Thu s the average number of fish in each species group is given by


_ I" 1 6 39485
y" = - L Yi = - L Yi = - - = 6580.83.
II i= 1 6 i=1 6
( b ) A (I - a)100% confidence interval for the population mean Y is given by
Chapter 2: Simple Random Sampl ing 87

y" ± la/2(df = n-I hJv{y,, ) , where v(y,,) =(1-nf )s~. .


Now we have

s; = _I_!IYT _
n - 1 i=l
n- 1( IYi)2) = _6 -I1_[ 446788871- (39485)2
i=l 6
j
= 37388933.4 .

Thus
v(Yn) = (I~/ }; = ( 1- 0;869 ) x 37388933.4 = 5689972.5 .

Using Tabl e 2 from the Appendix the 95% confidence interval for the average
number of fish is given by
Y,, ±lO.02S(df=6-I)Jv(y,,), or 6580.83±2.5nJ5689972.5 , or [448.05,12713 .61] .
( c ) An estimate of total number of fish is given by
y = Ny" = 69 x 6580 .83 = 454077.27 .
( d ) The 95% confidence interval for the total number of fish is given by
N x [448.05,1 2713.6 1], or 69 x [448.05,12713.61] , or [30915.45, 877239.09] .

Exa mple 2.2.3. We wish to estimate the average number of fish in each one of the
species groups caught by marine recreational fishermen at the Atlantic and Gulf
coasts. There were 69 species groups caught during 1995 as shown in the
popul ation 4 in the Appendix. What is the minimum numbe r of species groups to be
selected by SRSWOR sampling to attain the accuracy of relative standard error
30% ?
Gi ven: s; = 3719957 8 and Y = 311528.

Solutio n. We are given N = 69 , S; = 37199578 and Y = 4514 .898. Thu s for


¢ = 0.30, either we are estimating population total or population mean , the
minimum sample size under SRSWOR sampling for the required degr ee of
prec ision is

n ';? [~ + ¢2Y2]-1 = [~+ 0.32 x4514 .8982 j-l= 15.6 0::: 16.
N S2y 69 37199578

Thus a minimum samp le of size n = 16 units is required to attain 30% relat ive
standard error of the estimator of population tota l or mean under SRSWO R
samplin g.

Example 2.2.4. Select an SRSWOR sample of sixteen units from population 4


given in the Appendix. Collect the information on the number of fish durin g 1995
in each of the species group selected in the sample . Estimate the average number of
fish in each one of the species groups caught by marine recreational fishermen at
Atlantic and Gulf coast s during 1995. Construct 95% confidence interval for the
average number offish in each specie s groups available in the Uni ted States .
88 Advanced sampling theory with applications

Solution. The population size is N = 69, therefo re we used the secon d and third
columns of the Pseudo-Random Number (PRN) Table I given in the Appen dix to
select 16 random numbers between 1 and 69. The random numbers so selected are
01,49,25, 14,2~36,42 ,44,65 ,2~4~66, 17, 08, 33, and 53.

Rando m f Species group Yt.;/'·qI· (vi - Yn)


Y , ,0\

.y
~i ~ Yn ~ ,;,
No. ' ~Jt~"'" '~;
".
01 Sharks, other 20 16 -937.8 130 879492.2852
08 Toadfishes 1632 -1321.8100 1747188 .2850
14 Scu lpins 71 -2882 .8 100 83 10607.9100
17 Temperate basses, other 23 -2930.8 100 858966 1.9100
20 Sea basses, other 2068 -885 .8 130 784663 .7852
25 Florida pompano 644 -2309.8 100 5335233.7850
26 Jacks, other 1625 -1328.8 100 1765742. 6600
33 Snappers, other 492 -2461.8100 6060 520.7850
36 Grunts, other 3379 425. 1875 180784.4102
40 Red porgy 230 -2723 .8 100 74 19154.5350
42 Spotted seatro ut 246 15 21661.1900 469207043.9000
44 Sand seatrout 4355 1401.1880 1963326.4 100
49 Black drum 1595 -1358.8 100 184637 1.4 100
53 Barracuda 908 -2045.8 100 4185348.7850
65 Winter flounder 2324 -629.8130 396663 .7852
66 Flounders, other 1284 -1669 .8100 2788273.7850
.,.il< ,""",0 '.'\\ Sum ~ 47261
" 0.0000 52 1460078.4000

An estimate of the average number of fish in each species group during 1995 is
Yn =.!- fYi = 47261 = 2953.813 .
n i=l 16
Now
s; =-n-I-l f (vi - Yn)2 = 521460078.4 = 34764005.23.
16- 1
i=l
and the estimate of variance of the estimator Yn is
v(Yn) = C~f }; =C-:~69 Jx 34764005.23 =1668924.16.
A (I - a)I00% confidence interval for the average number of fish in each one of the
species grou ps caught during 1995 by marine recreational fisherme n in the United
States is
Yn+(a/2(df = n-1)Jv(Yn ).
Using Tab le 2 from the Appendix the 95% confidence interval is given by

Yn+ (o.o2s(df = 16 -1)Jv(Yn) , or 2953.81H 2.131.J1668924.16 , or [200.842, 5706.784] .


Chapter 2: Simple Random Sampl ing 89

Exam ple 2.2.5 . The distribution of yield (kglha) y of a crop in 1000 plots has a
Cauchy distribution:
1
f (y) = i }, - 00 < y < + 00 .
ll"ll+(y-IO)2

We wish to estimate the average yield with an accura cy of relati ve standard error of
0.15%. What is the minimum sample size 11 requ ired while using SRSWOR
sampling?

Solution. Since the true mean and variance of a variable having Cauch y distribution
are unknown, therefore it is not possible to find the required sample size under such
a distribution.

Exam ple 2.2.6 . The distribution of yield (kg/ha) y of a crop in 1000 plots has a
logistic distribution

f(y )= _1 sech2{~(~)}
4/3 2 /3.
with a. = 40 and /3. = 2.5.
( a ) Find the value of minimum sample size 11 required to estimate ave rage yield
with an accuracy of standard error of 5%
( b ) Select a sampl e of the required size and construct 95% confidence interv al for
the average yield .
( c ) Does the true average yield lies in the 95% confidence interval?
Solution. ( a ) We know that the mean and variance of a logistic distribution are
given by
Mean = a . = 40
and
.
Variance = O"y =
2 /3.2- ll"2
= 2.5 x
2 3.14159 2
= 20.56.
3 3
Also we are given N = 1000 thus
2 N 2 1000
S y = - - 0" Y = -
- - x 20.56 = 20.5806 .
N- I 1000-1
Thu s the minimum sample size required for 1ft = 0.05 is given by
2 2j-l
n e - I + -1ft2f2j
-
-l [ I
= - - +
0.05 x40
=5 .11 ,:::5.
[N S2
y
1000 20.5806

We know that the cumul ative distribution function for the logistic distribution is

F(y) = ±[I + tanh{ (Y2~:')}]


which implies that
y = a. + 2/3. tanh-I {2F(y )-I}= 40 + 5tanh- 1{2F (y )-I}.
90 Advanced sampling theory with applications

Using the last three columns of the Pseudo-Random Number (PRN) Table I given
in the Appendix, multiplied by 10-3, we obtain five values of F(y) and the
corresponding values ofy as given below:

0.072 36.460 1329.344


0.776 42.522 1808.111
0.406 39.071 1526.531
0.565 40.646 1652.128
0.108 36.675 1345.089

Thus an estimate of the average yield (kg/ha) of the crop is given by

h- =..!.-~
L~
. = 195.375 =39075
. .
n 1=1 5
We use the alternative method to find s; given by

s; = _1_[±Y1_
n -I
ny; ] = _1_[7661.202 - 5 x39.075
1=1 5-1
2]= 6.7309 ,

and estimate of the variance of the estimator Yn given by


v(Yn) = (I ~f}; = (1- 5~1000 Jx6.7309 = 1.339449.
A (I - a)1 00% confidence interval for the average yield of the crop is given by

Using Table 2 from the Appendix the 95% confidence interval is given by

Yn +lo.02S(df = 5 -1).jv(Yn) , or 39.075+ 2.776~1.339449 , or [35.8622,42.2877] .

( c ) Yes, the resultant 95% confidence interval estimate contains the true average
yield a. = 40.

Theorem 2.2.6. The covariance between the two sample means Yn and xn under

a,
SRSWOR sampling is:
- -) (1-
Cov (Yn' X n = - n - xy ' (2.2.18)
where
S
xy
=_l_~(y
N-l1:J
-YXx. -x).
I I

Proof. We have
Chapter 2: Simple Random Sampling 91

__
Cov(Y", x,,) = COY(I"
- LY; , -I"
II ;=1
L·r; )= COY(-IIIN;=1
II ;= 1
LI;Y; , -III ;=LI;X;
N )
1
(2.2. 19)

where I; is a rand om variable that takes the value' I ' if the { II unit is included in the

sample, otherwise it takes the value O. Note that the pair Y; and X ; is fixed for the

{ II unit in the popul ation, we have

Cov(y",x,,) = ~COV(II;Y;,
II 1=1
IliX; ) = ~[E{II;Y;}{II;X;}
1=1 II 1=1 1=1
- E{~I;Y;}E{It;X;}]
1=1 1=1

= ~[E{II?Y;X; + ,*. I l;ljY;Xj}-E{~I;Y;}E{II;X;}]


II 1=1 ;=1 1=1 1=1

= 11\ [t~IE~?~x; + i*~=1 E(I;lj){x j}-tEE(I; )Y; KEE(I;)X;}]- (2.2.20)

In (2.2.20) we need to determine the E~?) and E(I;Ij)' Note that the distributions
of and are :
{I
I; I?

I with pro bability 11/N , 2 with probability (11/ N),


and I· =
I; = { 0 with pro ba bility {I- (11/ N)}, 1 0 with probabi lity {I - (11/N )}.
We have

E~?)=I X ~ +O X (I- ~ ) = ~ . (2.2 .21)

The probability that both {II and /" units are included in the sample IS
(N-2)
N q ,,-2) =
c,
()
t
- I ) and otherwise the probability is I
II
N N- I
lit
(
)
-I ) , therefore
N N- I
11(11 -1)
with probabi lity N (N _I)'
I;lj = 11
o wit hprobabi lity {I N(N-I) .
II(II-I) }

Now we have
11(11 -1) { 11(11-1) } = 11(11-1)
E(
1;1)
j = 1x (
N N- I
) +0 I
N(N-I ) N(N - I)' (2.2.22)

Using (2 .2.2 1) and (2.2 .22) in (2.2.20) we obtain

Cov(y", x,,) = ~[{


II
IE~?~X; + ,*I; =1E(I;lj){Xj}-{IE(I;)Y;}{IE(I;)X;}]
1=1 1=1 1=1

= ~[{.~ E~l YiXi + . ~ EVil/YiX } -{.~ E(li)Yi}{.~ E(ldXi }]


II 1=1 I*- J= l
j
1=1 1=1
92 Advanced sampling theory with applicat ions

=1 n
- [{-LYX.+
N n(n -I) LN YX · } - {n
-LYN n N }]
}{ -LX.
n2 N i =1 1 1 N(N -IL;< j=\ t) N i =\ 1 N i = 1 1 . (2.2.23)

Note that

Y X = (.2.. .~ y;)(.2..N 1=1~Xi) = ~(


N 1=1 N 1=\~y;)( 1=1
~Xi) = ~[~Y;Xi
N 1=1 + 1;<)~=1 Y;X
j]
which implies that
N 2- - N
I Y;X j = N Y X- IY;Xi .
i;<j=1 i= \

Thus from (2.2.23) we have

Cov(Yn' x,J =~[{~IY;Xi+


n N
n~n-I\ I Y;Xj}-{~IY;}{~IXi}]
N N- 1 j=\
i= \ N N i;< i=\ i =\

=~[{~ ~Y;Xi + n(n -I) (N2y X - .~ Y;Xi)} - {~ .~ Y;}{~ ~Xi}]


n Nt=\ N(N - I) 1=1 N1 =1 Nt=1
= ...!-[{~_
2
n(n - I) } ~ y.x + Y x{n(n -I)N n2}]
n N N(N -I) i=\ t 1 (N -I)

= ~[~{~}
n N N- 1
~ Y;Xi - Y x {ntN-- I))}]
i =\

=~~{N-n }[~Y;Xi-NY x] = N-n_I_[ ~Y;Xi-NY x]


n N N -I i=\ nN N -I i =\

=(-
n
f)
N (Y;-Y Xi-X = -
I - - -1- I
N - I i=\
1- - St,. -x -) ( f) n }
Hence the theorem.

Theorem 2.2.7. An unbiased estimator of the covariance Cov(y,1' xn) is given by

COV(YII' C
xn )= ~f}XY (2 .2.24)

where

Sxy = -1-[fYiXi-nynXn].
n-I i=1
Proof. We have to show that
E[cov(Yn' xn)] = Cov(Yn ' x,J.
We have
E[cov(Yn , Xn)]=E[l~f Sty] = I~f E(Sty)
Now

Ekty)= E[_I {£YiXi- nyllx


n - 1 i=1
lI } ]
n- 1
f
= _ I [E( YiXi) - nE(YnXn)]
i=\
Chapter 2: Simple Random Sampling 93

n[1 1I
= - - - L E(y;x; )-
__ ] .
E(Yllx
/I-I /I ; ;1
lI)
Now
E(YllxlI) = COV(YII' xlI )+ E(YII)E(xlI) = N - n Sxy + YX ,
Nil
and each pair of units >j and X; gets selected with probability 1/ N , therefore we
have
(N
E (Sty ) =-/I- [ -I L11 L - I >jX;) - - --/I) {(N
- X
Sty + Y - }]
/I -I /I;;) ;;\ N Nil

= _II_[~ ~ YX - Y X _ (N - 11 )s ]
11 - I N ;;) I I Nil xy

= _1_1 [~{IYX -NY


11 - I N ;;1 I I
X}- (NNil- 11 )sxy ]

= _1_1 [ N -I {_ I_( I >j X; -NY x)}- (N - 11 )S xy ]


11 - I N N-=1\,.i;1 Nil

= _
I II_[N
-I N-I S ty _ (NNil-11) S ty] = S.w '
.
Thus we obtain
E[cov(YII ' XII)] = Cov(y,I' XII )'
Hence the theorem.

Ex am ple 2.2.7. Consider the joint proba bility densi ty function of two continuous

()l
random variables x and y is

f x,y =
~(x
3
+ 2y) 0 < x < I, 0 < y < I,

o otherwise.
ea ) Select six pairs of observations (y, x) by using the Random Number Table
method .
eb ) Estimate the value of covariance between x and y .
Solution . e a ) See Chapter I.
( b ) Estimate of covariance:
R Y R
2 I X
0.992 0.995 0.622 0.423
0.588 0.722 0.771 0.514
0.601 0.732 0.917 0.600
0.549 0.69 1 0.675 0.456
0.925 0.954 0.534 0.368
0.0 14 0.039 0.5 13 0.355
94 Advanced samp ling theory with applications

So we obtain
y "'~ (~, - y-) .'! 'l' X"c< ..' Ii" (x -x) (y - jl)(X- x)
'"
0.995 0.306 167 0.423 -0.030 -0.009080
0.722 0.033 167 0.514 0.06 1 0.002034
0.732 0.043 167 0.600 0.147 0.006360
0.691 0.002167 0.456 0.003 0.000000
0.954 0.265167 0.368 -0.085 -0.022450
0.039 -0.649830 0.355 -0.098 0.063467
I.. Sum k'. 4.133 , "0.000000 ""2 :7 16 0.000 0.0403 35
Thus an estimate of the covaria nce betwee n two sample means is give n by

• (-Y n , X- n )
COY = ( -1- f- J = (-1- -f J 1
Sxy --L, ~ (Yi - Y-XXi- r)
X
n n n - I i= l

= 1- 0 (_I_J x 0.040335 = 0.0013445.


6 6-1
Note that f = 0 because of infinite population size.

2:3 ESTIMATION OF POPULATION PROPORTIO N

Let N be the total number of units in the population nand N a be the numb er of
units possessing a certain attribute, A (say). Then population proportion is the ratio
of number of units possessing the attribute A to the total number of units in the
popu lation, i.e., Py = Nal N . Thus we have the following theorem:

Theorem 2.3.1. The popu lation proporti on Py is a special case of the popu lation
mean Y.

Proof. We know that popu lation mean is given by


- 1 N
Y=-IYi ' (2.3.1)
N i= l
Let us define
y, = {I if the /h unit possesses the attribute A,
I 0 otherwise .
Then (2.3. 1) becomes,
y _ 1+0 + 1+ ..... +1 _ NA _ P
- N - N - y '
Hence the theorem.

We will discuss the problem of estimation of popu lation proportion using SRSWR
and SRSWOR samp ling.
Chapter 2: Simp le Rand om Sampling 95

C ase I. When the sample is drawn using simple random sampling wit h rep lacement
(S RSWR samp ling), we have the following theo rems .

Theorem 2.3.2. An unbiased estimator of populatio n pro portion Py is given by


, r
Py = -;; (2.3.2)
where r is the numb er of units in the sa mple that possesses the attri bute A .
Proof. To prove that E(py) = Py , let us proceed as follows:
Defining
r. = {I if the it:, sampled unit possess the attribute A,
o otherwi se,
we have
_ 1 II r ,
Y II =- LYi =- = Py .
II i =1 II
Th erefore

E(py )=E(~) = E(YII) = Y = ~..


Hence the theorem.

Theorem 2.3.3. The variance of the est imato r Py is give n by


v(Py)=PyQy (2.3.3)
II
whe re
Qy = l - Py .
a2 a2
Proof. We know that V(YII) = -2. , whi ch implies that v(py) = -L .
II II
Now

a; = ~[ I Y;2 -Nf2 ].
N i=1

Note that
y, = {I if the;''' unit possesses the attribute A,
, 0 otherwise,
and

y2=
,
{I
0 otherwise.
if the ;''' unit possesses the attribute A,

So that
a~ = ~ [NA - N~~] = '; -~~ = ~.(l -~\,)= PyQy '
Th us we have
96 Advanced sampling theory with applications

V(Py)= PyQy .
n
Hence the theorem.

Theorem 2.3.4. An unbiased estimator of V(Py) is given by


-(-
vp ) =Pyqy
--
y n-1
where
qy = 1- Py.

Proof. We have to show that E[v(pJ = v(p)or in other words we have to show
that
E(PyqyJ= PyQy .
n-1 n
Now we know that s; /n is an unbiased estimator of a;/n .
Defining

y. =
I if the it" sampled unit E A,
an d y .2 = {1 if the /1 sampled unit E A,
1 { 0 otherwise, I 0 otherwise.
Hence we will obtain
2 1 [nLYi2-nYn-2] =--[r-npy
Sy =--
I r .2] n [r
= - - - - Py
_2] =--Py
n _ (1- Py
_) =npyqy
--.
n-1 i=l n-1 n-1 n n-1 n-I
So that
2
Sy = Pyqy
II II-I
Hence the theorem.

Theorem 2.3.5. Under SRSWR sampling, while estimating population proportion,


the minimum sample size with minimum relative standard error (RSE) equal to rjJ , is

(2.3.4)

Proof. The relative standard error of the estimator Py is given by

RSE(py) = ~V(py)/{e(pJ~ = ~PyQy/ liP; . (2.3.5)


Note that we need an estimator Py such that RSE(py) ~ rjJ , which implies that

~ IIP
Qy ~ rjJ, or Qy s rjJ2, or
IIPy
11;:0: ;y .
Py
y rjJ
Hence the theorem.

Note that the relation 11;:0: Qy /(rjJ2 p y ) shows that as Py -+ 0 then II -+ 00 •


Chapter 2: Simple Random Sampling 97

Example 2.3.1. We wish to estimate the proportion of the number of fish in the
group Herring cau ght by marine recreational fish ermen at the Atlantic and Gul f
coasts. There are 30027 fish out of total 311,528 fish caught during 1995 as shown
in the population 4 in the Appendix. What is the minimum number of fish to be
selected by SRSWR sampling to atta in 5% relative standard error of the estimator
of population proportion ?
Solution. We ha ve
P , = 30027 = 0.0964 and Q), = 1- Py = 1- 0.0964 = 0.9036.
) 3 11528 '
Thus for rp = 0.05 , we have

II ;:>: Q / (rp2 P) =
0.9036 = 3749.4 '" 3750.
Y 0.052 x 0.0964
Y
Thus a minimum sample of size II = 3750 fish is required to attain 5% relati ve
standard error of the estimator of population proportion under SRSWR sampling.

Example 2.3.2. A fisherm an visited the Atlantic and Gulf coast and caught 4000
fish on e by one . He not ed the species group of each fish cau ght by him and put
back that fish in the sea before making the next catch. He ob served that 400 fish
belong to the group Herr ings .
( a ) Estimate the proportion of fish in the group Herrings livin g in the Atl an tic and
Gulf coast.
( b ) Co nstruct the 95 % confidence inter val.
Solution. W e are given 11 =4000 and r = 400 .
( a) An estimate of the proportion of the fish in the Herrings group is give n by
P =!- = 400 = 0.1.
y 1/ 4000
(b ) Under SRSWR sampling an estim ate of the v(p) is given by
v(p ,)= P/ly = 0.1x 0.9 = 2.2505 x 10- 5.
)
11 -1 4000 -1
A (1- a)100% confide nce interval for the true proportion Py is give n by

Py + Za/2 ~V( Py ) .

Thus the 95 % confidence interval for the proportion of fish belonging to the
Herrings group is given by
Py + 1.96~v( Py ) , or 0.1 +1.96b.2505 x 10- 5 , or [0.0907, 0.1092].

Example 2.3.3. The height y of plant s in a field is un iformly distributed betwe en 5


cm to 20 em w ith the probabi lity den sity fun ction
I
j(y) = - V 5 < y < 20 .
15
W e wish to estimate the proportion of plants with height more than 15 cm , what is
the minimum required sa mple size II to ha ve an accuracy of relati ve standard erro r
of45 %?
98 Advanced sampling theory with applications

( a ) Select a sample of the required size, and estimate the proportion of plants with
height more than 15 cm.
( b ) Construct a 95% confidence interval estimate, assuming that your sample size
is large, and interpret your results.
Solution. We know that if y has uniform distribution function
1
f( y) = - \;f a <y <b •
b-a
Thus the proportion of plants with height more than 15cm is given by
20 20 1 1 5
Py = fJ(y}ty = f -dy
15
= -(20 -15) =- = 0.3333,
15 15 15 IS
and the variance
0"; = Py (1- py ) = 0.3333(1- 0.3333) = 0.2222 .
( a) We need ¢ = O.4S, thus the required minimum sample size is given by

n ~ 0"; = 0.2222 = 9.8 '" 10.


¢2 i} 0.4S 2 x 0.33332

( b ) We select a with replacement sample of n = 10 units as follows . The


cumulative distribution function (c.d.f.) is given by
F( y) = ply ~ y]= ~f(y}ty = ~~y = (y-S)
5 5 1S IS
which implies that y = lSF(y)+S. We used 4th to 6th columns multiplied by 10-3 of
the Pseudo-Random Number (PRN) Table 1 given in the Appendix to select ten
values of F(y) and the required sampled values are computed by using the
relationship y = ISF(y)+ S as follows:

19.31
0.183 7.75 o
0.448 11.72 o
0.171 7.57 o
0.567 13.51 o
0.737 16.06
0.856 17.84
0.233 8.50 o
0.895 18.43
0.263 8.95 o
Thus an estimate of the proportion Py is given by,
Chapter 2: Simple Random Sampling 99

p, = number of ' yes' answers = ~ = 0.4


} sample size 10
and an estimate of its variance is given by

V(Py )= pAI - Py) = 0.4(1 - 0.4) = 0.0267 .


n- I 10 - 1
Thus a 95% confidence interval estimate of the required proportion Py is given by

py+1.96~v(py) , or 0.4+1.96~0.0267, or [0.0797, 0.7202] .

Case II. When a samp le is drawn using SRSWO R sampling, we have the fol1owing
theorems.
Theorem 2.3.6. The unbiased estimator of the population proportion P; is given by
r
r- >;
A

where r is the number of units possessing the attribute A.


Proof. Obvious.
Theorem 2.3.7. The variance of the estimator Py is given by
(A) (N - n)
V\p y = n(N - 1{ yQy ·
Proof. We know that

V\YII (N-n)
(-:: ) =---Sy, where Sy = - I - 2 2 (NIf;2- NY-2J.
Nn N- I i=l

Again we define
y = {I if the ;ti' unit possesses the attribute A,
I ° otherwise ,
and
y'2 = {I if the /h unit possesses the attribute A,
I ° otherwise .
So
S2 = _I_(N _ Np2 )= ~(p _ p2 )= NPyQy = S2 .
y N- I A y N-I y y N- I P

Hence we have
v (p )=N-ns2=N- n x~PQv = (N -n)pQ "
y Nn P N/I N- 1 Y. n(N -1) y }
which proves the theorem.

Theorem 2.3.8. The unbiased estimator of the variance V(Py) is given by


A( A ) (N -n) A A
V Py = - (- ) Pyqy .
N /I-I
100 Adv anced sa mp ling theory with appl ication s

Proof. We w ill prove that E[v(p.J= v(py ) , that is


E[ ( - /I / vqy] = N( -II )pyQy.
N /I - I N /I- I

N ow we k no w t h at (N-II)
- - -Sy2 .
IS an un b lase
i d estimator
esti 0 f -N-- II Sy2 .
Nil Nil
Cha ngi ng
y; = {I if the /" population unit A, and E y;2 = {I if the i~" poulation unit A, E
o otherwise, 0 otherwise,
mak es
(N- II)S2 = (N - Il) PQ
Nil y II(N - I) y r:
Similarly, if we make the ch anges
v. = {I if the /~ sampled unit E A, an d
2
y. ={I if the /" sampled unit E A,
o otherwise, I 0 otherwise,
then

sJ, =-( I )[fYl- IlY;; ].


/I - I i= \
wi ll reduce to
2_ 1 [.
s p - 1l_I'-Il y - II-I
. 2]_ 11
P
[r; -Py
'2]_ II (. ' 2)_ IlPy(I- Py) _ IlPyCly
- II-I Py -Py - 11-1 - ~ .
Th er efore

-(N-II)
- - s2 _(-
N-Il
-JIl
- Pyqy
- _- (N-Il)
--P q
. .
Nil p - Nil (II-I) - N(II - I) y y'
Hen ce the theorem.

Theorem 2.3.9. Under SRSWO R sampling , w hile es tima ting population


prop ortion, the minimum sa mple size with minimum relative sta nda rd error (RSE)
equa l to ¢ is

(2.3.6)

Proof. T he relative standard error of the estimator Py is g ive n by

. ) I (. )I/{ (. )}2 (N - 1I )Py Qy


RSE (Py = 'Jv Py / E Py = ( )2 . (2.3.7)
II N- I Py

Note that we need an estimator Py suc h that RSE(p y ) ~ ¢, w hich implies that

(N
II
-IJ (N Q
y
-I )~,

- ,
Hen ce the theorem.
Chapter 2: Simple Random Sampling 101

Example 2.3.4. We wish to estimate the proportion of the number of fish in the
group Herrings caught by marine recreational fishermen at the Atlantic and Gulf
coasts . There are 30 ,027 fish out of total 311,528 of fish caught during 1995 as
shown in the population 4 in the Appendix . What is the minimum number of fish to
be selected by SRSWOR sampling to attain the accuracy of relative stand ard error
5%?
Solution. We have
P = 30027 = 0.0964 and Qy = 1- Py = 1- 0.0964 = 0.9036.
y 311528 '
Thus for ¢ = 0.05 , we have
ne 2( NQ y 311528 xO.9036 =3 704.8;::;3705.
¢ N -I)Py+Qy 0.052(311528-1)xO .0964 +0 .9036
Thus a minimum sample of size n = 3705 fish is required to attain 5% relative
standard error of the estimator of population proportion under SRSWOR sampling.
Example 2.3.5. A fisherman visited the Atlantic and Gulf coast and caught 4000
fish. He noted the species group of each fish caught by him . He observed that 400
fish belong to the group Herrings.
( a) Estimate the proportion of fish in the group Herrings living in the Atlantic and
Gulf coast.
( b ) Construct the 95% confidence interval.
Given: Total number of fish living in the coast = 311528.
Solution. We are given N = 311,528, n = 4,000 and r = 400 .
( a ) An estimate of the proportion of the fish in the Herrings group is
_ r 400
p = - = - - =0.1.
y n 4000
(b) Under SRSWOR sampling, an estimate of the V(Py) is given by
v(- ,) = (N -n)Pyqy = (311528-4000)x 0.l xO.9 =2.2216x10-5.
r, N n-1 311528 4000-1
A (1 - a)1 00% confidence interval for the true proportion Py is given by

Py + Za/2~V( Py ) .
Thus the 95% confidence interval for the proportion of fish belonging to Herrings
group is given by
py+1.96~~{ Py), or 0.1+1.96~2.2216x10-5, or [0.0908,0.1092].

Example 2.3.6. Ina field there are 1,000 plants and the distribution of their height
is given by the probability mass function

y (em) 50 100 150 200 225 275


p(y) 0 .1 0.2 0.3 0.1 0 .2 0 .1
102 Advanced sampling theory with applications

( a ) Select a random sample of n = 10 units and est imate the proportion of plants
with height more than or equa l to 225 ern.
( b ) Construct a 95% confidence interva l, assuming that it is a large sample.

Solution. The cumulative distribution functio n F(Y) is given by

y(cm}. 50 100 ISO 200 225 275

1'1\ <pCy) ,. 0.1 0.2 0.3 0.1 0.2 0.1


F;(Y )i 0.1 0.3 0.6 0.7 0.9 1.0

Using the first three columns, multip lied by 10-3 , of the Pseudo-Random Number
(PRN) Table I given in the Appendix, we obtain the 10 values of F(Y) and y as:

F(y) ' Y
:
~ 225
yes-vL
+ .. '" " ,,
:no-~O :i
0.992 225 I
0.588 100 0
0.601 150 0
0.549 100 0
0.925 225 I
0.0 14 Discard this number
0.697 ISO 0
0.872 200 0
0.626 150 0
0.236 50 0
0.884 225 I

(a) An estimate of the proportion of plants with height more tha n 225 em is

P = No. of plants with height ;:: 225 em = ~ = 0.3 .


y No. of plants in the sample 10

( b ) We have N = 1000 and n = 10 , therefore, an estimate the V(Py) is given by

v(' )=(N-n)pi1y = (1000-IO) xO.3 xO.7 = 0.023 1 .


Py N n -1 1000 10-1
Thus a 95% confidence interval for the proportion of plants with height more than
or equal to 225 em is:
Py+- 1.96~v(py) , or 0.3+ 1.96.J0.0231 , or [0.0021, 0.5978] .
Note that an estimate of proportion always follow normal distrib ution, so the use of
1.96 for deriv ing 95% confidence interval estimate is appropriate irrespective of the
sample size .
Chapter 2: Simple Random Sampling 103

2ASEARLS"ESTIMATOR OF POPULATION MEAN

Searls (1964) considered an estimator of the population mean r defined as


Yscarl=AYn (2.4.1)
where A is a constant such that the MSE of the estimator Yscarl is minimum. Thus
we have the following theorem.

Theorem 2.4.1. The minimum mean squared error of the estimator, Ysearl ' is
Min.MSE(Ysearl) = V(Yn)/ {I + V(Yn )/p}. (2.4 .2)
Proof. We have
MSE(Yscarl)= E~s - rf = E[AYn - rf = E[AYn -E(AYn)+ E(AYn)- rf
= E[A{Yn - E(Yn)} + AE(Yn)- Yf = E[A{Yn - E(Yn )} + (A -1)y]2
= ElA
2{Yn - E(Yn)f + (A -If y 2+ 2(A -1)YA{Yn - E(Yn )}j

= A2E{Yn - E(Yn)f + (A _1)2 y 2+ 0 = A2V(Yn)+ (A _1)2 y 2. (2.4 .3)


Thus the mean squared error, MSE(Ysearl)' is given by
MSE(Ysearl)= A2V(Yn)+ (A _1)2 P . (2.4.4)
On differentiating (2.4.4) with respect to A and equating to zero we obtain
AV(Yn)+(A-I)y2 = 0, or A= I/~ + V(Yn)/Y2} . (2.4.5)
On substituting (2.4 .5) in (2.4.4) we obtain

.
Mm.MSE Ysearl
(_ ) = ,2 V(_)
Yn + (, - I)2-2
Y = {_
f4V(Yn) }2 + {y2
2 (_) - I}2 Y
-2
A A
2
y +V(Yn) Y +VYn

= y4 V(Yn)
{y2 + V(Yn)}
2 +{ y2 -t 2
Y + V(Yn)
- ~(Yn )}\2
y4V(Yn) + y2 {V(Yn )}2 = y2V(Yn ){y2 + V(Yn )}
{y2 + v(Yn)f {r 2+ V(Yn)f {y2 + V(Yn )}2
y2 V(Yn) V(Yn) (2.4.6)
y2+ V(Yn) I+V(Yn)/y 2'
Hence the theorem .
Theorem 2.4.2. Under SRSWR sampling, the minimum mean squared error of the
Searls' estimator is
Min.MSE(Ysearl) = n-10"; / {I + n -10"; / Y2}. (2.4.7)
Proof. Obvious from (2.4 .2) because under SRSWR sampling we have
V(Y-n )= n - I O"y2 .
104 Advanced sampling theory with appl ications

Theorem 2.4.3. The relati ve efficiency of the Searls' estimator Y searl with respect
to usual estimator Y", under SRSWR samplin g, is given by
RE =I + eT;' /~IY2 ) . (2.4.8)

Thu s the relat ive gain in the Searls' estimat or is inversely proportional to the
sample size, II. In other words, as /I ~ 00 , the value of RE ~ 1.

Proof. It follows from the definition of the relative efficiency. Note that the relative
efficiency of Searl s' estimator with respect to the usual estimator is given by

-
RE -
MSE(YIl) _
( ) - .
V(YIl )
( ) -
-
/I
-I 21eT y
II -leT;
I 2 2
)-1
MSE Ysea rl Mm.MSE Ysearl 1+ /1- Y a Y
- 1- 2 2
=1 +/1 YeT y . (2.4 .9)
Hence the theorem .

Theorem 2.4.4. Under SRSWOR sampling , the minimum mean squared error of
the Searls' estimator is

Min.MSE(Ysearl) = U-~ )s.~ / {I + (~ - ~)s.~ /y2}. (2.4. I0)

Proof. Ob viou s from (2.4.2) because under SRSWOR sampling we have


V(YIl) = (~ - J...)s;. .
II N

Theorem 2.4.5. The relative efficiency of the Searls' estimator Ysearl with respect
to Y/I , und er SRSWOR, is given by
1
RE = + (~ -
/I
J...)C
N
2
Y
.

Thu s the relative gain in efficienc y of the Searls ' estimator is inversely proportional
to the sampl e size II. In other word s, as /I ~ N the value of RE ~ 1.

Proof. By the definition of the relative efficiency we have


-1

V(YIl) = ~ _J... S2 ( ~ _ J...)S2


N
eN) 1 {1_~);'i.
RE = /I Y

M',.MSE(y" o" )
1 \
Y
/I N y2
2
= +1 (~II - J...)
N y2
::y =1+ (~II - J...)c
N Y
2
.

Hence the theorem .


Chapter 2: Simple Random Sampling 105

E xample 2.4.1. We wish to estimate the ave rage num ber of fish in eac h one of the
spec ies gro ups caught by marine recrea tional fishermen at the Atlantic and Gulf
coasts. Th ere were 69 species caught during 1995 as show n in the pop ulation 4 of
the Ap pen dix . We selecte d a sample of 20 units by SRSW R sampling . Wha t is the
gain in effic iency owed to the Searls' estimator over the sample mean?
Giv en: S; = 371995 78 and Y = 3 11528 .
Solution. We are given N = 69 , S; = 37199578 and Y = 311528 , thus
- Y 311528
Y =- =- - = 4514.898
N 69
and

0-Y2 = -(N-N-1)SY2 = -
(69 -1)
69
- x 37199578 = 36660453.68.

The relative efficiency of the Searls' estimator Ysearl with respect to usual estimator
Yn , under SRS WR, is given by
RE = [1 + 0-12 ] x l 00 = [ 1+ 36660453.68 ] x 100 = 108.99% .
lIy 20 x45 14.8982

Example 2.4.2. On a ba nk of a river the height of trees is uniformly distribu ted


with the p.d.f. give n by
1
/(y) = - \j 200 ::; Y ::; 500 feet.
300

Find the relative efficiency of Searls' estimator ove r the usual estimator based on a
sample of 5 or 20 units, respec tive ly.

Solution. We are given a = 200 and b = 500 , therefore the population mean

y= a+b = 200 + 500 = 350 feet


2 2
and pop ulation var iance
o-~ = (b-af = (500 -200f = 7500 feet' .
.} 12 12
Thu s the relative efficie ncy of the Sea rls' estimato r ove r the usu al one is given by

If II = 5 then RE = [1 + 0-1 J x 100 = ( I + 7500 2 Jx 100= 10 1.22% .


IIY 2 5 x350

If II = 20 then RE = [ I + 0-1 J x 100 = ( 1+ 7500 2 J x 100= 100.30 % .


IIY 2 20 x 350

Searls ( 1967), Reddy ( 1978a) and Arn holt and Hebert ( 1995) studied the properties
of this estimator and found that it is useful if C y is large and sample size is small.
106 Advanced sampl ing theory with applications

2.5 USE OF DISTINCT UNITS IN THE WR SAMPLE AT THE


ESTIMATION STAGE

We sha ll discuss the problem of estimatio n of finite popul ation mean and variance
by using only distinct units from the SRSW R sample. However, before goi ng
further we shall discuss some results, which will be helpful in de riving the result
from distinc t units for SRSW R sample. Basu ( 1958) introdu ced the concept of
sufficiency in sampling from finite populations. Acco rding to him , for every
orde red sample SO there ex ists an unordered sample s uo which is obtained from SO
by ignoring information concerning the order in which the labels occur . The data
obtained from the sample suo can be repre sent ed as

C/UO= \Yi
( ... / E S uo ) . (2.5 .1)

Similarly the data obtained from SO can be represented as:

dO= ~i : i E so ). (2.5.2)

Then the probab ility of observing the ordered d" give n the unordered data dUO is

(2 .5.3)
where L is the summation ove r all those ordered sampl es sa which results in the
unordered sample suo . Since the probabil ity P(dOI dUO) is independ ent of any
populatio n parameters and hence the unordered statistics dU
Ois a suffic ient statistic
for any population parameter.

Let us now first state the Rao-- Blackwell theorem, which is based on Rao ( 1945)
and Blackwell ( 1947) results.

Theorem 2.5.1. Let e~ = e(do) be an estimator of e con structed from ordered data
s

«: Suppose es= Ele~ wo j then:


(a) E(e,) = E(e~ ); (2.5.4)

( b) MSE(e~)= MSE(es)+£(e,-e,O ). (2.5 .5)


Proof. ( a ) We have

£(es) = £le~ I dUO[, £lLe~(do~(do I duo)j = £(Le~{do):~UO))J

=
suo
L{Le,o(do\~
As:) }p(suO) Le~(do )p(so)=£(e~).
sO
=

Hence the par t ( a ) of the theorem.


Chapter 2: Simp le Random Sampl ing 107

( b) We have
MSE(e~ ) = E[es O - of = E[e.~ - es + es - of
=E(e~ -e. +E(es -0) +2E (e~ -esXe -0)
.s
I. ) l

=E(e~ - + MSE(es)+0 .
Hence the theorem.

Now we will discuss the problem of estimation of mean and variance on the basis of
distinct un its in the sample. Clearly a unit can be repeated onl y in W R sampling
schemes . Hence we are dealing only with SRSW R sampling scheme . Suppose v
denote the numb er of distinct units in the sam ple of size n drawn from the
popul ation of N units by using SRSWR scheme.
Th e distribution of distin ct units in the sample was first develop ed by Feller (195 7)
as follows

p(v =t)= _1 (N )±(_ly(t )(t _r


Nil t r=1 r
r, where t =I,2,...,Min.(n,N ). (2.5.6)

2.5.1 ESTIMATION OF MEAN

For esti mating the population mean Y by using information only from the distinct
units we have the following theor ems.

Theorem 2.5.1.1. An unbia sed estimator of populati on mean Y is given by


_ 1 v
Yv =-L y; · (2.5.1.1)
V ;= I

Proof. Followi ng Raj and Khamis (1958), let E 2 and E1 be the expected values
defined for a given sample (fixed numb er of distinct unit s) and for all possible
samples, respectively, then by taking expe cted value on both sides of (2.5. 1.1), we
have

E(yv )= E1EZ(Yvl v) =E,Ez (I -I v Yi J=E, (I- I Yi J=Y- .


n
(2.5 .1.2)
v i=1 n i =1
Hence the theorem.

Theorem 2.5.1.2. Th e variance of the unb iased estimator y,. based on distinct un its
IS

(2.5.1 .3)
Proof. Suppose V2 and VI denote the variance for the given sample (fixed numb er
of distin ct unit s) and over all possible samples, we have
108 Ad vanced sa mpling theory with applications

Hen ce the theorem.

Corollary 2.5.1.1. Path ak ( 196 1) has show n that


E(2-) = _1_ 2>(11-1) . (2 .5.1.4)
v N il J =l

Thus we have the following theorem

Theorem 2.5.1.3. The variance of the estimator Yv is give n by

V(Yv) = C~: J (I/- I)/ )s;.


N I/ (2.5. 1.5)
Proof. W e have

v(Yv ) = [ E(~ )- ~ ]s;. [X:(1l-1)/


= N il - ]s;.
1/ N

= [L~,J(I/-I) -N(I/- I)} / NI/ Js-~ = [XI J (I/ -I )/ NI/ ]s.~ .


He nce the theo rem.

It is interes ting to note that as the sa mple size 11 drawn wi th SRSW R sa mpling
approaches to the population of size N, the magn itud e of the relative efficiency also
inc reases. Th e rea son of inc rease in the relati ve effic iency may be that the increase
in sample size also increases the probability of rep etition of unit s in SRSW R
sa mpling .
Th e relati ve efficiency und er the Feller (1957) distribution is given by
V(y ) (N -I)N(I/-I )
RE = - = --'--r--'----,
n{Nil J(I/- I)}
_1/-
(2 .5. I .6)
V(Yv)
J=I

whic h is free fro m any po pulatio n parameter but depe nds upon populat ion size an d
sa mple size .

Th e following tabl e shows the percent relative efficie ncy of dist inct unit s based
est imators wit h res pect to the esti mators based on SRSW R sa mpling for different
va lues of sa mple sizes n an d po pulation sizes N = 10 .
Chapter 2: Simple Random Sampling 109

"
"
:: oi
BenetitOf\use ' distinct -units
>,J Sample size ( n )
J

J 2 3 4 5 6 7 8 9
J(n-l)
I I I I I 1 1 1 I
2 2 4 8 16 32 64 128 256
3 3 9 27 81 243 729 2187 6561
4 4 16 64 256 1024 4096 16384 65536
5 5 25 125 625 3125 15625 78125 390625
6 6 36 216 1296 7776 46656 279936 1679616
7 7 49 343 2401 16807 11 7649 823543 5764801
8 8 64 512 4096 32768 262144 2097152 16777216
9 9 81 729 6561 59049 531441 4782969 43046721
Sum I" 45 """ 285 2025 15333' 120825 978405 8080425 67731333

RE = [(N -1)N(n-1 f[ n{XJ(n-I) }]


1

RE '" 100.00 105.26 11 1.11 11 7.391 24.15 131.41 139.23 147.64

Corollary 2.5.1.2. An approximate expression for V(Yv ) valid up to order N - 2 IS


in Pathak (1962) as
V(yv) = [~ _ _ l + (n -1)]S2 .
n 2N 12N2 y (2.5.1.7)

Theorem 2.5.1.4. (a) Show that an altern ative estimator of the population mean
based on distinct units is

YI = E(v)Yv+ Xl 1- E(V)) (2.5.1.8)

where X is a good guess or a priori estimator of the popul ation mean Y.


( b ) If there is no priori information or good guess about the population mean Y,
then an altern ative estimator of popul ation mean is given by
_ v_
Y2 = E(v)Yv (2.5.1 .9)
1 v
whe re Yv is a mean of distinct units in the sample.
= - LYi
v i=1
Proof. Let us consider that an estimator of the population mean is given by
Ys = Ji(v)Yv + h (v) (2.5.1.10)
110 Advanced sampling theory with applications

where I] (v) and h (v) are suitably chosen constants such that ys is an unbiased
estimator of Y and its variance is minimum. Now from the property of
unbiasedness we have
E(ys) = EUi(v)yv + h(v)]= fi(v)Y + h(v)= Y . (2.5.l.l1)
This implies that
h(v)= [1 - fi(v)]Y. (2.5.l.l2)
Evidently the value of h (v) contains the unknown value Y, the exact value of
h (v) is not known unless fi(v)=1, which implies I: (v) = O. Thus we chose
fi(v) = 1, then h (v) = 0 , which means a better estimator of population mean Y
v
would be yy = v-I 2.: Yi . In practical situations, sometimes a priori information or
i=1
knowledge of X (say) is available about population mean Y from past surveys or
pilot surveys . In such situations, the value of h (v) is given by
h(v)= [1- fi(v)]X . (2.5.1.13)
Thus if we will chose h (v) as given in (2.5.l.l3), then the bias in the estimator ys
will be minimum . Unfortunately, I: (v) depends upon the value of II (v) too. The
best method to chose I I (v ) is such that the variance of ys is minimum . Now the
variance of the estimator ys is given by
V(Ys) = E]V2(Y.J + V] E2 (ys) = E]V2Ui(v)yv + h(v)] + V2ElUi (v)Yv + h(v)]

=E{fi 2(v{~- ~ )s;]+ V2[fi(v)y + h(v)]


= E{fi2(v{~ - ~)S;]+ V2[Y] =E{fi2(v{~- ~)s;l (2.5.l.l4)

The variance of Ys will be minimum if E{fi 2(v{ ~ - ~ )s; ] is minimum subject

to the condition E] Ui (v)] = 1. Then by Schwartz inequality, we have

E{fi2(V{~- ~)] ~E{~- ~) . (2.5.l.l5)


In the above inequal ity, the equality sign holds if and only if

fi(v )=U- ~)/EU - ~) = ~(;~;~-:~)J' (2.5.l.l6)


Thus if we have a priori information X about Y , then an optimum estimator of Y
is given by
- (Nv)/(N-v) - X[1 (Nv)/(N- v) ]
Y] = E[(Nv)/(N- v)JYv + - E[(Nv)/(N - v)} . (2.5.l.l7)
Chapter 2: Simple Rand om Sa mpling III

If no such information abou t Y is ava ilable, then we have X = 0 and the above
estima tor reduces to
_ (Nv)/(N-v) _
Y2 = E[(Nv)/(N _v)Vv . (2.5.1.18)
Path ak ( 1961) has show n that

E[ ( Nv )]=N2 I (PI2...III) (2.5.1 .19)


N-v 111=1 N- 1Il
where

PI2....m= 11 - lI1l J(I -~J"


N
+..+ (- lr(IIl J(l -~J"
III N
for III ~ II,
(2.5. 1.20)
o otherwise.
Th e relation (2.5.1.20) shows that the est imators YI and Y2 given at (2.5.1.17) and
(2.5.1.18) are very difficult to compute for large sample sizes . Now if we ignore the
samp ling frac tion II/ N and hence v/N , then the above estimators, respecti vely,
reduce to

YI = E(v) Yv+Xl 1- E(V) J (2.5.1.21)


and
_ v_
Y2 = E(v) Yv. (2.5.1.22)

Hence the theorem .

Theorem 2.5.1.5. Show that if the square of the population coeffi cient of var iation
Cf, = sf, /f 2 exceeds (II-I) , then the esti mato r Y2 = (v/E(v))Yv is more effic ient
than Yv'
Proof. We know that

V(Yv)= [E{~J - ~]sf, . (2.5.1.23)

By the defini tion of variance we have

V(Y2) = E,V2(Y2 I v)+V,E2(Y2 I v) = EIV2[ E(V)Yv IV] + VIE2[ E(V)Yv IV]

= E{{E(:)}2(~ - ~ )s;]+ v{iJ >


1
~ {E~:)J' [E(+ e(;')] l:.;j'VI(,)
+
(2.5.1.24)

It is very eas y to deri ve that


112 Advanced sampling theory with app lications

E(V) =N[I- ( I- ~rJ


and

E(V2)=N[I- ( I- ~ r]+ N(N - I{I -2(1 - ~ r +(1-~ Jl


therefore
{ I- N
1)11 _N 2 ( 1- N1)211 +N(N-\
V(v )=E (v2 )-{E (v}f =N ( I-fj 2 )11

From (2.5.1.24) we have

_
V(Y2) = 2
S;(N-I)1 /I 2
[ff-(1- N1 )" ) -f-
f 2(1- -;;1)"+ (I- fj2 )/1 )]
N! I-( I- N))

+ y2 II 2
[ N( I- fjI )" -N 21 I )211 +N(N - \{ I- N
( -fj 2 )" ] .

N 2 [1_( I_ ~) ] (2.5.1.25)

Now from (2 .5.1.23) and (2.5 .1.25), we have


V()lv)- V(Y2 )

l:
N- I
J
(
II -I
)

_ S2 ) -1
- Y Nil

2 -2 (2.5.1.26)
= CISy -C2Y (say) .
Now the estimator Y2 IS better than Yv if v(Yv)- V()l2 )<O or if
(s;jy2)> (c2/cd.
The approximate values of C) and C2 for large pop ulation s, correct up to terms of
order N - 2 , arc given by
C) =_1_+ 5(11- 1) and C, = (n- I) _ (n- IXn-2)
2nN 12nN 2 2nN 3nN 2
and thus, (C2 /C));:::(n-I) . Hence the theorem.
Theorem 2.5.1.6. If squared error be the loss functio n then show that )Iv is
adm issib le amo ngst all functio ns of )Iv and v .
Proof. Let I = )I,. + /
()lv, v) be the function of )lv and v . Suppose that the est imator
I is uniformly better than )Iv . Suppose R(l ) be the quadratic loss function for the
Chapter 2: Simple Random Sampling 113

es timator t. Then the estimator t will be uniformly better than the estim ator Yv if

R(r)~ vCYv), where V(y,.) = E(y,. - Y) .


Also we have

R(I) = E~ - rf = E~v - r+ j LY",v)f = E(Yv - yf + E{t(Yv,v)}2 + 2Ef (Yv'v)(yv - y) .


Thus
R(r)~ vLYv) if E~v - rf + E{jLYv,v)}2+ 2EjLYv, xYv - y) ~ E~v - yf
v (2.5.1.27)
hold s for all 1'[ , Y2 ,.. ., YN . In particul ar, if 1'[ = Y2 = ... = YN = C (say), where C is an
arbitrary cho sen con stant , then the relation (2.5.1.27) impli es that f( C, v) IS zero,
which pro ves the theorem.

2.5.2 ESTIMATION OF FINITEPOPULATION VARIANCE


Consider the problem of estim ation of

"»2 = N - I IN ( Y; - -Y \2J
i= 1
using distin ct uni ts in a sa mple of II unit s drawn by using S RSW R sa mpling. Th e
usual estimator of <7; is given by

S.~= (II- Itl f (Yi - y)2 =[211(1I - I)J- 1 f &i - yJ .


i=1 i ;tj =1
If we now construc t an estimator based on only distinct un its then we have the
follow ing theorem .

Theorem 2.5.2.1. A un iforml y better estimator of <7;' than s .~ is given by

-[1-C,.(II-()II I)]
Sv2 - Cv
2
Sci (2.5.2.1)

where
l
2 _! (V- It ±(Yi - Yv)2 if v > I, (2.5.2.2)
sci - i= 1
o otherwise,
and

CV(II ) = v" - (~}V - I Y' +... +(_I)lV-I{:_ I} / · (2.5.2 .3)

Proof. Suppose we have any convex loss function and T be an ord ered sufficient
stati stic , then by the Rao--Blackwell theorem we have

E[S; I d=E[_II-I- I i=1I(Yi - y)2 r]=E[211 (III- I)i=\I (Yi - Y j ~ I r]


1

= E[±(Y\ - Y2 f I r]. (2.5.2.4 )


114 Advanced sampling theory with applications

To prove that the estimator at (2.5.2.1) is uniformly better than s ~ let us consider
the following cases :

If v = I , i.e., only one unit has been selected in the sample of two units drawn by
SRSWR then (2.5.2.4) is obv iously zero . Suppose 'I I and 'I II denote the
summations over all integral values of at such that the following equalities holds:
v v
2:a(i) = n , a(i) > 0 for i = 1,2,..., v and 'I a()F (n - 2), a(j r:: 0, at) ) > 0 and
i=1 )=!

a(k) > 0, for k *' j *' j' = 1,2, ..., v .


Now if v> I then we have
1 II (n -2) ( 1 j a(l) ( I ja( v)
_. =. ]=Jl2 'I a(I )!a(Z)!....a(v)! IV ..... IV
P[xI - X(l) ' Xz X(J) I T I 1 a(l) 1 a(v ) (2.5.2.5)

'II a(I)!a(z~; ....a(viN j . . { N j


Pathak (1961) has shown some mathematical relat ions as:
II n! C (n) (2.5.2.6)
a(I)!a(2)!.....a(v)! v
and
I II (n-2) = Cv (n)-Cv(n - I)
a(1 )!a(2)I ·.. .a(v )! v(v - I) (2.5.2.7)
There fore we have
Cv(n)- Cv (n - I) . .
P[XI = X(i)' Xz = x(j) IT ] = ( ) () , l*' ) = 1,2 ,..., v . (2.5.2.8)
v v - I Cv n
Now if v > 1 then we have

E[(YI - yz)Z I
2
r] ± ~(i)
=
i ~j= l
- Y(j)~ P[XI = X(i )' Xz = x(j) I r].
2
On substituting the value of P~tl = X(i),X2 = x(j ) ITJ from (2.5.2 .8), we obtain

E[ (YI- Y2 f IT]= Cv(n )-Cv(n- I) I I &(i )-Y(j )t =cv(n)-Cv(n- I)s;.


2 Cv(n) 2v(v -! h~j=1 Cv (n)
Hence the theorem.

Example 2.5.2.1. We have selected an SRSWR sample of 20 units from the


population I by using the 3 rd and 4 th columns of the Pseudo-Random Numbers
(PRN) given in Table I in the Appendix. The 20 states corresponding to the serial
numbers 29,14,47,22,42,23 ,48,06,07,42,21,31 ,31,36,16,27,10, 18,26 and
48 were se lected in the sample . Later on we observed that the states at serial
numbers 42 , 31 and 48 have been selected more than once. We reduced our sample
size by keeping only 17 states in the sample and collected the information about the
real estat e farm loans in these states . The data so collected has been given below :
Chapter 2: Simple Random Sampling 115

14 IN 1213.024
47 WA 1100.745
22 MI 323.028
42 TN 553.266
23 MN 1354.768
48 WV 99.277
06 CO 315.809
07 CT 7.130
21 MA 7.590
31 NM 140.582
36 OK 612.108
16 KS 1049.834
27 NE 1337.852
10 GA 939.460
18 LA 282.565
26 MT 292.965
( a ) Estimate the average real estate farm loans in the United States using
information from distinct units only.
( b ) Estimate the finite population variance of the real estate loans in the US using
information from distinct units only.
( c ) Estimate the average real estate loans and its finite population variance by
including repeated units in the sample . Comment on the results .
Solution. Here n = 20 and v = 17, and on the basis of distinct units information, we
have

29 NH 6.044 -560 .7820 314476.8000


14 IN 1213.024 646.1977 417571.5000
47 WA 1100.745 533.9187 285069 .2000
22 MI 323.028 -243 .7980 59437 .6100
42 TN 553.266 - 13.5603 183.8816
23 MN 1354.768 787.9417 620852.1000
48 WV 99.277 -467.5490 218602.3000
06 CO 315.809 -251.0170 63009.6800
07 CT 7.130 -559.6960 313259.9000
21 MA 7.590 -559.2360 312745 .2000
31 NM 140.582 -426 .2440 181684.2000
Continued......
116 Advanced sampling theory with applications

36 OK 612.108 45 .2817 2050.4330


16 KS 1049.834 483.0077 233296.4000
27 NE 1337.852 771 .0257 594480.6000
10 GA 939.460 372 .6337 138855.9000
18 LA 282 .565 -284 .2610 80804.4800
26 MT 292 .965 -273 .8610 75000.0100
I* ,],i" }'i'/"'" l'Sum'; //':,
, ,/ ,,'
"
/~ /d
ro. '''ro.~ro.
eQlilqS()I()OO()'t
( a ) An unbiased estimate of the average real estate farm loans in the United States
is given by
_ I v I 17 9636.047
y = - L y.1 = - L y1. = = 566.826 .
v vi=1 17 i=1 17

( b ) An est imator of the finite population variance CT; based on distinct units
information is given by
2 [ C)n-I)] 2
Sv = 1- C)n) sd'

Now
2 1 V( _)2 3911380
sd = -(- ) L y.- y = = 244461.25,
v-I i=1 I v 17-1
and
C (n) = vn _(v)(V_I)n+.oo+(_I)(V-I)(V )In .
v I v-I

C)n) = CI7(20) = 1720 - cl 7(17 _1)20 + C~7 (17_2)20 - cj7 (17 - 3fO + CJ7(17 _4)20
- cF (17 _5)20 + C~7 (17- 6)20 - cj7 (17 _7)20 + cJ7 (17 _8)20

- cr (17- 9)20 + c1J(17 _10)20 - cli (17 _11)20 + cli (17 _12)20
-clI (17 _13)20 + cll (17 _14)20 - clJ (17 _ 15)20+ clJ (17 _16)20
= 2.6366 x 10
20 ,
and
C)n-I)= CI7(19)= 1719 -CF(17-1)19 +Cf(17-2)19 -Cj7(17-3)19 +CJ7(17 -4)19
-CF(17-5)19 +C~7(17-6)19 -Cj7(17-7)19 +CJ7(17- 8)19

-Cr(17-9)19 +cIJ(17-10)19 -cli(17 -11)19 +cl~(17-12)19

-cll(17 _13)19 +ClI(17 -14)19 -Cll(17 -15)19 +ClJ(17 _16)19


= 4.4805 x 1018 .

Hence an estimate of the finite population variance is given by


Chapte r 2: Simple Random Sampling 117

S; = ( I _ 2.6366
4.4805 x 10
x 10
18

20
) x 244461.25 = 240307.004 .

( c ) From the sample information including repeated units , we have


I~
Random ,, State Real estateli' . c !){y(
,

1i,ymEer
t
farm loans, fYl"'-I '~t;;;
~ )
~
(-
"" , (; Yi - Yn ~', ' Yi - Yn t ,;"~
'~ '

29 NH 6.044 -515.4150 265652.200


14 IN 1213.024 69 1.5654 478262.700
47 WA 1100.745 579.2864 335572.700
22 MI 323.028 -198.43 10 39374.700
42 TN 553.266 31.8074 10 11.71 1
23 MN 1354.768 833.3094 694404.600
48 WV 99.277 -422.1820 178237.300
06 CO 315.809 -205 .6500 42291 .760
07 CT 7.130 -514 .3290 264533 .900
21 MA 7.590 -513.8690 264060.900
31 NM 140.582 -380 .8770 145067 .000
36 OK 6 12.108 90.6494 8217.314
16 KS 1049.834 528.3754 279180.600
27 NE 1337.852 816.3934 666498.200
10 GA 939.460 418 .0014 174725 .200
18 LA 282.565 -238 .8940 57070.150
26 MT 292.965 -228.4940 52209.330
42 TN 553.266 31.8074 1011.711
31 NM 140.582 -380 .8770 145067.000
48 WV 99.277 -422 .1820 178237 .300
,'" ',L,'" "" "Sum
" 10.tl29.172 ::;",:0.00001' ! ;;1 4270686 .000~
If the repeated units are included in the sample then an estimate of the average real
estate farm loans in the United States is given by
_ 1 n 10429.172
Y =- l: y. = = 521.4586
n n i= 1 I 20
and an estimate of the finite popu lation variance is given by
s2 = _1_ f( ,_-
)2 = 4270686 = 224772.9474 .
y · y, Yn
n - I ,~I 20 - 1
Clearly estimates of the average and finite population variance remain under
estimate if repeated units are included in the samp le. For details on distinct units
one can refer to Raj and Khamis (1958), Pathak (1962) and Pathak (1966) . Some
comparisons of SRSW R and SRSW OR sampling schemes have also been
considered by several other researchers viz. Deshpande (1980) , Ramakrishnan
(1969) and Seth and Rao (1964) , and Basu (1958) .
118 Advanced sampling theory with applications

2.6 'ESTIMkTION
OF K,POPULA
Sometime we are interested in estimating the total or mean value of a variable of
interest within a subgroup or part of the population. Such a part or subgroup of a
population is called the domain of interest. For example, in a state wide survey, a
district may be considered as a domain. After completing the survey sampling
process from the whole population, one may be interested in estimating the mean or
total of a particular subgroup of the population. We are interested in estimating
population parameters of a subgroup of a population. For example
~ ..
" , ;r ;!,. ;; ~. ;
;. yr••";·· ;.0· !0".

United States population Employed


New York Unemployed
Retailers Supermarkets
All workers in a firm Part time workers

Let D be the domain of interest and N D be the number of units in this domain.
ND _ 1 ND
Let YD = L f; and YD =- L Y; be the total and mean for the domain D
;=1 ND ;=1
respectively . Suppose we selected an SRSWOR sample s of n units from the
entire population nand nD ~ n units out of the selected units are from the domain
D of interest. In certain situations the value of N D is known and in another
situations the value of N D is unknown. We shall discuss the both situations as
follows. Define a variable
y' =
I
{I0
if i E D ,
if i ~ D. (2.6.1)
N • •
Then we have If; = YD = Y (say).
i=1

Case 1. When N D is unknown. Then we have the following theorems.

Theorem 2.6.1. Under SRS sampling an unbiased estimator of total YD of the


subgroup (or domain D) is given by
, N n *
~=-IY; . (262)
n i=1 . .
Proof. Taking expected values on both sides of (2.6.2), we have

E(YD)= E( ~ i~JY;* J ~ i~1 Ek)= ~ i~t~/jPrk


= = Yj)]
N n 1 N * N n N * N * ND *
=- I - I Yj =- x- I Yj = I Yj = I Yj = YD .
n i= J N j=1 n N j=J j=J j=1
Hence the theorem.
Chapter 2: Simple Random Sampling 119

Theorem 2.6.2. The variance of the estimator YD under SRSWOR sampling is:
2(
V(YD) = N I- f)sb, where Sb=_I_f~Y;'2_N-I(~y;,)2J. (2 .6.3)
n N -I 1;=1 ;=1
Proof. Obvious from the results ofSRS WOR sampling.

Theorem 2.6.3. An unbiased estimator of the V(YD ) under SRSWOR sampling is


given by

v.(y'D)_ j
2
- N (1_ f) SD,
(J
2 1 '2 -1
2
where SD= - L Y; - n L Y; 2J .
/I /I ,
- (2.6.4)
n n -I ;=1 ;=1

Proof. Obvious .

Case II. When N D is known. Here we need the following lemma.

Lemma 2.6.1. If i E SD indicates that the t" unit is in the sample sub group of D
then, we have
Prf( . E SDI nD> 0 ) = -no . (2.6.5)
ND
Proof. We have
Pr(i E SD InD > 0)
Pr(i E SD, 1I1D) Number of samples of sizes n Dwith i E SD
= Pr(n D) = Number of samples of sizes nD

Number of ways (nD - I) can be chosen from (ND- I) and (n- nD)from (N- ND)
Numb er of ways n D can be chosen from NDand (n- nD) from (N- ND)

Then we have the following theorems


Theorem 2.6.4. Under SRS sampling a biased estimator of YD is given by

YD = lo~: ;~y;' if no > 0, (2.6.6)


otherwise.
Its relative bias is equal to the negative of the probability that n D is equal to zero.
Proof. Here nD is not fixed before taking the sample and therefore is a random
variable. Thus we have
E(YD)= E1[E2(YDI nD)]
120 Advanced sampling theory with applications

Now when n D > 0 then we have

E(YD) = E2[(YD I no > 0)] = E2[ :~ i~/t* I no > 0]


= E2[ N D Ili I n D >
nt: iESD
0] ' where S indicates sample subgroup of
D D

ND
=E2 [- - "i.JiYi InD > 0] , where Ii = {I if iES.O'
n DiED 0 otherwise,

= ND [ I E2 (Ii inD > 0)Yi] = ND [ I Yi Pr(I i E SD in D > 0)]


no iED no iED

= ND [IYiX nD] = Z:Yi =Yo .


no iED ND ieO
Therefore
, )_!E(Yo I no > 0) =Yo ,
E (Yo - (, )
E Yo I no = 0 = O.
Thus we have
E(YD) = E,[E 2(YD I nD)] = YD Pr[nD > 0]+ OPr(nD = 0) = Yo Pr[no > OJ
Thus the relative bias in the estimator YD ,when No is unknown , is given by
RB(Yo )=E(Yo)- Yo =Yo Pr(no > 0)- Yo = Pr(no > 0)-1 = -Pr(no = 0).
Yo Yo
Hence the theorem.

To find the variance of the estimator Yo we need the following lemma .


Lemma 2.6.2. Show that
1 I
E(-no no>
oj nPo
1- -Po
= -1+ - 2 2'
n Po
where Po __ No .
N
(2.6 .7)

Proof. We have
1 1

-;;; = n( n: ) = n( n: - Po ) + nPo
Chapter 2: Simple Random Sampling 121

Tanking expected values on both sides, we have

PD(I - PD) E(IID -PD)


1 II D>D J '" - i
E( ~ 1- -'-- ::----
ll - ~r===~="f"'==
II +---;----'-;-:,---- -~:___=__'~
v» IIPD IIPD ~PD(I-PD)/II
r

by

(2 .6.8)

Hence the lemma. For more details about the expected values of an inverse random
variab le one can refer to Stephen (1945).

Theorem 2.6.5 . Show that the variance of the est imator YD , when N D is know n, is

V(YD)", Pr(IID > O{{PD N2(~_ f) + : : (1- PD)}Sb + Pr(IID = 0)Y5]. (2.6.9)

Proof. We have
V(YD)= E, ~2(YD IIID)]+ VI [E2(YDI n» )] . (2.6.10)
Now

'
V2(YD [IID)= OIlD
N D2 ( I-~
ND
Jsb if no > 0, (, ) {YD if liD > 0,
and E2 YD j llD = 0
1 if no = 0,
if no = O.

From (2 .6.10) we have

V (Y,D )-- E, l lNb


»» (1-~JSb
ND if v» > 01 + V, [{Yo
D if liD >0]
if "o = 0 .
(2 .6.11)
o if no = 0

Now

.if liD > 0] = VI [YDI(IID > 0)] = Y5 V, [/(IID > 0)]= Y5 Pr(IID > OXI- Pr(IID > 0))
If liD = 0
= Y8 Pr(IID > O)Pr(IID = 0), (2.6.12)
122 Advanced sampling theory with app lications

and

if v» > 0]
If no =0
= min(flf,';'(IID = j )N5 (1 - ~)S5 +pr(IID = o)«0
j= 1 no ND

min(ND,II )Pr(IID = j) N5 ( liD ) 2


= Pr(1I D > 0) I ( ) 1- - SD
j= 1 Pr no > 0 ti D ND

= Pr(1I D > 0)
min(ND ,II) N2 ( II 2
I Prlll D = j ill D > 0)---.!2... 1- ~ S D
J
no

;J
j =l ND

= Pr(IID > 0)E{N5( tI~ - S5 1I1 D > o}


= N5S5{E( _1 I no > o)- _I_ }pr(tl D > 0)
n» ND

'" N5 S5{_ I_ + 1
IIPD II PD N D
~ P~
-_I_ }pr(IID > 0), where PD = N D
N

+ N5 1
2
'" { PD N
II
(1 -~)S5
N
~ P~ S5} pr(II D > 0),
II P (2.6.13)
D

On using (2.6. 12) and (2.6.13) in (2.6. 1I) we have

Hence the theorem.

Theorem 2.6.6. An estimator for estimating v(rD )is given by


2( 2
' ('D ) ", {'PD N 1_ f ) +-2
vY N ( I - PD
' )} SD
2 (2.6.14)
II II
where

1 ( J2) l
' no 2 1 II *2 *
IY; - IY;
1/
PD = - and SD = - - no t
II no -I ;= 1 ;= 1

Proof. Ob viou s by the method of moments.


Chapter 2: Simple Random Sampling 123

2.7 DEALING WITH A RARE ATTRIBUTE USING INVERSE SAMPLING

The probl em of estimation of proportion of some rare types of genes or acreage


under some special types of plants can not be done with the help of direct binomi al
distribution. The probl em is that if we select a sample of size 11 by SRSWR or
SRSWOR sampling, then the observed number of genes of particular type or
interest in the select ed sample will be zero due to its rare availability. Thu s the
traditional SRSWR or SRSWOR samplin g schemes cannot be used to estimate the
proportion of rare attribute in survey sampling. One of the possible solutions is an
Inver se Sampling. Inverse Sampling is a techniqu e in which sampling is continued
until a predeterm ined numb er of units possessing the attribute occur s in the sample
is useful in estimating the proportion of a rare attribute. Now inverse sampling can
be done either by using SRSWR or by SRSWOR sampling. If the Inver se Sampling
is don e by SRSWR sampling then the total numb er of trials 11 (say) to get
predetermined number of units m (say) possessing attribute A (say) follows the
Negative Binomial Distribution. It is also called Binomial waiting-time distribution
or Pascal distribution. If the invers e sampling is don e by SRSWOR sampling, then
the total number of trials 11 (say) to get predetermined number of units m (say)
possessing an attr ibute A (say) follow s Negati ve Hypergeometric distribution.
Figure 2.7.1 has been devoted to differentiate between Nega tive Binomi al and
Negative Hypergeometric distribu tions.

Fig. 2.7.1 Pictorial representation of the Inverse Sampling.


124 Advanced sampling theory with applications

Thus we have the following situations.


Case I. SRSWOR Inverse Sampling: Let N be the population size, P be the
proportion of the rare attribute of interest and Q = 1- P. In case of the Inverse
Sampling the sample size n is required to attain m is a random variable and its
probability distribution due to SRSWOR sampling is given by

P(n = I ) =
(:~J:QmJ
J (NP-m+l) m.m + I,... .m + NQ.
(
NN-I+I x , 1= (2.7.1)
I-I
Such a distribution is called negative hypergeometric distribution and we have
m+NQ
L: P(n = I) = I. Then we have the following theorem.
I=m

Theorem 2.7.1. An unbiased estimator of the proportion of rare attribute P is


given by
, m-I
p = n -I . (2.7.2)
Proof. We have
,) (m-I) m+NQm_1
E (p = E - - = L: - - P (n = I )
n -I I=m n-I
1
=mtQ{~}{NP-m+I}(NP J(NQ J(N J-
I=m n-I N- I+I m-I I- m I-I
=pm+I:Q-1{NP-m+I}(NP-IJ(NQ J(N-IJ-l =P.
I=m N-I+I m-2 I-m 1-2
Hence the theorem.

Theorem 2.7.2. An estimator for estimating the variance of p is given by


vp m-I[m-I
,(,)_ - - - - - (N-IXm-2)
- n- I n- I N(n - 2)
I]
N · (2.7.3)
Proof. We have
E{(m-l)(m-2)}= L: {(m-l)(m-2)}p(n=l)
(n-l)(n-2) tem (1-1)(1-2)

= {P(NP-I)} L: {NP-m+I}(NP-2J(NQ J(N-2J-1


N- I tem N- I+I m- 3 I- m I-3
Np 2 P
-----
N-I N-I
By the method of moments an estimator of p 2 is given by
Chapter 2: Simple Random Samp ling 125

1\2\
p =
(N - IXm- IXm- 2) +(m- I)
-- - .
1 . N(II - IXII - 2) N(II - I)
est imate

The variance of the estimator p is given by


V(p) = E(p 2)- {E(p)}2 .
By the method of mome nts, an unbiased estimator of V(p) is given by

v"(")_
p - {"}2
p - 11\p 2 ] _
- -(m-- If- - (N-I Xm- IXm-2 ) (m- I)
(II - If N(II - IXII - 2) N(II - I)
estimate

_ m- I [m- I _ (N - I Xm- 2) _-!.. ]


- II - I II -I N(II - 2) N ·
Hence the theore m.

Case II. SRSWR Inverse Sampling: In case of large N, the negative


hypergeometric probability distribut ion of total sample size II beco mes negative
binomial distri bution and is given by

P ( II; Ill , P ) = II - J
I P 11/Q /1- 11/ cl or /I = Ill, III + I, III + 2,...
( m- I
Then we have the following theorem.
Theorem 2.7.3 . An unbiased estimator of the required proportion P of a rare
attribute is
• Ill-I
p =- - (2.7.4)
/I - I
and an estimator of the V(p ) is given by
v(p) = p(l - p). (2.7.5)
11 -2
Proof. Ob vious for large N from the previous theor ems.

2.8 CONTROLLED SAMPLING

While using simple random sampling and without replacement (SRSWO R) design,
the number of possible samples, N C; , is very large, even for moderate sample and
population sizes. For example, if

Number of samples
N
/I 30 40
5 142,506 658,008
10 30,045 ,0 15 847,660,528
15 155, 117,520 40,225 ,345,056
126 Advanced sampling theory with applications

Some time s in the field surveys , all the possible samples are not equally preferable
from the operational point of view , because a few of them may be inaccessible,
expensive, or inconv enient, etc.. It is therefore advantageous if the sampl ing design
is such that the total number of poss ible samples is much less than N en , retainin g
the unbi asedness prop erties of sample mean and sample variance for their
respective population param eters. Neyman's (1923) notati on for causal effect s in
randomized experiments and Fisher's (1925) proposal to actually randomize
treatments to units . Neyman (1923) appears to have been the first to provide a
mathem atical analysis for a randomi zed experim ent with expli cit notation for the
potential outcomes, implicitly making the stability assumption. This notation
became standard for work in randomi zed exper iments (e.g., Pitman, 1937; Welch,
1937 ; McCarth y, 1939; Anscombe, 1948; Kempthorne, 1952; Brillin ger, Jones, and
Tuk ey, 1978; Hod ges and Lehmann, 1970 , and dozens of other place s, often
assuming con stant treatment effects as in Cox, 1958, and sometimes being used
quite informally as in Freedman, Pisan i, and Purv es, 1978) . Neym an's formali sm
was a maj or advan ce because it allowed explicit prob abil istic inferences to be
drawn from data, where the probabilit ies were explicitly defined by the random ized
ass ignment mechanism. Independently and nearly simultaneously, Fisher ( 1925)
invented a somewhat different method of inferenc e for rand omized experiments,
also based on the specia l class of randomi zed assignment mech anisms. Fisher's test
and resulting ' significance levels' (i.e., p values), remain the accepted rigorou s
standard for the analysis of randomi zed clinical trials at the end of the twent ieth
century, so called ' intent to treat analy ses. The notions of the cen tral role of
randomized experiments seems to have been ' in the air ' in the 1920, but Fisher was
the first to combine physical randomi zation with a theoreti cal analysis tied to it. A
review on randomi zation is also available by Fienberg and Tanur ( 1987). These
ideas were primarily assoc iated with the notion of fairness and obje ctivity in their
earlier work. The role of the International Statistical Institute in the earlier work
related to sample surveys, as reviewed by Smith and Sugd en (1985). Fienberg and
Tanur (1987) explored some of the developments following from their earlier
pioneer ing work with an emph asis on the parall els between the methodologies in
the design of experiments and the design of sample surveys . Chakrabarti (1963)
initiated the idea that the results on the existence and construction of balanced
sampling designs can be easily translated to the language of design theor y by using
the corr espondence between sampling design and block designs. Bellhouse (19 84a)
also work ed on these lines and has shown that a systematic applic ation of the
treatments minimi ses the variance of the treatment constant averaged over the
application of the treatment. The lack of cross reference in the review papers by
Cox ( 1984) and Smith (1984) suggested that the specia lisation extends even to
compartmentalisation within the minds and pro fession al lives of outstand ing
investigators, for both these authors have been steeped in the tradition of parall els.
For example, consider a balanced incomplete block design (BIBD) with standard
parameters (b , v, r, k,A), where v denotes the number of varieties, b the number of
blocks , k the block size, r the number of times each treatment occurs and A the
number of times any pair of treatments occur together in a blocks. In practice
Chapter 2: Simple Random Sampling 127

b < {I'Ck} . Each treatment represents a population units v = N, each block as a


sample with k = II . Thus each unit occurs in r samples and each pair of units in A
samples. Choose each sample with probability 1/b and under such designs, define
indicator variables

i ' __
SI
{I if i E S
0 if i '" S
such that £(1.." ) = -r ,
J b
and
I i f i.] E S ( ) A
i s,;; =
" { 0 if i, j '" S such that E i "I)" = -b . J

The sample mean


1 1/
Y= - L Yi
lIi=1
can be written as
_ 1 N
Y = - l.Y;lsi
II i=!

such that
I N ] 1N ) 1N r r N -
£(y)=£[ - l. Y/ si =- l.}j £(Jsi =- l. }j-=- l. }j=Y
II i= l II i= l II i=l b lib i=l
becau se vr = bk .
Similarly using r(k -I) = A(V-I) and bk = vr we have

1/ J ( N
£ ( ~~YiYj =£ ~ ~}jY/sij =-l. l. }jYj = - _
(
JA N
) l. ~ Yi Yj= -( _ )l.l.}jYj
(r(r - I)J N k(1I - I) N
'*1 '* 1 b'*l bv 1 '* 1 v v 1 1* 1
11(11-1) N
= (
N N- I
)l.DiYj
N.j
'

Thus, we have the following theorem:

Theorem 2.8.1. Under controlled sampling design, the sample mean and sample
variance rema in unbi ased to their respecti ve parameters.

2.9 DETERMINANT SAMPLING

Subramani and Trac y (1993) used the concept of incomplete block design in sample
surveys and introduced a new sampling scheme called determinant sampling. This
scheme totally ignore s the units close to each other for selection in the sample. In
the preceding discus sion , the units which are close to each other in some sense are
called contiguous units. Chakrabarti (1963) excluded conti guous units when
tran slating the result s of sampling designs to experimental design s since these units
have a tendency to provide identical inform ation which may be induced by factors
like time , category or location . As an example, in socio economic surveys people
128 Adva nced sampling theory with applications

have a tendency to exhibit similar expenditure patterns on household items dur ing
different wee ks of the month. More over peopl e belonging to the same income
category class have a grea ter tende ncy to have simi lar expe nditure patterns. With
regar d to the factor location, residents of a speci fic area show similar symp toms of a
disease caused by env ironmental pollution as of some infectious disease . Sim ilarly
in crop field surveys contiguous farms and fields shou ld be avoi ded. Because of
this limitation, Rao (197 5, 1987) has sugges ted that if contiguous units occ ur in any
observe d sample, they may be collapsed into a sing le unit, with the corresponding
response as the average observed respon se over these units. An estimate of the
unkn own par ameter is then recommended on the basis of such a reduced sample.
The situations for getting more information on the popul ation by avoiding pairs of
contiguous un its in the observed sample are well summarised by Heda yat, Rao, and
Stufk en (19 88). Tracy and Osahan (1994 a) furth er extend their work for other
sampling schemes.

EXERCISES

Exercise 2.1. Define simple random sampling. Is the sample mean a consistent or
unbi ased estimator of the population mean? Derive the variance of the estimator
using ( a ) SRSWR sampling ( b) SRSWOR sampling. Also derive an unbi ased
estimator of variance in each situation.
Exercise 2.2. A popul ation consists of N units, the value of one unit being known to
be YI • An SRS WOR of (II - I) units is drawn from the remainin g (N - I)
population units. Show that the estimator
)'1 = l) + (N-1)YIl_I
11 -1 N
where YIl- I = (11 - lt l LYi ,is an unbiased estimator of the popu lation total, Y=L Y j ,
i= 1 ;=2

but the variance of the estimator Y1 is not less than the variance of estim ator

Y2=NYIl ' where YIl=I1- 1 I Yi is an estimator of popu lation mean based on the sample
i= 1
of size 11 selected from the popul ation of N units. In other word s, the estimator Y1
is no more efficient than Y2. Give reasons.
Hint: By setting V(YI )?:: V(Y2) we obtain N > 11 which is always true for SRSWOR

sampling, where V(v,)=(N -If(_I


11- 1
I_)S; and V(Y2)=N2(~_~)S;
N- l N
.
11

Exercise 2.3. Suppose in the list on N businesses serially numb ered, k businesses
are found to be dead and t new businesses came into exis tence making the total
numb er of business (N - k + t). Give a simple procedure for selecting a businesses
with equal prob ability from (N - k + t) businesses, avoidin g renumbering of the
origi nal busine sses and show that the newly developed procedure achieves equal
probab ility for the new business too.
Chapte r 2: Simple Rand om Sampling 129

Hint: Using SRSWOR sampling the probability of selecting each unit will be
l
(N - k+ tt •

Exercise 2.4. Show that the bias in the Searl s' estimator defin ed as, Ysearl = AYn , is
B(Ysearl)= -Y V(YIl )/{f2 +v(y,.)}. Hence deduc e its values und er SRSWR and
SRSWOR.
Hint: Redd y (1 978a).

Exercise 2.5. An analogue to the Searls' estimator for estimating the population
propo rtion is defined as, P searl = Y Py , where y is a constant. Find the min imum
mean square error of the estimator Psearl under SRSWR and SRSWOR sampling.
Also study the bias of the estimator in each situation.
Hint: Conti (1995).

Exercise 2.6. An estimator of the optimum value of A in the Searls' estimator of


popul ation mean Y und er SRSWR sampling is given by,

i=[I+ 11
(N
N- I
)(s;,/y 2 )tJ
Show that i is a consi stent estimator of the optimum value of A . Also calcul ate
the bias and mean squared error, to the first order of approximation, in the estimator
of popu lation mean defined as, Yo = i yn . Deduce the results for estimating
popul ation proportion with the estimator, P searl = r Py , where r is a consistent
estimator of y .
Hint: Mangat, Singh , and Singh (199 1).

Exercise 2.7. Sho w that: ( a) under SRSWR sampling s;' is an unbi ased estimator
of a.~ , ( b) under SRSWOR sampling s;' is an unbia sed estimator of sJ,.
Exercise 2.8. Define the Searls' estimator of population mean . Show that the
relative efficiency of the Searls ' estimator is a decreasing funct ion of sample size
under (a) SWSWR (b) SRSWOR sampling designs.

Exercise 2.9. Show that the prob ability of selecting the i l " unit in the Sl" sampl e
remain the same under SRSWR and SRSWOR and is given by 1/ N .

Exercise 2.10. Why is the Searls ' estimator not useful in actual practice? Suggest
some modifications to make it practicable.
Hint: Use i in place of A. .
130 Advanced sampling theory with applications

Exercise 2.11. In case ofSWSWR sampling, if ther e are two characters Y and x ,

the covariance between Y and X is defined as a x)' =~ I (Yi - vXXi - x). Then the
N i=1

·
usua I estimator 0 f a . .
xy IS given
by Sri'
-X
= - I - I1/ ( Yi - Y Xi - rX) . Show that an
. n-I i= 1
estimator better than S ty based onl y on distinct units is:

Cv(II-I)] Sd(xy ) if v > I,


_ 1-
S,{xy ) - 0 j[ Cv(lI ) where Sd(xy) = (v- I t l ±
i= 1
t Vi - YvXXI - z.) .
otherwise,
H int: Pathak (1962).

Exercise 2.12. (a) Show that the usua l estimator of the popu lation tota l (namely
Ny) in SRSWOR has average minimum mean squared error, for permutations of
va lues attac hed to the units , in the general class of linear translation invariant
estimato rs of the population total Y.
( b ) Show that for SRSWOR sampling of size II , the estimator which minimises
the average mean squared error, for permutations of values atta ched to the unit s, in
the class of all linear estim ators is give n by,
N
Ie =-II (-I + Ii )iIYi
ES

where Ii = (N
N- I
-II ) C; and
II
Cy is the known population coefficient of variation.
Hint: Ramakrishnan and Rao ( 1975) .

Exercise 2.13. Let a finite population consi st of N units . To every unit there is
attached a characteristic y . The characteristics are assumed to be measured on a
given sca le with distinct points Y I,Y Z,...,Yt . Let N, be the number of unit s
associated with scale point Y/ ' with N = I N/ . A simp le random sample of size II
t

is drawn . Using the likelihood function of ( N], Nz ,...,Nt ) and assuming Nl n to be


an integer, show that a maximum like lihood estimator of the population mean Y is
_ I "
Yml = - L.,lI t Y /
/I t

where /It is the numb er of times the value s of Yt are observed in the sample.
Hin t: Hartl ey and Rao (196 8).

Exercise 2.14. Suppose we selected a sample of size II such that the {IJ unit of the
population occurs Ii times in the sample. Assume that II I of the se unit s (Ill < II ) are
r
selected with frequen cy one . Evidently II = /II + I f; , where r is the number of
i=1
units occ urring I i times in the sample. Let d (= III +r)be the number of distinct
Chapter 2: Simple Random Sampling 131

unit s in the sample. Th e d unit s are measured by one set of investigators and the r
repeated units by another set, preferably by the supervising staffs. The measurement
of the d units be denoted by Xl> X2, .•., xIII for the non-repeated ones and
xIII +1' xIII +2> •.•,x lIl +,. for the rep eated ones . The measurement of the r repeated unit s
be denoted by 21, 22, .. .,2,.. Us ing the abo ve information and not ation, study the
asymptotic properties of the following estimators of population mean:

(a) Xd =~~IIXIII
d
+rxrJ; (b ) ZR = Z,. ( ~dJ ;
X,.
(c) z/,.= Z,. +P(Xd - X,.);

( d ) YI = (1- W)Xd + wZR; and (e) Y2 = (1-W)xd + wZ/r •

Hint: Moh anty (1977) .

Exercise 2.15. Discuss the problem of the estimation of domain total in survey
sampling. Derive the estimator of domain total and find its variance under different
situations.

Exercise 2.16. Under SRSWR sampling, show that the distinct unit s based unbiased
estimators of the finite population variance a y2 are given by

(Nil) J (II- I)

( a) •
VI =
J= I
N"-I(N- I) Y
S
2•
, ( b) V2 =[(~ - ~ ) +NI-II(I - ~)}3;
. _ Cv_l(n - l) 2. ( d) ~ _
(N
il)J (II -1) [1- Cv(n- I)] 2.
( C ) v3 ~4 -
J- I
II-I ( )
- ( )
c, n
sd,
N N -I c, (n ) Sd ,

and ( e) Vs = . [(I NI) N(N-"-NI) ]2


- - -
v
+--- Sd

Hint: Path ak ( 1966).

Exercise 2.17. Discuss the method and theory of the estimation ofrare attributes in
survey sampling.

Exercise 2.18. Write a program in FORTRAN or SAS to find the values of the
coefficients Cv(n-I) and Cv(n ). Test the results for n = 5 and v = 3 with all steps
using your desk calculator.

Exercise 2.19. Under an SRSWOR sampling of n unit s out of a population of N


distin ct units, consider the following estim ator of population mean Y as
"
Ynew = L CIYI
1=1

where cI. is a constant depending on the /" draw, YI is the va lue of Y on the unit
se lected at the t" draw.
132 Ad vance d sampling theory with app lications

( a ) Show that -
Ynew is unbiased for population mean Y " Ci = 1.
if and onl y if L
i= J

( b ) Show also that under this condition V(Ynew) = s;t~c;( 1- ~ J} .


Hint: V(Ynew) = V( I CiYiJ = L C;V(Yi )+ LL CiCjCOV~i' Yj) = I c;V(Yi) = v(Yi)Ic;.
1=1 ' ''' }

( c ) Show that V(Ynew) is minim ised subject to the condition I Ci = 1 if and only if
i=1
Ci = 1/11, ,11 .
i = 1,2,...

H int: Ic; ~ (I ci i /11 = 1/11 and equal ity hold s if and only if Ci = 1/11 .

Exercise 2.20 . An SRSWR sample of size n is drawn until we have a pre-assigned

num b ermo f disti . .m rt.


istmct umts . Let y,,=-
- 1 - =-1 ~
I11/ fiYi an d YII/ L.Yi beth esampe
I
II ;= 1 III ; = 1
mean s based on the sample including repetitions and without repetitions.

( a ) Sho w that both estim ators y" and YII/ are unbi ased for population mean Y.

( b ) Show that v(y,,) = aJ,E(~ J.

( c ) Show that V(Ylll )= (~ - ~Js;


111 N
.
estimator YII/
( d ) Sh ow th at t hee esti - IS ' b etter tIian y"
- I'f E( 1J> N-
- III
---,-~
II m(N -I )'
Hint: Raj and Kham is (1958) .

Exercise 2.21. Discuss controlled sampling. Show that the sample mean and sampl e
variance rem ain unbiased to their respective parameters.

Exercise 2.22 . Discuss the concept of rare attribute and give a pos sible solution
using inverse samp ling.

PRACTICAL PROBL EM S
P r acti cal 2.1. Co nsider the problem of estimation of the tota l number of fish caught
by marine recreational fishermen at Atlantic and Gulf coasts. We know that there
were 69 species caught during 1992 as shown in the population 4 in the Appendix .
What is the minimum numb er of species groups to be se lected by SR SWR sampling
to attain the accuracy of relative standard error 12%?
Given: s; = 31,0 10,599 and Y = 291 ,882.
Chapter 2: Simpl e Random Sampling 133

Practical 2.2. Your supervis or has sugges ted you to think on the problem of
estim ation of the total numb er of fish caught by marine recreational fishermen at
Atlantic and Gulf coa sts. He told you that there were 69 species caught during 1993
as shown in the population 4 in the Appendix. He needs your help in deciding the
sample size using SRSWOR design with the relative standard erro r 25% . How your
kno wledg e in statistic s can help him?
Given: sJ, = 39,881,874 and Y = 316,784.
Practical 2.3. Th e demand for the Bluefi sh has been found to be highe st in certain
markets. In order to supply these types of fish the estimation of the proport ion of
bluefish is an important issue . At Atlantic and Gulf coas ts, in a large sampl e of
311,528 fish there were sho wn to be 10,940 Bluefish caught durin g 1995. What is
the minimum numb er of fish to be selected by SRSWR sampling to attain the
accuracy of relativ e standard error 12%?

Practical 2.4. John considers the problem of estim ation of the total number of fish
caught by marine recreati onal fishermen at Atlantic and Gulf coasts. There were 69
spec ies caught durin g 1994 as shown in the popul ation 4 in the Appendix. John
selected a sample of 20 units by SRSW R sampling. What will be his gain in
effi ciency ifh e considers the Sear ls' estimator instea d of usual estimator?
Given: sJ, = 49,829,270 and Y = 341,856.

Practical 2.5. Select an SRSWR sample of twenty units from population 4 given in
the Appendix. Collect the information on the number of fish during 1994 in each of
the species group selected in the sample . Estimate the average number of fish
caught by marine recreational fishermen at the Atlantic and Gulf coa sts dur ing
1994. Construc t 95% confid ence interval for the average numb er of fish in each
spec ies group of the United States .

Practical 2.6. Use populati on 4 of the Appendix to selec t an SRSW OR samp le of


sixteen units. Obtain the information on the numb er of fish durin g 1993 in each of
the spec ies group selecte d in the sample. Develop an estimate of the average
numb er of fish caught by marine recreational fishermen at Atlantic and Gulf coas ts
durin g 1993 . Construc t the 95% confi dence interva l estimate.

Practical 2.7. Select an SRSWR sample of 20 state s using Random Number Table
meth od from popul ation I of the Appendix. Note the frequency of each state
selected in the sample. Construct a new sample by keepin g onl y distinct states and
coll ect the information about the nonr eal estate farm loans in these states. From the
information collected in the sample:

( a ) Estimate the average nonreal estate farm loans in the Unit ed States USIng
information from distin ct units only.
( b ) Estimate the finite population variance of the nonreal estate loans in the United
States using distinct units only.
134 Adva nced sampling theory with applications

( c ) Estimate the average nonrea l estate loans and its finite pop ulati on variance by
inclu ding repeated unit s in the sample. Comment on the results.

Practical 2.8. A fisherman visited the Atlantic and Gulf coast and caught 6,000 fish
one by one. He noted the species group of eac h fish caught by him and put back
that fish in the sea before mak ing the next caught. He observed that 700 fish belon g
to the group Herrings.
( a ) Estimate the proportion of fish in the group Herrings living in the Atlanti c and
Gulf coast.
( b ) Co nstruc t the 95% confidence interval.

Practical 2.9. Durin g 1995 Michael visited the Atlantic and Gulf coast and caught
7,000 fish. He observed the spec ies group of each one of the fish caught by him
using SRSWOR sampling and found that 1,068 fish belong to the group Red
snapper.
( a ) Estimate the proportion of fish in the group Red snappe r living in the Atlantic
and Gul f coast.
( b ) Construct the 95% confid ence interval.
Gi ven: Total numb er of fish living in the coast = 311 ,52 8.

Practical 2.10. Follo win g the instructions of an ABC comp any, select an SRSW R
sample of 25 unit s from the popul ation I by using the 4 th and 5th co lumns of the
Pseud o-R and om Numb ers (PRN) given in Table I of the Appendix . Record the
states selected more than once in the sample. Reduc e the sample size by keeping
only eac h state onc e in the sample and collect the information about the real estate
farm loans in these states. Use this information to:

( a ) Estimate the average real estate farm loans 10 the Uni ted States using
inform ation from distin ct units only.
( b ) Estimate the finite popul ation variance of the real estate loans in the US using
informati on from distinct units only.
( c ) Estimate the average real estate loans and its finite popul ation variance by
includ ing repeated units in the sample. Comment on the result s.

Practical 2.11. You think of a practical situation where you have to estimate a total
of a variabl e or characteristic of a subgroup (dom ain) of a population. Tak e a
sample of reasonable size from the population under study and collect the
information from the units selected in the sample. Apply the appropriate formul ae
to construct the 95% confidence interval estimate.

Practical 2.12. A practic al situation arises where you have to estimate a proportion
of a rare attribute in a popul ation, e.g., extra marital relations. Coll ect the
information from the units selected in the sample throu gh inverse sampling from the
population under study. Apply the appropriate formul ae to construc t the 95%
confidence intervals for the prop ortion of the rare attribute in the popul ation.
Chapter 2: Simple Random Sampling 135

Practical 2.13. A sample of 30 out of 100 managers was taken, and they were
asked whether or not they usually take work home. The responses of these
managers are given below where ' Yes' indicates they usually take work home and
'No' means they do not.

Yes Yes Yes Yes No No Yes No Yes No


No Yes No Yes No Yes Yes Yes Yes Yes
No No Yes Yes Yes Yes Yes Yes No Yes

Construct 95% confidence intervals for the proportion of all managers who take
work home using the following sampling schemes :
( a ) Simple Random Sampl ing and With Replacement;
( b ) Simple Random Sampling and Without Replacement.

Practical 2.14. From a list of 80,000 farms in a state, a sample of 2,100 farms was
selected by SRSWOR sampling. The data for the number of cattle for the sample
were as follows :
n n 2
LYi = 38,000 , and L Yi = 920,000.
i ;1 i ;!
Estimate from the sample the total number of cattle in the state, the average number
of cattle per farm, along with their standard errors , coefficient of variat ion and 95%
confidence interval.

Practical 2.15. At St. Cloud State University, the length of hairs, Y, on the heads
of girls is assumed to be uniformly distributed between 5 em and 25cm with the
probability density function
1
f(y) = - \;j 5 < Y < 25
20
( a ) We wish to estimate the average length of hairs with an accuracy of relative
standa rd error of 5%, what is the required minimum number of hairs to be taken
from the girls?
( b ) Select a sample of the required size, and use it to construct a 95% confidence
interval for the average length of hairs?

Practical 2.16. The distribution ofweighty shipped to 1000 locations has a logistic
distribution

f Y =-sech
() 1
4fl.
2{ -
1 --
2 fl.
(x-a•J}
with a. = 10 and fl. = 0.5 .
( a ) Find the value of the minimum sample size n required to estim ate the average
weight shipped with an accuracy of standard error of 0.05% .
( b ) Select a sample of the required size and construct 95% confidence interval for
the average weight shipped.
( c) Does the true weight lies in the 95% confidence interval?
136 Advanced sampl ing theory with applicat ions

Practical 2.17. Assume that the life of every person is made of an infinite number
of good and bad events . Count the total number of good and bad events you
remember that have happened to you. Estimate the proportion of good events in
your life. Construct a 95% confidence interval estimate. Name the sampling
scheme you adopted to estimate proportion of good happenings, and comment.

Practical 2.18. Assuming that everyone dreams infinite number times during
sleeping hours in the life. Count the number of good and bad dreams in your life
you remember. Estimate the proportion of good dreams and construct a 95%
confidence interval estimate . Name the sampling scheme you followed to estimate
the proportion of good dreams, and comment.

Practical 2.19. Dr. Dreamer believes that if a person takes good dreams during
sleeping hours then he/she is mentally more healthy, and pleasant person . You are
instructed to report stories of your dreams to the doctor until you are not having 15
good dreams . Find the Dr. Dreamer's 95% confidence interval estimate of the
proportion of good dreams in your life. Can you be considered a pleasant person?
Comment and list the sampling scheme used.
3. USE OF AUXILIARY INFORMATION: SIMPLE RANDOM
SAMPLING

3.0 INTRODUCTION

It is well know n that suit able use of aux iliary informatio n in probab ility sam pling
results in co nsiderab le redu ction in the varia nce of the estimato rs of population
parameters viz. population mean (or total), med ian, variance, reg ress ion coefficient,
and popul ation correlation coefficient, etc.. In this chapter we will consider the
problem of estimation of different population parameters of interest to sur vey
statisticians using known auxiliary inform ation und er SRSWOR and SRSWR
sampling schemes only . Before proceeding furth er it is nece ssary to de fine som e
notation and ex pec ted values, which will be useful throu ghout this chapter.

3.1 NOTATION AND EXPECTED VALUES

Ass ume that a simple random sample (SRS) of size 11 is drawn from the give n
popul ation of N unit s. Let the value of the study variable Y and the auxiliary
variable X for the / " unit (i = 1,2,...,N) of the popul ation be denoted by >i and Xi
and for the i''' un it in the sample (i = 1,2,...,11) by Yi and Xi' respectiv ely. From the
sampl e obse rvations we have
- \ /I _ \ /I 2 \ /I _ 2 2 1 /I - 2
Y =- 'L Yt » X =- 'L Xi ' SY =-(- ) 'L (Yi - y) , Sx =-( - ) 'L (Xi - X) ,
11 i=1 11 i=1 11 - \ i=1 11 - \ i=1
and

S
xy
=-\()
11 -\ i=1
£(Y.-Y)(x.-x) .
I I

For the population observations we have the anal ogue qu antiti es


- \ N - 1 N 2 1 N( -)2 2 1 N( - )2
Y= - IYi , X = - I X i, Sy=- ( -) I>i -Yj , s, = -( -) 'LXi- Xj ,
N i=1 N i=1 N - 1 i=! N - \ i=1
and
N( - Y X i -X .
Sxy= -(- 1 -) 'L>i -X -)
N - \ i= 1
In genera l define the followi ng popul ation parameters

f i rs = -( -
I -) ~ (
L.. Yi - Y
-)1' (Xi - X- )s , and AI'S = fi rs
/V 1'/ 2 s /2)
' /20 fl 02 .
N- l i=\
Note that
fl2 0 = Sy2 , fl02 = S "2 and fil l = S ,y , so that Cy2 = Sy2/ Y-2 = fl 20 / Y-2 ,
Cr2 = S,2/ X- 2 = fl02 / - 2
X, and Pxy = S ty / (S,Sy ) = fill / (Vc-
f l 20 Vc-)
fl02 •

S. Singh, Advanced Sampling Theory with Applications


© Kluwer Academic Publishers 2003
138 Advanced sampling theory with applications

Let us define
y x
&0 ==-1, &1 =~-I,
Y X

To the first order of approximation we have

E(d)=C~f}A40 - I), E(d)=C~f}A04 - I), E(&J)= C~f)[ ~~: -I}


E(&0&Z)=(I -f)Cy A30, E(&0&3)=(1-f)CyAIZ, E(&0&4) = (l -f)Cy AZI,
n n n Pxy

E(&I&Z)=C~f )C,AZI' E(&I&3)=C~f )CxAo3' E(&I&4) =C~f)C, ~I,:,

E(&Z&3) =C~f}AzZ-I), E(&Z&4)=C~f )[;:~ -l and E(&3&4)=C~/)[~~: -Il


where f = n]N denotes the finite population correction (f.p.c.) factor . These
expected values can easily be obtained by following Sukhatme (1944) , Sukhatme
and Sukhatme (1970), Srivastava and Jhajj (1981) , and Tracy (1984) .
Also define
, 1 ~( -)r( -)' l' firs Sxv fill
firs = -_I)L.Y,-
( Y X, -X , /l.rs = ' r/Z',I/ z ' and r = -'- =
n ,=1 fizo fioz xy s,S y ~.,f.f;;;
as unbiased or consistent estimators of fi rs' Ars and P xy respectively .

The next section has been devoted to estimate the population mean in the presence
of known auxiliary information,

3.2 ESTIMATION OF POPULATION MEAN

Several estimators of population mean are available in the literature and we will
discuss some of them .
3.2.1 RATIO :ESTIMATOR

Cochran (1940) was the first to show the contr ibution of known auxiliary
information in improving the efficiency of the estimator of the population mean Y
in survey sampling. Assuming that the population mean X of the auxiliary variable
is known, he introduced a ratio estimator of population mean Y defined as
Chapter 3: Use of auxiliary informat ion: Simple random sampli ng 139

- -(XJ
:x .
YR = Y (3.2.1.1 )

Then we have the following theorems:

Theorem 3.2.1.1. The bias in the ratio estimator YR of the population mean Y , to
the first order of approximation, is

B(YR) = (1~f )V[C; - PXyCx;Cy] . (3.2.1.2)

Proof. The estimator YR in terms of eo and e, , can easily be written as


l
YR = V(1 + eoXI + elt . (3.2 .1.3)

Assum ing led < 1 and using the binomial expansion of the term (1 + e,t' we have

YR = Y(1 + eoX1-e\ + et + ok])) = V[1 + eo- e, + e?- eoe, + ok, )] . (3.2.104)

where O(e,) denot es the higher order terms of e1' Note that le]1 < 1, ef """""* 0 as
g > 1 increases. Therefore the terms in (3.2.1.4) with higher powers of e ] are
negligible and can be ignored. Now taking expected values on both sides of
(3.2.104) and using the results from section 3.1 we obtain

E(YR) =Y[l +C ~f )k~ - PxyCxCy}+o{n-')]. (3.2 .1.5)

Thus the bias in the estimator YR to the first order of approximation is given by
(3.2. 1.2). Henc e the theorem.

Theorem 3.2.1.2. The mean squared error of the ratio estimator Y R of the
population mean Y , to the first order of approx imation, is given by
MSE(h) = C ~f)y2[c; + C; - 2PxyCyCxl. (3.2.1.6)

Proof. By the definition of mean squared error (MSE) and usin g (3.2.104) we have
MSE(YR) = E[YR- r] "" E[V(1+eo- el +et - eoe, + 0(e2))- vf
"" V2 E[eo -e]+e? - eoe,]2.
Again neglecting high er order terms and using results from section 3.1 the MSE to
the first ord er of approximation is given by

MSE(h) = V2 E[e6 + 6} - 2eoel]= (1~f)V2 [c; + C~ - 2PxyCxCy].


Hence the theorem.
140 Advanced sampling theory with applic ations

By substituting the values of Cy ' C r and Pxy in (3.2 .1.6), one can easily see that
the mean squared error of the estimator YR, to the first order of approximation, can
be written as

MSE(YR) = (I ~/) (N~ l)i~[(Y; - f)-Rk - x)[ (3.2.1.7)

where R = f / X is the ratio of the two population means.

Theorem 3.2.1.3. An estimator of the mean squared error of the ratio estimator YR ,
to the first order of approximation, is

MSE(YR) = (I-'lf )[s~ + r2s; -2rsxy1 (3.2.1.8)


where r = YIX denotes the estimator of ratio of two sample means .

Proof. Obvious by replacing the population parameters with the corresponding


sample statistics in (3.2.1.6). Such a method of obtaining estimators is also called a
method of moments .

Theorem 3.2.1.4. Another form of the estimator of the mean squared error of the
ratio estimator YR , to the first order of approximation, is

MSE(YR)= (1-nf )_(


n
1)I [(Yi - y)-r(x; -x)]z
- I i=1 (3.2.1.9)
where r = YIx is the ratio of two sample means .
Proof. Obv ious by the method of moments.

Theorem 3.2.1.5. The ratio estimator YR is more efficient than sample mean Y if
<. 1
P ry - >- ' (3.2.1.10)
. Cr 2
Proof. The proof follows from the fact that the ratio estimator YR is more effic ient
than the sample mean Y if
MSE(YR) < v(y)
orif

C~,f )f2[C; +C; - 2 PxyCy cxl<C~f )f2C;


or if
Cx2 - 2pxy Cy Cx < 0
orif
1 c,
P xy > - - ·
2 Cy
Hence the theo rem.
Chapter 3: Use of auxiliary information: Simplerandom sampling 141

In the condition (3.2.1.10), if we assume that C y :::: e x ' then it holds for all values
of the correlation coefficient Pxy in the range (0.5, 1.0] . A Monte Carlo study of
ratio estimator is availab le from Rao and Beegle (1967). Thus we have the
following theorem.

T heore m 3.2.1.6. The ratio estimator YR is more efficient than the sample mean
Y if Pxy > 0.5 , i.e. , if the correlation between X and Y is positi ve and high.

Example 3.2.1.1. Mr. Bean was interested in estimating the average amount of real
estate farm loans (in $000 ) during 1997 in the United States. He took an SRSWOR
sample of eight states from the population 1 given in the Appendix. From the states
selected in the samp le he gathered the following information.

:, State
"
CA GA LA MS NM PA TX VT
Nonreal estate /' 3928.732 540.696 405.799 549.551 274.035 298.351 3520 .361 19.363
fafrrnloans(X..) '$
Real est~J~ farfn 1343.461 939.460 282.565 627.013 140.582 756.169 1248.761 57.747
loans (Y $,

The average amount $878.16 of nonreal estate farm loans (in $000) for the year
1997 is known. Apply the ratio method of estimation for estimating the average
amount of the real estate farm loans (in $000) during 1997. Also find an estimator
of the mean squared error of the ratio estimator and hence deduce 95% confidence
interval.

Solution. From the sample information , we have


,
Sr. •. Yi x,l Lv;- yf (x; -~f (V; - y Xx; :ex )
No .
:~,
"
.
1 1343.461 3928.732 447549 .2926 7489094.4980 1830775 .5040
2 939.460 540.696 70219 .8326 424341.5022 -172618.6237
3 282.565 405 .799 153589.3331 618286 .5613 308159.4078
4 627.oI3 549.551 2252.1431 412883.3536 30493 .8092
5 140.582 274.035 285036 .1296 842863 .5418 490 149.5300
6 756.169 298 .351 6674 .7674 798806 .9376 -730 19.52 17
7 1248.761 3520.361 329810.4398 5420748.0630 1337093 .6030
8 57.747 19.363 380346 .9504 1375337.8720 723260.3716
Sum 5395 .758 9536 .888 1675478.8890 17382362.3300 4474294.0800

Thus we have II = 8,
142 Advanced sampling theory with applications

8
f(Xi -xf IXi
s2 = -'.::
i-:..!...
l _ 17382362.33 2483194.6, x = .!.::.!.- = 9536.888 = 1192.11
x 8-1 7 8 8
8
f(Yi - y)2 I Yi
s2 = H 1675478.89 = 239354.1 - = .!.::.!.- = 5395.758 = 674.469
y 8-1 7 ' Y 8 8 '
8
I(Yi - yXXi - r )
S = H = 4474294.08 = 639184.86 and r = I = 674.469 = 0.5658 .
xy 8-1 7 ' x 1192.11
We are given X = 878.16, N = 50 and f = 0.16.

Thus the ratio estimate of average amount of real estate farm loans during 1997, Y
(say), is given by

-
YR
= -(
Y
XJ
x = 674.469(878.162)
1192.11
= 496.86

and an estimate of MSE(YR) is given by

MSE(h) = (I ~f )[s~ + rZs; -2rsXy]


= ( I- ~.16)[ 239354.1 + (0.5658)2 x 2483194.6- 2 x 0.5658 x 639184.86]
= 32654.65 .
A (1- a)lOO% confidence interval for population mean Y is given by
YR±ta /2(df = n-I).jMSE(YR)'

Using the Table 2 given in the Appendix the 95% confidence interval is given by

496.86± 2.365~32654 .65 or [69.490, 924.229].

Example 3.2.1.2. After applying the ratio method of estimat ion, Mr. Bean wants to
know if he achieved any gain in efficiency by using the ratio estimator. The amount
of real and nonreal estate farm loans (in $000) during 1997 in 50 different states of
the United States has been presented in population I of the Appendix. Find the
relative efficiency of the ratio estimator , for estimating the average amount of real
estate farm loans during 1997 by using known information on nonreal estate farm
loans during 1997, with respect to the usual estimator of population mean, given the
sample size is of eight units.

Solution. From the description of the population, we have Y; = Amount (in $000)
of real estate farm loans in different states during 1997, Xi = Amount (in $000) of
Chapter 3: Use of auxiliary information: Simple rando m sampling 143

nonreal estate farm loans in diffe ren t states during 1997, Y = 555.43, X = 878.16,
Sy2= 342021.5, C,2= 1.5256 , Cy2= 1.1086 , Pxy = 0.8038, an d N = 50 .

Thus we have

MSE(YR)= ( I ~f )y2[c; + C} - 2PxyCxCy]

= C-~.16 } 555.43r[1.1086 + 1.5256 - 2 x 0.8038·JL 1086 x 1.5256]

= 17606.39 .
Also

v(y) = C~f) s; = C-~.16) x 342021.5 = 35912.26 .

Thus the percent relative efficiency (RE) of the ratio estimator YR w ith respect to
the usual estima tor Y is given by
RE = v(-) x 100/ MSE(- ) = 35912.26 x 100 = 203.97%
y YR 17606.39
which shows that the ratio estimator is more effic ient than the usual estima tor of
pop ulatio n mean. It shou ld be noted that the relative efficiency does not depend
upon the sample size.

Theorem 3.2.1.7. The minimum sample size for the re lative standard error (RSE) to
be less tha n or equal to a fixed value ¢ is give n by

1/ >
¢2 y2
+-
1]-1 (3 .2 .1. 11)
- [ S2y + R 2S2x - 2RS xy N

Proof. T he re lative stan dard erro r of the ratio estimator YR is

RSE(yR) =~V(yR)/y2 = (+,-~ )(c; +d-2PXYCXCy).


Now

RSE(YR) :'> ¢ if (~ _ J...)(C;


II N
+C} -2PXYCXCy) :,>¢,
squar ing on both sid es we obtain

(~ _ J...)(c;
II N
+C} - 2PxyCxCy) :,> ¢2

-1
¢2y2 1
or 11 >[ 2 2¢2 +J...]-I or II ~ 2 2s;2- 2RS + -N ]
Cy + C, - 2pxyC,Cy N [ Sy + R xy
Hence the theorem.
144 Advanced sampling theory with applications

Example 3.2.1.3. Mr. Bean wishes to estimate the average real estate farm loans in
the United States with the help of ratio method of estimation by using known
information about the nonreal estate farm loans as shown in population I in the
Appendix . What is the minimum sample size required for the relative standard error
(RSE) to be equal to 12.5%?
Solution. From the description of the population I given in the Appendix, we have
- - 2 2
N = 50, Y = 555.43, X = 878.16, Sy = 342021.5, Sx = 1176526, Sxy = 509910.41,

R = Y- / X
- =-
555.43 ..
- = 0.63249 , ¢ = 0.125, th us th e minimum samp Ie size
. IS.
878.16

¢2yz 1]-1
n> +-
2 - 2RS
y + R S2
- [ S2 x xy N

=[
2
0.125 x (555.43'f +J...-]-I =20.51",21.
342021.5 + 0.632492 x 1176526- 2 x 0.63249 x 509910.41 50

Example 3.2.1.4 . Mr. Bean selected an SRSWOR sample of 2 I states listed in


population I. He collected information about the real estate farm loans and nonreal
estate farm loans from the selected states. He applied the ratio method of estimation
for estimating the average real estate farm loans assuming that the average nonreal
estate farm loans in the United States is known and is equal to $878. I6. Discuss his
95% confidence interval.

Solution. Note that the population size is 50. Mr. Bean started with the first two
columns of the Pseudo-Random Numbers (PRN) given in Table I of the Appendix
and selected the following 2 I distinct random numbers between I and 50 as: 01, 23,
46,04,32,47,33,05,22,38,29,40,03,36,27,19,14,42, 48, 06, and 07.

:it~~~i!~[ :rf,[!~~r' :;, I r!f'!r~i~fi~' :~ ! ~i(~: ~!~):,:[I:! : l i ~~i[~Y) g 1·','(:1:" ;11:1~;t;!-z:i Y,1
I~ : : ',:
:'i ' ,!;!" , \2
1,! !.i:J::'r,, ! 'N
01 AL 348.334 408.978 303627.6 21302 .6 80424 .302
03 AZ 43 I.439 54.633 218948 .3 250299 .3 234099 .570
04 AR 848.317 907.700 2605.2 124445.1 -18005 .653
05 CA 3928.732 1343.461 9177106 .0 621777 .6 2388748 .500
06 CO 906.281 315.809 47.9 57179.9 -1655.427
07 CT 4.373 7.130 800998.3 300087.3 490274 .840
14 IN 1022.782 1213.024 15233.5 433084 .8 81224.255
19 ME 51.539 8.849 718797.2 298206.9 462979 .800
22 MI 440.518 323.028 210534.2 53779.6 106406.960
23 MN 2466.892 1354.768 2457163.0 639737 .2 1253769.700
27 NE 3585.406 1337.852 7214853 .0 6 I2963.4 2102960 .000
29 NH 0.471 6.044 807998 .0 301278 .3 493388 .550
Continued .
Chapter 3: Use of auxiliary information: Simple random sampling 145

32 NY 426 .274 20 1.63 1 223808.6 124821 .8 167 14 1.200


33 NC 49 4.730 639. 57 1 163723.9 71 63.7 -34247 .22 1
36 OK 1716.087 6 12.108 667046.1 3269. 1 46697.097
38 PA 298 .35 1 756. 169 361209.5 40496.2 -120944.720
40 SC 80.750 87 .951 670119.2 218071. 5 382274.620
42 TN 388 .869 553 .266 260599.1 2.8 850.596
46 VA 188.477 321.583 50535 1.9 5445 1.9 165883.560
47 WA 1228.607 1100.745 108404.8 29 791 1.6 179708 .250
48 WV 29 .29 1 99 .277 7570 16.8 20 762 1.7 396450 .630
,,;<, Sum' 18886.5 20 11653.577 25645 192.0 ~4667 9 5 2 . 0 8858429 .300
x - Nonreal estate farm loans, y - Real estate farm loans.

Give n N=50 and X=878 .16 . Now from the above table, n =21, Y=55 4 .93223,
x = 899.35809, s; = 1282260 , s; = 233397.6, Sty = 442921.47 , r = 0.617, and
f = 0.42.

Thus rat io estima te of the ave rage real estate farm loans in the United States is

-
YR Y xJ
= -( X = 554.93223( 878.16 ) = $541.85 .
899 .35809
An estimate of MSE(YR) is give n by

MSE\.YR f )f
• to: ) = ( -1-n- lSy2 + r 2Sx2 - 2rs ]
xy

= C -2
0{42
) [ 233397.6 + (0.617)2 x 1282260 - 2 x 0.617 x 442921.47]

= 4832.64 .

A (I- a)1 00% co nfidence interval for population mea n Y is given by

Us ing Tabl e 2 from the Appendix the 95% confide nce interval for the ave rage real
estate farm loans is given by

541.85± 2 .086.J4832 .64, or [396.84, 686.86] .

3.2.2 PRODU€ffESHMAffOR

Murthy (1964) considered another est imator of popul ation mea n Y using known
population mean X of the aux iliary variable as a product estima tor

(3.2.2 .1)

Then we have the following theorems:


146 Advanced sampling theory with applications

Theorem 3.2.2.1. The exact bias in the product estimator yp of the population
mean r is given by

B(yp) = (1 ~f )rpXYCYCx . (3.2.2.2)

Proof. The product estimator yp in terms of Co and CI can easily be written as

yp = r(1 + CoXI + cd = r(1 + Co+ CI +&OCI)' (3.2 .2.3)


Taking expected values on both sides of (3.2.2.3) and using the results from section

l
3.1 we have

E(yp) = Y[I+C~f)pxyCxCy (3.2.2.4)

Thus the bias in the product estimator yp of the population mean is given by
B(yp)=E(yp)-Y =C~f)yPXYCXCy .
Hence the theorem.

Theorem 3.2.2.2. The mean squared error of the product estimator yp, to the first
order of approximation, is given by
MSE(yp) = C~f)y2[c; + C; + 2PxyCyCxJ, (3 .2.2.5)

Proof. By the definit ion of mean squared error (MSE) using (3.2.2.3) and again
neglecting higher order terms and using results from section 3.1 we have

MSE(yp)= E~p - yF = E[r(1 + Co + cl + cOc))- r Y= r 2E[CO+ c) + COc)t .

Thus the MSE, to the first order of approximation, is given by

- ) -2 r 2
MSE (YP = Y ElCo + CI2 + 2 cOCI ]
Hence the theorem.

Theorem 3.2.2.3. An estimator of the MSE of the product estimator yp , to the first
order of approximation, is given by
, ( ) () -f)[
MSE yp = -n- Sy2 +r2Sx2 +2rsxy] . (3.2.2.6)

Proof. It follows by the method of moments .

Theorem 3.2.2.4. The product estimator yp is more effic ient than sample mean y
if
Cy )
PXYC <-"2 ' (3.2.2.7)
x
Chap ter 3: Use of auxiliary inform ation : Simple random sa mpling 147

Proof. The proof follows from the fact that the product estimator y p IS more
efficient than the sample mean y if
MSE(yp) < V(y)
orif

C~/ }T2[c; + C; + 2 PxyCxCy ]<C~f )y2C;


orif
Cx2 + 2pxyCxC y < 0
or if
Cy I
PXYC <- 2" .
x
Hence the theorem.

In the condition (3.2.2.7), if we assume that Cy '" Cx ' then it holds for all values of
the correlation coefficient P xy in the range [-1. 0, - 0.5) . Thus we have the
following theorem .

Theorem 3.2.2.5. The product estimator yp is more efficient than the sample mean
y if Pxy < -0.5 , i.e. , if the correlation between X and Y is negative and high .

Remark 3.2.2.1. We observed that the product and ratio estimators are better than
sample mean if the value of P xy lies in the interval [-1.0, -0.5) and (+0.5, +1.0],
respecti vely. Thus the sample mean estimator remains better than both the ratio and
product estimators of the population mean if Pxy lies in the range [-0 .5, + 0.5] .

Example 3.2.2.1. A psychologist would like to estimate the average duration of


sleep (in minutes) during the night for persons 50 years of age and older in a small
village in the United States. It is known that there are 30 persons living in the
village aged 50 and over. Instead of asking everybod y the psychologi st selects an
SRSWOR sample of six persons of this age group and record s the information as
given below

Person no. v "~it i j. 3 7 10 17 22 29


Age X "(years) '\ :1;" j ':" " , 55 67 56 78 71 66
Duratiori'ofsleeit y (in mjnutes) 408 420 456 345 360 390

Assume that the average age 67.267 years of the subj ects is known as shown in the
population 2 in the Appendix. Assuming that as the age of a person increases then
the sleeping hours decrease, apply the product method of estimation for estimating
the average sleep time in the particular village under study. Also find an estimator
of the mean squared error of the product estimator and deduce a 95% confidence
interval.
148 Advanced sampling theory with applications

Solution. From the sample information, we have


Sr. No. , Yi
.<J!<:
Xi '(y i - Y-)2; (~i - i~ (Yi - yXXi - X )

i;< '"
1 408 55 132.25 110.25 - 120.75
2 420 67 552.25 2.25 35 .25
3 456 56 3540.25 90.25 -565.25
4 345 78 2652.25 156.25 -643 .75
5 360 71 1332.25 30.25 -200.75
6 390 66 42 .25 0.25 -3.25
Sum 2379 393 8251 .50 389.50 ~ 1 49 8 .5 0

Here }j = Duration of sleep (in minutes) , Xi = Age of subj ects (~50 years) , n = 6,

Y = 396.5, i = 65.5, s; = 77.9, s;= 1650.3, Sxy = -299.7, and r = Yli = 6.053 .
Also we are give n X = 67.267, N = 30 and f = 0.20.
Thus product estimate of the average sleep time, Y (say), is given by

Yp ~J = 396.5(~J
- = Y-( X 67.267 = 386.08'
and an estimate of MSE(yp) is given by
MSE(yp) = (1 ~f J[s; + r2s~ + 2rSry]
= ( 1- ~.20 J[1650.3 + (6.053)2x 77.9 - 2 x 6.053 x 299.7] = 116.83 .
A (1- a)100% confidence interval for population mean Y is given by

Yp±l a/2(df = n- 2WMSE(yp) .


Using Table 2 from the Appendix the 95% confidence interval for average sleeping
time is given by
386 .08±2.776.J116.83 , or [356.Q7, 416.08] .

Exa mple 3.2.2.2. The duration of sleep (in minutes) and age of 30 people aged 50
and over living in a small village of the United States is given in the population 2.
Suppose a psychologist selected an SRSW OR sample of six individuals to collec t
the required information. Find the relative efficiency of the prod uct estimator, for
estimating average duration of sleep using age as an auxiliary variable, with respect
to the usual estimator of popu lation mean .

Solution. Using the description of the population 2 given in the Appendix we have
Yi = Duration of sleep (in minutes), Xi = Age of subjects (~50 years), N = 30 ,
- 2 2
X = 2018, Y = 11526, X = 67.267, Y = 384.2, Sy = 3582.58, Sx = 85.237,
C y2 = 0.0243, C x2 = 0.0188 , Sxy = - 472.607, an d Pxy = -0.8552 .
Chapter 3: Use of auxiliary information: Simple random sampling 149

Thus we have
I- f Y
- ) = ( -n-
MSE (yp )-2 rlCy2+ Cx2+ 2pxyCxCy]
= C-~.20 }384.2)2[0.0243 + 0.0188 - 2 x 0.8552~0.0243x 0.0188]

= 128.759.
Also
v(y) = (I ~f )s; = (1- ~.20) x 3582.58 = 477.677 .

Thus the percent relative efficiency (RE) of the product estimator yp with respect
to the usual estimator y is given by

RE = v(y)x 100/MSE(yp) = 477.677 x 100 = 370.98%


128.759
which shows that the product estimator is more efficient than the usual estimator of
population mean . The relative efficiency is independent of sample size n.

Corollary 3.2.2.1. The minimum sample size for the relative standard error (RSE)
to be less than or equal to a fixed value ¢ is given by
- 1
¢2f2 1 (3.2.2.8)
n> +-
- [ S2
y + R 2S2
x + 2RSxy N]

3 ~2.3 REGRESSIONESTIMATQR .,

We consider an estimator of the population mean Y as


Ydif = Y+ d(X- x) (3.2.3.1)
where d is a constant to be chosen such that the variance of the estimator V(Ydif )
is minimum. Such an estimator is called difference estimator. The estimator Ydif
can be written as

Taking expected value on both sides of (3.2.3.2), we obtain


E(Ydif) = Y. (3.2.3.3)

Thus the difference estimator Ydif is unbiased for the population mean, Y. The
variance of the estimator Ydif is given by
V(Ydir) = E[Ydif - yf
= E[Y(I+8o)-dX&] - y]2 = E[Y&O - dX&]]2
=E[ y2&6+d2X2&,2 _2dY X&o&,]
= (l~f)[Y2C;+d2X2C~_2dY XPXyCxcJ (3.2.3.4)
150 Advanced sampling theor y with applications

On differentiating (3.2.3.4) with respect to d and equation to zero we obtain


Cy y Sxy
d = Pxy C
x
X = S.~ . (3.2.3.5)

On substituting optimum value of d in (3.2.3.4), the minim um variance of the


estimator Ydif is given by

. (_ ) (1- / )[ - 2 2 ( Y Cy ]2-2 2 y C y__


M mY Ydif = - - X C y + Pxy ~- X Cx - 2 pxy ~ - X YPxyCxC y
]
n XC x XC x

= (1-/)[S2 _
n y
s.~y
S2
] = (1-n/)Sy2[1_ S2S
Sly ; = (1- /) S2(I_ p.~) .
2 n y Y
(3.2.3.6)
x Y x
Cy Y Sxy
For the optimum value of d = Pxy - ~ = - 2 = /3 (regression coefficient, say) the
c, X Sx
difference estimator becomes

-
Ydif =Y S;
- + [ Sxy )(X
- - x-) . (3.2 .3.7)

Thu s the difference estimator becomes non-functional if the value of the regression
coeffic ient /3 = Sxy / s1 is unknown . In such situations, Hansen, Hurwitz, and
Madow (195 3) consider the linear regression estimator of the popul ation mean
Y as
YLR = Y + p(x - x), (3.2.3.8)
whe re p = s.w / s.~ denotes the estimator of the regression coefficie nt /3 = Sxy / S; .
Then we have the follo wing theorems:

Theorem 3.2.3.1. The bias in the linear regression estimator YLR of population
mean Y is given by

B(YLR) = (I-1)/3XC (Ao3- ~J.


Il
t
Pxy
(3.2.3.9)

Proof. The linear regression estimator YLR , in terms of &0 , &\ , &3 and &4 , can
easily be written as
-( ) Sxy (I +&4)[- -( )]
YLR = Y 1+ &0 + 2 X - X 1+ &]
Sx( I +&3)
l
= Y(I+ &0)+ /3(1+ &4XI + &3t [X - X(I + &1)] .
Using the binomial expansion (I + «r ' = 1- &3 + &f + 0(&3) we obtain
YLR = Y(I + &0)- /3X lcl +&]&4 - &J&3 + 0(&)] . (3.2.3.10)
Taking expected value on both sides of (3.2.3.10) and neglecting higher order
terms, we obtain
Chapter 3: Use of auxiliary information: Simple random sampling 151

E(YLR) = Y
- - f3 X 1)[
1- - Cx--C
- (-
n
Al2
Pxy
xAo3 J= Y+
- (-1- - f3XCx Ao3 -A-
n
l2 J.
Pxy
I) - [
Thu s the bias is given by

S(YL R)= E(YLR)- Y = ( 1- l )f3 XCX [Ao3 -


n
Al2
Pxy
J.
Hence the theorem.

Theorem 3.2.3.2. The mean squared error of the linear regression estimator YLR,
to the first order of approx imation, is

MSE(YLR) = C~/ )S~ (I - Px/ ). (3.2.3.11)

Proof. By the definition of mean squared error (MSE) and using (3.2.3.10) and
neglecting higher order terms we have

MSE(YLR)= E[YLR - yf = E[Y(I+ &0) - f3 X {&, + &1 &4 - &,&) + ok )}- yf


- 2 2-2 2 --
=EY&
[ o+f3 X &, -2f3YX&0&] ] .
Thu s the MSE, to the first order of appro ximation , is given by

MSE(YLR) = C~/ )[y2C~ + 13 2X2C.~ - 2f3Y XPxyCXCy]


_ I- / )[- 2 S;' [SXyJ2- 2 S;
- (- - Y - 2+ - Sty Y
X -2- 2- -X s; ~
- -- StSY]
~
n y S2
x
X S2
V
S S
x y
X Y

_(1 -/ )[
- - - S 2, + -S}y - 2-S}y] - - - S 2 - -
S2 S2 n y
_(1-/ )[ S~y]
S2
II }
x x x
= (I~/ )S~(I- p}y).
Hence the theorem .
Theorem 3.2.3.3. An estimator of the mean squared error of the linear regression
estimator YLR, to the first order of approximation, is given by

MSE(YLR) = (I~/ }J,[I -I}Y] . (3.2.3.12)

Proof. It follows by the method of moments.

Theorem 3.2.3.4. The linear regression estimator YLR is always more efficient than
the sample mean Y if Pxy ;t 0 .
152 Advanced sampling theory with applications

Remark 3.2.3.1. If jJ = Ylx then the linear regression estimator YLR reduces to the
usual ratio estimator YR and if jJ = - Y/ X , then the linear regression estimator YLR
reduces to the usual product estimator yp .

Example 3.2.3.1. A bank manager in the United States of America is interested in


estimating the average amount of real estate farm loans (in $000) during 1997. A
statistician took an SRSWOR sample of eight states from the population 1 as given
in the Appendix and collected the following information .

I!'Y:,! :>/ ::<;:!t"j."'>". ; ::1;


AZ CO DE LA MT NC VT WA
:N"()iireal'estaJe :; 431.439 906.281 43 .229 405.799 722.034 494 .730 19.363 1228.607
faml'loaris (:%)$
I ""' f a r m:,,: 54.633 315.809 42.808 282.565 292 .965 639 .571 57.747 1100.745
.\~£:f· · ".~~ ..",.
Iloans ':(f; .: )$ l. :i,.l

Apply the regression method of estimation for estimating the average amount of the
real estate farm loans (in 000) during 1997. Also find an estimator of the mean
squared error of the regression estimator and deduce a 95% confidence interval.
Assume that the average amount $878.16 of nonreal estate farm loans (in $000) for
the year 1997 is known.
Solution. From the sample information, we have

1 54.633 431.439 86272.833580 9999.250014 29371.136040


2 315.809 906.281 1059.266526 140509.336300 -12199.870350
3 42.808 43.229 93359.198370 238345.342500 149170.138100
4 282.565 405.799 4328.373443 15784.467310 8265.656001
5 292.965 722.034 3068.093643 36327.883500 -10557.336240
6 639.571 494.730 84806.540240 1347.275378 -10689.142320
7 57.747 19.363 84453.227620 262217.989200 148812.484500
8 1100.745 1228.607 566090.147800 486048.449000 524544.791500
'27 86!8~3 ~2 5r.482 :94~43 7t@ r200 11l90~.79;99300() '826717 :857300

H ere 2 2
n=8, Y=348 .3554 , x=531.4353, sx=170082.85, sy=131919.67,
,
sxy=118102.55, fJ=sxy / sx=0.6943
2 I
and rxy=sxy/\SxSy ) =0.7884 . Also we are
given X=878 .16, N=50 and /=0 .16.
Thus the regression estimate of average amount of real estate farm loans during
1997, Y (for example), is given by
YLR = Y + jJ(x- x)= 348.3554 + 0.6943x (878.16 - 531.4353) = 589.08
Chapter3: Use of auxiliary information: Simplerandomsampling 153

and an estimator of MSE(YLR) is given by

MSE(YLR) = C~f}~[I_ r}y] = C-~.16J x131919.67x [1- (0.7884f] = 5241.78.


A (1- a)100% confidence interval for population mean Y is given by

.YLR±ta /2(df = n-2).jMSE(YLR) '


Using Table 2 from the Appendix the 95% confidence interval is given by
589.08±2.447.J5241.78 or [411.916,766.24].

Example 3.2.3.2. Suppose a bank manager selects an SRSWOR sample of eight


states to collect the required information on real estate and nonreal estate farm loans
during 1997. Find the relative efficiency of the regression estimator, for estimating
the average amount of real estate farm loans during 1997 by using data on nonrea1
estate farm loans during 1997 with respect to the ratio estimator of population
mean. The amounts of real and nonreal estate farm loans (in $000) dur ing 1997 in
the 50 states of the United States have been given in population I of the Appendix.

Solution. From the description of the population 1 given the Appendix we have
- - 2
Y = 555.43, X = 878.16, Sy = 342021.5, Pxy = 0.8038, and N = 50 .
Also from example 3.2.1.2 we have
MSE(YR)= 17606.39 .
Now

MSE(YLR) = C~fJS~(I- P.~y) = C-~.16Jx 342021.5 x [1- (0.8038)2] = 12709.55 .


Thus the percent of relative efficiency (RE) of the regression estimator YLR with
respect to the ratio estimator YR is given by
_ ) x 100/ MSE (_)
RE = V (YR YLR x100= 17606.39 x100= 138.53%
12709.55
which shows that the regression estimator is more efficient than the ratio estimator
of population mean . The relative efficiency is independent of sample size, n.
Corollary 3.2.3 .1. The minimum sample size for the relative standard error (RSE)
of the linear regression estimator to be less than or equal to a fixed value ¢ is
-1
¢2f2 1
(3.2 .3.13)
n> +-
- [ ~(I-P;y) N]

Example 3.2.3.3. Suppose a bank manager in the United States of America is


interested in making future plans about the selection of sample size while estimating
average real estate farm loans. The manager would also like to apply the regression
method of estimation using known information about nonreal estate farm loans.
154 Advanced sampling theory with applications

What is the minimum sample size required for relative standard error (RSE) to be
equal to 12.5%? Use that data as shown in population I of the Appendix .

Solut ion. The population I given in the Appendix shows N = 50 , Y = 555.43 ,


S; = 342021 .5 , and P xy = 0.8038 . Here ¢ = 0.125 , thus the minimum sample size is:

n> ¢ 2 -y 2 + ~ ]-1= [ 0.125 2


x (555.43)\2 +~ ]-1 = 16 7 : : : 18
- [ S; (I _P'; ) N 342021.5(1 - 0.8038 2) 50 . .

Exa mple 3.2.3.4. A bank manager selects an SRSWOR sample of eighteen states
from population I of the Appendix and colIects information about real estate farm
loans and nonrea l estate farm loans. Estimate the average real estate farm loans by
using the regression method of estimation, given that the average amount of nonreal
estate farm loans in the United States is known to be equal to $878.16 .
Solution. The bank manager used the 19th and 20 th columns of the Pseudo-Random
Numbers (PRN) given in Table I of the Appendix to select the folIowing 18 distinct

,
random numbers between 1 and 50 as:16, 31, 50, 29, 08, 33,19,28,11,07,27,37,

-r .(Yi X)(Xi-X)
48, 22, 24, 46, 41, and 32.

r (xt~xr
"-, ,
Random Stat~
No ,.,
Yi ~> ( y.-y
-&- c;
.',L I '< ,~

07 CT 4.373 7.130 393759.3178 88444.6782 186617.0307


08 DE 43.229 42.808 346504 .6366 68496 .5732 154059.6645
11 HI 38.067 40.775 352608.4687 69564 .8537 156617.8679
16 KS 2580.304 1049.834 3796373 .8360 555483 .2696 1452178.4160
19 ME 51.539 8.849 336790.3888 87425.1840 171592.4291
22 MI 440.518 323.028 36617.6715 342.3055 -3540 .3997
24 MS 549.551 627.013 6777.3141 103997.5427 -26548.5219
27 NE 3585.406 1337.852 8723342.7430 1067761.5890 3051958.4380
28 NV 16.710 5.860 378428.5240 89201.6782 183729.3102
29 NH 0.471 6.044 398671.5725 89091.8028 188463.1771
31 NM 274.035 140.582 128049.7837 26877.7990 58665 .9727
32 NY 426.274 201.63 1 42271.9539 10587.4839 21155.4634
33 NC 494.730 639.57 1 18808.8729 112254.8170 -45949.8268
37 OR 571.487 114.899 3646.7642 35958 .5887 11451.3097
41 SD 1692.817 413.777 1125596.9840 11935.6717 115908.3954
46 VA 188.477 321.583 196602.1805 290.9242 -7562 .8256
48 WV 29.291 99.277 363108.01270 42127 .3572 123680.1559
50 WY 386.479 100.964 60219.4149 41437 .6914 49953 .5137
r Sum 11373.76 5481.477 16708178.4400 2501279.8100 5842429 .5700
x - Nonreal estate farm loans, y - Real estate farm loans.
Chapter 3: Use of auxiliary inform ation: Simple random sa mpling 155

Here N = 50 and X = 878. 16 . The above table shows II = 18, Y = 304 .5265,
x = 631.8754, s; = 982 834.03, s;= 147134.11, and Sxy =343672.33 . Thu s
iJ = 0.3496 , t:~y = 0.9037 and f = 0.36 .
Thu s the regression estimate of the average real estate farm loans in the United
States is
YLR = Y + iJ(x - x)= 304.5265 + 0.3496(878.16 - 631.8754 ) = 390. 627.

An estimate of MS EV LR) is given by

MS E(YLR) = C~f };(I_ t:;) =C ]x 0 6


- 1/ 147134 .11 x (1 - 0.9037 )
2
= 959 .059 .

A (1- a)I 00% confidence interval for population mean Y is given by


YLR ±fa /2(df = n - 2NMSE(hR) .

Using Table 2 from the Appendix the 95 % confidence interval for the average real
estate farm loans is given by
390.627 ± 2.120v'959.059 or [324.973, 456.280] .

Example 3.2.3.5. Consider the following population consisting of five ( N = 5 )


units A, B, C, D , and E where for each one of the units in the population two
variables Y and X are measured.

'~:Units A B C D E
Yi 9 II 13 16 21
Xi 14 IS 19 20 24

Do the following:
( a ) Select all possible SRSWOR samples each of n = 3 units;
( b ) Find the variance of the sample mean estimator by definition;
( c ) Find the variance of the sample mean estimator using the formula. Comment;
( d ) Find the exact mean square error of the ratio estimator by definit ion;
( e ) Find the approximate mean square error of the ratio estimator using first order
approximation;
( f) Find the ratio of approximate mean square error to that of exact mean square
error of the ratio estimator and comment;
( g ) Find the exact mean square error of the regression estimator using definit ion;
( h ) Find the approximate mean square error of the regression estimator using first
order approximation;
( i ) Find the ratio of approx imate mean square error of the regression estimator to
that of the exact mean square error and comment;
( j ) Find the exact relative efficien cy of the ratio estimator with respect to sample
mean estimator;
156 Adv anced sampling theory with applications

( k ) Find the approximate relative efficiency of the ratio estimator with respect to
the sample mean estimator and comment;
( I ) Find the exact relative efficienc y of the regression estimator with respect to the
samp le mean estimator;
(m) Find the approximate relative efficiency of the regress ion estimator with respect
to the sample mean and comment.
Solution. ( a ) From Chapter I we have following information for this population
- - 2
Y = 14, X = 19, Sy = 22, s;2 = 13, S xy = 16.25, P xy = 0.96, and f3 = 1.25. Also
from the all poss ible 10 samples of n = 3 units taken from the population of N = 5
units .

1 11.00 17.00 0.7 1 0.900 12.29 0.291 12.42 0.250


2 12.00 17.33 1.12 0.400 13.16 0.071 13.87 0.002
3 13.67 18.67 1.24 0.0 11 13.91 0.001 14.08 0.001
4 12.67 17.67 1.05 0.177 13.62 0.014 14.07 0.000
5 14.33 19.00 1.20 0.011 14.33 0.011 14.33 0.011
6 15.33 19.33 1.20 0.177 15.07 0.114 14.93 0.087
7 13.33 19.00 2.50 0.045 13.33 0.045 13.33 0.045
8 15.00 20.33 1.65 0.100 14.02 0.000 12.81 0.143
9 16.00 20.67 1.61 0.400 14.7 1 0.050 13.31 0.047
10 16.67 21.00 1.50 0.713 15.08 0.117 13.67 0.011
]'Sum 140:00 190.00 14.00 2.933:, .., 0.714, 0:596 "
where
_ _I n _ _I n
YLR(t) = YI + bl (x - XI ),
Sxy
YI = n L Yi, XI = n L Xi, bl = - 2 ' YR (t) = Y{ ;,).
i=1 i= 1 Sx

PI = I(~) = I(~) =1/10 , and t = 1,2,....10.


Now with this information, we can answer all of the above questions:
(b) The exact variance of the sample mean YI is given by

(~) f- -\2
Exact V(Yt)= L PI 151 - Y J = 2.933 .
1=1
( c ) The variance of the samp le mean YI with formu la is given by

V(Yt)= C~f )s; = C-:/5) X22= 2.933 .

We can see from ( b ) and ( c ) that the exact variance and variance by the formula
are same .
Chapter 3: Use of auxiliary information: Simple random sampling 157

( d) The exact mean square error of the ratio estimator YR (t) = Y{ ; J is given by
(~) -
ExactMSE{YR} = I pJh(t)-yf =0.714.
1=\

( e) The approximate mean square error of the ratio estimator is given by


- f J[Sy2 + R 2s;2 - 2RS xy ]
- } = (I-n-
Approx.MSE{YR

= C-:/5 J[22+ (14/19)2 x 13- 2 x (14/19)x 16.25] = 0.681.

( f) The ratio of approximate mean square error to the exact mean square error is
given by
. f S Approx.MSE(h) 0.681 0 953
RatIO 0 Mean quare Errors = ( ) = - - =. .
Exact.MSE YR 0.714
Note that this ratio of the mean square errors approaches unity if sample size and
population size are such that f = n] N ~ 0 .

( g) The exact mean square error of the linear regression estimator


YLR (t) = YI + bl (x - XI)
is given by
(~) -
ExactMSE{YLR}= I Pt(YLR(t)-yf =0.596 .
1=\

( h ) The approximate mean square error of the linear regression estimator is

- } = (1-n-
Approx.MSE{YLR - fJ Sy2[1- Pxy
2] = (1-3/5J
- 3 - x 22 x [1- 0.962] = 0.230.

(i ) The ratio of approximate mean square error to the exact mean square error of
the linear regression estimator is given by
. 0 f Mean Square Errors = Approx.MSE(YLR)
RatIO () =-0.230 0 386
-= . .
Exact. MSE YLR 0.596

Note that , for this particular example, the ratio of approximate mean square to the
exact mean square is far away from one, but if f = n]N ~ 0 then this ratio
approaches to unity.

(j ) The exact relative efficiency (RE) of the ratio estimator with respect to sample
mean estimator is
Exact RE of the Ratio Estimator = V(Yl)XI(O) = 2.933x100 = 410.78% .
Exact.MSE YR 0.714
158 Advanced samplin g theor y with applications

( k ) The approximate relative efficiency of the ratio estimator with respect to the
sample mean estimator is
.
Approximate RE t hee Rati
atio Estimator
. = V(Yt) x IOO( ) = 2.933 x 100 = 43O.69 0Yo .
Approx. MSE YR 0.68 1

It shows that the app roximate relative efficienc y expr ession for the ratio estima tor
gives a slightly higher efficiency than in reality.

( I ) The exa ct relative efficiency of the regress ion estimator with respect to the
sample mean estimator is

Exact RE of the Regression Estimator = V(y t) x 100 = 2.933 x 100 = 492 .1 1% .


Exact MSE(YLR) 0.596

( m) The approximate relative effic iency of the linear regression estimator with
respect to sample mean estimator is

Approx. RE of the Regression Estimator = V(Yt ) x 100 = 2.933 x 100 = 1275 .21% .
Approx. MSE(YLR) 0.230

It also shows that the approximate relative effici ency expression for the regression
est imator gives higher effi ciency than in reality.

Caution : Be careful while using appro ximate expression for mean square error of
the linear regression estimator or approximate expression for estimating the mean
square error of the linear regression estimator. The interval estimate of the
popul ation mean may be bigger than you are constructing with the approximate
results.
Note the following graphic al situations in the Figure 3.2.1 for the use of ratio,
product, and regre ssion estimator in actual practice.

- --' -- '~ - ---- -- -----'-- l


Rat io Produc t
- -- -- r------ - - - - --1
I ~'\ion I
Estimator
Estimator I E.\tinlttor I

~:'I: V
~
i "I
' :E~~~

ll~~
2.5
J I

I
~ 'I j
.a 0.5 I

"j o ~--- .. ,__; ; ~ ~ !


iI
U>
0(1 - - ; . - -- - . : - -; . . . --- . ~ - - - ~

Auxiliary variab le (x) ALDdliay vai<tlle(x) ,


A u xili a ry v a ria bl e (x) !
. ~_ ..__ ~ ._ _ . . J

Fig. 3.2.1 Situations for using of ratio, produ ct or regression estimators.

The follo wing tabl e is used to collect some informati on about these three
estimators, which will be useful to the readers:
Chapter 3: Use of auxiliary information: Simple random sampling 159

Ratio Estimator .. c . Product estimator ... Rearesslon .estimator


1 Th e correlation betw een Th e correlation between The corre lation betwee n
y and x must be y and x mu st be y and x must be non-
posi tive and high (within negative and high zero wi thin the range
+0 .5 and + 1.0). (within - 1.0 and -0. 5). r- 1.0, +1.01-
.
I'.>'!'i .:.. ....
2 The regression lines Th e regression line The regression line may
between y and x betwee n y and x may have bo th parameters viz.
sho uld pass throug h the or may not pass though Intercept and slope.
origin. the origin.
Note : If the regression
line with two negatively
co-related variables will
pass throu gh the origin,
then one of the variables
among y and x will be
negative, wh ich may not
be practicable.
ii'.... .... . ... ,
3 The approximate mea n The approximate mea n The approximate mean
sq uare error expression square error expression square error expression
wi ll be small if the wi ll be sma ll if the will be sma ll if the
( a ) f.p.c. / = n/N IS ( a ) f.p.c. / =n/N IS ( a ) f.p.c. / =n/N IS
large, large, large,
( b ) samp le size n is ( b ) sample size n IS ( b ) samp le size n IS
large, large, large,
( c ) corre latio n between ( c ) correlation between ( c ) correlation between
y and x is very close y and x IS very close y and x IS very close
to plus one, to minus one , to plus or min us one ,
( d ) error terms, say (d ) error terms, say (d ) error term s, say
8; = (Y; - r)- R(X;- x) 8; =(Y; - r)+ R(X ;- x) 8; = (Y;- r)- p(x;-x)
are smal l. are small. are sma ll.
I:: . I' .»'[ , c c, · i r.. ".' ,
...

4 The usual estimator of Th e usua l estimator of The usual estimator of


the approximate mean the approximate mean the approximate mean
square erro r may be low square error may be low square erro r may be low
if samp le size is large, if samp le size is large, if sam ple size is large ,
which may provide us which may provide us which may provide us
the smaller confidence the smaller confidence the sma ller confidence
interval estimate than the interval estimate than the interval estimate than the
actual one . actual one. actual one .
.. '
,,
Co ntinued . . ...
160 Adva nced sampling theory with applications

5 We have to estimate only If both variab les are Here we have two
one mode l parameter, so positive (x > 0 , and unknown parameters,
the degree of freedom for Y > 0) but the correlation viz.: intercept and slope,
constructing confidence is negative, then we have thus we must use
interval estimates will be both intercept and slope, df=(n-2) . Its more
df = (n - I) . and then we shou ld must justification is give n In
use df=(n-2). the Section 3.6.

3;2.4 POW ER TRANSFORMATION ESTIMATOR

Srivastava (1967) considered another estimator of popu lation mean, Y , using the
know n popu lation mean, X, of the auxiliary variable, as a power transformation
estimator given by
_
Yrw = Y X
_(:x)a (3.2.4. I)

where a is a suitab ly chosen consta nt, if a = 1 then Yrw reduces to Yr and if


a -I then Yrw reduces to YR .
=
Then we have the following theorems.

T heorem 3.2.4.1. The bias in the power transformation estimator Yrw , to the first
order of approximation, is given by

S(Yrw) = C~f )y[ a(a2-I) c; + apXYCXcy] . (3.2.4.2)

Proof. The power transformation estimator Yrw , in terms of £0 and £, , can easily
be written as
Yrw = Y(I +£ 0 XI + elf = Y(I +&0 {I +a£, + a(a -I)£ ,2 +0(&1))
2

= Y[I+£o +a£1+ a(~ -I) £?+a£O£1 +0(&;)] . (3.2.4.3)

Taking expected values on both sides of (3.2.4.3) and using results from section
3.1, we obtain (3.2.4.2). Hence the theorem.
T heorem 3.2.4.2. The minimum mean squared error of the power transformation
estimator Yrw , to the first order of approximation , is given by

Min.MSE(Yrw)= (I ~f)S; [I _ P;y] . (3.2.4.4)

P roof. By the definitio n of mean squared error (MSE), using (3.2.4 .3) and aga in
neglecting the higher order terms we have
MSE(Yrw) = E~rw - r] = E[Y(I +&0 +a£1 +0(&; ))- yf
= y2E[ £6+ a 2£,2 + 2a£0£, ] .
Chapter 3: Use of auxiliary information: Simple random sampling 161

Thus the MSE, to the first order of approximation, is given by

MSE(ypw)= C~/)Y2[C; 2C;++a 2apxy xCcJ (3 .2.4.5)

On differentiating (3.2.4.5) with respect to a and equating to zero we obtain the


optimum value of a as
Cy
a = -Pxy
Cx ' (3 .2.4.6)

On substituting the optimum value of a in (3.2.4 .5) we obtain

Min.MSE(ypw) = (I ~f)S;[I_ p;y]. (3.2.4.7)


Hence the theorem.

The power a depends upon the optimum values of unknown parameters. Thus the
estimator ypw is not practicable. Thus we have the following corollary.

Corollary 3.2.4.1. A practically useful power transformation type estimator

r
YPW(pract) of population mean Y is given by

YPW(pract) = Y( ~ (3.2.4.8)

where
a = -(xsxy}/~s;)
is a consistent estim ator of a . Note that while making confidence interval estimate
with the power transformation estimator the degree of freedom will be (n - 2).

Remark 3.2.4.1. The difference estim ator Ydif of the population mean, Y , given as
Ydif = Y + d(X - x) (3 .2.4.9)
has the same variance equal to the mean squared error of the linear regression
estimator for the optimum value of d = Sxy / S.~ = f3 . Again note that the degree of
freedom for constructing confidence interval estimates will be df = (n - I), because
the slope is assumed to be known, but we estimate the intercept.

3.2.5 ADUAL OF RATIO ESTIMATOR

Srivenkataramana and Tracy (1980) considered the following estimator of the


population mean Y based on the use of the mean value of the non-sampled
information of the auxiliary var iable defined as
_ _ _(NX -nx)
Yn su - Y (N - n)X (3.2.5.1)
162 Adva nced sampling theory with applications

or Ynsu = Y( ~J (3.2.5.2)

where x-* -- (NX


- ) - -- -1- N~n
-nx) L, x · denotes th e mean 0
f non-samp Ied uruts
. 0 f t he
(N - n N- n i= \ I

auxiliary variable.

Then we have the following theorems.

Theorem 3.2.5.1. The estimator Ynsu is an inconsistent estimator of the population


mean, Y .
Proof. The estimator Ynsu in terms of So and SI can be written as
_ - NX - nx _ Y-(I
-
Ynsu-Y (N - n)X - + so ) NX(N- nX(l
)
- nX
+ sd _ Y-(l
-
{I n
+so - - - SI
N- n
)

n - s\--
- ( l +so--
= Y n - sOs\ ) .
N -n N- n
Tak ing expected values on both sides we have
E(Ynsu )= YE(I + So - _ n_ s\ - _n- SOS\)
N -n N- n

- n 1- f - _ -Y YPxyC,Cy
= Y - - - x --YPxyCxCy
N -n n N
Thus the bias in the estimator Ynsu is given by
_ )_ (_ )_ - _ _ YPxyCxCy __ Sxy
B (Ynsu - EYnsu Y- -_
N NX
which proves the theorem.

Theorem 3.2.5.2 . The mean squared error of the estimator Ynsu is given by

MSE(Ynsu ) = C~f)y2[C; + g2 C; - 2gpxyC ,Cy] (3.2.5.3)

where g = _ n_ and N 2 ~ 00 •

r
N- n
Proof. We have

MSE(YnsJ = E[Ynsu- yf = E[Y(I+ so - N ~ n Sl -( N ~ n )sOSIJ- Y


n n n
]2= Y E[sO (
+ - -)2s l -2 (- -)sOsl ]
-2 -2 2 2
[
",y E sO - - - s}
N -n N- n N -n
Chapter 3: Use of auxiliary informa tion: Simple random sampling 163

1- f -2 2 n 2 n
= ( -n- ) Y [ Cy + ( N-n ) 2 Cx-2(
N -)n PxyCxC y ]

=(1- f)y2[ C2 + g 2 C 2 -2gp C C


n Y x xy x y
J.
Hence the theorem.

Theorem 3.2.5.3. The estimator Ynsu is more efficient than the ratio estimator YR if
N N
n < - , and Pxy < ( )' (3.2.5.4)
2 2N- n
assuming that the correlation coefficient Pxy is positive .

Proof. The estimator Ynsu wilI be more efficient than the ratio estimator YR if
MSE(Ynsu)< MSE(YR)

or C~f)y2 [c~ + g 2C; - 2g p xy C xCy ]< C~f )y2 [cf, +C; - 2P xyCxCy ]

or (g2 - I~; - 2(g -1)pxyCxCy < 0 or (g -IXg+ l)c; - 2(g -1)pxyCxCy < 0
or (g -I)[(g+I)c}-2PxyC t Cy] < 0. (3.2.5.5)

Now there are two cases :


Case 1. The inequality (3.2.5.5) will be satisfied if
g-I <O and (g+I )c;-2Pxy Cx Cy >0

or _n__ 1 < 0 and (g+l)cx > 2p xyCy


N- n

or n - N + n < 0 and Cy (g + I)
N- n PXYC <-2-
x

or N Cy n+ N-n N
n < - and Pxy - < ( ) = ( )
2 Cx 2 N-n 2 N-n
N N
For Cy '" Cx we have n < - and Pxy < ( ).
2 2N- n
This cond ition holds in practice . For example , if N = 100 and n = 30 then Pxy IS

supposed to be less that 0.714 .

Case 2. The inequality (3.2.5.5) wilI be satisfied if

g-I >O and (g +I )c;-2pxy CxCy <0

n - 1 >0 and ()c


or - - g + 1 x< 2p xy Cy
N- n
164 Advanced sampling theory with applications

or n - N + n O d Cy (g + I)
> an Pxy->--
N-n c, 2

or N Cy n + N - n N
n > 2 and Pry C, > 2(N _ n) = 2(N - n) .
For Cy "" Cx we have

n >-
N
2
and Pxy > (
N
2 N-n
r
This condition will not hold in practice. For example, if N = 100 and n = 70 then the
value of Pxy needs to be more that 1.667, which is not possible. Hence the
theorem.

3.2.6 GENERAL CLASS OF ESTIMATORS

Srivastava (1971) proposed a general class of estimators to estimate the population


mean Y of the study variable which in the case of single known mean X of the
auxiliary variable is given by
(3 .2.6 .1)

where u = xl X and H(.) IS a parametric function, such that, it satisfies the


following conditions:
( a) H(I) = I; (3.2 .6.2)
( b ) The first and second order partial derivatives of H with respect to U exist
and are known constants at a given point u = 1.
Expanding H(u) about the value 1 in a second order Taylor's series we have
oH \2 I o 2H
H(u)= H[ -I) = H(I)+(u -1)-lu=l
I +(u]
ou +(u -I) -2 -du2 lu=l +...... (3263)
. ..

Note that lu - II < I thus the higher order terms can be neglected. Using (3.2.6.2)
and (3.2.6.3) in (3.2.6.1) we obtain
tg = y[ I+(u-I)H) + (u - I)2Hz + ..... ] (3.2 .6.4)
sn l o2 H
where HI = "u lu=l and H2 = - - - 2 lu=\ denote the first and second order partial
u 2 ou
derivatives of H with respect to u and are the known constants. Evidently the class
of estimators t g given at (3.2.6.4) can easily be written in terms of &0 and &) as

tg = Y(I + &0)[1+ &\H) +c\zHZ + ....]


= Y[I+co+c\H 1 +c\zH Z +coc)H) +0(&)] . (3.2 .6.5)
Chapter3: Useof auxiliary information: Simplerandom sampling 165

Thus we have the following theorems :

Theorem 3.2.6.1. The bias in the general class of estimators t g defined at (3.2.6.1),
to the first order of approximation, is

(3.2 .6.6)

Proof. Taking the expected value on both sides of (3.2.6.5) we obtain

E&g)= r[1 +C~f)(H2C'; +HIPXyCycJ].


Thus the bias in the class of estimators tg is given by

B&g)= E&g)- r = C~f)r[H2c,; + H1PxyCycJ


Hence the theorem.

Theorem 3.2.6.2. The minimum mean squared error of the general class of
estimators t g defined at (3.2.6.1), to the first order of approximation, is given by

.
Mm.MSE t ()(I-f)-22(
g= -n- Y Cy1- Pxy2) . (3.2.6 .7)

Proof. By the definition of the mean squared error we have

MSE&g)= E~g - rf : : : E[r{l +'<:0 + cJH J + cf H 2 + ....}- r].


Again neglecting higher order terms we have

MSE&g) = f2 Elc6 + H I2cf + 2Hl cocI j

=C~f)r2[c; +H J2C; +2HJPxyC ycxJ, (3 .2.6.8)

On differentiating (3.2.6.8) with respect to HI and equating to zero we obtain

HI = -P xy-CCy ·
x
(3 .2.6.9)

On substituting the optimum value of HI from (3.2.6.9) in (3.2.6.8) we obtain


(3.2.6.7). Hence the theorem.

Remark 3.2.6.1. If we attach any function of xl X to the sample mean y the


asymptotic minimum mean squared error of the resultant estimator cannot be
reduced further than that given in (3.2.6.7). Thus the usual ratio estimator, product
estimator, and power transformation estimator are the spec ial cases of the class of
estimators defined in (3.2 .6.1).
166 Advanced sampling theory with applications

3;2:7WIDERCI..;j\SS OJ? ESTIMAT()RS

One may note here that regression estimator and difference estimator are not special
cases of the general class of estimators defined in (3.2.6 .1). Srivastava (1980)
defined another class of estimators and named a wider class of estimators as
tw = H[y, u] (3.2.7.1)
where H[y, u] is a function of y and u and satisfies the following regularity
conditions:
( a ) The point (y, u) assumes the value in a closed convex subset R2 of two-
dimensional real space containing the point (Y,I) ;
( b ) The function H(y, u) is continuous and bounded in R2 ;
( c ) H(Y, I) = Y and Ho(Y, I) = 1, where Ho(Y, I) denotes the first order partial
derivative of H with respect to y;
( d ) The first and second order partial derivatives of H (y, u) exist and are
continuous and bounded in R2 .

Expanding H(y, u) about the point (Y, I) in a second order Taylor series we have
tw =H(y,u)=H[Y +(y-Y}I+(u-I)]

(- ) (_ -y:JH sn
= H Y,I + Y - Y o y ly=Y,u=1+(u -I) ou ly=Y,u=l +(u -I)
\2 1 0 2lJ
2 ou2 ly=Y,u=l +
_ -)2 1 0 2H (_ -y 1 0 2lJ
+(y-Y 2 02 y2 Iy =y,U=1 +Y - YAU- I)2 o yo u ly=Y,u=1 + (3.2 .7.2)

Using the above regularity condit ions and ~~ ly=Y,u=l= I, we have

tw = Y +(u - I )H I +(u - I)2lJ2+(y - yXu -1)lJ3+(y- Y) J-l 4 + .. (3.2.7.3)


where
2J-l 2
J-l _ oH I _ lJ _10 lJ = 1 0 H 1_ -
_
and
1- OU y=Y,u=I' 2 - 2 ou2 y=Y,u=I' 3 20you y=Y,u=1 '
1

2
lJ = 10 H 1_ _
4 2 0 2y 2 y=Y,u=l·
Thus we have the following theorems.
Theorem 3.2.7.1. The asymptotic bias in the wider class of estimators i; of the
population mean Y is:
B(t w ) = C~f )[YPxyCrCyJ-l3 + C';J-l 2+ Y 2C;J-l 4]. (3.2.7.4)

Proof. The wider class of estimators t w ' in terms of &0 and &1' can easily be
written as
Chapter 3: Usc of auxiliary information : Simple random sampl ing 167

(3.2.7.5)
Taking expected values on both sides of (3.2.7.5) and using the definit ion of bias,
we obtain (3.2.7.4). Hence the theorem .

Theorem 3.2.7.2. The minimum mean squared error of the wider class of
estimators, t w ' is given by

Min.MS E(t w ) = (I ~f )f2c;(I- p,;y). (3.2.7.6)


Proof. By the definition of mean squared error , we have

MSE(t w )= E[tw - r] '" E~ + HI&I +0(&)- yf = E[Y( ~ -1)+ H\&J


= E[Y&O+H]&!]2 = E[f2&6 + H?&\2 + 2H\Y&O&\]
= C~f )[Y2C; +H?C} +2H\YP C cJ xy x (3.2.7.7)

On differentiating (3.2.7.7) with respect to HI and equating to zero we obta in


- Cy
HI =-PXyY-
, C . (3278)
. ..
x
On substituting (3.2.7.8) in (3.2.7.7), we obtain (3.2.7.6) . Hence the theorem.
Remark 3.2.7.1. If we have any function of )I, x and X to estimate the population
mean, Y , the asymptotic minimum mean squared error of the resultant estimator
again cannot be reduced further than that given in (3.2.6.7). Thu s the usual linear
regression estimator and differen ce estimator are spec ial cases of the wider class of
estimators defined at (3.2 .7.1).

3.2.8 USE OF KNOWN VARIANCEPJ1,THE AUXILIARY VARIABLE


AT ESTIMATION STAGEOF POPULATION MEAN

In this sect ion, we will show that the known variance of the auxiliary variable can
also be used as a benchmark, in addition to the known population total or mean of
the auxiliary variable, to improve the estimators of the finite population mean of the
study variabl e under certa in circumstances.

3.2.8;1 A CLASS OF ESTIMATORS


Sriva stava and Jhajj (1981) introduced the use of known value of the variance of
the aux iliary variable to improve the efficiency of the estimators of population
mean . They considered a general class of ratio type estimators as
)lS J = )lH (u,v) (3.2.8.1)
where u = xl X, v = s}/ S} and H(u, v) is a function of u and v such that:
168 Advanced sampling theory with applications

where u = xl X , v = s;/
S; and H(u , v) is a function of u and v such that:
( a ) The point (u, v) assumes the value in a closed convex subset R2 of two-
dimensional real space containing the point (I, I);
( b ) The function H(u , v) is continuous and bounded in R2 ;
(c )H(I ,I) = I;
( d ) The first and second order partial derivatives of H(u ,v) exist and are continuous
and bounded in R2 .

Thus all ratio and product type estimators of population mean r defined as

- - X
YI = y (-J
x [ 2)
s;
S;
- - (-
X
, Y2 =y aX+(I-a)X
J[ Sx
2) ,
yS; +(I - y )S;
-
and Y3 = y
- X
(-Ja[
x s;
Sx
2)Y

are the special cases of the class of estimators defined in (3.2 .8.1).

Expanding H(u , v) about the point (I, I) in a second order Taylor's series we obtain
YSJ = yH (u,v) = yH [1+(lI - I),I +(v- I)]

_[ Of! Of! \2 1 (} 2 H
"' Y H (I,I) +(u-I )& I(I,I) +(v-I )~ I(I,I) +(lI - l) 2w
2 1(1,1)

1 {}2H 1 {}2H ]
+(v - 1 f 2 (}v2 1(1,1) +(u- 1Xv - I)2 cum 1(1,1)+ .'

2H
= r(1 + coX 1+ s .H, + C3H2 + c I 3 + c} H4 + CIC3 HS + ..... ]
-[
'" Y 1+EO +cIH I +c3H2 + cI2 H3 +c3H4
2
+ clc3HS
+ cOEIHI + COC3 H2 + .... ] (3.2.8.2)
where
Of! Of!
HI =& '(1,1), H 2 = ~ I(I , I ) ' and

1 (} 2 H
Hs = 2 wOv 1(1,1)'
Thus we have the following theorems:

Theorem 3.2.8.1. The asymptotic bias in the class of ratio type estimator s YSJ IS

S(YSJ)=c ~/ )r[C;H3+(Ao4 -1 )H4+CxAo3Hs+PxyCyC,HI+CyAI2HJ (3.2.8.3)

Proof. It follows by taking expected values on both sides (3.2.8 .2) we have

E(YSJ) = rE[ 1+ co + e.H , +c3H2 + c?H 3 + c}H 4 +clc3HS +coclH I + coc3 H2 + .... ]

= r[ 1+O+OH1 +OH 2 +E(c?)H3 + E(c})H4 + E(ctC3)Hs +E(cOCI)H1+ E(cOC3)H 2


Chapter 3: Use of auxi liary inform ation: Simple random sa mpling 169

Thus the bias is given by

B(YSJ ) = E(YSJ ) - Y

= C~/ )Y[C;H3 + (~4 -1)H4 +Cx~3H5 + PxyCyC,H, +CyA'2 H J


Hence the theorem.

Theorem 3.2.8.2. The minimum MSE of the class of estimators YSJ is given by

Min .MSE(YSJ) = (1- II


f) y2 C ;[I_P.;y _ (~3Pxy - A,~f ] .
~4 -1- AD3
(3.2.8.4)

Proof. By the definition of the mean squared error we have

MS E(YSJ) f
= E[YSJ - Y = y 2E[eo + e,H, +e3H 2 + 0 (& )j2
- 2 [2 2 2 2 2 ]
= Y E eo + e, HI +e3 H2 + 2eoe,H , + 2eOe3 H2 +2 ele3H ,H2

=c~/ )p[C;+C.;H,2+(~4 -1 )HJ.+2PxyC,CyH , +2CyA'2H 2

+2Cx~3HIH2 ]. (3.2.8.5)

On differentiating (3.2.8.5) with respect to H, and H 2 , and equating to zero,


respectively we obtain
H,Cx + H2 ~3 = -PxyCy , and H,Cx }'Q3 + H2(~4 -1)= - CyA' 2 ' (3.2 .8.6)
Solving equations in (3.2.8.6) for H , and H 2 we obtain
_
H I -
Cy{P.t}'(~4 - 1) -A'2~3}
(
C< ~4 - I - }'Q 3
2 r': an d H2 -
_ Cy{AI2 - PxY~3}
~4 -
2
I - AD3
' (32 87)
. . .
On substituting these optimum value s of H, and H 2 in (3.2.8.5) we have (3.2.8.4).
Hence the theorem.

3.2.8.2 A WIDER CLASS.OF ESTIMATORS

Srivastava and Jhajj ( 198 1) also con sidered a wider class of estimators of
population mean Y as
YSJ(w) = H (y , 1/, v) (3 .2.8.8)

where 1/ = x-j X, 2/ 2 and H (


- v = sx S< ) is a functi on of
y,l/, V Y, 1/ and v, such that:

( a ) The point (y, 1/, v) assumes the value in a closed convex subset R3 of three-
dimensional real space containing the point (Y, I, I) ;
170 Advanced sampling theory with applications

( b ) The function H(y, u, v) is continuous and bound ed in R3 ;

( d ) The first and second order partial derivatives of H(y, u, v) exist, and are
continuous, and bound ed in R3 .

Expanding H(y, u,v) about the point (Y, I, I) in a second order Taylor 's series we
have

(3.2.8.9)
where
if{ if{ if{ 1 0 2H 1 0 2H
oy I(Y.I,I)=I , HI =~ I(Y,I,I)' H 2=a; I(Y,I,I), H 3- 2 oy2 I(Y,I,I)' H 4 ="2 at2 I(h l)'
2 2 2 2
H t 0 H I H =..!.- 0 H I_ H =..!.- 0 H I_ and H =..!.- 0 H 1-
5 ="2 a,.2 (Y,I,I)' 6 20yat (Y,I,I)' 7 2 ata,. (Y,I,I)' g 20voy (Y,I,I)'

Thus we have the following theorems:


Theorem 3.2.8.3. The asymptotic bias in the wider class estimators YSJ(w) is

B(YSJ(w))=( 1 ~f)[YC;H3+C;H4+{Ao4 - 1)H5+YPxyCyCxH6 +CxAo3H7+YCyAI2 Hg].


(3.2.8.10)
Proof. It follows by taking expected value on both sides of (3.2.8.9).

Theorem 3.2.8.4. The minimum MSE of the class of estimators YSJ(w) is given by

Min. IYSJ(w) )=(!-=-L)y


· MSE("" -2C2y [t_ Pxy2 (Ao3PXY- AI2~ ] 2 ' (3.2.8.11)
n ..104 -1 - AQ3
Proof. By the definition of the mean squared error we have
Chapter 3: Use of auxiliaryinformation: Simplerandom sampling 171

MSE(YSJ(w))= E~SJ(w) - yf = E[Y&O +GIHI +&3 HZf


= (1 ~f)[yZC;+C;HIZ-t{Ao4 -1)Hi +2YPxyCxCyHI
+2YCyAuHZ+2CxAo3HIHZ ] . (3.2.8.12)
On differentiating (3.2.8.12) with respect to HI and HZ, and equating to zero,
respectively we obtain
H ICx+HzAo3 =-YPxyCy , (3.2.8.13)
and
HICxA03 + H2(~r 1)= -YCyA12' (3.2.8.14)

Solving (3.2.8.13) and (3.2 .8.14) for HI and Hz we have

HI =- C t,
YCy {PXy (A04
x /L04 -
-1)- A]2A03}
,z \
1- /L03 J
' and Hz =-
YCy {A]2 - Pxy A03}
,
/L04 -
I ,z
- /L03
. (3.2.8.15)

On substituting these optimum values of HI and Hz in (3.2.8.12), we obtain


(3.2 .8.11) . Hence the theorem.

Remark 3.2.8.1.

(a) The difference type of estimator YI =y+yly-x)+yz(s;-s;), where YI and


r: are real constants, is a special case of the wider class of estimators. Sahoo,
Sahoo, and Espejo (1998) have presented an empirical investigation on the
performance of five strategies for estimating the finite population mean using
parameters such as mean or variance or both of an auxiliary variable . They
considered the problems of comparison of bias, efficiency and approach to
normality or asymmetry.

( b ) The asymptotic minimum mean squared error of the ratio type and the wider
class of estimators remains the same.

( c ) Note that A]2 and A03 are odd ordered moments. In case X and Y follow the
bivariate normal distribution then both A]2 and A03 are zero. In such situations the
minimum mean squared error of the class of estimators proposed by Srivastava and
Jhajj (1981) reduces to the mean squared error of the usual linear regression
estimator . Thus there is no advantage in using the known variance of auxiliary
variable for the construction of the estimator of the population mean Y if the joint
distribution of the study variable Y and auxiliary variable X is a bivariate normal
distribution .
172 Advanced sampling theor y with applications

( d ) As the use of number of known parameters of an auxi liary variable in the


construction of ratio or regression type estimators increases, no doubt the mean
square error of the resultant estimators decreases, but note that at the sam e time the
stab ility of the estimators decreases.

( e ) There are large number of estimators belongi ng to the same clas s of estimators
with the same minimum asymp totic mean square error, so it is difficult to select an
estimator for a particu lar survey, and there is no theoretical technique avai lable in
the literature to select an estimator.

Example 3.2.8.1. Use information given in the population I of the Appendix to


show the relative efficiency of the general class of estimators over the linear
regression estimator whi le estimat ing real estate farm loans with the help of known
nonreal estate farm loans .

Solution. From the description of the population I give n in the Appendix we have
- 2
Y = 555.43 , X = 878.16 , C y = 1.1086 Ao3 = 1.5936 , Pxy = 0.8038 , N = 50 ,
A12 = 1.0982 , and Ao4 = 4.5247 .

Now

Min.MSE(YLR) = C~/ )Y2C;'(I _ P.~) C-~. 16}555.43f


= x 1.1086 x (1- 0.8038
2)

= 12709.55 ,

and the value of minimum MSE(YSJ) is given by

Min .MSE(YSJ) = ( 1- f) y2 C;'[I_p,;y (AOJPXY- AI~ ~]


1/ -104-1- AQ3

= ( 1- 0.16) (555.43)2x 1.1086 x [ 1- (0.8038)2 _ (1.5936 x 0.8038 - 1.0982f]


8 4.5247 -1 - (1.5936)2
= 1149 1.74 .

Thus percent relative efficiency (RE) of the general class of estimators, YSJ, with
respect to the linear regression estimator, YLR , is given by

- ) x 100/Mm.MSE
RE = V (YLR . (-)
YSJ = 12709.55 x I 00 = II 059
. %.
11491.74

It should be noted that in this case the relative efficiency is independent of sampl e
size 1/.

The next section of this chapter has been devoted to con structing the unbia sed ratio
and product type of estimators of the population mean . We will discuss
Queno uille 's method, interpe netrating sampl ing method, exactly unbia sed ratio and
product type esti mators, and bias filtratio n techniques .
Chapter 3: Use of auxiliary information: Simple random sampling 173

3.2.9l\1Ji:JHODS TOEEMOVEBIASFROM RATIO AND PRODUCT


TYPE ESTIMATORS

We have observed that the ratio and product type of estimators are biased . Several
researchers have attempted to reduce the bias from these estimators. We should also
like to discuss a few methods to construct unbiased ratio and product type
estimators of population mean before going on to the problems of estimation of
finite population variance, correlation coefficient, and regression coefficient.

3.2.9.1 QUENOUILLE?S METHOD \

In this method we draw a sample of size 2n units from a population of N units by


SRSWOR sampling. We divide the sample of 2n units into two equal halves each
of size n. The sample based on 2n units is called the pooled sample . Then we
have three biased ratio estimators of the population mean as:

(a) YRI = YI(~J , where YI = n-I.IYi and x] = n-1 .I,xi are the first half sample
XI 1= 1 1=1

means for Y and X variables , respectively ;

(b) YR2 = Y2( ~ J' where Yz = n-1 IYi and x2= n-I IXi are the second half sample
X2 i=1 i=1
means for Y and X variables , respectively ;

(c) YR = y(~J, where Y=(2nt l ~Yi and x= (2ntl ¥Xi are the sample means for
X 1= 1 1=1

the Y and X variables , respectively, based on the pooled sample .

By following the ratio method of estimation , we have


E(YRI)= Y+(~- ~)Y(c; - PXyCxCy) , (3.2.9.1)

E(YRJ = Y+ (~ - ~)Y(c.; - PXyCtCy) , (3.2.9.2)

and

E(YR) = Y+Un - ~ )Y(c; - PXyCxCy) . (3.2.9.3)

Quenouille (1956) considered an estimator of the population mean Y as


YQ = alYRI + YRz)+(1- 2a)YR (3 .2.9.4)
where a is a suitably chosen constant such that the bias in the estimator YQ IS zero.
Thus we have the following theorem:
Theorem 3.2.9.1. The QuenouiIIe's estimator YQ IS an unbiased estimator of
population mean Y if
174 Advanced sampling theory with applications

(N -2n)
a= 2N (3.2.9 .5)
Proof. We have
E(YQ) = E[a(h l + YRJ+(1 - 2a)YR] = a[E(yRI)+ E(YRJ] + (1- 2a)E(YR)

= a[Y+(~- ~ )Y(c} - PXyCxC y)+Y+(±-~ )Y(c} - PxyCxCy)]


+(1- 2al y+Un- ~)Y(c} -pxyCxcJ]

= y + Y(c~ - PxyCxC y I2a(~- ~ )+(1-2a{L - ~)].


Evidently the bias in the estimator YQ will be zero if

(N - 2n)
2a(~-~)
n N
+(1-2a{~-~)=o,
\2n N
or if a=-
2N
.

Hence the theorem.

Sample size n versus value of a for N=20

0.6
0.5
0.4
0.3

...o
1Il 0.2
0.1
Ql
~ O -f! 1 +
> -0.1 3 4 5
-0.2
-0.3
-0.4
-0.5 j
Sample Size (n)

Fig. 3.2.2 Value of Quenouille's constant.

For more details, one can refer to Singh and Singh (1993), Murthy (1962) and Rao
(1965a) . The reduction in bias to the desired degree by using the method of
Quenouille (1956) has also been discussed by Singh (1979) .
Chapter 3: Use of auxiliary information: Simpl e random sampling 175

3.2.9.2 INTERPENETRATING SAMPLING METHOD

Let us first present an idea about the interpenetrating samples. If we want to select
II units with SRSWOR sampling, we can select k independent samples each of
size til = /I / k , where we assume that /I / k is an integer. We draw til units out of N
units, then put back these til units so as to make the popul ation size the same. To
make the k samples independent, each individual sample of III units is selected
with SRS WOR sampling. Now we have k samples each of size til. From the /"
sample, a ratio type estimator to estimate the popul ation mean Y is
_ <v,_(x)
YRj Xj

where Yj = tII-
1m
Z:Yi
m
and xj =tII - I Z:Xi denote the i" sample means for
i=\ i=l
the Y and X variables, respecti vely, for j = 1, 2,..., k. Let us defin e a new estimator of
the population mean Y as
_
YRK =
i
-k
z;
L YRj = -k
1 ~_ ( X )
L Y j -=-:- . (3.2.9.6)
F I J= I xJ
Also from the full sample information, we have the usual ratio estimator of
population mean Y given by

- =y-(XJ
YR x

where Y = /I - I f Yi and x = /I -I f Xi are the sample means based on full sample


~I ~I

information. By following the ratio method of estimation, we have

E(YRK) = Y + (,:1- ~ )Y(c} - PXYCycJ, (3.2.9.7)

and

E(YR ) = Y + (~- ~ Jy(c~ - PXYCyC..). (3.2.9.8)

Note that til units are drawn k times from a population of size N wh ich is equivalent
to a sample of size /I = km is drawn from a population of size kN . Thus we have the
followin g theorem:
Theorem 3.2.9.2. An unb iased estimator of the popul ation mean Y is given by
- kYR - YRK
Yu = k -I (3.2.9.9)
Proof. We have
E(yu) = E[A.YR - YRK] = kE(YR) - E(YRK)
k- 1 k- l
176 Advanced sampling theory with applic ations

k[y +(~ __1)Y(C;-PXyCxCJt~ i [Y+(~- ~)Y(c;-PXyCxcy)l


n kN J k j=1 m N J
(k- I)

(k -I)Y + (~- ~ )Y(C;-PxyCxCy)- (~- ~ )Y(C; -PXyCXCy)


(k - I)

(k -1)Y +(~-~) Y(C; -PxyCxC y)- (~- ~)Y(C; -PX


yCxCJ -
m N m N
= (k-l) =Y .
Hence the theorem.

Theorem 3.2.9.3. The varianc e of the unbiased estimator Yu of the popul ation mean
Y IS

V(Yu) = (~- k~ )[s; + R 2S,; - 2RS xy]. (3.2 .9.10)


Proof. We have
Chapter 3: Use of auxiliary information: Simple random samp ling 177

Note that k > 1, thus the unbiased estimat or Yu is less efficient than the ratio
estimator YR in case of finite popu lations.

Sengupta (1981 a, 1982a) considered the problem of interpenetrating sub-sampl ing


with unequal sizes of the samples and compared with an equicost procedure based
on equal sized samples . He observed that unequal sized samp les lead to more
precise estimates of the finite population mean in almost all cases. He consi dered the
simple random sampling design only and assumed that the cost of the survey is
proportional to the number of distinct units in the sample by following Koo p (196 7),
Singh and Bansal (1975, 1978), Singh and Singh (1974) and Srikan tan (1963).
Schucany, Gray, and Owen (197 1) consi dered the problem of higher order bias
reduction in estimating genera l parame ter in survey sampling.

Exa mple 3.2.9.1. Select three different samples each of five units by using
SRSWO R sampling from the population 1 given in the Appen dix. Collect the
information for the real and nonrea l estate fann loans from the states selected in
each samp le. The average nonreal estate farm loan is assumed to be known . Obtain
three differen t ratio estimates of the average real estate farm loans from the
information collected in the three sample s. Pool the information collected in three
sample s to obtain a pooled ratio estimate of the average real estate farm loans.
( a ) Derive an unbiased estimate of the average real estate farm loans.
( b ) Construct 95% confidence interval.
Give n: Average nonreal estate farm loans $878.16.

So lution. Here N = 50, k = 3, m = 5 and n = mk = 5 x 3 = 15. We selected the


following three independent samples each of size 5 units. The first sample is
selected by using first two columns , the second sample is selected by using 3rd and
4 th columns and the third sample is selected by using 5th and 6th columns of the
Pseudo-Random Numbers (PRN) given in Table 1 of the Appendix.

S ampleI I
Random-Number State Real estatefann Nonrea l estate farm
I S; Rli :5; 50 loans, Yi loans, Xi
01 AL 408 .978 348 .334
23 MN 1354.768 2466 .892
46 VA 321 .583 188.477
04 AR 907 .700 848.317
32 NY 201.631 426.274
Sum 3194 .660 4278.294

Thus YI = 638.932 and XI = 855.6588.


178 Advanced sampling theory with applica tions

6.044
1213.024
1100.745
323.028
553.266
3196 :107 '

Thus yz = 639.2214 and Xz = 616.2494.

Random Number Nonreahestate farm


',. o '=:; R3i ' ~'< .50 :
""c:, . ,.< , ..~. '":'2
,.:': :lgan~'
;i ' .
48 99.277
37 114.899
33 639.571
18 282.565
25 1579.686
2715 .998 ':

Thus Y3 = 543.1996 and x3 = 604.2602 . It is given that X = 878.16. Thus three


different ratio estimates of the average real estate farm loans in the United States are
- =- X = 638.932x878.16 655.7339 - =- X = 639.2214x878 .16 =910.8953
YRI Y, xI 855.6588 ' YRz Yz Xz 616.2494
and
- =- X = 543.1996 x878 .16 = 789.4218 .
YR3 Y3 x3 604.2602
Thus a pooled estimate from the above three ratio estimates of the average real
estate farm loans is given by
YRK f
=..!.. YR ' = 655.7339 + 910.8953 + 789.4218 = 785.3503 .
3 j= \ J 3
Now we have the poo led samp le information as follows:

P 00 Ied Sam I: e
..,
Stafe .;;;';;;.Yi:;; :liF••r:·;ix; :,,;
: ". ';,' \y;; ~).;• •
.... >
m'u'l'·i'y)2.;. ' I;e.:(.{ cW ·;' ...;.. (V.
' ",'
'.:A c..,
, ...•
AL 408.978 348.334 -198. 1400 -343.722 39259 .3 118 144.9 68 104.989
MN 1354.768 2466 .892 747.6503 1774.836 558981.0 3150042.0 1326956 .627
VA 321 .583 188.477 -285 .5350 -503 .579 81530 .1 253591.9 143789 .300
AR 907 .700 848.317 300.5823 156.261 90349 .7 24417.5 46969.256
Contmued .....
Chapter 3: Use of auxiliary information: Simple random sampling 179

NY 201.631 426.274 -405.4870 -265.782 164419.4 70640 .1 107771.111


NH 6.044 0.471 -60 1.0740 -691.585 361289.6 478290.0 415693 .612
IN 1213.024 1022.782 605.9063 330.726 367122 .5 109379.6 200388.897
WA 1100.745 1228.607 493.6273 536.551 243667.9 287886.8 264856 .174
MI 323.028 440.518 -284 .0900 -251.538 80706 .9 63271.4 71459 .384
TN 553.266 388.869 -53.8517 -303 .187 2900 .0 91922.4 16327.132
WV 99.277 29.291 -507.8410 -662 .765 257902.1 439257.6 336579.087
OR 114.899 571.487 -492 .2190 -120 .569 242279 .2 14536.9 59346.378
NC 639.571 494 .730 32.45333 -197 .326 1053.2 38937.6 -6403.891
LA 282.565 405.799 -324 .5530 -286.257 105334.4 81943 .1 92905.516
MO 1579.686 1519.994 972.5683 827.939 945889.2 685481 .1 805226.151
sum'9J 06.'765 10380.842 :~~, :.· O:ooOO L;'·j;>tl .OOO 3542685tO '5907743:0 3949969r724

Thus Y=607 .1l76 , i=692.056, s;=253048.93, s~=421981.64, sxy=282140.69


and r = yli = 0.8773.

A ratio estimate of the average real estate farm loans from the pooled sample
information is given by

- =_x =607.1176x 878.16 =770.380.


YR Y i 692.056

An unbiased estimate of the average real estate farm loans in the United States is

- = kYR - YRK = 3 x 770.380 - 785.3503 = 762.894 .


h k-1 3-1

An estimate for estimating the V(Yu) is given by

v(Yu) = (.!-__I_J(s; + r 2s; - 2rsxy)


n kN

= (...!... - _1_)(253048.93 + 0.8773 2 x 421981.64 - 2 x 0.8773 x 282140.69)


15 3 x 50
= 4967.1166.

A (1- a)1 00% confidence interval of the population mean Y is given by

Using Table 2 from the Appendix the 95% confidence interval of the average
amount of the real estate farm loans in the United States is

762.894+2 .145~4967 .1166, or [611.71, 914.06].


180 Advanced sampling theory with applications

3'7.9.3 ·EXACTLY'{JNBIASED RATIO TYPEEST