Hydrologic Statistics
Dr. N.R. Dhamge
Hydrologic Models
Classification based on randomness.
• Deterministic (eg. Rainfall runoff analysis)
– Analysis of hydrological processes using deterministic
approaches
– Hydrological parameters are based on physical relations of
the various components of the hydrologic cycle.
– Do not consider randomness; a given input produces the
same output.
• Stochastic (eg. flood frequency analysis)
– Probabilistic description and modeling of hydrologic
phenomena
– Statistical analysis of hydrologic data.
2
Probability
• A measure of how likely an event will occur
• A number expressing the ratio of favorable
outcome to the all possible outcomes
• Probability is usually represented as P(.)
– P (getting a club from a deck of playing cards) = 13/52 = 0.25 = 25 %
– P (getting a 3 after rolling a dice) = 1/6
3
Random Variable
• Random variable: a quantity used to represent
probabilistic uncertainty
– Incremental precipitation
– Instantaneous streamflow
– Wind velocity
• Random variable (X) is described by a probability
distribution
• Probability distribution is a set of probabilities
associated with the values in a random variable’s sample
space
4
Sampling terminology
• Sample: a finite set of observations x1, x2,….., xn of the random
variable
• A sample comes from a hypothetical infinite population
possessing constant statistical properties
• Sample space: set of possible samples that can be drawn from a
population
• Event: subset of a sample space
Example
Population: streamflow
Sample space: instantaneous streamflow, annual
maximum streamflow, daily average streamflow
Sample: 100 observations of annual max. streamflow
Event: daily average streamflow > 100 cumec
5
Types of sampling
• Random sampling: the likelihood of selection of each member of the
population is equal
– Pick any streamflow value from a population
• Stratified sampling: Population is divided into groups, and then a random
sampling is used
– Pick a streamflow value from annual maximum series.
• Uniform sampling: Data are selected such that the points are uniformly far
apart in time or space
– Pick steamflow values measured on Monday midnight
• Convenience sampling: Data are collected according to the convenience of
experimenter.
– Pick streamflow during summer
6
Summary statistics
• Also called descriptive statistics
– If x1, x2, …xn is a sample then
1 n
Mean, X = ∑ xi µ for continuous data
n i =1
∑ (xi − X )
1 n
Variance, S =
2
σ2 for continuous data
n − 1 i =1
Standard S = S2 σ for continuous data
deviation,
S
Coeff. of variation, CV =
X
Also included in summary statistics are median, skewness, correlation coefficient,
7
Graphical display
• Time Series plots
• Histograms/Frequency distribution
• Cumulative distribution functions
• Flow duration curve
9
Time series plot
• Plot of variable versus time (bar/line/points)
• Example. Annual maximum flow series
600
500
Annual Max Flow (10 3 cfs)
400
300
200
100
0
1905
1900 1908 1900
1918 1927
19001938 1948
1900 1958 1968
1900 1978 1900
1988 1998
1900
Year
Year
Colorado River near Austin
10
Histogram
• Plots of bars whose height is the number ni, or fraction
(ni/N), of data falling into one of several intervals of
equal width
30
60
100
90
50
25
80
Interval = 50,000 cfs
occurences
of occurences
No. ofoccurences
70
40
20
60 Interval
Interval = 25,000
= 10,000 cfscfs
30
15
50
40
No. of
20
10
No.
30
1020
5
10
0
00
50
0
0
0 0 50 50 100100 150
150 200
200 250
250 300
300 350 400
350 400 450
450 500
500
10
15
20
25
30
35
40
45
50
Annual 3 3 3cfs)
Annualmm
Annual m ax
ax
ax flow
flow (10
flow(10
(10cfs)cfs)
Dividing the number of occurrences with the total number of points will give Probability
Mass Function 11
Probability density function
• Continuous form of probability mass function is probability
density function
0.9
100
90
0.8
80
0.7
occurences
70
0.6
Probability
60
0.5
50
0.4
40
No. of
0.3
30
0.2
20
0.1
10
00
0 0 50
100 100 150
200 200 300
250 300 400350 400500450 500
600
3 3
Annualmm
Annual axaxflow
flow(10
(10 cfs)
cfs)
pdf is the first derivative of a cumulative distribution function
13
Cumulative distribution function
• Cumulate the pdf to produce a cdf
• Cdf describes the probability that a random variable is less
than or equal to specified value of x
1
P (Q ≤ 50000) = 0.8
0.8
P (Q ≤ 25000) = 0.4
Probability
0.6
0.4
0.2
0
0 100 200 300 400 500 600
Annual m ax flow (103 cfs)
15
Flow duration curve
• A cumulative frequency curve that shows the percentage of
time that specified discharges are equaled or exceeded.
Steps
Arrange flows in chronological order
Find the number of records (N)
Sort the data from highest to lowest
Rank the data (m=1 for the highest value and m=N for the lowest value)
Compute exceedance probability for each value using the following
formula
m
p = 100 ×
N +1
Plot p on x axis and Q (sorted) on y axis
16
Flow duration curve in Excel
600
500
400
Median flow
Q (1000 cfs)
300
200
100
0
0 20 40 60 80 100
% of tim e Q w ill be exceeded
17
Statistical analysis
• Regression analysis
• Mass curve analysis
• Flood frequency analysis
• Many more which are beyond the scope of
this class!
18
Linear Regression
• A technique to determine the relationship between two
random variables.
– Relationship between discharge and velocity in a stream
– Relationship between discharge and water quality constituents
A regression model is given by : yi = β 0 + β1 xi + ε i i = 1,2,..., n
yi = ith observation of the response (dependent variable)
xi = ith observation of the explanatory (independent) variable
β0 = intercept
β1 = slope
εi = random error or residual for the ith observation
n = sample size 19
Least square regression
• We have x1, x2, …, xn and y1,y2, …, yn
observations of independent and dependent
variables, respectively.
• Define a linear model for yi, yˆ i = β 0 + β1 xi i = 1,2,..., n
• Fit the model (find b0 and b1) such at the sum
of the squares of the vertical deviations is
minimum
– Minimize ( yi − yˆ i )2 = ( yi − β 0 − β1 xi ) 2 i = 1,2,..., n
20
Linear Regression in Excel
• Steps:
– Prepare a scatter plot
– Fit a trend line
1800
Data are for Brazos River
1500
TDS = 0.5946(sp. Cond) - 15.709
R2 = 0.9903
1200 near Highbank, TX
TDS (mg/L)
900
600
300
0
0 500 1000 1500 2000 2500 3000
Specific Conductance (° S/cm)
21
Coefficient of determination (R2)
• It is the proportion of observed y variation that can
be explained by the simple linear regression model
SSE
R2 = 1−
SST
SST = ∑ ( yi − y ) 2 Total sum of squares, Ybar is the mean of yi
SSE = ∑ ( yi − yˆ i ) 2 Error sum of squares
The higher the value of R2, the more successful is the model in explaining y
variation.
If R2 is small, search for an alternative model (non linear or multiple
regression model) that can more effectively explain y variation
22