Denoted , a power set consists of all subsets of .
Its cardinality is given by
Def. Sigma Algebra/Field
A sigma algebra on is a set (containing subsets of ) that:
contains the null set
closed under unions
closed under complementation
Def. Probability Measure
A probability measure defined on a set with algebra is a function with the following properties:
normed
countably additive
, for mutually disjoint
To turn finite additivity into countable additivity, add infinitely many null sets.
Many sample spaces are infinite sets, and there is no that can be defined for every element of these sets. We thus restrict the domain of P to be a subset .
Prop 1.2.2 (Some Event Must Occur)
If is a probability model, then = 0
Proof
Let for so the are mutually disjoint, and
(Contradiction) Suppose that , then , which , so
Lecture 2
HierarchyElements () -> sets of elements (events or ) -> sigma algebras () -> Borel sets ()
Prop 1.3.1 (Intersection of Sigma Algebras)
If is a family/set of on , then is a algebra on
Proof
must have the properties of a :
Since the intersection contains the null set, is closed under unions and complementation, it is a .
Def. Sigma Algebra Generated by C
is obtained by intersecting all containing . It is thus the smallest algebra on containing all subsets in .
Def. Borel Set
is a algebra generated by open sets. Formally:
It is the smallest algebra on containing all rectangles of the form (a, b] where
, since contains all such rectangles., since there is a subset that is not a borel set.
Loosely speaking, any set that can be defined explicitly is a borel set. (Nice) transformations of borel sets are also borel sets.
Def. Ellipsoidal Region
A ball of radius centered at is given by
The set that forms its boundary is denoted , and obtained by replacing with =.
Applying an affine transformation on , where is an invertible matrix
we obtain an ellipsoidal region centered at , whose axes and orientation are determined by and
Recall: A matrix is… - symmetric if - invertible if (The 0 matrix is not invertible) - positive definite if
is…
symmetric since
invertible since
positive definite since for any
= 0 iff w = 0, since A is invertible (cannot be the 0 matrix) and is invertible (transpose of an invertible matrix is invertible)
Note for the multivariate normal, is the mean vector, is the variance matrix.
Lecture 3
Def. Limit inferior/superior of a Sequence
For a sequence :
is a member of at least one of the intersections
is a member of all the unions
Properties:
If , then
Monotone Increasing/Decreasing Sequences
is an increasing sequence of sets (as i increases, fewer sets are intersected, the resulting intersection gets bigger)
is an decreasing sequence of sets (as i increases, fewer sets are unioned, the resulting union gets smaller)
Prop 1.4.1 (Monotone Sequences Converge)
A monotone decreasing sequence of sets converges to their intersection.
If , and , then
Proof
Need to prove that lim inf = lim sup:
(1) Since , we have that , so
(2) Also, , so (if we union the same set over and over again, we get that set)
Optional subproof: , since the intersection of many sets intersection of fewer sets Other direction: let , so , i.e. Since they are subsets of each other,
(1 & 2) Since . Hence we have convergence:
A monotone increasing sequence of sets converges to their union.
If , and , then
Proof
(1) Since , we have that so
(2) Also, , so (intersecting the same set over and over again gives that set)
(1 & 2) Since , we have convergence: .
Prop 1.4.2 (Continuity of P)
If and , then as
Note The converse is true
Proof
By the previous proposition, we know (1) & (2)
(1) Since is a monotone decreasing sequence, it converges to the intersection of the sets, i.e.
(2) Since is a monotone increasing sequence, it converges to the union of the sets, i.e.
By (1) & (2), , and
So
Suppose is a monotone increasing sequence, so
Now create mutually disjoint like so such that
So
Suppose is a monotone decreasing sequence, so is monotone increasing.
Hence
Prop 1.4.3 (Prob Measure on a Sigma Algebra)
is a probability measure on if satisfies
(1)
(2) is additive
(3) as whenever and
Proof
(1) and (2) are contained in the def of probability measure (normed and countably additive)
Combining additivity (2) with continuity (3), we have that P is countably additive:
(3) can also be written as
Let , where are mutually disjoint.
Then is a monotone increasing sequence of events with
Since ,
So continuity finite additivity
Important Note Countable additivity continuity of P. By ensuring countable additivity, we ensure continuity of P, which is needed when we have an infinite sample space.
Def. Conditional Probability Model
If is a probability model and has , then the conditional probability model given is , where is given by
Proof
(1)
(2) If are mutually disjoint,
then
Since is normed and countably additive, is a probability model.
Note The model can also be presented as ()
Prop 1.5.1 (LOTP / Thm of Total Prob.)
Suppose with , and with , then for any
Proof
Since where are mutually disjoint, P(A) =
Fact If each is a partition of , then and the sets are mutually disjoint
Proof
Since when , we have , and (also )
Lecture 4
Def. Statistically Independent
If is a probability model and , then A and C are statistically independent if
It follows that when ,
Statistically Independent Sigma Algebras
and are statistically independent if every element of the generated by is statistically independent of every element of the generated by B:
Proof
and are statistically independent since , and so
and are statistically independent since , and so
and are statistically independent since , and so
and are statistically independent in the same vein.
and are statistically independent since
Def. Mutually Statistically Independent
When is a probability model and is a collection of sub of , the are mutually statistically independent if , where distinct , and .
Notes
Pairwise Independence Mutual Independence
i.e.
Without pairwise independence, mutual independence
Union of 3 events (Inclusion-Exclusion Principles)
Proof
Generalized to n events
Proof
Base The result is true for n=2:
I.H.Assume it's true for n
Consider
Combining the above, we have
Intersection of 3 events
Proof
Generalized to n events
2. Random Variables and Stochastic Processes
Lecture 5
Motivation If we have a population — , a measurement of some sort — , and we want to assign probabilities to events — , or , the probabilities are on instead of — this is difficult. To navigate this, we use inverse images.
Def. Inverse Image
Under the function , the inverse image of the set is given by
By and , we have since they are subsets of each other
Proof for Complements
Let , then
So
Suppose
So
By and ,
Property If , then and are also disjoint.
Proof
Suppose , then
Def. Random Variable
A random variable is a function with the property that for any (i.e. Borel set in ), .
Thus, when X is a random variable,
Prop 2.1.1 (Marginal Probability Measure)
When X is a r.v., the marginal probability measure of X is , which is defined on by
Proof
Normed:
Countably additive: If are mutually disjoint elements of , then
Note The probability model for a random variable X is
Prop 2.1.2 (Determine whether X is a random variable)
If for every , then is a random variable.
Proof
Let
Since and , we know
If , then
Since and , we know
If , then
Since and , we know
By 1 (contains null set), 2 (closed under comp), & 3 (closed under union), we know is a sub of
By hypothesis, , since is the smallest containing all intervals
By and , they are subsets of each other, so X is a random variable.
Examples
is a r.v. since for any ,
is a r.v. since for any ,
is a r.v. since for any ,
is a r.v. since for any b,
if is even, and
if is odd
(projection on the ith coordinate) is a r.v. since for any b,
Also, when X is continuous on , so it must be a r.v.
Note When , then any is a random variable.
Prop 2.1.3 (Sum & Prod of R.V.s are R.V.s)
If X, Y are random variables defined on , then (1) W = X+Y and (2) W = XY are both random variables.
Proof of (1) W = X + Y
Suppose
Let be such that , then such that and ,
We can take the intersection to get that
We can express the set of all as , so
Since is countable, and is a countable union of elements of , we have that
By hypothesis is monotone decreasing, so is a r.v.
Proof of (2) W = XY
Suppose b = 0, then
Suppose b > 0, then
We've shown , so we just need to show the other part: .
Since xy=b is symmetrical over the line y=-x, proving the argument for one of 1 & 4 will suffice.
Suppose and let . Then such that
since is countable.
Since
A similar argument holds for b < 0. For any b, , so W=XY is a r.v.
E.g. is a r.v. if X is a r.v.
Any constant function is a r.v., so all are r.v.'s.
The product of r.v.'s is a r.v., so all are r.v.'s
The sum of r.v.'s is a r.v., so is a r.v.
Prop 2.1.4 (Sigma Algebra generated by X)
When X is a random variable, is a sub of , called the on generated by X.
Alternative notation:
Proof
If , then such that .
So (since )
If , then such that .
So (since )
By 1(contains null), 2(closed under unions), 3(closed under complementation), is a sub of
Def. Random Vector
Recall
A random variable is a function with the property that for any , .
Thus, when X is a random variable, , since
A random vector is a function with the property that for any , .
Thus, when X is a random vector, , since
Properties
is a random vector
The marginal probability measure of is given by
The generated by is
Example (Pt. 1)
Suppose we have , and the uniform prob measure
Let be given by where are defined as and
Example (Pt. 2)
What if we change the def of ? If are now and , what is ?
Only 2 possible outputs now: (0,1) and (1, 0)
Then for
Example (Pt. 3)
If P is not uniform, but instead defined , what is ?
Prop 2.1.5 (Cartesian Prod of Borel Sets is a Borel Set)
If , then and is the smallest on containing all such sets
Proof
Consider the sets that only restrict the ith coord.
Then is a sub of
Sub-proofLet
If for i = 1, 2, …, then since
If , then since .
So
Since each k-cell is of this form, there a on containing all such sets that is smaller than .
Prop 2.1.6 (A Vector of R.V.s is a Random Vector)
If is a random variable for , then is a random vector.
Proof
Suppose . By the previous proposition, . Then we have
Since is a random vector.
Lecture 6
Def. K-cells
, or
K-cells are the basic sets we want to assign probabilities to (using random vectors)
For k = 2, (a, b] =
Def. Cumulative Distribution Function (CDF)
The cumulative distribution function for random vector is given by
Def. Difference Operator
For any , the i-th difference operator is given by
Prop 2.2.1 (Properties of Distribution Functions)
Any distribution function satisfies
If for , then
is right continuous
If , then
Proof for (1)
Proof for (2)
Proof for (3)
Thm 2.2.1 (Extension Theorem)
If satisfies the 3 properties of distribution functions, then a unique probability measure on , such that is the distribution function of
Note such an determines a probability model and we can define a random vector with this model by taking and
Now we can present by a function of points (rather than sets)
Def. Marginal Distributions
Def. Discrete Probability Models
Prop 2.3.1 (Countably Many Points with Positive Prob)
Prop 2.3.2 (Prob Measure Defined by p)
Def. Multinomial Distribution
Def. Multivariate Hypergeometric Distribution
Lecture 7
Def. Continuous Probability Models
Def. Absolutely Continuous Probability Models
Def. Probability Density Functions (PDF)
Prop 2.4.1 (Properties of A.C. Models)
with probability 1
Prop 2.4.2 (Properties of PDFs)
is a density function for a.c. model if
Def. Multivariate Normal Distribution
Lecture 8 & 9
Suppose we transform the random vector to the random vector
Discrete case
If is discrete (with prob function ), then
Def. Projections (& their Prob Functions)
Suppose , then the projection on the first 2 coordinates is
Prob Function Derivation:
To find the probability functions of projections, take the joint probability function, and sum out unwanted variables.
The projection on the second coordinate is
Prob Function Derivation:
Marginal of a Multinomial Random Vector
Let multinomial , then where ,
Suppose , , and we want to find the distribution of
By the defined constraints, and
and
Thus, multinomial
Binomial(n, p) = Multinomial(n, p, 1-p)
If multinomial , then prove binomial multinomial Note this is easy to see intuitively since the multinomial arises by placing ind. observations into mutually disjoint categories, and when we project onto coordinates we are now categorizing into mutually disjoint categories
So
Sum of sub-Multinomial Random Vector ~ Binomial
Use the previous note to determine the distribution of for when
Note in the discrete case, if is 1-1 and , then
is the number of responses falling in the first categories.
A response falls into one of these categories with probability .
So
Def. Indicator Function
For , the indicator function is given by
Indicator Variable ~ Bernoulli(P(A))
Prove: if is a probability model and , then is a random variable with
, and
Since for any , , we know is a r.v.
Transformation Determines Distribution Type
could have a discrete distribution no matter how is distributed.
E.g.Suppose for every , then
and the distribution of is degenerate at
E.g. Suppose , so
Bernoulli
Absolutely continuous case
Suppose has density function , and where .
is also absolutely continuous with density which we want to determine.
Cdf Method
Generally, the cdf method works with projections when there is a formula for :
E.g. Define by
It was proved (in a lec 6 exercise) that this is a cdf (using thm 2.2.1),
so
Check that is a valid :
(i) for all
(ii) f is normed:
so it is valid and we obtain
Therefore, if , then
so , and
Thus, both and have exponential(1) distributions
E.g. Suppose , and () has the triangular density
Change of Variable Method
Suppose is 1-1 and smooth (i.e. all 1st order partial derivatives exist and are continuous),
so and we can find the Jacobian
Since , indicates how is changing volume at ,
so means expands volume at , and means contracts volumes at
If , then for small ,
This intuitive argument can be made rigorous to prove the following.
Proposition II.5.1 (Change of Variable)
If is 1-1 and smooth, and where has an a.c. distribution with density , then has an a.c. distribution with density
E.g. If we have a uniform dist for , find the density for .
, for
Note: solving , we see that T contracts lengths on (0, 1/2) and expands lengths on (1/2, 2)
E.g. Prove for , the pdf.
Consider
Make the polar coordinate change of variable where for
Since , and ,
so
Def. Affine Transformation
(Affine transformation are linear transformations plus a constant.)
is an affine transformation if where
So
Note: iff , so is 1-1 iff is a nonsingular (invertible) matrix, in which case,
If , then
Multivariate Normal
Suppose , so for
Let where is nonsingular and , then since is an affine transformation, we know it has an a.c. distribution with density:
where
If a random vector has this pdf,
Note is symmetric, invertible, and positive definite (see note from lecture 2)
Ex. Suppose and , where is nonsingular and . Prove that .
Ex. Suppose and where is nonsingular. Prove that .
Ex. Using k = 2, and write out the density in terms of and
Def. Spectral Decomposition
For any positive definite matrix , where
Recall
The diagonal entries of D are the eigenvalues , and the column vectors of P are the eigen vectors
So the diagonal entries of are the eigen values, and column vectors of Q are the eigen vectors
Properties of the Multivariate Normal
For , positive definite , is a valid pdf, such that
Using the spectral decomposition for , we have
So is an eigenvalue of with eigenvector
The symmetric square root of is , where :
If and , then the affine transf. where
Therefore, whenever is p.d. (positive definite), defines a valid pdf on
The level sets of are given by
Note: here indicates boundary of a set, not partial derivative
Plugging in :
is the ellipsoid in with i-th semi-principal axis along (the i-th standard basis vector) of length
so the i-th semi-principal axis of is on the line of length
(Extra) Principal component analysis: when ordered, is the first principal component. And if is the bigger eigen value, then there is more variability along the 1st principal axis.
Ex. Prove: If is p.d. with spectral decomposition , then
Lecture 10
Def. Stochastic Process
A stochastic process (or random process) is a set where is a random variable defined w.r.t. the probability model , and is called the index set of the process.
Note In many applications, we need to consider stochastic processes since their dependence on index (t) is important. T can be infinite, negative, and multi-dimensional. It can be a very general set (like the nodes of a graph). Stochastic processes where t is time are referred to as time series.
E.g. A random vector is equiv. to the stochastic process where
E.g. A random variable is equiv. to the stochastic process where
E.g. Suppose a coin is tossed (tosses are ind.) until the 1st head is observed and we record that number.
Denoting a head by 1 and a tail by 0, then the sample space is , i.e. the set of all sequences of 0's and 1's. We need an infinite dimensional here.
If we define , then is a stochastic process, called a Bernoulli(p) process.
If we define when , and , is a well defined r.v.?
Let probability of a head on a single toss. If , then is defined since P(an infinite sequence of tails) = Otherwise, is undefined.
Then using independence, we obtain the pdf for geometric(p)
Ex. Prove that defines a probability distribution.
Def. Sample Function
Each realized value of a stochastic process can be thought of as a function (called sample function) , with value at index . So in effect, a stochastic process is a probability measure on functions from .
A stochastic process is a generalization of a r.v. where we start with and get as follows:
= is the set of functions with domain mapping into
is the smallest on containing all sets of the form for any and intervals
Then if ,
E.g. If , then gives a ray form the origin with distributed slope.
E.g. If Uniform(0, 10), , then gives a cosinusoid with random frequency.
Prop 2.6.1 (Kolmogorov's Consistency Theorem)
Background If for , the distribution of can be obtained from that of , then we can say that the finite dimensional distributions are consistent.
Suppose , and a probability model is given for each . If the probability models are consistent, then a probability model and random variables such that is a stochastic process with
E.g. Let be the discrete prob measure concentrated on given by the prob function
These distributions are consistent. Below is proof for
So by Kolmogorov's Consistency Thm, this is a valid definition of a s.p. (stochastic process)
Def. Gaussian Process
A s.p. is a Gaussian process whenever
where is the mean vector, and is the variance matrix, which is p.d. for every
Def. Gaussian White Noise Process
For , where
So
Lecture 11
Recall Def. Mutual Stat Ind:
When is a probability model and is a collection of sub of , the are mutually statistically independent if ,, where distinct , and .
Recall Def. Sigma Algebra Generated by : is obtained by intersecting all containing .
The -algebra generated by r.v. is a sub -algebra of , given by
Ex. Prove that is a sub of
such that
Since
such that
Since ,
Def. Statistically Independent RVs
For the collection of random variables , the are mutually statistically independent if the are mutually statistically independent in the collection of .
Prop 2.7.1 (Mut. Stat. Ind iff Joint = Prod of Marginals)
For the collection of random variables , the are mutually statistically independent iff the joint of factors as the product of the marginal cdfs, where .
I.e. for every ,
Proof
Suppose mut. stat. ind., show the factorization holds
() Suppose the factorization holds, show mut. stat. ind.
The cdf determines . Since the cdf of the joint probability measure is obtained by multiplying the marginal probability measures we know that are mutually statistically independent.
Also, the collection of cdfs is consistent. By KCT, this determines , and so the collection of random variables are mutually statistically independent.
Prop 2.7.2 (Mut. Stat. Ind iff Joint = Prod of Marginals)
For the collection of random variables and each
if each has a discrete distribution, then the are mutually statistically independent iff for every
if each has an a.c. distribution, then the are mutually statistically independent iff for every
Proof Discrete case
Suppose are mutually stat. ind., then by Prop 2.7.1 , the cdf of is .
Since has a discrete distribution, it has the probability function , so has the probability function
Proof Absolutely Continuous Case
Prop 2.7.1 implies which implies
So the are mutually statistically independent by Prop 2.
E.g. Bernoulli process
For any and
with , so by Prop. 2.7.2, the are mut. stat. ind.
E.g. Gaussian white noise process
Since for any , and
for with
the are mut. stat. ind. by Prop. 2.7.2
Def. Principal Components
Suppose where (spectral decomp), then , so
with
Thus, the principal components are mut. stat. ind
Lecture 12
Suppose is a random vector with prob. measure , and is observed
We want the conditional distribution of given
Conditional Dist - Discrete
Suppose has a discrete distribution with probability function
When , the conditional probability function of given is When ,
E.g. Conditioning the Multinomial
Suppose , we want to find the conditional probability function of or equivalently
So, for
Therefore, multinomial
Ex. If and for some , then determine the conditional distribution of given
The sum of multinomial r.v.'s binomial .
multinomial multinomial
Conditional Dist - A.C.
Suppose has a.c. distribution with density function and is smooth.If , then the conditional density function of given is
where (now allowing to be many to one)
E.g. Projections
If then
Ex. Repeat the above when .
Since ,
E.g. Projection conditionals of the
Suppose and for
Partition and as
Ex. Prove that and are p.d. when is p.d.
is p.d.
is p.d.
In order to obtain the distribution of , we need another matrix decomposition:
Def. Gram-Schmidt (QR) decomposition
Let be a matrix of rank (so it is nonsingular/invertible) whose columns form a basis for I.e. are linearly independent iff and span — the linear span is
Applying the Gram-Schmidt process to , we obtain an orthonormal basis for
is an orthogonal matrix, and is a unique upper triangular matrix with positive diagonals
So can be decomposed into and :
Def. Orthogonal Matrix
A matrix is orthogonal iff
Properties:
and they are both orthogonal
det(Q) = 1 or -1
the product of orthogonal matrices is orthogonal
Ex. For , prove that is unique given
Suppose distinct that are upper triangular with positive diagonals. I.e.
Since , we can rearrange the above to get , which contradicts the hypothesis.
Def. Cholesky Decomposition
The Cholesky decomposition of is obtained by applying the decomposition to
Ex. Prove the following properties of upper triangular matrices with positive diagonals (like ), for 2x2 matrices.
The product of 2 upper triangular matrices with positive diagonals is upper triangular with positive diagonals.
is upper triangular with positive diagonals
An upper triangular matrix with positive diagonals is nonsingular, and its inverse is upper triangular with positive diagonals that are equal to the inverse of the diagonal elements of the original matrix.
is nonsingular/invertible since its determinant > 0
The matrix R in the Cholesky decomposition is unique.
Suppose where are upper triangular matrices with positive diagonals.
Then dividing LHS and RHS by the middle, we get (A is just an arbitrary letter)
So must hold:
But is lower triangular and is upper triangular.
This means they are both diagonal matrices, so A is diagonal as well.
So
Challenge: Generalize (i), (ii) and (iii) to upper triangular matrices.
Prop 2.8.1 (Marginal Dist. of Normal )
If where , then .
Proof Applying the Cholesky Decomp to , we get
Now, let , and using the fact that are mut. stat. ind., we can partition s.t.
Ex. If denotes an identity matrix, then use a matrix of the form to determine the distribution of in Proposition 2.8.1.
In general, for a permutation matrix , each row and column only contains a single 1, and the remaining entries are 0. Use such a matrix to determine the marginal distribution of any sub vector of .
Prop 2.8.2 (Marginal Dist. of Y is ind. of )
If where , then and
which is stat. ind. of
Proof
where
This proves the first part. Now observe in general, if , then
Since the density factors, and are statistically independent and this proves the second part.
Ex. Prove the second line of above proof: where
Suppose is a sub-vector of .
Let be the matrix with the basis vector in the first rows (), and the last rows contain the remaining basis vectors in any order.
where
where
Thus, where
Corollary 2.8.2 (Conditional Dist of )
Proof:
Make the transformation which has
By the change of variable,
So, by conditioning on projections,
where
Def. Monte Carlo Estimation
Suppose we want to compute . Sometimes this can be computed exactly but typically we need to resort to Monte Carlo simulation and estimate .
Suppose then we have an algorithm that allows us to generate . Since , we can generate and estimate by computing the proportion of sampled values falling in A: which has standard error
So the interval contains the value with virtual certainty, provided is large enough
R practice
(a) Using the R software compute (command eigen). Verify numerically (up to small rounding errors).
e=eigen(Sigma) # returns eigen values & eigen vectors of Sigma
3
Q=e$vectors
4
Lambda=diag(e$values)
5
# matrix mult operator: %*%
6
Sigmasqrt=Q%*%sqrt(Lambda) %*%t(Q) # t(Q) is Q-transpose
7
Sigmasqrt%*%Sigmasqrt# should return Sigma
(b) Using the R software compute the Cholesky factor . (command chol). Verify numerically (up to small rounding errors).
x
1
R=chol(Sigma)
2
t(R) %*%R# should return Sigma
3
4
Q=Sigmasqrt%*%solve(R) # solve(R) is R-inverse
5
t(R) %*%t(Q) %*%Q%*%R# should also return Sigma
Suppose .
(a) Using the software and the representation , where , generate a sample of from the distribution and based on this sample estimate and provide the interval containing the exact value with virtual certainty.
xxxxxxxxxx
20
1
mu=array(c(0, 1, 2), dim=c(1, 3))
2
one=array(1+0*(1:1000), dim=c(1000, 1)) # create a column vec of 1000 1's
3
Mu=one%*%mu# create a 1000x3 matrix (each row = mu)
samplevec=Mu+array(sample, dim=c(1000,3)) %*%Sigmasqrt# create 1000 rows of 3-dim sample vectors from N_3(mu, Sigma) distribution
8
9
one=array(c(1, 1, 1), dim=c(3, 1)) # create a column vec of 3 1's
10
length=sqrt((samplevec*samplevec) %*%one) # square each element in samplevec, sum the squares, and take square root
11
count=0# count how many lengths are <= 10
12
for (iin1:1000) {
13
if(length[i] <=10){
14
count=count+1
15
}
16
}
17
prop=count/1000# get estimated prob
18
error=sqrt(prop* (1-prop) /1000)
19
low=prop-3*error
20
high=prop+3*error
(b) Using the software and the representation , where , generate a sample of from the distribution and based on this sample estimate and provide the interval containing the exact value with virtual certainty.
xxxxxxxxxx
2
1
# Using R', samplevec is now the following (everything else same)
2
samplevec=Mu+array(sample, dim=c(1000,3)) %*%R# create 1000 rows of 3-dim sample vectors from N_3(mu, Sigma) distribution
(c) Compare the two estimates.
Results are very similar. Part a has error = 0.0146253205093085, part b has error = 0.0149256490646136.
Ex. Suppose
(a) Determine the conditional distribution .
(b) Using the conditional distribution in (a) compute the conditional probability of .
where
(c) Estimate the unconditional probability of .
It is difficult to evaluate directly so we proceed via Monte Carlo:
X=mvrnorm(1000, mu, Sigma) # generates a sample of n=1000 from N_2(mu, Sigma)
5
y=X[, 1]**2+X[, 2]**2
6
mean(y\le5) # returns the estimate, can increase n for higher accuracy
3. Expectation
Lecture 13
Properties of Indicator Functions
Recall the definition of indicator functions for :
Def. Simple Function
If and , a function given by is a simple function.
A simple function must be a r.v. that takes only finitely many values.
Note If are simple functions, then so are their product and linear combination for any constants , since it is also a r.v. that takes finitely many values
Ex. Prove that any r.v. that takes only finitely many values is a simple function.
Suppose is a r.v. that takes only finitely many values .
Since and is a r.v., .
Thus which takes the form of a simple function.
Def. Canonical Form
In canonical form, where , and are the distinct values taken by simple function , so , and when , i.e. are mutually disjoint.
Note
Proof is a r.v. with a discrete distribution given by
If are iid. , then as ,
Def. Expectation of a Simple Function
For a simple function the expectation of is defined by
Prop 3.1.1 (Expectation Properties)
If are simple functions, then
(i)
Proof Suppose
Then
So is a simple function, and by definition we have
(ii) if , then
Proof Since is a nonnegative simple function, distinct values taken are nonnegative.
By (i), this implies that
(iii) if , then
Proof Suppose are in canonical form.
Note that if , then , and similarly if , then , i.e. sets with probability 0 do not change the sum.
So assume that for all .
Then for each there exists (and conversely) such that , and and satisfy . This implies .
Motivation (for definition of expectation of a general r.v. X)
Now we want to extend the definition of expectation to as many r.v.'s as possible (not just simple functions).Suppose is a nonnegative r.v., and for , define a nonnegative simple function
where
Since is defined to be the lower bound of , this ensures that
Suppose , then . Since is an increasing sequence, for all
Since is an increasing sequence, .
Suppose is a r.v. and define
so
For any Borel set , and are non-negative r.v.'s, since
Def. Expectation (as a Sum of Positive r.v.'s)
For a r.v. , the expectation of is , provided at least one of is finite, otherwise is not defined. Note that cannot simultaneously > 0, i.e. either equal to or .
Lecture 14
Lemma 3.1.2
Suppose are nonnegative r.v.'s,
(i) if , then
(ii) if , then
Proof Choose non-negative simple .
(i) Since is a nonnegative simple function satisfying ,
(ii) Since , we have . is a simple function satisfying . Therefore , and the result follows since
Lemma 3.1.3
If are nonnegative r.v.'s with finite, then .
Proof Let
If , then by first line of above.If is added instead, then which implies
Similarly, if , then by first line of above. If is added instead, then which implies .
Therefore, and
Prop 3.1.4 (Linearity of Expectations)
If are r.v.'s and are finite, then
Proof We can decompose Y and Z and express them as a difference of 2 non-negative r.v.'s as in Lemma 3.1.3
Def. St Petersburg Paradox
The following illustrates a case where expectation is infinite. Suppose a coin is flipped until H comes up on ith flip – win
Define a nonnegative r.v by , so and whenever is not a positive integer power of 2
Suppose is discrete with , i.e. X ~ geometric
Let is a nonnegative simple function, and
So the fair price to pay for the gamble is
The paradox: if we took , then , but undefined
Prop 3.1.5 (Expectation of |X|)
(i)
Proof This follows from Lemma 3.1.2(i) since .
(ii) If with defined expectation, then .
Proof
If , or if , this is trivially true. So assume neither is true, i.e. and .
and equating the 2 braces we have
So
so equating the braces gives
So
Thus are both finite. Applying Prop. 3.1.4 to , we obtain which proves the result.
(iii) If , then .If , then .
Proof
1st lineAssume and choose nonnegative simple . Then since , we have that .
In general, , so , which implies that And similarly, , so and we have that
2nd lineIf , then
and similarly,
Lecture 15
Def. Converge with Probability 1
The sequence of r.v.'s converges with probability 1 to r.v. if
Note We can assign a probability measure to the set since it is a sigma algebra:
E.g. Define where is the uniform distribution on , so . Let , so . Then Since , we have that
E.g. Let then Since , we have that In fact, we could change at every rational to obtain . Since , we still have
Def. Converge Almost Surely
A measure defined on is a function that satisfies the following:
whenever are mutually disjoint
Suppose , i.e. for every
Then we can define a kind of average of with respect to , , called the integral of with respect to
The expectation of r.v. can thus be written as , the integral of with respect to
If is the counting measure on , countable, then
If is the volume measure on , then
If is a sequence of such functions and , then we say the sequence converges almost surely to and write
Note convergence almost surely to convergence with probability 1
Prop 3.2.1 (MCT & DCT)
Suppose
(i) Monotone Convergence (MCT)
If , then
(ii) Dominated Convergence (DCT)
If there exists such that and for all , then
Corollary (Applied to Expectations)
Suppose
(i) If , then .
(ii) If there exists r.v. such that and for all , then .
E.g. Suppose has . Let , so and . Then by DCT,
Prop 3.3.2 (Expectation of Compositions)
If is a r.v. with respect to , , and , then
(i) is a r.v. with respect to
Proof Let . Then since and is a r.v.
(ii) , if it exists.
Proof Steps: simple h -> non-negative h -> general h
If is a simple function, then , so
If , , then there exists a sequence of non-negative simple functions .
General case (h not necessarily ): . Applying the above to both parts gives the result.
Prop 3.3.3 (Expectation Formulas)
Suppose is a r.v. with respect to , and exists.
(i) If is discrete with probability mass function , then .
(ii) If is a.c. with probability density function , then .
Proof Suppose is a simple function in canonical form. Then
This proves the result for simple . If and nonnegative simple , then (i) (ii) and the result follows by MCT. For general , the result follows via the decomposition .
E.g. If , then with we have
With , we have
Def. Moments
The -th moment of a r.v. is given by when it exists. When the first moment exists, the -th central moment of a r.v. is given by .
The 1st moment of is its mean: , so its first central moment is 0.The 2nd (central) moment of is its variance: when exists.The 3rd moment is the skewness, and the 4th moment is the kurtosis.
Prop 3.3.4 (Finite Moment Property)
If is finite, then is finite for all , i.e. the previous moments must all be finite.
Proof is finite is finite. Let , then
Ex. When compute and
Since ,
Ex. When Standard Cauchy, i.e. has density for , show that doesn't exist.
Cauchy dist has longer tails.
Ex. Let X ∼ Geometric(θ), and let Y = min(X, 100).
(a) Compute E(Y).
(b) Compute E(Y − X).
Ex. Geometric & Negative Binomial
E&R 3.1.22For X ∼ Negative-Binomial (r, θ), prove that E(X) = r(1 − θ)/θ. (Hint: Argue that if are independent and identically distributed Geometric(θ) , then ∼ Negative-Binomial(r, θ).)
E&R 3.3.18Prove that the variance of the Geometric(θ) distribution is given by . Hint:
E&R 3.3.19Prove that the variance of the Negative-Binomial(r, θ) distribution is given by .
Ex. Gamma
E&R 3.2.16Let α > 0 and λ > 0, and let X ∼ Gamma(α, λ). Prove that E(X) = α/λ.
E&R 3.3.20Let α > 0 and λ > 0, and let X ∼ Gamma(α, λ). Prove that Var(X) = .
Ex. Beta
E&R 3.2.22 Suppose that X follows the Beta(a, b) distribution. Prove that E(X) = a/(a + b).
E&R 3.3.24 Suppose that X ∼ Beta(a, b). Prove that
E.g. (Monte Carlo Approximations)
Suppose for some and we want to compute
This can be very difficult unless is easy to work with
But if we can generate then A very natural estimator of is thus , which converges to as
How accurate is this estimate for some specific ?
The Central Limit Theorem says that for large , , provided
can be estimated by
Indeed it can be estimated, since
If then
We can say that the true value of lies in the interval with "virtual certainty". And if the interval turns out to be short, then the estimate is accurate.
Note Let , so the relative frequency of in . Since , This is the same estimation procedure as previously discussed for estimating
Lecture 16
Def. Mean Vector
For random vector , the mean vector of is , provided each exists.
Note For a matrix of r.v.'s (called random matrix) , its expected value is defined to be when each exists, and when each is finite.
Def. Variance Matrix
If each is finite (so then the variance matrix of is given by
provided each for exists. In vector form:
The off-diagonal entries The diagonal entries So
If is finite for every and , then and is symmetric, i.e.
Ex. When is a r.v., prove that implies .
When and are r.v.'s and , prove that is finite.
Also prove that if for all , then .
Ex. When r.v.'s and satisfy , prove that . Extend this result to random vectors to show that
Prop 3.4.1 (Affine Transformations' Mean & Variance)
Suppose is a random vector and where are constant.
(i) If is a r.v. and , then , so has a probability distribution degenerate at a constant (= ).
Proof (Repeat the below for simple functions -> non-negative functions -> general functions)
iff .
(ii) If is a random vector, , and is constant, then . Thus, any variance matrix is positive semidefinite (p.s.d.)
Proof Consider r.v. .
Then by (ii) of the previous proposition, since a variance is always nonnegative.
(iii) If for some , then the probability distribution of is concentrated on the affine plane , where L stands for linear span.
Proof Consider . Suppose , then Var(X) = 0, so by (i) and (ii),
Notes
If is p.s.d., then the spectral decomposition gives , where is orthogonal and with
If , then (letting gives the ith principal component)
So . If , then iff
Therefore, if and , then iff
If , then the above implies
Ex. Prove that, if is a random matrix such that each is finite and are constant matrices, then
E.g.
consider which has density
so which implies
when ,
so
in general, if , then standardize
so with
Ex.Suppose Determine
Ex.Suppose multinomial Determine and .
Ex.The correlation between r.v.'s and is defined by , where is the standard deviation of .
(i) What has to hold for to exist and provide sufficient conditions?
(ii) Prove that for constants then , provided . What happens when ? What happens when and when ?
(iii) Suppose . What is ?
(iv) Suppose and . Determine .
(v) Suppose and . Determine . Are and independent?
Recall Two collections of r.v.'s where are statistically independent if for any finite subsets , the joint cdf satisfies
The Extension Thm then implies for any
Prop 3.5.1 ( E(g h) = E(g)E(h) )
If and are statistically independent random vectors and , then and are statistically independent. So if and , then
Proof and are statistically independent since the following holds for every and .
Suppose are simple functions. Then is also simple, and as required.
The result then follows by proceeding to nonnegative by limits and then to general .
Corollary 3.5.2 (Covariance of Ind. Functions = 0)
Ex. For random vectors and define , provided all the relevant expectations exist.
(i) Give conditions under which .
(ii) Assuming and are constant, determine
(iii) Assuming and and are statistically independent, determine .
Ex. For random vector with , the correlation matrix is defined by where
(i) Show that the -th element of is .
(ii) Suppose where with for Show
(iii) Suppose in (ii) that is not diagonal with positive diagonal, is it true that ?
Lecture 17
Def. Functions of a Stochastic Process
Suppose is a stochastic process such that for all .
Then the mean function is defined as
The autocovariance function is defined as , provided these expectations exist.
The autocorrelation function is defined as , provided
E.g. (iid process)
the r.v.'s are mutually statistically independent and all have and
so
Def. Gaussian process
Recall the r.v.'s are such that for any ,
A Gaussian process is completely specified by the mean and autocovariance functions
To define a Gaussian process, we specify and s.t. for any , the variance matrix is symmetric and positive semidefinite:
i.e., is a valid autocovariance function whenever , and for any ,
Def. Weakly Stationary Process
For , a process with mean function and autocovariance function is called weakly stationary if is constant in and for some .
Note is a positive semidefinite function (positive definite when corresponding matrices are p.d.), i.e. must satisfy , and for all ,
There are theorems concerning such , for example, where is positive definite.
Def. Random Walk
If the r.v.'s are i.i.d. with mean and variance defined, then the process defined by is called a random walk
A simple random walk arises when so
For a Bernoulli(p) process, , so the random walk has the following functions:
Clearly, a random walk is not weakly stationary (neither nor satisfy the definition for stationarity)
If are i.i.d. , then it is a Gaussian random walk:
Since , we have:
, so the finite joint distributions of are defined consistently. By KCT, this defines a s.p. and it is a Gaussian process
E.g. Weakly Stationary Gaussian Process
Suppose the r.v.'s are i.i.d. , and are defined by for some
So the autocovariance matrix will only have 3 diagonal bands that are non-zero — variance on the diagonal, covariance on the off diagonals (when s = t-1, s = t, s = t+1).
Since is multivariate normal, it is consistent and by KCT, is a Gaussian process
Since is dependent on values of s vs t, where
it is a weakly stationary Gaussian process
Ex. If r.v.'s all have finite second moments, then for constants prove that
Ex. If r.v.'s all have finite second moments then for constants prove that
Specialize this result to the case where are mutually statistically independent.
Ex. In the 2 previous exercises, determine the joint distribution of in the Gaussian case.
Lecture 18
Def. Markov's Inequality
If is a nonnegative r.v. and , then
iff .
Proof (inequality)
Proof (equality)
() If , then is concentrated on the points , so .
() If at , then
Since and are both non-negative r.v.'s,
It follows that
holds since when , so it doesn't matter what evaluates to. We can thus exclude from .
Hence we have
Ex. If is a r.v., then determine an upper bound for when .
Might need to add t > 0
Ex. If is a r.v. and , then prove and also If exponential which inequality is sharper? Find the exact value of when exponential and compare this with the bounds.
Def. Chebyshev's Inequality
If has mean and variance , then for ,
iff
Proof Since is non-negative we can apply Markov and obtain
and the equality result follows as with Markov.
E.g. 5 sigma and if , then
Def. Cauchy-Schwartz Inequality
Recall from linear algebra:
Think of a set of r.v.'s as a vector space. Restrict it to a set that has second moments and it's a linear space.
If , then
iff wp1
where
Proof If , then .
Now assume . For any , , which is a convex parabola in with minimum at
So
Equality occurs iff when (which minimizes the parabola)
This occurs iff
Def. Correlation Inequality
If , then
iff
Note Correlation only measures linear (affine) relation between X & Y. (How much variation in Y does X explain?) Note that correlation = 0 does not imply independence. The correlation for and can be 0 even though they are not independent.
Proof In Cauchy Schwartz inequality, standardize , i.e. replace by and by
So
By C.S., we have
By C.S., the equality holds iff where
Rearranging the above, we get where , so the result follows.
It is in the form where and
Note a measure of the total variation in is given by
Def. Best Affine Predictor
If we approximate by for some constants and , then the amount of variation in that is not explained (the residual variation) by is
The best affine predictor (linear regression) of from is given by , where are constants that minimize
Ex. Assume . Show that if minimize , then with minimizes over all constants .
Ex.
(i) Assume and . For all constants , and , prove and
Use this to prove that is the best affine predictor of from .
(ii) Combine (i) and the previous exercise to determine the best affine predictor of from when the assumption of 0 means is not made.
(iii) Show that the proportion of the total variation in explained by the best affine predictor from is given by .
(iv) When , show that equals the best affine predictor of from .
Lecture 19
Def. Convexity
is a convex set if whenever and , then .
The line segment joining and is
If is convex, then is a convex function and for every . If LHS RHS, then f is a concave function. If is convex then is concave
If is defined on open convex set , then is convex whenever the Hessian matrix is positive semidefinite for every
Ex.Convexity proofs
(i) Prove the line segment is convex.(ii) Prove is convex. What about (iii) Prove is convex.(iv) Prove is convex (hint: use .(v) Prove that the affine function given by for constants is convex on .(vi) Prove that is convex on .(vii) If is positive semidefinite, then prove is convex on .
Prop 3.7.5 (Supporting Hyperplane Thm)
If is convex and is not an interior point of (there isn't a ball with , then there exists such that for every , .
For a set there is always a set for some , and s.t.
E.g. take so any would be in
E.g. the hyperplane in given by with
A set of the form is called an affine subset of and it has a dimension (point has dimension 0 , line has dimension , hyperplane has dimension has dimension )
Ex. Suppose are convex. Prove that is convex.
Ex. Suppose is convex and let . Prove that is convex.
Ex. If is a linear subspace of , then is convex.
Def. Affine Dimension
If , the affine dimension of is the smallest dimension of an affine set containing . For example, a squiggly line has affine dimension = 2.
Prop 3.7.7 (Expectation is in Convex Set)
If is convex with and , then .
Proof (Induction on the affine dimension of )
If the affine of is 0 (probability concentrated at the point ), then and and the result holds.
Assume wlog (w/o loss of generality) that , o/w put and .Note that is convex, , and iff .
Now assume the result holds for affine . Suppose , then by the Supporting Hyperplane Thm, there exists s.t. for every . This implies , i.e. is a nonnegative r.v.
By hypothesis, .Therefore, , and is a convex set w/ affine dimension . So by the inductive hypothesis, which implies , and we have a contradiction.
Def. Jensen's Inequality
If is convex, , and is convex, then
Equality is obtained iff for constants
E.g. Jensen's Inequality
If with , , then is convex, and
Suppose is convex, then for this simple context Jensen's inequality is immediate:
Geometrically consider the line segment in
Convexity of on the line segment implies the line segment lies above the graph
and gives
Proof (Induction on the affine dimension of .)If affine is 0, then and and so the result holds.
Now assume the result holds for affine . Let . Note that is convex, and is a boundary point of (not an interior point). Then by Supporting Hyperplane Thm, there exists s.t. for every
If , then the inequality can be violated by taking large, so must hold, and we have 2 cases:
Case 1
Let
, so
Since iff , which occurs iff , we have that , so rearrange the above:
which is of the required form
Case 2
Then
is a convex set of affine , so by the inductive hypothesis the result holds.
Note If is concave and , then the concave version of Jensen says
Def. Kullback-Liebler Distance
serves as a measure of distance between probability measures
suppose are probability measures on with probability (density) functions and respectively
the Kullback-Liebler distance between and is then defined to be
where is the counting (discrete case) or volume measure (a.c. case)
Prop 3.7.9 (KL Distance >= 0)
If are probability measures on with probability (density) functions and respectively, then with equality iff .
Proof Since is convex on , applying Jensen gives
Equality holds iff there exist such that
which holds when so .
Now and agree at and at most at one other point (draw the graphs), which implies
Sub-proof Suppose so they differ at least at two 's, define , so for some real number , which implies and
This means either or , and both cases contradict with positive probability.
Ex. Suppose is the probability measure and is the probability measure. Compute .
Ex. Does
Lecture 20
Conditional Expectation - Discrete Case
Consider a random vector and a r.v. where
The prob function for the joint distribution of is
The prob function for the conditional distribution of (i.e. the probability measure ) is
when (otherwise cond. dist. not defined)
The conditional expectation of given is given by
When , the conditional expectation is also finite, since
We want to think of and then define by
Prop 3.8.1 E[h(X)Y] = E[h(x)E(Y|X)]
If is s.t. , then
Proof
Corol 3.8.1 E[h(X)Y|X] = h(x)E(Y|X)
Applying prop 3.8.1 to conditional expectations, we have
Corol 3.8.2 (Theorem of Total Expectation)
for random vector where
Proof Let , for .
Then
Corol 3.8.3 (Theorem of Total Probability)
for
Corol 3.8.4 V(Y) = E[V(Y|X)] + V[E(Y|X)]
if
Proof:
Expand the inner expectation, get
and applying to both sides gives the result:
Corol 3.8.5 (Best Predictor & Residual Error)
The random variable is the best predictor of from in the sense that it minimizes among all , and smallest residual error is .
Proof
and so
with equality when .
Notes
In general, if r.v. satisfies , then is defined as the r.v. that satisfies for every such that
This can be generalized to define , the cond. expectation of given the process
Conditional Expectation - Continuous Case
If has density and , then
E.g.
Suppose , is p.d.
Then
So , and this minimizes among all
Def. Martingales
Consider a game of coin tossing where a gambler bets on which occurs with probability , and if the gambler bets the payoff is , so the expected gain on a toss is
Gambler's strategy: bet on the first toss, if they lose this bet they bet on the next toss, if they lose the next bet they bet on the next next toss, and generally if they lose the first bets they bet on the next bet. They stop as soon as they win which happens with probability 1
If the first occurs at time then gain is so this guarantees a profit
What's the catch? The expected loss just before win is , so you need a big bank account if you want to use this strategy
Let denote the gambler's gain (loss) at toss , then
A stochastic process with this property is called a martingale
Lecture 21
Def. Generating Functions
For a sequence of real numbers, the generating function is defined by , provided the series converges for all where
allows us to get the value of
not all sequences have generating functions (e.g. )
if are generating functions, then their product where is the generating function of , and
Def. Abel's Theorem
If is finite in and converges (limit could be ), then .
Def. Probability Generating Functions
If is a r.v. s.t. , then the probability generating function of is for .
Prop 3.9.1 (Same PGF <=> same prob dist)
If for all for some , then and have the same probability distribution.
If the r.v.'s are i.i.d. with pgf , and are also stat. ind. of with pgf , then has pgf .
Proof
E.g. PGF of Poisson with ,
If ind. of , then Poisson , since
If Poisson , then is finite for all since converges for all
Since
Exercise III.9.1 If , then find and use this to obtain the pgf for a binomial distribution.
Exercise III.9.2 If , then find and use this to obtain the mean and variance of .
Exercise III.9.3 If Poisson independent of and , determine .
Def. Moment Generating Function
If is a random vector, then the moment generating function of is , provided the expectation is finite for all . The MGF does not always exist (e.g. Cauchy).
Def. Characteristic Function
The characteristic function of is given by for all .Since and both , , we know is bounded:
so always exists (may be complex valued)
If , then , so has a probability distribution symmetric about 0. Since sine is an odd function, i.e. , its expectation and is real-valued
Prop 3.9.3 (Uniqueness of MGF & CF)
(i) If exist and for all , for some , then .
(ii) If for all then .
If we know or and we recognize it, then we know the distribution of .
There are inversion results that give expressions for the cdf of computed from or .
Note Same distribution does not mean same r.v.
Def. Mixed Moment of Random Vector
If , then -th mixed moment of a random vector is defined by whenever this expectation exists.
Prop 3.9.4 (Prev. Mixed Moments are Finite)
If and for all satisfying , then is finite.
Proof(for k = 2)
Exercise III.9.4
Prop 3.9.5 (i-th Mixed Moment)
If exists, then all the moments of are finite and the -th mixed moment is given by
where .
Proof Consider the case when . Then for ,
Let so
Since exists, and so all moments of are finite.
Furthermore, by DCT (Dominating Convergence Thm),
For the general case, let and a similar argument shows that exists. Let
which implies is finite, and by DCT,
Prop 3.9.6 c(t) = m(it)
If exists, then
Prop 3.9.7 (MGF & CF of X+Y)
If are stat. ind. with mgf's (cf's , then has mgf when and are finite and .
Proof
E.g. MGF and CF of
where , so and
so
Plugging into , we get
If is a sample from the distribution, then the sample mean has the following MGF:
So by uniqueness,
Prop 3.9.8 (Normal r'X -> Normal X)
If is a random vector and is normally distributed for all constant , then for some .
Proof We have that and and so . Now
which implies the result.
E.g. Cauchy
suppose Cauchy, then does not exist so does not exist
but using contour integration it can be shown that
now suppose is a sample from the Cauchy and
then , so by uniqueness Cauchy
note that sampling does not change the distribution, unlike distributions with shorter tails
any CF satisfies and by Dominating Convergence Thm, is continuous at 0 since the limit exists:
if is real, then , so is symmetric. For any and , we have the below (the bars represent the modulus: )
therefore such a can serve as the autocorrelation function of a weakly stationary process
for any constant , is such an autocorrelation function, as is
Exercise III.9.4 If are mut. stat. ind. with and are constant, then determine the distribution of .Exercise III.9.5 E&R 3.4.13Exercise III.9.6 E&R 3.4.16Exercise III.9.7 E&R Exercise III.9.8 E&R 3.4.29
4. Convergence
Lecture 22
Motivation
applications of probability theory are often concerned with approximations
the underlying idea of "approximation" is the notion of a limit
for a sequence of real numbers , the limit exists if such that for any , for all , and we write . We can then approximate by for large and try to say something about the error in this approximation
if we have a sequence of r.v.'s , then the pointwise convergence of to r.v. means for every
this is too strong (we want to eliminate inconsequential sets that do not converge), so we can weaken this to convergence wp1: if
note - this is concerned with the convergence of a sequence of functions
Def. Convergence in Distribution
converges in distribution to r.v. if for every continuity point of the cdf of
If , then for large provided are continuity points of
Note convergence in distribution is about approximating the dist of a r.v. and not about approximating the value of the r.v.
E.g. Why restrict to convergence at continuity points of ?
suppose so
then as gets bigger all the probability mass "piles up at 0 " and let be degenerate at 0 so
at every point of but 0, i.e. so 0 is not a continuity point of
So we have
Prop 4.1.1 (Series Expansion of CF)
If , then where the remainder is a function of satisfying .
Proof Integrate the below using IBP with , so
so with n = 0, we have
now with n-1, we have
Using , we have
This upper bound is finite since and goes to 0 as , which proves the result by DCT.
Prop 4.1.2 (Continuity Theorem)
Suppose is a sequence of r.v.'s.
(i) If , then for every .
(ii) If for every and is continuous at 0 , then is the CF of a r.v. such that .
Prop 4.1.3 (Weak Law of Large Numbers)
If is a sequence of i.i.d. r.v.'s with , then , the r.v. with distribution degenerate at
Proof Let be degenerate at , so which is continuous at 0 . Also,
If and converges to a finite limit, then The first term thus converges to . The second term converges to 1 since converges to 0. The result follows by (ii) of the Continuity Theorem.
Note The Strong Law of Large Numbers says
We can prove that if , then and so the the SLLN implies the WLLN.
Prop 4.1.4 (Central Limit Theorem)
If is a sequence of i.i.d. r.v.'s with , then
Proofso has mean 0 and variance 1.
Also, has mean 0 and variance 1, so we can write .
Since are i.i.d,
which is the CF of and the result follows by the Continuity Theorem.
E.g. Normal approximation to the binomial
are i.i.d. Bernoulli with , so proportion of 1 's in , so by CLT we have
For large with
Note reflect how long the interval about the mean is in terms of standard deviations
E.g. Poisson approximation to the binomial (rare events)
Consider i.i.d. Bernoulli . Since , we have that , so
Since ,
So for Poisson , at any continuity point where has cdf
Thus, Poisson
Lecture 23
Def. Convergence in Probability
The sequence of r.v.'s converges in probability to r.v. if for any . Denote as
Note this is different from (weaker than) which says
while says for any , there exists s.t. for all
Prop 4.2.1 (Convergence Hierarchy)
Note The converse is false.
Proof (convergence wp1 implies convergence in P)
Let so
By hypothesis,
so which implies .
Proof (convergence in P implies convergence in dist)
NTS at every continuity point. Subtract term from LHS and a smaller term from RHS:
Then, for there exists s.t. for all , and so
When is a continuity point of , choose s.t. . So . Since is arbitrary this implies the result.
E.g. does not imply
Let so
But so
Prop 4.2.2 (Convergence to a Constant)
iff .
Proof By Prop 4.2.1, implies . For the other direction,
which implies .
Prop 4.2.3 (Slutsky's Theorem)
If and , then (i) (ii) (iii) provided
Prop 4.2.4 (Cont. Function Convergence)
If and is continuous (thus measurable) at , then .
Proof Let . Then there exists s.t. whenever . Therefore
E.g. Suppose is an i.i.d. sequence from a distribution with mean and variance . By CLT,
Therefore, by Slutsky
Note when is an i.i.d. sequence, this implies Student
Def. Convergence in Expectation of Order r
converges in expectation of order to if for every and . Denoted
Prop 4.3.1 (Order r implies order s; order 1 implies P)
(i) If , then for any .
Proof when . This implies is convex (opens up) on . Therefore,
Since LHS goes to 0, RHS must also go to 0.
(ii) If , then .
Proof For any
Notethe converse to this proposition is false
Prop 4.3.2 (Order 2)
Take and let
Define by . Note that
Define
(i) If , then a for all constants a,
(ii) is an inner product on
(iii) is a norm on .
Proof
Exercise IV.3.1.
Geometric interpretation: the angle between satisfies
Prop 4.3.3 ( Law of large Numbers)
If is an i.i.d. sequence in , then .
Proof
So implies implies implies
Note In time series many stochastic processes are defined in terms of series of r.v.'s that converge in
Summary (wp1 => p => d)
Strong convergence (wp1, or almost sure convergence)
and are close
Weak convergence (convergence in distribution)
their probability distributions are close
Convergence in probability (in between strong and weak convergence)
and are close with high probability, so their probability distributions are close
5. Gaussian Process
Lecture 24 (Discrete Time)
Recall Def. Stationary Process:
For any , is a Gaussian process if
for some mean function and autocovariance function
For , a weakly stationary Gaussian process has constant and for some positive definite
Def. Strictly Stationary Process
A strictly stationary process has the property that for all , where h is such that
Note A weakly stationary Gaussian process is always strictly stationary, since (similar to how covariance = 0 implies statistical independence with joint normality)
Def. Autoregressive process of order 1
Suppose we have an i.i.d. process with . Consider where is ind. of
Proof Does there exist a stationary Gaussian process satisfying the definition above?
Assume there is, then
Consider . As , we have
Squaring (*) above and taking expectation, we have as
So , and hence it would be natural to define
With this definition, is satisfied. But is a valid r.v.? I.e., does converge to a real number for every ? And is for every ?
Consider and let
Then , and , so is an (extended) r.v. WTS has probability 0.
Since the expectation is finite, , so can be removed from , implying is a r.v. with a finite expectation. Similarly, are r.v.'s and so (decomposed below) is also a r.v.
Since converges wp1, we can apply DCT:
Now NTS exists and is finite – do so by proving and
Since the is finite, the autocovariance function is defined for every and
The autocovariance function of is given by (suppose wlog. and use when when
We have now proved that is weakly stationary. But is it a Gaussian process?
For and , let , and using continuity of ,
Since , and
This is the CF of a normal r.v, so by Uniqueness,
By Prop 3.9.8, (since their sum = is multivariate normal). So by KCT (clearly these distributions are consistent as they are all determined by the same autocovariance function), we have proved that is a stationary Gaussian process
To simulate (approximately) from an autoregressive process, choose a time , say (today), and choose s.t.
Take (write as a lin combo), generate future (error) values
and use to obtain
Lecture 25 (Continuous Time)
Def. Brownian Motion
A stochastic process is a standard Wiener process (another name for Brownian Motion) if
(i)
(ii) for any , the increments are mutually stat. ind.
(iii) for any
satisfying the above with is a general Wiener process.
It is a Gaussian process with mean function 0 and autocovariance function min .
Proof For any and
and so is multivariate normal since every linear combination is normal (Prop 3.9.8). Also,
Therefore, , so by KCT this is a Gaussian process.
Prop 5.2.2 (Alt. Brownian Motion)
There exists a version of also satisfying
(iv) is continuous in
(v) is nowhere differentiable in
E.g. How does Brownian motion arise? It arises as a limiting process.
Suppose i.i.d. with mean 0 and variance 1. Let and a random walk (e.g. a simple symmetric random walk when
Prop 5.2.3 (Donsker's Thm / Invariance Principle)
space is shrunk by factor and time speeded up by factor
for , let , then
sample paths are not continuous but using linear interpolation,
which has continuous sample paths and the same convergence result applies
these results also tell us how to simulate (approximately) from
Def. Diffusion Process
Define by (stock market) where initial value, and volatility
Exercise V.2.1 E&R
Lecture 26
Recall a discrete-time martingale is a s.p. , where is a set of consecutive integers s.t. exists and for every
E.g.Random walks are martingales
Suppose are i.i.d. with , and consider the random walk where and , then
Prop 6.1 (Martingale Convergence Theorem)
If is a martingale with sup , then there exists a r.v. with finite expectation s.t. .
Def. Stopping Time
For , a stopping time is a mapping such that for all .
note and
E.g. Hitting Time
consider a sequence of r.v.'s and a set
define and if for any
so is the first time the sequence hits
then and so is a stopping time
E.g. Clinical trials (simplified)
imagine a a sequence of patients that are suffering from a disease and a drug is given to each patient with the intention of curing the disease
let be i.i.d. Bernoulli where is unknown and means the patient is cured so the proportion of patients cured after the first patients have been given the drug
consider the s.p. and let and or since countable
then stops the trial when, after a minimal sample size , there are too few cures or sufficiently many that it can be declared that the drug is working
for
so is a stopping time
Notes
given a s.p. and a stopping time for the process that is finite wp1 (a finite stopping time)
consider the value of the process at the time it is stopped
is a r.v.
Proof We have which implies the result.
the implication of this is that we can compute probabilities for
Prop 6.3 (Optional Stopping Theorem)
Suppose is a martingale with is a stopping time for the process and that one of the following conditions holds for some constant . Then .
(i) (bounded before stopping) for every
(ii) (bounded stopping time)
E.g. Random walks and hitting times
as in the previous example, is a martingale with initial value