• 1. Basic Probability

    Lecture 1

    Prop 1.2.1 (De Morgan's)

    (i=1Ai)C=i=1AiC

    Proof

    Strategy: prove they are subsets of each other

    Let ω(i=1Ai)C. Then ωi=1AiωAi  iωAiC  iωi=1AiC. So we have (i=1Ai)Ci=1AiC

    Let ωi=1AiC. Then ωAiC  iωAi  iωi=1Aiω(i=1Ai)C. So we have i=1AiC(i=1Ai)C

    (i=1Ai)C=i=1AiC

    Proof

    Let ω(i=1Ai)C. Then ωi=1AiωAi for some i ωAiC for some i ωi=1AiC. So LHS RHS

    Let ωi=1AiC. Then ωAiC for some i ωAi for some i ωi=1Ai,ω(i=1Ai)C. So RHS LHS

    Def. Power Set

    Denoted 2Ω, a power set consists of all subsets of Ω.

    Its cardinality is given by #(2Ω)=2#Ω

    Def. Sigma Algebra/Field

    A sigma algebra A2Ω on Ω is a set (containing subsets of Ω) that:

    1. contains the null set

    A

    1. closed under unions

    A1,A2,...Ai=1AiA

    1. closed under complementation

    AAACA

    {,Ω}coarsest σ algebra all other sigma algebras 2Ωfinest σ algebra

    Def. Probability Measure

    A probability measure defined on a set Ω with σ algebra A is a function P:A[0,1] with the following properties:

    1. normed

      P(Ω)=1

    2. countably additive

      P(i=1Ai)=i=1P(Ai), for mutually disjoint Ai

    To turn finite additivity into countable additivity, add infinitely many null sets.

    Many sample spaces are infinite sets, and there is no P that can be defined for every element of these sets. We thus restrict the domain of P to be a subset A2Ω.

    Prop 1.2.2 (Some Event Must Occur)

    If (Ω,A,P) is a probability model, then P() = 0

    Proof

    Let Ai= for i=1,2,... so the Ai are mutually disjoint, and i=1Ai=

    (Contradiction) Suppose that P()>0, then P()=i=1P()=P()=, which ⇒⇐P()[0,1], so P()=0

    Lecture 2

    Hierarchy Elements (ω) -> sets of elements (events or A) -> sigma algebras (A) -> Borel sets (Bk)

    Prop 1.3.1 (Intersection of Sigma Algebras)

    If Aλ:λΛ is a family/set of σ-algebras on Ω, then λΛAλ is a σ algebra on Ω

    Proof

    λΛAλ must have the properties of a σ-algebra:

    1. Aλ  λλΛAλ
    2. A1,A2,...Aλ  λi=1AiAλ  λi=1AiλΛAλ
    3. AAλACAλ  λACλΛAλ

    Since the intersection contains the null set, is closed under unions and complementation, it is a σ-algebra.

    Def. Sigma Algebra Generated by C

    A(C) is obtained by intersecting all σ-algebras containing C2Ω.
    It is thus the smallest σ algebra on Ω containing all subsets in C.

    Def. Borel Set

    Bk is a σ algebra generated by open sets. Formally:

    It is the smallest σ algebra on Ω=Rk containing all rectangles of the form (a, b] where a=[a1ak],b=[b1bk]Rk

    Bk=×i=1k(ai,bi)

    =(a1,b1]×...×(ak,bk]

    ={(x1,...,xk):ak<xk<bk}

    Bk, since 2Rk contains all such rectangles. Bk2Rk, since there is a subset ARk that is not a borel set.

    Loosely speaking, any set that can be defined explicitly is a borel set.
    (Nice) transformations of borel sets are also borel sets.

    Def. Ellipsoidal Region

    A ball of radius r centered at x0 is given by Br(x0)={x:(xx0)T(xx0)=xix0i2=i=1k(xix0i)2r2}Bk

    The set that forms its boundary is denoted Sr(x0), and obtained by replacing with =.

    Applying an affine transformation y=Ax+b on Br(x0), where A is an invertible matrix A=[a11...a1kak1...akk]Rk×k,bRk

    ABr(x0)+b={y:y=Ax+b for some xBr(x0)}={y:(A1(yb)x0)T(A1(yb)x0)r2}(x=A1(yb))={y:(ybAx0μ)T(A1)T(A1)Σ1(ybAx0μ)r2}(pull out A1)={y:(yμ)TΣ1(yμ)r2}=Er(μ,Σ)Bk

    we obtain an ellipsoidal region centered at μ=Ax0+b, whose axes and orientation are determined by Σ=((A1)TA1)1=((AT)1A1)1=ATA and r

    Recall: A matrix is…
    - symmetric if AT=A
    - invertible if A1A=I (The 0 matrix is not invertible)
    - positive definite if vTAv0  vRk

    Σ is…

    Note for the multivariate normal, μ is the mean vector, Σ is the variance matrix.

    Lecture 3

    Def. Limit inferior/superior of a Sequence

    For a sequence AnΩ:

    • lim infAn=n=1i=nAi={ω:ω is in all but finitely many Ai}

      • ω is a member of at least one of the intersections
    • lim supAn=n=1i=nAi={ω:ω is in infinitely many Ai}

      • ω is a member of all the unions

    Properties:

    Monotone Increasing/Decreasing Sequences

    i=nAi is an increasing sequence of sets (as i increases, fewer sets are intersected, the resulting intersection gets bigger)

    i=nAi is an decreasing sequence of sets (as i increases, fewer sets are unioned, the resulting union gets smaller)

    Prop 1.4.1 (Monotone Sequences Converge)

    A monotone decreasing sequence of sets converges to their intersection.

    If AnA  n, and A1A2..., then AnA=i=1Ai

    Proof

    Need to prove that lim inf = lim sup:

    (1) Since AnAn1... , we have that i=nAi=An, so lim supAn=n=1i=nAi=n=1An

    (2) Also, i=nAi=i=1Ai, so lim infAn=n=1i=nAi=i=1Ai (if we union the same set over and over again, we get that set)

    Optional subproof:
    i=nAii=1Ai  n, since the intersection of many sets intersection of fewer sets
    Other direction: let ωi=nAi, so ωAn...A1, i.e. ωi=1Ai i=nAii=1Ai
    Since they are subsets of each other, i=nAi=i=1Ai

    (1 & 2) Since lim infAn=i=nAi=i=1Ai=lim supAn. Hence we have convergence: AnA=i=1Ai

    A monotone increasing sequence of sets converges to their union.

    If AnA  n, and A1A2..., then AnA=i=1Ai

    Proof

    (1) Since AnAn+1..., we have that i=nAi=An so lim infAn=n=1i=nAi=n=1An

    (2) Also, i=nAi=i=1Ai, so lim supAn=n=1i=nAi=i=1Ai (intersecting the same set over and over again gives that set)

    (1 & 2) Since lim infAn=lim supAn, we have convergence: Ani=1Ai.

    Prop 1.4.2 (Continuity of P)

    If AnA  n and AnA, then P(An)P(A) as n

    Note The converse is true

    Proof

    By the previous proposition, we know (1) & (2)

    (1) Since i=nAi is a monotone decreasing sequence, it converges to the intersection of the sets,
    i.e. i=nAii=1i=nAi=lim supAn

    (2) Since i=nAi is a monotone increasing sequence, it converges to the union of the sets,
    i.e. i=nAii=1i=nAi=lim infAn

    By (1) & (2), P(i=1Ai)P(lim supAn), and P(i=1Ai)P(lim infAn)

    So P(i=1Ai)P(An)P(i=1Ai)P(lim infAn)P(An)P(lim supAn)P(An)P(A)

    Suppose An is a monotone increasing sequence, so AnA=i=1Ai

    Now create mutually disjoint BiA like so {B1=A1,B2=A2A1CB3=A3A2C... such that An=i=1nBiP(An)=i=1nP(Bi)

    So limnP(An)=limni=1nP(Bi)=i=1P(Bi)=P(i=1Bi)=P(i=1Ai)=P(limnAn)

    Suppose An is a monotone decreasing sequence, so AnC is monotone increasing.

    Hence limnP(AnC)=P(limnAnC)=P(i=1AiC)=P((i=1Ai)C)=1P(i=1Ai)=P(limnAn)

    Prop 1.4.3 (Prob Measure on a Sigma Algebra)

    P is a probability measure on A if P:A[0,1] satisfies

    (1) P(Ω)=1

    (2) P is additive

    (3) P(An)P(A) as n whenever AnA  n and AnA

    Proof

    (1) and (2) are contained in the def of probability measure (normed and countably additive)

    Combining additivity (2) with continuity (3), we have that P is countably additive:

    (3) can also be written as AnAlimnP(An)=P(A)

    Let Bn=i=1nAi, where A1,A2,...A are mutually disjoint.

    Then Bn is a monotone increasing sequence of events with limBn=n=1Bn=n=1i=1nAi=i=1Ai

    Since P(i=1Ai)=P(limBn)=limP(Bn)=limP(i=1nAi)=limi=1nP(Ai)=i=1P(Ai),

    So continuity finite additivity

    Important Note Countable additivity continuity of P. By ensuring countable additivity, we ensure continuity of P, which is needed when we have an infinite sample space.

    Def. Conditional Probability Model

    If (Ω,A,P) is a probability model and CA has P(C)>0, then the conditional probability model given C is (Ω,A,P(|C)), where P(|C):A[0,1] is given by P(A|C)=P(AC)P(C)

    Proof

    (1) P(Ω|C)=P(ΩC)P(C)=P(C)P(C)=1

    (2) If An,A2,...A are mutually disjoint,

    then P(i=1Ai|C)=P(i=1Ai)CP(C)=P(i=1(AiC))P(C)=i=1P(AiC)P(C)=i=1P(Ai|C)

    Since P is normed and countably additive, (Ω,A,P(|C)) is a probability model.

    Note The model can also be presented as (C,AC,P(|C))

    Prop 1.5.1 (LOTP / Thm of Total Prob.)

    Suppose C1,C2,...A with P(Ci)>0  i, and Ω=i=1Ci with CiCj=,  i,j, then for any AA,P(A)=i=1P(Ci)P(A|Ci)

    Proof

    Since A=i=1ACi where CiA are mutually disjoint,
    P(A) = i=1P(ACi)=i=1P(ACi)P(Ci)P(Ci)=i=1P(A|Ci)P(Ci)

    Fact If each Ci is a partition of Ω, then A=i=1(ACi) and the sets ACi are mutually disjoint

    Proof

    Since CiCj= when ij, we have (ACi)(ACj)=, and A=i=1(ACi)=Ai=1Ci (also i=1Ci=Ω)

    Lecture 4

    Def. Statistically Independent

    If (Ω,A,P) is a probability model and A,CA, then A and C are statistically independent if P(AC)=P(A)P(C)

    It follows that when P(C)>0, P(A|C)=P(AC)P(C)=P(A)P(C)P(C)=P(A)

    Statistically Independent Sigma Algebras

    A and B are statistically independent if every element of the σ-algebra generated by A:{,A,AC,Ω} is statistically independent of every element of the σ-algebra generated by B: {,B,BC,Ω}

    Proof

    C and are statistically independent since C=  C, and so P(C)=P()P(C)=P()=0

    C and Ω are statistically independent since CΩ=C  C, and so P(CΩ)=P(C)P(Ω)=P(C)

    A and BC are statistically independent since ABC=A(AB)C=A(AB), and so P(ABC)=P(A)P(AB)=P(A)P(A)P(B)=P(A)(1P(B))=P(A)P(BC)

    AC and B are statistically independent in the same vein.

    AC and BC are statistically independent since P(ACBC)=P((AB)C)=1P(AB)=1P(A)P(B)+P(AB)=1P(A)P(B)+P(A)P(B)=(1P(A))(1P(B))=P(AC)P(BC)

    Def. Mutually Statistically Independent

    When (Ω,A,P) is a probability model and {Aλ:λΛ} is a collection of sub σ-algebras of A,
    the Aλ are mutually statistically independent if P(A1...An)=i=1nP(Ai)   n,
    where distinct λ1,...,λnΛ, and A1Aλ1,...,AnAλn.

    Notes

    Union of 3 events (Inclusion-Exclusion Principles)

    P(ABC)=P(A)+P(B)+P(C)P(AB)P(AC)P(BC)+P(ABC)

    Proof

    P(ABC)=P((AB)C)=P(AB)+P(C)P((AB)C))=P(A)+P(B)P(AB)+P(C)P((AC)(BC))=P(A)+P(B)+P(C)P(AB)P(AC)P(BC)+P((AC)(BC))=P(A)+P(B)+P(C)P(AB)P(AC)P(BC)+P(ABC)

    Generalized to n events

    P(A1...An)=i=1nP(Ai)i<jP(AiAj)+i<j<kP(AiAjAk)...+(1)n+1P(A1...An)

    Proof

    Base The result is true for n=2: P(AB)=P(A)+P(B)P(AB)

    I.H. Assume it's true for n

    Consider

    P(A1...AnAn+1)=P((A1...An)An+1)=P(A1...An)+P(An+1)P((A1...An)An+1)=P(A1...An)(1)+P(An+1)P((A1An+1)...(AnAn+1))(2)

    (1)P(A1...An)=i=1nP(Ai)i<jnP(AiAj)+...+(1)n+1P(A1...An)

    (2)P((A1An+1)...(AnAn+1))=i=1nP(AiAn+1)i<jnP(AiAjAn+1)+...+(1)n+1P(A1...AnAn+1)

    Combining the above, we have

    P(A1...An+1)=i=1n+1P(Ai)i<jP(AiAj)+...+(1)n+2P(A1...An+1)

    Intersection of 3 events

    P(ABC)=P(A)+P(B)+P(C)P(AB)P(AC)P(BC)+P(ABC)

    Proof

    LHS=1P((ABC)C)=1P(ACBCCC)=1[P(AC)+P(BC)+P(CC)P(ACBC)P(ACCC)P(BCCC)+P(ACBCCC)]=1[3P(A)P(B)P(C)(1P(AB))(1P(AC))(1P(BC))+(1P(ABC))]=RHS

    Generalized to n events

    P(A1An)=i=1nP(Ai)i<jP(AiAj)++(1)n+1P(A1An)

    2. Random Variables and Stochastic Processes

    Lecture 5

    Motivation If we have a population — Ω, a measurement of some sort —X(ω) , and we want to assign probabilities to events — aX(ωb), or X(ω)[a,b], the probabilities are on Ω instead of R1 — this is difficult. To navigate this, we use inverse images.

    Def. Inverse Image

    Under the function X:ΩR1, the inverse image of the set BR is given by X1B={ωΩ:X(ω)B}

    It is the set of ω that get mapped into B.

    Note to self X(ω)=bX1{b}=ω

    E.g. Suppose Ω={1,2,3,4,5} and X(ω)={0ω=10.20ω=20.30ω=30.01ω=40.20ω=5

    Note that X is not 1-1

    Given a set B, determine X1B

    B = [0, 1] X1B=Ω

    B = [0.00, 0.25] X1B={1,2,4,5}

    B = {0} X1B=1

    B = (, 0) X1B=

    Property Inverse images preserve Boolean operations.

    Proof for Unions

    Let ωX1(B1B2), then X(ω)B1B2

    ωX1B1 or ωX1B2

    ωX1B1X1B2

    So X1(B1B2)X1B1X1B2 1

    Suppose ωX1B1X1B2,

    ωX1B1 or ωX1B2

    X(ω)B1 or X(ω)B2

    X(ω)B1B2

    ωX1(B1B2)

    So X1B1X1B2X1(B1B2) 2

    By 1 and 2, we have X1(B1B2)=X1B1X1B2 since they are subsets of each other

    Proof for Complements

    Let ωX1BC, then X(ω)BC

    X(ω)B

    wX1B

    ω(X1B)C

    So X1BC(X1B)C 1

    Suppose ω(X1B)C

    ωX1B

    X(ω)B

    X(ω)BC

    ωX1BC

    So (X1B)CX1BC 2

    By 1 and 2, X1BC=(X1B)C

    Property If B1B2=, then X1B1 and X1B2 are also disjoint.

    Proof

    Suppose AB=, then X1AX1B=X1(AB)=X1=

    Def. Random Variable

    A random variable is a function X:ΩR1 with the property that for any BB1 (i.e. Borel set in R1), X1BA.

    Thus, when X is a random variable, P(X(ω)B)=P(X1B)

    Prop 2.1.1 (Marginal Probability Measure)

    When X is a r.v., the marginal probability measure of X is PX, which is defined on B1 by PX(B)=P(X1B)

    Proof

    PX:B1[0,1]

    1. Normed: PX(R1)=P(X1R1)=P(Ω)=1
    2. Countably additive: If B1,B2,... are mutually disjoint elements of B1,
      then PX(i=1Bi)=P(X1i=1Bi)=P(i=1X1Bi)=i=1P(X1Bi)=i=1PX(Bi)

    Note The probability model for a random variable X is (R1,B1,PX)

    Prop 2.1.2 (Determine whether X is a random variable)

    If X1(a,b]A for every a,bR1, then X is a random variable.

    Proof

    Let B1={BB1:X1BA}

    1. Since B1 and X1A, we know B1

    2. If BB1, then X1BA(X1B)C=X1BCA

      Since BCB1 and X1BCA, we know BCB1

    3. If B1,B2,...B1, then X1B1,X1B2,...Ai=1XBi=X1i=1BiA

      Since i=1BiB1 and X1i=1BiA, we know i=1BiB1

    By 1 (contains null set), 2 (closed under comp), & 3 (closed under union), we know B1 is a sub σ-algebra of B1 1

    By hypothesis, (a,b]B1  a,bR1B1B1, since B1 is the smallest σ-algebra containing all intervals (a,b] 2

    By 1 and 2, they are subsets of each other, so B1=B1X1BA,  BB1 X is a random variable.

    Examples

    Note When A=2Ω, then any X:ΩR1 is a random variable.

    Prop 2.1.3 (Sum & Prod of R.V.s are R.V.s)

    If X, Y are random variables defined on Ω, then (1) W = X+Y and (2) W = XY are both random variables.

    Proof of (1) W = X + Y

    Suppose ωW1(,b]={ω:X(ω)+Y(ω)b}

    Let cnQ be such that cnb, then  qQ such that X(ω)q and Y(ω)cnq,

    We can take the intersection to get that ω(X1(,q]Y1(,cnq])A

    We can express the set of all cn as Cn=qQ({ω:X(ω)q}{ω:Y(ω)cnq}), so W1(,b]set of ωCn,  n

    Since Q is countable, and Cn is a countable union of elements of A, we have that CnA

    By hypothesis Cn is monotone decreasing, so limnCn=n=1Cn=W1(,b]AW=X+Y is a r.v.

    Proof of (2) W = XY

    Suppose b = 0, then

    W1(,0]={ω:X(ω)0,Y(ω)0}{ω:X(ω)0,Y(ω)0}=(X1(,0]Y1[0,))(X1[0,)Y1(,0])A

    Suppose b > 0, then

    W1(,b]=W1(,0]W1(0,b]

    We've shown W1(,0]A, so we just need to show the other part: W1(0,b]A.

    W1(0,b]={ω:X(ω)>0,Y(ω)>0,X(ω)Y(ω)b}{ω:X(ω)<0,Y(ω)<0,X(ω)Y(ω)b}={ω1}{ω4}

    Since xy=b is symmetrical over the line y=-x, proving the argument for one of 1 & 4 will suffice.

    Suppose ω 1 and let cnb. Then  qQ(0,) such that ωX1(0,q]Y1(0,cn/q]A

    Cn=q  Q(0,)X1(0,q]Y1(0,cn/q]A since Q(0,) is countable.

    Since Cn11AW1(0,b]A

    A similar argument holds for b < 0. For any b, W1(,b]A, so W=XY is a r.v.

    E.g. p(X)=i=0naiXi is a r.v. if X is a r.v.

    Any constant function Y(ω)=c is a r.v., so all ai are r.v.'s.

    The product of r.v.'s is a r.v., so all aiXi are r.v.'s

    The sum of r.v.'s is a r.v., so i=0naiXi is a r.v.

    Prop 2.1.4 (Sigma Algebra generated by X)

    When X is a random variable, AX=X1B1={X1B:BB1} is a sub σ-algebra of A, called the σ-algebra on Ω generated by X.

    Alternative notation: AX=A({X1(a,b]:a,bR1})

    Proof

    1. =X1AX

    2. If A1,A2,...AX, then  B1,B2,...B1 such that Ai=X1Bi.

      So i=1Ai=i=1X1Bi=X1i=1BiAX (since i=1BiB1)

    3. If AAX, then  BB1 such that A=X1V.

      So AC=(X1B)C=X1BCAX (since BCB1)

    By 1(contains null), 2(closed under unions), 3(closed under complementation), AX is a sub σ-algebra of A

    Def. Random Vector

    Recall

    A random variable is a function X:ΩR1 with the property that for any BB1, X1BA.

    Thus, when X is a random variable, P(X(ω)B)=P(X1B), since X1B={ω:X(ω)B}

    A random vector is a function X:ΩRk with the property that for any BBk, X1BA.

    Thus, when X is a random vector, P(X(ω)B)=P(X1B), since X1B={ω:X(ω)B}

    Properties

    Example (Pt. 1)

    Suppose we have Ω={1,2,3},A=2Ω, and the uniform prob measure P

    Let X=[X1X2]:ΩR2 be given by X(ω)=[X1(ω)X2(ω)] where X1,X2 are defined as X1(1)=0X1(2)=0X1(3)=1 and X2(1)=1X2(2)=0X2(3)=0

    X1{(0,1)}={1}X1{(0,0)}={2}X1{(1,0)}={3}X1B={(0,1),(0,0),(1,0)B{1} or {2} or {3} if only one of (0,0),(0,1),(1,0)B{1,2} or {1,3} or {2,3} if only two of (0,0),(0,1),(1,0)BΩ(0,1),(0,0),(1,0)B

    PX(B)={0(0,1),(0,0),(1,0)B1/3 if only one of (0,1),(0,0),(1,0)B2/3 if only two of (0,1),(0,0),(1,0)B1(0,1),(0,0),(1,0)B

    Example (Pt. 2)

    What if we change the def of X2? If X1,X2 are now X1(1)=0X1(2)=0X1(3)=1 and X2(1)=1X2(2)=1X2(3)=0, what is PX?

    Only 2 possible outputs now: (0,1) and (1, 0)

    X1{(0,1)}={1,2}X1{(1,0)}={3}X1B={(0,1),(1,0)B{1,2}(0,1)B,(1,0)B{3}(0,1)B,(1,0)BΩ(0,1),(1,0)B

    Then for BB2,PX(B)={0(0,1),(1,0)B2/3(0,1)B,(1,0)B1/3(0,1)B,(1,0)B1(0,1),(1,0)B

    Example (Pt. 3)

    If P is not uniform, but instead defined P({1})=12,P({2})=13,P({3})=16, what is PX?

    PX(B)={0(0,1),(1,0)B5/6(0,1)B,(1,0)B1/6(0,1)B,(1,0)B1o/w

    Prop 2.1.5 (Cartesian Prod of Borel Sets is a Borel Set)

    If B1,...,BkB1, then B1×...×Bk={(x1,...,xk)T|xiBi,i=1,...,k}Bk and Bk is the smallest σ-algebra on Rk containing all such sets

    Proof

    Consider the sets R1×...×Bi×...×R1 that only restrict the ith coord.

    Then {R1×...×Bi×...×R1|BiB1} is a sub σ-algebra of Bk

    Sub-proof Let By={B×R1×...×R1:BB1}

    (,b]×R1×...×R1Bk  bR1B×R1×...×R1Bk

    1. ×R1×...×R1=Bk
    2. If Bi×R1×...×R1By for i = 1, 2, …, then i=1Bi×R1×...×R1=(i=1Bi)×R1×...R1By since i=1BiB1
    3. If B×R1×...×R1By, then (B×R1×...×R1)C=BC×R1×...×R1By since BCB1.

    So B1×...×Bk=i=1k(R1×...×Bi×...×R1)Bk

    Since each k-cell (a,b]=(a1,b1]××(ak,bk] is of this form, there a σ-algebra on Rk containing all such sets that is smaller than Bk.

    Prop 2.1.6 (A Vector of R.V.s is a Random Vector)

    If Xi:ΩR1 is a random variable for i=1,...,k, then X=(X1,...,Xk)T:ΩRk is a random vector.

    Proof

    Suppose B1,...,BkB1. By the previous proposition, B1×...×BkBk. Then we have

    X1(B1×...×Bk)={ω:X(ω)B1×...×Bk}={ω:Xi(ω)Bi for i = 1, ..., k}=i=1kXi1BiA

    Since X1(a,b]A  a,bRkX1BA  BBkX is a random vector.

    Lecture 6

    Def. K-cells

    (a,b]=×i=1k(ai,bi], or ×i=1k(,bi]

    K-cells are the basic sets we want to assign probabilities to (using random vectors)

    For k = 2, (a, b] =

    Def. Cumulative Distribution Function (CDF)

    The cumulative distribution function FX:Rk[0,1] for random vector XRk is given by FX(x1,...,xk)=PX((,x1]×...×(,xk])=PX((,X])

    Def. Difference Operator

    For any g:RkR1, the i-th difference operator Δa,b(i) g:Rk1R1 is given by (Δa,b(i) g)(x1,...,xi1,xi+1,...,xk)=g(x1,...,xi1,b,xi+1,...,xk)g(x1,...,xi1,a,xi+1,...,xk)

    Prop 2.2.1 (Properties of Distribution Functions)

    Any distribution function FX:Rk[0,1] satisfies

    1. If aibi for i=1,...,k, then PX((a,b])=Δa1,b1(1)Δa2,b2(2)...Δak,bk(k)FX

    2. As xi,FX(x1,...,xk)0

      As xi,FX(x1,...,xk)1

    3. FX is right continuous

      If δi0  i, then FX(x1+δ1,...,xk+δk)FX(x1,...,xk)

    Proof for (1)

     

    Proof for (2)

     

    Proof for (3)

     

    Thm 2.2.1 (Extension Theorem)

    If F:Rk[0,1] satisfies the 3 properties of distribution functions, then   a unique probability measure P on Bk, such that F is the distribution function of P

    Note such an F determines a probability model (Rk,Bk,P) and we can define a random vector with this model by taking Ω=Rk and X(ω)=ω

    Now we can present PX by a function of points (rather than sets) FX

    Def. Marginal Distributions

    Def. Discrete Probability Models

    Prop 2.3.1 (Countably Many Points with Positive Prob)

    Prop 2.3.2 (Prob Measure Defined by p)

    Def. Multinomial Distribution

    Def. Multivariate Hypergeometric Distribution

    Lecture 7

    Def. Continuous Probability Models

    Def. Absolutely Continuous Probability Models

    Def. Probability Density Functions (PDF)

    Prop 2.4.1 (Properties of A.C. Models)

    1. f(x)0 with probability 1
    2. Rkf(x)dx=1
    3. F(x)=F(x1,...,xk)=xk...x1f(z1,...,zk)dz1...dzk
    4. f(x)=f(x1,...,xk)=kF(x1,...,xk)x1...xk

    Prop 2.4.2 (Properties of PDFs)

    f:(Rk,Bk)(R1,B1) is a density function for a.c. model (Rk,Bk,P) if

    1. f(x)0  x
    2. Rkf(x)dx=1

    Def. Multivariate Normal Distribution

    Lecture 8 & 9

    Suppose we transform the random vector XRk to the random vector Y=T(X)R1

    Discrete case

    If X is discrete (with prob function pX), then pY(y)=PY({y})=PX(T1{y})=xT1{y}pX(x)

    Def. Projections (& their Prob Functions)

    Suppose k2 , then the projection on the first 2 coordinates is (y1,y2)=T(x1,,xk)=(x1,x2)

    Prob Function Derivation:

    To find the probability functions of projections, take the joint probability function, and sum out unwanted variables.

    T1{y}=T1{(y1,y2)}={(x1,,xk):x1=y1,x2=y2}

    py(Y)=pY(y1,y2)=PX(T1{y})=xT1{y}pX(x)=(x1,,xk):x1=y1,x2=y2pX(x1,,xk)=(x3,,xk)Rk2pX(y1,y2,x3,,xk)fix x1,x2 to y1,y2

    The projection on the second coordinate is y=T(x1,,xk)=x2

    Prob Function Derivation:

    T1{y}=T1{y}={(x1,,xk):x2=y}

    py(Y)=pY(y)=(x1,,xk):x2=ypX(x1,,xk)=(x1,x3,,xk)Rk1pX(x1,y,x3,,xk)

    Marginal of a Multinomial Random Vector

    Let X=(X1,,Xk) multinomial (n,p1,,pk), then pX(a)=(na1  ak)p1a1pkak
    where aRk, ai{0,,n}, and a1++ak=n

    Suppose k2, (y1,y2)=T(x1,,xk)=(x1,x2), and we want to find the distribution of Y=(X1,X2)

    By the defined constraints, y1,y2,a3,,ak{0,,n} and y1+y2+a3++ak=n

    a3,,ak{0,,ny1y2} and a3++ak=ny1y2()

    so  pY(y1,y2)=(a3,,ak) sat. ()(ny1 y2 a3 ak)p1y1p2y2p3a3pkak=n!y1!y2!p1y1p2y2(a3,,ak) sat. ()1a3!ak!p3a3pkaktook out terms where i=1,2=n!y1!y2!(ny1y2)!p1y1p2y2(a3,,ak) sat. ()(ny1y2)!a3!ak!p3a3pkakmultiplied prev by (ny1y2)!(ny1y2)!=(ny1  y2  ny1y2)p1y1p2y2(1p1p2)ny1y2multiply by (1p1p2)ny1y2(a3,,ak) sat. ()(ny1y2a3ak)(p31p1p2)a3(pk1p1p2)aksum of all multinomial(ny1y2,p31p1p2,...,pk1p1p2) probabilities, so =1divide by (1p1p2)a3+...+ak=(ny1  y2  ny1y2)p1y1p2y2(1p1p2)ny1y2

    Thus, (X1,X2) multinomial (n,p1,p2,1p1p2)

    Binomial(n, p) = Multinomial(n, p, 1-p)

    If X=(X1,,Xk) multinomial (n,p1,,pk), then prove Xi binomial (n,pi)= multinomial (n,pi,1pi) Note this is easy to see intuitively since the multinomial arises by placing n ind. observations into k mutually disjoint categories, and when we project onto l coordinates we are now categorizing into l+1 mutually disjoint categories

    PX1(x1)=x2=0nx1p(x1,x2,nx1x2)(x1,x2,nx1x2)=x2=0nx1(nx1 x2 nx2x2)p1x1pxx2(1p1p2)nx2x2=x2=0nx1n!x1!x2!(nx1x2)!p1x1pxx2(1p1p2)nx2x2now multiply by (nx1)!(nx1)!=n!x1!(nx1)!p1x1(1p1)nx1x2=0nx1(nx1)!x2!(xx1x2)!(p21p1)x2(1p21p1)nx1x2sum of all binomial(nx1,p21p1) probabilities, so = 1=n!x1!(nx1)!p1x1(1p1)nx1

    So X1binomial(n,p1)

    Sum of sub-Multinomial Random Vector ~ Binomial

    Use the previous note to determine the distribution of Y=X1++Xl for lk when (X1,,Xk) multinomial(n,p1,,pk)

    Note in the discrete case, if T is 1-1 and T1{y}ϕ, then pY(y)=PX(T1{y})=pX(T1{y})

    Y=X1+...+Xl is the number of responses falling in the first l categories.

    A response falls into one of these categories with probability p1+...+pl.

    So Ybinomial(n,p1+...+pl)

    Def. Indicator Function

    For AΩ, the indicator function IA:ΩR1 is given by IA(ω)={1 if ωA0 if ωAc

    Indicator Variable ~ Bernoulli(P(A))

    Prove: if (Ω,A,P) is a probability model and AA, then Y=IA is a random variable with YBernoulli(P(A))

    AAIA:ΩR1, and   BB1,IA1B={ω:IA(ω)B}={0,1BA1B,0BAC0B,1BΩ0,1B A

    Since for any BB1, IA1BA, we know Y=IA is a r.v.

    PY(1)=P(IA1{1})=P({ω:IA(ω)=1})=P(A)YBern(P(A))

    Transformation Determines Distribution Type

    Y=T(X) could have a discrete distribution no matter how X is distributed.

    E.g. Suppose T(x)=cRl for every x, then pY(y)=PX(T1{y})={PX(Rk)=1 if y=cPX()=0 if yc

    and the distribution of Y is degenerate at c

    E.g. Suppose XN(0,1), so P(X0)=P(X>0)=1/2

    Y=T(X)=I(,0](X)={1 if X00 if X>0 pY(1)=P(X0)=1/2pY(0)=P(X>0)=1/2 Y Bernoulli (1/2)

    Absolutely continuous case

    Suppose XRk has density function fX, and Y=T(X)Rl where lk.

    Y is also absolutely continuous with density fY which we want to determine.

    Cdf Method

    Generally, the cdf method works with projections T when there is a formula for FX :

    fy(y1)=kFY(y1,...,yl)y1...yl=kPX(T1{(,y1]×...×(,yl})y1...yl

    E.g. Define F:R2[0,1] by F(x1,x2)={0x1<0 or x2<01ex1ex2+ex1x2x10 and x20

    It was proved (in a lec 6 exercise) that this is a cdf (using thm 2.2.1),

    so f(x1,x2)=2F(x1,x2)x1x2={0x1<0 or x2<0ex1x2x10 and x20

    Check that f is a valid pdf:

    (i) f(x1,x2)0 for all (x1,x2)

    (ii) f is normed:

    f(x1,x2)dx1dx2=00ex1x2dx1dx2=0ex1dx10ex2dx2=(ex1|0)(ex2|0)=1

    so it is valid and we obtain F(x1,x2)=x1x2f(z1,z2)dz1dz2

    Therefore, if Y=T(X1,X2)=X1, then FX1(x1)=F(x1,)={0x1<01ex1x10

    so fX1(x1)=FX1(x1)x1={0x1<0ex1x10, and fX2(x2)=FX2(x2)x2={0x2<0ex2x20

    Thus, both X1 and X2 have exponential(1) distributions

    E.g. Suppose y=T(x1,x2)=x1+x2, and (X1,X2) has the triangular density f(x1,x2)={20<x1<x2<10o/w

    Fy(y)=PY((,y])=P(X1,X2)({(x1,x2):x1+x2y})={0y<00y/2x1yx12dx2dx1=y2/20y11y/21yx2x22dx1dx2=2yy2/211y212<yfY(y)={0y0 or y2y0<y<12y1y<2

    Change of Variable Method

    Suppose T:RkRk is 1-1 and smooth (i.e. all 1st order partial derivatives exist and are continuous),

    so T(X)=(T1(x)Tk(x)) and we can find the Jacobian JT(x)=|det(T1(X)x1T1(X)xkTk(X)x1Tk(X)xk)|1

    Since JT(x)=limδ0vol(Bδ(X))vol(TBδ(X)), JT1(x) indicates how T is changing volume at x,

    so JT(x)<1 means T expands volume at x, and JT(x)>1 means T contracts volumes at x=T1(y)

    If Y=T(X), then for small δ,

    fY(y)PY(TBδ(T1(y)))vol(TBδ(T1(y)))=PX(Bδ(T1(y)))vol(Bδ(T1(y))vol(Bδ(T1(y)))vol(TBδ(T1(y)))fX(T1(y))JT(T1(y))

    This intuitive argument can be made rigorous to prove the following.

    Proposition II.5.1 (Change of Variable)

    If T:RkRk is 1-1 and smooth, and Y=T(X) where X has an a.c. distribution with density fX,
    then Y has an a.c. distribution with density fY(y)=fX(T1(y))JT(T1(y))

    E.g. If we have a uniform dist f(x)=12 for 0<x<2, find the density for y=T(x)=x2.

    T1(y)=y1/2, JT(x)=|det(2x)|1=12x for x(0,2)

    Note: solving JT(x)=1, we see that T contracts lengths on (0, 1/2) and expands lengths on (1/2, 2)

    fY(y)=f(T1(y))JT(T1(y))=f(y1/2)12y1/2={0y0 or y41/4y1/20<y<4

    E.g. Prove φ(x)dx=1 for φ, the N(0,1) pdf.

    Consider (φ(x)dx)2=φ(x)dxφ(y)dy=12πexp(x2+y22)dxdy

    Make the polar coordinate change of variable T(x,y)=(r,θ) where for r(0,),θ[0,2π)

    (x,y)=T1(r,θ)=(rcosθ,rsinθ)

    JT1(r,θ)=|det(rcosθrrcosθθrsinθrrsinθθ)|1=|det(cosθrsinθsinθrcosθ)|1=|r(cos2θ+sin2θ)|1=1/r

    Since JT(x)=JT11(T(x))=1JT1(T(x))=r, and r2=x2+y2,

    (φ(x)dx)2=002πr2πexp(r2/2)dθdr=0rexp(r2/2)dr=exp(r2/2)|0=1

    so φ(x)dx=1

    Def. Affine Transformation

    (Affine transformation are linear transformations plus a constant.)

    T:RkRk is an affine transformation if
    T(x)=Ax+b=(a11x1++a1kxk+b1a21x1++a2kxk+b2ak1x1++akkxk+bk) where bRk,ARk×k

    So JT(x)=|det(T1(x)x1T1(x)xkTk(x)x1Tk(x)xk)|1=|detA |1

    Note: T(x1)=T(x2) iff A(x1x2)=0, so T is 1-1 iff A is a nonsingular (invertible) matrix, in which case, T1(y)=A1(yb)=x

    If Y=AX+b, then fY(y)=fX(T1(y))JT(T1(y))=fX(A1(yb))|detA |1

    Multivariate Normal

    Suppose ZNk(0,I), so fZ(z)=(2π)k/2exp(zz/2) for zRk

    Let X=AZ+μZ=A1(Xμ) where ARk×k is nonsingular and μRk, then since X is an affine transformation, we know it has an a.c. distribution with density:

    fX(x)=fZ(A1(xμ))|detA|1=(2π)k/2exp((A1(xμ))A1(xμ)/2)|detA|1plug in Z=A1(Xμ)=(2π)k/2|detA|1exp((xμ)(A1)A1(xμ)/2)reorder=(2π)k/2|detAdetA|1/2exp((xμ)(AA)1(xμ)/2)det(A)=det(AT)=(2π)k/2|detAA|1/2exp((xμ)(AA)1(xμ)/2)det(AB)=det(A)det(B)=(2π)k/2(detΣ)1/2exp((xμ)Σ1(xμ)/2)

    where Σ=AARk×k

    If a random vector X has this pdf, XNk(μ,Σ)

    Note Σ is symmetric, invertible, and positive definite (see note from lecture 2)

    Ex. Suppose XNk(μ,Σ) and Y=AX+b, where ARk×k is nonsingular and μRk.
    Prove that YNk(Aμ+b,AΣA).

    fY(y)=fX(A1(yb))|detA |1=(2π)k/2(detΣ)1/2exp((A1(yb)μ)Σ1(A1(yb)μ)/2)(detAA)1/2 merge determinants, take out A1 (AB)=BA,=(2π)k/2(detAΣA)1/2exp((A1[(yb)Aμ])Σ1(A1[(yb)Aμ)]/2)=(2π)k/2(detAΣA)1/2exp([ybAμ](A1)Σ1A1[ybAμ)]/2) (AB)1=B1A1=(2π)k/2(detAΣA)1/2exp([y(Aμ+b)](AΣA)1[y(Aμ+b)]/2)

    Ex. Suppose XNk(μ,Σ) and Σ=CC where CRk×k is nonsingular. Prove that Z=C1(Xμ)Nk(0,I).

    Z=C1X+(C1μ)X=CZ+μ

    fZ(z)=fX(Cz+μ)|detC1|1=(2π)k/2(detΣ)1/2exp((CZ+μμ)Σ1(CZ+μμ)/2)(detC1(C1))1/2=(2π)k/2(detCC)1/2(det(CC)1)1/2exp((CZ)(CC)1(CZ)/2)=(2π)k/2(detI)1/2exp(ZCC1C1CZ/2)=(2π)k/2exp(ZZ/2)

    Ex. Using k = 2, μ=(μ1,μ2) and Σ=(σ11σ12σ12σ22)by symmetry of Σ,σ21=σ12
    write out the density fX(X)=(2π)k/2(detΣ)1/2exp((Xμ)Σ1(Xμ)/2) in terms of x1 and x2

    Σ1=1σ11σ22σ122[σ22σ12σ12σ11]

    (xμ)Σ1(xμ)=1σ11σ22σ122[x1μ1x2μ2][σ22σ12σ12σ11][x1μ1x2μ2]=1σ11σ22σ122[(σ22(x1μ1)σ12(x2μ2))σ12(x1μ1)+σ11(x2μ2)][x1μ1x2μ2]=1σ11σ22σ122[σ22(x1μ1)22σ12(x1μ1)(x2μ2)+σ11(x2μ2)2]multiply and divide by σ11σ22=σ11σ22σ11σ22σ122[(x1μ1)2σ112σ12σ11σ22(x1μ1)(x2μ2)+(x2μ2)2σ22]sub in σ1=σ11,σ2=σ22,ρ=σ12σ1σ2=(1ρ2)1[(x1μ1σ1)22ρ(x1μ1σ1)(x2μ2σ2)+(x2μ2σ2)2]

    fX(x)=(2π)2/2(σ11σ22σ122)1/2exp(res/2)()=[σ11σ22(1ρ2)]1/2=[σ1σ2(1ρ2)]1/2=12πσ1σ2(1ρ2)1/2exp{res/2}

    Def. Spectral Decomposition

    For any positive definite matrix ΣRk×k, Σ=QΛQ=i=1kλjqiqi where

    Q=(q1qk)Rk×k orthogonal, i.e. QTQ=QQT=IΛ=diag(λ1,,λk) with λ1λk>0

    Recall D=P1AP

    The diagonal entries of D are the eigenvalues λi, and the column vectors of P are the eigen vectors

    So Σ=QΛQ the diagonal entries of Λ are the eigen values, and column vectors of Q are the eigen vectors

    Properties of the Multivariate Normal

    For μRk, positive definite ΣRk×k, is fX(x)=(2π)k/2(detΣ)1/2exp((xμ)Σ1(xμ)/2) a valid pdf, such that XNk(μ,Σ)?

    Ex. Prove: If Σ is p.d. with spectral decomposition QΛQ, then Σ1=QΛ1Q

    ΣΣ1=QΛQ(QΛ1Q)=QΛΛ1Q=QQ=I

    Lecture 10

    Def. Stochastic Process

    A stochastic process (or random process) is a set {(t,Xt):tT} where Xt is a random variable defined w.r.t. the probability model (Ω,A,P), and T is called the index set of the process.

    Note In many applications, we need to consider stochastic processes since their dependence on index (t) is important. T can be infinite, negative, and multi-dimensional. It can be a very general set (like the nodes of a graph). Stochastic processes where t is time are referred to as time series.

    E.g. A random vector X=(X1,...,Xk) is equiv. to the stochastic process {(t,Xt):tT} where T={1,...,k}

    E.g. A random variable X1 is equiv. to the stochastic process {(t,Xt):tT} where T={1}

    E.g. Suppose a coin is tossed (tosses are ind.) until the 1st head is observed and we record that number.

    Denoting a head by 1 and a tail by 0, then the sample space is Ω={(ω1,ω2,):ωi{0,1}}=×i=1{0,1},
    i.e. the set of all sequences of 0's and 1's. We need an infinite dimensional Ω here.

    If we define Xi(ω)=ωi, then {(t,Xt):tN} is a stochastic process, called a Bernoulli(p) process.

    If we define Y(ω)=i when j<i, ωj=0 and ωi=1, is Y a well defined r.v.?

    Let p= probability of a head on a single toss.
    If p>0, then Y is defined since P(an infinite sequence of tails) = limnP(first n are tails)=limn(1p)n=0
    Otherwise, Y is undefined.

    Then using independence, we obtain the pdf for Y geometric(p)

    pY(i)=PY({i})=P(Ai)Ai={ω:ω1=0,,ωi1=0,ωi=1}=(1p)i1pP({ω:ωi+1{0,1},ωi+2{0,1},})=(1p)i1p

    Ex. Prove that pY defines a probability distribution.

    1. pY(i)0  i{1,2,...}
    2. i=1pY(i)=pi=1(1p)i1=pi=1(1p)igeometric series=p11(1p)=1

    Def. Sample Function

    Each realized value of a stochastic process can be thought of as a function (called sample function) X(ω):TR1, with value Xt(ω) at index t. So in effect, a stochastic process is a probability measure on functions from TR1.

    A stochastic process is a generalization of a r.v. X where we start with (Ω,A,P) and get PX as follows:

    E.g. If T=[0,),ωN(0,1),Xt(ω)=ωt, then X gives a ray form the origin with N(0,1) distributed slope.

    E.g. If T=[0,1],ω Uniform(0, 10), Xt(ω)=cos(ωt), then X gives a cosinusoid with random frequency.

    Prop 2.6.1 (Kolmogorov's Consistency Theorem)

    Background If for {s1,...,sm}{t1,...,tn}, the distribution of (Xs1,...,Xsm) can be obtained from that of (Xt1,...,Xtn), then we can say that the finite dimensional distributions are consistent.

    Suppose TR, and a probability model (Rn,Bn,P(t1,...,tn)) is given for each {t1,...,tn}T.
    If the probability models are consistent, then a probability model (Ω,A,P) and random variables Xt such that {(t,Xt):tT} is a stochastic process with P(Xt1,...,Xtn)=P(t1,...,tn)

    E.g. Let T={1,2,...},P(t1,...,tn) be the discrete prob measure concentrated on {0,1}n given by the prob function

    p(t1,...,tn)(x1,...,xn)={i=1npxi(1p)1xi(x1,...,xn){0,1}n0o/w

    These distributions are consistent. Below is proof for T={1,2}

    x2=01pt1(x1)pt2(x2)=x2=01px1(1p)1x1px2(1p)1x2=px1(1p)1x1x2=01px2(1p)1x21p+p=0=pt1(x1)

    So by Kolmogorov's Consistency Thm, this is a valid definition of a s.p. (stochastic process) {(t,Xt):tT}

    Def. Gaussian Process

    A s.p. {(t,Xt):tT} is a Gaussian process whenever (Xt1,,Xtn)Nn(μ(t1,,tn),Σ(t1,,tn))

    where μ(t1,,tn)Rn is the mean vector, and Σ(t1,,tn)Rn×n is the variance matrix, which is p.d. for every {t1,,tn}T

    Def. Gaussian White Noise Process

    For {t1,,tn}T, μ(t1,,tn)=(0,,0), Σ(t1,,tn)=diag(σ2(t1),,σ2(tn)) where σ2:T(0,)

    So (Xt1,,Xtn)Nn(0,diag(σ2(t1),,σ2(tn)))

    Lecture 11

    Recall Def. Mutual Stat Ind:

    When (Ω,A,P) is a probability model and {Aλ:λΛ} is a collection of sub σ-algebras of A, the Aλ are mutually statistically independent if P(A1...An)=i=1nP(Ai)   n,, where distinct λ1,...,λnΛ, and A1Aλ1,...,AnAλn.

    Recall Def. Sigma Algebra Generated by C: A(C) is obtained by intersecting all σ-algebras containing C2Ω.

    The σ-algebra generated by r.v. X is a sub σ-algebra of A, given by AX=X1B1={X1B:BB1}

    Ex. Prove that AX is a sub σ-algebra of A

    1. X1=AX

    2. A1,A2,...AX B1,B2,...B1 such that Ai=X1Bii=1Ai=i=1X1Bi=X1i=1Bi

      Since i=1BiB1,i=1AiAX

    3. AAX BB1 such that A=X1BAC=(X1B)C=X1BC

      Since BCB1, ACAX

    Def. Statistically Independent RVs

    For the collection of random variables {Xλ:λΛ}, the Xλ are mutually statistically independent if the AXλ are mutually statistically independent in the collection of σ-algebras{AXλ:λΛ}.

    Prop 2.7.1 (Mut. Stat. Ind iff Joint = Prod of Marginals)

    For the collection of random variables {Xλ:λΛ}, the Xλ are mutually statistically independent iff the joint cdf of (Xλ1,,Xλn) factors as the product of the marginal cdfs, where {λ1,,λn}Λ.

    I.e. F(Xλ1,,Xλn)(x1,,xn)=i=1nFXλi(xi) for every (x1,,xn),

    Proof

    () Suppose mut. stat. ind., show the factorization holds

    F(Xλ1,,Xλn)(x1,,xn)=P(Xλ1,,Xλn)((,x1]××(,xn])=P({Xλ1(,x1]}{Xλn(,xn]})=i=1nP({Xλi(,xi]})=i=1nFXλi(xi)

    () Suppose the factorization holds, show mut. stat. ind.

    The cdf F(xλ1,,xλn) determines P(Xλ1,,Xλn). Since the cdf of the joint probability measure i=1nFXλj is obtained by multiplying the marginal probability measures PXλi we know that Xλ1,,Xλn are mutually statistically independent.

    Also, the collection of cdfs {i=1nFXλi:{λ1,,λn}Λ for some n} is consistent. By KCT, this determines PX, and so the collection of random variables {Xλ:λΛ} are mutually statistically independent.

    Prop 2.7.2 (Mut. Stat. Ind iff Joint = Prod of Marginals)

    For the collection of random variables {Xλ:λΛ} and each {λ1,,λn}Λ:

    • if each (Xλ1,,Xλn) has a discrete distribution, then the Xλ are mutually statistically independent iff p(xλ1,,xλn)(x1,,xn)=i=1npXλi(xi) for every (x1,,xn)
    • if each (Xλ1,,Xλn) has an a.c. distribution, then the Xλ are mutually statistically independent iff f(Xλ1,,Xλn)(x1,,xn)=i=1nfXλi(xi) for every (x1,,xn)

    Proof Discrete case

    Suppose {Xλ:λΛ} are mutually stat. ind., then by Prop 2.7.1 (F(Xλ1,,Xλn)(x1,,xn)=i=1nFXλi(xi)), the cdf of Xλi is FXλi.

    Since Xλi has a discrete distribution, it has the probability function PXλi=FXλilimδi0FXλi(xiδi),
    so Xλ1,...,Xλn has the probability function

    P(Xλ1,...,Xλn)(xλ1,...,xλn)=limδ10...limδn0P(Xλ1,...,Xλn)(×i=1n(xiδi,xi])=limδ10...limδn0i=1n(FXλiFXλi(xiδi))=i=1nlimδi0(FXλiFXλi(xiδi))=i=1npXi(xi)=z1x1,...,znxnpxλi(z1)...pxλn(zn)=i=1nzixipxλi(zi)=i=1nFxλi(xi)

    Proof Absolutely Continuous Case

    Prop 2.7.1 F(Xλ1,,Xλn)(x1,,xn)=i=1nFXλi(xi) implies f(Xλ1,,Xλn)(x1,,xn)=i=1nFXλi(xi)xi=i=1nfxλi(xi) which implies F(Xλ1,,Xλn)(x1,,xn)=i=1nxifxλi(xi)dzi=i=1nFXλi(xi)

    So the {Xλ:λΛ} are mutually statistically independent by Prop 2.

    E.g. Bernoulli (p) process

    For any {t1,,tn}T and (x1,,xn){0,1}n

    p(xt1,,xtn)(x1,,xn)=pi=1nxi(1p)ni=1nxi=i=1npxi(1p)1xi=i=1npxti(xi)

    with XtiBernoulli(p), so by Prop. 2.7.2, the Xλ are mut. stat. ind.

    E.g. Gaussian white noise process

    Since (Xt1,,Xtn)Nn(0,diag(σ2(t1),,σ2(tn))) for any {t1,,tn}T, and

    f(Xλ1,,Xλn)(x1,,xn)=(2π)n/2(σ2(t1)σ2(tn))1/2exp(12i=1nxi2σ2(ti))=i=1n(2π)1/2σ1(ti)exp(12xi2σ2(ti))=i=1nfXti(xi) for (x1,,xn)Rn with XtiN(0,σ2(ti))

    the Xλ are mut. stat. ind. by Prop. 2.7.2

    Def. Principal Components

    Suppose XNk(μ,Σ) where Σ=QΛQ (spectral decomp), then Y=QXNk(Qμ,QΣQ)=Nk(Qμ,Λ), so

    fY(y)=i=1n(2π)1/2λi1/2exp(12(yiqiμ)2λi)=i=1nfYi(yi)

    with Yi=qiX=j=1kqjiXjN(qiμ,λi)=N(j=1kqjiμj,λi)

    Thus, the principal components Y1.,Yl are mut. stat. ind

    Lecture 12

    Suppose X is a random vector with prob. measure PX, and Y=T(X)=y is observed

    We want the conditional distribution of X given T(X)=y

    Conditional Dist - Discrete

    Suppose X has a discrete distribution with probability function pX

    When T(x)y, the conditional probability function of X given T(x)=y is pXY(xy)=0
    When T(x)=y, pXY(xy)=PXY({x}T1{y})=PX({x}T1{y})PX(T1{y})=pX(x)zT1{y}pX(z)=pX(x)pY(y)

    E.g. Conditioning the Multinomial(n,p1,,pk)

    Suppose Y=T(X1,,Xk)=X1binomial(n,p1), we want to find the conditional probability function of (X1,,Xk)X1=x1 or equivalently (X2,,Xk)X1=x1

    So, for x2,,xk{0,,nx1},x2++xk=nx1

    p(X2,,Xk)X1(x2,,xkx1)=(nx1x2xk)p1x1p2x2pkxk(nx1)p1x1(1p1)nx1=(nx1)!x2!xk!(p21p1)x2(pk1p1)xk

    Therefore, (X2,,Xk)X1=x1 multinomial (nx1,p21p1,,pk1p1)

    Ex. If Xmultinomial(n,p1,,pk) and Y=X1++Xl for some lk, then determine the conditional distribution of X given Y=y

    The sum of multinomial r.v.'s Y=X1++Xl binomial (n,p1++pl).

    P(X1,,Xk)Y(x1,,xky)

    =(nx1xk)p1x1pkxk(ny)(i=1lpi)y(1i=1lpi)ny

    =y!x1!xl!(p1i=1lpi)x1...(pli=1lpi)xl(ny)!xl+1!xk!(pl+1i=l+1kpi)xl+1(pki=l+1kpi)xk

    multinomial (y,p1i=1lpi,pli=1lpi) multinomial (ny,p+1i=l+1kpi,,pki=l+1kpi)

    Conditional Dist - A.C.

    Suppose X has a.c. distribution with density function fX and T:RkRl is smooth. If xT1{y}, then the conditional density function of X given T(X)=y is

    fXY(xy)=limδ10,δ20{PX(Bδ1(x)T1Bδ2(y))Vol(Bδ1(x)T1Bδ2(y))/PY(Bδ2(y))Vol(Bδ2(y))}= fact fX(x)JT(x)fY(y)

    where (now allowing T to be many to one)

    JT(x)=|det(T1(x)x1T1(x)xkTl(x)x1Tl(x)xk)(T1(x)x1T1(x)xkTl(x)x1Tl(x)xk)|1/2

    E.g. Projections

    If T(x1,,xk)=(x1,x2) then l=2

    JT(x)=|det(T1(x)x1T1(x)xkT2(x)x1T2(x)xk)(T1(x)x1T2(x)x1T1(x)xkT2(x)xk)|1/2=|det(100010)(100100)|1/2=|det(1001)|1/2=1

    f(X1,X2)(x1,x2)=f(X1,,Xk)(x1,,xk)dx3dxkf(X3,,Xk)(X1,X2)(x3,,xkx1,x2)=f(X1,,Xk)(x1,,xk)f(X1,X2)(x1,x2)

    Ex. Repeat the above when T(x1,,xk)=x1.

    Since T(x1,,x2)=x1, JT(x1,,xn)=|det(Tx1,..Txk)(Tx1Txk)|12=|det(100)(100)|12=|det(1)|12=1 fX1(x1)=f(X1,,Xk)(x1,z2,,zk)dz1,dzk f(X2,,Xk)X1(x2,,xkx1)=f(X1,,Xk)(x1,,xk)JT(x1,,xk)fX1(x1)=f(X1,,Xk)(x2,,xk)fX1(x1)

    E.g. Projection conditionals of the Nk(μ,Σ)

    Suppose XNk(μ,Σ) and X1=T(X)=(X1,,Xl) for lk

    Partition μ and Σ as

    μ=(μ1μ2) where μ1Rl,μ2RklΣ=(Σ11Σ12Σ12Σ22) where Σ11Rl×l,Σ12Rl×(kl),Σ22R(kl)×(kl)

    Ex. Prove that Σ11 and Σ22 are p.d. when Σ is p.d.

    XΣX=[X1X2][Σ11Σ12Σ12Σ22][X1X2]=[X1Σ11+X2Σ12X1Σ12+X2Σ22][X1X2]=[X1Σ11X1+X2Σ12X1+X1Σ12X2+X1Σ22X2]=[X1Σ11X1+2X1Σ12X2+X1Σ22X2]0 (>0 if X0)

    X2=0,X10X1Σ11X1=XΣX>0Σ11 is p.d.

    X1=0,X20X2Σ22X2=XΣX>0Σ22 is p.d.

    In order to obtain the distribution of Y, we need another matrix decomposition:

    Def. Gram-Schmidt (QR) decomposition

    Let A=(a1ak)Rk×k be a matrix of rank k (so it is nonsingular/invertible) whose columns form a basis for Rk I.e. a1,,ak are linearly independent (c1a1++ckak=0 iff c1==ck=0) and span Rk — the linear span is L{a1,,ak}={c1a1++ckak:c1,,ckR1}=Rk

    Applying the Gram-Schmidt process to {a1,,ak}, we obtain an orthonormal basis {q1,,qk} for Rk

    q1=a1a1,r11=a1>0

    q2=a2(q1a2)q1a2q1a2,r12=q1a2,r22=a2(q1a2)q1>0

    Q=(q1qk)Rk×k is an orthogonal matrix, and R is a unique upper triangular matrix with positive diagonals

    So A can be decomposed into Q and R:

    QR=(q1qk)(r11r12r1k0r22r2k00rkk)=A

    Def. Orthogonal Matrix

    A matrix Q is orthogonal iff QTQ=QQT=I

    Properties:

    Ex. For A=QR, prove that R is unique given Q

    Suppose distinct R1,R2 that are upper triangular with positive diagonals. I.e. A=QR1=QR2

    Since Q1=Q, we can rearrange the above to get QA=R1=R2, which contradicts the hypothesis.

    Def. Cholesky Decomposition

    The Cholesky decomposition of Σ is obtained by applying the QR decomposition to Σ1/2=QR Σ=Σ1/2Σ1/2=by symmetry(Σ1/2)Σ1/2=(QR)(QR)=RQQR=RR

    Ex. Prove the following properties of upper triangular matrices with positive diagonals (like R), for 2x2 matrices.

    1. The product of 2 upper triangular matrices with positive diagonals is upper triangular with positive diagonals.

      [a1b10c1][a2b20c2]=[a1a2a1b2+b1c20c2c2] is upper triangular with positive diagonals

    2. An upper triangular matrix with positive diagonals is nonsingular, and its inverse is upper triangular with positive diagonals that are equal to the inverse of the diagonal elements of the original matrix.

      [ab0c] is nonsingular/invertible since its determinant > 0

      [ab0c]1=1ac[cb0a]=[1/ab/ac01/c]

    3. The matrix R in the Cholesky decomposition is unique.

      Suppose Σ=R1R1=R2R2 where R1,R2 are upper triangular matrices with positive diagonals.

      Then dividing LHS and RHS by the middle, we get I=(R1)1R2AR2R11A (A is just an arbitrary letter)

      So A=A1 must hold:

      A=(R1)1R2=(R2R11)

      A1=(R2,R11)1

      But A is lower triangular and A1 is upper triangular.

      This means they are both diagonal matrices, so A is diagonal as well.

      AA=[a0bc][ab0c]=[a2ababb2+c2]{a2=1a=1ab=0b=0b2+c2=1c=1

      So A=IR2R11=IR2=R1

    Challenge: Generalize (i), (ii) and (iii) to k×k upper triangular matrices.

     

    Prop 2.8.1 (Marginal Dist. of Normal X1)

    If X=(X1X2)Nk((μ1μ2),(Σ11Σ12Σ12Σ22)) where X1R, then X1Nl(μ1,Σ11).

    Proof Applying the Cholesky Decomp to Σ, we get

    Σ=(Σ11Σ12Σ12Σ22)=(R110R12R22)(R11R120R22)=(R11R11CholeskyR11R12R12R11R12R12+R22R22)

    Now, let ZNk(0,I), and using the fact that Z1,,Zk are mut. stat. ind., we can partition Z s.t.

    Z=(Z1Z2) where Z1Nl(0,I) stat. ind. of Z2Nkl(0,I)

    Recall from lecture 9 : a+AZNk(a,AA)

    X=(X1X2)=(μ1μ2)+(R11R120R22)(Z1Z2)express X as μ+RZ=(μ1μ2)+(R110R12R22)(Z1Z2)=(μ1+R11Z1μ2+R12Z1+R22Z2)

    So X1=μ1+R11Z1Nl(μ1,R11R11)=Nl(μ1,Σ11)marginal dist of first l coordinates

    Ex. If Ir denotes an r×r identity matrix, then use a matrix of the form C=(0IklIl0) to determine the distribution of X2 in Proposition 2.8.1.

    C[X1X2]=[X2X1]Nk([μ2μ1],[Σ22Σ12Σ12Σ11])

    In general, for a permutation matrix A, each row and column only contains a single 1, and the remaining entries are 0. Use such a matrix to determine the marginal distribution of any sub vector of X.

    Prop 2.8.2 (Marginal Dist. of Y is ind. of X1)

    If X=(X1X2)Nk((μ1μ2),(Σ11Σ12Σ12Σ22)) where X1Rl, then Y=X2Σ12Σ111X1 and

    YNkl(μ2Σ12Σ111μ1,Σ22Σ12Σ111Σ12) which is stat. ind. of X1Nl(μ1,Σ11)

    Proof

    (X1Y)=(I0Σ12Σ111I)(X1X2)=AX

    AXNk(Aμ,AΣA) where Aμ=(μ1μ2Σ12Σ111μ1),AΣA=(Σ1100Σ22Σ12Σ111Σ12)

    This proves the first part. Now observe in general, if W=(W1W2)Nk((v1v2),(Σ1100Σ22)), then

    fW(w1,w2)=(2π)k/2(det(Σ1100Σ22))1/2exp(12(w1v1w2v2)(Σ11100Σ221)(w1v1w2v2))=(2π)1/2(detΣ11)1/2exp((w1v1)Σ111(w1v1)/2)×(2π)(k1)/2(detΣ22)1/2exp((w2v2)Σ221(w2v2)/2)

    Since the density factors, W1 and W2 are statistically independent and this proves the second part.

    Ex. Prove the second line of above proof: AXNk(Aμ,AΣA) where Aμ=(μ1μ2Σ12Σ111μ1),AΣA=(Σ1100Σ22Σ12Σ111Σ12)

    Suppose (xi1,,xil) is a sub-vector of XNk(μ,Σ).

    Let ARk×k be the matrix with the basis vector ein in the first l rows (1nl), and the last kl rows contain the remaining basis vectors in any order.

    AX=[A1   l rowsA2k-l rows]X=[A1XA2X] where A1X=[Xi1Xil]

    (AX)Nk(Aμ,AΣA) where Aμ=[A1μA2μ],AΣA=[A1ΣA1A1ΣA2A2ΣA1A2ΣA2]

    Thus, A1X=[Xi1Xil]N(A1μ,A1ΣA1) where A1μ=[μi1μil],AΣA1=[σi1i1...σi1ilσilii...σilil]

    Corollary 2.8.2 (Conditional Dist of X2|X1)

    X2X1=x1Nkl(μ2+Σ12Σ111(x1μ1),Σ22Σ12Σ111Σ12)

    Proof:

    Make the transformation (x1x2)=T(x1y)=(x1y+Σ12Σ111x1) which has JT(x1,y)=|det(I10 stuff Ik1)|1=1

    By the change of variable, fX(x1,x2)=fX1(x1)fY(x2Σ12Σ111x1)

    So, by conditioning on projections, fX2X1(x2x1)=fX(x1,x2)fX1(x1)=fY(x2Σ12Σ111x1)

    =(2π)(k1)/2(det(Σ22Σ12Σ111Σ12))1/2exp(()(Σ22Σ12Σ111Σ12)1()2)

    where ()=x2(μ2+Σ12Σ111(x1μ1))called the regression of X2 on X1

    Def. Monte Carlo Estimation

    Suppose we want to compute PX(A). Sometimes this can be computed exactly but typically we need to resort to Monte Carlo simulation and estimate PX(A).

    Suppose then we have an algorithm that allows us to generate XPX. Since IA(X)Bernoulli(PX(A)), we can generate X1,,XnPX and estimate PX(A) by computing the proportion of sampled values falling in A: P^X(A)=1ni=1nIA(Xi) which has standard error 1nP^X(A)(1P^X(A))

    So the interval [P^X(A)±3SE] contains the value PX(A) with virtual certainty, provided n is large enough

    R practice

    Σ=(212624263430243036)

    (a) Using the R software compute Σ1/2 (command eigen). Verify Σ=Σ1/2Σ1/2 numerically (up to small rounding errors).

    (b) Using the R software compute the Cholesky factor R. (command chol). Verify Σ=RR numerically (up to small rounding errors).

    Suppose μ=(0,1,2).

    (a) Using the R software and the representation X=μ+Σ1/2Z, where ZN3(0,I), generate a sample of n=103 from the N3(μ,Σ) distribution and based on this sample estimate P(X10) and provide the interval containing the exact value with virtual certainty.

    (b) Using the R software and the representation X=μ+RZ, where ZN3(0,I), generate a sample of n=103 from the N3(μ,Σ) distribution and based on this sample estimate P(X10) and provide the interval containing the exact value with virtual certainty.

    (c) Compare the two estimates.

    Results are very similar. Part a has error = 0.0146253205093085, part b has error = 0.0149256490646136.

    Ex. Suppose XN2((12),(5/21/21/25/2))

    (a) Determine the conditional distribution X2X1=2.

    X2X1=x1Nkl(μ2+Σ12Σ111(x1μ1),Σ22Σ12Σ111Σ12)

    X2|X1=2N(μ2+σ12σ11(2μ1),σ22σ122σ11)

    (b) Using the conditional distribution in (a) compute the conditional probability of A={(x1,x2):x12+x225}.

    x22522=1PX2X1(Ax1)=PX2X1({x2:x221} | 2)=PX2X1(1x212) =PX2X1(12.22.4x22.22.412.22.4)=P(2.066z0.775) where ZN(0,1) =Φ(0.775)Φ(2.066)=0.120

    (c) Estimate the unconditional probability of A.

    It is difficult to evaluate P(X1,X2)(A) directly so we proceed via Monte Carlo:

    3. Expectation

    Lecture 13

    Properties of Indicator Functions

    Recall the definition of indicator functions for AA: IA(ω)={1 if ωA0 if ωAc

    IAc(ω)=1IA(ω)Ii=1nAi=i=1nIAiIi=1nAi=1i=1nIAic=1i=1n(1IAi)=i=1nIAii<jIAiIAj++(1)n+1i=1nIAi=i=1nIAii<jIAiAj++(1)n+1Ii=1nAi

    Def. Simple Function

    If A1,,AlA and a1,,alR1, a function X:ΩR1 given by X(ω)=i=1laiIAi(ω) is a simple function.

    A simple function must be a r.v. that takes only finitely many values.

    Note If X1,X2 are simple functions, then so are their product X1X2 and linear combination a0+a1X1+a2X2 for any constants a0,a1,a2, since it is also a r.v. that takes finitely many values

    Ex. Prove that any r.v. that takes only finitely many values is a simple function.

    Suppose X is a r.v. that takes only finitely many values c1,...,cm.

    Since {ci}B1 and X is a r.v., Ci=X1{ci}A.

    Thus X=i=1mciICi which takes the form of a simple function.

    Def. Canonical Form

    In canonical form, X(ω)=i=1mciICi(ω) where Ci=X1{ci}A, and c1,,cmR1 are the distinct values taken by simple function X, so i=1mCi=Ω, and when ij, CiCj=ϕ i.e. are mutually disjoint.

    Note j=1lajP(Aj)=j=1mcjP(Cj)

    Proof X is a r.v. with a discrete distribution given by pX(x)=PX({x})=P(X1{x})={0x{c1,,cm}P(Ci)x=ci

    If ω1,,ωn are iid. P, then as n,

    1ni=1nX(ωi)=1ni=1nj=1lajIAj(ωi)=j=1laj(1ni=1nIAj(ωi))j=1lajP(Aj)=1ni=1nj=1mcjICj(ωi)=j=1mcj(1ni=1nICj(ωi))j=1mcjP(Cj)

    Def. Expectation of a Simple Function

    For a simple function X=i=1laiIAi the expectation of X is defined by E(X)=i=1laiP(Ai)

    Prop 3.1.1 (Expectation Properties)

    If X1,X2 are simple functions, then

    (i) E(a0+a1X1+a2X2)=a0+a1E(X1)+a2E(X2)

    Proof Suppose X1=i=1mbiIBi X2=i=1nciICi

    Then a+0+a1X1+a2X2=a0IΩ+i=1ma1biIBi+i=1na2ciICi

    So a0+a1X1+a2X2 is a simple function, and by definition we have

    E(a0+a1X1+a2X2)=a0P(Ω)+i=1ma1biP(Bi)+i=1na2ciP(Ci)=a0+a1i=1mbiP(Bi)+a2i=1nciP(Ci)=a0+a1E(X1)+a2E(X1)

    (ii) if X1X2, then E(X1)E(X2)

    Proof Since X2X1 is a nonnegative simple function, distinct values taken are nonnegative.

    By (i), this implies that E(X2X1)=E(X2)E(X1)0

    (iii) if P({ω:X1(ω)X2(ω)})=0, then E(X1)=E(X2)

    Proof Suppose X1=i=1laiIAi,X2=i=1mbiIBi are in canonical form.

    Note that if P(Aj)=0, then E(X1)=i=1laiP(Ai)=ijaiP(Ai), and
    similarly if P(Bj)=0, then E(X2)=i=1lbiP(Bi)=ijbiP(Bi), i.e. sets with probability 0 do not change the sum.

    So assume that P(Ai)>0,P(Bj)>0 for all i,j.

    Then for each ai there exists bj (and conversely) such that ai=bj, and Ai and Bj satisfy P(AiBjc)=P(AicBj)=0. This implies P(Ai)=P(Bj).

    Motivation (for definition of expectation of a general r.v. X)

    Now we want to extend the definition of expectation to as many r.v.'s as possible (not just simple functions). Suppose X is a nonnegative r.v., and for i{1,,n},j{1,,2n}, define a nonnegative simple function

    Xn=i=1nj=12n((i1)+(j1)2n)IAi,j,n where Ai,j,n={ω:(i1)+(j1)2nX(ω)<(i1)+j2n}A

    Since Xn is defined to be the lower bound of X(ω), this ensures that Xn(ω)X(ω)

    Suppose nn, then Xn(ω)Xn(ω). Since Xn is an increasing sequence, limnXn(ω)=X(ω) for all ωΩ

    Since E(Xn) is an increasing sequence, limnE(Xn)=E(X).

    Suppose X is a r.v. and define

    X+(ω)=max{0,X(ω)} the positive part of XX(ω)=max{0,X(ω)} the negative part of X so X=X+X

    For any Borel set B, X+ and X are non-negative r.v.'s, since

    X+1B={X1(B(0,)) if 0BX1(B(0,))X1(,0]mapped to 0 by X+ if 0B

    X1B={X1(B(,0)) if 0BX1(B(,0))X1(0,]mapped to 0 by X if 0B

    Def. Expectation (as a Sum of Positive r.v.'s)

    For a r.v. X, the expectation of X is E(X)=E(X+)E(X), provided at least one of E(X+),E(X) is finite, otherwise E(X) is not defined. Note that E(X+),E(X) cannot simultaneously > 0, i.e. E(X) either equal to E(X+) or E(X).

    Lecture 14

    Lemma 3.1.2

    Suppose Y,Z are nonnegative r.v.'s,

    (i) if a,b0, then E(aY+bZ)=aE(Y)+bE(Z)

    (ii) if YZ, then 0E(Y)E(Z)

    Proof Choose non-negative simple YnY,ZnZ.

    (i) Since aYn+bZn is a nonnegative simple function satisfying aYn+bZnaY+bZ, E(aY+bZ)=limnE(aYn+bZn)=alimnE(Yn)+blimnE(Zn)=aE(Y)+bE(Z)

    (ii) Since YZ, we have 0Ynmax{Yn,Zn}. max{Yn,Zn} is a simple function satisfying max{Yn,Zn}Z. Therefore 0E(Yn)E(max{Yn,Zn}), and the result follows since E(Yn)E(Y),E(max{Yn,Zn})E(Z)

    Lemma 3.1.3

    If Y,Z are nonnegative r.v.'s with E(Y),E(Z) finite, then E(YZ)=E(Y)E(Z).

    Proof Let YZ=X=X+X

    YZ=X+XY+X=X++ZLHS and RHS both non-negativeE(Y+X)=E(Y)+E(X)E(LHS)E(X++Z)=E(X+)+E(Z)E(RHS)

    If X+(ω)>0, then X(ω)=0X+(ω)=Y(ω)Z(ω) by first line of above. If Z(ω) is added instead, then X+(ω)Y(ω)+Z(ω)  ω which implies 0E(X+)E(Y)+E(Z)<

    Similarly, if X(ω)>0, then X+(ω)=0X(ω)=Z(ω)Y(ω) by first line of above. If Y(ω) is added instead, then X(ω)Z(ω)+Y(ω)  ω which implies 0E(X)E(Z)+E(Y)<.

    Therefore, 0E(X)< and

    E(YZ)=E(X)=E(X+)E(X)=E(X++Z)E(Z)E(Y+X)+E(Y)=E(Y)E(Z)

    Prop 3.1.4 (Linearity of Expectations)

    If Y,Z are r.v.'s and E(Y),E(Z) are finite, then E(aY+bZ)=aE(Y)+bE(Z)

    Proof We can decompose Y and Z and express them as a difference of 2 non-negative r.v.'s as in Lemma 3.1.3

    aY+bZ=a(Y+Y)+b(Z+Z)={(aY++bZ+)(aY+bZ)if a,b0 or (aYbZ)(aY+bZ+)if a<0,b<0(aY+bZ)(aYbZ+)if a0,b<0(aY+bZ+)(aY++bZ)if a<0,b0

    Def. St Petersburg Paradox

    The following illustrates a case where expectation is infinite. Suppose a coin is flipped until H comes up on ith flip – win $2i

    Prop 3.1.5 (Expectation of |X|)

    (i) E(|X|)=E(X+)+E(X)

    Proof This follows from Lemma 3.1.2(i) since |X|=X++X.

    (ii) If XY with defined expectation, then E(X)E(Y).

    Proof

    (iii) If P(X=0)=1, then E(X)=0. If P(X=Y)=P({ω:X(ω)=Y(ω)})=1, then E(X)=E(Y).

    Proof

    1st line Assume X0 and choose nonnegative simple XnX. Then since 0XnX, we have that P(X=0)P(Xn=0)=1E(Xn)=0E(X)=0.

    In general, X=X+X, so {ω:X(ω)=0}{ω:X+(ω)=0}, which implies that P(X+=0)=1E(X+)=0 And similarly, {ω:X(ω)=0}{ω:X(ω)=0}, so P(X=0)=1E(X)=0 and we have that E(X)=0

    2nd line If P(X=Y)=1, then

    P({ω:X(ω)>0})=P({ω:Y(ω)>0})P(X+=Y+)=1E(X+)=E(Y+) and similarly, P({ω:X(ω)<0})=P({ω:Y(ω)<0})P(X=Y)=1E(X)=E(Y)E(X)=E(Y)

    Lecture 15

    Def. Converge with Probability 1

    Xnwp1X The sequence of r.v.'s {Xn} converges with probability 1 to r.v. X if P({ω:limnXn(ω)=X(ω)})=1

    Note We can assign a probability measure to the set since it is a sigma algebra:

    {ω:limnXn(ω)=X(ω)}=m=1lim infn{ω:|Xn(ω)X(ω)|<1m}=m=1n=1i=n{ω:|Xn(ω)X(ω)|<1m}A

    E.g. Define (Ω,A,P)=(R1,B1,P) where P is the uniform distribution on [0,1], so P(B)=B[0,1]dx.
    Let Xn(ω)=nn+1ω2, so X(ω)=ω2. Then {ω:limnXn(ω)=X(ω)}=R1
    Since P(R1)=[0,1]dx=1, we have that Xnwp1X

    E.g. Let X(ω)={ω2 if ω1/21 if ω=1/2 then {ω:limnXn(ω)=X(ω)}=R1{1/2}
    Since P(R1{12})=[0,12)dx+(12,1]dx=12+12=1, we have that Xnwp1X
    In fact, we could change X at every rational qQ to obtain X. Since P(Q)=0, we still have Xnwp1X

    Def. Converge Almost Surely

    A measure v defined on (Ω,A) is a function v:A[0,] that satisfies the following:

    1. v(ϕ)=0
    2. v(i=1Ai)=i=1v(Ai) whenever A1,A2,A are mutually disjoint

    Suppose h:(Ω,A)(R1,B1), i.e. h:ΩR1, h1BA for every BB1

    Then we can define a kind of average of h with respect to v, Ωh(ω)v(dω), called the integral of h with respect to v

    If {hn} is a sequence of such functions and v({ω:limnhn(ω)h(ω)})=0, then we say the sequence converges almost surely v to h and write hna.s.h

    Note convergence almost surely P to h convergence with probability 1

    Prop 3.2.1 (MCT & DCT)

    Suppose hn a.s. vh

    (i) Monotone Convergence (MCT)

    If 0h1h2, then Ωhn(ω)v(dω)Ωh(ω)v(dω)

    (ii) Dominated Convergence (DCT)

    If there exists g:(Ω,A)(R1,B1) such that Ω|g(ω)|v(dω)< and |hn||g| for all n,
    then Ωhn(ω)ν(dω)Ωh(ω)ν(dω)

    Corollary (Applied to Expectations)

    Suppose Xnwp1X

    (i) If 0X1X2, then E(Xn)E(X).

    (ii) If there exists r.v. Y such that E(|Y|)< and |Xn||Y| for all n, then E(Xn)E(X).

    E.g. Suppose X has E(X)<. Let Xn=XI{|X|n}, so Xnwp1X and |Xn||X|. Then by DCT, E(Xn)E(X)

    Prop 3.3.2 (Expectation of Compositions)

    If X is a r.v. with respect to (Ω,A,P), h:(R1,B1)(R1,B1), and Y=hX=h(X), then

    (i) Y=h(X) is a r.v. with respect to (Ω,A,P)

    Proof Let BB1. Then Y1B={ω:Y(ω)B}={ω:h(X(ω))B}={ω:X(ω)h1B}=X1h1BA since h1BB1 and X is a r.v.

    (ii) E(Y)=EPX(h), if it exists.

    Proof Steps: simple h -> non-negative h -> general h

    If h=i=1kbiIBi is a simple function, then Y(ω)=h(X(ω))=i=1kbiIBi(X(ω))=i=1kbiIX1Bi(ω), so EP(Y)=i=1kbiP(X1Bi)P on Ω=i=1kbiPX(Bi)P on R=EPX(h)expectation defined using simple functions

    If h0, Y=h(X)0, then there exists a sequence of non-negative simple functions WnhWn(X)h(X)=Y.

    EPX(h)=limnEPX(Wn)=limnE(Wn(X))=E(Y)expectation defined as a sum of positive r.v.’s

    General case (h not necessarily 0): h(X)=h+(X)h(X). Applying the above to both parts gives the result.

    EPX(h+)=E(Y+),EPX(h)=E(Y)EPX(h+)EPx(h)=E(Y+)E(Y)=E(Y)

    Prop 3.3.3 (Expectation Formulas)

    Suppose X is a r.v. with respect to (Ω,A,P), h:(R1,B1)(R1,B1) and EPX(h) exists.

    (i) If PX is discrete with probability mass function pX, then EPX(h)=xR1h(x)pX(x).

    (ii) If PX is a.c. with probability density function fX, then EPX(h)=h(x)fX(x)dx.

    Proof Suppose h(x)=i=1kbiIBi(x) is a simple function in canonical form. Then

    EPX(h)=i=1kbiPX(Bi)={i=1kbixBipX(x), if X discrete i=1kbiBifX(x)dx, if X a.c. ={xR1h(x)pX(x), if X discrete h(x)fX(x)dx, if X a.c. ={h(x)pX(x)v(dx),v= counting measure h(x)fX(x)v(dx),v= volume measure 

    This proves the result for simple h. If h0 and nonnegative simple hnh, then (i) hnpXhpX (ii) hnfXhfX and the result follows by MCT. For general h, the result follows via the decomposition h=h+h.

    E.g. If XN(μ,σ2), then with h(x)=x we have

    E(X)=0x12πσexp(12(xμσ)2)dx0(x)12πσexp(12(xμσ)2)dxmake the change of variable t=T(x)=(xμ)σ so x=T1(t)=μ+σt and JT(x)=σ=0(μ+σt)12πexp(t22)dx0(μ+σt)12πexp(t22)dxsub in φ(t)=12πexp(t22)=0(μ+σt)φ(t)dt+0(μ+σt)φ(t)dtcan recombine and simplify=μφ(t)dt1+σ(0tφ(t)dt+0tφ(t)dt)0 since for an odd function, f(-x) = -f(x)since 0tφ(t)dt=0tφ(t)dt=μ

    With h(x)=(xμ)2, we have

    E((Xμ)2)=(xμ)212πσexp(12(xμσ)2)dx=σ2t2φ(t)dtusing same change of var as beforeapply integration by parts with {u=tdu=dtdv=tφ(t)v=t12πexp(t22)dt=φ(t)t2φ(t)dt=uvvdu=[tφ(t)]contains exp()+φ(t)dt=0+1=1=σ2

    Def. Moments

    The k-th moment of a r.v. X is given by μk=E(Xk) when it exists. When the first moment exists, the k-th central moment of a r.v. X is given by μ¯k=E((Xμ1)k).

    The 1st moment of X is its mean: μX=E(X), so its first central moment is 0. The 2nd (central) moment of X is its variance: σX2=Var(X)=E((XμX)2) when μX exists. The 3rd moment is the skewness, and the 4th moment is the kurtosis.

    Prop 3.3.4 (Finite Moment Property)

    If μk is finite, then μl is finite for all lk, i.e. the previous moments must all be finite.

    Proof μk is finite E(|X|k) is finite. Let h(x)=|x|l, then

    0E(|X|l)=EPX(h)=|x|lPX(dx)=1|x|lPX(dx)+11|x|lPX(dx)()+1|x|lPX(dx)to get the following line, use the fact that lk for the 1st and 3rd term abovefor the 2nd term (*): powers of a number in [-1, 1] must be smaller than 11|x|kPX(dx)(1)+111PX(dx)+1|x|kPX(dx)(2)add in the original 2nd term (*)|x|kPX(dx)(1)+()+(2)+PX([1,1])<

    Ex. When XN(μ,σ2) compute E(X3) and E(X4)

    Since XN(μ,σ2), Z=XμσN(0,1)

    Z3=(Xμ)3σ3=X3(31)μX2+(32)μ2Xμ3σ3=X33μX2+3μ2Xμ3σ3E(Z3)=E(X3)3μE(X2)+3μ2E(X)μ3σ3=E(X3)3μ(σ2+μ2)+3μ3μ3σ3=E(X3)3μσ2μ3σ3E(Z3)=z3ϕ(z)dz=0odd function0=E(X3)3μσ2μ3E(X3)=3μσ2+μ3
    Z4=(Xμ)4σ4=X4(43)μX3+(42)μ2X2(41)μ3X+μ4σ4=X44μX3+6μ2X24μ3X+μ4σ4E(Z4)=E(X4)4μE(X3)+6μ2E(X2)4μ4+μ4σ4=E(X4)4μ(μ3+3μσ2)+6μ2(μ2+σ2)3μ4σ4=E(X4)6μ2σ2μ4σ4E(Z4)=z4ϕ(z)dzu=z3,dv=zϕ(z)du=3z2,v=ϕ(z)=[z3(ϕ(z))](ϕ(z))3z2dz=0+3=33σ4=E(X4)6μ2σ2μ4E(X4)=3σ4+6μ2σ2+μ4

    Ex. When X Standard Cauchy, i.e. X has density fX(x)=1/π(1+x2) for <x<, show that μ doesn't exist.

    Cauchy dist has longer tails.

    Ex. Let X ∼ Geometric(θ), and let Y = min(X, 100).

    (a) Compute E(Y).

    (b) Compute E(Y − X).

    Ex. Geometric & Negative Binomial

    E&R 3.1.22 For X ∼ Negative-Binomial (r, θ), prove that E(X) = r(1 − θ)/θ. (Hint: Argue that if X1,...,Xr are independent and identically distributed Geometric(θ) , then X=X1+···+Xr ∼ Negative-Binomial(r, θ).)

    E&R 3.3.18 Prove that the variance of the Geometric(θ) distribution is given by (1θ)/θ2.
    Hint: ((1θ)x)n=x(x1)(1θ)x2

    E&R 3.3.19 Prove that the variance of the Negative-Binomial(r, θ) distribution is given by r(1θ)/θ2.

    Ex. Gamma

    E&R 3.2.16 Let α > 0 and λ > 0, and let X ∼ Gamma(α, λ). Prove that E(X) = α/λ.

    E&R 3.3.20 Let α > 0 and λ > 0, and let X ∼ Gamma(α, λ). Prove that Var(X) = α/λ2.

    Ex. Beta

    E&R 3.2.22 Suppose that X follows the Beta(a, b) distribution. Prove that E(X) = a/(a + b).

    E&R 3.3.24 Suppose that X ∼ Beta(a, b). Prove that Var(X)=ab/((a+b)2(a+b+1))

    E.g. (Monte Carlo Approximations)

    Suppose Y=h(X) for some h:(R1,B1)(R1,B1) and we want to compute E(Y)

    How accurate is this estimate for some specific n ?

    Note Let Y=IA, so Y¯= the relative frequency of A in X1,X2,,Xn. Since Yi2=Yi, s2=Y¯(1Y¯) This is the same estimation procedure as previously discussed for estimating PX(A)=EPX(IA)

    Lecture 16

    Def. Mean Vector

    For random vector XRk, the mean vector of X is μX=E(X)=(E(X1),E(X2),,E(Xk))=(μ1,μ2,,μk), provided each E(Xi)=μi exists.

    Note For a matrix of r.v.'s (called random matrix) X=(Xii)Rk×l, its expected value is defined to be E(X)=(E(Xij)) when each E(Xij) exists, and E(X)Rk×l when each E(Xij) is finite.

    Def. Variance Matrix

    If each E(Xi) is finite (so E(X)Rk) then the variance matrix of X is given by ΣX=Var(X)

    =(E((X1μ1)2)E((X1μ1)(Xkμk))E((X2μ2)(X1μ1))E((X2μ2)(Xkμk))E((Xkμk)(X1μ1))E((Xkμk)2))

    provided each E((Xiμi)(Xjμj)) for ij exists. In vector form: ΣX=Var(X)=E((XμX)(XμX))

    The off-diagonal entries =Cov(Xi,Xj)=E((Xiμi)(Xjμj)) The diagonal entries =Cov(Xi,Xi)=Var(Xi) So ΣX=(Cov(Xi,Xj))

    If Cov(Xi,Xj) is finite for every i and j, then ΣxRk×k and is symmetric, i.e. Cov(Xi,Xj)=Cov(Xj,Xi)

    Ex. When X is a r.v., prove that E(X2)< implies E(X)<.

    When X and Y are r.v.'s and E(X2)<,E(Y2)<, prove that E(XY) is finite.

    Also prove that if E(Xi2)< for all i=1,,k, then ΣxRk×k.

    Ex. When r.v.'s X and Y satisfy E(X2)<,E(Y2)<, prove that Cov(X,Y)=E(XY)E(X)E(Y). Extend this result to random vectors X to show that ΣX=Var(X)=E(XX)μXμX

    Prop 3.4.1 (Affine Transformations' Mean & Variance)

    Suppose XRk is a random vector and Y=a+CX where aRl,CRl×k are constant.

    (i) If μXRk, then μY=a+CμXRl

    Proof μY=E(Y)=E(a+CX)=a+CE(X) since E(ai+j=1kcijXj)=ai+j=1kcijE(Xj) by the linearity of E

    (ii) If ΣXRk×k, then ΣY=CΣXCRl×l since

    ΣY=Var(Y)=E((YμY)(YμY))=E((a+CX(a+CμX))(a+CX(a+CμX)))=E(C(XμX)(XμX)C)=CE((XμX)(XμX))C=CΣXC

    Prop 3.4.2 (Degenerate Dist, p.s.d. Variance Matrix, Affine Plane)

    (i) If X is a r.v. and Var(X)=0, then P(X=μX)=1, so X has a probability distribution degenerate at a constant (= μX).

    Proof (Repeat the below for simple functions -> non-negative functions -> general functions)

    Var(X)=E((XμX)2)=0 iff 1=P((XμX)2=0)=P(XμX=0)=P(X=μX).

    (ii) If XRk is a random vector, ΣXRk×k, and cRk is constant, then cΣXc0. Thus, any variance matrix is positive semidefinite (p.s.d.)

    Proof Consider r.v. Y=cX.

    Then by (ii) of the previous proposition, Var(Y)=cΣXc0 since a variance is always nonnegative.

    (iii) If cΣXc=0 for some c0, then the probability distribution of X is concentrated on the affine plane μX+L{c}, where L stands for linear span. L(c)={a:ac}={a:ac=0}

    Proof Consider Y=cX. Suppose cΣXc=0, then Var(X) = 0, so by (i) and (ii),

    1=P(Y=μY)=P(cX=cμX)=P(c(XμX)=0)=P(XμXL{c})=PX(μX+L{c})

    Notes

    Ex. Prove that, if XRk×l is a random matrix such that each E(Xij) is finite and ARp×q,BRp×k,CRl×q are constant matrices, then E(A+BXC)=A+BE(X)C

    E.g. XNk(μ,Σ)

    Ex. Suppose XNk(μ,Σ). Determine E(XX)

    Ex. Suppose X multinomial (n,p1,,pk). Determine μX and ΣX.

    Ex. The correlation between r.v.'s X and Y is defined by ρXY=Corr(X,Y)=Cov(X,Y)Sd(X)Sd(Y), where Sd(X)=Var(X) is the standard deviation of X.

    (i) What has to hold for ρXY to exist and provide sufficient conditions?

    (ii) Prove that for constants a,b,c,d then Corr(a+bX,c+dY)=Corr(X,Y), provided b>0,d>0. What happens when b=0 ? What happens when b<0,d>0 and when b<0,d<0 ?

    (iii) Suppose Y= wp1 a+bX. What is Corr(X,Y) ?

    (iv) Suppose XU(0,1) and Y=X2. Determine Corr(X,Y).

    (v) Suppose XU(1,1) and Y=X2. Determine Corr(X,Y). Are X and Y independent?

    Recall Two collections of r.v.'s {Xs:sS},{Yt:tT} where x1,,xm,y1,,ynR1 are statistically independent if for any finite subsets {s1,,sm}S,{t1,,tn}T, the joint cdf satisfies

    F(Xs1,,Xsm,Yt1,,Ytn)(x1,,xm,y1,,yn)=F(Xs1,,Xsm)(x1,,xm)F(Yt1,,Ytn)(y1,,yn)

    The Extension Thm then implies P(Xs1,,Xsm,Yt1,,Ytn)(B1×B2)=P(Xs1,,Xsm))(B1)P(Yt1,,Ytn)(B2) for any B1Bm,B2Bn

    Prop 3.5.1 ( E(g h) = E(g)E(h) )

    If X and Y are statistically independent random vectors and h1,h2:(R1,B1)(R1,B1), then h1(X) and h2(Y) are statistically independent. So if E(h12(X))< and E(h22(Y))<, then E(h1(X)h2(Y))=E(h1(X))E(h2(Y))

    Proof h1(X) and h2(Y) are statistically independent since the following holds for every x and y.

    F(h1(X),h2(Y))(x,y)=P(h1(X)x,h2(Y)y)=P(Xh11(,x],Yh21(,y])=P(X,Y)(h11(,x]×h21(,y])=PX(h11(,x])PY(h21(,y])=Fh1(X)(x)Fh2(Y)(y)

    Suppose h1=iaiIAi,h2=jbjIbj are simple functions. Then h1(x)h2(y)=i,jaibjIAi(x)IBj(y)=i,jaibjIAi×Bj(x,y) is also simple, and E(h1(X)h2(Y))=i,jaibjP(X,Y)(Ai×Bj)=i,jaibjPX(Ai)PY(Bj)=E(h1(X))E(h2(Y)) as required.

    The result then follows by proceeding to nonnegative h1,h2 by limits and then to general h1=h1+h1,h2=h2+h2.

    Corollary 3.5.2 (Covariance of Ind. Functions = 0)

    Cov(h1(X),h2(Y))=0

    Ex. For random vectors XRk and YRl define Cov(X,Y)=E((XμX)(YμY)), provided all the relevant expectations exist.

    (i) Give conditions under which Cov(X,Y)Rk×l.

    (ii) Assuming Cov(X,Y)Rk×l and aRp,bRq,ARp×k,BRq×l are constant, determine Cov(a+AX,b+BY)

    (iii) Assuming Cov(X,Y)Rk×l and X and Y are statistically independent, determine Cov(X,Y).

    Ex. For random vector XRk with ΣXRk×k, the correlation matrix is defined by Corr(X)=RX=DX1ΣXDX1 where DX=diag(Sd(X1),,Sd(X1))=diag(σ11,,σkk)

    (i) Show that the (i,j)-th element of RX is Corr(Xi,Xj).

    (ii) Suppose Y=DX where D=diag(d1,,dk) with di>0 for i=1,,k. Show Corr(Y)=Corr(X)

    (iii) Suppose in (ii) that D is not diagonal with positive diagonal, is it true that Corr(Y)=Corr(X)?

    Lecture 17

    Def. Functions of a Stochastic Process

    Suppose {(t,Xt):tT} is a stochastic process such that E(Xt2)< for all tT.

    Then the mean function μ:TR1 is defined as μ(t)=E(Xt)

    The autocovariance function σ:T×TR1 is defined as σ(s,t)=Cov(Xs,Xt), provided these expectations exist.

    The autocorrelation function ρ:T×TR1 is defined as ρ(s,t)=σ(s,t)σ(s,s)σ(t,t), provided σ(t,t)>0  tT

    E.g. (iid process)

    Def. Gaussian process

    Def. Weakly Stationary Process

    For TRk, a process with mean function μ and autocovariance function σ is called weakly stationary if μ(t) is constant in t and σ(s,t)=κ(st) for some κ:RkR1.

    Note κ is a positive semidefinite function (positive definite when corresponding matrices are p.d.), i.e. κ must satisfy κ(0)0,κ(t)=κ(t), and i=1ni=1nxixjκ(titj)0 for all {t1,,tn}T, x=(x1,,xn)Rn

    There are theorems concerning such κ, for example, κ(t)=exp(τ2t2) where τ2>0 is positive definite.

    Def. Random Walk

    μ(t)=E(Xt)=i=1tE(Zt)=tE(Z1)=t((1p)+p)=(2p1)tσ(s,t)=Cov(Xs,Xt)=Cov(i=1sZi,j=1tZj)=i=1sj=1tCov(Zi,Zj)only non-zero when i=j=i=1min{s,t}Var(Zi)=min{s,t}Var(Z1)=4p(1p)min{s,t}ρ(s,t)=4p(1p)min{s,t}4p(1p)s4p(1p)t=min{s,t}st
    μ(t)=E(Xt)=i=1tE(Zt)=tE(Z1)=mtσ(s,t)=Cov(Xs,Xt)= as above min{s,t}Var(Z1)=τ2min{s,t}ρ(s,t)=τ2min{s,t}τ2sτ2t=min{s,t}st
    (X1X2Xt)=(100110111)(Z1Z2Zt)=AZt

    E.g. Weakly Stationary Gaussian Process

    μ(t)=E(Xt)=E(Zt)+θE(Zt1)=0σ(s,t)=Cov(Xs,Xt)=E(XsXt)E(Xs)E(Xt)=E(XsXt)=E((Zs+θZs1)(Zt+θZt1))=E(ZsZt)+θ[E(ZsZt1)+E(Zs1Zt)]+θ2E(Zs1Zt1)={0s<t1when indices don’t match, covariance is 0τ2θs=t1τ2+τ2θ2s=tτ2θs=t+10s>t+1
    (XtXt+1Xt+n)=(θ1000θ1000θ1)(Zt1ZtZt+n)=AZt1,t+n

    Since AZ is multivariate normal, it is consistent and by KCT, {(t,Xt):tZ} is a Gaussian process

    κ(t)={0t<1τ2θt=1τ2+τ2θ2t=0τ2θt=10t>1

    it is a weakly stationary Gaussian process

    Ex. If r.v.'s X1,,Xm,Y1,,Yn all have finite second moments, then for constants a0,a1,,am,b0,b1,,bn prove that Cov(a0+i=1maiXi,b0+j=1nbjYj)=i=1mj=1naibjCov(Xi,Yj)

    Ex. If r.v.'s X1,,Xm all have finite second moments then for constants a0,a1,,am prove that Var(a0+i=1maiXi)=i=1mai2Var(Xi)+2i<jaiajCov(Xi,Xj)

    Specialize this result to the case where X1,,Xm are mutually statistically independent.

    Ex. In the 2 previous exercises, determine the joint distribution of (X1,,Xt) in the Gaussian case.

    Lecture 18

    Def. Markov's Inequality

    If X is a nonnegative r.v. and x>0, then P(Xx)E(X)x

    P(Xx)=E(X)x iff P(X=x)=1P(X=0).

    Proof (inequality)

    P(Xx)=E(I{Xx})E(XxI{Xx})=E(XI{Xx})xE(X)x

    Proof (equality)

    () If P(X=x)=1P(X=0),
    then PX is concentrated on the points {0,x}, so E(X)=0P(X=0)+xP(X=x)=xP(Xx).

    () If E(X)=xP(Xx) at x>0,
    then 0=E(X)E(xI{Xx})=(E(XIX<x)+E(XIXx))E(xI{Xx})=E(XI{X<x})+E((Xx)I{Xx})

    Since XI{X<x} and (Xx)I{Xx} are both non-negative r.v.'s, E(XI{X<x})=E((Xx)I{Xx})=0

    It follows that {1=P(XI{X<x}=0)I{X<x}=0XxP(0<X<x)=01=P((Xx)I{Xx}=0)()I{X>x}=0XxP(X>x)=0

    () holds since Xx=0 when X=x, so it doesn't matter what I evaluates to.
    We can thus exclude X=x from I{Xx}.

    Hence we have P(X=x)=1P(X=0)

    Ex. If X is a r.v., then determine an upper bound for P(exp(tX)k) when k>0.

    Might need to add t > 0

    Ex. If X is a r.v. and k>0, then prove P(|X|k)E(|X|)/k and also P(|X|k)E(X2)/k2. If X exponential (1) which inequality is sharper? Find the exact value of P(X2) when X exponential (1) and compare this with the bounds.

    Def. Chebyshev's Inequality

    If X has mean μ and variance σ2, then for k>0, P(|Xμ|kσ)1k2

    P(|Xμ|kσ)=1k2 iff P(X{μkσ,μ+kσ})=1P(X=μ)

    Proof Since |Xμ| is non-negative we can apply Markov and obtain

    P(|Xμ|kσ)=P((Xμ)2k2σ2)E((Xμ)2)k2σ2=σ2k2σ2=1k2

    and the equality result follows as with Markov.

    E.g. 5 sigma P(|Xμ|5σ)125=0.04 and if XN(μ,σ2), then P(|Xμ|5σ)=5.733031e07

    Def. Cauchy-Schwartz Inequality

    Recall from linear algebra: |xTy|xy

    Think of a set of r.v.'s as a vector space. Restrict it to a set that has second moments and it's a linear space.

    If E(X2)<,E(Y2)<, then |E(XY)|E(X2)E(Y2)

    |E(XY)|=E(X2)E(Y2) iff Y=cX wp1

    where c={0P(Y=0)=P(X=0)=1E(XY)E(X2)o/w

    Proof If E(X2)=0, then P(X=0)=1P(XY=0)=1E(XY)=0. E(X2)=E(XY)X=0Y

    Now assume E(X2)>0,E(Y2)>0. For any cR1, 0(YcX)2=Y22cXY+c2X20E(Y2)2cE(XY)+c2E(X2), which is a convex parabola in c with minimum at c=E(XY)E(X2)

    So 0E(Y2)2(E(XY))2E(X2)+(E(XY))2E(X2)=E(Y2)(E(XY))2E(X2)|E(XY)|E(X2)E(Y2)

    Equality occurs iff 0=E((YcX)2) when c=E(XY)E(X2) (which minimizes the parabola)

    This occurs iff 1=P((YcX)2=0)=P(YcX=0)=P(Y=cX)

    Def. Correlation Inequality

    If 0<σX2<,0<σY2<, then 1ρXY=Corr(X,Y)1

    1ρXY=Corr(X,Y)=1 iff Y= wp 1{μY+σY(XμX)σXρXY=1μYσY(XμX)σXρXY=1

    Note Correlation only measures linear (affine) relation between X & Y. (How much variation in Y does X explain?)
    Note that correlation = 0 does not imply independence. The correlation for X and Y=a+bX+cX2 can be 0 even though they are not independent.

    Proof In Cauchy Schwartz inequality, standardize X,Y, i.e. replace X by (XμX)σX and Y by (YμY)σY

    So E((XμX)2σX2)=E((Y^μY)2σY2)=1

    By C.S., we have |ρXY|=|E((XμXσX)(YμYσY))|11ρXY1

    By C.S., the equality holds iff (YμYσY)= wp 1c(XμXσX) where c=E((XμXσX)(YμYσY))E((XμXσX)2)=ρXY

    Rearranging the above, we get Y=wp1μY+σYρXY(XμXσX) where ρXY=±1, so the result follows.

    It is in the form Y=a+bX where a=μY+σYρXY(μXσX) and b=σYρXYσX

    Note a measure of the total variation in Y is given by Var(Y)=E((YμY)2)

    Def. Best Affine Predictor

    If we approximate Y by a+bX for some constants a and b, then the amount of variation in Y that is not explained (the residual variation) by a+bX is E((YabX)2)

    The best affine predictor (linear regression) of Y from X is given by a+bX, where a,b are constants that minimize E((YabX)2)

    Ex. Assume 0<σX2<,0<σY2<. Show that if a,b minimize E((YabX)2), then a,b with a=aμY+bμX,b=b minimizes E(((YμY)ab(XμX))2) over all constants a,b.

    Ex.

    (i) Assume μX=μY=0 and 0<σX2<,0<σY2<. For all constants a,b, and cXY=σYρXYσX, prove E(YcXYX)=0,Cov(YcXYX,a+bX)=0 and E((YabX)2=Var(YcXYX)+a2+(bcXY)2Var(X).

    Use this to prove that cXYX is the best affine predictor of Y from X.

    (ii) Combine (i) and the previous exercise to determine the best affine predictor of Y from X when the assumption of 0 means is not made.

    (iii) Show that the proportion of the total variation in Y explained by the best affine predictor from X is given by ρXY2.

    (iv) When (XY)N2((μXμY),(σX2σXσYρXYσXσYρXYσY2)), show that EYX(Yx) equals the best affine predictor of Y from X.

    Lecture 19

    Def. Convexity

    CRk is a convex set if whenever x1,x2C and α[0,1], then αx1+(1α)x2C.

    The line segment joining x1 and x2 is L(x1,x2)={αx1+(1α)x2:α[0,1]}

    If C is convex, then f:CR1 is a convex function and f(αx1+(1α)x2)αf(x1)+(1α)f(x2) for every α[0,1]. If LHS RHS, then f is a concave function. If f:CR1 is convex then f is concave

    If f:CR1 is defined on open convex set CRk,
    then f is convex whenever the Hessian matrix (2f(x1,,xk)xixj)Rk×k is positive semidefinite for every xC

    Ex. Convexity proofs

    (i) Prove the line segment L(x1,x2) is convex. (ii) Prove [a,b]Rk is convex. What about (a,b],(a,b),[a,b)? (iii) Prove Br(μ)Rk is convex. (iv) Prove Er(μ,Σ) is convex (hint: use Er(μ,Σ)=μ+Σ1/2Br(0). (v) Prove that the affine function f:RkR1 given by f(x)=a+cx for constants aR1,cRk is convex on Rk. (vi) Prove that f(x)=logx is convex on C=(0,). (vii) If ΣRk×k is positive semidefinite, then prove f(x)=xΣx is convex on Rk.

    Prop 3.7.5 (Supporting Hyperplane Thm)

    If CRk is convex and x0Rk is not an interior point of C (there isn't a ball Br(x0)C with r>0), then there exists cRk{0} such that for every xC, cxcx0 .

    For a set ARk there is always a set S={xRk:a+Bx=0} for some aRl,BRl×k, and lk s.t. AS

    E.g. take a=0Rk,B=0R1×k so any ARk would be in {x:a+Bx=0}=Rk

    E.g. the hyperplane in Rk+1 given by a,bRk{0} with y=a+bx

    {(xy)Rk+1:a+(b1)(xy)=0}

    A set of the form {xRk:a+Bx=0} is called an affine subset of Rk and it has a dimension (point has dimension 0 , line has dimension 1,, hyperplane has dimension k1,Rk has dimension k )

    Ex. Suppose C1,C2Rk are convex. Prove that C1C2 is convex.

    Ex. Suppose CRk is convex and let C=a+BC={y=a+Bx:xC}. Prove that C is convex.

    Ex. If C is a linear subspace of Rk, then C is convex.

    Def. Affine Dimension

    If ARk, the affine dimension of A is the smallest dimension of an affine set containing A. For example, a squiggly line has affine dimension = 2.

    Prop 3.7.7 (Expectation is in Convex Set)

    If CRk is convex with PX(C)=P(XC)=1 and E(X)Rk, then E(X)C.

    Proof (Induction on the affine dimension of C)

    If the affine dim of C is 0 (probability concentrated at the point x), then C={x} and E(X)=xC and the result holds.

    Assume wlog (w/o loss of generality) that E(X)=0, o/w put Y=XE(X) and C=CE(X). Note that C is convex, PY(C)=P(YC)=P(XC)=PX(C)=1, and E(X)C iff E(Y)=0C.

    Now assume the result holds for affine dimC<k. Suppose 0C, then by the Supporting Hyperplane Thm, there exists cRk{0} s.t. cxc0=0 for every xC. This implies P(cX0)=1, i.e. cX is a nonnegative r.v.

    By hypothesis, E(X)=0E(cX)=cE(X)=0P(cX=0)=1. Therefore, P(X{x:cx=0}C)=1, and {x:cx=0}C is a convex set w/ affine dimension k1. So by the inductive hypothesis, E(X)=0{x:cx=0}C which implies 0C, and we have a contradiction.

    Def. Jensen's Inequality

    If CRk is convex, PX(C)=1,E(X)Rk, and f:CR1 is convex, then E(f(X))f(E(X))

    Equality is obtained iff f(x)=wp1a+bx for constants a,b.

    E.g. Jensen's Inequality

    If PX({x1,x2})=1 with PX({x1})=α1, PX({x2})=1α1, then L(x1,x2)Bk is convex, and Px(L(x1,x2))=1

    Suppose f:L(x1,x2)R1 is convex, then for this simple context Jensen's inequality is immediate:

    E(f(X))=α1f(x1)+(1α1)f(x2)f(α1x1+(1α1)x2)=f(E(X))

    Geometrically consider the line segment {α(x1,f(x1))+(1α)(x2,f(x2)):α[0,1]} in Rk+1

    Convexity of f on the line segment implies the line segment lies above the graph

    {(αx1+(1α)x2,f(αx1+(1α)x2)):α[0,1]}

    and E(X)=α1x1+(1α1)x2 gives E(f(X))f(E(X))

    Proof (Induction on the affine dimension of C.) If affine dimC is 0, then C={x} and E(f(X))=f(x)=f(E(X)) and f(x)=wp1f(x)+0x so the result holds.

    Now assume the result holds for affine dimC<k. Let S={(x,y):xC,yf(x)}. Note that SRk+1 is convex, and (E(X),f(E(X))) is a boundary point of S (not an interior point). Then by Supporting Hyperplane Thm, there exists cRk+1{0} s.t. for every zS

    cz=i=1kcizi+ck+1zk+1c(E(X)f(E(X)))=i=1kciE(Xi)+ck+1f(E(X))

    If ck+1<0, then the inequality can be violated by taking zk+1 large, so ck+10 must hold, and we have 2 cases:

    Case 1 ck+1>0

    Let Y=i=1kci(XiE(Xi))+ck+1(f(X)f(E(X))(LHS - RHS) of above, where LHSRHS, so Y0

    P(Y0)=1, so 0E(Y)=ci(0)+ck+1(E(f(X))f(E(X))E(f(X))f(E(X))

    Since E(f(X))=f(E(X) iff E(Y)=0, which occurs iff P(Y=0)=1, we have that Y=0, so rearrange the above:

    i=1kci(XiE(Xi))=ck+1(f(X)f(E(X))f(X)=f(E(X))i=1kcick+1(XiE(Xi))=(f(E(X))+i=1kcick+1E(Xi))+i=1k(cick+1)Xi

    which is of the required form a+bx

    Case 2 ck+1=0

    Then Y=i=1kci(XiE(Xi))E(Y)=0P(Y=0)=1P(X{x:cx=cE(X)}C)=1

    {x:cx=cE(X)}C is a convex set of affine dim<k, so by the inductive hypothesis the result holds.

    Note If f:CR1 is concave and PX(C)=1,E(X)Rk, then the concave version of Jensen says E(f(X))f(E(X))

    Def. Kullback-Liebler Distance

    KL(PQ)=EP(logqp)=Ωp(ω)logq(ω)p(ω)v(dω)=Ωp(ω)logp(ω)q(ω)v(dω)

    where v is the counting (discrete case) or volume measure (a.c. case)

    Prop 3.7.9 (KL Distance >= 0)

    If P,Q are probability measures on (Ω,A) with probability (density) functions p and q respectively,
    then KL(PQ)0 with equality iff P=Q.

    Proof Since logx is convex on (0,), applying Jensen gives

    KL(PQ)log(EP(qp))=log(Ωp(ω)q(ω)p(ω)v(dω))=log(Ωq(ω)v(dω))=log1=0

    Equality holds iff there exist a,b such that

    logq(ω)p(ω)=wp1 w.r.t. Pa+bq(ω)p(ω)

    which holds when p= wp1 w.r.t. Pq so a=b=0.

    Now logx and a(1x) agree at x=1 and at most at one other point (draw the graphs), which implies p=wp1 Pq

    Sub-proof Suppose pq so they differ at least at two ω 's, define A={ω:q(ω)=p(ω)}, so Ac={ω:q(ω)=rp(ω)} for some real number r, which implies Q(A)=P(A) and Q(Ac)=rP(Ac)=r(1P(A))=r(1Q(A))=rQ(Ac)

    This means either r=1 or 0=Q(Ac)=P(Ac), and both cases contradict pq with positive P probability.

    Ex. Suppose P is the N(μ1,σ12) probability measure and Q is the N(μ2,σ22) probability measure. Compute KL(PQ).

    Ex. Does KL(PQ)=KL(QP)?

    Lecture 20

    Conditional Expectation - Discrete Case

    p(X,Y)(x,y)=P(X,Y)({(x,y)})=P(X=x,Y=y)
    pYX(y|x)=p(X,Y)(x,y)pX(x)

    when pX(x)=PX({x})=P(X=x)=yp(X,Y)(x,y)>0 (otherwise cond. dist. not defined)

    EPYX(YX=x)=EpYX(YX)(x)=yypYX(yx)
    y|y|pYX(yx)=y|y|p(X,Y)(x,y)pX(x)=1pX(x)y:p(X,Y)(x,y)>0|y|p(X,Y)(x,y)1pX(x)(z,y)|y|p(X,Y)(z,y)=1pX(x)E(|Y|)<
    E(YX)(ω)=EPYX(YX)(X(ω))

    Prop 3.8.1 E[h(X)Y] = E[h(x)E(Y|X)]

    If h:(Rk,Bk)(R1,B1) is s.t. E(|Yh(X)|)<, then E(Yh(X))=E(h(X)E(YX)).

    Proof

    E(Yh(X))=(x,y)yh(x)p(X,Y)(x,y)=(x,y)yh(x)pX(x)p(X,Y)(x,y)pX(x)=(x,y)yh(x)pX(x)pYx(yx)=xh(x)(yypYX(yx))pX(x)=xh(x)EpYX(YX)(x)pX(x)=E(h(X)E(YX)).

    Corol 3.8.1 E[h(X)Y|X] = h(x)E(Y|X)

    Applying prop 3.8.1 to conditional expectations, we have E(Yh(X)X)=h(X)E(YX)

    Corol 3.8.2 (Theorem of Total Expectation)

    E(Y)=E(E(YX)) for random vector (X,Y) where E(|Y|)<

    Proof Let h(x)1, Y=IA for AA.

    Then E(YX)(x)=ypYx(yx)=0pYx(0x)+1pYx(1x)=P(AX)(x)

    Corol 3.8.3 (Theorem of Total Probability)

    P(A)=E(P(AX)) for AA

    Corol 3.8.4 V(Y) = E[V(Y|X)] + V[E(Y|X)]

    Var(Y)=E(Var(YX))+Var(E(YX)) if E(Y),E(Y2)<

    Proof:

    Var(Y)=E((YE(Y))2)=TTEE(E((YE(Y))2X))=E(E([YE(YX)a+E(YX)E(Y)b]2X))add and subtract E(Y|X)

    Expand the inner expectation, get a2+2ab+b2

    E((YE(YX))2X)Var(Y|X)+2E((YE(YX))E(Y)E(Y|X)(E(YX)E(Y))constantX)+E((E(YX)E(Y))2X)constant, so =(E(YX)E(Y))2=Var(YX)+2(E(YX)E(YX))0(E(YX)E(Y))+(E(YX)E(Y))2=Var(YX)+(E(YX)E(Y))2

    and applying E to both sides gives the result:

    E(Var(Y))=E(Var(Y|X))+Var(E(Y|X))

    Corol 3.8.5 (Best Predictor & Residual Error)

    The random variable E(YX) is the best predictor of Y from X in the sense that it minimizes E((Yh(X))2) among all h:(Rk,Bk)(R1,B1), and smallest residual error is E(Var(YX)).

    Proof

    E((Yh(X))2)=E((YE(YX)+E(YX)h(X))2)add and subtract E(Y|X)=E((YE(YX))2)+2E(YE(YX))(E(YX)h(X))0+E((E(YX)h(X))2)

    and so

    E((Yh(X))2)=E((YE(YX))2)+E((E(YX)h(X))2)E((YE(YX))2)=E(Var(YX))

    with equality when h(X)=E(YX).

    Notes

    Conditional Expectation - Continuous Case

    If (X,Y) has density f(X,Y) and E(|Y|)<, then

    E(YX)(x)=yfYX(yx)dy where fYX(yx)=f(X,Y)(x,y)fX(x) and fX(x)=f(X,Y)(x,y)dy.

    E.g. Nk(μ,Σ)

    Suppose (YX)Nk(μ,Σ) with YRl, μ=(μYμX),Σ=(ΣYΣYXΣYXΣX) is p.d.

    Then YX=xNk(μY+ΣYXΣX1(xμX),ΣYΣYXΣX1ΣYX)

    So E(YX)(x)=μY+ΣYXΣX1(xμX), and this minimizes i=1lE((Yihi(X))2)=E(Yh(X)2) among all h:(Rk1,Bk1)(R,B)

    Def. Martingales

    E(Xn+1X1,,Xn)(x1,,xn)=xnE(Xn+1X1,,Xn)=Xn

    Lecture 21

    Def. Generating Functions

    For a sequence {an:nN0} of real numbers, the generating function is defined by G(t)=i=0aiti, provided the series converges for all t(hG,hG) where hG>0

    Def. Abel's Theorem

    If G(t)=i=0aiti is finite in (1,1) and i=0ai converges (limit could be ), then limt1G(t)=i=0ai.

    Def. Probability Generating Functions

    If X is a r.v. s.t. PX(N0)=1, then the probability generating function of X is GX(t)=E(tX)=i=0P(X=i)ti for |t|1.

    Prop 3.9.1 (Same PGF <=> same prob dist)

    If GX(t)=GY(t) for all t(h,h) for some h>0, then X and Y have the same probability distribution.

    Proof for |t|1, GX(t)=i=0P(X=i)ti , so 1k!dkGX(t)dtk|t=0=P(X=k)=1k!dkGY(t)dtk|t=0=P(Y=k)

    Thus, GX completely specifies the distribution of X

    Prop 3.9.2 (PGF Properties, K-th Factorial Moment)

    (i) If X,Y are stat. ind. r.v.'s with pgf's GX,GY, then GX+Y(t)=GX(t)GY(t).

    Proof GX+Y(t)=E(tX+Y)=E(tXtY)= ind E(tX)E(tY)=GX(t)GY(t)

    (ii) If X has pgfGX, and the k-th factorial moment of X:

    μ[k]=E(X(X1)(Xk+1))=i=ki(i1)(ik+1)P(X=i) exists, then limt1dkGX(t)dtk=μ[k].

    Proof If |t|<1, dkGX(t)dtk=dkdtki=0P(X=i)ti=i=ki(i1)(ik+1)P(X=i)tik is finite, and by Abel's Thm

    limt1i=ki(i1)(ik+1)P(X=i)tik=i=ki(i1)(ik+1)P(X=i)=μ[k]

    (iii) (Compound distributions)

    If the r.v.'s {Xi:i=1,2,} are i.i.d. with pgf GX, and are also stat. ind. of N with pgf GN,
    then Y=i=1NXi has pgf GY(t)=GN(GX(t)).

    Proof

    GY(t)=E(tY)=E(ti=1Nxi)=E(i=1Ntxi)=TTEE(E(i=1NtxiN))=n=1P(N=n)E(i=1ntxi)=(i)n=1P(N=n)(GX(t))n=GN(GX(t))

    E.g. PGF of X Poisson (λ) with λ>0, pX(i)=λii!eλ for i=0,1,2,

    GX(t)=E(tX)=i=0tiλii!eλ=eλi=0(tλ)ii!=eλetλ=eλ(t1)uses Maclaurin expansion
    GX+Y(t)=GX(t)GY(t)=eλ1(t1)eλ2(t1)=e(λ1+λ2)(t1)
    μ[1]=limt1dGX(t)dt=limt1λeλ(t1)=λμ[2]=limt1d2GX(t)dt2=limt1λ2eλ(t1)=λ2

    Exercise III.9.1 If XBernoulli(p), then find GX(t) and use this to obtain the pgf for a binomial (n,p) distribution.

    Exercise III.9.2 If XGeometric(p), then find GX(t) and use this to obtain the mean and variance of X.

    Exercise III.9.3 If N Poisson (λ) independent of X1,X2, 1+2Bernoulli(p) and Y=i=1NXi, determine E(Y).

    Def. Moment Generating Function

    If XRk is a random vector, then the moment generating function of X is mX(t)=E(exp(tX)), provided the expectation is finite for all tBh(0). The MGF does not always exist (e.g. Cauchy).

    Def. Characteristic Function

    The characteristic function of X is given by cX(t)=E(exp(itX)) for all tRk. Since eix=cosx+isinx and both |cosx|, |sinx|1, we know eix is bounded:

    E(|exp(itX)|)=E(|cos(tX)+isin(tX)|)E(|cos(tX)|)+E(|sin(tX)|)2

    so cX(t)=E(cos(tX))+iE(sin(tX)) always exists (may be complex valued)

    If PX(B)=PX(B), then PX(tXx)=PX(tXx), so tX has a probability distribution symmetric about 0.
    Since sine is an odd function, i.e. sin(x)=sin(x), its expectation E(sin(tX))=0 and cX is real-valued

    Prop 3.9.3 (Uniqueness of MGF & CF)

    (i) If mX,mY exist and mX(t)=mY(t) for all tBh(0), for some h>0, then PX=PY.

    (ii) If cX(t)=cY(t) for all tRk then PX=PY.

    If we know mX or cX and we recognize it, then we know the distribution of X.

    There are inversion results that give expressions for the cdf of X computed from mX or cX.

    Note Same distribution does not mean same r.v.

    Def. Mixed Moment of Random Vector

    If i1,,ikN0, then (i1,,ik)-th mixed moment of a random vector XRk is defined by μi1,,ik=E(X1i1Xkik) whenever this expectation exists.

    Prop 3.9.4 (Prev. Mixed Moments are Finite)

    If i1j1,,ikjk and E(|X1j1Xkjk|)< for all (j1,,jk) satisfying j1++jk=j, then μi1,,ik is finite.

    Proof (for k = 2)

    Exercise III.9.4

    Prop 3.9.5 (i-th Mixed Moment)

    If mX exists, then all the moments of X are finite and the (i1,,ik)-th mixed moment is given by

    μi1,,ik=lmX(t)i1t1iktk|t=0

    where l=i1++ik.

    Proof Consider the case when k=1. Then for tBh(0),

    mX(t)=E(exp(tX))=E(I{X0}exp(tX))+E(I{X<0}exp(tX))=E(I{X0}exp(tX+))+E(I{X<0}exp(tX))=mX+(t)P(X<0)+mX(t)P(X0)<m|X|(t)=E(exp(tX++tX))=mX+(t)P(X<0)+mX(t)P(X0)

    Let Yn=j=0ntjXjj!j=0tjXjj!=exp(tX) so |Yn|j=0n|t|j|X|jj!k=0|t|j|X|jj!=exp(|t||X|).

    Since m|X| exists, E(|X|k)k!|t|km|X|(|t|)< and so all moments of X are finite.

    Furthermore, by DCT (Dominating Convergence Thm), limnE(Yn)j=0tjμjj!=mX(t)μj=djmX(t)dtj|t=0

    For the general case, let Z=(|X1|,,|Xk|) and a similar argument shows that mZ exists. Let

    Yn=j=0n(t1X1++tkXk)jj!=j=0n1j!i10ik0i1++ik=j(ji1ik)t1i1tkikX1i1Xkik|Yn|exp(|t1||X1|++tk|Xk|)

    which implies μi1,,ik is finite, and by DCT, E(Yn)j=0i10ik0i1++ik=jt1i1tkiki1!ik!μi1,,ik=mX(t)

    Prop 3.9.6 c(t) = m(it)

    If mX exists, then cX(t)=mX(it)

    Prop 3.9.7 (MGF & CF of X+Y)

    If X,YRk are stat. ind. with mgf's mX,mY (cf's cX,cY), then X+Y has mgf mX+Y(t)=mX(t)mY(t) when mX(t) and mY(t) are finite and cfX+Y(t)=cX(t)cY(t).

    Proof

    cX+Y(t)=E(exp(it(X+Y))=E(exp(itX)exp(itY))=E(exp(itX))E(exp(itY))=cX(t)cY(t)

    E.g. MGF and CF of XNk(μ,Σ)

    X=μ+Σ1/2Z where ZNk(0,I), so Z1,,Zki,i,dN(0,1) and

    mZ(t)=E(exp(tZ))=E(exp(t1Z1++tkZk))=E(i=1kexp(tiZi))=i.i.di=1kE(exp(tiZi))=i=1kmZ(ti) where mZ(t)=exp(tz)12πexp(z2/2)dz=exp(t2/2)12πexp((zt)2/2)dzcomplete the square for z2/2+tz=exp(t2/2)

    so mZ(t)=exp(12i=1kti2)=exp(tt/2)

    Plugging X=μ+Σ1/2Z into mX(t)=E(exp(tX)), we get

    mX(t)=E(exp(t(μ+Σ1/2Z))=exp(tμ)E(exp(tΣ1/2Z))=exp(tμ)E(exp((Σ1/2t)Z))=exp(tμ)exp(tΣt/2)=exp(tμ+tΣt/2)cX(t)=exp(itμtΣt/2) using Prop. III.9.6 

    If X1,,Xn is a sample from the Nk(μ,Σ) distribution, then the sample mean Y=1ni=1nXi has the following MGF:

    mY(t)=E(exp(t1ni=1nXi))=E(i=1nexp((tn)Xi))=i,i,di=1nmX(t/n)=exp(tμ+tΣt/2n)

    So by uniqueness, YNk(μ,Σ/n)

    Prop 3.9.8 (Normal r'X -> Normal X)

    If XRk is a random vector and rX is normally distributed for all constant rRk, then XNk(μ,Σ) for some (μ,Σ).

    Proof We have that E(rX)=rE(X) and Var(rX)=rVar(X)r and so (μ,Σ)=(E(X),Var(X)). Now

    mrX(t)=exp(trμ+t2rΣr/2)=mX(tr)

    which implies the result.

    E.g. Cauchy

    limt0cX(t)=limt0E(cos(tX))+ilimt0E(sin(tX))=1
    j=1nk=1nxjxkcx(tjtk)=E(|j=1nxjexp(itjx)|2)0

    Exercise III.9.4 If X1,,Xn are mut. stat. ind. with XiNki(μi,Σi) and aRm,CiRm×ki are constant, then determine the distribution of Y=a+CiXi. Exercise III.9.5 E&R 3.4.13 Exercise III.9.6 E&R 3.4.16 Exercise III.9.7 E&R 3.4.20 Exercise III.9.8 E&R 3.4.29

    4. Convergence

    Lecture 22

    Motivation

    Def. Convergence in Distribution

    Xn converges in distribution to r.v. X if limnFXn(x)=FX(x) for every continuity point x of the cdf FX of X

    If XndX, then PXn((a,b])=FXn(b)FXn(a)FX(b)FX(a) for large n provided a,b are continuity points of FX

    Note convergence in distribution is about approximating the dist of a r.v. and not about approximating the value of the r.v.

    E.g. Why restrict to convergence at continuity points of FX ?

    FXn(x)={0 if x<1/n1/2 if 1/nx<1/n1 if 1/nx
    FX(x)={0 if x<01 if 0xlimnFXn(x)={0 if x<01/2 if x=01 if 0<x

    Prop 4.1.1 (Series Expansion of CF)

    If E(|X|k)<, then cX(t)=j=0k(it)jj!μj+o(tk) where the remainder o(tk) is a function of t satisfying limt0o(tk)/tk=0.

    Proof Integrate the below using IBP with u=eis,dv=(xs)n, so du=ieis,v=(xs)n+1n+1

    udv=uvvdu0x(xs)neisds=xn+1n+1+in+10x(xs)n+1eisds

    so with n = 0, we have

    0x(xs)0eisds=x+i0x(xs)1eisdsby above[eisi]0x=x+i(x22+i2(x2)2eisds)1i(eix1)=x+ix22++in1xnn!+inn!0x(xs)neisds(eix1)=ix+i2x22++inxnn!+in+1n!0x(xs)neisdseix=j=0n(ix)jj!+in+1n!0x(xs)neisds.

    now with n-1, we have

    0x(xs)n1eisds=xnn+in0x(xs)neisds0x(xs)neisds=ni(0x(xs)n1eisdsxnn)plug in n=0 like above and simplifyeix=j=0n(ix)jj!+in(n1)!0x(xs)n1(eis1)ds

    Using |f(x)||f(x)|, we have

    |eixj=0n(ix)jj!|min{|x|n+1(n+1)!,2|x|nn!}|eis1||eis|+|1|=21|t|k|E(eitXj=0k(itX)jj!)|1|t|kE(min{|tX|k+1(k+1)!,2|tX|kk!})|E(X)|E(|X|)1|t|k|cX(t)j=0k(it)jj!μj|E(min{|t||X|k+1(k+1)!,2|X|kk!})

    This upper bound is finite since E(|X|k)< and goes to 0 as t0, which proves the result by DCT.

    Prop 4.1.2 (Continuity Theorem)

    Suppose Xn is a sequence of r.v.'s.

    (i) If XndX, then cXn(t)cX(t) for every t.

    (ii) If cXn(t)c(t) for every t and c is continuous at 0 , then c is the CF of a r.v. X such that XndX.

    Prop 4.1.3 (Weak Law of Large Numbers)

    If Xn is a sequence of i.i.d. r.v.'s with E(Xi)=μR1, then 1nSn=1ni=1nXidμ, the r.v. with distribution degenerate at μ

    Proof Let X be degenerate at μ, so cX(t)=E(eitx)=eitμ=exp(itμ) which is continuous at 0 . Also,

    c1nSn(t)=E(exp(itni=1nXi))=i.i.dcX1n(tn)=(1+iμtn+o(tn))nby Prop 4.1.1=(1+iμtn)n(1+o(tn)1+iμtn)nexp(itμ)

    If xn0 and nxn converges to a finite limit, then log(1+xn)n=nlog(1+xn)=n(xnxn2/2+xn3/3)limnxn. The first term thus converges to exp(itμ). The second term converges to 1 since o(tn) converges to 0. The result follows by (ii) of the Continuity Theorem.

    Note The Strong Law of Large Numbers says 1nSn=1ni=1nXiwp1μ

    We can prove that if Xnwp1X, then XndX and so the the SLLN implies the WLLN.

    Prop 4.1.4 (Central Limit Theorem)

    If Xn is a sequence of i.i.d. r.v.'s with E(Xi)=μR1,Var(Xi)=σ2, then Zn=1nSnμσ/ndZN(0,1)

    Proof E(1nSn)=μ,Var(1nSn)=σ2n so Zn has mean 0 and variance 1.

    Also, Yi=(Xiμ)σ has mean 0 and variance 1, so we can write Zn=1ni=1nYi.

    Since Y1,,Yn are i.i.d,

    cZn(t)=cY1n(tn)=(1+itnE(Y1)t22nE(Y12)+o(t2n))n (by Prop IV.1.1) =(1t22n+o(t2n))net2/2

    which is the CF of ZN(0,1) and the result follows by the Continuity Theorem.

    E.g. Normal approximation to the binomial

    X1,X2, are i.i.d. Bernoulli (p) with E(Xi)=p,Var(Xi)=p(1p), so Sn=i=1nXi binomial(n,p) 1nSn= proportion of 1 's in X1,X2,,Xn, so by CLT we have 1nSnpp(1p)/nN(0,1)

    For large n with ZN(0,1)

    Φ(b)Φ(a)=P(a<Zb)P(a<1nSnpp(1p)/nb)=P(np+anp(1p)<Snnp+bnp(1p))

    Note a,b reflect how long the interval about the mean is in terms of standard deviations

    E.g. Poisson approximation to the binomial (rare events)

    Consider i.i.d. Bernoulli (pn) X1,X2,,Xn. Since n o(1/n)0, we have that o(1/n)0, so pn=λ/n+o(1/n)0

    Since Snbinomial(n,λ/n+o(1/n)),

    P(Sn=k)=(nk)(λn+o(1/n))k(1λno(1/n))nk=n(n1)(nk+1)nkλkk!(1+no(1/n)λ)k(1λn)n(1o(1/n)1λn)n(1λno(1/n))k=[1(11n)(1kn+1n)](1+no(1/n)λ)k(1o(1/n)1λn)n(1λno(1/n))kλkk!(1λn)n1111λkk!eλ=λkk!eλ

    So for Poisson (λ), at any continuity point y(k,k+1) where kN has cdf P(Sny)i=0kλii!eλ

    Thus, Snd Poisson (λ)

    Lecture 23

    Def. Convergence in Probability

    The sequence Xn of r.v.'s converges in probability to r.v. X if limnP(|XnX|>δ)=0 for any δ>0. Denote as XnPX

    Note this is different from (weaker than) Xnwp1X which says P({ω:limnXn(ω)X(ω)})=0

    while XnPX says for any δ>0,ε>0, there exists Nδ,ε s.t. P({ω:|Xn(ω)X(ω)|>δ})<ε for all n>Nδ,ε

    Prop 4.2.1 (Convergence Hierarchy)

    Xnwp1XXnPXXndX

    Note The converse is false.

    Proof (convergence wp1 implies convergence in P)

    Let Am,n={ω:|Xn(ω)X(ω)|>1m} so lim supnAm,n={ω:|Xn(ω)X(ω)|>1m for infinitely many n}

    By hypothesis, 0=P(limsupnAm,n)=P(k=1n=kAm,n)=limkP(n=kAm,n)limkP(Am,k)

    so limkP(Am,k)=0 which implies XnPX.

    Proof (convergence in P implies convergence in dist)

    FXn(x)=P(Xnx,Xx+δ)+P(Xnx,X>x+δ)FX(x+δ)+P(|XnX|>δ)FX(xδ)=P(Xnx,Xxδ)+P(Xn>x,Xxδ)FXn(x)+P(|XnX|>δ)FX(x)FXn(x+δ)+P(|XnX|>δ)

    NTS FXn(x)FX(x) at every continuity point. Subtract term from LHS and a smaller term from RHS:

    FXn(x)FX(x)FX(x+δ)FX(xδ)+P(|XnX|>δ)FX(x)FXn(x)FX(x+δ)FX(xδ)+P(|XnX|>δ)

    Then, for ε>0 there exists Nδ,ε s.t. P(|XnX|>δ)<ε/2 for all n>Nδ,ε, and so

    |FX(x)FXn(x)|FX(x+δ)FX(xδ)+ε/2

    When x is a continuity point of Fx, choose δ s.t. |FX(x+δ)FX(xδ)|ε/2. So |FX(x)FXn(x)|ε/2+ε/2=ε. Since ε is arbitrary this implies the result.

    E.g. XndX does not imply XnPX

    Prop 4.2.2 (Convergence to a Constant)

    Xndμ iff XnPμ.

    Proof By Prop 4.2.1, XnPμ implies Xndμ. For the other direction,

    P(|Xnμ|δ)=P(μδXnμ+δ)=(FXn(μ+δ)FXn(μδ))+P(Xn=μδ)FXn(μδ)0P(|Xnμ|δ)10+0=1

    which implies XnPμ.

    Prop 4.2.3 (Slutsky's Theorem)

    If XndX and Yndc, then (i) Xn+YndX+c (ii) XnYndcX (iii) provided c0,Xn/YndX/c

    Prop 4.2.4 (Cont. Function Convergence)

    If Xndc and h is continuous (thus measurable) at c, then h(Xn)dh(c).

    Proof Let ε>0. Then there exists δ>0 s.t. |h(x)h(c)|ε whenever |xc|δ. Therefore

    P(|h(Xn)h(c)|>ε)P(|Xnc|>δ)0

    E.g. Suppose X1,X2, is an i.i.d. sequence from a distribution with mean μ and variance σ2. By CLT,

    1ni=1nXiμσ/n=n(X¯μ)σdN(0,1)
    S2=i=1n(XiX¯)2n1=nn1(1ni=1nXi2X¯2) nn1wp11,1ni=1nXi2dσ2+μ2,X¯2dμ2 S2dσ2Sdσ

    Therefore, n(X¯μ)S=σSn(X¯μ)σdN(0,1) by Slutsky

    Note when X1,X2, is an i.i.d. N(μ,σ2) sequence, this implies Student (n)dN(0,1)

    Def. Convergence in Expectation of Order r

    Xn converges in expectation of order r (1) to X if E(|Xn|r)< for every n and limnE(|XnX|r)=0. Denoted XnrX

    Prop 4.3.1 (Order r implies order s; order 1 implies P)

    (i) If XnrX, then XnsX for any 1sr.

    Proof d2xpdx2=p(p1)xp20 when x0,p1. This implies xr/s is convex (opens up) on [0,). Therefore,

    E(|XnX|r)=E((|XnX|s)rs) Jensen (E(|XnX|s))rs

    Since LHS goes to 0, RHS must also go to 0.

    (ii) If Xn1X, then XnPX.

    Proof For any δ>0

    P(|XnX|>δ) Markov E(|XnX|)δ0.

    Note the converse to this proposition is false

    Prop 4.3.2 (Order 2)

    Take r=2 and let L2(P)={X:X is a r.v. and E(X2)<}

    Define ,:L2(P)×L2(P)R1 by X,Y=E(XY). Note that (E(XY))2Cauchy-SchwartzE(X2)E(Y2)<

    Define X=X,X12

    (i) If X,YL2(P), then a +bX+cYL2(P) for all constants a, b,c.

    (ii) <> is an inner product on L2(P)

    (iii) is a norm on L2(P).

    Proof

    Exercise IV.3.1.

    Geometric interpretation: the angle θ between XE(X),YE(Y)L2(P) satisfies

     cos θ=XE(X),YE(Y)XE(X)YE(Y)=E((XE(X))(YE(X)))E((XE(X))2)12E((YE(Y))2)12=Cov(X,Y)Sd(X)Sd(Y)=Corr(X,Y)

    Prop 4.3.3 (L2 Law of large Numbers)

    If Xn is an i.i.d. sequence in L2(P), then 1ni=1nXi2E(X1).

    Proof

    E((1ni=1nXiE(X1))2)=Var(1ni=1nXi)=Var(X1)n0

    So Xn2X implies Xn1X implies XnPX implies XndX

    Note In time series many stochastic processes are defined in terms of series of r.v.'s that converge in L2

    Summary (wp1 => p => d)

    Strong convergence (wp1, or almost sure convergence)

    Weak convergence (convergence in distribution)

    Convergence in probability (in between strong and weak convergence)

    5. Gaussian Process

    Lecture 24 (Discrete Time)

    Recall Def. Stationary Process:

    For any {t1,,tn}T, {(t,Xt):tT} is a Gaussian process if

    (Xt1Xtn)Nn((μ(t1)μ(tn)),(σ(t1,t1)σ(t1,tn)σ(tn,t1)σ(tn,tn)))

    for some mean function μ:TR1 and autocovariance function σ:T×TR1

    For TR1, a weakly stationary Gaussian process has constant μ and σ(ti,tj)=κ(titj) for some positive definite κ:TR1

    Def. Strictly Stationary Process

    A strictly stationary process has the property that for all {t1,,tn}TR1, (Xt1+h,,Xtn+h)(Xt1,,Xtn) where h is such that {t1+h,,tn+h}T

    Note A weakly stationary Gaussian process is always strictly stationary, since σ(ti+h,tj+h)=κ(titj)=σ(ti,tj) (similar to how covariance = 0 implies statistical independence with joint normality)

    Def. Autoregressive process of order 1

    Suppose we have an i.i.d. N(0,τ2) process with {Zn:nZ}. Consider Xn=αXn1+Zn where Xn1 is ind. of Zn

    Proof Does there exist a stationary Gaussian process satisfying the definition above?

    Xn=αXn1+Zn=α2Xn2+αZn1+Zn=αkXnk+αk1Znk+1++Znafter k steps=αkXnk+j=0k1αjZnj()
    E(Xn)=αkE(Xnk)+0+...+0{Xn:nZ} is stationary so constant mean and var=αkE(X0)0Var(αkXnk)=α2kE(Xnk2)=α2kE(X02)0
    Ab={ω:i=0|αiZni(ω)|b}=m=0{ω:i=0m|αiZni(ω)|b}A={ω:i=0|αiZni(ω)|=}=b=1{ω:i=0|αiZni(ω)|>b}
    E(i=0|α|i|Zni|)=E(limmi=0m|αiZni|)=MCTlimmE(i=0m|αiZni|)=E(|Z0|)limmi=0m|α|i=E(|Z0|)(1|α|)1<
    Xn=i=0αiZni=i=0(αiZni)+i=0(αiZni)
    E(Xn)=E(limmi=0mαiZni)=DCTlimmE(i=0mαiZni)=0
    Var(Xn)=E(Xn2)=E((i=0αiZsi)2)E((i=0|αiZsi|)2)=E(limm(i=0m|αiZsi|)2)=MCTlimmE((i=0m|αiZsi|)2)=limmE(i=0m|αiZsi|2+20i<jm|αiZsi||αjZsj|XY)since |XY|X2+Y2limmE(5i=0m|αiZsi|2)=5E(|Z0|2)(1|α|2)1<
    σ(s,t)=Cov(Xs,Xt)=E(i=0αiZsij=0αjZtj)=i=0j=0αi+jE(ZsiZtj)0 unless s-i = t-j={(i,j):si=tj}αi+jE(ZsiZtj)=i=stα2i+tsE(Zsi2)=τ2αsti=0α2i=τ2α|st|1α2
    j=0kαjZn0jN(0,τ2j=0kα2j)=N(0,τ21α2(k+1)1α2)N(0,τ21α2)

    Take Xn0=j=0kαjZn0j (write as a lin combo), generate future (error) values Zn0k,Zn0k+1,,Zn0+ni.i.dN(0,τ2)

    and use Xn=αXn1+Zn to obtain Xn0,Xn0+1,,Xn0+n

    Lecture 25 (Continuous Time)

    Def. Brownian Motion

    A stochastic process {(t,Wt):t0} is a standard Wiener process (another name for Brownian Motion) if

    (i) P(W0=0)=1

    (ii) for any 0<t1<<tk, the increments Wt1,Wt2Wt1,,WtkWtk1 are mutually stat. ind.

    (iii) WtWsN(0,ts) for any 0st

    {(t,Xt):t0} satisfying the above with Xt=τWtN(0,τ2(ts)) is a general Wiener process.

    It is a Gaussian process with mean function 0 and autocovariance function σ(s,t)=τ2 min (s,t).

    Proof For any 0<t1<<tn and c1,,cnR1

    i=1nciXti=τi=1nciWti=τ[cn(WtnWtn1)+(cn1+cn)(Wtn1Wtn2)++(c1++cn)Wt1]N(0,τ2i=1n(j=1ni+1cj)2(titi1))

    and so (Xt1,,Xtn) is multivariate normal since every linear combination is normal (Prop 3.9.8). Also,

    σ(s,t)=E(XsXt)=τ2E(WsWt)=stτ2E(Ws(Ws+WtWs))=τ2E(Ws2)+τ2E(Ws(WtWs))=τ2s+τ20=τ2s=τ2min(s,t)

    Therefore, (Xt1,,Xtn)Nn(0,τ2(min(ti,tj))), so by KCT this is a Gaussian process.

    Prop 5.2.2 (Alt. Brownian Motion)

    There exists a version of {(t,Wt):t0} also satisfying

    (iv) P(Wt is continuous in t)=1

    (v) P(Wt is nowhere differentiable in t)=1

    E.g. How does Brownian motion arise? It arises as a limiting process.

    Suppose Z1,Z2, i.i.d. with mean 0 and variance 1. Let S0=0 and Sn=i=1nZi a random walk (e.g. a simple symmetric random walk when Zi1+2Bernoulli(1/2))

    Prop 5.2.3 (Donsker's Thm / Invariance Principle)

    {(t,n1/2Snt):t[0,1]}d{(t,Wt):t[0,1]}
    {(t,(ΔT0)1/2St/ΔT0):t[0,T0]}={(t,T01/2n1/2Snt/T0):t/T0[0,1]}d{(t,Wt):t[0,T0]}
    tn1/2[(1nt+nt)Snt+(ntnt)Snt+1]

    which has continuous sample paths and the same convergence result applies

    Def. Diffusion Process

    Define {(t,Xt):t0} by Xt=α+δt+σWt (stock market) where α= initial value, δ=drift and σ= volatility

    Exercise V.2.1 E&R 11.5.7,11.5.8,11.5.12,11.5.13.

    Lecture 26

    Recall a discrete-time martingale is a s.p. {(n,Xn):nT}, where T is a set of consecutive integers s.t. E(Xn)< exists and E(XnX1,,Xn1)=Xn1 for every nT

    E.g. Random walks are martingales

    Suppose X1,X2, are i.i.d. with E(X1)=0, and consider the random walk {(n,Sn):nN0} where S0=a and Sn=a+i=1nXi, then

    E(SnS0,,Sn1)=E(Sn1+XnS1,,Sn1)=E(Sn1S0,,Sn1)+E(XnS1,,Sn1)=Sn1+E(XnX1,,Xn1)Xiii,i.d.Sn1+E(Xn)=Sn1

    Prop 6.1 (Martingale Convergence Theorem)

    If {(n,Xn):nN} is a martingale with sup nE(|Xn|)<, then there exists a r.v. X with finite expectation s.t. Xnwp1X.

    Def. Stopping Time

    For {(n,Xn):nN0}, a stopping time is a mapping T:ΩN0{} such that {T=n}={ω:T(ω)=n}A for all nN0.

    E.g. Hitting Time

    E.g. Clinical trials (simplified)

    {TB=n}=(k=n0n1{p1<Sk/k<p2}){Sk/kp1}{Sk/kp2}A

    so TB is a stopping time

    Notes

    Proof We have XT1(,b]={ω:XT(ω)(ω)b}=n=0{T(ω)=n,Xn(ω)b} which implies the result.

    Prop 6.3 (Optional Stopping Theorem)

    Suppose {(n,Xn):nN0} is a martingale with X0=a,T is a stopping time for the process and that one of the following conditions holds for some constant M>0. Then E(XT)=a.

    (i) (bounded before stopping) P(max{|X0|,,|Xn|}MTn)=1 for every n

    (ii) (bounded stopping time) P(TM)=1

    E.g. Random walks and hitting times