A Proofs

A.1 Proofs of results in Chapter 2

A.1.1 Proof of Theorem 2.3

In order to use Theorem 2.2 for studying the behavior of $\hat{\Delta^Y_{WW}}$, we have to prove that it is unbiased and we have to compute $\var{\hat{\Delta^Y_{WW}}}$. Let’s first prove that the $WW$ estimator is an unbiased estimator of $TT$:

Lemma A.1 (Unbiasedness of $\hat{\Delta^Y_{WW}}$) Under Assumptions 1.7, 2.1 and 2.2,

\[\begin{align*} \esp{\hat{\Delta^Y_{WW}}}& = \Delta^Y_{TT}. \end{align*}\]

Proof. In order to prove Lemma A.1, we are going to use a trick. We are going to compute the expectation of the $WW$ estimator conditional on a given treatment allocation. Because the resulting estimate is independent of treatment allocation, we will have our proof. This trick simplifies derivations a lot and is really natural: think first of all the samples with the same treatment allocation, then average your results over all possible treatment allocations.

\[\begin{align*} \esp{\hat{\Delta^Y_{WW}}} & = \esp{\esp{\hat{\Delta^Y_{WW}}|\mathbf{D}}}\\ & = \esp{\esp{\frac{1}{\sum_{i=1}^N D_i}\sum_{i=1}^N Y_iD_i-\frac{1}{\sum_{i=1}^N (1-D_i)}\sum_{i=1}^N Y_i(1-D_i)|\mathbf{D}}}\\ & = \esp{\esp{\frac{1}{\sum_{i=1}^N D_i}\sum_{i=1}^N Y_iD_i|\mathbf{D}}-\esp{\frac{1}{\sum_{i=1}^N (1-D_i)}\sum_{i=1}^N Y_i(1-D_i)|\mathbf{D}}}\\ & = \esp{\frac{1}{\sum_{i=1}^N D_i}\esp{\sum_{i=1}^N Y_iD_i|\mathbf{D}}-\frac{1}{\sum_{i=1}^N (1-D_i)}\esp{\sum_{i=1}^N Y_i(1-D_i)|\mathbf{D}}}\\ & = \esp{\frac{1}{\sum_{i=1}^N D_i}\sum_{i=1}^N \esp{Y_iD_i|\mathbf{D}}-\frac{1}{\sum_{i=1}^N (1-D_i)}\sum_{i=1}^N \esp{Y_i(1-D_i)|\mathbf{D}}}\\ & = \esp{\frac{1}{\sum_{i=1}^N D_i}\sum_{i=1}^N \esp{Y_iD_i|D_i}-\frac{1}{\sum_{i=1}^N (1-D_i)}\sum_{i=1}^N \esp{Y_i(1-D_i)|D_i}}\\ & = \esp{\frac{1}{\sum_{i=1}^N D_i}\sum_{i=1}^N D_i\esp{Y_i|D_i=1}-\frac{1}{\sum_{i=1}^N (1-D_i)}\sum_{i=1}^N(1-D_i)\esp{Y_i|D_i=0}}\\ & = \esp{\frac{\sum_{i=1}^N D_i}{\sum_{i=1}^N D_i}\esp{Y_i|D_i=1}-\frac{\sum_{i=1}^N(1-D_i)}{\sum_{i=1}^N (1-D_i)}\esp{Y_i|D_i=0}}\\ & = \esp{\esp{Y_i|D_i=1}-\esp{Y_i|D_i=0}}\\ & = \esp{Y_i|D_i=1}-\esp{Y_i|D_i=0} \\ & = \Delta^Y_{TT}. \end{align*}\]

The first equality uses the Law of Iterated Expectations (LIE). The second and fourth equalities use the linearity of conditional expectations. The third equality uses the fact that, conditional on $\mathbf{D}$, the number of treated and untreated is a constant. The fifth equality uses Assumption 2.2. The sixth equality uses the fact that $\esp{Y_iD_i|D_i}=D_i\esp{Y_i*1|D_i=1}+(1-D_i)\esp{Y_i*0|D_i=0}$. The seventh and ninth equalities use the fact that $\esp{Y_i|D_i=1}$ is a constant. The last equality uses Assumption 1.7.

Let’s now compute the variance of the $WW$ estimator:

Lemma A.2 (Variance of $\hat{\Delta^Y_{WW}}$) Under Assumptions 1.7, 2.1 and 2.2,

\[\begin{align*} \var{{\hat{\Delta^Y_{WW}}}} & = \frac{1-(1-\Pr(D_i=1))^N}{N\Pr(D_i=1)}\var{Y_i^1|D_i=1}+\frac{1-\Pr(D_i=1)^N}{N(1-\Pr(D_i=1))}\var{Y_i^0|D_i=0}. \end{align*}\]

Proof. Same trick as before, but now using the Law of Total Variance (LTV):

\[\begin{align*} \var{{\hat{\Delta^Y_{WW}}}} & = \esp{\var{\hat{\Delta^Y_{WW}}|\mathbf{D}}}+\var{\esp{\hat{\Delta^Y_{WW}}|\mathbf{D}}}\\ & = \esp{\var{\frac{1}{\sum_{i=1}^N D_i}\sum_{i=1}^N Y_iD_i-\frac{1}{\sum_{i=1}^N (1-D_i)}\sum_{i=1}^N Y_i(1-D_i)|\mathbf{D}}} \\ & = \esp{\var{\frac{1}{\sum_{i=1}^N D_i}\sum_{i=1}^N Y_iD_i|\mathbf{D}}}+\esp{\var{\frac{1}{\sum_{i=1}^N (1-D_i)}\sum_{i=1}^N Y_i(1-D_i)|\mathbf{D}}}\\ & \phantom{=}+\esp{\cov{\frac{1}{\sum_{i=1}^N D_i}\sum_{i=1}^N Y_iD_i,\frac{1}{\sum_{i=1}^N (1-D_i)}\sum_{i=1}^N Y_i(1-D_i)|\mathbf{D}}} \\ & = \esp{\frac{1}{(\sum_{i=1}^N D_i)^2}\var{\sum_{i=1}^N Y_iD_i|\mathbf{D}}}+\esp{\frac{1}{(\sum_{i=1}^N (1-D_i))^2}\var{\sum_{i=1}^N Y_i(1-D_i)|\mathbf{D}}} \\ & = \esp{\frac{1}{(\sum_{i=1}^N D_i)^2}\var{\sum_{i=1}^N Y_iD_i|D_i}}+\esp{\frac{1}{(\sum_{i=1}^N (1-D_i))^2}\var{\sum_{i=1}^N Y_i(1-D_i)|D_i}} \\ & = \esp{\frac{1}{(\sum_{i=1}^N D_i)^2}\sum_{i=1}^ND_i\var{Y_i|D_i=1}}+\esp{\frac{1}{(\sum_{i=1}^N (1-D_i))^2}\sum_{i=1}^N(1-D_i)\var{Y_i|D_i=0}} \\ & = \var{Y_i|D_i=1}\esp{\frac{1}{\sum_{i=1}^N D_i}}+\var{Y_i|D_i=0}\esp{\frac{1}{\sum_{i=1}^N (1-D_i)}} \\ & = \frac{1-(1-\Pr(D_i=1))^N}{N\Pr(D_i=1)}\var{Y_i^1|D_i=1}+\frac{1-\Pr(D_i=1)^N}{N(1-\Pr(D_i=1))}\var{Y_i^0|D_i=0}. \end{align*}\]

The first equality stems from the LTV. The second and third equalities stems from the definition of the $WW$ estimator and of the variance of a sum of random variables. The fourth equality stems from Assumption 2.2, which means that the covariance across observations is zero, and from the formula for a variance of a random variable multiplied by a constant. The fifth and sixth equalities stems from Assumption 2.2 and from $\var{Y_iD_i|D_i}=D_i\var{Y_i*1|D_i=1}+(1-D_i)\var{Y_i*0|D_i=0}$. The seventh equality stems from $\var{Y_i|D_i=1}$ and $\var{Y_i|D_i=0}$ being constant. The last equality stems from the formula for the expectation of the inverse of a sum of Bernoulli random variables with at least one of them taking value one which is the case under Assumption 2.1.

Using Theorem 2.2, we have:

\[\begin{align*} 2\epsilon & \leq 2\sqrt{\frac{1}{N(1-\delta)}\left(\frac{1-(1-\Pr(D_i=1))^N}{\Pr(D_i=1)}\var{Y_i^1|D_i=1}+\frac{1-\Pr(D_i=1)^N}{(1-\Pr(D_i=1))}\var{Y_i^0|D_i=0}\right)}\\ & \leq 2\sqrt{\frac{1}{N(1-\delta)}\left(\frac{\var{Y_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{Y_i^0|D_i=0}}{(1-\Pr(D_i=1))}\right)}, \end{align*}\]

where the second equality stems from the fact that $\frac{(1-\Pr(D_i=1))^N}{\Pr(D_i=1)}\var{Y_i^1|D_i=1}+\frac{\Pr(D_i=1)^N}{(1-\Pr(D_i=1))}\var{Y_i^0|D_i=0}\geq0$. This proves the result.

A.1.2 Proof of Theorem 2.5

Before proving Theorem 2.5, let me state a very useful result: $\hat{WW}$ can be computed using OLS:

Lemma A.3 (WW is OLS) Under Assumption 2.1, the OLS coefficient $\beta$ in the following regression:

\[\begin{align*} Y_i & = \alpha + \beta D_i + U_i \end{align*}\]

is the WW estimator:

\[\begin{align*} \hat{\beta}_{OLS} & = \frac{\frac{1}{N}\sum_{i=1}^N\left(Y_i-\frac{1}{N}\sum_{i=1}^NY_i\right)\left(D_i-\frac{1}{N}\sum_{i=1}^ND_i\right)}{\frac{1}{N}\sum_{i=1}^N\left(D_i-\frac{1}{N}\sum_{i=1}^ND_i\right)^2} \\ & = \hat{\Delta^Y_{WW}}. \end{align*}\]

Proof. In matrix notation, we have:

\[\begin{align*} \underbrace{\left(\begin{array}{c} Y_1 \\ \vdots \\ Y_N \end{array}\right)}_{Y} & = \underbrace{\left(\begin{array}{cc} 1 & D_1\\ \vdots & \vdots\\ 1 & D_N\end{array}\right)}_{X} \underbrace{\left(\begin{array}{c} \alpha \\ \beta \end{array}\right)}_{\Theta}+ \underbrace{\left(\begin{array}{c} U_1 \\ \vdots \\ U_N \end{array}\right)}_{U} \end{align*}\]

The OLS estimator is:

\[\begin{align*} \hat{\Theta}_{OLS} & = (X'X)^{-1}X'Y \end{align*}\]

Under the Full Rank Assumption, $X'X$ is invertible and we have:

\[\begin{align*} (X'X)^{-1} & = \left(\begin{array}{cc} N & \sum_{i=1}^ND_i \\ \sum_{i=1}^ND_i & \sum_{i=1}^ND_i^2 \end{array}\right)^{-1} \\ & = \frac{1}{N\sum_{i=1}^ND_i^2-\left(\sum_{i=1}^ND_i\right)^2}\left(\begin{array}{cc} \sum_{i=1}^ND_i^2 & -\sum_{i=1}^ND_i \\ -\sum_{i=1}^ND_i & N \end{array}\right) \end{align*}\]

For simplicity, I omit the summation index:

\[\begin{align*} \hat{\Theta}_{OLS} & = \frac{1}{N\sum D_i^2-\left(\sum D_i\right)^2} \left(\begin{array}{cc} \sum D_i^2 & -\sum D_i \\ -\sum D_i & N \end{array}\right) \left(\begin{array}{c} \sum Y_i \\ \sum Y_iD_i \end{array}\right) \\ & = \frac{1}{N\sum D_i^2-\left(\sum D_i\right)^2} \left(\begin{array}{c} \sum D_i^2\sum Y_i-\sum D_i\sum_{i=1}^NY_iD_i \\ -\sum D_i\sum Y_i+ N\sum Y_iD_i \end{array}\right) \\ \end{align*}\]

Using $D_i^2=D_i$, we have:

\[\begin{align*} \hat{\Theta}_{OLS} & = \left(\begin{array}{c} \frac{\left(\sum D_i\right)\left(\sum Y_i-\sum Y_iD_i\right)}{\left(\sum D_i\right)\left(N-\sum D_i\right)} \\ \frac{N\sum Y_iD_i-\sum D_i\sum Y_i}{N\sum D_i-\left(\sum D_i\right)^2} \end{array}\right) = \left(\begin{array}{c} \frac{\sum (Y_iD_i+Y_i(1-D_i))-\sum Y_iD_i}{\sum(1-D_i)} \\ \frac{N^2}{N^2}\frac{\frac{1}{N}\sum Y_iD_i-\frac{1}{N}\sum D_i\frac{1}{N}\sum Y_i+\frac{1}{N}\sum D_i\frac{1}{N}\sum Y_i-\frac{1}{N}\sum D_i\frac{1}{N}\sum Y_i}{\frac{1}{N}\sum D_i-2\left(\frac{1}{N}\sum D_i\right)^2+\left(\frac{1}{N}\sum D_i\right)^2} \end{array}\right) \\ & = \left(\begin{array}{c} \frac{\sum Y_i(1-D_i)}{\sum(1-D_i)} \\ \frac{\frac{1}{N}\sum \left(Y_iD_i-D_i\frac{1}{N}\sum Y_i-Y_i\frac{1}{N}\sum D_i+\frac{1}{N}\sum D_i\frac{1}{N}\sum Y_i\right)}{\frac{1}{N}\sum\left(D_i-2D_i\frac{1}{N}\sum D_i+\left(\frac{1}{N}\sum D_i\right)^2\right)} \end{array}\right) = \left(\begin{array}{c} \frac{\sum Y_i(1-D_i)}{\sum(1-D_i)} \\ \frac{\frac{1}{N}\sum\left(Y_i-\frac{1}{N}\sum Y_i\right)\left(D_i-\frac{1}{N}\sum D_i\right)}{\frac{1}{N}\sum \left(D_i-\frac{1}{N}\sum D_i\right)^2} \end{array}\right), \end{align*}\]

which proves the first part of the lemma. Now for the second part of the lemma:

\[\begin{align*} \hat{\beta}_{OLS} & = \frac{\sum Y_iD_i-\frac{1}{N}\sum D_i\sum Y_i}{\sum D_i\left(1-\frac{1}{N}\sum D_i\right)} = \frac{\sum Y_iD_i-\frac{1}{N}\sum D_i\sum\left(Y_iD_i+(1-D_i)Y_i\right)}{\sum D_i\left(1-\frac{1}{N}\sum D_i\right)}\\ & = \frac{\sum Y_iD_i\left(1-\frac{1}{N}\sum D_i\right)-\frac{1}{N}\sum D_i\sum(1-D_i)Y_i}{\sum D_i\left(1-\frac{1}{N}\sum D_i\right)}\\ & = \frac{\sum Y_iD_i}{\sum D_i}-\frac{\frac{1}{N}\sum(1-D_i)Y_i}{\left(1-\frac{1}{N}\sum D_i\right)}\\ & = \frac{\sum Y_iD_i}{\sum D_i}-\frac{\frac{1}{N}\sum(1-D_i)Y_i}{\frac{1}{N}\sum\left(1-D_i\right)}\\ & = \frac{\sum Y_iD_i}{\sum D_i}-\frac{\sum(1-D_i)Y_i}{\sum\left(1-D_i\right)}\\ & = \hat{\Delta^Y_{WW}}, \end{align*}\]

which proves the result.

Now, let me state the most important lemma behind the result in Theorem 2.5:

Lemma A.4 (Asymptotic Distribution of the OLS Estimator) Under Assumptions 1.7, 2.1, 2.2 and 2.3, we have:

\[\begin{align*} \sqrt{N}(\hat{\Theta}_{OLS}-\Theta) & \stackrel{d}{\rightarrow} \mathcal{N}\left(\begin{array}{c} 0\\ 0\end{array}, \sigma_{XX}^{-1}\mathbf{V_{xu}}\sigma_{XX}^{-1}\right), \end{align*}\]

with \[\begin{align*} \sigma_{XX}^{-1}& = \left(\begin{array}{cc} \frac{\Pr(D_i=1)}{\Pr(D_i=1)(1-\Pr(D_i=1))} & -\frac{\Pr(D_i=1)}{\Pr(D_i=1)(1-\Pr(D_i=1))}\\ -\frac{\Pr(D_i=1)}{\Pr(D_i=1)(1-\Pr(D_i=1))} & \frac{1}{\Pr(D_i=1)(1-\Pr(D_i=1))} \end{array}\right)\\ \mathbf{V_{xu}}&= \esp{U_i^2\left(\begin{array}{cc} 1 & D_i\\ D_i & D_i\end{array}\right)} \end{align*}\]

Proof. \[\begin{align*} \sqrt{N}(\hat{\Theta}_{OLS}-\Theta) & = \sqrt{N}((X'X)^{-1}X'Y-\Theta) \\ & = \sqrt{N}((X'X)^{-1}X'(X\Theta+U)-\Theta) \\ & = \sqrt{N}((X'X)^{-1}X'X\Theta+(X'X)^{-1}X'U)-\Theta) \\ & = \sqrt{N}(X'X)^{-1}X'U \\ & = N(X'X)^{-1}\frac{\sqrt{N}}{N}X'U \end{align*}\]

Using Slutsky’s Theorem, we can study both terms separately.

Before stating Slutsky’s Theorem, we need to define a new term: convergence in probability (this is a simpler version of convergence in distribution). We say that a sequence $X_N$ converges in probability to the constant $x$ if, $\forall\epsilon>0$, $\lim_{N\rightarrow\infty}\Pr(|X_N-x|>\epsilon)=0$.
We denote $X_N\stackrel{p}{\rightarrow}x$ or $\text{plim}(X_N)=x$.

Slutsky’s Theorem states that if $Y_N\stackrel{d}{\rightarrow}y$ and $\text{plim}(X_N)=x$, then:

$X_N+Y_N\stackrel{d}{\rightarrow}x+y$
$X_NY_N\stackrel{d}{\rightarrow}xy$
$\frac{Y_N}{X_N}\stackrel{d}{\rightarrow}\frac{y}{x}$ if $x\neq0$

Using this theorem, we have:

\[\begin{align*} \sqrt{N}(\hat{\Theta}_{OLS}-\Theta) & \stackrel{d}{\rightarrow} \sigma_{XX}^{-1}xu, \end{align*}\]

Where $\sigma_{XX}^{-1}$ is a matrix of constants and $xu$ is a random variable.

Let’s begin with $\frac{\sqrt{N}}{N}X'U\stackrel{d}{\rightarrow}xu$:

\[\begin{align*} \frac{\sqrt{N}}{N}X'U & = \sqrt{N}\left(\begin{array}{c} \frac{1}{N}\sum_{i=1}^{N}U_i\\ \frac{1}{N}\sum_{i=1}^{N}D_iU_i\end{array}\right) \end{align*}\]

In order to determine the asymptotic distribution of $\frac{\sqrt{N}}{N}X'U$, we are going to use the vector version of the CLT:

If $X_i$ and $Y_i$ are two i.i.d. random variables with finite first and second moments, we have:

\[\begin{align*} \sqrt{N} \left( \begin{array}{c} \frac{1}{N}\sum_{i=1}^NX_i-\esp{X_i}\\ \frac{1}{N}\sum_{i=1}^NY_i-\esp{Y_i} \end{array} \right) & \stackrel{d}{\rightarrow} \mathcal{N} \left( \begin{array}{c} 0\\ 0 \end{array}, \mathbf{V} \right), \end{align*}\]

where $\mathbf{V}$ is the population covariance matrix of $X_i$ and $Y_i$.

We actually need the Lyapounov version of the CLT for non i.i.d. data since there is conditional heteroskedasticity

We know that, under Assumption 1.7, both random variables have mean zero:

\[\begin{align*} \esp{U_i}& = \esp{U_i|D_i=1}\Pr(D_i=1)+\esp{U_i|D_i=0}\Pr(D_i=0)=0 \\ \esp{U_iD_i}& = \esp{U_i|D_i=1}\Pr(D_i=1)=0 \end{align*}\]

Their covariance matrix $\mathbf{V_{xu}}$ can be computed as follows:

\[\begin{align*} \mathbf{V_{xu}} & = \esp{\left(\begin{array}{c} U_i\\ UiD_i\end{array}\right)\left(\begin{array}{cc} U_i& UiD_i\end{array}\right)} - \esp{\left(\begin{array}{c} U_i\\ UiD_i\end{array}\right)}\esp{\left(\begin{array}{cc} U_i& UiD_i\end{array}\right)}\\ & = \esp{\left(\begin{array}{cc} U_i^2 & U_i^2D_i\\ Ui^2D_i & U_i^2D_i^2\end{array}\right)} = \esp{U_i^2\left(\begin{array}{cc} 1 & D_i\\ D_i & D_i^2\end{array}\right)} = \esp{U_i^2\left(\begin{array}{cc} 1 & D_i\\ D_i & D_i\end{array}\right)} \end{align*}\]

Using the Vector CLT, we have that $\frac{\sqrt{N}}{N}X'U\stackrel{d}{\rightarrow}\mathcal{N}\left(\begin{array}{c} 0\\ 0\end{array},\mathbf{V_{xu}}\right)$.

Let’s show now that $\plims N(X'X)^{-1}=\sigma_{XX}^{-1}$:

\[\begin{align*} N(X'X)^{-1} & = \frac{N}{N\sum_{i=1}^ND_i-\left(\sum_{i=1}^ND_i\right)^2} \left(\begin{array}{cc} \sum_{i=1}^ND_i & -\sum_{i=1}^ND_i \\ -\sum_{i=1}^ND_i & N \end{array}\right) \\ & = \frac{1}{N}\frac{1}{\frac{1}{N}\sum_{i=1}^ND_i-\left(\frac{1}{N}\sum_{i=1}^ND_i\right)^2} \left(\begin{array}{cc} \sum_{i=1}^ND_i & -\sum_{i=1}^ND_i \\ -\sum_{i=1}^ND_i & N \end{array}\right)\\ & = \frac{1}{\frac{1}{N}\sum_{i=1}^ND_i-\left(\frac{1}{N}\sum_{i=1}^ND_i\right)^2} \left(\begin{array}{cc} \frac{1}{N}\sum_{i=1}^ND_i & -\frac{1}{N}\sum_{i=1}^ND_i \\ -\frac{1}{N}\sum_{i=1}^ND_i & 1 \end{array}\right)\\ \plims N(X'X)^{-1} & = \frac{1}{\plims\frac{1}{N}\sum_{i=1}^ND_i-\left(\plims\frac{1}{N}\sum_{i=1}^ND_i\right)^2} \left(\begin{array}{cc} \plims\frac{1}{N}\sum_{i=1}^ND_i & -\plims\frac{1}{N}\sum_{i=1}^ND_i \\ -\plims\frac{1}{N}\sum_{i=1}^ND_i & 1 \end{array}\right)\\ & = \frac{1}{\Pr(D_i=1)-\Pr(D_i=1)^2} \left(\begin{array}{cc} \Pr(D_i=1) & -\Pr(D_i=1) \\ -\Pr(D_i=1) & 1 \end{array}\right)\\ & = \sigma_{XX}^{-1} \end{align*}\]

The fourth equality uses Slutsky’s Theorem. The fifth equality uses the Law of Large Numbers (LLN): if $Y_i$ are i.i.d. variables with finite first and second moments, $\plim{N}\frac{1}{N}\sum_{i=1}^NY_i = \esp{Y_i}$.

In order to complete the proof, we have to use the Delta Method Theorem. This theorem states that:

\[\begin{gather*} \sqrt{N}(\begin{array}{c} \bar{X}_N-\esp{X_i}\\ \bar{Y}_N-\esp{Y_i}\end{array}) \stackrel{d}{\rightarrow}\mathcal{N}(\begin{array}{c} 0\\ 0\end{array},\mathbf{V}) \\ \Rightarrow \sqrt{N}(g(\bar{X}_N,\bar{Y}_N)-g(\esp{X_i},\esp{Y_i}) \stackrel{d}{\rightarrow}\mathcal{N}(0,G'\mathbf{V}G) \end{gather*}\]

where $G(u)=\partder{g(u)}{u}$ and $G=G(\esp{X_i},\esp{Y_i})$.

In our case, $g(xu)=\sigma_{XX}^{-1}xu$, so $G(xu)=\sigma_{XX}^{-1}$. The results follows from that and from the symmetry of $\sigma_{XX}^{-1}$.

A last lemma uses the previous result to derive the asymptotic distribution of $\hat{WW}$:

Lemma A.5 (Asymptotic Distribution of $\hat{WW}$) Under Assumptions 1.7, 2.1, 2.2 and 2.3, we have:

\[\begin{align*} \sqrt{N}(\hat{\Delta^Y_{WW}}-\Delta^Y_{TT}) & \stackrel{d}{\rightarrow} \mathcal{N}\left(0,\frac{\var{Y_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{Y_i^0|D_i=0}}{1-\Pr(D_i=1)}\right). \end{align*}\]

Proof. In order to derive the asymptotic distribution of WW, I use first Lemma A.3 which implies that the asymptotic distribution of WW is the same as that of $\hat{\beta}_{OLS}$. Now, from Lemma A.4, we know that $\sqrt{N}(\hat{\beta}_{OLS}-\beta)\stackrel{d}{\rightarrow}\mathcal{N}(0,\sigma^2_{\beta})$, where $\sigma^2_{\beta}$ is the lower diagonal term of $\sigma_{XX}^{-1}\mathbf{V_{xu}}\sigma_{XX}^{-1}$. Using the convention $p=\Pr(D_i=1)$, we have:

\[\begin{align*} \sigma_{XX}^{-1}\mathbf{V_{xu}}\sigma_{XX}^{-1} & = \left(\begin{array}{cc} \frac{p}{p(1-p)} & -\frac{p}{p(1-p)}\\ -\frac{p}{p(1-p)} & \frac{1}{p(1-p)} \end{array}\right) \esp{U_i^2\left(\begin{array}{cc} 1 & D_i\\ D_i & D_i\end{array}\right)} \left(\begin{array}{cc} \frac{p}{p(1-p)} & -\frac{p}{p(1-p)}\\ -\frac{p}{p(1-p)} & \frac{1}{p(1-p)} \end{array}\right)\\ & = \frac{1}{(p(1-p))^2} \left(\begin{array}{cc} p\esp{U_i^2}-p\esp{U_i^2D_i} & p\esp{U_i^2D_i}-p\esp{U_i^2D_i}\\ -p\esp{U_i^2}+\esp{U_i^2D_i} & -p\esp{U_i^2D_i}+\esp{U_i^2D_i} \end{array}\right) \left(\begin{array}{cc} p & -p\\ -p & 1 \end{array}\right)\\ & = \frac{1}{(p(1-p))^2} \left(\begin{array}{cc} p^2(\esp{U_i^2}-\esp{U_i^2D_i}) & p^2(\esp{U_i^2D_i}-\esp{U_i^2})\\ p^2(\esp{U_i^2D_i}-\esp{U_i^2}) & p^2\esp{U_i^2}+(1-2p)\esp{U_i^2D_i} \end{array}\right) \end{align*}\]

The final result comes from the fact that:

\[\begin{align*} \esp{U_i^2} & = \esp{U_i^2|D_i=1}p + (1-p)\esp{U_i^2|D_i=0}\\ & = p\var{Y_i^1|D_i=1}+(1-p)\var{Y_i^0|D_i=0} \\ \esp{U_i^2D_i} & = \esp{U_i^2|D_i=1}p \\ & = p\var{Y_i^1|D_i=1}. \end{align*}\]

As a consequence:

\[\begin{align*} \sigma^2_{\beta} &= \frac{1}{(p(1-p))^2}\left(\var{Y_i^1|D_i=1}p(p^2-2p+1) + p^2(1-p)\var{Y_i^0|D_i=0}\right) \\ &= \frac{1}{(p(1-p))^2}\left(\var{Y_i^1|D_i=1}p(1-p)^2 + p^2(1-p)\var{Y_i^0|D_i=0}\right)\\ & = \frac{\var{Y_i^1|D_i=1}}{p}+\frac{\var{Y_i^0|D_i=0}}{1-p}. \end{align*}\]

Using the previous lemma, we can now approximate the confidence level of $\hat{WW}$:

\[\begin{align*} \Pr&(|\hat{\Delta^Y_{WW}}-\Delta^Y_{TT}|\leq\epsilon) = \Pr(-\epsilon\leq\hat{\Delta^Y_{WW}}-\Delta^Y_{TT}\leq\epsilon) \\ & = \Pr\left(-\frac{\epsilon}{\frac{1}{\sqrt{N}}\sqrt{\frac{\var{Y_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{Y_i^0|D_i=0}}{1-\Pr(D_i=1)}}}\leq\frac{\hat{\Delta^Y_{WW}}-\Delta^Y_{TT}}{\frac{1}{\sqrt{N}}\sqrt{\frac{\var{Y_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{Y_i^0|D_i=0}}{1-\Pr(D_i=1)}}}\leq\frac{\epsilon}{\frac{1}{\sqrt{N}}\sqrt{\frac{\var{Y_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{Y_i^0|D_i=0}}{1-\Pr(D_i=1)}}}\right)\\ & \approx \Phi\left(\frac{\epsilon}{\frac{1}{\sqrt{N}}\sqrt{\frac{\var{Y_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{Y_i^0|D_i=0}}{1-\Pr(D_i=1)}}}\right)- \Phi\left(-\frac{\epsilon}{\frac{1}{\sqrt{N}}\sqrt{\frac{\var{Y_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{Y_i^0|D_i=0}}{1-\Pr(D_i=1)}}}\right)\\ & = \Phi\left(\frac{\epsilon}{\frac{1}{\sqrt{N}}\sqrt{\frac{\var{Y_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{Y_i^0|D_i=0}}{1-\Pr(D_i=1)}}}\right)- 1 + \Phi\left(\frac{\epsilon}{\frac{1}{\sqrt{N}}\sqrt{\frac{\var{Y_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{Y_i^0|D_i=0}}{1-\Pr(D_i=1)}}}\right)\\ & = 2\Phi\left(\frac{\epsilon}{\frac{1}{\sqrt{N}}\sqrt{\frac{\var{Y_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{Y_i^0|D_i=0}}{1-\Pr(D_i=1)}}}\right)-1. \end{align*}\]

As a consequence,

\[\begin{align*} \delta & \approx 2\Phi\left(\frac{\epsilon}{\frac{1}{\sqrt{N}}\sqrt{\frac{\var{Y_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{Y_i^0|D_i=0}}{1-\Pr(D_i=1)}}}\right)-1. \end{align*}\]

Hence the result.

A.2 Proofs of results in Chapter 3

A.2.1 Proof of Theorem 3.9

In order to prove the theorem, it is going to be very helpful to prove the following lemma:

Lemma A.6 (Unconfounded Types) Under Assumptions 3.9 and 3.10, the types $T_i$ are independent of the allocation of the treatment:

\[\begin{align*} (Y_i^{1,1},Y_i^{0,1},Y_i^{0,0},Y_i^{1,0},T_i)\Ind R_i|E_i=1. \end{align*}\]

Proof. Lemma 4.2 in Dawid (1979) shows that if $X\Ind Y|Z$ and $U$ is a function of $X$ then $U\Ind Y|Z$. The fact that $T_i$ is a function of $(D_i^1,D^0_i)$ proves the result.

The four sets defined by $T_i$ are a partition of the sample space. As a consequence, we have (ommitting the conditioning on $E_i=1$ all along for simplicity):

\[\begin{align*} \esp{Y_i|R_i=1} & = \esp{Y_i|T_i=a,R_i=1}\Pr(T_i=a|R_i=1)\\ & \phantom{=}+ \esp{Y_i|T_i=c,R_i=1}\Pr(T_i=c|R_i=1) \\ & \phantom{=} + \esp{Y_i|T_i=d,R_i=1}\Pr(T_i=d|R_i=1)\\ & \phantom{=} + \esp{Y_i|T_i=n,R_i=1}\Pr(T_i=n|R_i=1)\\ \esp{Y_i|R_i=0} & = \esp{Y_i|T_i=a,R_i=0}\Pr(T_i=a|R_i=0)\\ & \phantom{=} + \esp{Y_i|T_i=c,R_i=0}\Pr(T_i=c|R_i=0) \\ & \phantom{=} + \esp{Y_i|T_i=d,R_i=0}\Pr(T_i=d|R_i=0)\\ & \phantom{=}+ \esp{Y_i|T_i=n,R_i=0}\Pr(T_i=n|R_i=0). \end{align*}\]

Let’s look at all these terms in turn:

\[\begin{align*} \esp{Y_i|T_i=a,R_i=1} & = \esp{Y_i^{1,1}D_iR_i+Y_i^{1,0}D_i(1-R_i)+Y_i^{0,1}(1-D_i)R_i+Y_i^{0,0}(1-D_i)(1-R_i)|T_i=a,R_i=1} \\ & = \esp{Y_i^{1,1}(D^1_iR_i+D_i^0(1-R_i))R_i+Y_i^{0,1}(1-(D^1_iR_i+D_i^0(1-R_i)))R_i|T_i=a,R_i=1} \\ & = \esp{Y_i^{1,1}D^1_iR_i^2+Y_i^{0,1}(1-D^1_iR_i)R_i|D_i^1=D_i^0=1,R_i=1} \\ & = \esp{Y_i^{1,1}|T_i=a,R_i=1} \\ & = \esp{Y_i^{1,1}|T_i=a}, \\ \end{align*}\]

where the first equality uses Assumption 3.9, the second equality uses the fact that $R_i=1$ in the conditional expectation and Assumption 3.9, the third equality uses the fact that $R_i=1$, the fourth equality uses the fact that $T_i=a \Leftrightarrow D_i^1=D_i^0=1$ and the last equality uses Lemma A.6.

Using a similar reasoning, we have:

\[\begin{align*} \esp{Y_i|T_i=c,R_i=1} & = \esp{Y_i^{1,1}|T_i=c} \\ \esp{Y_i|T_i=d,R_i=1} & = \esp{Y_i^{0,1}|T_i=d} \\ \esp{Y_i|T_i=n,R_i=1} & = \esp{Y_i^{0,1}|T_i=n} \\ \esp{Y_i|T_i=a,R_i=0} & = \esp{Y_i^{1,0}|T_i=c} \\ \esp{Y_i|T_i=c,R_i=0} & = \esp{Y_i^{0,0}|T_i=c} \\ \esp{Y_i|T_i=d,R_i=0} & = \esp{Y_i^{1,0}|T_i=d} \\ \esp{Y_i|T_i=n,R_i=0} & = \esp{Y_i^{0,0}|T_i=n}. \end{align*}\]

Also, Lemma A.6 implies that $\Pr(T_i=a|R_i)=\Pr(T_i=a)$, and the same is true for all other types. As a consequence, we have:

\[\begin{align*} \esp{Y_i|R_i=1} & = \esp{Y_i^{1,1}|T_i=a}\Pr(T_i=a)\\ & \phantom{=} + \esp{Y_i^{1,1}|T_i=c}\Pr(T_i=c) \\ & \phantom{=} + \esp{Y_i^{0,1}|T_i=d}\Pr(T_i=d)\\ & \phantom{=} + \esp{Y_i^{0,1}|T_i=n}\Pr(T_i=n)\\ \esp{Y_i|R_i=0} & = \esp{Y_i^{1,0}|T_i=a}\Pr(T_i=a)\\ & \phantom{=} + \esp{Y_i^{0,0}|T_i=c}\Pr(T_i=c) \\ & \phantom{=} + \esp{Y_i^{1,0}|T_i=d}\Pr(T_i=d)\\ & \phantom{=} + \esp{Y_i^{0,0}|T_i=n}\Pr(T_i=n). \end{align*}\]

And thus:

\[\begin{align*} \esp{Y_i|R_i=1}-\esp{Y_i|R_i=0} & = (\esp{Y_i^{1,1}|T_i=a}-\esp{Y_i^{1,0}|T_i=a})\Pr(T_i=a)\\ & \phantom{=}+ (\esp{Y_i^{1,1}|T_i=c}-\esp{Y_i^{0,0}|T_i=c})\Pr(T_i=c) \\ & \phantom{=} - (\esp{Y_i^{1,0}|T_i=d}-\esp{Y_i^{0,1}|T_i=d})\Pr(T_i=d)\\ & \phantom{=} + (\esp{Y_i^{0,1}|T_i=n}-\esp{Y_i^{0,0}|T_i=n})\Pr(T_i=n). \end{align*}\]

Using Assumption 3.11, we have:

\[\begin{align*} \esp{Y_i|R_i=1}-\esp{Y_i|R_i=0} & = (\esp{Y_i^{1}|T_i=a}-\esp{Y_i^{1}|T_i=a})\Pr(T_i=a)\\ & \phantom{=}+ (\esp{Y_i^{1}|T_i=c}-\esp{Y_i^{0}|T_i=c})\Pr(T_i=c) \\ & \phantom{=} - (\esp{Y_i^{1}|T_i=d}-\esp{Y_i^{0}|T_i=d})\Pr(T_i=d)\\ & \phantom{=} + (\esp{Y_i^{0}|T_i=n}-\esp{Y_i^{0}|T_i=n})\Pr(T_i=n)\\ & = \esp{Y_i^{1}-Y_i^{0}|T_i=c}\Pr(T_i=c) \\ & \phantom{=} - \esp{Y_i^{1}-Y_i^{0}|T_i=d}\Pr(T_i=d). \end{align*}\]

Under Assumption 3.13, we have:

\[\begin{align*} \esp{Y_i|R_i=1}-\esp{Y_i|R_i=0} & = \esp{Y_i^{1}-Y_i^{0}|T_i=c}\Pr(T_i=c)\\ & = \Delta^Y_{LATE}\Pr(T_i=c). \end{align*}\]

We also have:

\[\begin{align*} \Pr(D_i=1|R_i=1) & = \Pr(D^1_i=1|R_i=1)\\ & = \Pr(D^1_i=1\cap (D_i^0=1\cup D_i^0=0) |R_i=1)\\ & = \Pr(D^1_i=1\cap D_i^0=1\cup D^1_i=1\cap D_i^0=0 |R_i=1)\\ & = \Pr(D^1_i=D_i^0=1\cup D^1_i-D_i^0=0 |R_i=1)\\ & = \Pr(T_i=a\cup T_i=c |R_i=1)\\ & = \Pr(T_i=a|R_i=1)+\Pr(T_i=c|R_i=1)\\ & = \Pr(T_i=a)+\Pr(T_i=c), \end{align*}\]

where the first equality follows from Assumption 3.9 and the fact that $D_i=R_iD_i^1+(1-R_i)D_i^0$, so that $D_i|R_i=1=D_i^1$. The second equality follows from the fact that $\left\{ D_i^0=1,D_i^0=0\right\}$ is a partition of the sample space. The third equality follows from usual rules of logic and the fourth equality from the fact that $D_i^1$ and $D_i^0$ can only take values zero and one. The fifth equality follows from the definition of $T_i$. The sixth equaity follows from the rule of addition in probability and the fact that $T_i=a$ and $T_i=c$ are disjoint. The final equality follows from Lemma A.6.

Using a similar reasoning, we have:

\[\begin{align*} \Pr(D_i=1|R_i=0) & = \Pr(T_i=a)+ \Pr(T_i=d). \end{align*}\]

As a consequence, under Assumption 3.13, we have:

\[\begin{align*} \Pr(D_i=1|R_i=1)-\Pr(D_i=1|R_i=0) & = \Pr(T_i=c). \end{align*}\]

Using Assumption 3.12 proves the result.

A.2.2 Proof of Theorem 3.15

In matrix notation, we have:

and

\[\begin{align*} \left(\begin{array}{c} D_1 \\ \vdots \\ D_N \end{array}\right) & = \underbrace{\left(\begin{array}{cc} 1 & R_1\\ \vdots & \vdots\\ 1 & R_N\end{array}\right)}_{R} \left(\begin{array}{c} \gamma \\ \tau \end{array}\right)+ \left(\begin{array}{c} V_1 \\ \vdots \\ V_N \end{array}\right) \end{align*}\]

The IV estimator is:

\[\begin{align*} \hat{\Theta}_{IV} & = (R'X)^{-1}R'Y \end{align*}\]

If there is at least one observation with $R_i=1$ and $D_i=1$, $R'X$ is invertible (its determinant is non null) and we have (ommitting the summation index for simplicity):

\[\begin{align*} (R'X)^{-1} & = \left(\begin{array}{cc} N & \sum D_i \\ \sum R_i & \sum D_iR_i \end{array}\right)^{-1} \\ & = \frac{1}{N\sum D_iR_i-\sum D_i\sum R_i}\left(\begin{array}{cc} \sum D_iR_i & -\sum D_i \\ -\sum R_i & N \end{array}\right) \end{align*}\]

Since:

\[\begin{align*} R'Y & = \left(\begin{array}{c} \sum Y_i \\ \sum Y_iR_i \end{array}\right), \end{align*}\]

we have:

\[\begin{align*} \hat{\Theta}_{IV} & = \left( \begin{array}{c} \frac{\sum Y_i\sum D_iR_i-\sum D_i\sum Y_iR_i}{N\sum D_iR_i -\sum D_iR_i}\\ \frac{N\sum Y_iR_i-\sum R_i\sum Y_i}{N\sum D_iR_i-\sum D_iR_i} \end{array} \right) \end{align*}\]

As a consequence, $\hat{\beta}_{IV}$ is equal to the ratio of two OLS estimators ($Y_i$ on $R_i$ and a constant and $D_i$ on the same regressors) (see the proof of Lemma A.3 in section A.1.2, just after “Using $D_i^2=D_i$”). We can use Lemma A.3 stating that the OLS estimator is the WW estimator to prove the result.

A.2.3 Proof of Theorem 3.16

In order to derive the asymptotic distribution of the Wald estimator, I first use Theorem 3.15 which implies that the asymptotic distribution of Wald is the same as that of $\hat{\beta}_{IV}$. Now, I’m going to derive the asymptotic distribution of the IV estimator.

Lemma A.7 (Asymptotic Distribution of the IV Estimator) Under Independence and Validity of the Instrument, Exclusion Restriction and Full Rank, we have:

\[\begin{align*} \sqrt{N}(\hat{\Theta}_{IV}-\Theta) & \stackrel{d}{\rightarrow} \mathcal{N}\left(\begin{array}{c} 0\\ 0\end{array}, (\sigma_{RX}^{-1})'\mathbf{V_{ru}}\sigma_{RX}^{-1}\right), \end{align*}\]

with \[\begin{align*} \sigma_{RX}^{-1}& = \frac{\left(\begin{array}{cc} \esp{D_iR_i} & -\Pr(D_i=1)\\ -\Pr(R_i=1) & 1 \end{array}\right)}{(\Pr(D_i=1|R_i=1)-\Pr(D_i=1|R_i=0))\Pr(R_i=1)(1-\Pr(R_i=1))} \\ \mathbf{V_{ru}}&= \esp{U_i^2\left(\begin{array}{cc} 1 & R_i\\ R_i & R_i\end{array}\right)} \end{align*}\]

Proof. \[\begin{align*} \sqrt{N}(\hat{\Theta}_{IV}-\Theta) & = \sqrt{N}((R'X)^{-1}R'Y-\Theta) \\ & = \sqrt{N}((R'X)^{-1}R'(X\Theta+U)-\Theta) \\ & = \sqrt{N}((R'X)^{-1}R'X\Theta+(X'X)^{-1}X'U)-\Theta) \\ & = \sqrt{N}(R'X)^{-1}R'U \\ & = N(R'X)^{-1}\frac{\sqrt{N}}{N}R'U \end{align*}\]

Using Slutsky’s Theorem, we have:

\[\begin{align*} \sqrt{N}(\hat{\Theta}_{IV}-\Theta) & \stackrel{d}{\rightarrow} \sigma_{RX}^{-1}ru, \end{align*}\]

where $\sigma_{RX}^{-1}$ is a matrix of constants and $ru$ is a random variable.

We know that $\plims N(R'X)^{-1}=\sigma_{RX}^{-1}$. So:

\[\begin{align*} N(R'X)^{-1} & = \frac{N}{N\sum D_iR_i-\sum D_i\sum R_i}\left(\begin{array}{cc} \sum D_iR_i & -\sum D_i \\ -\sum R_i & N \end{array}\right) \\ & = \frac{1}{\frac{\sum D_iR_i}{N}-\frac{\sum D_i}{N}\frac{\sum R_i}{N}} \left(\begin{array}{cc} \frac{\sum D_iR_i}{N} & -\frac{\sum D_i}{N} \\ -\frac{\sum R_i}{N} & 1 \end{array} \right) \end{align*}\]

$\frac{\sum D_iR_i}{N}-\frac{\sum D_i}{N}\frac{\sum R_i}{N}$ is equal to the numerator of the OLS coefficient of a regression of $D_i$ on $R_i$ and a constant (Proof of Lemma 3 in Lecture 0). As a consequence of Lemma 3 in Lecture 0, it can be written as the With/Without estimator multiplied by the denominator of the OLS estimator, which is simply the variance of $R_i$.

Let’s turn to $\frac{\sqrt{N}}{N}R'U\stackrel{d}{\rightarrow}xu$:

\[\begin{align*} \frac{\sqrt{N}}{N}R'U & = \sqrt{N}\left(\begin{array}{c} \frac{1}{N}\sum^{i=1}_{N}U_i\\ \frac{1}{N}\sum^{i=1}_{N}R_iU_i\end{array}\right) \end{align*}\]

We know that, under Validity of Randomization, both random variables have mean zero:

\[\begin{align*} \esp{U_i}& = \esp{U_i|R_i=1}\Pr(R_i=1)+\esp{U_i|R_i=0}\Pr(R_i=0)=0 \\ \esp{U_iR_i}& = \esp{U_i|R_i=1}\Pr(R_i=1)=0 \end{align*}\]

Their covariance matrix $\mathbf{V_{ru}}$ can be computed as follows:

\[\begin{align*} \mathbf{V_{ru}} & = \esp{\left(\begin{array}{c} U_i\\ UiR_i\end{array}\right)\left(\begin{array}{cc} U_i& UiR_i\end{array}\right)} - \esp{\left(\begin{array}{c} U_i\\ UiR_i\end{array}\right)}\esp{\left(\begin{array}{cc} U_i& UiR_i\end{array}\right)}\\ & = \esp{\left(\begin{array}{cc} U_i^2 & U_i^2R_i\\ Ui^2R_i & U_i^2R_i^2\end{array}\right)} = \esp{U_i^2\left(\begin{array}{cc} 1 & R_i\\ R_i & R_i^2\end{array}\right)} = \esp{U_i^2\left(\begin{array}{cc} 1 & R_i\\ R_i & R_i\end{array}\right)} \end{align*}\]

Using the Vector CLT, we have that $\frac{\sqrt{N}}{N}R'U\stackrel{d}{\rightarrow}\mathcal{N}\left(\begin{array}{c} 0\\ 0\end{array},\mathbf{V_{ru}}\right)$. Using Slutsky’s theorem and the LLN gives the result.

From Lemma A.7, we know that $\sqrt{N}(\hat{\beta}_{IV}-\beta)\stackrel{d}{\rightarrow}\mathcal{N}(0,\sigma^2_{\beta})$, where $\sigma^2_{\beta}$ is the lower diagonal term of $(\sigma_{RX}^{-1})'\mathbf{V_{ru}}\sigma_{RX}^{-1}$. Using the convention $p^R=\Pr(R_i=1)$, $p^D=\Pr(D_i=1)$, $p^D_1=\Pr(D_i=1|R_i=1)$, $p^D_0=\Pr(D_i=1|R_i=0)$ and $p^{DR}=\esp{D_iR_i}$, we have:

\[\begin{align*} (&\sigma_{RX}^{-1})'\mathbf{V_{ru}}\sigma_{RX}^{-1} \\ & = \frac{1}{((p^D_1-p^D_0)p^R(1-p^R))^2} \left(\begin{array}{cc} p^{DR} & -p^R\\ -p^D & 1 \end{array}\right) \esp{U_i^2\left(\begin{array}{cc} 1 & R_i\\ R_i & R_i\end{array}\right)} \left(\begin{array}{cc} p^{DR} & -p^D\\ -p^R & 1 \end{array}\right)\\ & = \frac{1}{((p^D_1-p^D_0)p^R(1-p^R))^2} \left(\begin{array}{cc} p^{DR}\esp{U_i^2}-p^R\esp{U_i^2R_i} & \esp{U_i^2R_i}(p^{DR}-p^R)\\ \esp{U_i^2R_i}-p^D\esp{U_i^2} & \esp{U_i^2R_i}(1-p^D) \end{array}\right) \left(\begin{array}{cc} p^{DR} & -p^D\\ -p^R & 1 \end{array}\right)\\ & = \frac{\left(\begin{array}{cc} p^{DR}(p^{DR}\esp{U_i^2}-p^R\esp{U_i^2R_i})- p^R\esp{U_i^2R_i}(p^{DR}-p^R) & \esp{U_i^2R_i}(p^{DR}-p^R)-p^{D}(p^{DR}\esp{U_i^2}-p^R\esp{U_i^2R_i})\\ p^{DR}(\esp{U_i^2R_i}-p^D\esp{U_i^2})-p^R\esp{U_i^2R_i}(1-p^D) & \esp{U_i^2R_i}(1-p^D) - p^{D}(\esp{U_i^2R_i}-p^D\esp{U_i^2}) \end{array}\right)}{((p^D_1-p^D_0)p^R(1-p^R))^2} \end{align*}\]

As a consequence:

\[\begin{align*} \sigma^2_{\beta} & = \frac{\esp{U_i^2R_i}(1-p^D) - p^{D}(\esp{U_i^2R_i}-p^D\esp{U_i^2})}{((p^D_1-p^D_0)p^R(1-p^R))^2} \\ & = \frac{(p^D)^2\esp{U_i^2}+(1-2p^D)\esp{U_i^2R_i}}{((p^D_1-p^D_0)p^R(1-p^R))^2}\\ & = \frac{(p^D)^2(\esp{U_i^2|R_i=1}p^R+\esp{U_i^2|R_i=0}(1-p^R))+(1-2p^D)\esp{U_i^2|R_i=1}p^R}{((p^D_1-p^D_0)p^R(1-p^R))^2}\\ & = \frac{(p^D)^2\esp{U_i^2|R_i=0}(1-p^R)+(1-2p^D+(p^D)^2)\esp{U_i^2|R_i=1}p^R}{((p^D_1-p^D_0)p^R(1-p^R))^2}\\ & = \frac{(p^D)^2\esp{U_i^2|R_i=0}(1-p^R)+(1-p^D)^2\esp{U_i^2|R_i=1}p^R}{((p^D_1-p^D_0)p^R(1-p^R))^2}\\ & = \frac{1}{(p^D_1-p^D_0)^2}\left[\left(\frac{p^D}{p^R}\right)^2\frac{\esp{U_i^2|R_i=0}}{1-p^R}+\left(\frac{1-p^D}{1-p^R}\right)^2\frac{\esp{U_i^2|R_i=1}}{p^R}\right]. \end{align*}\]

Note that, under monotonicity, $p^C=p^D_1-p^D_0$ and:

\[\begin{align*} \esp{U_i^2|R_i=1} & = p^{AT}\var{Y_i^1|T_i=AT}+p^C\var{Y_i^1|T_i=C}+p^{NT}\var{Y_i^0|T_i=NT} \\ \esp{U_i^2|R_i=0} & = p^{AT}\var{Y_i^1|T_i=AT}+p^C\var{Y_i^0|T_i=C}+p^{NT}\var{Y_i^0|T_i=NT}. \end{align*}\]

The final result comes from the fact that:

\[\begin{align*} \frac{1}{(p^C)^2} & \left[\left(\frac{p^D}{p^R}\right)^2\frac{1}{1-p^R}+\left(\frac{1-p^D}{1-p^R}\right)^2\frac{1}{p^R}\right]\\ & = \frac{(p^D)^2(1-p^R)+(1-p^D)^2p^R}{(p^Cp^R(1-p^R))^2} \\ & = \frac{(p^D)^2-(p^D)^2p^R+p^R-2p^Dp^R+(p^D)^2p^R}{(p^Cp^R(1-p^R))^2} \\ & = \frac{(p^D)^2+p^R-2p^Dp^R}{(p^Cp^R(1-p^R))^2} \\ & = \frac{(p^D-p^R)^2+p^R-(p^R)^2}{(p^Cp^R(1-p^R))^2} \\ & = \frac{(p^D-p^R)^2+p^R(1-p^R)}{(p^Cp^R(1-p^R))^2} \\ & = \frac{(p^{AT}+p^Cp^R-p^R)^2+p^R(1-p^R)}{(p^Cp^R(1-p^R))^2} \\ & = \frac{(p^{AT}+(1-p^{AT}-p^{NT})p^R-p^R)^2+p^R(1-p^R)}{(p^Cp^R(1-p^R))^2} \\ & = \frac{(p^{AT}+(1-p^{AT}-p^{NT})p^R-p^R)^2+p^R(1-p^R)}{(p^Cp^R(1-p^R))^2} \\ & = \frac{(p^{AT}+p^R-p^{AT}p^R-p^{NT}p^R-p^R)^2+p^R(1-p^R)}{(p^Cp^R(1-p^R))^2} \\ & = \frac{(p^{AT}(1-p^R)-p^{NT}p^R)^2+p^R(1-p^R)}{(p^Cp^R(1-p^R))^2}, \end{align*}\]

where the seventh equality uses the fact that $p^C+p^{AT}+p^{NT}=1$.

A.3 Proofs of results in Chapter 4

A.3.1 Proof of Theorem 4.10

Let us start with the proof that $\hat{\beta}^{FD}=\hat{\Delta}^Y_{DID}$. Using Lemma A.3, we have that $\hat{\beta}^{FD}=\hat{\Delta}^{Y_A-Y_B}_{WW}$. From there, since $\sum_{i=1}^N(Y_{i,A}-Y_{i,B})D_i= \sum_{i=1}^NY_{i,A}D_i- \sum_{i=1}^NY_{i,B}D_i$, we have $\hat{\beta}^{FD}=\hat{\Delta}^Y_{DID}$.

In order to prove the result for the OLS DID estimator, it is convenient to write the model in matrix form (where we rank all the observations from the first period in the first lines of each matrix and vector):

\[\begin{align*} \underbrace{\left(\begin{array}{c} Y_{1,B} \\ \vdots \\ Y_{N,B} \\Y_{1,A} \\ \vdots \\ Y_{N,A} \end{array}\right)}_{Y} & = \underbrace{\left(\begin{array}{cccc} 1 & D_1 & T_{1,B} & D_1T_{1,B}\\ \vdots & \vdots & \vdots & \vdots\\ 1 & D_N & T_{N,B} & D_NT_{N,B} \\ 1 & D_1 & T_{1,A} & D_1T_{1,A}\\ \vdots & \vdots & \vdots & \vdots\\ 1 & D_N & T_{N,A} & D_NT_{N,A}\end{array}\right)}_{X} \underbrace{\left(\begin{array}{c} \alpha \\ \mu \\ \delta \\ \beta \end{array}\right)}_{\Theta} + \underbrace{\left(\begin{array}{c} \epsilon_{1,B} \\ \vdots \\ \epsilon_{N,B} \\ \epsilon_{1,A} \\ \vdots \\ \epsilon_{N,A} \end{array}\right)}_{\epsilon} \end{align*}\]

Now, using the fact that $T_{i,B}=0$ and $T_{i,A}=1$, $\forall i$, we can write matrix $X$ as follows:

\[\begin{align*} X & = \left(\begin{array}{cccc} 1 & D_1 & 0 & 0\\ \vdots & \vdots & \vdots & \vdots\\ 1 & D_N & 0 & 0 \\ 1 & D_1 & 1 & D_1\\ \vdots & \vdots & \vdots & \vdots\\ 1 & D_N & 1 & D_N\end{array}\right) \end{align*}\]

Doing some matrix multiplication and factoring $N$, we have:

\[\begin{align*} X'X & = N\underbrace{\left(\begin{array}{cccc} 2 & 2\bar{D} & 1 & \bar{D}\\ 2\bar{D} & 2\bar{D} & \bar{D} & \bar{D} \\ 1 & \bar{D} & 1 & \bar{D}\\ \bar{D} & \bar{D} & \bar{D} & \bar{D} \end{array}\right)}_{x'x} \end{align*}\]

with $\bar{D}=\frac{1}{N}\sum_{i=1}^ND_i$, and using the fact that $D_i^2=D_i$ since $D_i\in\left\{0,1\right\}$. Using results on the inverse of a 4 by 4 matrix presented here and collecting terms patiently, we find that the determinant of $xx$ is equal to:

\[\begin{align*} \det(x'x) & = \bar{D}^2(1-\bar{D})^2 \end{align*}\]

and its adjugate is equal to:

\[\begin{align*} \tilde{x'x} & = \bar{D}(1-\bar{D}) \left(\begin{array}{cccc} \bar{D} & -\bar{D} & -\bar{D} & \bar{D}\\ -\bar{D} & 1 & \bar{D} & -1 \\ -\bar{D} & \bar{D} & 2\bar{D} & -2\bar{D}\\ \bar{D} & -1 & -2\bar{D} & 2 \end{array}\right) \end{align*}\]

We also have that:

\[\begin{align*} X'Y & = N\left(\begin{array}{c} \bar{Y}_B+\bar{Y}_A \\ \bar{D}(\bar{Y}^1_B+\bar{Y}^1_A)\\ \bar{Y}_A \\ \bar{D}\bar{Y}^1_A \end{array}\right) \end{align*}\]

with $\bar{Y}_t=\frac{1}{N}\sum_{i=1}^NY_{i,t}$ and $\bar{Y}^1_t=\frac{1}{\sum_{i=1}^ND_i}\sum_{i=1}^ND_iY_{i,t}$ and $\bar{Y}^0_t=\frac{1}{\sum_{i=1}^N(1-D_i)}\sum_{i=1}^N(1-D_i)Y_{i,t}$ and using the fact that $\sum_{i=1}^ND_iY_{i,t}=N\bar{D}\bar{Y}^1_t$. Using the fact that $Y_{i,t}=D_iY_{i,t}+(1-D_i)Y_{i,t}$, we have:

\[\begin{align*} \bar{Y}_t & = \frac{\sum_{i=1}^ND_i}{N}\frac{\sum_{i=1}^ND_iY_{i,t}}{\sum_{i=1}^ND_i}+\frac{\sum_{i=1}^N(1-D_i)}{N}\frac{\sum_{i=1}^N(1-D_i)Y_{i,t}}{\sum_{i=1}^N(1-D_i)} \\ & = \bar{D}\bar{Y}^1_t+(1-\bar{D})\bar{Y}^0_t. \end{align*}\]

We thus have:

\[\begin{align*} X'Y & = N\left(\begin{array}{c} \underbrace{\bar{Y}^0_B+\bar{Y}^0_A+\bar{D}(\bar{Y}^1_B-\bar{Y}^0_B+\bar{Y}^1_A-\bar{Y}^0_A)}_{\mathbf{A}} \\ \underbrace{\bar{D}(\bar{Y}^1_B+\bar{Y}^1_A)}_{\mathbf{B}}\\ \underbrace{\bar{Y}^0_A+\bar{D}(\bar{Y}^1_A-\bar{Y}^0_A)}_{\mathbf{C}} \\ \underbrace{\bar{D}\bar{Y}^1_A}_{\mathbf{D}} \end{array}\right) \end{align*}\]

Using the fact that $(X'X)^{-1}=(Nx'x)^{-1}=\frac{1}{N}(x'x)^{-1}=\frac{1}{N}\frac{\tilde{x'x}}{\det(x'x)}$, we have:

\[\begin{align*} \hat{\Theta}^{OLS} & = (X'X)^{-1}X'Y \\ & = \frac{1}{\bar{D}(1-\bar{D})} \left(\begin{array}{c} \bar{D}(\mathbf{A}-\mathbf{B}-\mathbf{C}+\mathbf{D}) \\ -\bar{D}\mathbf{A}+\mathbf{B}+\bar{D}\mathbf{C}-\mathbf{D} \\ \bar{D}(-\mathbf{A}+\mathbf{B}+2\mathbf{C}-2\mathbf{D})\\ \bar{D}\mathbf{A}-\mathbf{B}-2\bar{D}\mathbf{C}+2\mathbf{D} \end{array}\right) \end{align*}\]

Let’s take each term in turn:

\[\begin{align*} \hat{\alpha}^{OLS} & = \frac{1}{1-\bar{D}} \left(\bar{Y}^0_B+\bar{Y}^0_A+\bar{D}(\bar{Y}^1_B-\bar{Y}^0_B+\bar{Y}^1_A-\bar{Y}^0_A) -\bar{D}(\bar{Y}^1_B+\bar{Y}^1_A) -(\bar{Y}^0_A+\bar{D}(\bar{Y}^1_A-\bar{Y}^0_A)) +\bar{D}\bar{Y}^1_A\right)\\ & = \frac{1}{1-\bar{D}} \left(\bar{Y}^0_B(1-\bar{D}) +\bar{Y}^0_A(1-\bar{D}-1+\bar{D}) +\bar{Y}^1_B(\bar{D}-\bar{D}) +\bar{Y}^1_A(\bar{D}-\bar{D}-\bar{D}+\bar{D})\right)\\ & = \bar{Y}^0_B \end{align*}\]

\[\begin{align*} \hat{\mu}^{OLS} & = \frac{1}{\bar{D}(1-\bar{D})}\left( -\bar{D}(\bar{Y}^0_B+\bar{Y}^0_A+\bar{D}(\bar{Y}^1_B-\bar{Y}^0_B+\bar{Y}^1_A-\bar{Y}^0_A)) +\bar{D}(\bar{Y}^1_B+\bar{Y}^1_A) +\bar{D}(\bar{Y}^0_A+\bar{D}(\bar{Y}^1_A-\bar{Y}^0_A)) -\bar{D}\bar{Y}^1_A\right)\\ & = \frac{1}{1-\bar{D}}\left( -\bar{Y}^0_B(1-\bar{D}) +\bar{Y}^0_A(-1+\bar{D}+1-\bar{D}) +\bar{Y}^1_B(1-\bar{D}) +\bar{Y}^1_A(-\bar{D}+1+\bar{D}-1)\right) \\ & = \bar{Y}^1_B-\bar{Y}^0_B \end{align*}\]

\[\begin{align*} \hat{\delta}^{OLS} & =\frac{1}{1-\bar{D}}\left( -(\bar{Y}^0_B+\bar{Y}^0_A+\bar{D}(\bar{Y}^1_B-\bar{Y}^0_B+\bar{Y}^1_A-\bar{Y}^0_A)) +(\bar{Y}^1_B+\bar{Y}^1_A) +2(\bar{Y}^0_A+\bar{D}(\bar{Y}^1_A-\bar{Y}^0_A)) -2\bar{D}\bar{Y}^1_A\right)\\ & = \frac{1}{1-\bar{D}}\left( -\bar{Y}^0_B(1-\bar{D}) +\bar{Y}^0_A(2(1-\bar{D})-(1-\bar{D})) +\bar{Y}^1_B(\bar{D}-\bar{D}) +\bar{Y}^1_A(\bar{D}-\bar{D}+2\bar{D}-2\bar{D})\right) \\ & = \bar{Y}^0_A-\bar{Y}^0_B \end{align*}\]

\[\begin{align*} \hat{\beta}^{OLS} & =\frac{1}{\bar{D}(1-\bar{D})}\left( \bar{D}(\bar{Y}^0_B+\bar{Y}^0_A+\bar{D}(\bar{Y}^1_B-\bar{Y}^0_B+\bar{Y}^1_A-\bar{Y}^0_A)) -\bar{D}(\bar{Y}^1_B+\bar{Y}^1_A) -2 \bar{D}(\bar{Y}^0_A+\bar{D}(\bar{Y}^1_A-\bar{Y}^0_A)) +2\bar{D}\bar{Y}^1_A\right)\\ & = \frac{1}{1-\bar{D}}\left( \bar{Y}^0_B(1-\bar{D}) +\bar{Y}^0_A((1-\bar{D})-2(1-\bar{D})) +\bar{Y}^1_B(\bar{D}-1) +\bar{Y}^1_A(\bar{D}-1-2\bar{D}+2)\right) \\ & = \bar{Y}^1_A-\bar{Y}^1_B-(\bar{Y}^0_A-\bar{Y}^0_B) \end{align*}\]

This last results proves that $\hat{\beta}^{OLS}=\hat{\Delta}^Y_{DID}$.

For the within estimator, it can be written in matrix form as follows:

\[\begin{align*} \underbrace{\left(\begin{array}{c} Y_{1,B}-\bar{Y}_1 \\ \vdots \\ Y_{N,B}-\bar{Y}_N \\Y_{1,A}-\bar{Y}_1 \\ \vdots \\ Y_{N,A}-\bar{Y}_N \end{array}\right)}_{Y^W} & = \underbrace{\left(\begin{array}{ccc} 1 & 0 & -\bar{D}_1\\ \vdots & \vdots & \vdots \\ 1 & 0 & -\bar{D}_N \\ 1 & 1 & D_1-\bar{D}_1\\ \vdots & \vdots & \vdots \\ 1 & 1 & D_N-\bar{D}_N\end{array}\right)}_{X^W} \underbrace{\left(\begin{array}{c} \alpha^W \\ \delta^W \\ \beta^W \end{array}\right)}_{\Theta^{W}} + \underbrace{\left(\begin{array}{c} \epsilon^W_{1,B} \\ \vdots \\ \epsilon^W_{N,B} \\ \epsilon^W_{1,A} \\ \vdots \\ \epsilon^W_{N,A} \end{array}\right)}_{\epsilon^W} \end{align*}\]

We have:

\[\begin{align*} {X^W}'X^W & = N\underbrace{\left(\begin{array}{ccc} 2 & 1 & 0\\ 1 & 1 & \frac{\bar{D}}{2} \\ 0 & \frac{\bar{D}}{2} & \frac{\bar{D}}{2} \end{array}\right)}_{{x^W}'x^W} \end{align*}\]

This is because:

\[\begin{align*} {X^W}'X^W & = \left(\begin{array}{ccc} 2N & N & -\sum_{i=1}^N\bar{D}_i+\sum_{i=1}^N(D_i-\bar{D}_i)\\ N & N & \sum_{i=1}^N(D_i-\bar{D}_i) \\ -\sum_{i=1}^N\bar{D}_i+\sum_{i=1}^N(D_i-\bar{D}_i) & \sum_{i=1}^N(D_i-\bar{D}_i) & \sum_{i=1}^N\bar{D}_i^2+\sum_{i=1}^N(D_i-\bar{D}_i)^2 \end{array}\right) \end{align*}\]

and:

\[\begin{align*} \sum_{i=1}^N\bar{D}_i & = \frac{1}{2}\sum_{i=1}^N(D_{i,B}+D_{i,A}) \\ & = \frac{1}{2}\sum_{i=1}^ND_{i} \\ & = \frac{1}{2}N\bar{D}\\ \sum_{i=1}^N(D_i-\bar{D}_i) & = N\bar{D}-\frac{1}{2}N\bar{D} \\ & = \frac{1}{2}N\bar{D}\\ \sum_{i=1}^N\bar{D}^2_i & = \frac{1}{4}\sum_{i=1}^N(D_{i,B}+D_{i,A})^2\\ & = \frac{1}{4}\sum_{i=1}^ND^2_{i} \\ & = \frac{1}{4}N\bar{D} \\ \sum_{i=1}^N(D_i-\bar{D}_i)^2 & = \sum_{i=1}^N(D_{i}-\frac{1}{2}D_{i})^2\\ & = \frac{1}{4}N\bar{D} \end{align*}\]

Now we can use the results here and here to compute the inverse of the ${x^W}'x^W$ matrix. Let us first compute the determinant:

\[\begin{align*} \det({x^W}'x^W) & = 2(\frac{\bar{D}}{2}-\frac{\bar{D}^2}{4}) - \frac{\bar{D}}{2}\\ & = \frac{1}{2}\bar{D}(1-\bar{D}). \end{align*}\]

And then the adjugate:

\[\begin{align*} \tilde{{x^W}'x^W} & = \left(\begin{array}{ccc} \frac{\bar{D}}{2}(1-\frac{\bar{D}}{2}) & -\frac{\bar{D}}{2} & \frac{\bar{D}}{2}\\ -\frac{\bar{D}}{2} & \bar{D} & -\bar{D}\\ \frac{\bar{D}}{2} & -\bar{D} & 1 \end{array}\right) \end{align*}\]

Let us now examine ${X^W}'Y^W$:

\[\begin{align*} {X^W}'Y^W & = \left(\begin{array}{c} \sum_{i=1}^N(Y_{i,B}-\bar{Y}_i)+\sum_{i=1}^N(Y_{i,A}-\bar{Y}_i)\\ \sum_{i=1}^N(Y_{i,A}-\bar{Y}_i) \\ -\sum_{i=1}^N\bar{D}_i(Y_{i,B}-\bar{Y}_i)+\sum_{i=1}^N(D_i-\bar{D}_i)(Y_{i,A}-\bar{Y}_i) \end{array}\right) \end{align*}\]

We have:

\[\begin{align*} \sum_{i=1}^N(Y_{i,B}-\bar{Y}_i) & = N\bar{Y}_B-\frac{1}{2}N(\bar{Y}_B+\bar{Y}_A)\\ & = \frac{1}{2}N(\bar{Y}_B-\bar{Y}_A)\\ \sum_{i=1}^N(Y_{i,A}-\bar{Y}_i) & = \frac{1}{2}N(\bar{Y}_A-\bar{Y}_B)\\ \sum_{i=1}^N\bar{D}_i(Y_{i,B}-\bar{Y}_i) & = \sum_{i=1}^N\frac{1}{2}D_i(Y_{i,B}-\frac{1}{2}\sum_{i=1}^N(Y_{i,B}+Y_{i,A}))\\ & = \sum_{i=1}^N\frac{1}{2}D_i\frac{1}{2}(Y_{i,B}-Y_{i,A})\\ & = \frac{1}{4}\sum_{i=1}^ND_i(Y_{i,B}-Y_{i,A})\\ & = \frac{1}{4}N\bar{D}(\bar{Y}^1_B-\bar{Y}^1_A)\\ \sum_{i=1}^N(D_i-\bar{D}_i)(Y_{i,A}-\bar{Y}_i) & = \sum_{i=1}^N(D_i-\frac{1}{2}D_i)(Y_{i,A}-\frac{1}{2}\sum_{i=1}^N(Y_{i,B}+Y_{i,A}))\\ & = \frac{1}{4}\sum_{i=1}^ND_i(Y_{i,A}-Y_{i,B})\\ & = \frac{1}{4}N\bar{D}(\bar{Y}^1_A-\bar{Y}^1_B). \end{align*}\]

So, we have:

\[\begin{align*} ({X^W}'X^W)^{-1}{X^W}'Y^W & = \frac{2}{N\bar{D}(1-\bar{D})} \left(\begin{array}{ccc} \frac{\bar{D}}{2}(1-\frac{\bar{D}}{2}) & -\frac{\bar{D}}{2} & \frac{\bar{D}}{2}\\ -\frac{\bar{D}}{2} & \bar{D} & -\bar{D}\\ \frac{\bar{D}}{2} & -\bar{D} & 1 \end{array}\right) \left(\begin{array}{c} 0\\ \frac{N}{2}(\bar{Y}_A-\bar{Y}_B)\\ \frac{N}{2}\bar{D}(\bar{Y}^1_A-\bar{Y}^1_B) \end{array}\right) \end{align*}\]

We thus have:

\[\begin{align*} \hat{\beta}^W & = \frac{2}{N\bar{D}(1-\bar{D})}\left(-\bar{D}\frac{N}{2}(\bar{Y}_A-\bar{Y}_B)+\frac{N}{2}\bar{D}(\bar{Y}^1_A-\bar{Y}^1_B)\right)\\ & = \frac{1}{1-\bar{D}}\left(\bar{Y}^1_A-\bar{Y}^1_B-(\bar{Y}_A-\bar{Y}_B)\right)\\ \end{align*}\]

Using the fact that $\bar{Y}_t=\bar{D}\bar{Y}_t^1+(1-\bar{D})\bar{Y}^0_t$, we have $\bar{Y}_A-\bar{Y}_B=(1-\bar{D})(\bar{Y}^0_A-\bar{Y}^0_B)+\bar{D}(\bar{Y}_A^1-\bar{Y}_B^1)$.

As a consequence:

\[\begin{align*} \hat{\beta}^W & = \frac{1-\bar{D}}{1-\bar{D}}\left(\bar{Y}^1_A-\bar{Y}^1_B-(\bar{Y}^0_A-\bar{Y}^0_B)\right)\\ & = \bar{Y}^1_A-\bar{Y}^1_B-(\bar{Y}^0_A-\bar{Y}^0_B), \end{align*}\]

which proves that $\hat{\beta}^{W}=\hat{\Delta}^Y_{DID}$.

Now for $\hat{\beta}^{LSDV}$, the estimator can be written in matrix form as follows:

\[\begin{align*} \underbrace{\left(\begin{array}{c} Y_{1,B} \\ \vdots \\ Y_{N,B} \\Y_{1,A} \\ \vdots \\ Y_{N,A} \end{array}\right)}_{Y} & = \underbrace{\left(\begin{array}{ccccccc} 1 & 0 & \dots & 0 & 1 & 0 & D_{1,B}\\ 0 & 1 & \dots & 0 & 1 & 0 & D_{2,B}\\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \vdots \\ 0 & 0 & \dots & 1 & 1 & 0 & D_{N,B}\\ 1 & 0 & \dots & 0 & 0 & 1 & D_{1,A}\\ 0 & 1 & \dots & 0 & 0 & 1 & D_{2,A}\\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \vdots \\ 0 & 0 & \dots & 1 & 0 & 1 & D_{N,A}\\ \end{array}\right)}_{X^{LSDV}} \underbrace{\left(\begin{array}{c} \mu^{LSDV}_1\\ \vdots \\ \mu^{LSDV}_N \\ \delta^{LSDV}_B \\ \delta^{LSDV}_A \\ \beta^{LSDV} \end{array}\right)}_{\Theta^{LSDV}} + \underbrace{\left(\begin{array}{c} \epsilon^{LSDV}_{1,B} \\ \vdots \\ \epsilon^{LSDV}_{N,B} \\ \epsilon^{LSDV}_{1,A} \\ \vdots \\ \epsilon^{LSDV}_{N,A} \end{array}\right).}_{\epsilon^{LSDV}} \end{align*}\]

In order to prove the result, it is going to be very convenient to use Frish-Waugh-Lovell Theorem. It can be stated as follows:

Theorem A.1 (Frish-Waugh-Lovell) The coefficients on a set of variables $X_2$ estimated by OLS in a linear regression with another set of control variables $X_1$ is equal to the coefficients on the same set of variables estimated by OLS in a linear model where the outcome variable is the residual of regressing $Y$ on $X_1$ by OLS and the explanatory variables are the residuals of regressing $X_2$ on $X_1$. More formally: $\hat{\beta}_2^{OLS}=\hat{\beta}_2^{OLS(MX_1)}$ where: \[\begin{align*} Y & = X_1\beta_1 + X_2\beta_2 + \epsilon \\ M_1Y & = M_1X_2\beta_2 + \epsilon^* \\ M_1 & = I - X_1(X_1'X_1)^{-1}X_1'. \end{align*}\]

Proof. See Section 8.2.2 here.

$M_1$ is called the prediction or the residualizing matrix.

In our case, let us call $X^{LSDV}_{\mu}$ the first $N$ columns of $X^{LSDV}$. $X^{LSDV}_{\mu}$ is going to play the role of $X_1$ in Theorem A.1. Let us call $X^{LSDV}_{\delta,D}$ the matrix made of the last three columns of $X^{LSDV}$. $X^{LSDV}_{\delta,D}$ is going to play the role of $X_2$ in Theorem A.1.

Let us first note that ${X^{LSDV}_{\mu}}'X^{LSDV}_{\mu}=2I_{N}$, where $I_{N}$ is the identity matrix of dimension $N$. As a consequence, $({X^{LSDV}_{\mu}}'X^{LSDV}_{\mu})^{-1}=\frac{1}{2}I_N$. Now, let us compute ${X^{LSDV}_{\mu}}'Y$:

\[\begin{align*} {X^{LSDV}_{\mu}}'Y & = \left(\begin{array}{c} Y_{1,B}+Y_{1,A} \\ \vdots \\ Y_{N,B}+Y_{N,A} \end{array}\right). \end{align*}\]

As a consequence, we have:

\[\begin{align*} M^{LSDV}_{\mu}Y & = Y - X^{LSDV}_{\mu}({X^{LSDV}_{\mu}}'X^{LSDV}_{\mu})^{-1}{X^{LSDV}_{\mu}}'Y \\ & = Y-\frac{1}{2} X^{LSDV}_{\mu}I_N \left(\begin{array}{c} Y_{1,B}+Y_{1,A} \\ \vdots \\ Y_{N,B}+Y_{N,A} \end{array}\right) \\ & = \left(\begin{array}{c} Y_{1,B} - \frac{1}{2}(Y_{1,B}+Y_{1,A}) \\ \vdots \\ Y_{N,B} - \frac{1}{2}(Y_{N,B}+Y_{N,A})\\ Y_{1,A} - \frac{1}{2}(Y_{1,B}+Y_{1,A}) \\ \vdots \\ Y_{N,A} - \frac{1}{2}(Y_{N,B}+Y_{N,A}) \end{array}\right). \end{align*}\]

And finally:

\[\begin{align*} M^{LSDV}_{\mu}X^{LSDV}_{\delta,D} & = X^{LSDV}_{\delta,D} - X^{LSDV}_{\mu}({X^{LSDV}_{\mu}}'X^{LSDV}_{\mu})^{-1}{X^{LSDV}_{\mu}}'X^{LSDV}_{\delta,D} \\ & = \left(\begin{array}{ccc} \frac{1}{2} & -\frac{1}{2} & D_{1,B}-\frac{1}{2}(D_{1,B}+D_{1,A}) \\ \vdots & \vdots & \vdots \\ \frac{1}{2} & -\frac{1}{2} & D_{N,B}-\frac{1}{2}(D_{1,B}+D_{1,A}) \\ -\frac{1}{2} & \frac{1}{2} & D_{1,A}-\frac{1}{2}(D_{1,B}+D_{1,A}) \\ \vdots & \vdots & \vdots \\ -\frac{1}{2} & \frac{1}{2} & D_{N,A}-\frac{1}{2}(D_{1,B}+D_{1,A}) \\ \end{array}\right). \end{align*}\]

Using Theorem A.1, we can rewrite the LSDV version of the TWFE model as follows:

\[\begin{align*} M^{LSDV}_{\mu}Y & = M^{LSDV}_{\mu}X^{LSDV}_{\delta,D} \left(\begin{array}{c} \delta^{LSDV}_B \\ \delta^{LSDV}_A \\ \beta^{LSDV} \end{array}\right) + M^{LSDV}_{\mu}\epsilon^{LSDV} \end{align*}\]

In a more compact notation, we have, $\forall i\in\left[1,N\right]$ and $\forall t\in\left\{B,A\right\}$:

\[\begin{align*} Y_{i,t} - \bar{Y}_i & = \frac{1}{2}(\delta^{LSDV}_A-\delta^{LSDV}_B)(\uns{t=A}-\uns{t=B}) + \beta^{LSDV}(D_{i,t}-\bar{D}_{i}) + \epsilon^{LSDV}_{i,t}-\bar{\epsilon}^{LSDV}_{i}, \end{align*}\]

which we can rewrite, for simplicity, as:

\[\begin{align*} Y_{i,t} - \bar{Y}_i & = \tilde{\delta}^{LSDV}_t + \beta^{LSDV}(D_{i,t}-\bar{D}_{i}) + \epsilon^{LSDV}_{i,t}-\bar{\epsilon}^{LSDV}_{i}, \end{align*}\]

with $\tilde{\delta}^{LSDV}_A=-\tilde{\delta}_B^{LSDV}=\bar{\delta}^{LSDV}$ and $\bar{\delta}^{LSDV}=\frac{1}{2}(\delta^{LSDV}_A-\delta^{LSDV}_B)$.

In matrix form, we can thus rewrite the LSDV model transformed by the application of the Frich-Waugh theorem as follows:

\[\begin{align*} \underbrace{\left(\begin{array}{c} Y_{1,B}-\bar{Y}_1 \\ \vdots \\ Y_{N,B}-\bar{Y}_N \\Y_{1,A}-\bar{Y}_1 \\ \vdots \\ Y_{N,A}-\bar{Y}_N \end{array}\right)}_{Y^{LSDV}_r} & = \underbrace{\left(\begin{array}{ccc} 1 & 0 & -\bar{D}_1\\ \vdots & \vdots & \vdots \\ 1 & 0 & -\bar{D}_N \\ 0 & 1 & D_1-\bar{D}_1\\ \vdots & \vdots & \vdots \\ 0 & 1 & D_N-\bar{D}_N \end{array}\right)}_{X^{LSDV}_r} \underbrace{\left(\begin{array}{c} \tilde{\delta}^{LSDV}_B \\ \tilde{\delta}^{LSDV}_A \\ \beta^{LSDV} \end{array}\right)}_{\Theta^{LSDV}_r} + \underbrace{\left(\begin{array}{c} \epsilon^{LSDV}_{1,B}-\bar{\epsilon}^{LSDV}_{1} \\ \vdots \\ \epsilon^{LSDV}_{N,B}-\bar{\epsilon}^{LSDV}_{N} \\ \epsilon^{LSDV}_{1,A}-\bar{\epsilon}^{LSDV}_{1} \\ \vdots \\ \epsilon^{LSDV}_{N,A}-\bar{\epsilon}^{LSDV}_{N} \end{array}\right)}_{\epsilon^{LSDV}_r} \end{align*}\]

This is very close to the formula for the Within estimator we have seen above. The only difference is that we have two time fixed effects instead of a constant and the After time fixed effect. We are going to solve for the estimator in a very similar way. First:

\[\begin{align*} {X^{LSDV}_r}'X^{LSDV}_r & = N\underbrace{\left(\begin{array}{ccc} 1 & 0 & -\frac{\bar{D}}{2}\\ 0 & 1 & \frac{\bar{D}}{2} \\ -\frac{\bar{D}}{2} & \frac{\bar{D}}{2} & \frac{\bar{D}}{2} \end{array}\right)}_{{x^{LSDV}_r}'x^{LSDV}_r} \end{align*}\]

The determinant of ${x^{LSDV}_r}'x^{LSDV}_r$ is:

\[\begin{align*} \det({x^{LSDV}_r}'x^{LSDV}_r) & = \frac{1}{2}\bar{D}(1-\bar{D}) \end{align*}\]

and its adjoint matrix is:

\[\begin{align*} \tilde{{x^{LSDV}_r}'x^{LSDV}_r} & = \left(\begin{array}{ccc} \frac{1}{2}\bar{D}(1-\frac{1}{2}\bar{D}) & - \frac{1}{4}\bar{D}^2 & \frac{1}{2}\bar{D}\\ - \frac{1}{4}\bar{D}^2 & \frac{1}{2}\bar{D}(1-\frac{1}{2}\bar{D}) & -\frac{1}{2}\bar{D} \\ \frac{1}{2}\bar{D} & -\frac{1}{2}\bar{D} & 1 \end{array}\right). \end{align*}\]

Finally, we have:

\[\begin{align*} {X^{LSDV}_r}'Y^{LSDV}_r & = \left(\begin{array}{c} \sum_{i=1}^N(Y_{i,B}-\bar{Y}_i)\\ \sum_{i=1}^N(Y_{i,A}-\bar{Y}_i) \\ -\sum_{i=1}^N\bar{D}_i(Y_{i,B}-\bar{Y}_i)+\sum_{i=1}^N(D_i-\bar{D}_i)(Y_{i,A}-\bar{Y}_i) \end{array}\right) \\ & = \left(\begin{array}{c} -\frac{1}{2}N(\bar{Y}_A-\bar{Y}_B)\\ \frac{1}{2}N(\bar{Y}_A-\bar{Y}_B)\\ \frac{1}{2}N\bar{D}(\bar{Y}_A^1-\bar{Y}_B^1) \end{array}\right) \end{align*}\]

Using the fact that $\hat{\Theta}^{LSDV}_r=({X^{LSDV}_r}'X^{LSDV}_r)^{-1}{X^{LSDV}_r}'Y^{LSDV}_r$, we have:

\[\begin{align*} \hat{\beta}^{LSDV} & = \frac{2}{N\bar{D}(1-\bar{D})}\left[-\frac{\bar{D}N}{2}(\bar{Y}_A-\bar{Y}_B)+\frac{\bar{D}N}{2}(\bar{Y}_A^1-\bar{Y}_B^1)\right] \\ & =\frac{1}{1-\bar{D}}\left[\bar{Y}_A^1-\bar{Y}_B^1-(1-\bar{D})(\bar{Y}^0_A-\bar{Y}^0_B)-\bar{D}(\bar{Y}_A^1-\bar{Y}_B^1)\right]\\ & =\frac{1}{1-\bar{D}}\left[(1-\bar{D})(\bar{Y}_A^1-\bar{Y}_B^1)-(1-\bar{D})(\bar{Y}^0_A-\bar{Y}^0_B)\right]\\ & =\bar{Y}_A^1-\bar{Y}_B^1-(\bar{Y}^0_A-\bar{Y}^0_B). \end{align*}\]

The second equality uses the fact that $\bar{Y}_A-\bar{Y}_B=(1-\bar{D})(\bar{Y}^0_A-\bar{Y}^0_B)+\bar{D}(\bar{Y}_A^1-\bar{Y}_B^1)$. This proves the result.

To Do: the AP and LC estimators

A.3.2 Proof of Theorem 4.12

The DID model in repeated cross sections can be written in the following matrix form:

\[\begin{align*} \underbrace{\left(\begin{array}{c} Y_{1,B} \\ \vdots \\ Y_{N_B,B} \\Y_{1,A} \\ \vdots \\ Y_{N_A,A} \end{array}\right)}_{Y} & = \underbrace{\left(\begin{array}{cccc} 1 & D^B_{1} & T_{1,B} & D^B_{1}T_{1,B}\\ \vdots & \vdots & \vdots & \vdots\\ 1 & D^B_{N_B} & T_{N_B,B} & D^B_{N_B}T_{N_B,B} \\ 1 & D^A_{1} & T_{1,A} & D^A_{1}T_{1,A}\\ \vdots & \vdots & \vdots & \vdots\\ 1 & D^A_{N_A} & T_{N_A,A} & D^A_{N_A}T_{N_A,A}\end{array}\right)}_{X} \underbrace{\left(\begin{array}{c} \alpha \\ \mu \\ \delta \\ \beta \end{array}\right)}_{\Theta} + \underbrace{\left(\begin{array}{c} \epsilon_{1,B} \\ \vdots \\ \epsilon_{N_B,B} \\ \epsilon_{1,A} \\ \vdots \\ \epsilon_{N_A,A} \end{array}\right),}_{\epsilon} \end{align*}\]

where $D^B_{i}$ and $D^A_{i}$ denote the actual treatment status in period $A$ of individuals observed in periods $B$ and $A$ respectively and $N_B$ and $N_A$ are the numbers of units observed in periods $B$ and $A$ respectively.

Using the beginning of the proof of Lemma A.4, we know that: $\sqrt{N}(\hat{\Theta}_{OLS}-\Theta)=N(X'X)^{-1}\frac{\sqrt{N}}{N}X'\epsilon$. Using Slutsky’s Theorem, we know that we can study both terms separately (see the same proof of Lemma A.4). Let’s start with $N(X'X)^{-1}$. Using the fact that $T_{i,B}=0$ and $T_{i,A}=1$, $\forall i$, we can write matrix $X$ as follows:

\[\begin{align*} X & = \left(\begin{array}{cccc} 1 & D^B_{1} & 0 & 0\\ \vdots & \vdots & \vdots & \vdots\\ 1 & D^B_{N_B} & 0 & 0 \\ 1 & D^A_{1} & 1 & D^A_{1}\\ \vdots & \vdots & \vdots & \vdots\\ 1 & D^A_{N_A} & 1 & D^A_{N_A}\end{array}\right) \end{align*}\]

Doing some matrix multiplication, we have:

\[\begin{align*} X'X & = N_A\underbrace{\left(\begin{array}{cccc} k+1 & k\bar{D}_B+\bar{D}_A & 1 & \bar{D}_A\\ k\bar{D}_B+\bar{D}_A & k\bar{D}_B+\bar{D}_A & \bar{D}_A & \bar{D}_A \\ 1 & \bar{D}_A & 1 & \bar{D}_A\\ \bar{D}_A & \bar{D}_A & \bar{D}_A & \bar{D}_A \end{array}\right)}_{x'x} \end{align*}\]

with $\bar{D}_t=\frac{1}{N_t}\sum_{i=1}^{N_t}D^t_i$, $k=\frac{N_B}{N_A}$, and using the fact that $(D^t_i)^2=D^t_i$ since $D^t_i\in\left\{0,1\right\}$. Using results on the inverse of a 4 by 4 matrix presented here and collecting terms patiently, we find that the determinant of $x'x$ is equal to:

\[\begin{align*} \det(x'x) & = k^2\pi\bar{D}_A^2(1-\bar{D}_A)(1-\pi\bar{D}_A), \end{align*}\] with $\pi=\frac{\bar{D}_B}{\bar{D}_A}$, and its adjugate is equal to:

\[\begin{align*} \tilde{x'x} & = k\pi\bar{D}_A(1-\bar{D}_A) \left(\begin{array}{cccc} \bar{D}_A & -\bar{D}_A & -\bar{D}_A & \bar{D}_A\\ -\bar{D}_A & \frac{1}{\pi} & \bar{D}_A & -\frac{1}{\pi} \\ -\bar{D}_A & \bar{D}_A & \bar{D}_A\frac{k+1-\bar{D}_A(k\pi+1)}{1-\bar{D}_A} & -\bar{D}_A\frac{k+1-\bar{D}_A(k\pi+1)}{1-\bar{D}_A}\\ \bar{D}_A & -\frac{1}{\pi} & -\bar{D}_A\frac{k+1-\bar{D}_A(k\pi+1)}{1-\bar{D}_A} & k\frac{1-\pi\bar{D}_A}{1-\bar{D}_A}+\frac{1}{\pi} \end{array}\right) \end{align*}\]

We finally have that $N_A(X'X)^{-1}=\frac{1}{\det(x'x)}\tilde{x'x}$. Taking the $\text{plim}$ with respect to $N_A$, we have that:

\[\begin{align*} \text{plim}N_A(X'X)^{-1} & = \frac{1}{kp(1-p)} \left(\begin{array}{cccc} p & -p & -p & p\\ -p & 1 & p & -1 \\ -p & p & p(k+1) & -p(k+1)\\ p & -1 & -p(k+1) & k+1 \end{array}\right) \end{align*}\]

The result comes from $\text{plim}\bar{D}_B=\text{plim}\bar{D}_A=\Pr(D_i=1)=p$, according to the Law of Large Numbers, and thus, using Slutsky’s Theorem, $\text{plim}\pi=1$.

Let us now derive the asymptotic distribution of $\frac{\sqrt{N}}{N}X'\epsilon$. In order to do that, we need to know the coefficients of the OLS DID model in repeated cross sections of different sizes. They probably are the same that with a panel, but we still need to check. We have that:

\[\begin{align*} X'Y & = N_A\left(\begin{array}{c} k\bar{Y}_B+\bar{Y}_A \\ \bar{D}_A(k\pi\bar{Y}^1_B+\bar{Y}^1_A)\\ \bar{Y}_A \\ \bar{D}_A\bar{Y}^1_A \end{array}\right) \end{align*}\]

Using the fact that $\bar{Y}_t = \bar{D}_t\bar{Y}^1_t+(1-\bar{D}_t)\bar{Y}^0_t$, we have that:

\[\begin{align*} X'Y & = N_A\left(\begin{array}{c} \underbrace{k\bar{Y}^0_B+\bar{Y}^0_A+\bar{D}_A(k\pi(\bar{Y}^1_B-\bar{Y}^0_B)+\bar{Y}^1_A-\bar{Y}^0_A)}_{\mathbf{A}} \\ \underbrace{\bar{D}_A(k\pi\bar{Y}^1_B+\bar{Y}^1_A)}_{\mathbf{B}}\\ \underbrace{\bar{Y}^0_A+\bar{D}_A(\bar{Y}^1_A-\bar{Y}^0_A)}_{\mathbf{C}} \\ \underbrace{\bar{D}_A\bar{Y}^1_A}_{\mathbf{D}} \end{array}\right) \end{align*}\]

Using the fact that $(X'X)^{-1}=\frac{1}{N_A}\frac{\tilde{x'x}}{\det(x'x)}$, we have:

\[\begin{align*} \hat{\Theta}^{OLS} & = (X'X)^{-1}X'Y \\ & = \frac{1}{\bar{D}_Ak(1-\pi\bar{D}_A)} \left(\begin{array}{c} \bar{D}_A(\mathbf{A}-\mathbf{B}-\mathbf{C}+\mathbf{D}) \\ -\bar{D}_A\mathbf{A}+\frac{\mathbf{B}}{\pi}+\bar{D}_A\mathbf{C}-\frac{\mathbf{D}}{\pi} \\ \bar{D}_A(-\mathbf{A}+\mathbf{B}+\frac{k+1-\bar{D}_A(k\pi+1)}{1-\bar{D}_A}\mathbf{C}-\frac{k+1-\bar{D}_A(k\pi+1)}{1-\bar{D}_A}\mathbf{D})\\ \bar{D}_A\mathbf{A}-\frac{\mathbf{B}}{\pi}-\frac{k+1-\bar{D}_A(k\pi+1)}{1-\bar{D}_A}\bar{D}_A\mathbf{C}+(k\frac{1-\pi\bar{D}_A}{1-\bar{D}_A}+\frac{1}{\pi})\mathbf{D} \end{array}\right) \end{align*}\]

Let’s take each term in turn:

\[\begin{align*} \hat{\alpha}^{OLS} & = \frac{1}{k(1-\pi\bar{D}_A)} \left(k\bar{Y}^0_B+\bar{Y}^0_A+\bar{D}_A(k\pi(\bar{Y}^1_B-\bar{Y}^0_B)+\bar{Y}^1_A-\bar{Y}^0_A) - \bar{D}_A(k\pi\bar{Y}^1_B+\bar{Y}^1_A)\right.\\ & \phantom{\frac{1}{k(1-\pi\bar{D}_A)}\left(\right.} \left. - \bar{Y}^0_A-\bar{D}_A(\bar{Y}^1_A-\bar{Y}^0_A)+\bar{D}_A\bar{Y}^1_A\right)\\ & = \frac{1}{k(1-\pi\bar{D}_A)} \left(\bar{Y}^0_B(k-k\pi\bar{D}_A)+ \bar{Y}^0_A(1-\bar{D}_A-1+\bar{D}_A)+ \bar{Y}^1_B(k\pi\bar{D}_A-k\pi\bar{D}_A)+ \right.\\ & \phantom{\frac{1}{k(1-\pi\bar{D}_A)}\left(\right.}\left. \bar{Y}^1_A(\bar{D}_A-\bar{D}_A-\bar{D}_A+\bar{D}_A)\right) \\ & = \bar{Y}^0_Bk\frac{1-\pi\bar{D}_A}{k(1-\pi\bar{D}_A)}\\ & = \bar{Y}^0_B \end{align*}\]

\[\begin{align*} \hat{\mu}^{OLS} & = \frac{1}{k\bar{D}_A(1-\pi\bar{D}_A)} \left(-\bar{D}_A(k\bar{Y}^0_B+\bar{Y}^0_A+\bar{D}_A(k\pi(\bar{Y}^1_B-\bar{Y}^0_B)+\bar{Y}^1_A-\bar{Y}^0_A))\right.\\ & \phantom{= \frac{1}{k\bar{D}_A(1-\pi\bar{D}_A)}} \left.+\frac{\bar{D}_A(k\pi\bar{Y}^1_B+\bar{Y}^1_A)}{\pi}+\bar{D}_A(\bar{Y}^0_A+\bar{D}_A(\bar{Y}^1_A-\bar{Y}^0_A))-\frac{\bar{D}_A\bar{Y}^1_A}{\pi}\right)\\ & = \frac{1}{k\bar{D}_A(1-\pi\bar{D}_A)} \left(-\bar{D}_Ak\bar{Y}^0_B(1-\pi\bar{D}_A)-\bar{Y}^0_A\bar{D}_A(1-\bar{D}_A-1+\bar{D}_A)\right.\\ & \phantom{= \frac{1}{k\bar{D}_A(1-\pi\bar{D}_A)}} \left.+\bar{D}_Ak\bar{Y}^1_B(1-\pi\bar{D}_A)+\bar{Y}^1_A(-\bar{D}_A^2+\frac{\bar{D}_A}{\pi}+\bar{D}_A^2-\frac{\bar{D}_A}{\pi})\right) \\ & = \frac{k\bar{D}_A(1-\pi\bar{D}_A)}{k\bar{D}_A(1-\pi\bar{D}_A)}(\bar{Y}^1_B-\bar{Y}^0_B)\\ & = \bar{Y}^1_B-\bar{Y}^0_B \end{align*}\]

\[\begin{align*} \hat{\delta}^{OLS} & = \frac{1}{k(1-\pi\bar{D}_A)}\left(-(k\bar{Y}^0_B+\bar{Y}^0_A+\bar{D}_A(k\pi(\bar{Y}^1_B-\bar{Y}^0_B)+\bar{Y}^1_A-\bar{Y}^0_A))\right. \\ & \phantom{=\frac{1}{k(1-\pi\bar{D}_A)}}+\bar{D}_A(k\pi\bar{Y}^1_B+\bar{Y}^1_A)+\frac{k+1-\bar{D}_A(k\pi+1)}{1-\bar{D}_A}(\bar{Y}^0_A+\bar{D}_A(\bar{Y}^1_A-\bar{Y}^0_A))\\ & \phantom{=\frac{1}{k(1-\pi\bar{D}_A)}}\left.-\frac{k+1-\bar{D}_A(k\pi+1)}{1-\bar{D}_A}\bar{D}_A\bar{Y}^1_A\right) \\ & = \frac{1}{k(1-\pi\bar{D}_A)}\left(-\bar{Y}^0_Bk(1-\pi\bar{D}_A)-\bar{Y}^0_A(1-\bar{D}_A-(k+1-\bar{D}_A(k\pi+1)))\right.\\ & \phantom{=\frac{1}{k(1-\pi\bar{D}_A)}}\left.+\bar{Y}^1_B(-k\pi\bar{D}_A+k\pi\bar{D}_A)+\bar{Y}^1_A(-\bar{D}_A+\bar{D}_A+\frac{k+1-\bar{D}_A(k\pi+1)}{1-\bar{D}_A}(\bar{D}_A-\bar{D}_A))\right)\\ & = \frac{1}{k(1-\pi\bar{D}_A)}\left(-\bar{Y}^0_Bk(1-\pi\bar{D}_A)+\bar{Y}^0_Ak(1-\pi\bar{D}_A)\right)\\ & = \bar{Y}^0_A-\bar{Y}^0_B \end{align*}\]

\[\begin{align*} \hat{\beta}^{OLS} & =\frac{1}{\bar{D}_Ak(1-\pi\bar{D}_A)}\left(\bar{D}_A(k\bar{Y}^0_B+\bar{Y}^0_A+\bar{D}_A(k\pi(\bar{Y}^1_B-\bar{Y}^0_B)+\bar{Y}^1_A-\bar{Y}^0_A))-\frac{\bar{D}_A(k\pi\bar{Y}^1_B+\bar{Y}^1_A)}{\pi}\right.\\ & \phantom{ =\frac{1}{\bar{D}_Ak(1-\pi\bar{D}_A)}}\left.-\frac{k+1-\bar{D}_A(k\pi+1)}{1-\bar{D}_A}\bar{D}_A(\bar{Y}^0_A+\bar{D}_A(\bar{Y}^1_A-\bar{Y}^0_A))+(k\frac{1-\pi\bar{D}_A}{1-\bar{D}_A}+\frac{1}{\pi})\bar{D}_A\bar{Y}^1_A\right) \\ & = \frac{1}{\bar{D}_Ak(1-\pi\bar{D}_A)}\left(\bar{Y}^0_B\bar{D}_Ak(1-\pi\bar{D}_A)+\bar{Y}^0_A\bar{D}_A(1-\bar{D}_A-(k+1-\bar{D}_A(k\pi+1)))-\bar{Y}^1_B\bar{D}_Ak(1-\pi\bar{D}_A)\right.\\ & \phantom{= \frac{1}{\bar{D}_Ak(1-\pi\bar{D}_A)}}\left.+\bar{Y}^1_A\bar{D}_A(\bar{D}_A-\frac{1}{\pi}-\bar{D}_A\frac{k+1-\bar{D}_A(k\pi+1)}{1-\bar{D}_A}+k\frac{1-\pi\bar{D}_A}{1-\bar{D}_A}+\frac{1}{\pi})\right)\\ & = \frac{1}{k(1-\pi\bar{D}_A)}\left(-k(1-\pi\bar{D}_A)(\bar{Y}^1_B-\bar{Y}^0_B)-k(1-\pi\bar{D}_A)\bar{Y}^0_A+ k(1-\pi\bar{D}_A)\bar{Y}^1_A\right)\\ & = \bar{Y}^1_A-\bar{Y}^0_A-(\bar{Y}^1_B-\bar{Y}^0_B). \end{align*}\]

So it is confirmed that OLS estimation of the DID model in repeated cross sections of different sizes estimates the same parameters than in panel data. Thanks to the Law of Large Numbers, we know that:

\[\begin{align*} \text{plim}\hat\Theta^{OLS} & =\left(\begin{array}{c} \esp{Y^0_{i,B}|D_i=0}\\ \esp{Y^0_{i,B}|D_i=1}-\esp{Y^0_{i,B}|D_i=1}\\ \esp{Y^0_{i,A}-Y^0_{i,B}|D_i=0}\\ \esp{Y^1_{i,A}-Y^0_{i,B}|D_i=1}-\esp{Y^0_{i,A}-Y^0_{i,B}|D_i=0} \end{array}\right), \end{align*}\]

where the $\text{plim}$ is taken over $N=N_A+N_B$.

In order to study more easily the DID model as estimated by OLS, we are going to rewrite it as a pure cross sectional model:

\[\begin{align*} Y_j & = \alpha + \mu D_j + \delta T_j + \beta D_jT_j + \epsilon_j, \end{align*}\]

where $j=i$ when $t=B$ and $j=N_B+i$ when $t=A$, $D_j=D_i^t$, $T_j=T_{i,t}$ and $Y_j=Y_{i,t}$. In that case, we are assuming that $T_i$ is a random variable, whereas in real life, the sample is stratified with respect to $T_i$. We will treat this case in the stratification section. For now, assuming that time is sampled as a usual random variable is a useful simplification.

From what we have proven above, we know that:

\[\begin{align*} \epsilon_{j} & = Y_{j}-\left(\esp{Y^0_{j}|D_j=0,T_j=0}+D_j(\esp{Y^0_{j}|D_j=1,T_j=0}-\esp{Y^0_{j}|D_j=0,T_j=0})\right.\\ & \phantom{=Y_{j}-\left(\right.}+T_j(\esp{Y^0_{j}|D_j=0,T_j=1}-\esp{Y^0_{j}|D_j=0,T_j=0})\\ & \phantom{=Y_{j}-\left(\right.}+D_jT_j(\esp{Y^1_{j}|D_j=1,T_j=1}-\esp{Y^0_{j}|D_j=1,T_j=0}\\ & \phantom{=Y_{j}-\left(\right.+D_jT_j}\left.-(\esp{Y^0_{j}|D_j=0,T_j=1}-\esp{Y^0_{j}|D_j=0,T_j=0}))\right) \end{align*}\]

With this notation, and $N=N_A+N_B$, we have:

\[\begin{align*} \frac{\sqrt{N}}{N}X'\epsilon & =\sqrt{N}\left(\begin{array}{c} \frac{1}{N}\sum_{i=1}^N\epsilon_{j} \\ \frac{1}{N}\sum_{i=1}^ND_j\epsilon_{j} \\ \frac{1}{N}\sum_{i=1}^NT_j\epsilon_{j} \\ \frac{1}{N}\sum_{i=1}^ND_jT_j\epsilon_{j} \end{array}\right). \end{align*}\]

In order to be able to use the vector CLT in order to study the distribution of these quantities, we need first to compute the expectation of these variables. Let us first start with $\esp{D_jT_j\epsilon_{j}}$:

\[\begin{align*} \esp{D_jT_j\epsilon_{j}} & =\esp{\epsilon_{j}|D_j=1,T_j=1}\Pr(D_j=1|T_j=1)\Pr(T_j=1)\\ & = 0, \end{align*}\]

where the first equality follows from Bayes’ Law and the second equality from the definition of $\epsilon_{j}$. Using the same reasoning, we have:

\[\begin{align*} \esp{D_j\epsilon_{j}} & =\esp{\epsilon_{j}|D_j=1,T_j=1}\Pr(T_j=1|D_j=1)+\esp{\epsilon_{j}|D_j=1,T_j=0}\Pr(T_j=0|D_j=1)\\ & = 0\\ \esp{T_j\epsilon_{j}} & =\esp{\epsilon_{j}|D_j=1,T_j=1}\Pr(D_j=1|T_j=1)+\esp{\epsilon_{j}|D_j=0,T_j=1}\Pr(D_j=0|T_j=1)\\ & = 0\\ \esp{\epsilon_{j}} & =\esp{\epsilon_{j}|T_j=1}\Pr(T_j=1)+\esp{\epsilon_{j}|T_j=0}\Pr(T_j=0)\\ & = (\esp{\epsilon_{j}|T_j=0,D_j=1}\Pr(D_j=1|T_j=0)+\esp{\epsilon_{j}|T_j=0,D_j=0}\Pr(D_j=0|T_j=0))\Pr(T_j=0)\\ & = 0. \end{align*}\]

Using the vector version of the CLT that we have already invoked in the proof of Lemma A.4, we have that $\sqrt{N}\frac{X'\epsilon}{N}\sim\mathcal{N}((0,0,0,0),\mathbf{V_{x\epsilon}})$ with:

\[\begin{align*} \mathbf{V_{x\epsilon}} & = \esp{\left(\begin{array}{c} \epsilon_i \\ \epsilon_iD_i \\ \epsilon_iT_i \\ \epsilon_iD_iT_i \end{array}\right) \left(\begin{array}{cccc} \epsilon_i & \epsilon_iD_i & \epsilon_iT_i & \epsilon_iD_iT_i \end{array}\right)} - \esp{\left(\begin{array}{c} \epsilon_i \\ \epsilon_iD_i \\ \epsilon_iT_i \\ \epsilon_iD_iT_i \end{array}\right)} \esp{\left(\begin{array}{cccc} \epsilon_i & \epsilon_iD_i & \epsilon_iT_i & \epsilon_iD_iT_i \end{array}\right)}\\ & = \esp{\epsilon_i^2\left(\begin{array}{cccc} 1 & D_i & T_i & D_iT_i \\ D_i & D_i & D_iT_i & D_iT_i \\ T_i & D_iT_i & T_i & D_iT_i \\ D_iT_i & D_iT_i & D_iT_i & D_iT_i \end{array}\right)}, \end{align*}\]

where we made use of the fact that $T_i^2=T_i$ and $D_i^2=D_i$.

Before we can use the Delta Method to derive the distribution of the OLS DID estimator, we need to compute $\text{plim}N(X'X)-1$ as a function of $N$ and not $N_A$, as we did before. In order to obtain that without redoing the matrix inversion all over again (which is pretty awful without the trick of factoring $N_A$), we are going to use the fact that the proportion of observations belonging to period $A$ is equal to $\bar{T}_A=\frac{N_A}{N}=\sum_{j=1}^NT_j$, and the proportion of observations belonging to period $B$ is equal to $1-\bar{T}_A$. We also have that $k=\frac{N_B}{N_A}=\frac{(1-\bar{T}_A)N}{\bar{T}_AN}=\frac{1-\bar{T}_A}{\bar{T}_A}$. Finally, note that $\pi=\frac{\bar{D}_B}{\bar{D}_A}$. As a consequence of that and of our previous computations, we have that:

\[\begin{align*} (X'X)^{-1} & = \frac{1}{N_A}\frac{1}{k\bar{D}_A(1-\pi\bar{D}_A)} \left(\begin{array}{cccc} \bar{D}_A & -\bar{D}_A & -\bar{D}_A & \bar{D}_A\\ -\bar{D}_A & \frac{1}{\pi} & \bar{D}_A & -\frac{1}{\pi} \\ -\bar{D}_A & \bar{D}_A & \bar{D}_A\frac{k+1-\bar{D}_A(k\pi+1)}{1-\bar{D}_A} & -\bar{D}_A\frac{k+1-\bar{D}_A(k\pi+1)}{1-\bar{D}_A}\\ \bar{D}_A & -\frac{1}{\pi} & -\bar{D}_A\frac{k+1-\bar{D}_A(k\pi+1)}{1-\bar{D}_A} & k\frac{1-\pi\bar{D}_A}{1-\bar{D}_A}+\frac{1}{\pi} \end{array}\right) \\ & = \frac{1}{N\bar{T}_A}\frac{1}{\frac{1-\bar{T}_A}{\bar{T}_A}\bar{D}_A(1-\frac{\bar{D}_B}{\bar{D}_A}\bar{D}_A)}\\ & \phantom{=} \left(\begin{array}{cccc} \bar{D}_A & -\bar{D}_A & -\bar{D}_A & \bar{D}_A\\ -\bar{D}_A & \frac{\bar{D}_A}{\bar{D}_B} & \bar{D}_A & -\frac{\bar{D}_A}{\bar{D}_B} \\ -\bar{D}_A & \bar{D}_A & \bar{D}_A\frac{\frac{1-\bar{T}_A}{\bar{T}_A}+1-\bar{D}_A(\frac{1-\bar{T}_A}{\bar{T}_A}\frac{\bar{D}_B}{\bar{D}_A}+1)}{1-\bar{D}_A} & -\bar{D}_A\frac{\frac{1-\bar{T}_A}{\bar{T}_A}+1-\bar{D}_A(\frac{1-\bar{T}_A}{\bar{T}_A}\frac{\bar{D}_B}{\bar{D}_A}+1)}{1-\bar{D}_A}\\ \bar{D}_A & -\frac{\bar{D}_A}{\bar{D}_B} & -\bar{D}_A\frac{\frac{1-\bar{T}_A}{\bar{T}_A}+1-\bar{D}_A(\frac{1-\bar{T}_A}{\bar{T}_A}\frac{\bar{D}_B}{\bar{D}_A}+1)}{1-\bar{D}_A} & \frac{1-\bar{T}_A}{\bar{T}_A}\frac{1-\frac{\bar{D}_B}{\bar{D}_A}\bar{D}_A}{1-\bar{D}_A}+\frac{\bar{D}_A}{\bar{D}_B} \end{array}\right) \\ & = \frac{1}{N}\frac{1}{(1-\bar{T}_A)(1-\bar{D}_B)}\\ & \phantom{=} \left(\begin{array}{cccc} 1 & -1 & -1 & 1\\ -1 & \frac{1}{\bar{D}_B} & 1 & -\frac{1}{\bar{D}_B} \\ -1 & 1 & \frac{1-\bar{D}_B+\bar{T}_A(\bar{D}_B-\bar{D}_A)}{\bar{T}_A(1-\bar{D}_A)} & -\frac{1-\bar{D}_B+\bar{T}_A(\bar{D}_B-\bar{D}_A)}{\bar{T}_A(1-\bar{D}_A)}\\ 1 & -\frac{1}{\bar{D}_B} & -\frac{1-\bar{D}_B+\bar{T}_A(\bar{D}_B-\bar{D}_A)}{\bar{T}_A(1-\bar{D}_A)} & \frac{1}{\bar{D}_A}\frac{1-\bar{T}_A}{\bar{T}_A}\frac{1-\bar{D}_B}{1-\bar{D}_A}+\frac{1}{\bar{D}_B} \end{array}\right), \end{align*}\]

because:

\[\begin{align*} \frac{\frac{1-\bar{T}_A}{\bar{T}_A}+1-\bar{D}_A(\frac{1-\bar{T}_A}{\bar{T}_A}\frac{\bar{D}_B}{\bar{D}_A}+1)}{1-\bar{D}_A} & = \frac{1+\frac{1-\bar{T}_A}{\bar{T}_A}-\frac{1-\bar{T}_A}{\bar{T}_A}\bar{D}_B+ \bar{D}_A}{1-\bar{D}_A} \\ = & \frac{\bar{T}_A+1-\bar{T}_A-\bar{D}_B+\bar{T}_A\bar{D}_B+ \bar{D}_A\bar{T}_A}{\bar{T}_A(1-\bar{D}_A)}\\ = & \frac{1-\bar{D}_B+\bar{T}_A(\bar{D}_B-\bar{D}_A)}{\bar{T}_A(1-\bar{D}_A)}. \end{align*}\]

As a consequence, we have:

\[\begin{align*} \text{plim}N(X'X)^{-1} & = \frac{1}{(1-p_A)(1-p)} \left(\begin{array}{cccc} 1 & -1 & -1 & 1\\ -1 & \frac{1}{p} & 1 & -\frac{1}{p} \\ -1 & 1 & \frac{1}{p_A} & -\frac{1}{p_A} \\ 1 & -\frac{1}{p} & -\frac{1}{p_A} & \frac{1}{pp_A} \end{array}\right), \end{align*}\]

using the Law of Large Numbers, Slutsky’s Theorem and the fact that $\text{plim}\bar{T}_A=p_A$, the proportion of observations stemming from the After period, $\text{plim}\bar{D}_A=\text{plim}\bar{D}_B=p$ and the fact that $\frac{1}{p}\frac{1-p_A}{p_A}+\frac{1}{p}=\frac{1}{pp_A}$.

Now, we can derive the asymptotic distribution of $\sqrt{N}(\hat{\Theta}_{OLS}-\Theta)=N(X'X)^{-1}\frac{\sqrt{N}}{N}X'\epsilon$. Using the Delta Method, we have that $\sqrt{N}(\hat{\Theta}_{OLS}-\Theta)\stackrel{d}{\rightarrow}\mathcal{N}\left((0,0,0,0),\sigma_{XX}^{-1}\mathbf{V_{x\epsilon}}\sigma_{XX}^{-1}\right)$. So we’re in for a treat: deriving the lower diagonal term in the quadratic form $\sigma_{XX}^{-1}\mathbf{V_{x\epsilon}}\sigma_{XX}^{-1}$.

Let us start. We first need the four terms on the last line of $(1-p_A)^2(1-p)^2\sigma_{XX}^{-1}\mathbf{V_{x\epsilon}}=(\mathbf{A},\mathbf{B},\mathbf{C},\mathbf{D})$ (the squared terms in the beginning are accounting for the constant terms in the matrix multiplication):

\[\begin{align*} \mathbf{A}& = \esp{\epsilon_j^2} -\frac{1}{p}\esp{\epsilon_j^2D_j} -\frac{1}{p_A}\esp{\epsilon_j^2T_j} + \frac{1}{pp_A}\esp{\epsilon_j^2D_jT_j}\\ & = \esp{\epsilon_j^2|D_j=0,T_j=0}\Pr(D_j=0|T_j=0)\Pr(T_j=0)+\esp{\epsilon_j^2|D_j=1,T_j=0}\Pr(D_j=1|T_j=0)\Pr(T_j=0)\\ & \phantom{=}+\esp{\epsilon_j^2|D_j=0,T_j=1}\Pr(D_j=0|T_j=1)\Pr(T_j=1)+\esp{\epsilon_j^2|D_j=1,T_j=1}\Pr(D_j=1|T_j=1)\Pr(T_j=1) \\ & \phantom{=}-\frac{1}{p}\esp{\epsilon_j^2|D_j=1}\Pr(D_j=1) -\frac{1}{p_A}\esp{\epsilon_j^2|T_j=1}\Pr(T_j=1) \\ & \phantom{=}+ \frac{1}{pp_A}\esp{\epsilon_j^2|D_j=1,T_j=1}\Pr(D_j=1|T_j=1)\Pr(T_j=1)\\ & = \esp{\epsilon_j^2|D_j=0,T_j=0}(1-p)(1-p_A)+\esp{\epsilon_j^2|D_j=1,T_j=0}p(1-p_A)\\ & \phantom{=}+\esp{\epsilon_j^2|D_j=0,T_j=1}(1-p)p_A+\esp{\epsilon_j^2|D_j=1,T_j=1}pp_A \\ & \phantom{=}-\esp{\epsilon_j^2|D_j=1} -\esp{\epsilon_j^2|T_j=1} \\ & \phantom{=}+ \esp{\epsilon_j^2|D_j=1,T_j=1}\\ & = \esp{\epsilon_j^2|D_j=0,T_j=0}(1-p)(1-p_A)+\esp{\epsilon_j^2|D_j=1,T_j=0}p(1-p_A)\\ & \phantom{=}+\esp{\epsilon_j^2|D_j=0,T_j=1}(1-p)p_A+\esp{\epsilon_j^2|D_j=1,T_j=1}pp_A \\ & \phantom{=}-\esp{\epsilon_j^2|D_j=1,T_j=0}(1-p_A)-\esp{\epsilon_j^2|D_j=1,T_j=1}p_A\\ & \phantom{=}-\esp{\epsilon_j^2|D_j=0,T_j=1}(1-p)-\esp{\epsilon_j^2|D_j=1,T_j=1}p \\ & \phantom{=}+ \esp{\epsilon_j^2|D_j=1,T_j=1}\\ & = \esp{\epsilon_j^2|D_j=0,T_j=0}(1-p)(1-p_A)\\ & \phantom{=}-\esp{\epsilon_j^2|D_j=1,T_j=0}(1-p)(1-p_A)\\ & \phantom{=}-\esp{\epsilon_j^2|D_j=0,T_j=1}(1-p)(1-p_A)\\ & \phantom{=}+\esp{\epsilon_j^2|D_j=1,T_j=1}(1-p)(1-p_A)\\ & = (1-p)(1-p_A)(\sigma_{\epsilon_{0,0}}^2-\sigma_{\epsilon_{1,0}}^2-\sigma_{\epsilon_{0,1}}^2+\sigma_{\epsilon_{1,1}}^2)\\ \mathbf{B}& = \esp{\epsilon_j^2D_j} -\frac{1}{p}\esp{\epsilon_j^2D_j} -\frac{1}{p_A}\esp{\epsilon_j^2D_jT_j} + \frac{1}{pp_A}\esp{\epsilon_j^2D_jT_j}\\ & = \esp{\epsilon_j^2D_j}(1-\frac{1}{p})+\esp{\epsilon_j^2D_jT_j}\frac{1}{p_A}(\frac{1}{p}-1) \\ & = (1-\frac{1}{p})\esp{\epsilon_j^2|D_j=1}\Pr(D_j=1)+\frac{1}{p_A}(\frac{1}{p}-1)\esp{\epsilon_j^2|D_j=1,T_j=1}\Pr(D_j=1|T_j=1)\Pr(T_j=1) \\ & = p(1-\frac{1}{p})\left(\esp{\epsilon_j^2|D_j=1,T_j=0}\Pr(T_j=0|D_j=1)+\esp{\epsilon_j^2|D_j=1,T_j=1}\Pr(T_j=1|D_j=1)\right)\\ & \phantom{=}+\frac{1}{p_A}(\frac{1}{p}-1)\sigma_{\epsilon_{1,1}}^2pp_A\\ & = p(1-\frac{1}{p})\left(\sigma_{\epsilon_{1,0}}^2(1-p_A)+\sigma_{\epsilon_{1,1}}^2p_A\right)+\frac{1}{p_A}(\frac{1}{p}-1)\sigma_{\epsilon_{1,1}}^2pp_A\\ & = -(1-p)\left(\sigma_{\epsilon_{1,0}}^2(1-p_A)+\sigma_{\epsilon_{1,1}}^2p_A\right)+(1-p)\sigma_{\epsilon_{1,1}}^2\\ & = (1-p)\left(\sigma_{\epsilon_{1,1}}^2-\sigma_{\epsilon_{1,0}}^2(1-p_A)-\sigma_{\epsilon_{1,1}}^2p_A\right)\\ & = (1-p)(1-p_A)\left(\sigma_{\epsilon_{1,1}}^2-\sigma_{\epsilon_{1,0}}^2\right) \\ \mathbf{C}& = \esp{\epsilon_j^2T_j} -\frac{1}{p}\esp{\epsilon_j^2D_jT_j} -\frac{1}{p_A}\esp{\epsilon_j^2T_j} + \frac{1}{pp_A}\esp{\epsilon_j^2D_jT_j}\\ & = \esp{\epsilon_j^2|T_j=1}\Pr(T_j=1)-\frac{1}{p}\esp{\epsilon_j^2|D_j=1,T_j=1}\Pr(D_j=1|T_j=1)\Pr(T_j=1)\\ & \phantom{=}-\frac{1}{p_A}\esp{\epsilon_j^2|T_j=1}\Pr(T_j=1)+\frac{1}{pp_A}\esp{\epsilon_j^2|D_j=1,T_j=1}\Pr(D_j=1|T_j=1)\Pr(T_j=1)\\ & = -(1-p_A)(\esp{\epsilon_j^2|D_j=1,T_j=1}\Pr(D_j=1|T_j=1)+\esp{\epsilon_j^2|D_j=0,T_j=1}\Pr(D_j=0|T_j=1))\\ & \phantom{=}+\esp{\epsilon_j^2|D_j=1,T_j=1}(1-p_A)\\ & = (1-p_A)(\sigma_{\epsilon_{1,1}}^2(1-p)-(1-p)\sigma_{\epsilon_{0,1}}^2)\\ & = (1-p)(1-p_A)(\sigma_{\epsilon_{1,1}}^2-\sigma_{\epsilon_{0,1}}^2)\\ \mathbf{D}& = \esp{\epsilon_j^2D_jT_j} -\frac{1}{p}\esp{\epsilon_j^2D_jT_j} -\frac{1}{p_A}\esp{\epsilon_j^2D_jT_j} + \frac{1}{pp_A}\esp{\epsilon_j^2D_jT_j}\\ & = \esp{\epsilon_j^2|D_j=1,T_j=1}\Pr(D_j=1|T_j=1)\Pr(T_j=1)(1-\frac{1}{p}-\frac{1}{p_A}+\frac{1}{pp_A})\\ & = \sigma_{\epsilon_{1,1}}^2(pp_A-p_A-p+1)\\ & = (1-p)(1-p_A)\sigma_{\epsilon_{1,1}}^2 \end{align*}\]

since $1-p-p_A+pp_A=(1-p)(1-p_A)$, and where $\sigma_{\epsilon_{d,t}}^2=\esp{\epsilon_j^2|D_j=d,T_j=t}$. We also make use of the fact that $\Pr(D_j=d|T_j=t)=\Pr(D_j=d)$ and $\Pr(T_j=t|D_j=d)=\Pr(T_j=t)$, that is that the participants and non participants are sampled exactly in the same proportion in both periods.

Let us now obtain $\var{\hat{\beta}^{OLS}}$, the variance of the $\hat{\beta}^{OLS}$ parameter. It is the last diagonal term of the matrix $\sigma_{XX}^{-1}\mathbf{V_{x\epsilon}}\sigma_{XX}^{-1}$. We know that:

\[\begin{align*} \var{\hat{\beta}^{OLS}} & = \frac{1}{(1-p)^2(1-p_A)^2}\left(\mathbf{A}-\frac{1}{p}\mathbf{B}-\frac{1}{p_A}\mathbf{C}+\frac{1}{pp_A}\mathbf{D}\right)\\ & = \frac{1}{(1-p)(1-p_A)}\left(\sigma_{\epsilon_{0,0}}^2-\sigma_{\epsilon_{1,0}}^2-\sigma_{\epsilon_{0,1}}^2+\sigma_{\epsilon_{1,1}}^2 -\frac{1}{p}(\sigma_{\epsilon_{1,1}}^2-\sigma_{\epsilon_{1,0}}^2) \right.\\ & \phantom{= \frac{1}{(1-p)(1-p_A)}\left(\right.}\left. -\frac{1}{p_A}(\sigma_{\epsilon_{1,1}}^2-\sigma_{\epsilon_{0,1}}^2)+ \frac{1}{pp_A}\sigma_{\epsilon_{1,1}}^2\right)\\ & = \frac{1}{(1-p)(1-p_A)}\left(\sigma_{\epsilon_{0,0}}^2+\sigma_{\epsilon_{1,0}}^2(-1+\frac{1}{p})+\sigma_{\epsilon_{0,1}}^2(-1+\frac{1}{p_A})+\sigma_{\epsilon_{1,1}}^2(1-\frac{1}{p}-\frac{1}{p_A}+\frac{1}{pp_A})\right)\\ & = \frac{1}{(1-p)(1-p_A)}\left(\sigma_{\epsilon_{0,0}}^2+\sigma_{\epsilon_{1,0}}^2\frac{1-p}{p}+\sigma_{\epsilon_{0,1}}^2(\frac{1-p_A}{p_A}+\sigma_{\epsilon_{1,1}}^2\frac{pp_A-p_A-p+1}{pp_A}\right)\\ & = \frac{\sigma_{\epsilon_{0,0}}^2}{(1-p)(1-p_A)}+\frac{\sigma_{\epsilon_{1,0}}^2}{p(1-p_A)}+\frac{\sigma_{\epsilon_{0,1}}^2}{(1-p)p_A}+\frac{\sigma_{\epsilon_{1,1}}^2}{pp_A}, \end{align*}\]

using again the fact that $1-p-p_A+pp_A=(1-p)(1-p_A)$.

Finally, using the formula for $\epsilon_j$, we have:

\[\begin{align*} \sigma_{\epsilon_{0,0}}^2 & = \esp{\epsilon_j^2|D_j=0,T_j=0} \\ & = \esp{(Y_j-\esp{Y^0_j|D_j=0,T_j=0})^2|D_j=0,T_j=0}\\ & = \esp{(Y^0_{i,B}-\esp{Y^0_{i,B}|D_i=0)^2}|D_i=0}\\ & = \var{Y^0_{i,B}|D_i=0}\\ \sigma_{\epsilon_{1,0}}^2 & = \esp{\epsilon_j^2|D_j=1,T_j=0} \\ & = \esp{(Y_j-\esp{Y^0_j|D_j=1,T_j=0})^2|D_j=1,T_j=0}\\ & = \esp{(Y^0_{i,B}-\esp{Y^0_{i,B}|D_i=1)^2}|D_i=1}\\ & = \var{Y^0_{i,B}|D_i=1}\\ \sigma_{\epsilon_{0,1}}^2 & = \esp{\epsilon_j^2|D_j=0,T_j=1} \\ & = \esp{(Y_j-\esp{Y^0_j|D_j=0,T_j=1})^2|D_j=0,T_j=1}\\ & = \esp{(Y^0_{i,A}-\esp{Y^0_{i,A}|D_i=0)^2}|D_i=0}\\ & = \var{Y^0_{i,A}|D_i=0}\\ \sigma_{\epsilon_{1,1}}^2 & = \esp{\epsilon_j^2|D_j=1,T_j=1} \\ & = \esp{(Y_j-\esp{Y^1_j|D_j=1,T_j=1})^2|D_j=1,T_j=1}\\ & = \esp{(Y^1_{i,A}-\esp{Y^1_{i,A}|D_i=1)^2}|D_i=1}\\ & = \var{Y^1_{i,A}|D_i=1} \end{align*}\]

This proves the result.

A.3.3 Proof of Theorem 4.19

The proof uses saturated models as Angrist and Pischke(2009). A saturated model is a model involving only categorical variables where the model has a separate parameter for each various sets of parameter values that the covariates can take. We can check that Sun and Abraham’s model is a saturated model. Let’s start with the model in repeated cross sections (with group fixed effects). In the population, excluding the group of individuals that are always treated (adding this group would entail adding a separate dummy for each date at which they are observed, I leave that as an exercise), with $T=4$ (larger time series do not change the basic result):

\[\begin{align*} \esp{Y_{i,1}|D_i=\infty} & = \alpha \\ \esp{Y_{i,2}|D_i=\infty} &= \alpha+\delta_2 \\ \esp{Y_{i,3}|D_i=\infty} & = \alpha+\delta_3 \\ \esp{Y_{i,4}|D_i=\infty} & = \alpha+\delta_4 \\ \esp{Y_{i,1}|D_i=2} & = \alpha +\mu_2 \\ \esp{Y_{i,2}|D_i=2} & = \alpha +\mu_2 +\delta_2 + \beta_{2,0}^{SA}\\ \esp{Y_{i,3}|D_i=2} & = \alpha +\mu_2 +\delta_3 + \beta_{2,1}^{SA} \\ \esp{Y_{i,4}|D_i=2} & = \alpha +\mu_2 +\delta_4 + \beta_{2,2}^{SA} \\ \esp{Y_{i,1}|D_i=3} & = \alpha +\mu_3 + \beta_{3,-2}^{SA} \\ \esp{Y_{i,2}|D_i=3} & = \alpha +\mu_3 + \delta_2 \\ \esp{Y_{i,3}|D_i=3} & = \alpha +\mu_3 + \delta_3 + \beta_{3,0}^{SA} \\ \esp{Y_{i,4}|D_i=3} & = \alpha+\mu_3 + \delta_4 + \beta_{3,1}^{SA} \\ \esp{Y_{i,1}|D_i=4} & = \alpha +\mu_4 +\beta_{4,-3}^{SA} \\ \esp{Y_{i,2}|D_i=4} & = \alpha +\mu_4 + \delta_2+\beta_{4,-2}^{SA} \\ \esp{Y_{i,3}|D_i=4} & = \alpha +\mu_4 + \delta_3 \\ \esp{Y_{i,4}|D_i=4} & = \alpha +\mu_4 + \delta_4 + \beta_{4,0}^{SA} \end{align*}\]

The model has 16 parameters to model the 16 different combinations of the regressors. It is thus a saturated model. Let us now state the Linear Conditional Expectation Function Theorem:

Theorem A.2 (Linear Conditional Expectation Function) Let $\esp{Y_i|X_i}=X_i'\Theta^*$ for a $K\times 1$ vector of coefficients $\Theta^*$. Then $\Theta^*=\esp{X_i'X_i}^{-1}\esp{X_i'Y_i}=\Theta^{OLS}$.

Theorem A.2 states that the coefficients of a model with a linear conditional expectation function can be obtained by using OLS. Applying Theorem A.2 to Sun and Abraham’s saturated model, we have that:

\[\begin{align*} \alpha & = \esp{Y_{i,1}|D_i=\infty}=\alpha^{OLS} \\ \delta_2 & =\esp{Y_{i,2}|D_i=\infty}-\esp{Y_{i,1}|D_i=\infty}=\delta_2^{OLS} \\ \delta_3 &= \esp{Y_{i,3}|D_i=\infty}-\esp{Y_{i,1}|D_i=\infty}=\delta_3^{OLS} \\ \delta_4 & = \esp{Y_{i,4}|D_i=\infty}-\esp{Y_{i,1}|D_i=\infty}=\delta_4^{OLS} \\ \mu_2 & = \esp{Y_{i,1}|D_i=2}-\esp{Y_{i,1}|D_i=\infty} =\mu_2^{OLS} \\ \mu_3 & = \esp{Y_{i,2}|D_i=3}-\esp{Y_{i,2}|D_i=\infty} =\mu_3^{OLS} \\ \mu_4 & = \esp{Y_{i,3}|D_i=4}-\esp{Y_{i,3}|D_i=\infty} =\mu_4^{OLS} \\ \beta_{2,0}^{SA} & = \esp{Y_{i,2}|D_i=2}-\esp{Y_{i,1}|D_i=2}-(\esp{Y_{i,2}|D_i=\infty}-\esp{Y_{i,1}|D_i=\infty}) =\beta_{2,0}^{OLS} \\ \beta_{2,1}^{SA} & = \esp{Y_{i,3}|D_i=2}-\esp{Y_{i,1}|D_i=2}-(\esp{Y_{i,3}|D_i=\infty}-\esp{Y_{i,1}|D_i=\infty})=\beta_{2,1}^{OLS} \\ \beta_{2,2}^{SA} & = \esp{Y_{i,4}|D_i=2}-\esp{Y_{i,1}|D_i=2}-(\esp{Y_{i,4}|D_i=\infty}-\esp{Y_{i,1}|D_i=\infty})=\beta_{2,2}^{OLS} \\ \beta_{3,-2}^{SA} & =\esp{Y_{i,1}|D_i=3}-\esp{Y_{i,2}|D_i=3}-(\esp{Y_{i,1}|D_i=\infty}-\esp{Y_{i,2}|D_i=\infty})=\beta_{3,-2}^{OLS}\\ \beta_{3,0}^{SA} & =\esp{Y_{i,3}|D_i=3}-\esp{Y_{i,2}|D_i=3}-(\esp{Y_{i,3}|D_i=\infty}-\esp{Y_{i,2}|D_i=\infty})=\beta_{3,0}^{OLS}\\ \beta_{3,1}^{SA} & =\esp{Y_{i,4}|D_i=3}-\esp{Y_{i,2}|D_i=3}-(\esp{Y_{i,4}|D_i=\infty}-\esp{Y_{i,2}|D_i=\infty})=\beta_{3,1}^{OLS}\\ \beta_{4,-3}^{SA} & =\esp{Y_{i,1}|D_i=4}-\esp{Y_{i,3}|D_i=4}-(\esp{Y_{i,1}|D_i=\infty}-\esp{Y_{i,3}|D_i=\infty})=\beta_{4,-3}^{OLS}\\ \beta_{4,-2}^{SA} & =\esp{Y_{i,2}|D_i=4}-\esp{Y_{i,3}|D_i=4}-(\esp{Y_{i,2}|D_i=\infty}-\esp{Y_{i,3}|D_i=\infty})=\beta_{4,-2}^{OLS}\\ \beta_{4,0}^{SA} & =\esp{Y_{i,4}|D_i=4}-\esp{Y_{i,3}|D_i=4}-(\esp{Y_{i,4}|D_i=\infty}-\esp{Y_{i,3}|D_i=\infty})=\beta_{4,0}^{OLS} \end{align*}\]

This proves that Sun and Abraham’s estimator is actually equal to the individual DID estimators $\beta_{d,\tau}^{SA}=\Delta^{Y}_{DID}(d,\infty,\tau,d-1)$ in the population, which completes the proof for the model with group fixed effects and $T=4$. I leave generalizing this result to any $T$ and to the panel data model with individual fixed effects as an exercise.

A.3.4 Proof of Theorem 4.20

Let us start with Sun and Abraham’s model in repeated cross sections, with group fixed effects. Here is how we can write this model with four time periods:

\[\begin{align*} Y & = X\Theta + \epsilon \end{align*}\] with \[\begin{align*} Y & =\left(\begin{array}{c} Y_{1,1} \\ \vdots \\ Y_{N_1,1}\\ Y_{1,2} \\ \vdots \\Y_{N_2,2}\\ Y_{1,3} \\ \vdots \\ Y_{N_3,3}\\ Y_{1,4} \\ \vdots \\ Y_{N_4,4} \end{array}\right) \\ X & = \left(\begin{array}{cccccccccccccccc} 1 & D^2_1 & D^3_1 & D^4_1 & 0 & 0 & 0 & 0 & 0 & 0 & D^3_1 & 0 & 0 & D^4_1 & 0 & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots &\vdots & \vdots& \vdots & \vdots & \vdots & \vdots \\ 1 & D^2_{N_1} & D^3_{N_1} & D^4_{N_1} & 0 & 0 & 0 & 0 & 0 & 0 & D^3_{N_1} & 0 & 0 & D^4_{N_1} & 0 & 0\\ 1 & D^2_1 & D^3_1 & D^4_1 & 1 & 0 & 0 & D^2_1 & 0 & 0 & 0 & 0 & 0 & 0 & D^4_1 &0\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots &\vdots & \vdots & \vdots & \vdots& \vdots & \vdots & \vdots & \vdots \\ 1 & D^2_{N_2} & D^3_{N_2} & D^4_{N_2} & 1 & 0 & 0 & D^2_{N_2} & 0 & 0 & 0 & 0 & 0 &0 & D^4_{N_2} & 0\\ 1 & D^2_1 & D^3_1 & D^4_1 & 0 & 1 & 0 & 0 & D^2_1 & 0 & 0 & D^3_1 & 0 & 0 & 0 & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots& \vdots & \vdots & \vdots & \vdots \\ 1 & D^2_{N_3} & D^3_{N_3} & D^4_{N_3} & 0 & 1 & 0 & 0 & D^2_{N_3} & 0 & 0 & D^3_{N_3} & 0 & 0 & 0 & 0\\ 1 & D^2_1 & D^3_1 & D^4_1 & 0 & 0 & 1 & 0 & 0 & D^2_1 & 0 & 0 & D^3_1 & 0 & 0 & D^4_1\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots& \vdots & \vdots & \vdots & \vdots \\ 1 & D^2_{N_4} & D^3_{N_4} & D^4_{N_4} & 0 & 0 & 1 & 0 & 0 & D^2_{N_4} & 0 & 0 & D^3_{N_4} & 0 & 0 & D^4_{N_4} \end{array}\right)\\ \Theta & =\left(\begin{array}{c} \alpha \\ \mu_2\\ \mu_3\\ \mu_4 \\ \delta_2\\ \delta_3\\ \delta_4 \\ \beta^{SA}_{2,0}\\ \beta^{SA}_{2,1}\\ \beta^{SA}_{2,2}\\ \beta^{SA}_{3,-2}\\ \beta^{SA}_{3,0}\\ \beta^{SA}_{3,1}\\ \beta^{SA}_{4,-3}\\ \beta^{SA}_{4,-2}\\ \beta^{SA}_{4,0} \end{array}\right)\\ \epsilon & =\left(\begin{array}{c} \epsilon_{1,1} \\ \vdots \\\epsilon_{N_1,1}\\ \epsilon_{1,2} \\ \vdots \\ \epsilon_{N_2,2}\\ \epsilon_{1,3} \\ \vdots \\ \epsilon_{N_3,3}\\ \epsilon_{1,4} \\ \vdots \\ \epsilon_{N_4,4} \end{array}\right), \end{align*}\]

with $D^d_i=\uns{D_i=d}$ and $N_t$ the number of observations at time $t$. If we are in a panel, each $i$ is the same across time periods. If we are in a repeated cross section, the $i$ index refers to different individuals. This model is very difficult to solve by brute force, since its $X'X$ matrix is $16 \times 16$ and has no easy simplification on sight. Here is the $X'X$ for panel data (which is slightly simpler), with $\bar{D^d}=\frac{1}{N}\sum_{i=1}^N\uns{D_i=d}$, and $N$ the number of individuals in the panel:

\[\begin{align*} X'X & = N\left(\begin{array}{cccccccccccccccc} T & T\bar{D^2} & T\bar{D^3} & T\bar{D^4} & 1 & 1 & 1 & \bar{D^2} & \bar{D^2} & \bar{D^2} & \bar{D^3} & \bar{D^3} & \bar{D^3} & \bar{D^4} & \bar{D^4} & \bar{D^4}\\ T\bar{D^2} & T\bar{D^2} & 0 & 0 & \bar{D^2} & \bar{D^2} & \bar{D^2} & \bar{D^2} & \bar{D^2} & \bar{D^2} & 0 & 0 & 0 & 0 & 0 & 0 \\ T\bar{D^3} & 0 & T\bar{D^3} & 0 & \bar{D^3} & \bar{D^3} & \bar{D^3} & 0 & 0 & 0 & \bar{D^3} & \bar{D^3} & \bar{D^3} & 0 & 0 & 0\\ T\bar{D^4} & 0 & 0 & T\bar{D^4} & \bar{D^4} & \bar{D^4} & \bar{D^4} & 0 & 0 & 0 & 0 & 0 & 0 & \bar{D^4} & \bar{D^4} & \bar{D^4}\\ 1 & \bar{D^2} & \bar{D^3} & \bar{D^4} & 1 & 0 & 0 & \bar{D^2} & 0 & 0 & 0 & 0 & 0 & 0 & \bar{D^4} & 0 \\ 1 & \bar{D^2} & \bar{D^3} & \bar{D^4} & 0 & 1 & 0 & 0 & \bar{D^2} & 0 & 0 & \bar{D^3} & 0 & 0 & 0 & 0\\ 1 & \bar{D^2} & \bar{D^3} & \bar{D^4} & 0 & 0 & 1 & 0 & 0 & \bar{D^2} & 0 & 0 & \bar{D^3} & 0 & 0 & \bar{D^4}\\ \bar{D^2} & \bar{D^2} & 0 & 0 & \bar{D^2} & 0 & 0 & \bar{D^2} & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ \bar{D^2} & \bar{D^2} & 0 & 0 & 0 & \bar{D^2} & 0 & 0 & \bar{D^2} & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ \bar{D^2} & \bar{D^2} & 0 & 0 & 0 & 0 & \bar{D^2} & 0 & 0 & \bar{D^2} & 0 & 0 & 0 & 0 & 0 & 0 \\ \bar{D^3} & 0 & \bar{D^3} & 0 & 0 & 0 & 0 & 0 & 0 & 0 & \bar{D^3} & 0 & 0 & 0 & 0 & 0 \\ \bar{D^3} & 0 & \bar{D^3} & 0 & 0 & \bar{D^3} & 0 & 0 & 0 & 0 & 0 & \bar{D^3} & 0 & 0 & 0 & 0 \\ \bar{D^3} & 0 & \bar{D^3} & 0 & 0 & 0 & \bar{D^3} & 0 & 0 & 0 & 0 & 0 & \bar{D^3} & 0 & 0 & 0 \\ \bar{D^4} & 0 & 0 & \bar{D^4} & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & \bar{D^4} & 0 & 0 \\ \bar{D^4} & 0 & 0 & \bar{D^4} & \bar{D^4} & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & \bar{D^4} & 0 \\ \bar{D^4} & 0 & 0 & \bar{D^4} & 0 & 0 & \bar{D^4} & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & \bar{D^4} \end{array}\right) \end{align*}\]

The epiphany comes when you are able to write this model with a separate constant, time and group dummies for each separate treated group for which we want to estimate the DID model for. We have 9 separate interaction parameters $\beta^{SA}_{d,\tau}$ to estimate. We are thus going to estimate $9 \times 4$ parameters total (i.e. run 9 separate regressions with four parameters, but all at once). We thus have to write a $36 \times 36$ $X'X$ matrix. The key is that this matrix is going to be block diagonal, with $4 \times 4$ blocks that are identical to the blocks of the $X'X$ matrix in the case of a simple OLS DID estimator with two time periods. Also, the parameters that are redundant in this model (i.e. that appear several times at different places) will be estimated in exactly the same way, which shows that the two formulations of the model ($16 \times 16$ and $36 \times 36$) are equivalent and estimate the exact same set of 16 parameters. In order to see how this works (and to prove the result), let’s write the model for $\beta^{SA}_{2,0}$. To be able to do that, we are going to order all the observations in each time period by the opposite of the treatment group to which they belong. We also denote $N_t^{d}$ the number of observations of group $D_i=d$ in period $t$, $D^d_{i,t}=\uns{D_i=d}$ and $T^{d+\tau}_{i,t}=\uns{d+\tau=t}$. With these notations, we have:

\[\begin{align*} \underbrace{\left(\begin{array}{c} Y_{1,1} \\ \vdots \\ Y_{N_1^{\infty},1} \\Y_{N_1^{\infty}+1,1} \\ \vdots \\ Y_{N_1^{\infty}+N_1^{2},1} \\ Y_{1,2} \\ \vdots \\ Y_{N_2^{\infty},2} \\Y_{N_2^{\infty}+1,2} \\ \vdots \\ Y_{N_2^{\infty}+N_2^{2},2} \end{array}\right)}_{Y_{2,0}} & = \underbrace{\left(\begin{array}{cccc} 1 & D^2_{1,1} & T^2_{1,1} & D^2_{1,1}T^2_{1,1}\\ \vdots & \vdots & \vdots & \vdots\\ 1 & D^2_{N_1^{\infty},1} & T^2_{N_1^{\infty},1} & D^2_{N_1^{\infty},1}T^2_{N_1^{\infty},1} \\ 1 & D^2_{N_1^{\infty}+1,1} & T^2_{N_1^{\infty}+1,1} & D^2_{N_1^{\infty}+1,1}T^2_{N_1^{\infty}+1,1} \\ \vdots & \vdots & \vdots & \vdots\\ 1 & D^2_{N_1^{\infty}+N_1^{2},1} & T^2_{N_1^{\infty}+N_1^{2},1} & D^2_{N_1^{\infty}+N_1^{2},1}T^2_{N_1^{\infty}+N_1^{2},1} \\ 1 & D^2_{1,2} & T^2_{1,2} & D^2_{1,2}T^2_{1,2}\\ \vdots & \vdots & \vdots & \vdots\\ 1 & D^2_{N_1^{\infty},2} & T^2_{N_1^{\infty},2} & D^2_{N_1^{\infty},2}T^2_{N_1^{\infty},2} \\ 1 & D^2_{N_1^{\infty}+1,2} & T^2_{N_1^{\infty}+1,2} & D^2_{N_1^{\infty}+1,2}T^2_{N_1^{\infty}+1,2} \\ \vdots & \vdots & \vdots & \vdots\\ 1 & D^2_{N_1^{\infty}+N_1^{2},2} & T^2_{N_1^{\infty}+N_1^{2},2} & D^2_{N_1^{\infty}+N_1^{2},2}T^2_{N_1^{\infty}+N_1^{2},2} \end{array}\right)}_{X_{2,0}} \underbrace{\left(\begin{array}{c} \tilde\alpha_{2,0} \\ \tilde\mu_{2,0} \\ \tilde\delta_{2,0} \\ \beta^{SA}_{2,0} \end{array}\right)}_{\Theta_{2,0}} + \underbrace{\left(\begin{array}{c} \epsilon_{1,1} \\ \vdots \\ \epsilon_{N_1^{\infty},1} \\\epsilon_{N_1^{\infty}+1,1} \\ \vdots \\ \epsilon_{N_1^{\infty}+N_1^{2},1} \\ \epsilon_{1,2} \\ \vdots \\ \epsilon_{N_2^{\infty},2} \\\epsilon_{N_2^{\infty}+1,2} \\ \vdots \\ \epsilon_{N_2^{\infty}+N_2^{2},2} \end{array}\right)}_{\epsilon_{2,0}} \end{align*}\]

Now, we can write 9 such models, one for each $\beta^{SA}_{d,\tau}$. If we stack the $Y_{d,\tau}$ on top of each other, starting with $d=2$ and $\tau=0$, and we stack in the same way the $\Theta_{d,\tau}$ vectors, and, finally, we regroup all the $X_{d,\tau}$ matrices in a block diagonal matrix, we have a new model $\tilde{Y} = \tilde{X}\tilde{\Theta} + \tilde{\epsilon}$. The stacked model has $4 \times 9=36$ parameters while the original model has 16 parameters. For the two models to be identical, it has to be that there exists $36-16=20$ direct restrictions on the parameters of the stacked model. Using the fact that some parts of the data set are duplicated in teh stack model, we can determine the link between the parameters in the stacked model and the ones in the original model. For example, we know that $Y_{1,1}=\tilde{\alpha}_{2,0}+\epsilon_{1,1}=\tilde{\alpha}_{2,1}+\epsilon_{1,1}=\tilde{\alpha}_{2,2}+\epsilon_{1,1}=\alpha+\epsilon_{1,1}$. As a consequence, we have $\tilde{\alpha}_{2,0}=\tilde{\alpha}_{2,1}=\tilde{\alpha}_{2,2}=\alpha$. Using similar sets of restrictions, we can also show that: $\tilde\delta_{2,0}=\delta_2$, $\tilde\delta_{2,1}=\delta_3$ and $\tilde\delta_{2,2}=\delta_4$; $\tilde{\mu}_{d,\tau}=\mu_d$, $\forall d,\tau$; $\tilde{\alpha}_{3,-2}=\tilde{\alpha}_{3,0}=\tilde{\alpha}_{3,1}=\alpha+\delta_2$; $\tilde{\alpha}_{4,-3}=\tilde{\alpha}_{4,-2}=\tilde{\alpha}_{4,0}=\alpha+\delta_3$; $\tilde\delta_{3,-2}=-\delta_2$; $\tilde\delta_{3,0}=\delta_3-\delta_2$; $\tilde\delta_{3,1}=\delta_4-\delta_2$; $\tilde\delta_{4,-3}=-\delta_3$; $\tilde\delta_{4,-2}=\delta_2-\delta_3$; $\tilde\delta_{4,0}=\delta_4-\delta_3$. We have thus shown that every single parameter in the stacked model can be derived from the parameters in the original model. What is left to check now is that the estimation of the stacked model by OLS abides by the constraints implied by these equalities. In order to complete the proof, we make use of the fact that the inverse of a block diagonal matrix is the blog diagonal matrix of the inverses of each block. Using the proof of Theorem 4.12 (especially the beginning of the proof, which derives the OLS DID estimator in repeated cross sections of different sizes), we can now show that:

\[\begin{align*} \hat{\tilde{\alpha}}^{OLS}_{d,\tau} & = \bar{Y}^{\infty}_{d-1}\\ \hat{\tilde{\mu}}^{OLS}_{d,\tau} & = \bar{Y}^{d}_{d-1}-\bar{Y}^{\infty}_{d-1}\\ \hat{\tilde{\delta}}^{OLS}_{d,\tau} & = \bar{Y}^{\infty}_{d+\tau}-\bar{Y}^{\infty}_{d-1}\\ \hat{\beta}^{SA}_{d,\tau} & = \bar{Y}^{d}_{d+\tau}-\bar{Y}^{d}_{d-1}-(\bar{Y}^{\infty}_{d+\tau}-\bar{Y}^{\infty}_{d-1}). \end{align*}\]

These results show that all the constraints on the parameters of the stacked model are verified (I leave this as an exercise). The last equality proves the result for the OLS DID model in repeated cross sections. The proof for panel data follows exactly the same lines.

Let us now turn to the First Difference estimator in panel data. The First Difference transformation of Sun and Abraham model which uses $d-1$ as the benchmark period can be written as follows (for $\tau\neq-1$):

\[\begin{align*} Y_{i,d+\tau} - Y_{i,d-1} & = \alpha_{d,\tau}^{FD} + \beta_{d,\tau}^{FD}\uns{D_i=d} + \Delta\epsilon^{FD}_{i,d+\tau}, \end{align*}\]

with:

\[\begin{align*} \alpha_{d,\tau}^{FD} & = \delta_{d+\tau} - \delta_{d-1}\\ \beta_{d,\tau}^{FD} & = \beta_{d,\tau}^{SA}\\ \Delta\epsilon^{FD}_{i,d+\tau} & = \epsilon^{SA}_{i,d+\tau}-\epsilon^{SA}_{i,d-1}. \end{align*}\]

Using the same trick as for the model in repeated cross sections, we can rewrite this model as stacked model with a block diagonal matrix of covariates. Here is the block corresponding to the estimation of $\beta^{SA}_{2,0}$:

\[\begin{align*} \underbrace{\left(\begin{array}{c} Y_{1,2}-Y_{1,1} \\ \vdots \\ Y_{N^{\infty},2}- Y_{N^{\infty},1} \\Y_{N^{\infty}+1,2}-Y_{N^{\infty}+1,1} \\ \vdots \\ Y_{N^{\infty}+N^{2},2} -Y_{N^{\infty}+N^{2},1} \end{array}\right)}_{\Delta Y_{2,0}} & = \underbrace{\left(\begin{array}{cccc} 1 & 0 \\ \vdots & \vdots\\ 1 & 0 \\ 1 & 1 \\ \vdots & \vdots\\ 1 & 1 \end{array}\right)}_{\Delta X_{2,0}} \underbrace{\left(\begin{array}{c} \alpha^{FD}_{2,0} \\ \beta^{FD}_{2,0} \end{array}\right)}_{\Theta^{FD}_{2,0}} + \underbrace{\left(\begin{array}{c} \Delta\epsilon^{FD}_{1,2} \\ \vdots \\ \Delta\epsilon^{FD}_{N^{\infty},2} \\\Delta\epsilon^{FD}_{N^{\infty}+1,2} \\ \vdots \\ \Delta\epsilon^{FD}_{N^{\infty}+N^{2},2} \end{array}\right)}_{\Delta\epsilon^{FD}_{2,0}} \end{align*}\]

Stacking all the vectors of outcomes, the vector of coefficients and the vector of residuals on top of each other, and organizing the matrices of covariates in a block diagonal matrix, we obtain the stacked Sun and Abraham model in first differences: $\Delta Y = \Delta X\Theta^{FD} + \Delta\epsilon^{FD}$. Using the fact that the inverse of a block diagonal matrix is the blog diagonal matrix of the inverses of each block and the proof of Lemma A.3, we can show that $\hat\beta^{FD}_{d,\tau}$ is the with/without estimator applied to $Y_{i,d+\tau} - Y_{i,d-1}$. The result follows.

Let us now study the Within estimator of Sun and Abraham model in panel data. The within mean of Sun and Abraham model depends on the group the observation belongs to. For $i$ such that $D_i=d$, we have:

\[\begin{align*} \underbrace{\frac{1}{T}\sum_{t=1}^TY_{i,t}}_{\bar{Y}_{i,.}} = \underbrace{\frac{1}{T}\sum_{t=1}^T\delta_{t}}_{\bar{\delta}}+ \underbrace{\frac{1}{T}\sum_{\tau\neq-1}^T\beta_{d,\tau}^{SA}}_{\bar{\beta}_d} \uns{D_i=d}+\underbrace{\frac{1}{T}\sum_{t=1}^T\epsilon_{i,t}}_{\bar{\epsilon}_{i,.}}. \end{align*}\]

As a consequence, for $i$ such that $D_i=d$ or $D_i=\infty$, we can write the within transformation of Sun and Abraham model as follows:

\[\begin{align*} Y_{i,d+\tau}-\bar{Y}_{i,.} & = \underbrace{\delta_{d-1}-\bar{\delta}}_{\alpha_d^{FE}}+\underbrace{(\delta_{d+\tau}-\delta_{d-1})}_{\delta_{d+\tau}^{FE}}\uns{T_i=d+\tau}\underbrace{-\bar{\beta}_d}_{\mu_d^{FE}}\uns{D_i=d} +\beta_{d,\tau}^{SA}\uns{D_i=d}\uns{T_i=d+\tau}+\epsilon_{i,t}-\bar{\epsilon}_{i,.}. \end{align*}\]

The within transformation of Sun and Abraham model is thus equivalent to the OLS DID model applied to the within-transformed outcomes $Y_{i,d+\tau}-\bar{Y}_{i,.}$. Building a stacked model of the within transformed Sun and Abraham model and using the fact that the inverse of a block diagonal matrix is the blog diagonal matrix of the inverses of each block along with Theorem 4.10 proves that:

\[\begin{align*} \hat\beta_{d,\tau}^{SA} & = \frac{\sum_{i=1}^{N}(Y_{i,d+\tau}-\bar{Y}_{i,.}-(Y_{i,d-1}-\bar{Y}_{i,.}))\uns{D_i=d}}{\sum_{i=1}^{N}\uns{D_i=d}} \\ & \phantom{=} - \frac{\sum_{i=1}^{N}(Y_{i,d+\tau}-\bar{Y}_{i,.}-(Y_{i,d-1}-\bar{Y}_{i,.}))\uns{D_i=\infty}}{\sum_{i=1}^{N}\uns{D_i=\infty}}\\ & = \frac{\sum_{i=1}^{N}(Y_{i,d+\tau}-Y_{i,d-1})\uns{D_i=d}}{\sum_{i=1}^{N}\uns{D_i=d}} \\ & \phantom{=} - \frac{\sum_{i=1}^{N}(Y_{i,d+\tau}-Y_{i,d-1})\uns{D_i=\infty}}{\sum_{i=1}^{N}\uns{D_i=\infty}}, \end{align*}\]

which proves the result.

Let us finally look at the Least Squares Dummy Variables estimator. Let’s denote $X_{\mu}$ the matrix of individual dummies in the LSDV estimator. We are going to apply Theorem A.1, i.e. Frish-Waugh-Lovell Theorem, partialling out these individual dummies from the list of regressors. First, we have $(X'_{\mu}X_{\mu})^{-1}=\frac{1}{T}I_N$ where $I_N$ is the identity matrix of dimension $N$ and $T$ is the total number of time periods in the panel. Second, we have that $M_{\mu}Y=X_{\mu}(X'_{\mu}X_{\mu})^{-1}X'_{\mu}Y=\left(\dots,Y_{i,t}-\bar{Y}_{i,.},\dots\right)$. For the time fixed effects, we have $M_{\mu}X_{-\mu,T}$ a matrix with $1-\frac{1}{T}$ where $T_{i,t}=1$ and $-\frac{1}{T}$ otherwise. For the interactive treatment dummies, we have $M_{\mu}X_{-\mu,DT}$ a matrix with $D^d_i(1-\frac{1}{T})$ where $D^d_i$ appeared in the original $X_{-\mu,DT}$ matrix (the last 9 columns of the $X$ matrix) and $-\frac{D^d_i}{T}$ otherwise. As a consequence of Theorem A.1, we can rewrite the LSDV model as follows:

\[\begin{align*} Y_{i,t} - \bar{Y}_{i,.} & = \delta_t-\bar\delta-\sum_d\bar\beta^{SA}_d\uns{D_i=d}+\sum_d\sum_{\tau\neq-1}\beta^{SA}_{d,\tau}\uns{D_i=d}\uns{t=d+\tau}+\epsilon^{LSDV}_{i,t} - \bar{\epsilon}^{LSDV}_{i,.}. \end{align*}\]

This is the same formula as the one we have uncovered in the within transformation we have studied just above. Using the same approach proves the result.

A.3.5 Proof of Theorem 4.24

Using the beginning of the proof of Lemma A.4, we know that: $\sqrt{N}(\hat{\tilde\Theta}_{OLS}-\tilde\Theta)=N(\tilde{X}'\tilde{X})^{-1}\frac{\sqrt{N}}{N}\tilde{X}'\tilde{\epsilon}$. Using Slutsky’s Theorem, we know that we can study both terms separately (see the same proof of Lemma A.4). $N(\tilde{X}'\tilde{X})^{-1}$ can be derived rather directly from the fact that Sun and Abraham model can be written as a block diagonal matrix, as shown in the proof of Theorem 4.20. The most difficult part is going to be to derive the distribution of $\frac{\sqrt{N}}{N}\tilde{X}'\tilde{\epsilon}$.

Let us start with $N(\tilde{X}'\tilde{X})^{-1}$. Let us define $N_{t,d}$ the number of observations observed in group $d$ at period $t$. We also define $N^B_{d,\tau}=N_{d-1,\infty}+N_{d-1,d}$ the number of observations used to estimate $\hat{\beta}^{SA}_{d,\tau}$ that are observed in the reference (or before) period and $N^A_{d,\tau}=N_{d+\tau,\infty}+N_{d+\tau,d}$ the number of observations used to estimate $\hat{\beta}^{SA}_{d,\tau}$ that are observed in the after period. We also define $N^{SA}_{d,\tau}=N^A_{d,\tau}+N^B_{d,\tau}$, the number of observations used to estimate $\hat{\beta}^{SA}_{d,\tau}$. We also define $\bar{T}_A^{d,\tau}=\frac{N^A_{d,\tau}}{N^{SA}_{d,\tau}}$, $k^{d,\tau}=\frac{N^B_{d,\tau}}{N^{A}_{d,\tau}}=\frac{1-\bar{T}_A^{d,\tau}}{\bar{T}_A^{d,\tau}}$. We let $\bar{D}_A^{d,\tau}$ and $\bar{D}_B^{d,\tau}$ denote the proportion of treated observations in the after and before periods used to estimate $\hat{\beta}^{SA}_{d,\tau}$. We also define $\bar{P}^{d,\tau}=\frac{N^{SA}_{d,\tau}}{N}$ the proportion of observations used to estimate $\hat{\beta}^{SA}_{d,\tau}$. We also have: $\text{plim}\bar{P}^{d,\tau}=p^{d,\tau}$, $\text{plim}\bar{D}_A^{d,\tau}=\text{plim}\bar{D}_B^{d,\tau}=p^{d,\tau}_D$ and $\text{plim}\bar{T}_A^{d,\tau}=p^{d,\tau}_A$. $p^{d,\tau}=\Pr(D^{d,\tau}_i=1)$, with $D^{d,\tau}_i=\uns{(D_i=d\lor D_i=\infty)\land(T_i=d-1\lor T_i=d+\tau)}$ a dummy indicating that a unit in the population belongs to the set of units used to identify $\beta^{SA}_{d,\tau}$. $p^{d,\tau}_D=\Pr(D_i=d|D^{d,\tau}_i=1)$ is the proportion of treated units among the set of units used to identify $\beta^{SA}_{d,\tau}$. $p^{d,\tau}_A=\Pr(T_i=d+\tau|D^{d,\tau}_i=1)$ is the proportion of units belonging to the after period among the set of units used to identify $\beta^{SA}_{d,\tau}$. Finally, let $(X'X)^{-1}_{d,\tau}$ denote the block of the matrix $(\tilde{X}'\tilde{X})^{-1}$ which is used to estimate $\hat{\beta}^{SA}_{d,\tau}$. With all these definitions, we can now follow the proof of Theorem 4.12 in order to derive the following result:

\[\begin{align*} \sigma_{\tilde{X}\tilde{X}^{-1}}^{d,\tau} &= \text{plim}N(X'X)^{-1}_{d,\tau} = \frac{1}{p^{d,\tau}(1-p^{d,\tau}_A)(1-p^{d,\tau}_D)} \left(\begin{array}{cccc} 1 & -1 & -1 & 1\\ -1 & \frac{1}{p^{d,\tau}_D} & 1 & -\frac{1}{p^{d,\tau}_D} \\ -1 & 1 & \frac{1}{p^{d,\tau}_A} & -\frac{1}{p^{d,\tau}_A} \\ 1 & -\frac{1}{p^{d,\tau}_D} & -\frac{1}{p^{d,\tau}_A} & \frac{1}{p^{d,\tau}_Dp^{d,\tau}_A} \end{array}\right) \end{align*}\]

Using the fact that the inverse of a block diagonal matrix is the blog diagonal matrix of the inverses of each block, we now know $\text{plim}N(\tilde{X}'\tilde{X})^{-1}=\sigma_{\tilde{X}\tilde{X}}^{-1}$ is block diagonal matrix with blocks equal to $\sigma_{\tilde{X}\tilde{X}^{-1}}^{d,\tau}$.

Let us now turn to $\frac{\sqrt{N}}{N}\tilde{X}'\tilde{\epsilon}$. In order to derive its distribution, we have to write Sun and Abraham model in a repeated cross section with the observations grouped by blocks corresponding to the parameters they help to estimate. This model can be written as:

\[\begin{align*} Y_{j}D^{d,\tau}_j & = \tilde\alpha^{d,\tau}D^{d,\tau}_j + \tilde\mu_{d,\tau}\uns{D_{j}=d}D^{d,\tau}_j + \tilde\delta_{d,\tau}\uns{T_j=d+\tau}D^{d,\tau}_j \\ & \phantom{=} + \beta_{d,\tau}^{SA}\uns{D_{i}=d}\uns{T_j=d+\tau}D^{d,\tau}_j + \tilde\epsilon_{j}D^{d,\tau}_j, \end{align*}\]

with:

\[\begin{align*} \tilde\epsilon_{j} & = Y_{j}-\left(\esp{Y^0_{j}|D_j=\infty,T_j=d-1}\right.\\ & \phantom{=Y_{j}-\left(\right.}+\uns{D_j=d}(\esp{Y^0_{j}|D_j=d,T_j=d-1}-\esp{Y^0_{j}|D_j=\infty,T_j=d-1}) & \phantom{=Y_{j}-\left(\right.}+\uns{T_j=d+\tau}(\esp{Y^0_{j}|D_j=\infty,T_j=d+\tau}-\esp{Y^0_{j}|D_j=\infty,T_j=d-1})\\ & \phantom{=Y_{j}-\left(\right.}+\uns{D_{j}=d}\uns{T_j=d+\tau}(\esp{Y^1_{j}|D_j=d,T_j=d+\tau}-\esp{Y^0_{j}|D_j=d,T_j=d-1}\\ & \phantom{=Y_{j}-\left(\right.+\uns{D_{j}=d}\uns{T_j=d+\tau}}\left.-(\esp{Y^0_{j}|D_j=\infty,T_j=d+\tau}-\esp{Y^0_{j}|D_j=\infty,T_j=d-1}))\right). \end{align*}\]

It is pretty straightforward to prove that $\esp{\tilde\epsilon_{j}D^{d,\tau}_j}=0$. For that, note that $D^{d,\tau}_j=f(D_j,T_j)$ so that conditioning on $D^{d,\tau}_j$ is irrelevant when also conditioning on $(D_j,T_j)$. Then, check that $\esp{\tilde\epsilon_{j}D^{d,\tau}_j\uns{D_{j}=d}\uns{T_j=d+\tau}}=0$. The same thing holds for $\esp{\tilde\epsilon_{j}D^{d,\tau}_j\uns{T_j=d+\tau}}=0$ and for $\esp{\tilde\epsilon_{j}D^{d,\tau}_j\uns{D_{j}=d}}=0$, which proves the result.

It can also be shown that $\esp{\tilde\epsilon_{j}D^{d,\tau}_jD^{d',\tau'}_j}=0$, with either $j\neq j'$ or $\tau\neq \tau'$ or both. If it is the case that $D^{d,\tau}_j\Ind D^{d',\tau'}_j$, then the term is zero. If $D^{d,\tau}_j$ and $D^{d',\tau'}_j$ are not independent, by definition of $D^{d,\tau}_j$, $\esp{\tilde\epsilon_{j}D^{d,\tau}_jD^{d',\tau'}_j}$ can only involve the following terms: $\esp{\tilde\epsilon_{j}|D_j=d,T_j=d+\tau}$, $\esp{\tilde\epsilon_{j}|D_j=d,T_j=d-1}$, $\esp{\tilde\epsilon_{j}|D_j=\infty,T_j=d+\tau}$ and $\esp{\tilde\epsilon_{j}|D_j=\infty,T_j=d-1}$, and all of these terms are equal to zero.

Using the vector version of the Central Limit Theorem that we have already used in the proof of Theorem 4.12, we thus have that $\frac{\sqrt{N}}{N}\tilde{X}'\tilde{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{V_{\tilde{x}\tilde{\epsilon}}})$. Using the Delta Method, we have that $\sqrt{N}(\hat{\Theta}_{OLS}-\Theta)\stackrel{d}{\rightarrow}\mathcal{N}\left(\mathbf{0},\sigma_{\tilde{X}\tilde{X}}^{-1}\mathbf{V_{\tilde{x}\tilde{\epsilon}}}\sigma_{\tilde{X}\tilde{X}}^{-1}\right)$. In order to prove the result, we simply need to derive the fourth term on the diagonal of each $4\times 4$ block of $\sigma_{\tilde{X}\tilde{X}}^{-1}\mathbf{V_{\tilde{x}\tilde{\epsilon}}}\sigma_{\tilde{X}\tilde{X}}^{-1}$. Since $\sigma_{\tilde{X}\tilde{X}}^{-1}$ is block diagonal, the $4\times 4$ blocks of $\sigma_{\tilde{X}\tilde{X}}^{-1}\mathbf{V_{\tilde{x}\tilde{\epsilon}}}\sigma_{\tilde{X}\tilde{X}}^{-1}$ are equal to $\sigma_{\tilde{X}\tilde{X}^{-1}}^{d,\tau}\mathbf{V}^{d,\tau}_{\mathbf{\tilde{x}\tilde{\epsilon}}}\sigma_{\tilde{X}\tilde{X}^{-1}}^{d,\tau}$, with (following the proof of Theorem 4.12):

\[\begin{align*} \mathbf{V}^{d,\tau}_{\mathbf{\tilde{x}\tilde{\epsilon}}} & = \esp{\epsilon_j^2D^{d,\tau}_j\left(\begin{array}{cccc} 1 & D^d_j & T^{d,\tau}_j & D^d_jT^{d,\tau}_j \\ D^d_j & D^d_j & D^d_jT^{d,\tau}_j & D^d_jT^{d,\tau}_j \\ T^{d,\tau}_j & D^d_jT^{d,\tau}_j & T^{d,\tau}_j & D^d_jT^{d,\tau}_j \\ D^d_jT^{d,\tau}_j & D^d_jT^{d,\tau}_j & D^d_jT^{d,\tau}_j & D^d_jT^{d,\tau}_j \end{array}\right)}, \end{align*}\]

with $D^d_j=\uns{D_{j}=d}$ and $T_j^{d,\tau}=\uns{T_j=d+\tau}D^{d,\tau}_j$. Following the proof of Theorem 4.12 proves the result.

A.3.6 Proof of Theorem 4.26

The key to the proof relies on the covariance terms. The covariance terms come from the off-$4\times 4$-block-diagonal elements of the $\sigma_{\tilde{X}\tilde{X}}^{-1}\mathbf{V_{\tilde{x}\tilde{\epsilon}}}\sigma_{\tilde{X}\tilde{X}}^{-1}$ matrix. They are due to the fact that the same data are used repeatedly to estimate the $\beta^{SA}_{d,\tau}$ parameters. For example, the observations from the never treated group are used as benchmarks for the estimation of $\beta^{SA}_{2,0}$, $\beta^{SA}_{2,1}$ and $\beta^{SA}_{2,2}$. The same observations are used as post-treatment observations for the estimation of $\beta^{SA}_{2,0}$, $\beta^{SA}_{3,-2}$ and $\beta^{SA}_{4,-3}$.

In order to rigorously derive these covariance terms, we need to derive the off-$4\times 4$-block-diagonal elements of the $\sigma_{\tilde{X}\tilde{X}}^{-1}\mathbf{V_{\tilde{x}\tilde{\epsilon}}}\sigma_{\tilde{X}\tilde{X}}^{-1}$ matrix in the proof of Theorem 4.24. For two sets of observations used to estimate $\beta^{SA}_{d,\tau}$ and $\beta^{SA}_{d',\tau'}$, $d\neq d'$ or $\tau\neq\tau'$, we only need the last line of the off-$4\times 4$-block-diagonal covariance matrix of $\sigma_{\tilde{X}\tilde{X}}^{-1}\mathbf{V_{\tilde{x}\tilde{\epsilon}}}\sigma_{\tilde{X}\tilde{X}}^{-1}$. Indeed, using results in the proof of Theorem 4.24, we can show that:

\[\begin{align*} \text{Cov}(\hat{\beta}^{SA}_{d,\tau},\hat{\beta}^{SA}_{d',\tau'}) & = \frac{1}{p^{d,\tau}p^{d',\tau'}(1-p_A^{d,\tau})(1-p_A^{d',\tau'})(1-p_D^{d,\tau})(1-p_D^{d',\tau'})} \\ & \phantom{=} \left(\mathbf{A}_4-\frac{1}{p_D^{d,\tau}}\mathbf{B}_4-\frac{1}{p_A^{d,\tau}}\mathbf{C}_4+\frac{1}{p_A^{d,\tau}p_D^{d,\tau}}\mathbf{D}_4\right), \end{align*}\]

with:

\[\begin{align*} \mathbf{A}_4 & = \esp{\epsilon_j^2D_j^{d',\tau'}D_j^{d,\tau}} -\frac{1}{p_D^{d',\tau'}}\esp{\epsilon_j^2D^{d'}_jD_j^{d',\tau'}D_j^{d,\tau}} -\frac{1}{p^{d',\tau'}_A}\esp{\epsilon_j^2T^{d',\tau'}_jD_j^{d',\tau'}D_j^{d,\tau}} \\ & \phantom{=} + \frac{1}{p_D^{d',\tau'}p_A^{d',\tau'}}\esp{\epsilon_j^2D^{d'}_jT^{d',\tau'}_jD_j^{d',\tau'}D_j^{d,\tau}}\\ \mathbf{B}_4 & = \esp{\epsilon_j^2D_j^{d',\tau'}D^{d}_jD_j^{d,\tau}} -\frac{1}{p_D^{d',\tau'}}\esp{\epsilon_j^2D^{d'}_jD_j^{d',\tau'}D^{d}_jD_j^{d,\tau}} -\frac{1}{p^{d',\tau'}_A}\esp{\epsilon_j^2T^{d',\tau'}_jD_j^{d',\tau'}D^{d}_jD_j^{d,\tau}} \\ & \phantom{=} + \frac{1}{p_D^{d',\tau'}p_A^{d',\tau'}}\esp{\epsilon_j^2D^{d'}_jT^{d',\tau'}_jD_j^{d',\tau'}D^{d}_jD_j^{d,\tau}}\\ \mathbf{C}_4 & = \esp{\epsilon_j^2D_j^{d',\tau'}T^{d,\tau}_jD_j^{d,\tau}} -\frac{1}{p_D^{d',\tau'}}\esp{\epsilon_j^2D^{d'}_jD_j^{d',\tau'}T^{d,\tau}_jD_j^{d,\tau}} -\frac{1}{p^{d',\tau'}_A}\esp{\epsilon_j^2T^{d',\tau'}_jD_j^{d',\tau'}T^{d,\tau}_jD_j^{d,\tau}} \\ & \phantom{=} + \frac{1}{p_D^{d',\tau'}p_A^{d',\tau'}}\esp{\epsilon_j^2D^{d'}_jT^{d',\tau'}_jD_j^{d',\tau'}T^{d,\tau}_jD_j^{d,\tau}}\\ \mathbf{D}_4 & = \esp{\epsilon_j^2D_j^{d',\tau'}D^{d}_jT^{d,\tau}_jD_j^{d,\tau}} -\frac{1}{p_D^{d',\tau'}}\esp{\epsilon_j^2D^{d'}_jD_j^{d',\tau'}D^{d}_jT^{d,\tau}_jD_j^{d,\tau}} -\frac{1}{p^{d',\tau'}_A}\esp{\epsilon_j^2T^{d',\tau'}_jD_j^{d',\tau'}D^{d}_jT^{d,\tau}_jD_j^{d,\tau}} \\ & \phantom{=} + \frac{1}{p_D^{d',\tau'}p_A^{d',\tau'}}\esp{\epsilon_j^2D^{d'}_jT^{d',\tau'}_jD_j^{d',\tau'}D^{d}_jT^{d,\tau}_jD_j^{d,\tau}} \end{align*}\]

Let us start with $\mathbf{A}_4$’s first term, $\esp{\epsilon_j^2D_j^{d',\tau'}D_j^{d,\tau}}$. Note first that if $d=d'$, we have to have $\tau\neq\tau'$, otherwise the term would be on the diagonal. As a consequence, with $d=d'$, we can have only two configurations for which $D_j^{d',\tau'}D_j^{d,\tau}=1$, and they both correspond to a baseline observation ($T_j=d-1$) either for the control group ($D_j=\infty$) or for the treated group ($D_j=d$). In that case, we thus have:

\[\begin{align*} \esp{\epsilon_j^2D_j^{d',\tau'}D_j^{d,\tau}} & = \left(\var{Y^0_{i,d-1}|D_i=\infty}(1-p_D^{d,\tau,d',\tau'})\right.\\ & \phantom{=}\left.+\var{Y^0_{i,d-1}|D_i=d}p_D^{d,\tau,d',\tau'}\right)p_{d-1}^{d,\tau,d',\tau'}p^{d,\tau,d',\tau'}, \end{align*}\]

with $p_D^{d,\tau,d',\tau'}$ the proportion of treated individuals in the group such as $D_j^{d',\tau'}D_j^{d,\tau}=1$ and $p_{d-1}^{d,\tau,d',\tau'}$ is the proportion of observations observed in period $d-1$ among the observations such as $D_j^{d',\tau'}D_j^{d,\tau}=1$ and $p^{d,\tau,d',\tau'}$ the proportion of observations such as $D_j^{d',\tau'}D_j^{d,\tau}=1$.

When $d\neq d'$, we have three possible cases: $d-1=d'+\tau'$, $d'-1=d+\tau$ or $d'+\tau'=d+\tau$. Because the treated groups are different in that case, the only possible correspondence in these cases is due to the control group, with the After period for one treated group being the Before period for another treated group or the After periods being the same for both groups. If $d-1=d'+\tau'$, we have:

\[\begin{align*} \esp{\epsilon_j^2D_j^{d',\tau'}D_j^{d,\tau}} & = \var{Y^0_{i,d-1}|D_i=\infty}(1-p_D^{d,\tau,d',\tau'})p_{d-1}^{d,\tau,d',\tau'}p^{d,\tau,d',\tau'}, \end{align*}\]

where $p_{d-1}^{d,\tau,d',\tau'}$ is the proportion of observations observed in period $d-1$ among the observations such as $D_j^{d',\tau'}D_j^{d,\tau}=1$.

If $d'-1=d+\tau$, we have:

\[\begin{align*} \esp{\epsilon_j^2D_j^{d',\tau'}D_j^{d,\tau}} & = \var{Y^0_{i,d'-1}|D_i=\infty}(1-p_D^{d,\tau,d',\tau'})p_{d'-1}^{d,\tau,d',\tau'}p^{d,\tau,d',\tau'}. \end{align*}\]

Finally, if $d'+\tau'=d+\tau$, we have:

\[\begin{align*} \esp{\epsilon_j^2D_j^{d',\tau'}D_j^{d,\tau}} & = \var{Y^0_{i,d+\tau}|D_i=\infty}(1-p_D^{d,\tau,d',\tau'})p_{d+\tau}^{d,\tau,d',\tau'}p^{d,\tau,d',\tau'}. \end{align*}\]

Let us now look at the next term: $\esp{\epsilon_j^2D^{d'}_jD_j^{d',\tau'}D_j^{d,\tau}}$. Here, there is only one case that yields a non zero term, when $d=d'$ (all the other configurations involve only the untreated group and thus have $D^{d'}_j=0$). In that case, we have:

\[\begin{align*} \esp{\epsilon_j^2D^{d'}_jD_j^{d',\tau'}D_j^{d,\tau}} & = \var{Y^0_{i,d'-1}|D_i=d'}p_D^{d,\tau,d',\tau'}p_{d'-1}^{d,\tau,d',\tau'}p^{d,\tau,d',\tau'}. \end{align*}\]

Let us now look at the next term: $\esp{\epsilon_j^2T^{d',\tau'}_jD_j^{d',\tau'}D_j^{d,\tau}}$. Here, there are two cases that yield a non zero term, when $d-1=d'+\tau'$ and when $d'+\tau'=d+\tau$ (all the other configurations involve only the Before period and thus have $T^{d',\tau'}_j=0$). In both case, we have:

\[\begin{align*} \esp{\epsilon_j^2T^{d',\tau'}_jD_j^{d',\tau'}D_j^{d,\tau}} & = \var{Y^0_{i,d'+\tau'}|D_i=\infty}(1-p_D^{d,\tau,d',\tau'})p_{d'+\tau'}^{d,\tau,d',\tau'}p^{d,\tau,d',\tau'}. \end{align*}\]

Let us now look at the next term: $\esp{\epsilon_j^2D^{d'}_jT^{d',\tau'}_jD_j^{d',\tau'}D_j^{d,\tau}}$. This term is equal to zero since there is no treated observation observed in a post-treatment period such that $D_j^{d',\tau'}D_j^{d,\tau}=1$.

Let us now move on to $\mathbf{B}_4$. For $\esp{\epsilon_j^2D_j^{d',\tau'}D^{d}_jD_j^{d,\tau}}$ to be non zero, we have to have that $d=d'$ (otherwise, $D_j^d=0$). The only corresponding nonzero case corresponds to a baseline observation for the treated group:

\[\begin{align*} \esp{\epsilon_j^2D_j^{d',\tau'}D^{d}_jD_j^{d,\tau}} & = \var{Y^0_{i,d-1}|D_i=d}p_D^{d,\tau,d',\tau'}p_{d-1}^{d,\tau,d',\tau'}p^{d,\tau,d',\tau'}. \end{align*}\]

The next term ($\esp{\epsilon_j^2D^{d'}_jD_j^{d',\tau'}D^{d}_jD_j^{d,\tau}}$) is the same since, with $d=d'$, $D_j^d=D_j^{d'}$. The last two terms of $\mathbf{B}_4$ are null everywhere. This is because it can only be that $d=d'$ (since $D_j^d=1$) and it cannot be a baseline observation (since $T^{d,\tau'}_j=1$). Since treated observations appear only once for each $d$, $\tau\neq\tau'\Rightarrow\esp{\epsilon_j^2T^{d',\tau'}_jD_j^{d',\tau'}D^{d}_jD_j^{d,\tau}}=0$.

Let us now move on to $\mathbf{C}_4$. For the first term $\esp{\epsilon_j^2D_j^{d',\tau'}T^{d,\tau}_jD_j^{d,\tau}}$, we cannot have nonzero terms when $d=d'$, since this case involves only baseline period variances and this contradicts $T^{d,\tau}_j=1$, and thus the term is null in that case. The only possible nonzero cases involve $d'-1=d+\tau$ or $d'+\tau'=d+\tau$. In that case, we have:

\[\begin{align*} \esp{\epsilon_j^2D_j^{d',\tau'}T^{d,\tau}_jD_j^{d,\tau}} & = \var{Y^0_{i,d+\tau}|D_i=\infty}(1-p_D^{d,\tau,d',\tau'})p_{d+\tau}^{d,\tau,d',\tau'}p^{d,\tau,d',\tau'}. \end{align*}\]

The next term in $\mathbf{C}_4$ is zero everywhere since it involves the same term as above plus the additional requirement that $D^{d'}_j=1$. Since this entails that observations have to belong to a treatment group, and the previous term only includes terms from the control group, this term has to be zero everywhere.

The term $\esp{\epsilon_j^2T^{d',\tau'}_jD_j^{d',\tau'}T^{d,\tau}_jD_j^{d,\tau}}$ is non zero only when $d'+\tau'=d+\tau$ (it is the same as the first term in $\mathbf{C}_4$ with the added constraint that $T^{d',\tau'}_j=1$). We thus have: \[\begin{align*} \esp{\epsilon_j^2T^{d',\tau'}_jD_j^{d',\tau'}T^{d,\tau}_jD_j^{d,\tau}} & = \var{Y^0_{i,d+\tau}|D_i=\infty}(1-p_D^{d,\tau,d',\tau'})p_{d+\tau}^{d,\tau,d',\tau'}p^{d,\tau,d',\tau'}. \end{align*}\]

The last term in $\mathbf{C}_4$ everywhere since it is a subset of the second term, which is already zero.

Finally, all the terms in $\mathbf{D}_4$ are equal to zero. This is because $T^{d,\tau}_jD^{d}_j=1$ implies that we cannot have $d=d'$ (because the only nonzero terms would then be the ones in the baseline period, which contradicts the fact that $T^{d,\tau}_j=1$). The remaining potential nonzero configurations only concern the control group, which runs counter $D^{d}_j=1$. Hence the result.

Collecting terms, we now have, when $d=d'$:

\[\begin{align*} \mathbf{A}_4 & = \var{Y^0_{i,d-1}|D_i=\infty}(1-p_D^{d,\tau,d',\tau'})p_{d-1}^{d,\tau,d',\tau'}p^{d,\tau,d',\tau'}\\ & \phantom{=}+\var{Y^0_{i,d-1}|D_i=d}p_D^{d,\tau,d',\tau'}p_{d-1}^{d,\tau,d',\tau'}p^{d,\tau,d',\tau'}(1-\frac{1}{p_D^{d',\tau'}})\\ \mathbf{B}_4 & = \var{Y^0_{i,d-1}|D_i=d}p_D^{d,\tau,d',\tau'}p_{d-1}^{d,\tau,d',\tau'}p^{d,\tau,d',\tau'}(1-\frac{1}{p_D^{d',\tau'}})\\ \mathbf{C}_4 & = 0 \\ \mathbf{D}_4 & = 0, \end{align*}\]

and thus:

\[\begin{align*} \text{Cov}(\hat{\beta}^{SA}_{d,\tau},\hat{\beta}^{SA}_{d',\tau'}) & = \frac{p^{d,\tau,d',\tau'}p_{d-1}^{d,\tau,d',\tau'}}{p^{d,\tau}p^{d',\tau'}} \left(\frac{\var{Y^0_{i,d-1}|D_i=\infty}(1-p_D^{d,\tau,d',\tau'})}{(1-p_A^{d,\tau})(1-p_A^{d',\tau'})(1-p_D^{d,\tau})(1-p_D^{d',\tau'})} \right.\\ & \phantom{=}\left.+\frac{\var{Y^0_{i,d-1}|D_i=d}p_D^{d,\tau,d',\tau'}}{(1-p_A^{d,\tau})(1-p_A^{d',\tau'})p_D^{d,\tau}p_D^{d',\tau'}}\right). \end{align*}\]

Alternatively, when $d+\tau=d'+\tau'$, we have:

\[\begin{align*} \mathbf{A}_4 & = \var{Y^0_{i,d+\tau}|D_i=\infty}(1-p_D^{d,\tau,d',\tau'})p_{d+\tau}^{d,\tau,d',\tau'}p^{d,\tau,d',\tau'}(1-\frac{1}{p^{d',\tau'}_A}) \\ \mathbf{B}_4 & = 0\\ \mathbf{C}_4 & = \var{Y^0_{i,d+\tau}|D_i=\infty}(1-p_D^{d,\tau,d',\tau'})p_{d+\tau}^{d,\tau,d',\tau'}p^{d,\tau,d',\tau'}(1-\frac{1}{p^{d',\tau'}_A})\\ \mathbf{D}_4 & = 0 \end{align*}\]

and thus:

\[\begin{align*} \text{Cov}(\hat{\beta}^{SA}_{d,\tau},\hat{\beta}^{SA}_{d',\tau'}) & = \frac{p^{d,\tau,d',\tau'}p_{d+\tau}^{d,\tau,d',\tau'}\var{Y^0_{i,d+\tau}|D_i=\infty}(1-p_D^{d,\tau,d',\tau'})}{p^{d,\tau}p^{d',\tau'}p_A^{d,\tau}p_A^{d',\tau'}(1-p_D^{d,\tau})(1-p_D^{d',\tau'})} \end{align*}\]

Finally, when $d-1=d'+\tau'$, we have:

\[\begin{align*} \mathbf{A}_4 & = \var{Y^0_{i,d-1}|D_i=\infty}(1-p_D^{d,\tau,d',\tau'})p_{d-1}^{d,\tau,d',\tau'}p^{d,\tau,d',\tau'}(1-\frac{1}{p^{d',\tau'}_A})\\ \mathbf{B}_4 & = 0\\ \mathbf{C}_4 & = 0\\ \mathbf{D}_4 & = 0 \end{align*}\]

and thus:

\[\begin{align*} \text{Cov}(\hat{\beta}^{SA}_{d,\tau},\hat{\beta}^{SA}_{d',\tau'}) & = -\frac{p^{d,\tau,d',\tau'}p_{d-1}^{d,\tau,d',\tau'}\var{Y^0_{i,d-1}|D_i=\infty}(1-p_D^{d,\tau,d',\tau'})}{p^{d,\tau}p^{d',\tau'}(1-p_A^{d,\tau})p_A^{d',\tau'}(1-p_D^{d,\tau})(1-p_D^{d',\tau'})}. \end{align*}\]

And when $d'-1=d+\tau$, we have:

\[\begin{align*} \mathbf{A}_4 & = \var{Y^0_{i,d'-1}|D_i=\infty}(1-p_D^{d,\tau,d',\tau'})p_{d'-1}^{d,\tau,d',\tau'}p^{d,\tau,d',\tau'} \\ \mathbf{B}_4 & = 0\\ \mathbf{C}_4 & = \var{Y^0_{i,d'-1}|D_i=\infty}(1-p_D^{d,\tau,d',\tau'})p_{d'-1}^{d,\tau,d',\tau'}p^{d,\tau,d',\tau'} \\ \mathbf{D}_4 & = 0 \end{align*}\]

and thus:

\[\begin{align*} \text{Cov}(\hat{\beta}^{SA}_{d,\tau},\hat{\beta}^{SA}_{d',\tau'}) & = -\frac{p^{d,\tau,d',\tau'}p_{d'-1}^{d,\tau,d',\tau'}\var{Y^0_{i,d'-1}|D_i=\infty}(1-p_D^{d,\tau,d',\tau'})}{p^{d,\tau}p^{d',\tau'}p_A^{d,\tau}(1-p_A^{d',\tau'})(1-p_D^{d,\tau})(1-p_D^{d',\tau'})} \end{align*}\]

This proves the result.

A.3.7 Proof of Theorem 4.25

The proof follows closely that of Theorem 4.24, except that our stacked model is $\Delta Y = \Delta X\Theta^{FD} + \Delta\epsilon^{FD}$, as introduced in the proof of 4.20. Using the beginning of the proof of Lemma A.4, we know that: $\sqrt{N}(\hat{\Theta}^{FD}-\Theta^{FD})=N(\Delta X'\Delta X)^{-1}\frac{\sqrt{N}}{N}\Delta X'\Delta\epsilon^{FD}$. Using Slutsky’s Theorem, we know that we can study both terms separately (see the same proof of Lemma A.4).

Let us start with $N(\Delta X'\Delta X)^{-1}$. Let’s denote $N_{d,\infty}$ the number of observations that are such that $D_i=d$ or $D_i=\infty$ and $p^{d,\infty}=\text{plim}\frac{N_{d,\infty}}{N}$. Let’s also denote $p^{d,\infty}_D=\Pr(D_i=d|D_i=d\cup D_i=\infty)$. Using the same reasoning as in the proof of Theorem 4.24, and using the proof of Theorem 2.5, we can show that:

\[\begin{align*} \sigma_{\Delta X \Delta X^{-1}}^{d,\tau} &= \plims N(\Delta X'\Delta X)^{-1}_{d,\tau} = \frac{1}{p^{d,\infty}p^{d,\infty}_D(1-p^{d,\infty}_D)} \left(\begin{array}{cc} p^{d,\infty}_D & -p^{d,\infty}_D\\ -p^{d,\infty}_D & 1 \end{array}\right) \end{align*}\]

with $N(\Delta X'\Delta X)^{-1}_{d,\tau}$ the block of the $N(\Delta X'\Delta X)^{-1}$ matrix that is related to the estimation of $\hat{\beta}^{SA}_{d,\tau}$ and using the fact that the inverse of a block diagonal matrix is the blog diagonal matrix of the inverses of each block. The proof then follows the line of the proof of Theorem 2.5, replacing $Y_{i}$ by $Y_{i,d+\tau}-Y_{i,d-1}$. This proves the result.

A.3.8 Proof of Theorem 4.27

The proof takes where the proof of Theorem 4.25 in Section A.3.7 left. Let us define $\mathbf{V}_{\mathbf{\Delta{x}\Delta{\epsilon}}}$ the matrix such that $\frac{\sqrt{N}}{N}\Delta X'\Delta\epsilon^{FD}\distr\mathcal{N}\left(\mathbf{0},\mathbf{V}_{\mathbf{\Delta{x}\Delta{\epsilon}}}\right)$. We also denote $\mathbf{V}^{d,\tau,d',\tau'}_{\mathbf{\Delta{x}\Delta{\epsilon}}}$ the blocks of the matrix that relates to the estimators of $\beta^{SA}_{d,\tau}$ and $\beta^{SA}_{d',\tau'}$. We also denote $\sigma_{\Delta{X}\Delta{X}^{-1}}=\plims N(\Delta X'\Delta X)^{-1}$. Following the proof of Theorem 4.25, it is a block diagonal matrix, with the block $\sigma_{\Delta X \Delta X^{-1}}^{d,\tau}$ related to the estimation of $\beta^{SA}_{d,\tau}$. The key to the distribution of the treatment of the treated parameter is to get the off-diagonal terms of $\sigma_{\Delta{X}\Delta{X}^{-1}}\mathbf{V}_{\mathbf{\Delta{x}\Delta{\epsilon}}}\sigma_{\Delta{X}\Delta{X}^{-1}}$ right. Let us focus on the parts of this matrix that is related to the estimators of $\beta^{SA}_{d,\tau}$ and $\beta^{SA}_{d',\tau'}$, $\sigma_{\Delta{X}\Delta{X}^{-1}}^{d,\tau,d',\tau'}\mathbf{V}^{d,\tau,d',\tau'}_{\mathbf{\Delta{x}\Delta{\epsilon}}}\sigma_{\Delta{X}\Delta{X}^{-1}}^{d,\tau,d',\tau'}$, where $\mathbf{V}^{d,\tau,d',\tau'}_{\mathbf{\Delta{x}\Delta{\epsilon}}}$ is the part of the $\mathbf{V}_{\mathbf{\Delta{x}\Delta{\epsilon}}}$ matrix that relates to the estimators of $\beta^{SA}_{d,\tau}$ and $\beta^{SA}_{d',\tau'}$, $\sigma_{\Delta{X}\Delta{X}^{-1}}^{d,\tau,d',\tau'}$ regroups the blocks of the matrix $\sigma_{\Delta{X}\Delta{X}^{-1}}$ corresponding to the same estimation, with blocks formed by $\sigma_{\Delta X \Delta X^{-1}}^{d,\tau}$ and $\sigma_{\Delta X \Delta X^{-1}}^{d',\tau'}$.

As we did in the proof of Theorem 4.12 in Section A.3.2 and the proof of Theorem 4.24 in Section A.3.5, we are going to write the First Difference model as a pure cross section model:

\[\begin{align*} \Delta Y^{d,\tau}_j & = \alpha^{FD}_{d,\tau}+\beta^{FD}_{d,\tau}D^d_j+\Delta\epsilon^{d,\tau}_j, \end{align*}\]

with $\Delta Y^{d,\tau}_j=Y_{j,d+\tau}-Y_{j,d-1}$, $\Delta\epsilon^{d,\tau}_j=\Delta\epsilon^{FD}_{j,d+\tau}=\epsilon^{SA}_{j,d+\tau}-\epsilon^{SA}_{j,d-1}$, $\alpha^{FD}_{d,\tau}$ and $\beta^{FD}_{d,\tau}$ as in the proof of Theorem 4.20 in Section A.3.4, and $D^d_j=\uns{D_j=d}$, as in the proof of Theorem 4.24 in Section A.3.5. Using previous results on the First Difference estimator, we know that:

\[\begin{align*} \Delta\epsilon^{d,\tau}_j & = \Delta Y^{d,\tau}_j \nonumber\\ & \phantom{=}-\left(\esp{\Delta Y^{d,\tau}_j|D_j=\infty}\right.\nonumber\\ & \phantom{=-\left(\right.}\left.+D_j^d(\esp{\Delta Y^{d,\tau}_j|D_j=d}-\esp{\Delta Y^{d,\tau}_j|D_j=\infty})\right) \end{align*}\]

We need to derive the asymptotic distribution of $\sqrt{N}(\hat{\Theta}^{FD}-\Theta^{FD})=N(\Delta X'\Delta X)^{-1}\frac{\sqrt{N}}{N}\Delta X'\Delta\epsilon^{FD}$, as introduced in the proof of 4.20. For $d,\tau,d',\tau'$ with either $\tau\neq\tau'$ or $d\neq d'$, we have:

\[\begin{align*} \frac{\sqrt{N}}{N}\Delta X'\Delta\epsilon^{FD} & =\left(\begin{array}{c} \vdots \\ \sqrt{N^{d,\infty}}\sqrt{\bar{P}^{d,\infty}}\frac{1}{N^{d,\infty}}\sum_{j=1}^{N^{d,\infty}}\Delta\epsilon^{d,\tau}_{j}\\ \sqrt{N^{d,\infty}}\sqrt{\bar{P}^{d,\infty}}\frac{1}{N^{d,\infty}}\sum_{j=1}^{N^{d,\infty}}\Delta\epsilon^{d,\tau}_{j}D^{d}_j \\ \sqrt{N^{d',\infty}}\sqrt{\bar{P}^{d',\infty}}\frac{1}{N^{d',\infty}}\sum_{j=1}^{N^{d',\infty}}\Delta\epsilon^{d',\tau'}_{j} \\ \sqrt{N^{d',\infty}}\sqrt{\bar{P}^{d',\infty}}\frac{1}{N^{d',\infty}}\sum_{j=1}^{N^{d',\infty}}\Delta\epsilon^{d',\tau'}_{j}D^{d'}_j\\ \vdots \end{array}\right), \end{align*}\]

with $N^{d,\infty}$ the number of units in the panel that are either such that $D_i=d$ or $D_i=\infty$ and $\bar{P}^{d,\infty}=\frac{N^{d,\infty}}{N}$ the proportion of units in the panel that are either with $D_i=d$ or $D_i=\infty$. We have $\plims\bar{P}^{d,\infty}=p^{d,\infty}$, the proportion of these units in the population.

Under Assumption 4.28, we have:

\[\begin{align*} \Delta Y^{d,\tau}_j & = \mu_j + \delta_{d+\tau}+D^d_j(\bar\alpha+\eta_{i,t})+U^0_{j,d+\tau}-(\mu_j + \delta_{d-1}+U^0_{j,d-1})\\ & = \delta_{d+\tau}-\delta_{d-1}+D^d_j(\bar\alpha+\eta_{j,d+\tau})+U^0_{j,d+\tau}-U^0_{j,d-1}. \end{align*}\]

As a consequence, and using Assumption 4.28 again, we have:

\[\begin{align*} \esp{\Delta Y^{d,\tau}_j|D_j=\infty} & = \delta_{d+\tau}-\delta_{d-1} + \esp{U^0_{j,d+\tau}-U^0_{j,d-1}|D_j=\infty} \\ & = \delta_{d+\tau}-\delta_{d-1}\\ \esp{\Delta Y^{d,\tau}_j|D_j=d} & = \delta_{d+\tau}-\delta_{d-1} + \esp{U^0_{j,d+\tau}-U^0_{j,d-1}|D_j=d} \nonumber\\ & \phantom{=} +\bar\alpha+\esp{\eta_{j,d+\tau}|D_j=d}\\ & = \delta_{d+\tau}-\delta_{d-1}+\bar\alpha+\esp{\eta_{j,d+\tau}|D_j=d} \end{align*}\]

As a consequence, we have:

\[\begin{align*} \Delta\epsilon^{d,\tau}_j & =U^0_{j,d+\tau}-U^0_{j,d-1} + D_j^d(\eta_{j,d+\tau}-\esp{\eta_{j,d+\tau}|D_j=d}). \end{align*}\]

Under Assumption 4.28, we have that $\esp{\Delta\epsilon_{j}D^{d}_j}=0$ and $\esp{\Delta\epsilon_{j}}=0$, which is a first step for using the vector CLT. For a given $(d,\tau)$, Assumption 4.28 also implies that $\Delta\epsilon^{d,\tau}_j$ are independent over $j$, since $U^0_{j,d+\tau}$, $U^0_{j,d-1}$ and $\eta_{j,d+\tau}$ are all independent across $j$. So we can use the vector CLT for heteroskedastik variables. Using the vector CLT, Slutsy’s Theorem and the Delta Method, we have that $\frac{\sqrt{N}}{N}\Delta X'\Delta\epsilon^{FD}\distr\mathcal{N}\left(\mathbf{0},\mathbf{V}_{\mathbf{\Delta{x}\Delta{\epsilon}}}\right)$, with (using the fact that $(D_j^d)^2=D_j^d$):

\[\begin{align*} \mathbf{V}^{d,\tau,d',\tau'}_{\mathbf{\Delta{x}\Delta{\epsilon}}} & = \esp{\left(\begin{array}{cc} p^{d,\infty}(\Delta\epsilon^{d,\tau}_j)^2 \left(\begin{array}{cc} 1 & D^d_j \\ D^d_j & D^d_j \end{array}\right) & \sqrt{p^{d,\infty}p^{d',\infty}}\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d',\tau'}_j \left(\begin{array}{cc} 1 & D^{d'}_j \\ D^d_j & D^d_jD^{d'}_j \end{array}\right) \\ \sqrt{p^{d,\infty}p^{d',\infty}}\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d',\tau'}_j \left(\begin{array}{cc} 1 & D^{d}_j \\ D^{d'}_j & D^d_jD^{d'}_j\end{array}\right) & p^{d',\infty}(\Delta\epsilon^{d',\tau'}_j)^2 \left(\begin{array}{cc} 1 & D^{d'}_j \\ D^{d'}_j & D^{d'}_j \end{array}\right) \end{array}\right)}. \end{align*}\]

Now, in order to derive $\text{Cov}(\hat{\beta}^{SA}_{d,\tau},\hat{\beta}^{SA}_{d',\tau'})$, we need to derive the second term on the fourth and last line of $\sigma_{\Delta{X}\Delta{X}^{-1}}^{d,\tau,d',\tau'}\mathbf{V}^{d,\tau,d',\tau'}_{\mathbf{\Delta{x}\Delta{\epsilon}}}\sigma_{\Delta{X}\Delta{X}^{-1}}^{d,\tau,d',\tau'}$. If we denote the terms of $\sigma_{\Delta X \Delta X^{-1}}^{d,\tau}$ by $\mathbf{A}$, $\mathbf{B}$, $\mathbf{C}$ and $\mathbf{D}$ (filling in the first line first, and going from left to right), and $\mathbf{A}'$, $\mathbf{B}'$, $\mathbf{C}'$ and $\mathbf{D}'$ the same terms for the matrix $\sigma_{\Delta X \Delta X^{-1}}^{d',\tau'}$, and $\mathbf{A}_l$, $\mathbf{B}_l$, $\mathbf{C}_l$ and $\mathbf{D}_l$ the terms in line $l$ of matrix $\mathbf{V}^{d,\tau,d',\tau'}_{\mathbf{\Delta{x}\Delta{\epsilon}}}$, with $l\in\left\{1,2,3,4\right\}$, we can show that:

\[\begin{align*} \text{Cov}(\hat{\beta}^{SA}_{d,\tau},\hat{\beta}^{SA}_{d',\tau'}) & = \mathbf{C}'(\mathbf{B}\mathbf{A}_3+\mathbf{D}\mathbf{B}_3)+\mathbf{D}'(\mathbf{B}\mathbf{A}_4+\mathbf{D}\mathbf{B}_4),\\ \mathbf{B} & = -\frac{p^{d,\infty}_D}{p^{d,\infty}p^{d,\infty}_D(1-p^{d,\infty}_D)}\\ \mathbf{D} & = \frac{1}{p^{d,\infty}p^{d,\infty}_D(1-p^{d,\infty}_D)}\\ \mathbf{C}' & = -\frac{p^{d',\infty}_D}{p^{d',\infty}p^{d',\infty}_D(1-p^{d',\infty}_D)}\\ \mathbf{D}' & = \frac{1}{p^{d',\infty}p^{d',\infty}_D(1-p^{d',\infty}_D)}\\ \mathbf{A}_3 & =\sqrt{p^{d,\infty}p^{d',\infty}}\esp{\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d',\tau'}_j} \\ \mathbf{B}_3 & = \sqrt{p^{d,\infty}p^{d',\infty}}\esp{\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d',\tau'}_jD_j^{d}} \\ \mathbf{A}_4 & = \sqrt{p^{d,\infty}p^{d',\infty}}\esp{\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d',\tau'}_jD_j^{d'}} \\ \mathbf{B}_4 & = \sqrt{p^{d,\infty}p^{d',\infty}}\esp{\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d',\tau'}_jD_j^{d}D_j^{d'}} \end{align*}\]

Let us first start by examining $\mathbf{A}_3$. Let’s start with looking at $\esp{\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d',\tau'}_j}$ when $d=d'$ and $\tau\neq\tau'$. In that case, we have:

\[\begin{align*} \esp{\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d,\tau'}_j} & = \esp{\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d,\tau'}_j|D_j=d}p_D^{d,\infty}+\esp{\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d,\tau'}_j|D_j=\infty}(1-p_D^{d,\infty}) \end{align*}\]

Let us look at the second part of the right hand side:

using Assumption 4.28.

Let us now look at the first part of the right hand side:

\[\begin{align*} \esp{\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d,\tau'}_j|D_j=d} & = \espE\left[\left(U^0_{j,d+\tau}-U^0_{j,d-1}+\eta_{j,d+\tau}-\esp{\eta_{j,d+\tau}|D_j=d}\right)\right.\\ & \phantom{=\espE\left[\right.}\left.\left(U^0_{j,d+\tau'}-U^0_{j,d-1}+\eta_{j,d+\tau'}-\esp{\eta_{j,d+\tau'}|D_j=d}\right)|D_j=d\right] \\ & = \var{U^0_{i,d-1}|D_i=d}, \end{align*}\]

using Assumption 4.28 again.

Let us now examine $\esp{\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d',\tau'}_j}$ when $d\neq d'$. In that case, we have:

\[\begin{align*} \esp{\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d',\tau'}_j} & = \esp{\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d',\tau'}_j|D_j=d}p_D^{d,\infty}p_{d,\infty}\\ & \phantom{=}+\esp{\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d',\tau'}_j|D_j=d'}p_D^{d',\infty}p_{d',\infty}\\ & \phantom{=}+\esp{\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d',\tau'}_j|D_j=\infty}(1-p_D^{d,\infty})p_{d,\infty}. \end{align*}\]

The first two terms are equal to zero when $d\neq d'$ under Assumption 4.28, since the error terms for groups $D_j=d$ and $D_j=d'$ are independent under this assumption. For the last term, we have three mutually exclusive cases under which the error terms are not independent under Assumption 4.28:

$d-1=d'+\tau'$: the baseline period for the first group is the arrival period for the second. In that case, we have $\esp{\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d',\tau'}_j}=-\var{U^0_{i,d-1}|D_i=\infty}(1-p^{d,\infty}_D)p_{d,\infty}$ under Assumption 4.28.
$d'-1=d+\tau$: we have the symmetric case $\esp{\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d',\tau'}_j}=-\var{U^0_{i,d'-1}|D_i=\infty}(1-p^{d',\infty}_D)p_{d',\infty}$.
$d+\tau=d'+\tau'$: both terms share the same terminal time period, and thus they share their control group terminal outcome: $\esp{\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d',\tau'}_j}=\var{U^0_{i,d+\tau}|D_i=\infty}(1-p^{d,\infty}_D)p_{d,\infty}$.

For $\mathbf{B}_3$, $\mathbf{A}_4$, $\mathbf{B}_4$, the terms are all zero when $d\neq d'$, since they condition on being member of a treatment group, and only control group members are involved in the covariance terms when $d\neq d'$. When $d=d'$, we have $\mathbf{B}_3=\mathbf{A}_4=\mathbf{B}_4=\sqrt{p^{d,\infty}p^{d',\infty}}\var{U^0_{i,d-1}|D_i=d}p^{d,\infty}_D$.

Collecting terms we have:

When $d=d'$ and $\tau\neq\tau'$:

\[\begin{align*} \text{Cov}(\hat{\beta}^{SA}_{d,\tau},\hat{\beta}^{SA}_{d',\tau'}) & = -\frac{p^{d,\infty}p^{d,\infty}_D}{p^{d,\infty}p^{d,\infty}_D(1-p^{d,\infty}_D)}\left[-\frac{p^{d,\infty}_D}{p^{d,\infty}p^{d,\infty}_D(1-p^{d,\infty}_D)}\var{U^0_{i,d-1}}\right.\nonumber\\ & \phantom{=-\frac{p^{d,\infty}p^{d,\infty}_D}{p^{d,\infty}p^{d,\infty}_D(1-p^{d,\infty}_D)} \left[-\right.}\left.+\frac{1}{p^{d,\infty}p^{d,\infty}_D(1-p^{d,\infty}_D)}\var{U^0_{i,d-1}|D_i=d}p^{d,\infty}_D\right] \nonumber\\ & \phantom{=} + \frac{p^{d,\infty}}{p^{d,\infty}p^{d,\infty}_D(1-p^{d,\infty}_D)}\left[-\frac{p^{d,\infty}_D}{p^{d,\infty}p^{d,\infty}_D(1-p^{d,\infty}_D)}\var{U^0_{i,d-1}|D_i=d}p^{d,\infty}_D\right.\nonumber\\ & \phantom{=-\frac{p^{d,\infty}_D}{p^{d,\infty}p^{d,\infty}_D(1-p^{d,\infty}_D)}\left[-\right.} \left.+ \frac{1}{p^{d,\infty}p^{d,\infty}_D(1-p^{d,\infty}_D)}\var{U^0_{i,d-1}|D_i=d}p^{d,\infty}_D\right] \\ & = \frac{p^{d,\infty}_D}{p^{d,\infty}(p^{d,\infty}_D(1-p^{d,\infty}_D))^2}\left[p^{d,\infty}_D\left(\var{U^0_{i,d-1}}-2\var{U^0_{i,d-1}|D_i=d}\right)\right.\\ & \phantom{=\frac{1}{p^{d,\infty}(p^{d,\infty}_D(1-p^{d,\infty}_D))^2}\left[\right.}\left.+\var{U^0_{i,d-1}|D_i=d}\right]\\ & = \frac{p^{d,\infty}_D}{p^{d,\infty}(p^{d,\infty}_D(1-p^{d,\infty}_D))^2}\left[p^{d,\infty}_D(1-p^{d,\infty}_D)\var{U^0_{i,d-1}|D_i=\infty}\right.\\ & \phantom{=\frac{1}{p^{d,\infty}(p^{d,\infty}_D(1-p^{d,\infty}_D))^2}\left[\right.}\left.+\left((p^{d,\infty}_D)^2-2p^{d,\infty}_D+1\right)\var{U^0_{i,d-1}|D_i=d}\right]\\ & = \frac{1}{p^{d,\infty}}\left[\frac{\var{U^0_{i,d-1}|D_i=\infty}}{1-p^{d,\infty}_D}+\frac{\var{U^0_{i,d-1}|D_i=d}}{p^{d,\infty}_D}\right]. \end{align*}\]

When $d\neq d'$ and $d-1=d'+\tau'$:

\[\begin{align*} \text{Cov}(\hat{\beta}^{SA}_{d,\tau},\hat{\beta}^{SA}_{d',\tau'}) & = -\frac{p^{d',\infty}_D}{p^{d',\infty}p^{d',\infty}_D(1-p^{d',\infty}_D)}\frac{p^{d,\infty}_D}{p^{d,\infty}p^{d,\infty}_D(1-p^{d,\infty}_D)}\var{U^0_{i,d-1}|D_i=\infty}(1-p^{d,\infty}_D)p_{d,\infty}\\ & = -\frac{\var{U^0_{i,d-1}|D_i=\infty}}{p^{d',\infty}(1-p^{d',\infty}_D)}. \end{align*}\]

When $d\neq d'$ and $d'-1=d+\tau$:

\[\begin{align*} \text{Cov}(\hat{\beta}^{SA}_{d,\tau},\hat{\beta}^{SA}_{d',\tau'}) & = -\frac{p^{d',\infty}_D}{p^{d',\infty}p^{d',\infty}_D(1-p^{d',\infty}_D)}\frac{p^{d,\infty}_D}{p^{d,\infty}p^{d,\infty}_D(1-p^{d,\infty}_D)}\var{U^0_{i,d'-1}|D_i=\infty}(1-p^{d',\infty}_D)p_{d',\infty}\\ & = -\frac{\var{U^0_{i,d'-1}|D_i=\infty}}{p^{d,\infty}(1-p^{d,\infty}_D)}. \end{align*}\]

When $d\neq d'$ and $d+\tau=d'+\tau'$:

\[\begin{align*} \text{Cov}(\hat{\beta}^{SA}_{d,\tau},\hat{\beta}^{SA}_{d',\tau'}) & = \frac{p^{d',\infty}_D}{p^{d',\infty}p^{d',\infty}_D(1-p^{d',\infty}_D)}\frac{p^{d,\infty}_D}{p^{d,\infty}p^{d,\infty}_D(1-p^{d,\infty}_D)}\var{U^0_{i,d+\tau}|D_i=\infty}(1-p^{d,\infty}_D)p_{d,\infty}\\ & = \frac{\var{U^0_{i,d+\tau}|D_i=\infty}}{p^{d',\infty}(1-p^{d',\infty}_D)}. \end{align*}\]

Replacing the indexes ($d+\tau=d'+\tau'$, and so on) in the variance formulas proves the result.

A.4 Proofs of results in Chapter 5

A.4.1 Proof of Theorem 5.4

One way to obtain the result is to apply Theorem 7.2 in Imbens and Rubin (2015). A more constructive proof follows similar lines, but differs somehow at critical points. We know that $\hat\Delta^Y_{WWOLSX}=\hat\delta^{OLS}$, where:

\[\begin{align*} (\hat\alpha_1,\hat\alpha_0,\hat\beta_1,\hat\beta_0,\hat\delta^{OLS}) & = \arg\min_{\alpha_1,\alpha_0,\beta_1,\beta_0,\delta} \frac{1}{N}\sum_{i=1}^N\left(Y_i-\alpha_0 - \beta_0'X_i - (\beta_1-\beta_0)'\left(X_i-\esp{X_i|D_i=1}\right)D_i - \delta D_i\right)^2 \end{align*}\]

Replacing $\delta$ by its value in the population ($\Delta^Y_{TT}=\alpha_1-\alpha_0+(\beta_1-\beta_0)'\esp{X_i|D_i=1}$), using Theorem 5.2, we can easily show that the remaining parameters can be obtained by two separate optimizations:

\[\begin{align*} (\hat\alpha_0,\hat\beta_0) & = \arg\min_{\alpha_0,\beta_0} \frac{1}{N}\sum_{i=1}^N(1-D_i)\left(Y_i-\alpha_0 - \beta_0'X_i \right)^2\\ (\hat\alpha_1,\hat\beta_1) & = \arg\min_{\alpha_1,\beta_1} \frac{1}{N}\sum_{i=1}^ND_i\left(Y_i-\alpha_1 - \beta_1'X_i \right)^2 \end{align*}\]

By Theorem A.2, these parameters are also identified in the population by the OLS estimator. We now write $\tilde{Y}_i=(1-D_i)\tilde{Y}_i^0+D_i\tilde{Y}_i^1$, with:

\[\begin{align*} \tilde{Y}_i^0 & = Y_i-\beta_0'(X_i-\bar{X}_1) = \alpha_0+\beta_0'\bar{X}_1 +U_i^0 \\ \tilde{Y}_i^1 & = Y_i-\beta_1'(X_i-\bar{X}_1) = \alpha_1+\beta_1'\bar{X}_1 +U_i^1, \end{align*}\]

with $\bar{X}_1=\frac{\sum_{i=1}^ND_iX_i}{\sum_{i=1}^ND_i}$ in the sample and $\bar{X}_1=\esp{X_i|D_i=1}$ in the population. $U_i^1$ and $U_i^0$ are mean-zero noise terms independent from $X_i$ and $D_i$ by Assumptions 5.2, 5.3 and 5.4.

Now, let’s estimate the following regression by OLS:

\[\begin{align*} \tilde{Y}_i & = \alpha + \delta D_i + U_i \\ & = \alpha_0+\beta_0'\bar{X}_1 +(\alpha_1-\alpha_0+(\beta_1-\beta_0)'\bar{X}_1)D_i + (1-D_i)U_i^0+D_iU_i^1 \end{align*}\]

$\delta^{OLS}$ estimates the treatment effect on the treated. It is also equal to the $\Delta^Y_{WWOLSX}$ estimator (since it is a combination of the same OLS parameters). Using Lemma A.3, we know that $\delta^{OLS}$ is the With/Without estimator applied to $\tilde{Y}_i$. Using Assumptions 2.1, 5.5 and 5.6, Lemma A.5 implies that:

\[\begin{align*} \sqrt{N}(\hat{\delta}^{OLS}-\Delta^Y_{TT}) & \stackrel{d}{\rightarrow} \mathcal{N}\left(0,\frac{\var{\tilde{Y}_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{\tilde{Y}_i^0|D_i=0}}{1-\Pr(D_i=1)}\right). \end{align*}\]

By construction, and using Assumptions 5.3 and 5.4, $\var{\tilde{Y}_i^d|D_i=d}=\var{Y_i^d|X_i,D_i=d}$.

Clarify what $\var{Y_i^d|X_i,D_i=d}$ is exactly and what happens when it depends on $X_i$

A.5 Proofs of results in Chapter 9

A.5.1 Proof of Theorem 9.1

We start with a very useful result stated in Bruce Hansen’s Econometrics textbook: for every matrix $A=A(X)$, we have $\var{A'Y|X}=\var{A'U|X}=A'\esp{UU'|X}A$, with $Y$ the vector of observed outcomes $Y_i$ and $X$ the matrix of covariates. $\hat{\Delta^Y_{WW}}$ is the second term of the vector $\hat{\Theta}_{OLS}=A'Y$, with $A=X(X'X)^{-1}$, following Lemma A.3. We thus have that $\var{\hat{\Delta^Y_{WW}}|X}$ is the lower diagonal term in the matrix $\var{\hat{\Theta}_{OLS}|X}=(X'X)^{-1}X'\esp{UU'|X}X(X'X)^{-1}$. Under Assumption 9.1, we know that $\esp{UU'|X}=\esp{UU'}$. We also have that the matrix $\esp{UU'}$ is block diagonal, with blocks of size $m\times m$. If we note $X_c$ the matrix of covariates for the units in cluster $c$ and $U_c$ the vector of error terms for those same units, we can derive the value of $X'\esp{UU'}X$ in each cluster, $X_c'\esp{U_cU_c'}X_c$. For a treated cluster, we have:

\[\begin{align*} X_c'\esp{U_cU_c'}X_c & = \sigma^2_1 \left( \begin{array}{ccc} 1 & \cdots & 1 \\ 1 & \cdots & 1 \end{array} \right) \left( \begin{array}{cccc} 1 & \rho_1 & \cdots & \rho_1 \\ \rho_1 & 1 & \cdots & \rho_1 \\ \vdots & \vdots & \ddots & \vdots \\ \rho_1 & \rho_1 & \cdots & 1 \end{array} \right) \left( \begin{array}{cc} 1 & 1 \\ \vdots & \vdots \\ 1 & 1 \end{array} \right) \\ & = \sigma^2_1(1 + (m-1)\rho_1)\left( \begin{array}{ccc} 1 & \cdots & 1\\ 1 & \cdots & 1 \end{array} \right) \left( \begin{array}{cc} 1 & 1 \\ \vdots & \vdots \\ 1 & 1 \end{array} \right) \\ & = \sigma^2_1m(1 + (m-1)\rho_1)\left( \begin{array}{cc} 1 & 1\\ 1 & 1 \end{array} \right), \end{align*}\]

and, for an untreated cluster:

\[\begin{align*} X_c'\esp{U_cU_c'}X_c & = \sigma^2_0 \left( \begin{array}{ccc} 1 & \cdots & 1 \\ 0 & \cdots & 0 \end{array} \right) \left( \begin{array}{cccc} 1 & \rho_0 & \cdots & \rho_0 \\ \rho_0 & 1 & \cdots & \rho_0 \\ \vdots & \vdots & \ddots & \vdots \\ \rho_0 & \rho_0 & \cdots & 1 \end{array} \right) \left( \begin{array}{cc} 1 & 0 \\ \vdots & \vdots \\ 1 & 0 \end{array} \right) \\ & = \sigma^2_0(1 + (m-1)\rho_0)\left( \begin{array}{ccc} 1 & \cdots & 1 \\ 0& \cdots & 0 \end{array} \right) \left( \begin{array}{cc} 1 & 0 \\ \vdots & \vdots \\ 1 & 0 \end{array} \right) \\ & = \sigma^2_0m(1 + (m-1)\rho_0)\left( \begin{array}{cc} 1 & 0\\ 0 & 0 \end{array} \right). \end{align*}\]

Summing over all clusters, we have:

\[\begin{align*} X'\esp{UU'}X & = \left(\begin{array}{cc} \sigma^2_0mn_0(1 + (m-1)\rho_0) + \sigma^2_1mn_1(1 + (m-1)\rho_1) & \sigma^2_1mn_1(1 + (m-1)\rho_1)\\ \sigma^2_1mn_1(1 + (m-1)\rho_1) & \sigma^2_1mn_1(1 + (m-1)\rho_1) \end{array}\right) \end{align*}\]

Following the proof of Lemma A.3, we know that:

\[\begin{align*} (X'X)^{-1} & = \frac{1}{N}\frac{1}{\frac{\sum R_i}{N}-\left(\frac{\sum R_i}{N}\right)^2} \left(\begin{array}{cc} \frac{\sum R_i}{N} & -\frac{\sum R_i}{N} \\ -\frac{\sum R_i}{N} & 1 \end{array}\right)\\ & = \frac{1}{N}\frac{1}{p(1-p)} \left(\begin{array}{cc} p & -p \\ -p & 1 \end{array}\right),\\ \end{align*}\]

with $p=\frac{\sum R_i}{N}$. Since $p=\frac{\sum R_i}{N}=\frac{mn_1}{N}$ and $1-p=\frac{\sum (1-R_i)}{N}=\frac{mn_0}{N}$, we have:

\[\begin{align*} \var{\hat{\Theta}_{OLS}|X} & = \frac{1}{N}\frac{1}{p^2(1-p)^2} \left(\begin{array}{cc} p & -p \\ -p & 1 \end{array}\right) \\ & \phantom{=}\left(\begin{array}{cc} \sigma^2_0(1-p)(1 + (m-1)\rho_0) + \sigma^2_1p(1 + (m-1)\rho_1) & \sigma^2_1p(1 + (m-1)\rho_1)\\ \sigma^2_1p(1 + (m-1)\rho_1) & \sigma^2_1p(1 + (m-1)\rho_1) \end{array}\right) \\ & \phantom{=}\left(\begin{array}{cc} p & -p \\ -p & 1 \end{array}\right) \end{align*}\]

As a consequence, we have:

\[\begin{align*} \var{\hat{\Delta^Y_{WW}}|X} & = \frac{1}{N}\frac{1}{p^2(1-p)^2}\\ & \phantom{=} \left(p^2(1-p)\left(\sigma^2_0(1 + (m-1)\rho_0)-\sigma^2_1(1 + (m-1)\rho_1)\right)+p(1-p)\sigma^2_1(1 + (m-1)\rho_1\right) \\ & = \frac{1}{N}\left(\frac{\sigma^2_0}{1-p}(1 + (m-1)\rho_0) + \sigma^2_1(1 + (m-1)\rho_1)\frac{p(1-p)-p^2(1-p)}{p^2(1-p)^2}\right)\\ & = \frac{1}{N}\left(\frac{\sigma^2_0}{1-p}(1 + (m-1)\rho_0) + \frac{\sigma^2_1}{p}(1 + (m-1)\rho_1)\right), \end{align*}\]

which proves the result.

A.5.2 Proof of Theorem 9.2

The proof relies heavily on the proof of Theorem 4.27 in Section A.3.8. The two proofs diverge when computing $\esp{\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d',\tau'}_j}$.

When $d=d'$ and $\tau\neq\tau'$, we have:

Let us look at the second part of the right hand side:

\[\begin{align*} \esp{\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d,\tau'}_j|D_j=\infty} & = \esp{\left(U^0_{j,d+\tau}-U^0_{j,d-1}\right)\left(U^0_{j,d+\tau'}-U^0_{j,d-1}\right)|D_j=\infty} \\ & = \cov{U^0_{j,d+\tau},U^0_{j,d+\tau'}|D_j=\infty} -\cov{U^0_{j,d+\tau},U^0_{j,d-1}|D_j=\infty} \nonumber\\ & \phantom{=} -\cov{U^0_{j,d+\tau'},U^0_{j,d-1}|D_j=\infty} +\var{U^0_{j,d-1}|D_j=\infty}\\ & = \rho^{|\tau+\tau'|}\var{U^0_{j,\min\{d+\tau,d+\tau'\}}|D_j=\infty}-\rho^{|\tau+1|}\var{U^0_{j,\min\{d+\tau,d-1\}}|D_j=\infty}\\ & \phantom{=}-\rho^{|\tau'+1|}\var{U^0_{j,\min\{d+\tau',d-1\}}|D_j=\infty}+\var{U^0_{j,d-1}|D_j=\infty} \end{align*}\]

using Assumption 9.2.

Let us now look at the first part:

\[\begin{align*} \esp{\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d,\tau'}_j|D_j=d} & = \esp{\left(U^0_{j,d+\tau}-U^0_{j,d-1}\right)\left(U^0_{j,d+\tau'}-U^0_{j,d-1}\right)|D_j=} \\ & = \cov{U^0_{j,d+\tau},U^0_{j,d+\tau'}|D_j=d} -\cov{U^0_{j,d+\tau},U^0_{j,d-1}|D_j=d} \nonumber\\ & \phantom{=} -\cov{U^0_{j,d+\tau'},U^0_{j,d-1}|D_j=d} +\var{U^0_{j,d-1}|D_j=d}\\ & = \rho^{|\tau+\tau'|}\var{U^0_{j,\min\{d+\tau,d+\tau'\}}|D_j=d}-\rho^{|\tau+1|}\var{U^0_{j,\min\{d+\tau,d-1\}}|D_j=d}\\ & \phantom{=}-\rho^{|\tau'+1|}\var{U^0_{j,\min\{d+\tau',d-1\}}|D_j=d}+\var{U^0_{j,d-1}|D_j=d} \end{align*}\]

using Assumption 9.2 again.

Let us now examine $\esp{\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d',\tau'}_j}$ when $d\neq d'$. In that case, we have:

The first two terms are equal to zero when $d\neq d'$ under Assumption 9.2, since the error terms for groups $D_j=d$ and $D_j=d'$ are independent under this assumption. For the last term, we have:

\[\begin{align*} \esp{\Delta\epsilon^{d,\tau}_j\Delta\epsilon^{d',\tau'}_j|D_j=\infty} & = \esp{\left(U^0_{j,d+\tau}-U^0_{j,d-1}\right)\left(U^0_{j,d'+\tau'}-U^0_{j,d'-1}\right)|D_j=\infty} \\ & = \cov{U^0_{j,d+\tau},U^0_{j,d'+\tau'}|D_j=\infty} -\cov{U^0_{j,d+\tau},U^0_{j,d'-1}|D_j=\infty} \nonumber\\ & \phantom{=} -\cov{U^0_{j,d'+\tau'},U^0_{j,d-1}|D_j=\infty} +\cov{U^0_{j,d-1},U^0_{j,d'-1}|D_j=\infty}\\ & = \rho^{|d+\tau-d-\tau'|}\var{U^0_{j,\min\{d+\tau,d+\tau'\}}|D_j=\infty}-\rho^{|d+\tau-d'+1|}\var{U^0_{j,\min\{d+\tau,d'-1\}}|D_j=\infty}\\ & \phantom{=}-\rho^{|d'+\tau'-d+1|}\var{U^0_{j,\min\{d'+\tau',d-1\}}|D_j=\infty}+\rho^{|d-d'|}\var{U^0_{j,\min\{d-1,d'-1\}}|D_j=\infty}. \end{align*}\]

Collecting terms we have:

When $d=d'$:

\[\begin{align*} \text{Cov}(\hat{\beta}^{SA}_{d,\tau},\hat{\beta}^{SA}_{d',\tau'}) & = \frac{1}{p^{d,\infty}}\left[\frac{\esp{\Delta\epsilon^{d,\tau}_i\Delta\epsilon^{d',\tau'}_i|D_i=\infty}}{1-p^{d,\infty}_D}+\frac{\esp{\Delta\epsilon^{d,\tau}_i\Delta\epsilon^{d',\tau'}_i|D_i=d}}{p^{d,\infty}_D}\right]. \end{align*}\]

When $d\neq d'$:

\[\begin{align*} \text{Cov}(\hat{\beta}^{SA}_{d,\tau},\hat{\beta}^{SA}_{d',\tau'}) & = \frac{\esp{\Delta\epsilon^{d,\tau}_i\Delta\epsilon^{d',\tau'}_i|D_i=d}}{p^{d',\infty}(1-p^{d',\infty}_D)}. \end{align*}\]

This proves the result.

A.5.3 Proof of Theorem 9.3

We start with the same starting point as in the Proof of Theorem 9.1 in Section A.5.1, the very useful result stated in Bruce Hansen’s Econometrics textbook: that, for every matrix $A=A(X)$, we have $\var{A'Y|X}=\var{A'U|X}=A'\esp{UU'|X}A$, with $Y$ the vector of observed outcomes $Y_i$ and $X$ the matrix of covariates. $\hat{\Delta^Y_{WW}}$ is the second term of the vector $\hat{\Theta}_{OLS}=A'Y$, with $A=X(X'X)^{-1}$, following Lemma A.3. We thus have that $\var{\hat{\Delta^Y_{WW}}|X}$ is the lower diagonal term in the matrix $\var{\hat{\Theta}_{OLS}|X}=(X'X)^{-1}X'\esp{UU'|X}X(X'X)^{-1}$. Under Assumption ??, we know that $\esp{UU'|X}=\esp{UU'}$. We also have that each line and each column of the matrix $\esp{UU'}$ have only $k+1$ non zero elements: the diagonal term equal to $\sigma^2$ and $k$ elements equal to $\rho\sigma^2$ due to the $k$ neighbors of each observation. We thus have:

\[\begin{align*} X'\esp{UU'}X & = \sigma^2 \left( \begin{array}{ccc} 1 & \cdots & 1 \\ D_1 & \cdots & D_N \end{array} \right) \left( \begin{array}{cccc} 1 & \rho \mathcal{N}_{1,1} & \cdots & \rho \mathcal{N}_{1,N} \\ \rho \mathcal{N}_{2,1} & 1 & \cdots & \rho \mathcal{N}_{2,N} \\ \vdots & \vdots & \ddots & \vdots \\ \rho \mathcal{N}_{N,1} & \rho \mathcal{N}_{N,2} & \cdots & 1 \end{array} \right) \left( \begin{array}{cc} 1 & D_1 \\ \vdots & \vdots \\ 1 & D_N \end{array} \right) \\ & = \sigma^2 N \left( \begin{array}{cc} 1+k\rho & (1+k\rho)\bar{D} \\ (1+k\rho)\bar{D} & (1+k\rho\bar{D}_1)\bar{D} \end{array} \right), \end{align*}\]

with $\bar{D}=\frac{1}{N}\sum_{i=1}^ND_i$ and $\bar{D}_1=\frac{1}{\sum_{i=1}^ND_i}\sum_{i=1}^ND_i\bar{D}_i$ and $\bar{D}_i=\frac{1}{k}\sum_{j=1}^N\mathcal{N}_{j,i}D_j$.

Following the proof of Lemma A.3, we know that:

\[\begin{align*} (X'X)^{-1} & = \frac{1}{N}\frac{1}{\bar{D}(1-\bar{D})} \left(\begin{array}{cc} \bar{D} & -\bar{D} \\ -\bar{D} & 1 \end{array}\right).\\ \end{align*}\]

We thus have:

\[\begin{align*} \var{\hat{\Theta}_{OLS}|X} & = \frac{1}{N}\frac{\sigma^2}{\bar{D}^2(1-\bar{D})^2} \left(\begin{array}{cc} \bar{D} & -\bar{D} \\ -\bar{D} & 1 \end{array}\right) \left( \begin{array}{cc} 1+k\rho & (1+k\rho)\bar{D} \\ (1+k\rho)\bar{D} & (1+k\rho\bar{D}_1)\bar{D} \end{array} \right) \left(\begin{array}{cc} \bar{D} & -\bar{D} \\ -\bar{D} & 1 \end{array}\right) \\ & = \frac{1}{N}\frac{\sigma^2}{\bar{D}^2(1-\bar{D})^2} \left(\begin{array}{cc} (1+k\rho)\bar{D}(1-\bar{D}) & \bar{D}^2k\rho(1-\bar{D}_1) \\ 0 & (1+k\rho\bar{D}_1)\bar{D}-(1+k\rho)\bar{D}^2 \end{array}\right) \left(\begin{array}{cc} \bar{D} & -\bar{D} \\ -\bar{D} & 1 \end{array}\right) \\ & = \frac{1}{N}\frac{\sigma^2}{\bar{D}^2(1-\bar{D})^2} \left( \begin{array}{c} (1+k\rho)\bar{D}^2(1-\bar{D})-\bar{D}^3k\rho(1-\bar{D}_1) \\ -\bar{D}((1+k\rho\bar{D}_1)\bar{D}-(1+k\rho)\bar{D}^2) \end{array}\right.\\ & \phantom{= \frac{1}{N}\frac{\sigma^2}{\bar{D}^2(1-\bar{D})^2}}\left. \begin{array}{c} (1+k\rho)\bar{D}^2(1-\bar{D})-((1+k\rho\bar{D}_1)\bar{D}-(1+k\rho)\bar{D}^2))\\ (1+k\rho\bar{D}_1)\bar{D}-(1+k\rho)\bar{D}^2 \end{array} \right) \end{align*}\]

As a consequence, we have:

\[\begin{align*} \var{\hat{\Delta^Y_{WW}}|X} & = \frac{1}{N}\frac{\sigma^2}{\bar{D}(1-\bar{D})}\left(1+k\rho\frac{\bar{D}_1-\bar{D}}{\bar{D}(1-\bar{D})}\right), \end{align*}\]

which proves the result (with $\plim\bar{D}=p$ and $\plim\bar{D}_1=p_1$).

A.6 Proofs of results in Chapter 11

A.6.1 Proof of Theorem 11.4

The Lagrangian corresponding to the optimization problem is:

\[\begin{align*} \mathcal{L} & = \sum_{k=1}^K n_kA\left(\frac{r_k}{n_k}\right) + \lambda\left(\bar{R}-\sum_{k=1}^Kr_k\right) + \sum_{k=1}^K\mu_k(n_k-r_k), \end{align*}\]

with $\lambda$ and $\mu_k$ the Lagrange multipliers.

Let us first start with the case where $A''<0$. In that case, the optimization problem is concave (the objective function is concave and all the constraints are linear, thus convex). The Kuhn and Tucker conditions are thus both necessary and sufficient for optimality. These conditions are:

\[\begin{align*} \partder{\mathcal{L}}{r_k} & = A'\left(\frac{r_k}{n_k}\right)-\lambda-\mu_k \leq 0, & r_k & \geq 0, & r_k\partder{\mathcal{L}}{r_k} & = 0\\ \partder{\mathcal{L}}{\lambda} & = \bar{R}-\sum_{k=1}^Kr_k \geq 0, & \lambda & \geq 0, & \lambda\partder{\mathcal{L}}{\lambda} & = 0\\ \partder{\mathcal{L}}{\mu_k} & = n_k-r_k \geq 0, & \mu_k & \geq 0, & \mu_k\partder{\mathcal{L}}{\mu_k} & = 0. \end{align*}\]

Let us first examine whether there is a symmetric interior solution where the stock of effort is exhausted, that is with $\bar{R}=\sum_{k=1}^Kr_k$. This will happen when $0<r_k<n_k$, $\forall k$ and $\lambda>0$. As a consequence, we have that $\mu_k=0$, $\forall k$ and $A'\left(\frac{r_k}{n_k}\right)=\lambda$, $\forall k$. Since $A''<0$, $A'$ is invertible and we have $\frac{r_k^*}{n_k}=A'^{-1}(\lambda)=p^*$, $\forall k$. We have thus $\sum_{k=1}^Kr_k^*=\sum_{k=1}^Kp^*n_k=p^*N=\bar{R}$, and thus $p^*=\frac{\bar{R}}{N}$. There is thus a symmetric interior optimum where all nodes are treated in the same proportion. We are now going to show that this optimum is unique, that is that there is no asymmetric, non interior, non saturated solution.

Let us start with the possibility of an asymmetric, interior, and saturated solution, that is $0<r_k<n_k$, $\forall k$ and $\lambda>0$, but with $\frac{r_k}{n_k}\neq \frac{r_j}{n_j}$ for some $j\neq k$. This is impossible, since $A'\left(\frac{r_k}{n_k}\right)=\lambda$, $\forall k$ implies that $\frac{r_k}{n_k}= \frac{r_j}{n_j},$ $\forall j\neq k$.

Let us now examine the possibility of a symmetric corner solution. Let us start with the possibility that $r_k=n_k$, $\forall k$. This is impossible since we have $\bar{R}<N$. Now, let us look at the possibility that $r_k=0$, $\forall k$. In that case, $\lambda=0=\mu_k$, $\forall k$, and thus $A'(0)\leq 0$, which is a contradiction.

Finally, let us examine whether there are asymmetric corner solutions. First, there might be some $k$ for which $\frac{r_k^*}{n_k}=A'^{-1}(\lambda)=p^*>0$, and some $j\neq k$ such that $r_j^*=0$, and thus $\mu_j=0$. If $\lambda=0$, we have $A'(0)\leq 0$, a contradiction with $A'>0$. If $\lambda>0$, we have $A'(0)=\lambda$, which is in contradiction with $A'\left(\frac{r_k^*}{n_k}\right)=\lambda$ and $\frac{r_k^*}{n_k}>0$, since $A''<0$. Second, there might be some $k$ for which $\frac{r_k^*}{n_k}=A'^{-1}(\lambda)=p^*$, and some $j\neq k$ such that $r_j^*=n_k$, and thus $\mu_j>0$. If $\lambda=0$, we have $A'(1)=\mu_j$. We also have $A'\left(\frac{r_k^*}{n_k}\right)=0$, a contradiction, since $A'>0$ by assumption. If $\lambda>0$, we have $A'(1)=\lambda + \mu_j$ and $A'\left(\frac{r_k^*}{n_k}\right)=\lambda$. Since $\mu_j>0$, we thus have $A'(1)>A'\left(\frac{r_k^*}{n_k}\right)$, and $\frac{r_k^*}{n_k}<1$, which is in contradiction with $A''<0$. This proves the first part of the result.

Let us now examine the case with $A''>0$. Since all the constraints in the optimisation problem are linear, we know that the Kuhn and Tucker conditions are necessary conditions for an optimum. As a consequence, the conditions derived for the case when $A''<0$ are necessary for an optimum. A reasoning similar to the previous one shows us that the symmetric, saturated, interior equilibrium is a minimum (using the fact that $-A$ is concave, since $A$ is convex). So this solution cannot be an optimum.

Now, can there be non saturated solutions? Non saturated solutions have $\lambda=0$. As a consequence, if $r_k<n_k$, we have $\mu_k=0$ and thus either $r_k=0$ and $A'(0)\leq 0$, or $r_k>0$ and $A'(0)=0$, both in contradiction with $A'>0$. There also cannot be $r_k=n_k$, $\forall k$, since $\bar{R}<N$. So there cannot be non saturated solutions, and we thus have $\lambda>0$.

Let us now examine whether there can be solutions that are saturated, asymmetric and corner. There can be solutions with some $j$ such that $r_j=0$ and some $k$ with $r_k<n_k$. Indeed, we have $\mu_j=0$, and thus $A'(0)\leq\lambda$ and $A'(\frac{r_k}{n_k})=\lambda$, which is compatible with $A''>0$. Another possible set of solutions is where some $j$ such that $r_j=1$ and some $k$ with $r_k<n_k$. Indeed, we have $\mu_j>0$, and thus $A'(1)=\lambda+\mu_j$, and $\mu_k=0$ and thus $A'(\frac{r_k}{n_k})=\lambda$, which is compatible with $A''>0$. Another possible set of solutions is where some $j$ such that $r_j=0$ and some $k$ with $r_k=n_k$. Indeed, we have $\mu_j=0$, and thus $A'(0)\leq\lambda$, and $\mu_k>0$ and thus $A'(1)=\lambda+\mu_k$, which is compatible with $A''>0$. A final possible set of solutions is where some $j$ are such that $r_j=0$, some $l$ are such that $r_l<n_l$ and some $k$ are such that $r_k=n_k$. Indeed, we have $\mu_j=0$, and thus $A'(0)\leq\lambda$, and $\mu_l=0$ and thus $A'(\frac{r_l}{n_l})=\lambda$, and finally $\mu_k>0$ and thus $A'(1)=\lambda+\mu_k$, which is compatible with $A''>0$.

We are now going to show that the only optimum solution is one in which some $j$ are such that $r_j^*=0$ and the rest of nodes $l$ are such that $r_l^*=n_l$. If we treat a proportion $p$ of units in a node of size $n$, we obtain an increase in our objective function of $n(A(p)-A(0))$ for a cost in terms of treatment effort of $np$. Note that the optimum of this problem does not depend on $n$, so we can study $(A(p)-A(0))+\kappa(1-p)$, with $\kappa$ the multiplier associated with the constraint that $p\leq 1$. Since $A''>0$ and $A'>0$, we have that $A(1)-A(0)>A(p)-A(0)$, $\forall 0\leq p<1$. It implies that we can always divert some treatment effort away from some node $k$ such that $r_k<n_k$ to get another node closer to $r_l=n_l$ and increase total output. As a consequence, the only optimal solutions are such that $\sum_{k=1}^Kr_k^*=\bar{R}$ and there is a set of nodes $\mathcal{L}$ such that $r_l^*=n_l$, $\forall l\in\mathcal{L}$ and another set of nodes $\mathcal{J}$ such that $r_j^*=0$, $\forall j\in\mathcal{J}$. We also have, as a consequence, $\sum_{l\in\mathcal{L}}n_l=\bar{R}$ and $\sum_{j\in\mathcal{J}}n_j=N-\bar{R}$. This proves the second part of the result and completes the proof.

A.6.2 Proof of Theorem 11.5

Under Assumptions 11.2, 11.3, 11.5, 11.6, we know that the expected outcome across clusters only depends on $p_c$ through the function $A(p)$. We therefore have $\esp{Y_i|p_c=p}=A(p)$. Under Assumptions 11.7 and ??, we have that $\esp{Y_i|p^R_c=p}=A(p)$. Under Assumption 11.6, the sign of $A''$ is constant over the whole domain of $A$. Therefore, we only need three points to identify the sign of $A''$. Let’s choose the points such as $p^R_c=0$, $p^R_c=p^*$ and $p^R_c=1$. $A''$ is of the same sign as the difference in growth rates of $A$ between $p^R_c=0$ and $p^R_c=p^*$, and between $p^R_c=p^*$ and $p^R_c=1$. Note that the denominator that normalized the differences in growth rates is always going to be positive. We indeed have that $\hat{A}'(\frac{p^*}{2})=\frac{\esp{Y_i|p^R_c=p^*}-\esp{Y_i|p^R_c=0}}{p^*}$ and $\hat{A}'(p^*+\frac{1-p^*}{2})=\frac{\esp{Y_i|p^R_c=1}-\esp{Y_i|p^R_c=p^*}}{1-p^*}$. We can therefore approximate $A''$ as follows:

\[\begin{align*} \hat{A}''(\frac{1}{2}) & = \frac{\hat{A}'(p^*+\frac{1-p^*}{2})-\hat{A}'(\frac{p^*}{2})}{p^*+\frac{1-p^*}{2}-\frac{p^*}{2}} \\ & = \frac{\frac{\esp{Y_i|p^R_c=1}-\esp{Y_i|p^R_c=p^*}}{1-p^*}-\frac{\esp{Y_i|p^R_c=p^*}-\esp{Y_i|p^R_c=0}}{p^*}}{\frac{1}{2}}. \end{align*}\]

The result follows as a consequence.

A.6.3 Proof of Theorem 11.6

Under Assumptions 11.2, 11.3, 11.5, 11.6, 11.7 and 11.8, we have:

\[\begin{align*} A(1) & = \esp{Y_i^{1,1}}\\ A(p^*) & = p^*\esp{Y_i^{1,p^*}|R_i=1,p^R_c=p^*}+(1-p^*)\esp{Y_i^{0,p^*}|R_i=0,p^R_c=p^*}\\ & = p^*\esp{Y_i^{1,p^*}}+(1-p^*)\esp{Y_i^{0,p^*}}\\ A(0) & = \esp{Y_i^{0,0}}, \end{align*}\]

where the second equality uses the Law of Iterated Expectation and the third equality uses Assumption 11.7. As a consequence, we have:

\[\begin{align*} A(1)-A(p^*) & = p^*\esp{Y_i^{1,1}-Y_i^{1,p^*}}+(1-p^*)\esp{Y_i^{1,1}-Y_i^{0,p^*}}\\ & = p^*\esp{Y_i^{1,1}-Y_i^{1,p^*}}+(1-p^*)\esp{Y_i^{1,1}-Y_i^{1,p^*}+Y_i^{1,p^*}-Y_i^{0,p^*}}\\ & = \esp{Y_i^{1,1}-Y_i^{1,p^*}}+(1-p^*)\esp{Y_i^{1,p^*}-Y_i^{0,p^*}}\\ A(p^*)-A(0) & = p^*\esp{Y_i^{1,p^*}-Y_i^{0,0}}+(1-p^*)\esp{Y_i^{0,p^*}-Y_i^{0,0}}\\ & = p^*\esp{Y_i^{1,p^*}-Y_i^{0,p^*}+Y_i^{0,p^*}-Y_i^{0,0}}+(1-p^*)\esp{Y_i^{0,p^*}-Y_i^{0,0}}\\ & = \esp{Y_i^{0,p^*}-Y_i^{0,0}}+p^*\esp{Y_i^{1,p^*}-Y_i^{0,p^*}}. \end{align*}\]

As a consequence, we have:

\[\begin{align*} \hat{A}''(\frac{1}{2}) & = \frac{\hat{A}'(p^*+\frac{1-p^*}{2})-\hat{A}'(\frac{p^*}{2})}{p^*+\frac{1-p^*}{2}-\frac{p^*}{2}} \\ & = \frac{\frac{A(1)-A(p^*)}{1-p^*}-\frac{A(p^*)-A(0)}{p^*}}{\frac{1}{2}}\\ & = 2\left(\frac{\esp{Y_i^{1,1}-Y_i^{1,p^*}}}{1-p^*}+\esp{Y_i^{1,p^*}-Y_i^{0,p^*}}-\frac{\esp{Y_i^{0,p^*}-Y_i^{0,0}}}{p^*}-\esp{Y_i^{1,p^*}-Y_i^{0,p^*}}\right) \\ & = 2\left(\frac{\esp{Y_i^{1,1}-Y_i^{1,p^*}}}{1-p^*}-\frac{\esp{Y_i^{0,p^*}-Y_i^{0,0}}}{p^*}\right), \end{align*}\]

which proves the result.

A.7 Proofs of results in Chapter 16

A.7.1 Proof of Theorem 16.1

The SFE estimator can be written in matrix form as follows:

\[\begin{align*} \underbrace{\left(\begin{array}{c} Y_{1} \\ \vdots \\ Y_{N} \end{array}\right)}_{Y} & = \underbrace{\left(\begin{array}{ccccccc} 1 & 0 & \dots & 0 & R_{1}\\ \vdots & \vdots & & \vdots & \vdots\\ 1 & 0 & \dots & 0 & \vdots \\ 0 & 1 & \dots & 0 & \vdots\\ \vdots & \vdots & & \vdots & \vdots\\ 0 & 1 & \dots & 0 & \vdots\\ \vdots & \vdots & & \vdots & \vdots \\ 0 & 0 & \dots & 1 & \vdots\\ \vdots & \vdots & & \vdots & \vdots\\ 0 & 0 & \dots & 1 & R_{N}\\ \end{array}\right)}_{X^{SFE}} \underbrace{\left(\begin{array}{c} \alpha^{SFE}_{1}\\ \alpha^{SFE}_{2} \\ \vdots\\ \alpha^{SFE}_{S} \\ \beta^{SFE} \end{array}\right)}_{\Theta^{SFE}} + \underbrace{\left(\begin{array}{c} \epsilon^{SFE}_{1} \\ \vdots \\ \epsilon^{SFE}_{N} \end{array}\right),}_{\epsilon^{SFE}} \end{align*}\]

where all observations are ordered by strata, and strata indices go from $1$ tp $S$ since there are $S$ strata. We are going to apply Theorem A.1, i.e. Frish-Waugh-Lovell Theorem, partialling out the strata fixed effects from the list of regressors. Let $X^{SFE}_1$ denote the matrix of strata fixed effects. We have:

\[\begin{align*} (X^{SFE}_1)'X^{SFE}_1 & = \left(\begin{array}{cccc} N_1 & 0 & \dots & 0\\ 0 & N_2 & \dots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \dots & N_S \end{array}\right) \end{align*}\]

and therefore

\[\begin{align*} ((X^{SFE}_1)'X^{SFE}_1)^{-1} & = \left(\begin{array}{cccc} \frac{1}{N_1} & 0 & \dots & 0\\ 0 & \frac{1}{N_2} & \dots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \dots & \frac{1}{N_S} \end{array}\right), \end{align*}\]

since the inverse of a diagonal matrix is the diagonal matrix of the inverses of each of its diagonal terms. We also have:

\[\begin{align*} (X^{SFE}_1)'Y & = \left(\begin{array}{c} \sum_{i\in\mathcal{I}(1)}Y_i \\ \vdots \\ \sum_{i\in\mathcal{I}(S)}Y_i \end{array}\right) \end{align*}\]

and therefore

\[\begin{align*} ((X^{SFE}_1)'X^{SFE}_1)^{-1}(X^{SFE}_1)'Y & = \left(\begin{array}{c} \bar{Y}_1 \\ \vdots \\ \bar{Y}_S \end{array}\right) \end{align*}\]

where $\bar{Y}_s=\frac{\sum_{i\in\mathcal{I}(s)}Y_i }{N_s}$, and finally:

\[\begin{align*} M_1Y & = Y-X^{SFE}_1((X^{SFE}_1)'X^{SFE}_1)^{-1}(X^{SFE}_1)'Y = \left(\begin{array}{c} Y_1-\bar{Y}_1 \\ \vdots \\ Y_{\max\mathcal{I}(1)}-\bar{Y}_1 \\ Y_{\min\mathcal{I}(2)}-\bar{Y}_2 \\ \vdots \\ Y_{\max\mathcal{I}(2)}-\bar{Y}_2 \\ \vdots \\ Y_{\min\mathcal{I}(S)}-\bar{Y}_S \\ \vdots \\ Y_{N}-\bar{Y}_S \end{array}\right) \end{align*}\]

Similarly, we have:

\[\begin{align*} M_1R & =\left(\begin{array}{c} R_1-\bar{R}_1 \\ \vdots \\ R_{\max\mathcal{I}(1)}-\bar{R}_1 \\ R_{\min\mathcal{I}(2)}-\bar{R}_2 \\ \vdots \\ R_{\max\mathcal{I}(2)}-\bar{R}_2 \\ \vdots \\ R_{\min\mathcal{I}(S)}-\bar{R}_S \\ \vdots \\ R_{N}-\bar{R}_S \end{array}\right) \end{align*}\]

Let us now form the OLS estimator of $\beta^{SFE}$ which is equal to $\hat{\beta}^{SFE}=((M_1R)'M_1R)^{-1}(M_1R)'M_1Y$ by Theorem A.1. We have:

\[\begin{align*} (M_1R)'M_1Y & = \sum_{i=1}^{N}(R_i-\bar{R}_{\mathbf{s}(i)})(Y_i-\bar{Y}_{\mathbf{s}(i)})\\ & = \sum_{i=1}^{N}R_iY_i -\sum_{i=1}^{N}\bar{R}_{\mathbf{s}(i)}Y_i-\sum_{i=1}^{N}R_i\bar{Y}_{\mathbf{s}(i)}+\sum_{i=1}^{N}\bar{R}_{\mathbf{s}(i)}\bar{Y}_{\mathbf{s}(i)}\\ & = \sum_{i=1}^{N}R_iY_i -\sum_{s=1}^{S}\bar{R}_{s}\sum_{i\in\mathcal{I}(s)}Y_i-\sum_{s=1}^{S}\bar{Y}_{s}\sum_{i\in\mathcal{I}(s)}R_i+\sum_{s=1}^{S}\bar{R}_{s}\bar{Y}_{s}\sum_{i\in\mathcal{I}(s)}1\\ & = \sum_{i=1}^{N}R_iY_i -\sum_{s=1}^{S}\bar{R}_{s}N_s\bar{Y}_{s}-\sum_{s=1}^{S}\bar{Y}_{s}N_s\bar{R}_{s}+\sum_{s=1}^{S}\bar{R}_{s}\bar{Y}_{s}N_s\\ & = \sum_{i=1}^{N}R_iY_i -\sum_{s=1}^{S}N_s\bar{R}_{s}\bar{Y}_{s}. \end{align*}\]

If we write $\overline{RY}_s=\frac{1}{N_s}\sum_{i\in\mathcal{I}(s)}R_iY_i$, we have $\sum_{i=1}^{N}R_iY_i=\sum_{s=1}^{S}N_s\overline{RY}_s$, and therefore:

\[\begin{align*} (M_1R)'M_1Y & = \sum_{s=1}^{S}N_s(\overline{RY}_s-\bar{R}_{s}\bar{Y}_{s})\\ & = \sum_{s=1}^{S}N_s\left(\frac{1}{N_s}\sum_{i\in\mathcal{I}(s)}R_iY_i-\frac{1}{N_s}\sum_{i\in\mathcal{I}(s)}R_i\frac{1}{N_s}\sum_{i\in\mathcal{I}(s)}Y_i\right)\\ & = \sum_{s=1}^{S}\left(\sum_{i\in\mathcal{I}(s)}R_iY_i-\frac{1}{N_s}\sum_{i\in\mathcal{I}(s)}R_i\sum_{i\in\mathcal{I}(s)}Y_i\right) \end{align*}\]

At the same time,

\[\begin{align*} (M_1R)'M_1R & = \sum_{i=1}^{N}(R_i-\bar{R}_{\mathbf{s}(i)})^2\\ & = \sum_{i=1}^{N}R_i^2-2\sum_{i=1}^{N}R_i\bar{R}_{\mathbf{s}(i)}+\sum_{i=1}^{N}\bar{R}^2_{\mathbf{s}(i)}\\ & = \sum_{i=1}^{N}R_i-2\sum_{s=1}^{S}N_s\bar{R}^2_s+\sum_{s=1}^{S}N_s\bar{R}^2_{s}\\ & = \sum_{s=1}^{S}N_s\bar{R}_s-2\sum_{s=1}^{S}N_s\bar{R}^2_s+\sum_{s=1}^{S}N_s\bar{R}^2_{s}\\ & = \sum_{s=1}^{S}N_s\bar{R}_s-\sum_{s=1}^{S}N_s\bar{R}^2_s\\ & = \sum_{s=1}^{S}N_s\bar{R}_s(1-\bar{R}_s). \end{align*}\]

As a consequence, we have:

\[\begin{align*} \hat{\beta}^{SFE} & = \frac{\sum_{s=1}^{S}\left(\sum_{i\in\mathcal{I}(s)}R_iY_i-\frac{1}{N_s}\sum_{i\in\mathcal{I}(s)}R_i\sum_{i\in\mathcal{I}(s)}Y_i\right)}{\sum_{s=1}^{S}N_s\bar{R}_s(1-\bar{R}_s)}\\ & = \sum_{s=1}^{S}\frac{\sum_{i\in\mathcal{I}(s)}R_i(1-\bar{R}_s)}{\sum_{s=1}^{S}N_s\bar{R}_s(1-\bar{R}_s)} \frac{\sum_{i\in\mathcal{I}(s)}R_iY_i-\frac{1}{N_s}\sum_{i\in\mathcal{I}(s)}R_i\sum_{i\in\mathcal{I}(s)}Y_i}{\sum_{i\in\mathcal{I}(s)}R_i(1-\bar{R}_s)}\\ & = \sum_{s=1}^{S}\frac{N_s\bar{R}_s(1-\bar{R}_s)}{\sum_{s=1}^{S}N_s\bar{R}_s(1-\bar{R}_s)}\hat{\Delta}^Y_{WW_s}\\ & = \sum_{s=1}^{S}w^{SFE}_s\hat{\Delta}^Y_{WW_s}, \end{align*}\]

where the penultimate line uses the last equation in the proof of Lemma A.3 and the last line uses the definition of $w^{SFE}_s$. This proves the result.

A.7.2 Proof of Theorem 16.2

Using Slutsky’s Theorem, we have $\plims\hat{\beta}^{SFE}=\sum_{s=1}^{S}\plims w_s\plims\hat{\Delta}^Y_{WW_s}$. Examining the terms separately, we have:

\[\begin{align*} \plims\hat{\Delta}^Y_{WW_s} & = \esp{Y_i|R_i=1,\mathbf{s}(i)=s}-\esp{Y_i|R_i=0,\mathbf{s}(i)=s} \\ & = \esp{Y^1_i-Y^0_i|\mathbf{s}(i)=s}\\ & = \Delta^Y_{ATE_s} \end{align*}\]

where the first equality uses the Law of Large Numbers and the second equality uses Assumption ?? and 3.2. The last equality is a definition. We also gave:

\[\begin{align*} \plims w^{SFE}_s & = \frac{\plims\frac{N_s}{N}\plims\bar{R}_s(1-\plims\bar{R}_s)}{\sum_{s\in\mathcal{S}}\plims\frac{N_s}{N}\plims\bar{R}_s(1-\plims\bar{R}_s)}\\ & = \frac{\plims\frac{N_s}{N}\pi(1-\pi)}{\sum_{s\in\mathcal{S}}\plims\frac{N_s}{N}\pi(1-\pi)}\\ & = \frac{\plims\frac{N_s}{N}\pi(1-\pi)}{\sum_{s\in\mathcal{S}}\plims\frac{N_s}{N}\pi(1-\pi)}\\ & = \plims\frac{N_s}{N}\\ & = \Pr(\mathbf{s}(i)=s) \end{align*}\]

where the first equality follows from Slutsky’s Theorem and Assumption 16.1, the second equality from the fact that, as sample size grows, the proportion of treated in each strata converges to $\pi$ and the last equality follows from the Law of Large Numbers. The final result follows from the Law of Iterated Expectations.

A.7.3 Proof of Theorem 16.3

The first step of the proof is to actually show that estimating the fully saturated model by OLS is equivalent to running a separate OLS regression in each strata. This is what the following lemma shows:

Lemma A.8 (The fully saturated model is OLS in each strata) Under Assumption 16.1, the OLS estimates of $\alpha^{SAT}=\left\{\hat{\alpha}^{SAT}_1,\dots,\hat{\alpha}^{SAT}_S\right\}$ and $\beta^{SAT}=\left\{\beta^{SAT}_1,\dots,\beta^{SAT}_S\right\}$ in the fully staturated model:

\[\begin{align*} Y_i & = \sum_{s\in\mathcal{S}}\alpha^{SAT}_s\uns{\mathbf{s}(i)=s}+\sum_{s\in\mathcal{S}}\beta_s^{SAT}R_i + \epsilon^{SAT}_i, \end{align*}\]

are equivalent to estimating each regression separately on each strata using OLS and

\[\begin{align*} \hat{\beta}^{SAT}_{s} & = \hat{\Delta}^Y_{WW_s}. \end{align*}\]

Proof. The SAT estimator can be written in matrix form as follows:

\[\begin{align*} \underbrace{\left(\begin{array}{c} Y_{1} \\ \vdots \\ Y_{N} \end{array}\right)}_{Y} & = \underbrace{\left(\begin{array}{cccccccccc} 1 & 0 & \dots & 0 & R_{1} & 0 & \dots & 0 \\ \vdots & \vdots & & \vdots & \vdots & \vdots & \dots & \vdots\\ 1 & 0 & \dots & 0 & R_{\max\mathcal{I}(1)} & 0 & \dots & \vdots \\ 0 & 1 & \dots & 0 & 0 & R_{\min\mathcal{I}(2)} & \dots & \vdots \\ \vdots & \vdots & & \vdots & \vdots & \vdots & \dots & \vdots \\ 0 & 1 & \dots & 0 & \vdots & R_{\max\mathcal{I}(2)} & \dots & \vdots \\ \vdots & \vdots & & \vdots & \vdots & 0 & \dots & 0 \\ 0 & 0 & \dots & 1 & \vdots & \vdots & \dots & R_{\min\mathcal{I}(S)}\\ \vdots & \vdots & & \vdots & \vdots & \vdots & \dots & \vdots\\ 0 & 0 & \dots & 1 & 0 & 0 & \dots & R_{N} \\ \end{array}\right)}_{X^{SAT}} \underbrace{\left(\begin{array}{c} \alpha^{SAT}_{1}\\ \alpha^{SAT}_{2} \\ \vdots\\ \alpha^{SAT}_{S} \\ \beta^{SAT}_{1}\\ \beta^{SAT}_{2} \\ \vdots\\ \beta^{SAT}_{S} \end{array}\right)}_{\Theta^{SAT}} + \underbrace{\left(\begin{array}{c} \epsilon^{SAT}_{1} \\ \vdots \\ \epsilon^{SAT}_{N} \end{array}\right),}_{\epsilon^{SAT}} \end{align*}\]

where all observations are ordered by strata, and strata indices go from $1$ tp $S$ since there are $S$ strata. We are going to apply Theorem A.1, i.e. Frish-Waugh-Lovell Theorem, partialling out the strata fixed effects from the list of regressors. Let $X^{SAT}_1$ denote the matrix of strata fixed effects. Following the proof of Theorem 16.1 in Section A.7.1, we know that:

\[\begin{align*} M_1Y & = Y-X^{SAT}_1((X^{SAT}_1)'X^{SAT}_1)^{-1}(X^{SAT}_1)'Y = \left(\begin{array}{c} Y_1-\bar{Y}_1 \\ \vdots \\ Y_{\max\mathcal{I}(1)}-\bar{Y}_1 \\ Y_{\min\mathcal{I}(2)}-\bar{Y}_2 \\ \vdots \\ Y_{\max\mathcal{I}(2)}-\bar{Y}_2 \\ \vdots \\ Y_{\min\mathcal{I}(S)}-\bar{Y}_S \\ \vdots \\ Y_{N}-\bar{Y}_S \end{array}\right) \end{align*}\]

Similarly, we have:

\[\begin{align*} M_1\mathbf{R} & =\left(\begin{array}{cccc} R_{1}-\bar{R}_1 & 0 & \dots & 0 \\ \vdots & \vdots & \dots & \vdots\\ R_{\max\mathcal{I}(1)}-\bar{R}_1 & 0 & \dots & \vdots \\ 0 & R_{\min\mathcal{I}(2)}-\bar{R}_2 & \dots & \vdots \\ \vdots & \vdots & \dots & \vdots \\ \vdots & R_{\max\mathcal{I}(2)}-\bar{R}_2 & \dots & \vdots \\ \vdots & 0 & \dots & R_{\min\mathcal{I}(S)}-\bar{R}_S \\ \vdots & \vdots & \dots & \vdots\\ 0 & 0 & \dots & R_{N}-\bar{R}_S \\ \end{array}\right) \end{align*}\]

where $\mathbf{R}$ is the matrix of interactions between treatment status and strata fixed effects. Let us now form the OLS estimator of $\mathbf{\beta}^{SAT}=\left\{\beta^{SAT}_1,\dots,\beta^{SAT}_S\right\}$ which is equal to $\hat{\mathbf{\beta}}^{SAT}=((M_1\mathbf{R})'M_1\mathbf{R})^{-1}(M_1\mathbf{R})'M_1Y$ by Theorem A.1. We have:

\[\begin{align*} (M_1\mathbf{R})'M_1\mathbf{R} & =\left(\begin{array}{cccc} \sum_{i\in\mathcal{I}(1)}(R_{i}-\bar{R}_1)^2 & 0 & \dots & 0 \\ 0 & \sum_{i\in\mathcal{I}(2)}(R_{i}-\bar{R}_2)^2 & \dots & \vdots \\ \vdots & \vdots & \ddots & 0 \\ 0 & 0 & \dots & \sum_{i\in\mathcal{I}(S)}(R_{i}-\bar{R}_S)^2 \\ \end{array}\right) \end{align*}\]

and:

\[\begin{align*} (M_1\mathbf{R})'M_1Y & =\left(\begin{array}{c} \sum_{i\in\mathcal{I}(1)}(R_{i}-\bar{R}_1)(Y_{i}-\bar{Y}_1) \\ \sum_{i\in\mathcal{I}(2)}(R_{i}-\bar{R}_2)(Y_{i}-\bar{Y}_2) \\ \vdots \\ \sum_{i\in\mathcal{I}(S)}(R_{i}-\bar{R}_S)(Y_{i}-\bar{Y}_S)\\ \end{array}\right) \end{align*}\]

Under Assumption 16.1, we have:

\[\begin{align*} \hat{\mathbf{\beta}}^{SAT} & = \left(\begin{array}{c} \frac{\sum_{i\in\mathcal{I}(1)}(R_{i}-\bar{R}_1)(Y_{i}-\bar{Y}_1)}{\sum_{i\in\mathcal{I}(1)}(R_{i}-\bar{R}_1)^2} \\ \frac{\sum_{i\in\mathcal{I}(2)}(R_{i}-\bar{R}_2)(Y_{i}-\bar{Y}_2)}{\sum_{i\in\mathcal{I}(2)}(R_{i}-\bar{R}_2)^2} \\ \vdots \\ \frac{\sum_{i\in\mathcal{I}(S)}(R_{i}-\bar{R}_S)(Y_{i}-\bar{Y}_S)}{\sum_{i\in\mathcal{I}(S)}(R_{i}-\bar{R}_S)^2} \\ \end{array}\right) \\ & = \left(\begin{array}{c} \Delta^Y_{WW_1} \\ \Delta^Y_{WW_2} \\ \vdots \\ \Delta^Y_{WW_S} \\ \end{array}\right), \end{align*}\]

where the second equality uses Lemma A.3. Similar derivations show that $\hat{\mathbf{\alpha}}^{SAT}=\left\{\hat{\alpha}^{SAT}_1,\dots,\hat{\alpha}^{SAT}_S\right\}$ is equal to the mean outcome of the untreated in each strata, as proven in the proof of Lemma A.3 in Section A.1.2. This completes the proof.

We now have:

\[\begin{align*} \esp{\hat{\Delta}^{Y}_{SAT}} & = \esp{\sum_{s\in\mathcal{S}}w^{SAT}\beta_s^{SAT}} \\ & = \esp{\sum_{s\in\mathcal{S}}\frac{N_s}{N}\hat{\Delta}^Y_{WW_s}} \\ & = \sum_{s\in\mathcal{S}}\esp{\frac{N_s}{N}\hat{\Delta}^Y_{WW_s}} \\ & = \sum_{s\in\mathcal{S}}\esp{\frac{N_s}{N}}\esp{\hat{\Delta}^Y_{WW_s}} \\ & = \sum_{s\in\mathcal{S}}\Pr(\mathbf{s}(i)=s)\Delta^Y_{ATE_s} \\ & = \Delta^Y_{ATE} \end{align*}\]

where the first equality uses the definition of the estimator in the saturated model, the second equality uses Lemma A.8, the third and fourth equalities use Assumption 3.1, the fifth equality uses Lemma A.1 which proves that $\esp{\hat{\Delta}^Y_{WW_s}}=\Delta^Y_{ATE_s}=\esp{Y_i^1-Y_i^0|\mathbf{s}(i)=s}$ and that the proportion of observation in a strata in the sample is an unbiased estimate of the proportion of observations in that strata in the population. The last equality follows from the Law of Iterated Expectations.

We also have:

\[\begin{align*} \plims\hat{\Delta}^{Y}_{SAT} & = \plims\sum_{s\in\mathcal{S}}w^{SAT}\beta_s^{SAT} \\ & = \plims\sum_{s\in\mathcal{S}}\frac{N_s}{N}\hat{\Delta}^Y_{WW_s} \\ & = \sum_{s\in\mathcal{S}}\plims\frac{N_s}{N}\hat{\Delta}^Y_{WW_s} \\ & = \sum_{s\in\mathcal{S}}\plims\frac{N_s}{N}\plims\hat{\Delta}^Y_{WW_s} \\ & = \sum_{s\in\mathcal{S}}\Pr(\mathbf{s}(i)=s)\Delta^Y_{ATE_s} \\ & = \Delta^Y_{ATE} \end{align*}\]

where the first equality uses the definition of the estimator in the saturated model, the second equality uses Lemma A.8, the third and fourth equalities use Slutsky’s Theorem, the fifth equality uses the fact that, under Assumptions 3.1, 3.2 and 16.1, the OLS estimator in each strata is a consistent estimator of the average treatment effect in the strata $\Delta^Y_{ATE_s}=\esp{Y_i^1-Y_i^0|\mathbf{s}(i)=s}$ and the proportion of observation in a strata in the sample is a consistent estimate of the proportion of observations in that strata in the population. the last equality follows from the Law of Iterated Expectations. This completes the proof.

A.7.4 Proof of Theorem 16.4

Under Assumptions 16.1, 16.2, 16.3 and 16.4, we can apply Lemma A.5 within each strata to derive the asymptotic distribution of the $WW$ estimator within each strata:

\[\begin{align*} \sqrt{N}(\hat{\Delta}^Y_{WW_s}-\Delta^Y_{ATE_s}) & \stackrel{d}{\rightarrow} \mathcal{N}\left(0,\frac{1}{p_s}\left(\frac{\var{Y_i^1|R_i=1,\mathbf{s}(i)=s}}{\Pr(R_i=1)}+\frac{\var{Y_i^0|R_i=0,\mathbf{s}(i)=s}}{1-\Pr(R_i=1)}\right)\right). \end{align*}\]

The term in $\frac{1}{p_s}$ appears because, when applying Lemma A.5 in strata $s$, the effective sample size is $N_s=p_sN$, following Assumption 16.5. Note also that, under Assumption 16.3, all the $\hat{\Delta}^Y_{WW_s}$ are independent across strata, so that their covariances are zero. Using the Delta method, we thus have:

\[\begin{align*} \sqrt{N}(\hat{\Delta}^Y_{SAT}-\Delta^Y_{ATE}) & \stackrel{d}{\rightarrow} \mathcal{N}\left(0,\sum_{s\in\mathcal{S}}\frac{p^2_s}{p_s}\left(\frac{\var{Y_i^1|R_i=1,\mathbf{s}(i)=s}}{\Pr(R_i=1)}+\frac{\var{Y_i^0|R_i=0,\mathbf{s}(i)=s}}{1-\Pr(R_i=1)}\right)\right). \end{align*}\]

The results follows.

A.7.5 Proof of Theorem 16.5

Under Assumptions 16.1, 16.2, 16.3, 16.4 and 16.5, we know that $\sqrt{N}\left(\hat{\beta}^{SFE}-\Delta^{Y}_{ATE}\right) \distr \mathcal{N}\left(0,\hat\Omega\right)$ with $\hat\Omega=N(X'X)^{-1}(X'\diag(\hat\epsilon_1,\dots,\hat\epsilon_N)X)(X'X)^{-1}$ and $X= M_1\mathbf{R}$ in the notation of the proof of Theorem 16.1 in Section A.7.1 and $\hat\epsilon_1=\hat\epsilon^{SFE}_1$. This follows first from the classical results of Huber, Eicker and White on heteroskedasticity-robust variance estimators (see for example the Wikipedia entry). I also use Theorem A.1 as is has been applied to the strata fixed effects estimator in the proof of Theorem 16.1 in Section A.7.1. Under Assumption 16.5, I can consider the strata fixed effects as constants, and therefore I can disregard sampling noise stemming from the estimation of their proportions. As a results, I can apply the heteroskedasticity-robust variance estimator to the data after partialling out the strata-fixed effects. From the proof of Theorem 16.1 in Section A.7.1, we know that $X'X=\sum_{s=1}^{S}N_s\bar{R}_s(1-\bar{R}_s)$. We also have:

\[\begin{align*} X'\diag(\hat\epsilon_1,\dots,\hat\epsilon_N)X & = \sum_{i=1}^N(R_i-\bar{R}_{\mathbf{s}(i)})^2\epsilon_i^2 \\ & = \sum_{i=1}^NR_i\epsilon_i^2-2\sum_{i=1}^NR_i\bar{R}_{\mathbf{s}(i)}\epsilon_i^2+\sum_{i=1}^N\bar{R}_{\mathbf{s}(i)}^2\epsilon_i^2 \\ & = \sum_{s=1}^SN_s^1\sigma_{\epsilon^1_s}^2-2\sum_{s=1}^S\bar{R}_sN_s^1\sigma_{\epsilon^1_s}^2+\sum_{s=1}^S\bar{R}^2_sN_s\left(\frac{N_s^1}{N_s}\sigma_{\epsilon^1_s}^2+\frac{N_s^0}{N_s}\sigma_{\epsilon^0_s}^2\right) \\ & = \sum_{s=1}^SN_s^1\sigma_{\epsilon^1_s}^2(1-2\bar{R}_s+\bar{R}^2_s)+\sum_{s=1}^SN_s^0\sigma_{\epsilon^0_s}^2\bar{R}^2_s \\ & = \sum_{s=1}^SN_s^1\sigma_{\epsilon^1_s}^2(1-\bar{R}_s)^2+\sum_{s=1}^SN_s^0\sigma_{\epsilon^0_s}^2\bar{R}^2_s, \end{align*}\]

where the first equality follows from the proof of Theorem 16.1 in Section A.7.1, the third equality follows from summing over strata and denoting $\sigma_{\epsilon^1_s}^2=\frac{\sum_{i\in\mathcal{I}(s)}R_i\epsilon_i^2}{N_s^1}$ and $N_s^1=\sum_{i\in\mathcal{I}(s)}R_i$. Rearranging gives:

\[\begin{align*} \hat\Omega & = N \sum_{s=1}^S\frac{N_s^1(1-\bar{R}_s)^2\sigma_{\epsilon^1_s}^2+N_s^0\bar{R}^2_s\sigma_{\epsilon^0_s}^2}{\left(\sum_{s=1}^{S}N_s\bar{R}_s(1-\bar{R}_s)\right)^2}\\ & = N \sum_{s=1}^S\frac{N_s\bar{R}_s^2(1-\bar{R}_s)^2\frac{\sigma_{\epsilon^1_s}^2}{\bar{R}_s}+N_s(1-\bar{R}_s)^2\bar{R}^2_s\frac{\sigma_{\epsilon^0_s}^2}{1-\bar{R}_s}}{\left(\sum_{s=1}^{S}N_s\bar{R}_s(1-\bar{R}_s)\right)^2}\\ & = N \sum_{s=1}^S\frac{N_s\bar{R}_s^2(1-\bar{R}_s)^2}{\left(\sum_{s=1}^{S}N_s\bar{R}_s(1-\bar{R}_s)\right)^2}\left(\frac{\sigma_{\epsilon^1_s}^2}{\bar{R}_s}+\frac{\sigma_{\epsilon^0_s}^2}{1-\bar{R}_s}\right)\\ & = \sum_{s=1}^S\frac{N}{N_s}\frac{N_s^2\bar{R}_s^2(1-\bar{R}_s)^2}{\left(\sum_{s=1}^{S}N_s\bar{R}_s(1-\bar{R}_s)\right)^2}\left(\frac{\sigma_{\epsilon^1_s}^2}{\bar{R}_s}+\frac{\sigma_{\epsilon^0_s}^2}{1-\bar{R}_s}\right)\\ & = \sum_{s=1}^S\frac{(w^{SFE}_s)^2}{p_s}\left(\frac{\sigma_{\epsilon^1_s}^2}{\bar{R}_s}+\frac{\sigma_{\epsilon^0_s}^2}{1-\bar{R}_s}\right). \end{align*}\]

Using the fact that $\plims\bar{R}_s=\Pr(R_i=1|\mathbf{s}(i)=1)$ and $\plims\sigma_{\epsilon^d_s}^2=\var{Y_i^d|R_i=d,\mathbf{s}(i)=s}$ proves the result.

A.7.6 Proof of Theorem 16.7

Classical results on OLS with a single dummy variable as regressor imply that

\[\begin{align*} \hat{\alpha}^{PD} & = \frac{1}{\frac{N}{2}}\sum_{s=1}^S\Delta_s^Y \\ & = \frac{2}{N}\sum_{s=1}^S\left(Y_{i\in\mathcal{I}(s)\land D_i=1}-Y_{j\in\mathcal{I}(s)\land D_j=0}\right)\\ & = \frac{2}{N}\sum_{i=1}^N\left(Y_{i}D_i-Y_{i}(1-D_i)\right)\\ & = \frac{1}{\sum_{i=1}^ND_i}\sum_{i=1}^NY_{i}D_i-\frac{1}{\sum_{i=1}^N(1-D_i)}\sum_{i=1}^NY_{i}(1-D_i)\\ & = \Delta^Y_{WW} \end{align*}\]

As a consequence, using Lemma A.1 proves unbiasedness of the pairwise difference estimator. Consistency follows from the Law of Large Numbers. Direct application of Theorems 16.1 and 16.3 with $\bar{R}_s=\frac{1}{2}$, $\forall s$ prove that the pairwise difference estimator is also identical to the strata fixed effects and fully saturated estimators. This completes the proof.