Chapter 11 Diffusion effects
Up until now, we have assumed that the treatment received by one unit in the population did not have any impact on any other unit. We have not encoded this assumption formally, but we have implicitly made it all along, starting with our encoding of Rubin Causal Model in Chapter 1. In this chapter, we are going to relax that assumption, and learn how to deal with the more general cases that then appear. We are going to cover a host of very important applications, that go from identifying contagion effects to identifying the optimal proportion of individuals to treat at independent locations. We are first going to start by introducing an extended Rubin Causal Model allowing for diffusion effects and introducing ways to discipline this model so that it becomes estimable. We are then going to look at various ways to estimate this model and the precision of the resulting estimates, using RCTs, DID, and both parametric and non parametric approaches. Most of these developments are fairly recent and will enable us to get rapidly in touch with the research frontier.
11.1 Allowing for diffusion effects in Rubin Causal Model
In this section, we are going to detail how to encode causality in the presence of diffusion effects. We are going to start with potential outcomes and a general framework, before considering two very important special cases: the case where diffusion effects are absent and the case where they take a specific form.
11.1.1 Potential outcomes and treatment effects with diffusion effects
The main starting point for an extended Rubin Causal Model is to acknowledge that the treatment status of the \(N^*\) observations in the population (with \(N^*\) possibly infinite) might influence the observed outcome for individual \(i\). Let \(\mathbf{d}=\left\{d_1,\dots,d_{N^*}\right\}\), with \(d_j\in\left\{0,1\right\}\), \(\forall j\in\left\{1,\dots,N^*\right\}\). We can therefore write the generalized potential outcome for individual \(i\) as \(Y_i^{\mathbf{d}}\). If we write \(\mathbf{D}=\left\{D_1,\dots,D_{N^*}\right\}\), we can then write the observed outcome for individual \(i\) as \(Y_i^{\mathbf{D}}\). The average effect of the treatment becomes:
\[\begin{align*} \Delta^Y_{ATE}(\mathbf{D}) & = \esp{Y_i^{\mathbf{D}}-Y_i^{\mathbf{0}}}\\ & = \esp{Y_i^{\mathbf{D}}-Y_i^{\mathbf{0}}|D_i=1}\Pr(D_i=1)+\esp{Y_i^{\mathbf{D}}-Y_i^{\mathbf{0}}|D_i=0}\Pr(D_i=0)\\ & = \Delta^Y_{TT}(\mathbf{D})\Pr(D_i=1)+\Delta^Y_{TUT}(\mathbf{D})\Pr(D_i=0)\\ \end{align*}\]
where \(\mathbf{0}\) is the null vector of length \(N^*\). Note that the average effect of the treatment is equal to a weighted average of the effect on the treated and the effect of the untreated. Note also that these effects differ from the ones we defined in Chapters 1 and 2: they depend on the whole vector of treatment assignments. Indeed, the effect on the untreated is not the one we defined in Section 5.1.3: it is not the difference between taking the treatment and not taking the treatment for those who do not take it. The TUT we have defined here is the difference in outcomes for the ones who do not take the treatment between a case where the treated individuals in the population receive the treatment and a case where no one receives the treatment. The only effect of the treatment on the untreated is indirect: it is the effect that transits through diffusion of the treatment effects from the treated to the untreated. It can be when farmers adopt technologies after seeing their treated neighbors adopt them, or when people contract less diseases because their neighbors are vaccinated. These effects can also be negative, for example when untreated job seekers are crowded out of a job by the job counselling received by the treated. In general, I like to call these effects contagion effects, to insist on the fact that they are indirect.
Note that the effect on the treated also is different and depends on the whole treatment vector. In that case, we allow for the effect on the treated to depend on whether or not some or all of their neighbors are treated. The effect of a vaccine might for example be higher when more people around us are vaccinated. Or a technology is more likely to be adopted if more neighbors are informed that it exists and encourage to adopt it. I call these types of effects amplification effects, to denote the fact that whether the treated react a lot or not to the treatment might depend on whether their neighbors are also treated. These effects might also be negative, for example when more job seekers receive counselling, the effectiveness of counselling on the treated might very well decrease.
11.1.2 Encoding the absence of diffusion effects
In this section, we are going to state the assumption of absence of diffusion effects, that is required for all our previous estimators to work. This assumption, called the Stable Unit Treatment Value Assumption, is stated as follows:
Hypothesis 11.1 (Stable Unit Treatment Value Assumption) We assume that the effect of the treatment on individual \(i\) only depends on whether \(i\) receives the treatment or not, and not on whether other individuals in the population receive the treatment as well: \(\forall i\), \(D_i=D'_i\Rightarrow Y_i^{\mathbf{D}}=Y_i^{\mathbf{D'}}\), \(\forall\mathbf{D}\neq\mathbf{D'}\).
SUTVA implies the version of Rubin Causal Model that we have introduced in Chapter 1. Indeed, SUTVA implies that the only treatment status that matters for the potential outcomes of individual \(i\) is the treatment status of individual \(i\). As a consequence, we have the following results:
Theorem 11.1 (Rubin Causal Model and Treatment Effects Under SUTVA) Under Assumption 11.1, the potential outcome of individual \(i\) only depends on its treatment status: \(\forall i\), \(Y_i^{\mathbf{D}}=Y_i^{D_i}\). As a consequence:
\[\begin{align*} \Delta^Y_{TT}(\mathbf{D}) & = \Delta^Y_{TT}\\ \Delta^Y_{TUT}(\mathbf{D}) & = 0. \end{align*}\]
Proof. The proof of the first result that \(Y_i^{\mathbf{D}}=Y_i^{D_i}\) is straightforward from Assumption 11.1. We therefore have
\[\begin{align*} \Delta^Y_{TT}(\mathbf{D}) & = \esp{Y_i^{\mathbf{D}}-Y_i^{\mathbf{0}}|D_i=1}\\ & = \esp{Y_i^{D_i}-Y_i^{0}|D_i=1}\\ & = \esp{Y_i^{1}-Y_i^{0}|D_i=1}\\ & = \Delta^Y_{TT}\\ \Delta^Y_{TUT}(\mathbf{D})& = \esp{Y_i^{\mathbf{D}}-Y_i^{\mathbf{0}}|D_i=0}\\ & = \esp{Y_i^{D_i}-Y_i^{0}|D_i=0}\\ & = \esp{Y_i^{0}-Y_i^{0}|D_i=0}\\ & = 0. \end{align*}\]
11.1.3 Treatment exposure
In general, it is going to prove extremely difficult to do econometric analysis using the very general setting we have defined so far, with potential outcomes depending on the whole treatment vector in the population. A useful simplifying assumption that we often have to resort to is to specify an exposure mapping, that relates the whole treatment vector to the specifications relevant for the outcomes of interest. In order to specify the exposure mapping, we are going to assume that all units in the population are part of a network. This network is summarized by an \(N^*\times N^*\) contiguity matrix \(A\) where each element \(a_{j,i}\) (with \(j\) denoting the line and \(i\) the column) measures the strength of the relationship between \(j\) and \(i\). For example, if \(j\) mentions \(i\) as a friend, \(a_{j,i}=1\), whereas \(a_{i,j}=1\) whenever \(i\) mentions \(j\) as a friend. We can enforce the graph to be symmetric, that is \(a_{j,i}=a_{i,j}\), \(\forall (i,j)\), but it does not have to be the case. For example, water quality at some point \(i\) along a river stream depends on whether water is treated at a point \(j\) upstream, but water quality in \(j\) does not depend on treatments in a downstream point \(i\). Because water flows in one direction, the network is not symmetric.
Equipped with a network of links, and denoting \(\mathbf{\Omega}=2^{N^*}\) the set of possible treatment allocations, and \(\mathbf{\Theta}\) the set of parameters \(\theta_i\) relevant for the value of treatment exposure of unit \(i\) (possibly containing features of the \(A\) matrix), we can define treatment exposure as a mapping \(f\) from \(\mathbf{\Omega}\times\mathbf{\Theta}\) to \(\mathbf{\Delta}\), the set of possible treatment exposure: \(\Delta_i=f(\mathbf{D},\theta_i)\). A key assumption we are going to make is that the exposure mapping is propermy specified, that is that it captures perfectly the intricacies of the effects of various treatment vectors:
Hypothesis 11.2 (Properly specified exposure mapping) \(\forall i\), \(\forall\mathbf{D}\neq\mathbf{D'}\in\mathbf{\Omega}\), \(\forall \theta_i\in\mathbf{\Theta}\), \(f(\mathbf{D},\theta_i)=f(\mathbf{D'},\theta_i)\Rightarrow Y_i^{\mathbf{D}}=Y_i^{\mathbf{D'}}\).
Under Assumption 11.2, the potential outcomes can be written as functions of treatment exposure only: \(Y_i^{\Delta_i}\).
As a consequence, we can now define the average treatment effect of the treatment on the treated as follows:
\[\begin{align*} \Delta^Y_{TT}(\mathbf{d}) & = \esp{Y_i^{\mathbf{d}}-Y_i^{\mathbf{0}}|\Delta_i=\mathbf{d}}, \end{align*}\]
where \(\Delta^Y_{TT}(\mathbf{d})\) measures the impact of treatment exposure \(\mathbf{d}\) on those who have received it.
Remark. The framework based on the use of an exposure mapping has been developped by Manski (2013) and Aronow and Samii (2017).
Remark. I use the term ``average treatment effect on the treated’’ because \(\Delta^Y_{ATE}(\mathbf{d})\) measures the effect of receiving a treatment vector \(d\) on those who receive it.
We are now equipped with tools that enable us to define treatment effects in the presence of diffusion effects, and to identify various types of diffusion effects. The key concept that we are going to have to specify is treatment exposure: how does it change with various applications and how do we go around identifying it in various precise cases? What can we do as well to test for features of treatment exposure without completely specifying it? This is what we are going to see in what follows, first in the case of Randomized Controlled Trials, and then in the case of Difference in Differences. We are going to go step by step, and first we ware going to start with simpler networks, that I call coarse networks, before looking at what we can do with more complex networks.
11.1.4 Fundamental problem of causal inference for diffusion effects
With diffusion effects and treatment exposure, the Fundamental Problem of Causal Inference strikes again. Let state the problem using our more general framework of treatment exposure:
Theorem 11.2 (Fundamental problem of causal inference with diffusion effects) It is impossible to observe \(\Delta^Y_{TT}(\mathbf{d})\), \(\forall d\in\mathbf{\Delta}\), either in the population or in the sample.
Proof. For the population TT:
\[\begin{align*} \Delta^Y_{TT}(\mathbf{d}) & = \esp{Y_i^{\mathbf{d}}-Y_i^{\mathbf{0}}|\Delta_i=\mathbf{d}} \\ & = \esp{Y_i^{\mathbf{d}}|\Delta_i=\mathbf{d}}-\esp{Y_i^{\mathbf{0}}|\Delta_i=\mathbf{d}}\\ & = \esp{Y_i|\Delta_i=\mathbf{d}}-\esp{Y_i^{\mathbf{0}}|\Delta_i=\mathbf{d}}. \end{align*}\]
\(\esp{Y_i^{\mathbf{0}}|\Delta_i=\mathbf{d}}\) is unobserved, and so is \(\Delta^Y_{TT}\). A similar reasoning holds for the sample average treatment effect.
We also have a novel formulation of the bias of intuitive methods. For example, selection bias now depends on \(\mathbf{d}\):
\[\begin{align*} \Delta^Y_{SB}(\mathbf{d}) & = \Delta^Y_{WW}(\mathbf{d})-\Delta^Y_{TT}(\mathbf{d}) \\ & = \esp{Y_i|\Delta_i=\mathbf{d}}-\esp{Y_i|\Delta_i=\mathbf{0}}-\esp{Y_i^{\mathbf{d}}-Y_i^{\mathbf{0}}|\Delta_i=\mathbf{d}}\\ & = \esp{Y_i^{\mathbf{0}}|\Delta_i=\mathbf{d}}-\esp{Y_i^{\mathbf{0}}|\Delta_i=\mathbf{0}}. \end{align*}\]
Remark. Why is the with/without comparison of individuals with treatment exposure \(\Delta_i=\mathbf{d}\) and those with treatment exposure \(\Delta_i=\mathbf{0}\) biased for average treatment effect on the treated \(\Delta^Y_{TT}(\mathbf{d})\)? This is because treatment exposure might be correlated with unobserved confounders: individuals with higher treatment exposure might be systematically different from those with the reference level of treatment exposure (here \(\mathbf{0}\)).
11.2 Diffusion effects with coarse networks
Coarse networks are networks where we do not have a lot of information on the connections between individuals: we only know whether they belong to the same influence group or not. This type of network characterizes for example of a group of villages, or municipalities, or classes, for which we do not know which links individuals have between each other other than they belong to the same group. More formally, coarse networks can be characterized by the following property:
Hypothesis 11.3 (Coarse network) We say that our population is characterized by a coarse network if the observed matrix of connections \(A\) is block diagonal and we do not know which nodes are activated within each block.
Remark. A blog diagonal influence matrix is composed of a set of groups or clusters within which observations influence each other and across which we assume all influences are muted. This is of course a simplification: some units within a cluster might not really be connected, while some units might be connected to units in an other group. Also, not all units might be equivalent within a group, with some being more central (e.g. connected) than others. In a coarse network, we are assuming these differences away.
Remark. Another way of framing coarse networks is to say that there is unknown interference within clusters (and no interference across). This is Viviano (2023)’s definition. With Viviano’s approach to coarse networks, we do not know which units interfere within each network and how they do.
With a coarse network approach, under Assumption 11.3, we might specialize the exposure mapping to things we might know, that is whether the unit itself is treated or not and the proportion of units that are treated in a given cluster \(c\), \(p_c\), or, more generally, the proportion of units with characteristics \(X_i=x\) that are treated within clusters with characteristics \(Z_c=z\): \(p(x,z)\). As a consequence, we might write potential outcomes as \(Y_i^{D_i,p_c}\) or, more generally, \(Y_i^{D_i,\left\{p(x,Z_c)\right\}_{x\in\mathcal{X}}}\), with \(\mathcal{X}\) the support of \(X_i\).
Remark. Lemma 2.1 in Viviano (2023) shows an example of assumptions under which we can simplify the exposiure mapping and obtain potential outcomes as a function of the proportion of treated units and cluster and unit characteristics.
Under Assumption 11.3, we can write the average effect of treating a cluster with a proportion of treated \(p_c=p\) as follows:
\[\begin{align*} \Delta^Y_{TT}(p) & = \esp{Y_i^{D_i,p}-Y_i^{0,0}|p_c=p}\\ & = \esp{Y_i^{1,p}-Y_i^{0,0}|D_i=1,p_c=p}\Pr(D_i=1|p_c=p)\\ & \phantom{=} +\esp{Y_i^{0,p}-Y_i^{0,0}|D_i=0,p_c=p}\Pr(D_i=0|p_c=p)\\ & = \Delta^Y_{TDT}(p)\Pr(D_i=1|p_c=p)+\Delta^Y_{TIT}(p)\Pr(D_i=0|p_c=p), \end{align*}\]
where \(\Delta^Y_{TDT}(p)\) is the Average Treatment Effect on the Directly Treated and \(\Delta^Y_{TIT}(p)\) is the Average Treatment Effect on the Indirectly Treated.
The main question under Assumption 11.3 is to find the allocation of treated units that maximizes some objective function. We are going to make a distinction between two different cases:
- In the first case, we have a pre-specified budget for treatment effort (in terms of number of treated units) and we have to choose how to spend it optimally. This often happens in practical policy applications where the budget has been pre-approved but you do not know how to implement it in the most optimal way possible.
- In the second case, we already have an existing policy in place, and we would like to know whether it is optimal, and in which direction we should take it if we happen to have some additional budget.
11.2.1 Optimal treatment allocation under monotone response
I develop result on this setting in my own ongoing research. In order to fix ideas, we are going to start with a simple network with two clusters. We will then look at what happens with a more general network. Finally, we will look at how we can use two-steps clustered RCTs to estimate the required parameters and decide on the optimal allocation.
11.2.1.1 A simple model
Let’s start with a very simple example of a network with two clusters \(1\) and \(2\). Let’s also consider only the case of a discrete outcome (such as participation in a program, getting vaccinated, contracting a disease, adopting a technology, etc.). For simplicity, we are also going to write that potential outcomes are realizations of a continuous utility variable crossing a threshold:
\[\begin{align*} Y_{i}^{0,P_c} & = \uns{\underbrace{\delta_0 + \beta_0 P_{c} -\epsilon_{i,0}}_{Y^*_{i,0}}\geq0}\\ Y_{i}^{1,P_c} & = \uns{\underbrace{\delta_1 + \beta_1 P_{c} -\epsilon_{i,1}}_{Y^*_{i,1}}\geq0}. \end{align*}\]
What this models tells us is that, when no one else in the cluster is treated (\(P_c=0\)), the individual level effect of being treated is equal to \(\Delta^Y_i=\uns{\epsilon_{i,1}\leq\delta_1}-\uns{\epsilon_{i,0}\leq\delta_0}\). When some units starts receiving the treatment, we have two indirect effects:
- Increasing the proportion of treated units impacts the outcomes of untreated units, through \(\beta_0\). This is what I call a contagion effect, in which untreated units are somehow contaminated by the treatment received by the treated individuals in the same cluster. Contagion might refer to receiving information about the existence of a program and eventually deciding the enroll, or being protected by the fact that some neighbors are taking a treatment (in that case, contagion effects might actually prevent some untreated units from being contaminated). Contagion effects might be negative, if for example treated individuals who receive job training or job search assistance end up finding jobs that would have been allocated to some of the untreated individuals in the absence of the treatment.
- Increasing the proportion of treated units impacts the outcomes of treated units, through \(\beta_1\). This is what I call an amplification effect. There is amplification each time a treated units increases its likelihood of a positive outcome because more units are treated. This might happen when technological adoption occurs only after most individuals in the cluster have been exposed to it and convinced to make a change.
In order to get even more intuition on this problem, we are going to specialize it even further by making the following set of assumptions:
Hypothesis 11.4 (Simplified Allocation Problem) We assume that the allocation problem is characterized as follows:
- There are only two nodes \(c=1\) and \(c=2\).
- A mass of \(1\) units reside at each node.
- We can only treat a mass of \(1\) units.
- We assume the constraint is saturated so that we use all available treatments: \(p_1+p_2=1\).
- \(\epsilon_1\) and \(\epsilon_0\) are uniform on \(\left[0,1\right]\).
Under Assumption 11.4, we can set \(p_1=p\) and \(p_2=1-p\). Let’s assume our goal is to maximize the total amount of people with \(Y_i=1\). Under Assumption 11.4, this is equivalent to maximizing the sum of the adoption rates at both nodes. Using the fact that the constraint is saturated, we can write the objective function we aim to maximize as follows:
\[\begin{align*} W(p) & = \underbrace{pF_{\epsilon,1}(\alpha_1 + \beta_1p)+(1-p)F_{\epsilon,0}(\alpha_0 + \beta_0p)}_{A(p)}\\ & \phantom{=}+\underbrace{(1-p)F_{\epsilon,1}(\alpha_1 + \beta_1(1-p))+pF_{\epsilon,0}(\alpha_0 + \beta_0(1-p))}_{A(1-p)} \end{align*}\]
The \(A\) function measures how much the probability of observing the favorable outcome \(Y_i=1\) in a given cluster increases with the proportion of treated individuals in the cluster, \(p\). It turns out that the properties of the \(A\) function are key to determine the optimal allocation of treatment effort across nodes in the general case with more than two nodes. For now, in the two-node case and under substantial simplifications, we have the following result:
Theorem 11.3 (Optimal allocation of treatment effort with two nodes) Under Assumptions 11.2, 11.3 and 11.4, we have three possible cases for the optimal allocation of treatment effort:
- When amplification effects dominate (\(\beta_1>\beta_0\)): either \(p^*=1\) or \(p^*=0\)
- When contagion effects dominate (\(\beta_0>\beta_1\)): \(p^*=\frac{1}{2}\)
- When amplification and contagion effects are of the same size (\(\beta_0=\beta_1\)): \(p^*=\left[0,1\right]\).
Proof. Under Assumption 11.4, we have:
\[\begin{align*} W(p) & = p(\alpha_1 + \beta_1p)+(1-p)(\alpha_0 + \beta_0p)+(1-p)(\alpha_1 + \beta_1(1-p))+p(\alpha_0 + \beta_0(1-p))\\ & = \alpha_0+\alpha_1+\beta_1+2(\beta_0-\beta_1)p(1-p), \end{align*}\]
where the second line follows after some algebra. The problem \(\max_{p\in\left[0,1\right]}W(p)\) has the folowing first order condition: \(W'(p)=2(\beta_0-\beta_1)(1-2p)=0\) and the following second order condition: \(W''(p)=-4(\beta_0-\beta_1)\). When \(\beta_0>\beta_1\), \(W''(p)<0\), and the interior solution \(p^*=\frac{1}{2}\) maximizes \(W\). When \(\beta_0<\beta_1\), \(W''(p)>0\), and the interior solution \(p^*=\frac{1}{2}\) minimizes \(W\). In that case, the optimal solution is at a corner, either at \(p^*=1\) or at \(p^*=0\). Since \(W(1)=W(0)\), they are both maxima. When \(\beta_0=\beta_1\), \(W\) is constant and any value in \(\left[0,1\right]\) maximizes \(W\).
Remark. Theorem 11.3 shows that when amplification effects dominate, it is optimal to focus all treatment effort on one of the two nodes (for example the first, but they are interchangeable). This is because returns are increasing in this case: the \(A\) function is convex, with more people responding to the treatment as more of them receive the treatment. When contagion effects dominate, it is optimal to treat both nodes, with half of the observations receiving the treatment. This is because in that case, the \(A\) function is concave, and the marginal returns are decreasing when we treat more people. When both contagion and amplification effects are equal, there is no optimum, or, equivalently, any allocation \(p\) will yield the same result.
11.2.1.2 A general model
One open question is whether we can generalize the result in Theorem 11.3 to a much more general setting with several nodes and more general functional forms. It is actually the case. Let us now formulate a more general setting:
Hypothesis 11.5 (Symmetric Allocation Problem) We assume that the allocation problem is characterized as follows:
- \(K\) nodes indexed from \(1\) to \(K\), and each node has size \(n_k\).
- At each node, we can choose to treat \(r_k\) individuals.
- The total number of individuals on the network is \(N=\sum_{k=1}^Kn_k\).
- The total number of treated individuals is \(R=\sum_{k=1}^Kr_k\).
- We cannot treat more than \(\bar{R}\) individuals.
- We cannot treat everyone: \(\bar{R}<N\).
- The expected outcome at each node (or response function) is only a function of \(p\), that we denote \(A(p)\), with \(A'>0\).
\(A\) is two things at the same time: connection matrix and response function
Remark. Assumption 11.5 is mainly restrictive in making the problem symmetric: all nodes are treated in the same way. The only thing that distinguishes nodes is their respective size. Apart from that, they all respond in the same (average) way to the treatment. We do not try to distinguish between nodes based on observed characteristics of the nodes. We also do not try to vary the identity of treated units based on their observed characteristics. Another restriction is that \(A'>0\): we only consider treatments for which the response is always strictly increasing in \(p\) (and not weakly).
Under Assumptions 11.2, 11.3 and 11.5, we can cast our optimization problem as follows:
\[\begin{align*} \max_{\left\{r_k\right\}_{k=1}^K} & \sum_{k=1}^K n_kA(\frac{r_k}{n_k})\label{eqn:MainProbMax}\\ & \text{under the constraints} \nonumber\\ R & =\sum_{k=1}^Kr_k \leq \bar{R} \label{eqn:MainProbR}\\ r_k & \leq n_k\text{, }\forall k\label{eqn:MainProbn}\\ r_k & \geq 0\text{, }\forall k.\label{eqn:MainProbr} \end{align*}\]
In my work, I have been able to solve this problem for a smooth response function \(A\), in the following sense:
Hypothesis 11.6 (Monotone Response Function) We assume that the reponse function \(A\) has constant second derivative on its full support: either \(A''(p)>0\) \(\forall p\in\left[0,1\right]\) or \(A''(p)<0\) \(\forall p\in\left[0,1\right]\).
We can indeed prove the following result:
Theorem 11.4 (Optimal allocation under monotone response with $K$ symmetric nodes) Under Assumptions 11.2, 11.3, 11.5 and 11.6, the optimal allocation of treatment across nodes is as follows:
\[\begin{align*} \frac{r^*_k}{n_k} & = \begin{cases} \frac{\bar{R}}{N}\text{, }\forall k & \text{ if }A''<0\\ \begin{cases} 0 & \text{ for a set of nodes } \mathcal{J} \text{ such that } \sum_{j\in\mathcal{J}}n_j=N-\bar{R},\\ 1 & \text{ for a set of nodes } \mathcal{L} \text{ such that } \sum_{l\in\mathcal{L}}n_j=\bar{R}, \end{cases} & \text{ if } A''>0.\\ \end{cases} \end{align*}\]
Proof. See Section A.6.1.
Remark. Theorem 11.4 shows that the very simple intuition that we got in the two nodes problem transports well to more complex settings. The optimal allocation depends on the sign of the second derivative. When returns are decreasing, we treat each node symmetrically with the same share \(p^*=\frac{\bar{R}}{N}\) of the treatment effort. When returns are increasing, we treat a share \(\frac{\bar{R}}{N}\) of the nodes with \(p^*=1\) and a share \(1-\frac{\bar{R}}{N}\) with \(p^*=0\).
Remark. There are several open questions on this research front. To list but a few:
- Can we relax Assumption 11.6? For example, we know that \(A''\) has not constant sign when the error terms are normal in the model with two nodes, but we still have an optimal solution that has the same shape.
- Can we relax Assumption 11.5?
Especially, can we allow for responses that vary as a function of node characteristics and can we allow for treatment allocation based on unit characteristics?
11.2.1.3 Using two-step clustered randomized controlled trials to find the optimal treatment allocation
In this section, we are going to see that conducting a two-step clustered randomized controlled trial is going to enable us to identify the optimal treatment allocation under Assumptions 11.2, 11.3 and 11.4. A two-step clustered randomized controlled trial works as follows:
- In a first step, we randomly select three sets of nodes, \(ST\) and \(PT\) and \(SC\), with \(K_{ST}+K_{PT}+K_{SC}=\tilde{K}\) and \(\tilde{K}\leq K\).
When \(\tilde{K}< K\), \(\tilde{K}\) is a random subset of the \(K\) nodes.
- Nodes that belong to \(ST\), the set of nodes of size \(K_{ST}\), are called Super Treated nodes. The proportion of treated units is \(p^R_c=1\), \(\forall c \in ST\).
- Nodes that belong to \(PT\), the set of nodes of size \(K_{ST}\), are called Partially Treated nodes. The proportion of treated units is \(p^R_c=\frac{\bar{R}}{N}\equiv p^*\), \(\forall c \in PT\).
- Nodes that belong to \(SC\), the set of nodes of size \(K_{SC}\), are called Super Control nodes. The proportion of treated units is \(p^R_c=0\), \(\forall c \in SC\).
- In a second step, we randomly select \(N^1_c=\frac{\bar{R}}{N}N_c\) units to be treated (with \(R_i=1\)) and \(N^0_c=N_c-N^1_c\) to be in the control group (\(R_i=0\)), \(\forall c \in PT\), with \(N_c\) the number of units in node \(c\).
When implementing the treatment, all units in \(ST\) are treated, only \(N^1_c\) units are treated in \(PT\) and no unit is treated in \(SC\).
Remark. Note that rigorously, we should have \(N_c^1=\lfloor\frac{\bar{R}}{N}N_c\rfloor\), but we disregard the complexities brought about by the fact that the number of units has to be an integer.
The one thing we need to identify now in order to apply Theorem 11.4 is the sign of the second derivative of the \(A\) function. We are going to show that the sign of \(A''\) can be identified in a two-step clustered randomized controlled trial. Before that, we are going to encode the validity of the two-step clustered randomized controlled trial:
Hypothesis 11.7 (Independence in a two-step clustered design) We assume that the allocation of the proportion of neighbors treated and of the individual treatment level are independent of potential outcomes:
\[\begin{align*} (R_i,p^R_c)\Ind\left(\left\{Y_i^{0,p},Y_i^{1,p}\right\}_{p\in\left[0,1\right]}\right). \end{align*}\]
We also assume that the randomized allocation does not interfere with how units respond to the treatment:
Hypothesis 11.8 (Validity of the 2-step clustered design) We assume that the randomized allocation of the program does not interfere with how potential outcomes are generated:
\[\begin{align*} Y_i & = \begin{cases} Y_i^{1,p} & \text{ if } R_i=1 \text{ and } p^R_c=p\\ Y_i^{0,p} & \text{ if } R_i=0 \text{ and } p^R_c=p \end{cases} \end{align*}\]
with \(Y_i^{1,p}\) and \(Y_i^{0,p}\) the same potential outcomes as defined with a routine allocation of the treatment.
We are now equipped to prove the identification of \(A''\):
Theorem 11.5 (Identification of $A''$ in a 2-step clustered randomized controlled trial) Under Assumptions 11.2, 11.3, 11.5, 11.6, 11.7, and 11.8, the numerator of \(A''\) is identified by the following quantity:
\[\begin{align*} \text{sign}(A'') & = \text{sign}\left(\frac{\esp{Y_i|p^R_c=1}-\esp{Y_i|p^R_c=p^*}}{1-p^*}-\frac{\esp{Y_i|p^R_c=p^*}-\esp{Y_i|p^R_c=0}}{p^*}\right). \end{align*}\]
Proof. See Section A.6.2.
One thing that is pretty amazing is that we can relate the sign of \(A''\) to the relative size of contagion and amplification effects:
Theorem 11.6 (The sign of $A''$ depends on the relative size of contagion vs amplification effects) Under Assumptions 11.2, 11.3, 11.5, 11.6, 11.7, and 11.8, we have:
\[\begin{align*} \text{sign}(A'') & = \text{sign}\left(\frac{\esp{Y^{1,1}_i-Y^{1,p^*}}}{1-p^*} -\frac{\esp{Y^{0,p^*}_i-Y^{0,0}_i}}{p^*}\right), \end{align*}\] where \(\esp{Y^{1,1}_i-Y^{1,p^*}}\) measures the strength of amplification effects and \(\esp{Y^{0,p^*}_i-Y^{0,0}_i}\) measures the strength of contagion effects.
Proof. See Section A.6.3.
Theorem 11.6 suggests an alternative identification strategy for the sign of \(A''\):
Theorem 11.7 (Identifying the sign of $A''$ from the relative size of contagion and amplification effects) Under Assumptions 11.2, 11.3, 11.5, 11.6, 11.7, and 11.8, we have:
\[\begin{align*} \text{sign}(A'') & = \text{sign}\left(\frac{\esp{Y_i|R_i=1,p^R_c=1}-\esp{Y_i|R_i=1,p^R_c=p^*}}{1-p^*}\right.\\ & \phantom{=\text{sign}\left(\right.}\left.-\frac{\esp{Y_i|R_i=0,p^R_c=p^*}-\esp{Y_i|R_i=0,p^R_c=0}}{p^*}\right) \end{align*}\]
Thanks to Theorems 11.5 and 11.7, we therefore have two ways to estimate the sign of \(A''\): either by comparing the overall changes in expected outcomes when moving from \(0\) to \(p^*\) and from \(p^*\) to \(1\), or by comparing the relative size of amplification and contagion effects. As a consequence, we can form two with/without estimators of \(A''\):
\[\begin{align*} \hat{A}''_{All}(\frac{1}{2}) & = \frac{\frac{\sum_{i\in\mathcal{I}_{ST}}Y_i}{N_{ST}}-\frac{\sum_{i\in\mathcal{I}_{PT}}Y_i}{N_{PT}}}{1-p^*}- \frac{\frac{\sum_{i\in\mathcal{I}_{SP}}Y_i}{N_{SP}}-\frac{\sum_{i\in\mathcal{I}_{SC}}Y_i}{N_{SC}}}{p^*}\\ \hat{A}''_{Diff}(\frac{1}{2}) & = \frac{\frac{\sum_{i\in\mathcal{I}_{ST}}Y_i}{N_{ST}}-\frac{\sum_{i\in\mathcal{I}^1_{PT}}Y_i}{N^1_{PT}}}{1-p^*}- \frac{\frac{\sum_{i\in\mathcal{I}^0_{SP}}Y_i}{N^0_{SP}}-\frac{\sum_{i\in\mathcal{I}_{SC}}Y_i}{N_{SC}}}{p^*}, \end{align*}\]
with \(\mathcal{I}_{T}\), \(T\in\left\{ST,SC,PT\right\}\), the set of units \(i\) that belong to a cluster of type \(T\), \(\mathcal{I}^d_{PT}\), \(d\in\left\{0,1\right\}\), the set of units that belong to a cluster of type \(PT\) and have \(R_i=d\), \(N_{T}\), \(T\in\left\{ST,SC,PT\right\}\), the number of units \(i\) belonging to clusters of type \(T\), and \(\mathcal{N}^d_{PT}\), \(d\in\left\{0,1\right\}\), the number of units that belong to clusters of type \(PT\) and have \(R_i=d\). Following usual arguments, these estimators are both unbiased and consistent (as the number of clusters goes to infinity) for \(A''(\frac{1}{2})\). Their components can both be estimated separately by using OLS with a linear model on separate subsamples. The covariance of each separate with/without comparison can be estimated by estimating both components jointly, for example by estimating the following model by OLS:
\[\begin{align*} Y_i & = \alpha^{All} + \beta^{All}_{PT}\uns{i\in\mathcal{I}_{PT}} + \beta^{All}_{ST}\uns{i\in\mathcal{I}_{ST}} + \epsilon_i^{All}\\ Y_i & = \alpha^{Diff} + \alpha^{Diff}_{1}\uns{R_i=1} + \beta^{Diff}_{0}\uns{i\in\mathcal{I}^0_{PT}}+ \beta^{Diff}_{1}\uns{i\in\mathcal{I}_{ST}} + \epsilon_i^{All}. \end{align*}\]
With these parameter estimates, we have:
\[\begin{align*} \hat{A}''_{All}(\frac{1}{2}) & = \frac{\hat\beta^{All}_{ST}}{1-p^*}-\frac{\hat\beta^{All}_{SP}}{p^*}\\ \hat{A}''_{Diff}(\frac{1}{2}) & = \frac{\hat\beta^{Diff}_{1}}{1-p^*}-\frac{\hat\beta^{Diff}_{0}}{p^*}. \end{align*}\]
To estimate the precision of each of the parameters, one has to use standard errors clustered at the cluster level. To obtain the precision of \(\hat{A}''(\frac{1}{2})\), one can simply use the Delta Method.
Remark. Note that in practice, the actual proportion of treated in each cluster of type \(PT\) is going to differ from \(p^*\). Does this affect consistency and unbiasedness of both estimators? Could we estimate \(\hat p^*\) and try to use it to get access to a wider share of the \(A\) function, or at least to an average effect? See Davide’s discussion of that issue.
11.2.2 Identifying optimal treatment levels
In the previous section, we discussed ways of identifying diffusion effects, and we focused on the task of finding the optimal treatment allocation when total treatment capacity was fixed to a limited number of treatments. In that scenario, what turned out to be super important was the shape of the returns to treatment effort (convex or concave), and it turned out to be related to whether contagion or amplification effects dominated. Though this scenario of constant treatment effort sometimes happens in real life, in other situations, policymakers might want to decide whether to increase or decrease their treatment effort, and to find the optimal treatment level, taking into account diffusion effects. This is the goal of this section, which is fully based on Davide Viviano (2023)’s recent working paper on the topic.
11.2.2.1 Setup and assumptions
Davide considers a setting with \(K\) clusters of equal size \(N\). Researchers sample a proportion \(\lambda\in\left]0,1\right]\) of the \(N\) units in each cluster at each period \(t\) and they have access to the following information for the sampled observations in each cluster: \(\left(Y^{(k)}_{i,t},X^{(k)}_{i},D^{(k)}_{i,t}\right)_{i=1}^n\) where \(n=\lambda N\) and \(X^{(k)}_{i}\) are baseline characteristics. There are \(T\) periods. Despite the data being allowed to be a panel or a repeated cross section, we denote observations as if there was repeated sampling. Potential outcomes are denoted \(Y^{(k)}_{i,t}(\mathbf{D}_1^{(k)},\dots,\mathbf{D}_t^{(k)})\), where \(\mathbf{D}_s^{(k)}\in\left\{0,1\right\}^N\), and \(s\leq t\). We denote \(Y^{(k)}(.)\) and \(X^{(k)}\) the vectors of potential outcomes and covariates in cluster \(k\).
The key policy parameter that we are going to be after is a treatment rule, \(\pi(.;\beta):\mathcal{X}\leftrightarrow\left[0,1\right]\), indexed by a (possibly vector valued) parameter \(\beta\) which lies in a compact set. The treatment rules selects a probability of allocating each agent with characteristics \(x\) at date \(t\) in cluster \(k\) to the treatment. We would like to choose \(\pi\) so that we maximize an objective function, for example total program returns net of program implementation costs.
In order to determine this optimal function, we are going to run two-step clustered experiments. These experiments are as follows:
Hypothesis 11.9 (Treatment Assignement in the experiment) For \(\beta_{k,t}\Ind\left(X^{(k)},Y^{(k)}(.)\right)\),
\[\begin{align*} D^{(k)}_{i,t}|X^{(k)},Y^{(k)}(.),\beta_{k,t}\sim_{i.n.i.d.}\mathcal{B}(\pi(X^{(k)}_{i};\beta_{k,t})). \end{align*}\]
Assumption 11.9 implies that the allocation of treatment follows a Bernoulli distribution indexed by parameters \(\beta_{k,t}\), and can be different for individuals with different baseline characteristics.
Example 11.1 Examples of experimental allocation rules are the equal probability rule: \(\pi(.;\beta)=\beta\in\left[0,1\right]\) or targeted treatments \(\pi(x;\beta)=\beta_x\). The treatment rules can also be made conditional on cluster characteristics.
Davide now needs another assumption about the data generating process:
Hypothesis 11.10 (Data generating process) For any \((i,t,k)\), we assume that:
Assumption 11.10 says that there are no carryover effects of the treatment beyond the period in which it is assigned (??), that there are no interactions between the treatment and time and cluster fixed effects (??), and finally that outcomes depend on at most \(\gamma_N\) other outcomes in the same cluster.
We are now equipped to define welfare as a function of the parameters of the allocation rule:
Definition 11.1 For treatments as in Assumption 11.9, and under the assumptions on the d.g.p. in Assumption 11.10, we can define welfare as \(W(\beta)=\int y(x,\beta)dF_X(x)\), with \(y(x,\beta)=\pi(x;\beta)m(1,x,\beta)+(1-\pi(x;\beta))m(0,x,\beta)\),
with \(y(x,\beta)\) the outcome net of costs. Equipped with this definition, and assuming all functions are differentiable, we can define the direct effect of the treatment (\(\Delta(x,\beta)\)), the marginal spillover effect (\(S(d,x,\beta)\)), the marginal policy effect (\(M(\beta)\)) and the welfare optimizing policy (\(\beta^*\)) as follows:
\[\begin{align*} \Delta(x,\beta) & = m(1,x,\beta)-m(0,x,\beta)\\ S(d,x,\beta) & = \partder{m(d,x,\beta)}{\beta}\\ M(\beta) & = \partder{W(\beta)}{\beta}\\ & = \int\left[S(0,x,\beta)+\pi(x;\beta)(S(1,x,\beta)-S(0,x,\beta))\right.\\ & \phantom{=\int\left[\right.}\left.+\partder{\pi(x,\beta)}{\beta} \Delta(x,\beta)\right]dF_X(x)\\ \beta^* & = \arg\sup_{\beta}W(\beta). \end{align*}\]
Example 11.2 A first example that Davide gives is the case of positive externalities with decreasing returns from neighbours’ treatments. We pose \(D^{(k)}_{i,t}\sim_{i.i.d.}\mathcal{B}(\beta)\), and \(\mathcal{N}_i\) is the set of neighbours of individual \(i\). We let
\[\begin{align*} Y^{(k)}_{i,t} & = \alpha_t + D^{(k)}_{i,t}\phi_1 + \frac{\sum_{j\in\mathcal{N}_i}D^{(k)}_{j,t}}{\left|\mathcal{N}_i\right|}\phi_2 -\left(\frac{\sum_{j\in\mathcal{N}_i}D^{(k)}_{j,t}}{\left|\mathcal{N}_i\right|}\right)^2\phi_3 + \nu_{i,t} \end{align*}\]
In that case, assuming that \(\left|\mathcal{N}_i\right|\sim\mathcal{D}_N\), we have:
\[\begin{align*} \espsub{Y^{(k)}_{i,t}|\alpha_t,D^{(k)}_{i,t}=d}{\beta} & = \alpha_t + d\phi_1 + \espsub{\frac{\sum_{j\in\mathcal{N}_i}D^{(k)}_{j,t}}{\left|\mathcal{N}_i\right|}}{\beta}\phi_2 -\espsub{\left(\frac{\sum_{j\in\mathcal{N}_i}D^{(k)}_{j,t}}{\left|\mathcal{N}_i\right|}\right)^2}{\beta}\phi_3 \\ & = \alpha_t + d\phi_1 + \esp{\espsub{\frac{\sum_{j\in\mathcal{N}_i}D^{(k)}_{j,t}}{n}}{\beta}|\left|\mathcal{N}_i\right|}\phi_2 -\esp{\espsub{\left(\frac{\sum_{j\in\mathcal{N}_i}D^{(k)}_{j,t}}{n}\right)^2}{\beta}|\left|\mathcal{N}_i\right|}\phi_3 \\ & = \alpha_t + d\phi_1 + \esp{\frac{\left|\mathcal{N}_i\right|\beta}{\left|\mathcal{N}_i\right|}|\left|\mathcal{N}_i\right|}\phi_2 -\esp{\frac{\left|\mathcal{N}_i\right|\beta(1-\beta)+\left|\mathcal{N}_i\right|^2\beta^2}{\left|\mathcal{N}_i\right|^2}|\left|\mathcal{N}_i\right|}\phi_3 \\ & = \alpha_t + d\phi_1 + \beta\phi_2-\beta\phi_3\left(\beta+(1-\beta)\esp{\frac{1}{\left|\mathcal{N}_i\right|}}\right) \end{align*}\]
As a consequence, we have:
\[\begin{align*} m(d,1,\beta) & = d\phi_1 + \beta\phi_2-\beta\phi_3\left(\beta+(1-\beta)\esp{\frac{1}{\left|\mathcal{N}_i\right|}}\right)\\ \Delta(x,\beta) & = \phi_1\\ S(d,x,\beta) & = \phi_2-\phi_3(2\beta+(1-2\beta)\esp{\frac{1}{\left|\mathcal{N}_i\right|}})\\ y(1,\beta) & = \beta\phi_2-\beta\phi_3\left(\beta+(1-\beta)\esp{\frac{1}{\left|\mathcal{N}_i\right|}}\right)+\beta(\phi_1-c)\\ M(\beta) & = \phi_2-\phi_3(2\beta+(1-2\beta)\esp{\frac{1}{\left|\mathcal{N}_i\right|}})+\phi_1-c. \end{align*}\]
Example 11.3 Let us now look at an example with negative externalties. We similarly have \(D^{(k)}_{i,t}\sim_{i.i.d.}\mathcal{B}(\beta)\), but now outcomes are negatively affected by the proportion of treated, and all the more so if they are treated themselves:
\[\begin{align*} Y^{(k)}_{i,t} & = \alpha_t + D^{(k)}_{i,t}\phi_1 - \frac{\sum_{j\in\mathcal{N}_i}D^{(k)}_{j,t}}{\left|\mathcal{N}_i\right|}\phi_2 -D^{(k)}_{i,t}\frac{\sum_{j\in\mathcal{N}_i}D^{(k)}_{j,t}}{\left|\mathcal{N}_i\right|}\phi_3 + \nu_{i,t} \end{align*}\]
In that case, we have, after similar manipulation, \(y(1,\beta)=\beta(\phi_1-c-\phi_2-\phi_3\beta)\).
Remark. One open question is whether Assumption 11.10 is compatible with any real-looking network. Davide formalizes a nice proposition that shows that indeed this assumption can be rationalized by an actual network formation model. Let units be spaced on a latent space, and each unit can interact with at most the \(\sqrt{\gamma_N}\) closest units. Let \(\uns{i_k\leftrightarrow j_{k}}\) denote whether or not \(i\) and \(j\) can be connected in the resulting the latent network. Let \(\mathcal{I}_k\) be the matrix of these potential connections in cluster \(k\). Davide’s first assumption is:
Hypothesis 11.11 (Network) Actual connections are generated as follows:
\[\begin{align*} A_{i,j}^{(k)} & = l(X^{(k)}_{i},X^{(k)}_{j},U^{(k)}_{i},U^{(k)}_{j})\uns{i_k\leftrightarrow j_{k}}, \end{align*}\]
for some function \(l\) and unobservables \(U^{(k)}_{i}\) with \(\left(X^{(k)}_{i},U^{(k)}_{i}\right)|\mathcal{I}_k\sim F_{U|X}F_{X}\) and with \(\sum_{j=1}^N\uns{i_k\leftrightarrow j_{k}}=\sqrt{\gamma_N}\).
The second assumption Davide makes is on how potential outcomes are generated:
Hypothesis 11.12 (Potential outcomes) Potential outcomes are generated as follows:
\[\begin{align*} Y^{(k)}_{i,t}(\mathbf{D}_1^{(k)},\dots,\mathbf{D}_t^{(k)}) & = r(D_{i,t}^{(k)},\mathbf{D}_{\mathcal{N}_i^{k},t}^{(k)},X^{(k)}_{i},X^{(k)}_{\mathcal{N}_i^{k},t},U^{(k)}_{i},U^{(k)}_{\mathcal{N}_i^{k},t},A^{(k)}_{i,.},|\mathcal{N}_i^{k}|,\nu^{(k)}_{i,t})+\tau_k+\alpha_t, \end{align*}\]
for some function \(r\) which attains the same value for any permutations of the entries of \(A^{(k)}_{i,.}\), with \(A^{(k)}_{i,.}\) the vector of connections of \(i\) in \((k)\), and unobservables \(\nu^{(k)}_{i,t}|\left(X^{(k)}_{i},U^{(k)}_{i}\right)\sim F_{\nu}\), and where \(\mathcal{N}_i^{k}=\left\{j:A^{(k)}_{i,j}>0\right\}\).
Davide can then prove that this setting implies Assumption 11.10:
Proposition 11.1 With a treatment assigned following Assumption 11.9, if Assumptions 11.11 and 11.12 hold, then Assumption 11.10 holds.
Proof. See Viviano (2023), Section B.1.2.
11.2.2.2 Identifying and estimating the marginal policy effect with a one-wave experiment
Davide proposes a one-wave experiment to get at the marginal policy effect. Here is the algorithm he proposes, with \(p_1=1\):
Organize clusters into \(G=\frac{K}{2}\) pairs with consecutive indexes \(\left\{k,k+1\right\}\)
At \(t=0\), either nobody receives the treatment, or treatment is assigned using rule \(\pi(.;\beta)\). Collect baseline outcomes and observe \((Y_{i,0}^{(h)},X_{i}^{(h)})_{i=1}^N\), for \(h\in\left\{1,\dots,K\right\}\).
At \(t=1\), start the experiment:
For each pair \(g=\left\{k,k+1\right\}\), randomize
\[\begin{align*} D_{i,1}^{(k)}|\beta,X_i^{(k)}=x & \sim \begin{cases} \mathcal{B}(\pi(x,\beta+\eta_n\underline{e}_1)) & \text{ if } h=k\\ \mathcal{B}(\pi(x,\beta-\eta_n\underline{e}_1)) & \text{ if } h=k+1 \end{cases} \end{align*}\] with \(\bar{C}n^{-\frac{1}{2}}<\eta_n<\bar{C}n^{-\frac{1}{4}}\), and \(\underline{e}_j=\left[0,\dots,0,1,0,\dots,0\right]\), where \(\underline{e}_j\in\left\{0,1\right\}^p\), and \(\underline{e}_j[j]=1\).
For \(n\) units in cluster \(h\) observe \(Y_{i,1}^{(h)}\)
Estimate the marginal effect as follows:
\[\begin{align*} \bar{M}_n(\beta) & = \frac{1}{G}\sum_{g=1}^G\widehat{M}_g(\beta) \\ \widehat{M}_g(\beta) & = \frac{1}{2\eta_n}\left[\frac{1}{n}\sum_{i=1}^nY^{(k)}_{i,1}-\frac{1}{n}\sum_{i=1}^nY^{(k)}_{i,0}\right] -\frac{1}{2\eta_n}\left[\frac{1}{n}\sum_{i=1}^nY^{(k+1)}_{i,1}-\frac{1}{n}\sum_{i=1}^nY^{(k+1)}_{i,0}\right] \end{align*}\]
- Construct the following test statistic
\[\begin{align*} \mathcal{T}_n & = \sqrt{G}\frac{\bar{M}_n(\beta)}{\sqrt{\frac{1}{G-1}\sum_{g=1}^G(\widehat{M}_g(\beta)-\bar{M}_n(\beta))^2}} \end{align*}\] to test whether the current allocation is optimal. Indeed, if \(\beta^*=\arg\max_{\beta} W(\beta)\) is an interior point, then \(W(\beta)=W(\beta^*)\Rightarrow M(\beta)[j]=0\), \(\forall j\in\left\{1,\dots,p_1\right\}\), with \(p_1\leq p\).
- Constructs tests \(\uns{\left|\mathcal{T}_n\right|> \text{cv}_{G}(\alpha)}\), with size \(\alpha\) and critical values obtained by permuting the sign of the estimated marginal effects.
Remark. Davide’s approach runs a pairwise randomized controlled trial similar to the ones we studied in Section 16.2, but at the cluster level. Each cluster within the pair is allocated a slightly different value of the allocation parameter, with a perturbation aound the current level of allocation (or a \(\beta\) of interest). Davide then proposes a DID estimator to get rid of the time and cluster fixed effects (mostly for precision, since they do not affect consistency, at least they do not seem to). He estimates a test statistic for whether the average marginal effect is zero across clusters, which is a necessary condition for being at the optimum (and a sufficient one if we assume sufficiency).
Remark. Davide’s approach also recovers several other important treatment effects:
\[\begin{align*} \bar{W}_n(\beta) & = \frac{1}{K}\sum_{k=1}^K\left(\frac{1}{n}\sum_{i=1}^nY^{(k)}_{i,1}-\frac{1}{n}\sum_{i=1}^nY^{(k)}_{i,0}\right) \\ \bar{\Delta}_n(\beta) & = \frac{1}{G}\sum_{g=1}^G\hat{\Delta}_g(\beta) \\ \hat{\Delta}_{g}(\beta) & = \frac{1}{2n}\sum_{h\in\left\{k,k+1\right\}}\sum_{i=1}^n \left[\frac{D^{(h)}_{i,1}Y^{(h)}_{i,1}}{\pi(X_i^{h};\beta+\eta_n\nu_h\underline{e}_1)} -\frac{(1-D^{(h)}_{i,1})Y^{(h)}_{i,1}}{1-\pi(X_i^{h};\beta+\eta_n\nu_h\underline{e}_1)}\right]\\ \bar{S}_n(1,\beta) & = \frac{1}{G}\sum_{g=1}^G\hat{S}_{g}(1,\beta) \\ \hat{S}_{g}(1,\beta) & = \frac{1}{2n}\sum_{h\in\left\{k,k+1\right\}}\frac{\nu_h}{\eta_n}\sum_{i=1}^n \left[\frac{D^{(h)}_{i,1}Y^{(h)}_{i,1}}{\pi(X_i^{h};\beta+\eta_n\nu_h\underline{e}_1)} -\frac{1}{n}\sum_{i=1}^nY^{(k)}_{i,0}\right]\\ \bar{S}_n(0,\beta) & = \frac{1}{G}\sum_{g=1}^G\hat{S}_{g}(0,\beta) \\ \hat{S}_{g}(0,\beta) & = \frac{1}{2n}\sum_{h\in\left\{k,k+1\right\}}\frac{\nu_h}{\eta_n}\sum_{i=1}^n \left[\frac{(1-D^{(h)}_{i,1})Y^{(h)}_{i,1}}{1-\pi(X_i^{h};\beta+\eta_n\nu_h\underline{e}_1)} -\frac{1}{n}\sum_{i=1}^nY^{(k)}_{i,0}\right]\\ \nu_h & = \begin{cases} 1 & \text{ if } h=k \\ -1 & \text{ if } h=k+1 \end{cases} \\ \end{align*}\]
Remark. Note that Davide’s approach is politically much more palatable than a 2-step clustered design with super controls and super treated clusters.
Remark. At the same time, note that Davide’s approach just gives the direction to change \(\beta\) but does not deliver an optimal \(\beta^*\), unelss we are already there.
Remark. Davide also proves several theorems that ensure that the estimates above are consistent under some reasonable assumptions, and can be approximated by a normal. The randomization tests also have correct coverage.
11.2.2.3 Identifying and estimating the optimal treatment allocation
Davide also proposes an algorithm to sequentially converge to the optimal treatment level. Here is Davide’s proposed algorithm for \(p_1=1\), with \(\beta\in\left[\underline{\beta},\overline{\beta}\right]\):
Organize clusters into pairs \(\left\{k,k+1\right\}\), with \(k\in\left\{1,3,\dots,K-1\right\}\);
At \(t=0\), treatment is assigned using rule \(D_{i,0}^{(h)}|X_i^{(h)}=x\sim\mathcal{B}(\pi(x;\beta_0))\), \(\forall k\in\left\{1,\dots,K\right\}\). Collect baseline outcomes and observe \((Y_{i,0}^{(h)},X_{i}^{(h)})_{i=1}^N\), for \(h\in\left\{1,\dots,K\right\}\). Initialize \(\widehat{M}_{k,0}=0\), \(\tilde{\beta}_k^0=\beta_0\).
while \(1\leq t\leq T\), do:
Define
\[\begin{align*} \tilde{\beta}_{h}^{(t)} & = \mathcal{P}_{\underline{\beta},\overline{\beta}-\eta_n}(\tilde{\beta}_k^0+\alpha_{h+2,t}\widehat{M}_{h+2,t-1}) \end{align*}\] with the convention \(h+2=1\) when \(h=K\) and \(h=K+1\), \(\alpha_{k,t}\) is the learning rate and \(\mathcal{P}_{a,b}(x)=\arg\min_{x'\in\left[a,b\right]^p}||x-x'||^2\)):
Assign treatments as (for \(\bar{C}n^{-\frac{1}{2}}<\eta_n<\bar{C}n^{-\frac{1}{4}}\)):
\[\begin{align*} D_{i,0}^{(h)}|X_i^{(h)}=x & \sim\mathcal{B}(\pi(x;\beta_{h,t}))\\ \beta_{h,t} & = \begin{cases} \tilde{\beta}_{h,t}+\eta_n & \text{ if } h \text{ is odd}\\ \tilde{\beta}_{h,t}-\eta_n & \text{ if } h \text{ is even} \end{cases} \end{align*}\]
For \(n\) units in cluster \(h\) observe \(Y_{i,t}^{(h)}\)
For each pair \(\left\{k,k+1\right\}\), estimate the marginal effect as follows:
\[\begin{align*} \widehat{M}_{k,t}=\widehat{M}_{k+1,t} & = \frac{1}{2\eta_n}\left[\frac{1}{n}\sum_{i=1}^nY^{(k)}_{i,1}-\frac{1}{n}\sum_{i=1}^nY^{(k)}_{i,0}\right] -\frac{1}{2\eta_n}\left[\frac{1}{n}\sum_{i=1}^nY^{(k+1)}_{i,1}-\frac{1}{n}\sum_{i=1}^nY^{(k+1)}_{i,0}\right] \end{align*}\]
End while.
Return \(\hat{\beta}^*=\frac{1}{K}\sum_{k=1}^K\tilde{\beta}^T_{k}\)
Remark. The algorithm simply mimicks gradient descent as in a Newton-Raphson algorithm. One twist is that it uses as estimate of the gradient the marginal treatment effect estimated in another pair of clusters. This ensures that there will not be overfitting: the choices of optimal treatment level remain indpendent at each stage of the potential outcomes abd covariates in the cluster. Using all pairs but the treated pairs would not work neither (see Appendix B.1.4 in Davide’s paper).
Remark. When \(p_1>1\), the algorithm is split in \(\frac{T}{p_1}\) sub-waves of length \(p_1\), where we move each coordinate sequentially before moving to the next wave.
Remark. How to choose the optimal learning rate \(\alpha_{k,t}\)? Under strong concavity of the objective function, the learning rate should be of order \(\frac{J}{t}\), with for example \(J\in\left[0.1,0.2\right]\) when \(\beta\) is a proportion. More robust with moderate to large \(T\):
\[\begin{align*} \alpha_{k,t} & = \begin{cases} \frac{J}{T^{\frac{1-\nu}{2}}||\widehat{M}_{k,t}||} & \text{ if }||\widehat{M}_{k,t}||_2^2>\frac{\kappa}{T^{1-\nu}}-\epsilon_n,\\ 0 & \text{ otherwise } \end{cases} \end{align*}\]
for \(\epsilon_n>0\), \(\epsilon_n\rightarrow 0\), and small constants \(\nu\leq 1\), \(J\), \(\kappa>0\). This approach of dividing the estimated marginal effect by its norm is called gradient norm rescaling and guarantees control of out-of-sample regret under strict quasi-concavity.
Remark. Under the assmption that \(W(\beta)\) is \(\sigma\)-strongly concave, for some strictly positive \(\sigma\), and under additional technical assumptions, the distance between ^*$ and the optimal \(\beta^*\) is arbitrarily small, as well as the distance between \(W(\beta^*)\) and \(W(\hat{\beta}^*)\). Davide also shows that regret can be made arbitrarily small in his approach, whether in and out-of-sample.
11.3 Diffusion effects with detailed networks
We are now going to study what happens when we have access to detailed network information. We observe the contiguity matrix \(A\), or at least all the relevant links for each member of our sample and the treatment status of each peer of our sample members. The analysis of such data is going to closely follow the treatment by Michael Leung (2020).
11.3.1 Setting
We consider a network of total size \(n\), where connections are represented by the matrix \(A\), which is such that there are no self-links (\(A_{i,i}=0\), \(\forall i\in\left\{1,\dots,n\right\}\)). For each unit \(i\) we observe \(D_i\), \(Y_i\), \(\gamma_i=\sum_{j=1}^na_{i,j}\) (\(i\)’s degree or number of neighbors), and \(T_i=\sum_{j=1}^na_{i,j}D_j\) (the number of \(i\)’s neighbors that are treated). We posit a treatment response function that is as follows:
\[\begin{align*} Y_i & = r(D_i,T_i,\gamma_i,\epsilon_i), \end{align*}\]
where \(\epsilon_i\in\mathbb{R}^{d_{\epsilon}}\) are unobserved influences to outcomes, and \(r\) is a function. One specification of \(r\) is the linear first-degree influence model:
\[\begin{equation} Y_i = \beta_1 + \beta_2D_i + \beta_3\frac{T_i}{\gamma_i} + \epsilon_i \tag{11.1} \end{equation}\]
where outcomes depend linearly on the proportion of neighbors treated.
Remark. Michael’s model implicitly imposes that diffusion effects can only stem from direct connexions. Connexions further away on the network (friends of friends) have no direct effect on \(i\)’s outcome. This assumption can be relaxed, as long as there is a maximum distance \(K\) on the network after which neighbors treatments have no effect on \(i\)’s outcome. In our formulation so far, \(K=1\).
Remark. The way we collect data is either we observe the full network or, more often, we conduct a snowball-sampling of \(1\)-neighborhoods. In this sampling strategy, we first randomly select a set of \(\tilde{n}\leq n\) focal units on which we collect \(\left(Y_i,D_i\right)\) and the identity of their neighbors, which gives us \(\gamma_i\) and \(a_{i,j}\). We then collect the treatment status of each of the neighbors, which gives us \(T_i\). We therefore have as data: \(\left(Y_i,D_i,T_i,\gamma_i\right)_{i=1}^{\tilde{n}}\) as well as \(\tilde{A}\), the set of sampled links.
11.3.2 Identification of causal effects
We define conditional causal effects as follows:
\[\begin{align*} \Delta^Y_{TDT}(t,\gamma) & = \esp{r(1,t,\gamma,\epsilon_i)-r(0,t,\gamma,\epsilon_i)|D_i=1,T_i=t,\gamma_i=\gamma}\\ \Delta^Y_{TIT}(d,\gamma) & = \esp{r(d,t,\gamma,\epsilon_i)-r(d,t',\gamma,\epsilon_i)|D_i=d,T_i=t,\gamma_i=\gamma}, \end{align*}\]
with \(\Delta^Y_{TDT}(t,\gamma)\) the average treatment effect on the directly treated, keeping the indirect level of treatment and the degree of each individual constant; and \(\Delta^Y_{TIT}(d,\gamma)\) the average treatment effect on the indirectly treated, keeping the direct level of treatment and the degree of each individual constant.
Remark. Leung (2020) also allows for the identification of the effect on function of \(Y_i\), \(h(Y_i)\).
To state identification results for both treatment effects, we are going to make several assumptions on \(\tilde{D}=\left\{D_i\right\}_{i=1}^{\tilde{n}}\), \(\tilde{\epsilon}=\left\{\epsilon_i\right\}_{i=1}^{\tilde{n}}\) and \(\tilde{A}\):
Hypothesis 11.13 (Treatment exogeneity) We assume that (a) \(\tilde{D}\Ind\left(\tilde{A},\tilde{\epsilon}\right)\) and (b) \(\forall i\in\left\{1,\dots,\tilde{n}\right\}\), \(\epsilon_i\Ind\tilde{A}|\gamma_i\).
Remark. Assumption 11.13 imposes that the treatment does not alter links between units across the network, and, furthermore, since treatment is assumed i.i.d., it also imposes that the treatment is not allocated with respect to network characteristics. This can be relaxed, for example by conducting an experiment stratified on network charcateristics.
Remark. Assumption 11.13 imposes full independence between treatment and unobservables, which is satisfied mostly when the treatment is randomly allocated across units.
Remark. Part (b) of Assumption 11.13 imposes that links are independent from error terms, conditional on degree. This assumption rules out unobserved homophily (individuals forming links based on unobserved determinants of outcomes), a key open issue in the literature.
We also need one technical assumption:
Hypothesis 11.14 (Support) We assume that (a) \(\Pr(D_i=1)\in]0,1[\) and (b) there exists \(P\): \(\N\rightarrow\left[0,1\right]\) such that, \(\forall\gamma\in\N\), \(\frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}\uns{\gamma_i=\gamma}\probconv P(\gamma)\) and \(\Gamma=\left\{\gamma:P(\gamma)>0\right\}\) is not empty and is different from \(\left\{0\right\}\).
Remark. Assumption 11.14 imposes that at least some sampled units have at least one link, and some of them are treatedd and some of them are not.
We can now prove identification of our effects of interest:
Theorem 11.8 Under Assumptions 11.13 and 11.14, \(\Delta^Y_{TDT}(t,\gamma)\) and \(\Delta^Y_{TIT}(d,\gamma)\) are identified, \(\forall d\in\left\{0,1\right\}\), \(\forall t \leq \gamma\), \(\forall\gamma\in\Gamma\):
\[\begin{align*} \Delta^Y_{TDT}(t,\gamma) & = \esp{Y_i|D_i=1,T_i=t,\gamma_i=\gamma}-\esp{Y_i|D_i=0,T_i=t,\gamma_i=\gamma},\\ \Delta^Y_{TIT}(d,\gamma) & = \esp{Y_i|D_i=d,T_i=t,\gamma_i=\gamma}-\esp{Y_i|D_i=d,T_i=t',\gamma_i=\gamma}. \end{align*}\]
Proof. Under Assumption 11.14, \(\esp{Y_i|D_i=d,T_i=t,\gamma_i=\gamma}\) is well defined \(\forall d\in\left\{0,1\right\}\), \(\forall t \leq \gamma\), \(\forall\gamma\in\Gamma\). We have, \(\forall d\in\left\{0,1\right\}\), \(\forall t,t' \leq \gamma\):
\[\begin{align*} \esp{Y_i|D_i=d,T_i=t,\gamma_i=\gamma} & = \esp{r(d,t,\gamma,\epsilon_i)|D_i=d,T_i=t,\gamma_i=\gamma}\\ & = \esp{\esp{r(d,t,\gamma,\epsilon_i)|\tilde{A},D_i=d,T_i=t,\gamma_i=\gamma}|D_i=d,T_i=t,\gamma_i=\gamma}\\ & = \esp{\esp{r(d,t,\gamma,\epsilon_i)|\gamma_i=\gamma}|\gamma_i=\gamma}\\ & = \esp{r(d,t,\gamma,\epsilon_i)|\gamma_i=\gamma}\\ & = \esp{r(d,t,\gamma,\epsilon_i)|D_i=d',T_i=t',\gamma_i=\gamma}, \end{align*}\]
where the first quality is by definition, the second equality uses the Law of Iterated Expectations, and the third equality uses Assumption 11.13. The third equality uses the fact that Assumption 11.13 implies that \((D_i=d,T_i=t)\Ind\epsilon_i|(\tilde{A},\gamma_i)\), which enables us to undo the conditioning on \((D_i=d,T_i=t)\) in the inner expectation. We then use part (b) of Assumption 11.13 to undo the conditioning on \(\tilde{A}\). Since the inner expectation then does only depend on \(\gamma_i\), the outer expectation also does, by the Law of Iterated Expectations. This undoes the conditioning on \((D_i=d,T_i=t)\) in the outer expectation. The same reasoning applied in reverse gives the last equality. This proves the result.
11.3.3 Estimation of causal effects
Michael explores two estimators of the causal effects, one nonparametric and one parametric. The nonparametric estimators are as follows:
\[\begin{align*} \hat{\Delta}^{Y^{np}}_{TDT}(t,\gamma) & = \hat{\mu}(1,t,\gamma)-\hat{\mu}(0,t,\gamma)\\ \hat{\Delta}^{Y^{np}}_{TIT}(d,\gamma) & = \hat{\mu}(d,t,\gamma)-\hat{\mu}(d,t',\gamma)\\ \hat{\mu}(d,t,\gamma) & = \frac{\sum_{i=1}^{\tilde{n}}Y_i\unsi{i}{d,t,\gamma}}{\sum_{i=1}^{\tilde{n}}\unsi{i}{d,t,\gamma}} \\ \unsi{i}{d,t,\gamma} & = \uns{D_i=1,T_i=t,\gamma_i=\gamma}. \end{align*}\]
The parametric estimators are:
\[\begin{align*} \hat{\Delta}^{Y^{p}}_{TDT}(t,\gamma) & = \hat{\beta}^{OLS}_2\\ \hat{\Delta}^{Y^{np}}_{TIT}(d,\gamma) & = \hat{\beta}^{OLS}_3\left(\frac{t-t'}{\gamma}\right), \end{align*}\]
using the OLS estimates of Equation (11.1).
We need several assumptions to ensure that our estimators converge to the actual treatment effect when the network size grows large. Let \(\max_kA_{ik}A_{jk}\) denote whether \(i\) and \(j\) are indirectly linked through a common friend \(k\).
Hypothesis 11.15 For any pair \((i,j)\in\left\{1,\dots,n\right\}^2\), (a) \(\epsilon_i\Ind\epsilon_j|\tilde{A},A_{ij}=0,\max_kA_{ik}A_{jk}=0\) and (b) \((\epsilon_i,\epsilon_j)\Ind\tilde{A}|A_{ij},\gamma_i,\gamma_j,\sum_kA_{ik}A_{jk}\).
Part (a) of Assumption 11.15 imposes that \(\epsilon_i\) and \(\epsilon_j\) can only be correlated if \(i\) and \(j\) they are neighbors or share a common neighbor. Part (b) of Assumption 11.15 imposes that \(\epsilon_i\) and \(\epsilon_j\) depend on the network only through \(i\) and \(j\)’s own connection, own degrees, and number of common connections.
We are now going to define a set of properties on the connections between units that will enable to apply a CLT with non independent data. Under Assumption 11.15, we know that the outcome of observations are correlated across the netework when either they are direct neighbor or they have a neighbor in common. Let us encode these connections within a new matrix, \(G\), such that each entry measures whether the outcome of two observations are correlated or not: \(G_{ij}=\uns{A_{ij}+\max_kA_{ik}A_{jk}+\uns{i=j}>0}\). Let \(\mathbf{N}_i=\left\{j:G_{ij}=1\right\}\) be the set of units whose outcomes are correlated with that of \(i\), and \(|\mathbf{N}_i|\) the cardinal of this set (i.e. the number of units whose outcomes are correlated with \(i's\) outcome). Let \(G^3\) be the third matrix power of \(G\).
Hypothesis 11.16 \(\frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}|\mathbf{N}_i|^3\) and \(\frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}\sum_{j\neq i}(G^3)_{ij}\) are bounded in probability.
Assumption 11.16 imposes that the amount of links in the network is small enough so that the \(G\) matrix is sparse. Assumption 11.16 implies that the average degree is bounded asymptotically, so that average degree is sunstantially smaller than sample size. There is no direct way to test for this, but one convenient approach is to compute the density of \(G\) (its proportion of linked pairs), \(\frac{1}{\left(\begin{array}\tilde{n}\\2\end{array}\right)}\sum_{i\leq j}G_{ij}=\frac{2\sum_{i=1}^{\tilde{n}}\gamma_i}{\tilde{n}-1}\). In sparse enough networks, the density is around 10%. The second part of Assumption 11.16 requires that higher order moments of the degree distribution are bounded, which controls the tails of the distribution of degrees. One way to test for this condition is to compute the tail index of the distribution of degrees, and to find that they decrease exponentially fast, as for example in Ibragimov et al. (2015).
Remark. We can allow for connections of order higher than two in Assumption 11.15, as long as there exists some finite \(K\) beyind which correlations between the outcomes of units \(i\) and \(j\) are zero. We can also allow for connections across time or across clusters or across overlapping clusters
Let us now state Michael’s Theorem for the distribution of the nonparametric estimator of treatment effects:
Theorem 11.9 (CLT for nonparametric estimator of diffusion effects) Under Assumptions 11.13, 11.14, 11.15 and 11.16 (and some regularity conditions summrized in Assumptions 5 and 6 in Leung (2019)), \(\forall d,d'\in\left\{0,1\right\}\), \(\gamma\in\Gamma\) and integers \(t,t'\leq\gamma\), there exists \(\sigma^2_{TS}\) such that:
\[\begin{align*} \sqrt{\tilde{n}}\left(\left(\hat{\mu}(d,t,\gamma)-\hat{\mu}(d',t',\gamma)\right)-\left(\mu(d,t,\gamma)-\mu(d',t',\gamma)\right)\right)\distr\mathcal{N}(0,\sigma^2_{TS}). \end{align*}\]
Proof. See proof of Theorem 2 in Leung (2019).
Let us now state Michael’s Theorem for the parametric estimator of diffusion effects. We first need the following assumption, with \(X\) the \(\tilde{n}\times 3\) matrix of regressors in Equation (11.1), \(\Theta\) the corresponding \(3\times 1\) matrix of coefficients, and \(\rho_{ij}(X,G)=\esp{\epsilon_i,\epsilon_j|X_i,X_j,G_{ij}=1}\):
Hypothesis 11.17 We assume that there exists positive definite matrices \(V\) and \(S\) such that \(\esp{\frac{1}{\tilde{n}}X'X|\tilde{A}}\probconv V\) and \(\frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}\sum_{j\in\mathbf{N}_i}\esp{\rho_{ij}(X,G)X_iX_j'|\tilde{A}}\probconv S\) and that \(\esp{\epsilon_i|X_i,\tilde{A}}=0\).
Theorem 11.10 (CLT for parametric estimator of diffusion effects) Under Assumptions 11.13, 11.14, 11.15, 11.16, and 11.17:
\[\begin{align*} \sqrt{\tilde{n}}\left(\hat{\Theta}-\Theta\right)\distr\mathcal{N}(0,V^{-1}SV^{-1}). \end{align*}\]
Proof. See proof of Theorem SA.2.2 in Leung (2019).
11.3.4 Estimation of sampling noise
For estimating sampling noise, Michael proposes two estimators: one for the sampling noise of the nonparametric estimator and one for the sampling noise of the pâraetric estimator. For the nonparametric estimator, Michael proposes the following approach:
\[\begin{align*} \hat{\sigma}^2_{TS} & = \frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}\sum_{j=1}^{\tilde{n}}G_{ij}((Y_ia_i-b_i)(Y_ja_j-b_j))\\ a_i & = \frac{\unsi{i}{d,t,\gamma}}{\hat{\rho}(d,t,\gamma)}-\frac{\unsi{i}{d',t',\gamma}}{\hat{\rho}(d',t',\gamma)}\\ b_i & = \hat{\mu}(d,t,\gamma)\frac{\unsi{i}{d,t,\gamma}}{\hat{\rho}(d,t,\gamma)}-\hat{\mu}(d',t',\gamma)\frac{\unsi{i}{d',t',\gamma}}{\hat{\rho}(d',t',\gamma)}\\ \hat{\rho}(d,t,\gamma) & = \frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}\unsi{i}{d,t,\gamma}. \end{align*}\]
For the parametric estimator Michael proposes the following estimator for the covariance matrix of the parameters of Equation (11.1) estimated by OLS:
\[\begin{align*} \hat{\mathbf{\Sigma}} & = (X'X)^{-1}\mathcal{M}'G\mathcal{M}(X'X)^{-1}, \end{align*}\]
with \(\mathcal{M}=(X_1\hat{\epsilon}_1,\dots,X_{\tilde{n}}\hat{\epsilon}_{\tilde{n}})'\), and \(\hat{\epsilon}\) the vector of regression residuals.
We can now state two propositions showing that the estimators of sampling noise proposed by Michael are consistent:
Proposition 11.2 (Consistyency of nonparametric estimator of sampling noise) Under the Assumptions of Theorem 11.9, \(\hat{\sigma}^2_{TS}\probconv\sigma^2_{TS}\).
Proof. See proof of Proposition 2 in Leung (2019).
Proposition 11.3 (Consistyency of nonparametric estimator of sampling noise) Under Assumptions 11.13, 11.14, 11.15, 11.16, 11.17, and regularity conditions made clear in Proposition SA.2.4 in Leung (2019), \(\hat{\mathbf{\Sigma}}\probconv V^{-1}SV^{-1}\).
Proof. See proof of Proposition SA.2.4 in Leung (2019).
Remark. One assumption that is hard to swallow here is that effects die after \(K\) connections and correlations after \(K+1\) connections. To relax these assumptions you can to conduct inference at the cluster level, such as in Banerjee et al. (2013), Guiteras et al. (2015) and Beaman et al. (2021) for example. Another approach is to use the bootstrap, with justifications in Davezies et al. (2021) and Nowakowicz (2024).
11.3.5 Nonparametric tests for the existence of diffusion effects based on randomization inference
One final question that we might have about diffusion effects with detailed network information is whether we can adapt nonparametric tests such as Randomization Inference to test for the existence and shape of diffusion effects. One especially interesting thing to test would be the existence of diffusion effects up to \(K\) levels of separation within the network, as assumption used to identify diffusion effects in the sections just above. Athey, Eckles, and Imbens (2018) have developed an approach to do just that, that we are going to detail here.
11.3.5.1 Overview of Athey, Eckles, and Imbens (2018)’s approach
A key concept to understand what Athey, Eckles, and Imbens (2018) are doing is that of sharpness. An assumption is sharp if it enables the researcher to infer the outcomes of each observation under counterfactual treatment vectors. For example, the assumption of no-effect is sharp because it imposes that all the observed outcomes are the ones with and without the treatment at the same time, and thus we can reallocate the treatment and infer the outcomes for each reallocation (they do not change).
The problem that Athey, Eckles, and Imbens (2018) try to solve is that of testing assumptions that are non sharp in the original experiment. The assumption of no spillover effects for example is not sharp in the original experiment since it allows for the treatment to have a direct unknown effect on each unit. When drawing a new treatment allocation, we do not know what the outcome of those who have changed treatment status should be.
Athey, Eckles, and Imbens (2018)’s approach for randomization inference tries to avoid having to assume the absence of treatment effects everywhere in order to test higher for order level assumptions. For that, AEI uses a set of focal units for which treatment status is going to remain fixed over repetitions. Then, we select a set of non focal units (who they are depends on the actual assumption to test). We select a test statistic measuring the strength of evidence against \(H0\). We allocate the treatment at random among the non-focal units multiple times. We derive the distribution of the test statistic under \(H0\). Compute the actual test statistic in the real sample and derive its p-value using the empirically-derived distribution.
11.3.5.3 Testing for the existence of any effect of the treatment
The null hypothesis in this case is that all the effects of the treatment are zero: \(Y_{i}(\mathbf{D})=Y_{i}(\mathbf{D}')\), \(\forall i, \forall \mathbf{D},\mathbf{D}'\in\mathbf{\Omega}\).
To test for this assumption, we randomly allocate all hydrographic zones to placebo vulnerable zones, which gives us a randomized treatment vector \(\tilde{D}_{i}\). To test for the existence of a treatment effect, we use as test statistics \(\hat{\beta}_{OLS}\) and \(\hat{\delta}_{DID}\), estimated using the following regressions: