Chapter 11 Diffusion effects

Up until now, we have assumed that the treatment received by one unit in the population did not have any impact on any other unit. We have not encoded this assumption formally, but we have implicitly made it all along, starting with our encoding of Rubin Causal Model in Chapter 1. In this chapter, we are going to relax that assumption, and learn how to deal with the more general cases that then appear. We are going to cover a host of very important applications, that go from identifying contagion effects to identifying the optimal proportion of individuals to treat at independent locations. We are first going to start by introducing an extended Rubin Causal Model allowing for diffusion effects and introducing ways to discipline this model so that it becomes estimable. We are then going to look at various ways to estimate this model and the precision of the resulting estimates, using RCTs, DID, and both parametric and non parametric approaches. Most of these developments are fairly recent and will enable us to get rapidly in touch with the research frontier.

11.1 Allowing for diffusion effects in Rubin Causal Model

In this section, we are going to detail how to encode causality in the presence of diffusion effects. We are going to start with potential outcomes and a general framework, before considering two very important special cases: the case where diffusion effects are absent and the case where they take a specific form.

11.1.1 Potential outcomes and treatment effects with diffusion effects

The main starting point for an extended Rubin Causal Model is to acknowledge that the treatment status of the \(N^*\) observations in the population (with \(N^*\) possibly infinite) might influence the observed outcome for individual \(i\). Let \(\mathbf{d}=\left\{d_1,\dots,d_{N^*}\right\}\), with \(d_j\in\left\{0,1\right\}\), \(\forall j\in\left\{1,\dots,N^*\right\}\). We can therefore write the generalized potential outcome for individual \(i\) as \(Y_i^{\mathbf{d}}\). If we write \(\mathbf{D}=\left\{D_1,\dots,D_{N^*}\right\}\), we can then write the observed outcome for individual \(i\) as \(Y_i^{\mathbf{D}}\). The average effect of the treatment becomes:

\[\begin{align*} \Delta^Y_{ATE}(\mathbf{D}) & = \esp{Y_i^{\mathbf{D}}-Y_i^{\mathbf{0}}}\\ & = \esp{Y_i^{\mathbf{D}}-Y_i^{\mathbf{0}}|D_i=1}\Pr(D_i=1)+\esp{Y_i^{\mathbf{D}}-Y_i^{\mathbf{0}}|D_i=0}\Pr(D_i=0)\\ & = \Delta^Y_{TT}(\mathbf{D})\Pr(D_i=1)+\Delta^Y_{TUT}(\mathbf{D})\Pr(D_i=0)\\ \end{align*}\]

where \(\mathbf{0}\) is the null vector of length \(N^*\). Note that the average effect of the treatment is equal to a weighted average of the effect on the treated and the effect of the untreated. Note also that these effects differ from the ones we defined in Chapters 1 and 2: they depend on the whole vector of treatment assignments. Indeed, the effect on the untreated is not the one we defined in Section 5.1.3: it is not the difference between taking the treatment and not taking the treatment for those who do not take it. The TUT we have defined here is the difference in outcomes for the ones who do not take the treatment between a case where the treated individuals in the population receive the treatment and a case where no one receives the treatment. The only effect of the treatment on the untreated is indirect: it is the effect that transits through diffusion of the treatment effects from the treated to the untreated. It can be when farmers adopt technologies after seeing their treated neighbors adopt them, or when people contract less diseases because their neighbors are vaccinated. These effects can also be negative, for example when untreated job seekers are crowded out of a job by the job counselling received by the treated. In general, I like to call these effects contagion effects, to insist on the fact that they are indirect.

Note that the effect on the treated also is different and depends on the whole treatment vector. In that case, we allow for the effect on the treated to depend on whether or not some or all of their neighbors are treated. The effect of a vaccine might for example be higher when more people around us are vaccinated. Or a technology is more likely to be adopted if more neighbors are informed that it exists and encourage to adopt it. I call these types of effects amplification effects, to denote the fact that whether the treated react a lot or not to the treatment might depend on whether their neighbors are also treated. These effects might also be negative, for example when more job seekers receive counselling, the effectiveness of counselling on the treated might very well decrease.

11.1.2 Encoding the absence of diffusion effects

In this section, we are going to state the assumption of absence of diffusion effects, that is required for all our previous estimators to work. This assumption, called the Stable Unit Treatment Value Assumption, is stated as follows:

Hypothesis 11.1 (Stable Unit Treatment Value Assumption) We assume that the effect of the treatment on individual \(i\) only depends on whether \(i\) receives the treatment or not, and not on whether other individuals in the population receive the treatment as well: \(\forall i\), \(D_i=D'_i\Rightarrow Y_i^{\mathbf{D}}=Y_i^{\mathbf{D'}}\), \(\forall\mathbf{D}\neq\mathbf{D'}\).

Remark. SUTVA has been coined by Don Rubin in a series of papers: Rubin (1978, 1980, 1990).

SUTVA implies the version of Rubin Causal Model that we have introduced in Chapter 1. Indeed, SUTVA implies that the only treatment status that matters for the potential outcomes of individual \(i\) is the treatment status of individual \(i\). As a consequence, we have the following results:

Theorem 11.1 (Rubin Causal Model and Treatment Effects Under SUTVA) Under Assumption 11.1, the potential outcome of individual \(i\) only depends on its treatment status: \(\forall i\), \(Y_i^{\mathbf{D}}=Y_i^{D_i}\). As a consequence:

\[\begin{align*} \Delta^Y_{TT}(\mathbf{D}) & = \Delta^Y_{TT}\\ \Delta^Y_{TUT}(\mathbf{D}) & = 0. \end{align*}\]

Proof. The proof of the first result that \(Y_i^{\mathbf{D}}=Y_i^{D_i}\) is straightforward from Assumption 11.1. We therefore have

\[\begin{align*} \Delta^Y_{TT}(\mathbf{D}) & = \esp{Y_i^{\mathbf{D}}-Y_i^{\mathbf{0}}|D_i=1}\\ & = \esp{Y_i^{D_i}-Y_i^{0}|D_i=1}\\ & = \esp{Y_i^{1}-Y_i^{0}|D_i=1}\\ & = \Delta^Y_{TT}\\ \Delta^Y_{TUT}(\mathbf{D})& = \esp{Y_i^{\mathbf{D}}-Y_i^{\mathbf{0}}|D_i=0}\\ & = \esp{Y_i^{D_i}-Y_i^{0}|D_i=0}\\ & = \esp{Y_i^{0}-Y_i^{0}|D_i=0}\\ & = 0. \end{align*}\]

11.1.3 Treatment exposure

In general, it is going to prove extremely difficult to do econometric analysis using the very general setting we have defined so far, with potential outcomes depending on the whole treatment vector in the population. A useful simplifying assumption that we often have to resort to is to specify an exposure mapping, that relates the whole treatment vector to the specifications relevant for the outcomes of interest. In order to specify the exposure mapping, we are going to assume that all units in the population are part of a network. This network is summarized by an \(N^*\times N^*\) contiguity matrix \(A\) where each element \(a_{j,i}\) (with \(j\) denoting the line and \(i\) the column) measures the strength of the relationship between \(j\) and \(i\). For example, if \(j\) mentions \(i\) as a friend, \(a_{j,i}=1\), whereas \(a_{i,j}=1\) whenever \(i\) mentions \(j\) as a friend. We can enforce the graph to be symmetric, that is \(a_{j,i}=a_{i,j}\), \(\forall (i,j)\), but it does not have to be the case. For example, water quality at some point \(i\) along a river stream depends on whether water is treated at a point \(j\) upstream, but water quality in \(j\) does not depend on treatments in a downstream point \(i\). Because water flows in one direction, the network is not symmetric.

Equipped with a network of links, and denoting \(\mathbf{\Omega}=2^{N^*}\) the set of possible treatment allocations, and \(\mathbf{\Theta}\) the set of parameters \(\theta_i\) relevant for the value of treatment exposure of unit \(i\) (possibly containing features of the \(A\) matrix), we can define treatment exposure as a mapping \(f\) from \(\mathbf{\Omega}\times\mathbf{\Theta}\) to \(\mathbf{\Delta}\), the set of possible treatment exposure: \(\Delta_i=f(\mathbf{D},\theta_i)\). A key assumption we are going to make is that the exposure mapping is propermy specified, that is that it captures perfectly the intricacies of the effects of various treatment vectors:

Hypothesis 11.2 (Properly specified exposure mapping) \(\forall i\), \(\forall\mathbf{D}\neq\mathbf{D'}\in\mathbf{\Omega}\), \(\forall \theta_i\in\mathbf{\Theta}\), \(f(\mathbf{D},\theta_i)=f(\mathbf{D'},\theta_i)\Rightarrow Y_i^{\mathbf{D}}=Y_i^{\mathbf{D'}}\).

Under Assumption 11.2, the potential outcomes can be written as functions of treatment exposure only: \(Y_i^{\Delta_i}\).
As a consequence, we can now define the average treatment effect of the treatment on the treated as follows:

\[\begin{align*} \Delta^Y_{TT}(\mathbf{d}) & = \esp{Y_i^{\mathbf{d}}-Y_i^{\mathbf{0}}|\Delta_i=\mathbf{d}}, \end{align*}\]

where \(\Delta^Y_{TT}(\mathbf{d})\) measures the impact of treatment exposure \(\mathbf{d}\) on those who have received it.

Remark. The framework based on the use of an exposure mapping has been developped by Manski (2013) and Aronow and Samii (2017).

Remark. I use the term ``average treatment effect on the treated’’ because \(\Delta^Y_{ATE}(\mathbf{d})\) measures the effect of receiving a treatment vector \(d\) on those who receive it.

We are now equipped with tools that enable us to define treatment effects in the presence of diffusion effects, and to identify various types of diffusion effects. The key concept that we are going to have to specify is treatment exposure: how does it change with various applications and how do we go around identifying it in various precise cases? What can we do as well to test for features of treatment exposure without completely specifying it? This is what we are going to see in what follows, first in the case of Randomized Controlled Trials, and then in the case of Difference in Differences. We are going to go step by step, and first we ware going to start with simpler networks, that I call coarse networks, before looking at what we can do with more complex networks.

11.1.4 Fundamental problem of causal inference for diffusion effects

With diffusion effects and treatment exposure, the Fundamental Problem of Causal Inference strikes again. Let state the problem using our more general framework of treatment exposure:

Theorem 11.2 (Fundamental problem of causal inference with diffusion effects) It is impossible to observe \(\Delta^Y_{TT}(\mathbf{d})\), \(\forall d\in\mathbf{\Delta}\), either in the population or in the sample.

Proof. For the population TT:

\[\begin{align*} \Delta^Y_{TT}(\mathbf{d}) & = \esp{Y_i^{\mathbf{d}}-Y_i^{\mathbf{0}}|\Delta_i=\mathbf{d}} \\ & = \esp{Y_i^{\mathbf{d}}|\Delta_i=\mathbf{d}}-\esp{Y_i^{\mathbf{0}}|\Delta_i=\mathbf{d}}\\ & = \esp{Y_i|\Delta_i=\mathbf{d}}-\esp{Y_i^{\mathbf{0}}|\Delta_i=\mathbf{d}}. \end{align*}\]

\(\esp{Y_i^{\mathbf{0}}|\Delta_i=\mathbf{d}}\) is unobserved, and so is \(\Delta^Y_{TT}\). A similar reasoning holds for the sample average treatment effect.

We also have a novel formulation of the bias of intuitive methods. For example, selection bias now depends on \(\mathbf{d}\):

\[\begin{align*} \Delta^Y_{SB}(\mathbf{d}) & = \Delta^Y_{WW}(\mathbf{d})-\Delta^Y_{TT}(\mathbf{d}) \\ & = \esp{Y_i|\Delta_i=\mathbf{d}}-\esp{Y_i|\Delta_i=\mathbf{0}}-\esp{Y_i^{\mathbf{d}}-Y_i^{\mathbf{0}}|\Delta_i=\mathbf{d}}\\ & = \esp{Y_i^{\mathbf{0}}|\Delta_i=\mathbf{d}}-\esp{Y_i^{\mathbf{0}}|\Delta_i=\mathbf{0}}. \end{align*}\]

Remark. Why is the with/without comparison of individuals with treatment exposure \(\Delta_i=\mathbf{d}\) and those with treatment exposure \(\Delta_i=\mathbf{0}\) biased for average treatment effect on the treated \(\Delta^Y_{TT}(\mathbf{d})\)? This is because treatment exposure might be correlated with unobserved confounders: individuals with higher treatment exposure might be systematically different from those with the reference level of treatment exposure (here \(\mathbf{0}\)).

11.1.5 Bias of one-step randomized controlled trials

Explain direction of bias

11.2 Diffusion effects with coarse networks

Coarse networks are networks where we do not have a lot of information on the connections between individuals: we only know whether they belong to the same influence group or not. This type of network characterizes for example of a group of villages, or municipalities, or classes, for which we do not know which links individuals have between each other other than they belong to the same group. More formally, coarse networks can be characterized by the following property:

Hypothesis 11.3 (Coarse network) We say that our population is characterized by a coarse network if the observed matrix of connections \(A\) is block diagonal and we do not know which nodes are activated within each block.

Remark. A blog diagonal influence matrix is composed of a set of groups or clusters within which observations influence each other and across which we assume all influences are muted. This is of course a simplification: some units within a cluster might not really be connected, while some units might be connected to units in an other group. Also, not all units might be equivalent within a group, with some being more central (e.g. connected) than others. In a coarse network, we are assuming these differences away.

Remark. Another way of framing coarse networks is to say that there is unknown interference within clusters (and no interference across). This is Viviano (2023)’s definition. With Viviano’s approach to coarse networks, we do not know which units interfere within each network and how they do.

With a coarse network approach, under Assumption 11.3, we might specialize the exposure mapping to things we might know, that is whether the unit itself is treated or not and the proportion of units that are treated in a given cluster \(c\), \(p_c\), or, more generally, the proportion of units with characteristics \(X_i=x\) that are treated within clusters with characteristics \(Z_c=z\): \(p(x,z)\). As a consequence, we might write potential outcomes as \(Y_i^{D_i,p_c}\) or, more generally, \(Y_i^{D_i,\left\{p(x,Z_c)\right\}_{x\in\mathcal{X}}}\), with \(\mathcal{X}\) the support of \(X_i\).

Remark. Lemma 2.1 in Viviano (2023) shows an example of assumptions under which we can simplify the exposiure mapping and obtain potential outcomes as a function of the proportion of treated units and cluster and unit characteristics.

Under Assumption 11.3, we can write the average effect of treating a cluster with a proportion of treated \(p_c=p\) as follows:

\[\begin{align*} \Delta^Y_{TT}(p) & = \esp{Y_i^{D_i,p}-Y_i^{0,0}|p_c=p}\\ & = \esp{Y_i^{1,p}-Y_i^{0,0}|D_i=1,p_c=p}\Pr(D_i=1|p_c=p)\\ & \phantom{=} +\esp{Y_i^{0,p}-Y_i^{0,0}|D_i=0,p_c=p}\Pr(D_i=0|p_c=p)\\ & = \Delta^Y_{TDT}(p)\Pr(D_i=1|p_c=p)+\Delta^Y_{TIT}(p)\Pr(D_i=0|p_c=p), \end{align*}\]

where \(\Delta^Y_{TDT}(p)\) is the Average Treatment Effect on the Directly Treated and \(\Delta^Y_{TIT}(p)\) is the Average Treatment Effect on the Indirectly Treated.

The main question under Assumption 11.3 is to find the allocation of treated units that maximizes some objective function. We are going to make a distinction between two different cases:

  1. In the first case, we have a pre-specified budget for treatment effort (in terms of number of treated units) and we have to choose how to spend it optimally. This often happens in practical policy applications where the budget has been pre-approved but you do not know how to implement it in the most optimal way possible.
  2. In the second case, we already have an existing policy in place, and we would like to know whether it is optimal, and in which direction we should take it if we happen to have some additional budget.

11.2.1 Optimal treatment allocation under monotone response

I develop result on this setting in my own ongoing research. In order to fix ideas, we are going to start with a simple network with two clusters. We will then look at what happens with a more general network. Finally, we will look at how we can use two-steps clustered RCTs to estimate the required parameters and decide on the optimal allocation.

11.2.1.1 A simple model

Let’s start with a very simple example of a network with two clusters \(1\) and \(2\). Let’s also consider only the case of a discrete outcome (such as participation in a program, getting vaccinated, contracting a disease, adopting a technology, etc.). For simplicity, we are also going to write that potential outcomes are realizations of a continuous utility variable crossing a threshold:

\[\begin{align*} Y_{i}^{0,P_c} & = \uns{\underbrace{\delta_0 + \beta_0 P_{c} -\epsilon_{i,0}}_{Y^*_{i,0}}\geq0}\\ Y_{i}^{1,P_c} & = \uns{\underbrace{\delta_1 + \beta_1 P_{c} -\epsilon_{i,1}}_{Y^*_{i,1}}\geq0}. \end{align*}\]

What this models tells us is that, when no one else in the cluster is treated (\(P_c=0\)), the individual level effect of being treated is equal to \(\Delta^Y_i=\uns{\epsilon_{i,1}\leq\delta_1}-\uns{\epsilon_{i,0}\leq\delta_0}\). When some units starts receiving the treatment, we have two indirect effects:

  • Increasing the proportion of treated units impacts the outcomes of untreated units, through \(\beta_0\). This is what I call a contagion effect, in which untreated units are somehow contaminated by the treatment received by the treated individuals in the same cluster. Contagion might refer to receiving information about the existence of a program and eventually deciding the enroll, or being protected by the fact that some neighbors are taking a treatment (in that case, contagion effects might actually prevent some untreated units from being contaminated). Contagion effects might be negative, if for example treated individuals who receive job training or job search assistance end up finding jobs that would have been allocated to some of the untreated individuals in the absence of the treatment.
  • Increasing the proportion of treated units impacts the outcomes of treated units, through \(\beta_1\). This is what I call an amplification effect. There is amplification each time a treated units increases its likelihood of a positive outcome because more units are treated. This might happen when technological adoption occurs only after most individuals in the cluster have been exposed to it and convinced to make a change.

In order to get even more intuition on this problem, we are going to specialize it even further by making the following set of assumptions:

Hypothesis 11.4 (Simplified Allocation Problem) We assume that the allocation problem is characterized as follows:

  • There are only two nodes \(c=1\) and \(c=2\).
  • A mass of \(1\) units reside at each node.
  • We can only treat a mass of \(1\) units.
  • We assume the constraint is saturated so that we use all available treatments: \(p_1+p_2=1\).
  • \(\epsilon_1\) and \(\epsilon_0\) are uniform on \(\left[0,1\right]\).

Under Assumption 11.4, we can set \(p_1=p\) and \(p_2=1-p\). Let’s assume our goal is to maximize the total amount of people with \(Y_i=1\). Under Assumption 11.4, this is equivalent to maximizing the sum of the adoption rates at both nodes. Using the fact that the constraint is saturated, we can write the objective function we aim to maximize as follows:

\[\begin{align*} W(p) & = \underbrace{pF_{\epsilon,1}(\alpha_1 + \beta_1p)+(1-p)F_{\epsilon,0}(\alpha_0 + \beta_0p)}_{A(p)}\\ & \phantom{=}+\underbrace{(1-p)F_{\epsilon,1}(\alpha_1 + \beta_1(1-p))+pF_{\epsilon,0}(\alpha_0 + \beta_0(1-p))}_{A(1-p)} \end{align*}\]

The \(A\) function measures how much the probability of observing the favorable outcome \(Y_i=1\) in a given cluster increases with the proportion of treated individuals in the cluster, \(p\). It turns out that the properties of the \(A\) function are key to determine the optimal allocation of treatment effort across nodes in the general case with more than two nodes. For now, in the two-node case and under substantial simplifications, we have the following result:

Theorem 11.3 (Optimal allocation of treatment effort with two nodes) Under Assumptions 11.2, 11.3 and 11.4, we have three possible cases for the optimal allocation of treatment effort:

  • When amplification effects dominate (\(\beta_1>\beta_0\)): either \(p^*=1\) or \(p^*=0\)
  • When contagion effects dominate (\(\beta_0>\beta_1\)): \(p^*=\frac{1}{2}\)
  • When amplification and contagion effects are of the same size (\(\beta_0=\beta_1\)): \(p^*=\left[0,1\right]\).

Proof. Under Assumption 11.4, we have:

\[\begin{align*} W(p) & = p(\alpha_1 + \beta_1p)+(1-p)(\alpha_0 + \beta_0p)+(1-p)(\alpha_1 + \beta_1(1-p))+p(\alpha_0 + \beta_0(1-p))\\ & = \alpha_0+\alpha_1+\beta_1+2(\beta_0-\beta_1)p(1-p), \end{align*}\]

where the second line follows after some algebra. The problem \(\max_{p\in\left[0,1\right]}W(p)\) has the folowing first order condition: \(W'(p)=2(\beta_0-\beta_1)(1-2p)=0\) and the following second order condition: \(W''(p)=-4(\beta_0-\beta_1)\). When \(\beta_0>\beta_1\), \(W''(p)<0\), and the interior solution \(p^*=\frac{1}{2}\) maximizes \(W\). When \(\beta_0<\beta_1\), \(W''(p)>0\), and the interior solution \(p^*=\frac{1}{2}\) minimizes \(W\). In that case, the optimal solution is at a corner, either at \(p^*=1\) or at \(p^*=0\). Since \(W(1)=W(0)\), they are both maxima. When \(\beta_0=\beta_1\), \(W\) is constant and any value in \(\left[0,1\right]\) maximizes \(W\).

Remark. Theorem 11.3 shows that when amplification effects dominate, it is optimal to focus all treatment effort on one of the two nodes (for example the first, but they are interchangeable). This is because returns are increasing in this case: the \(A\) function is convex, with more people responding to the treatment as more of them receive the treatment. When contagion effects dominate, it is optimal to treat both nodes, with half of the observations receiving the treatment. This is because in that case, the \(A\) function is concave, and the marginal returns are decreasing when we treat more people. When both contagion and amplification effects are equal, there is no optimum, or, equivalently, any allocation \(p\) will yield the same result.

11.2.1.2 A general model

One open question is whether we can generalize the result in Theorem 11.3 to a much more general setting with several nodes and more general functional forms. It is actually the case. Let us now formulate a more general setting:

Hypothesis 11.5 (Symmetric Allocation Problem) We assume that the allocation problem is characterized as follows:

  • \(K\) nodes indexed from \(1\) to \(K\), and each node has size \(n_k\).
  • At each node, we can choose to treat \(r_k\) individuals.
  • The total number of individuals on the network is \(N=\sum_{k=1}^Kn_k\).
  • The total number of treated individuals is \(R=\sum_{k=1}^Kr_k\).
  • We cannot treat more than \(\bar{R}\) individuals.
  • We cannot treat everyone: \(\bar{R}<N\).
  • The expected outcome at each node (or response function) is only a function of \(p\), that we denote \(A(p)\), with \(A'>0\).

\(A\) is two things at the same time: connection matrix and response function

Remark. Assumption 11.5 is mainly restrictive in making the problem symmetric: all nodes are treated in the same way. The only thing that distinguishes nodes is their respective size. Apart from that, they all respond in the same (average) way to the treatment. We do not try to distinguish between nodes based on observed characteristics of the nodes. We also do not try to vary the identity of treated units based on their observed characteristics. Another restriction is that \(A'>0\): we only consider treatments for which the response is always strictly increasing in \(p\) (and not weakly).

Under Assumptions 11.2, 11.3 and 11.5, we can cast our optimization problem as follows:

\[\begin{align*} \max_{\left\{r_k\right\}_{k=1}^K} & \sum_{k=1}^K n_kA(\frac{r_k}{n_k})\label{eqn:MainProbMax}\\ & \text{under the constraints} \nonumber\\ R & =\sum_{k=1}^Kr_k \leq \bar{R} \label{eqn:MainProbR}\\ r_k & \leq n_k\text{, }\forall k\label{eqn:MainProbn}\\ r_k & \geq 0\text{, }\forall k.\label{eqn:MainProbr} \end{align*}\]

In my work, I have been able to solve this problem for a smooth response function \(A\), in the following sense:

Hypothesis 11.6 (Monotone Response Function) We assume that the reponse function \(A\) has constant second derivative on its full support: either \(A''(p)>0\) \(\forall p\in\left[0,1\right]\) or \(A''(p)<0\) \(\forall p\in\left[0,1\right]\).

We can indeed prove the following result:

Theorem 11.4 (Optimal allocation under monotone response with $K$ symmetric nodes) Under Assumptions 11.2, 11.3, 11.5 and 11.6, the optimal allocation of treatment across nodes is as follows:

\[\begin{align*} \frac{r^*_k}{n_k} & = \begin{cases} \frac{\bar{R}}{N}\text{, }\forall k & \text{ if }A''<0\\ \begin{cases} 0 & \text{ for a set of nodes } \mathcal{J} \text{ such that } \sum_{j\in\mathcal{J}}n_j=N-\bar{R},\\ 1 & \text{ for a set of nodes } \mathcal{L} \text{ such that } \sum_{l\in\mathcal{L}}n_j=\bar{R}, \end{cases} & \text{ if } A''>0.\\ \end{cases} \end{align*}\]

Proof. See Section A.6.1.

Remark. Theorem 11.4 shows that the very simple intuition that we got in the two nodes problem transports well to more complex settings. The optimal allocation depends on the sign of the second derivative. When returns are decreasing, we treat each node symmetrically with the same share \(p^*=\frac{\bar{R}}{N}\) of the treatment effort. When returns are increasing, we treat a share \(\frac{\bar{R}}{N}\) of the nodes with \(p^*=1\) and a share \(1-\frac{\bar{R}}{N}\) with \(p^*=0\).

Remark. There are several open questions on this research front. To list but a few:

  • Can we relax Assumption 11.6? For example, we know that \(A''\) has not constant sign when the error terms are normal in the model with two nodes, but we still have an optimal solution that has the same shape.
  • Can we relax Assumption 11.5?
    Especially, can we allow for responses that vary as a function of node characteristics and can we allow for treatment allocation based on unit characteristics?

11.2.1.3 Using two-step clustered randomized controlled trials to find the optimal treatment allocation

In this section, we are going to see that conducting a two-step clustered randomized controlled trial is going to enable us to identify the optimal treatment allocation under Assumptions 11.2, 11.3 and 11.4. A two-step clustered randomized controlled trial works as follows:

  • In a first step, we randomly select three sets of nodes, \(ST\) and \(PT\) and \(SC\), with \(K_{ST}+K_{PT}+K_{SC}=\tilde{K}\) and \(\tilde{K}\leq K\). When \(\tilde{K}< K\), \(\tilde{K}\) is a random subset of the \(K\) nodes.
    • Nodes that belong to \(ST\), the set of nodes of size \(K_{ST}\), are called Super Treated nodes. The proportion of treated units is \(p^R_c=1\), \(\forall c \in ST\).
    • Nodes that belong to \(PT\), the set of nodes of size \(K_{ST}\), are called Partially Treated nodes. The proportion of treated units is \(p^R_c=\frac{\bar{R}}{N}\equiv p^*\), \(\forall c \in PT\).
    • Nodes that belong to \(SC\), the set of nodes of size \(K_{SC}\), are called Super Control nodes. The proportion of treated units is \(p^R_c=0\), \(\forall c \in SC\).
  • In a second step, we randomly select \(N^1_c=\frac{\bar{R}}{N}N_c\) units to be treated (with \(R_i=1\)) and \(N^0_c=N_c-N^1_c\) to be in the control group (\(R_i=0\)), \(\forall c \in PT\), with \(N_c\) the number of units in node \(c\).

When implementing the treatment, all units in \(ST\) are treated, only \(N^1_c\) units are treated in \(PT\) and no unit is treated in \(SC\).

Remark. Note that rigorously, we should have \(N_c^1=\lfloor\frac{\bar{R}}{N}N_c\rfloor\), but we disregard the complexities brought about by the fact that the number of units has to be an integer.

The one thing we need to identify now in order to apply Theorem 11.4 is the sign of the second derivative of the \(A\) function. We are going to show that the sign of \(A''\) can be identified in a two-step clustered randomized controlled trial. Before that, we are going to encode the validity of the two-step clustered randomized controlled trial:

Hypothesis 11.7 (Independence in a two-step clustered design) We assume that the allocation of the proportion of neighbors treated and of the individual treatment level are independent of potential outcomes:

\[\begin{align*} (R_i,p^R_c)\Ind\left(\left\{Y_i^{0,p},Y_i^{1,p}\right\}_{p\in\left[0,1\right]}\right). \end{align*}\]

We also assume that the randomized allocation does not interfere with how units respond to the treatment:

Hypothesis 11.8 (Validity of the 2-step clustered design) We assume that the randomized allocation of the program does not interfere with how potential outcomes are generated:

\[\begin{align*} Y_i & = \begin{cases} Y_i^{1,p} & \text{ if } R_i=1 \text{ and } p^R_c=p\\ Y_i^{0,p} & \text{ if } R_i=0 \text{ and } p^R_c=p \end{cases} \end{align*}\]

with \(Y_i^{1,p}\) and \(Y_i^{0,p}\) the same potential outcomes as defined with a routine allocation of the treatment.

We are now equipped to prove the identification of \(A''\):

Theorem 11.5 (Identification of $A''$ in a 2-step clustered randomized controlled trial) Under Assumptions 11.2, 11.3, 11.5, 11.6, 11.7, and 11.8, the numerator of \(A''\) is identified by the following quantity:

\[\begin{align*} \text{sign}(A'') & = \text{sign}\left(\frac{\esp{Y_i|p^R_c=1}-\esp{Y_i|p^R_c=p^*}}{1-p^*}-\frac{\esp{Y_i|p^R_c=p^*}-\esp{Y_i|p^R_c=0}}{p^*}\right). \end{align*}\]

Proof. See Section A.6.2.

One thing that is pretty amazing is that we can relate the sign of \(A''\) to the relative size of contagion and amplification effects:

Theorem 11.6 (The sign of $A''$ depends on the relative size of contagion vs amplification effects) Under Assumptions 11.2, 11.3, 11.5, 11.6, 11.7, and 11.8, we have:

\[\begin{align*} \text{sign}(A'') & = \text{sign}\left(\frac{\esp{Y^{1,1}_i-Y^{1,p^*}}}{1-p^*} -\frac{\esp{Y^{0,p^*}_i-Y^{0,0}_i}}{p^*}\right), \end{align*}\] where \(\esp{Y^{1,1}_i-Y^{1,p^*}}\) measures the strength of amplification effects and \(\esp{Y^{0,p^*}_i-Y^{0,0}_i}\) measures the strength of contagion effects.

Proof. See Section A.6.3.

Theorem 11.6 suggests an alternative identification strategy for the sign of \(A''\):

Theorem 11.7 (Identifying the sign of $A''$ from the relative size of contagion and amplification effects) Under Assumptions 11.2, 11.3, 11.5, 11.6, 11.7, and 11.8, we have:

\[\begin{align*} \text{sign}(A'') & = \text{sign}\left(\frac{\esp{Y_i|R_i=1,p^R_c=1}-\esp{Y_i|R_i=1,p^R_c=p^*}}{1-p^*}\right.\\ & \phantom{=\text{sign}\left(\right.}\left.-\frac{\esp{Y_i|R_i=0,p^R_c=p^*}-\esp{Y_i|R_i=0,p^R_c=0}}{p^*}\right) \end{align*}\]

Proof. The proof is immediate using Theorem 11.6 and Assumptions 11.7 and 11.8.

Thanks to Theorems 11.5 and 11.7, we therefore have two ways to estimate the sign of \(A''\): either by comparing the overall changes in expected outcomes when moving from \(0\) to \(p^*\) and from \(p^*\) to \(1\), or by comparing the relative size of amplification and contagion effects. As a consequence, we can form two with/without estimators of \(A''\):

\[\begin{align*} \hat{A}''_{All}(\frac{1}{2}) & = \frac{\frac{\sum_{i\in\mathcal{I}_{ST}}Y_i}{N_{ST}}-\frac{\sum_{i\in\mathcal{I}_{PT}}Y_i}{N_{PT}}}{1-p^*}- \frac{\frac{\sum_{i\in\mathcal{I}_{SP}}Y_i}{N_{SP}}-\frac{\sum_{i\in\mathcal{I}_{SC}}Y_i}{N_{SC}}}{p^*}\\ \hat{A}''_{Diff}(\frac{1}{2}) & = \frac{\frac{\sum_{i\in\mathcal{I}_{ST}}Y_i}{N_{ST}}-\frac{\sum_{i\in\mathcal{I}^1_{PT}}Y_i}{N^1_{PT}}}{1-p^*}- \frac{\frac{\sum_{i\in\mathcal{I}^0_{SP}}Y_i}{N^0_{SP}}-\frac{\sum_{i\in\mathcal{I}_{SC}}Y_i}{N_{SC}}}{p^*}, \end{align*}\]

with \(\mathcal{I}_{T}\), \(T\in\left\{ST,SC,PT\right\}\), the set of units \(i\) that belong to a cluster of type \(T\), \(\mathcal{I}^d_{PT}\), \(d\in\left\{0,1\right\}\), the set of units that belong to a cluster of type \(PT\) and have \(R_i=d\), \(N_{T}\), \(T\in\left\{ST,SC,PT\right\}\), the number of units \(i\) belonging to clusters of type \(T\), and \(\mathcal{N}^d_{PT}\), \(d\in\left\{0,1\right\}\), the number of units that belong to clusters of type \(PT\) and have \(R_i=d\). Following usual arguments, these estimators are both unbiased and consistent (as the number of clusters goes to infinity) for \(A''(\frac{1}{2})\). Their components can both be estimated separately by using OLS with a linear model on separate subsamples. The covariance of each separate with/without comparison can be estimated by estimating both components jointly, for example by estimating the following model by OLS:

\[\begin{align*} Y_i & = \alpha^{All} + \beta^{All}_{PT}\uns{i\in\mathcal{I}_{PT}} + \beta^{All}_{ST}\uns{i\in\mathcal{I}_{ST}} + \epsilon_i^{All}\\ Y_i & = \alpha^{Diff} + \alpha^{Diff}_{1}\uns{R_i=1} + \beta^{Diff}_{0}\uns{i\in\mathcal{I}^0_{PT}}+ \beta^{Diff}_{1}\uns{i\in\mathcal{I}_{ST}} + \epsilon_i^{All}. \end{align*}\]

With these parameter estimates, we have:

\[\begin{align*} \hat{A}''_{All}(\frac{1}{2}) & = \frac{\hat\beta^{All}_{ST}}{1-p^*}-\frac{\hat\beta^{All}_{SP}}{p^*}\\ \hat{A}''_{Diff}(\frac{1}{2}) & = \frac{\hat\beta^{Diff}_{1}}{1-p^*}-\frac{\hat\beta^{Diff}_{0}}{p^*}. \end{align*}\]

To estimate the precision of each of the parameters, one has to use standard errors clustered at the cluster level. To obtain the precision of \(\hat{A}''(\frac{1}{2})\), one can simply use the Delta Method.

Remark. Note that in practice, the actual proportion of treated in each cluster of type \(PT\) is going to differ from \(p^*\). Does this affect consistency and unbiasedness of both estimators? Could we estimate \(\hat p^*\) and try to use it to get access to a wider share of the \(A\) function, or at least to an average effect? See Davide’s discussion of that issue.

11.2.2 Identifying optimal treatment levels

In the previous section, we discussed ways of identifying diffusion effects, and we focused on the task of finding the optimal treatment allocation when total treatment capacity was fixed to a limited number of treatments. In that scenario, what turned out to be super important was the shape of the returns to treatment effort (convex or concave), and it turned out to be related to whether contagion or amplification effects dominated. Though this scenario of constant treatment effort sometimes happens in real life, in other situations, policymakers might want to decide whether to increase or decrease their treatment effort, and to find the optimal treatment level, taking into account diffusion effects. This is the goal of this section, which is fully based on Davide Viviano (2023)’s recent working paper on the topic.

11.2.2.1 Setup and assumptions

Davide considers a setting with \(K\) clusters of equal size \(N\). Researchers sample a proportion \(\lambda\in\left]0,1\right]\) of the \(N\) units in each cluster at each period \(t\) and they have access to the following information for the sampled observations in each cluster: \(\left(Y^{(k)}_{i,t},X^{(k)}_{i},D^{(k)}_{i,t}\right)_{i=1}^n\) where \(n=\lambda N\) and \(X^{(k)}_{i}\) are baseline characteristics.

Davide describes the latent process that takes place between units in each cluster. Units are spaced on a latent space, and each unit can interact with at most the \(\sqrt{\gamma_N}\) closest units. Let \(\uns{i_k\leftrigharrow j_{k}}\) denote whether or not \(i\) and \(j\) can be connected in the resulting the latent network. Let \(\mathcal{I}_k\) be the matrix of these potential connections in cluster \(k\). Davide’s first assumption is:

Hypothesis 11.9 (Network) Actual connections are generated as follows:

\[\begin{align*} A_{i,j}^{(k)} & = l(X^{(k)}_{i},X^{(k)}_{j},U^{(k)}_{i},U^{(k)}_{j})\uns{i_k\leftrigharrow j_{k}}, \end{align*}\]

for some function \(l\) and unobservables \(U^{(k)}_{i}\) with \(\left(X^{(k)}_{i},U^{(k)}_{i}\right)|\mathcal{I}_k\sim F_{U|X}F_{X}\) and with \(\sum_{j=1}^N\uns{i_k\leftrigharrow j_{k}}=\sqrt{\gamma_N}\).

The second assumption Davide makes is on how potential outcomes are generated:

Hypothesis 11.10 (Potential outcomes) Potential outcomes are generated as follows:

\[\begin{align*} Y^{(k)}_{i,t}(\mathbf{D}_1^{(k)},\dots,\mathbf{D}_t^{(k)}) & = r(D_{i,t}^{(k)},\mathbf{D}_{\mathcal{N}_i^{k},t}^{(k)},X^{(k)}_{i},X^{(k)}_{\mathcal{N}_i^{k},t},U^{(k)}_{i},U^{(k)}_{\mathcal{N}_i^{k},t},A^{(k)}_{i,.},|\mathcal{N}_i^{k}|,\nu^{(k)}_{i,t})+\tau_k+\alpha_t, \end{align*}\]

for some function \(r\) which attains the same value for any permutations of the entries of \(A^{(k)}_{i,.}\), with \(A^{(k)}_{i,.}\) the vector of connections of \(i\) in , and unobservables \(\nu^{(k)}_{i,t}|\left(X^{(k)}_{i},U^{(k)}_{i}\right)\sim F_{\nu}\), and where \(\mathcal{N}_i^{k}=\left\{j:A^{(k)}_{i,j}>0\right\}\).

The main assumption that Davide makes on how units interact in the network

11.3 Diffusion effects with detailed networks

Leung

11.4 Nonparametric test for the existence of diffusion effects based on randomization inference

Athey and Imbens

11.5 Diffusion effects with DID

Sylvain’s approach vs Kyle’s approach