Front door criterion and Natural experiments

Front door

In a recent blog I highlighted how regression discontinuity could be an example of Pearl’s front door criterion. This is interesting because examples are rare. It occurred to me that given their commonality a number of other natural experiment designs could fall under the front door criterion. Let’s recap front door criterion. Basically it is when the exposure is shielded from confounding (C – often unobserved) by a variable. In Pearl’s original formulation it is a mediator that is shielded by the exposure. The DAG below illustrates. The shielding variable in my example is time (T) but could be another variable that causes the natural experiment (E). Conditioning on time blocks confounding, and the effect of time on the outcome. This allows us to estimate the effect of the policy / intervention on the outcome (Y).

DAG of Y (outcome), E (exposure), T (time), C (confounding)
Y = outcome, E = policy / intervention, T = time, C = confounding

Natural experiments: similarities

Let’s compare the following designs:

Now this blog is not about the technicalities of estimation from these designs. Also I simplify a lot for illustration. For example I use a linear relationship between the outcome and time, and I keep any differences in the pre intervention period between groups fixed. The intervention happened in 2006 in one population and not others.

Normally, regression discontinuity doesn’t use time, as this is “weak” design, but conceptually it is the same as an interrupted time series with no control, as part A of the figure below argues. Y(1) and Y(0) are the outcome in the intervention group under the intervention (1) and no intervention (0). Y(0) is counter-to-fact (dashed-line) so we use the observed (solid lines) pre-intervention trend as a proxy (Y(0)pre).

Adding a control group (Y(com)) that might have a different level of the outcome, but a similar pre-trend, leads us to a difference-in-difference design (panel B). This is similar to adding a control to the interrupted-time-series. Admittedly, difference-in-difference design often uses fewer pre and post time points.

Finally, in panel C synthetic control extends this by weighting a range of controls (represented by the shaded area around line) in the pre intervention period to mimic Y(0)pre and then proxy Y(0).

In all these designs time is a forcing variable, causing the intervention and allowing, with assumptions, confounding control. While we tend to treat each design separately, highlighting commonality may be useful, and the front door criterion is one way?

Does this make sense? Great to hear your views.

Economic success and failure: life expectancy edition

Recently, a number of governments have shown interest in moving beyond economic growth (GDP) to judge national performance. Measures of wellbeing are among the additional measures suggested.  Here’s Scotland’s First Minister on the topic. Amartya Sen had a slight twist on this idea. Use mortality statistics as measures of economic success. In that spirit, let’s assess the economic performance of some high income countries using life expectancy. I limit analysis to countries in the Human Mortality Database. In 2016, Hong Kong had the most successful economy with the USA towards the bottom.

Looking at trends for the UK and USA against the life expectancy leader from 1970 to 2016, these two countries have been relatively economically weak. There have been significant and worrying slowdowns recently. They are, perhaps, in recession?  

Knock knock. Who’s there? It’s regression discontinuity! But which door?

Regression discontinuity design can return a causal effect by shielding the exposure of interest from (unobserved) confounding. It does so through a forcing variable, a score or alike with a threshold that is the sole cause of exposure. For an epidemiology overview see here. As always a DAG better expresses this.

For DAG aficionados, this looks a bit front doory? Front door criterion is sort of akin to a reverse instrumental variable. This is where a mediator of a confounded exposure acts as the instrument for the exposure. Examples are hard to find.  The regression discontinuity example is a bit different, here the exposure rather than the mediator is the shielded variable. Also the shielding variable does effect the outcome. But does it have the property of the front door criterion? The forcing variable / exposure relationship is identifiable as the outcome (a collider) blocks the open path. Exposure / outcome relationship is identifiable by backdoor adjustment for the forcing variable.

There are different flavours of regression discontinuity design. What we might call the randomized is based on a slightly different DAG (below). Analysis focuses entirely on the threshold of the forcing variable. Here forcing variable measurement error means that exposure is effectively randomized and adjustment isn’t needed.

Caveat time. This is just my drawing of the design, others exist. and this is a blog, so my language is imprecise, and mistakes even more likely. To read more about the front door criterion see The Book of Why.

Target validity. I wrote a letter

One of my favourite recent papers is this one on target validity. It is great. I wrote a letter to the journal as I think the authors’ example shows internal rather external validity produces a narrower range of estimates around the true estimate in the target population. The journal did not see it as a priority (which is fair enough). Rather than it go to waste I thought I would publish it below. I might be wrong in my analysis by the way.

Dear Editor,

I read with great interest the paper by Westreich and colleagues on target validity1. They make a powerful case for the importance of a representative sample for validly estimating an average treatment effect in a target population when there is effect modification, as opposed to focusing solely on internal validity. I would like to further the debate by illustrating a potential issue.   In their example, the range of potential non-target effect sizes is larger in the representative sample (when internally invalid) compared to internally valid but externally invalid samples. I will use their web appendix examples to illustrate. The examples feature a binary treatment on a binary exposure where 50% of those in the target population have the binary effect modifier and the target average treatment effect is 0.2.  In their examples (A and B) where confounding is dealt with (i.e. in the examples there is randomisation of exposure) the sample estimates are average treatment effects where 50% and 80% have the effect modifier respectively. In contrast, example C is confounded (internally invalid) but externally valid (representative of the target population) and nearer the target effect than the effect from example B.  I adjusted example C so it remained representative, but I increased the level of confounding so that all those treated came from those with the effect modifier and all those untreated came from those without the effect modifier. The effect size was 0.4 (0.7-0.3). Reversing the situation so the untreated only came from those with the effect modifier and the treated came only from those without the effect modifier, the effect size was 0 (0.4-0.4). In contrast the range of effects from internally valid but unrepresentative studies was narrower, 0.3 (100% have the effect modifier) to 0.1 (0% have the effect modifier).  Any trade-off between external and internal validity may need to consider that the degree of potential error may not be equal between internal and external validity.

Yours,

Frank Popham

ACKNOWLEDGEMENTS

FP works at the MRC/CSO Social & Public Health Sciences Unit

200 Renfield Street, Glasgow, G2 3AX. FP is the corresponding author (frank.popham@glasgow.ac.uk). Competing Interest: FP declares that he has no competing interests. Funding: FP is funded by the Medical Research Council (MC_UU_12017/13) and the Scottish Government Chief Scientist Office (SPHSU13).

REFERENCES

  1. Westreich D, Edwards JK,  Lesko CR, Cole SR, Stuart EA. Target Validity and the Hierarchy of Study Designs. American Journal of Epidemiology 2018;188(2):438-443.


Did New Labour’s health inequalities strategy impact population health?

Etaine Lamy led the writing and associated analysis, with my input, for this post. Etaine is studying for her degree in economics at the University of Glasgow and has been on a short-term placement in my Unit arranged by the Q-Step initiative.  Q-Step promotes quantitative social science education in the UK by increasing undergraduate training in quantitative methods. Rather than let Etaine’s work sit unpublished for ages (as I might take a while to do my share of work on the paper), we thought we would write a blog summary. Please bear in mind, this hasn’t been peer reviewed, and we are trying a newish method. As a result, the results should be taken as preliminary. Any suggestions for improvement welcome. The R code to replicate is here.

Background

The New Labour government (1997-2010) aimed to cut health inequalities. In 2003, they set these targets: decrease inequalities in infant mortality and life expectancy by 10% between those deprived and the rest of the country by 2010. By its scale and ambition, the English strategy remains unique in the world with more than £20bn invested.

Did the New Labour strategy manage to inverse the trend of growing health inequalities, seen in Britain but also in the rest of Europe? Research has been divided so far. Mackenbach (2011) found that overall, inequalities had stayed stable or in some cases even increased. However, a recent study by Barr and Whitehead (2017) found that the target had been reached for male life expectancy. Hu et al (2016) examined the trends in England compared to other European countries using a difference-in-difference analysis. They did not find an effect.

The method

To add to this debate, we used a relatively new method: the Synthetic Control Method. Alberto Abadie and Javier Gardeazabal developed it in 2003 to study the economic effects of terrorism in the Basque Country. Since then, the SCM has been applied in a wide range of areas, from economics to public health. The SCM can work well for comparative case studies. Here we build a counterfactual synthetic England and Wales (using data from 1987 onwards) to compare post intervention trends (2003 onwards) in the actual and the counterfactual England and Wales.

We built the synthetic control for England and Wales from a pool of 11 countries: Australia, France, Greece, Israel, Italy, Japan, Portugal, Spain, Switzerland, New Zealand and the United States. We made this choice of countries because of data availability and because we found it difficult to match England and Wales in terms of income inequality trends. These countries all had a Gini coefficient of 27 and over in 2003. The pool also excludes any country with a similar health inequalities strategy at the same period of time (for example, Ireland). Our set of predictors includes socio-economic variables, primarily from the World Bank database such as GDP per capita in current US$, CO2 emissions per capita, the enrolment rate in tertiary education, and the Gini coefficient from the World Income Inequality Database. While the strategy was for England, our mortality data covers England and Wales and other indicators are for the UK as a whole.

We wrote  a protocol before starting the analysis. We did change the analysis plan somewhat as the analysis revealed some unanticipated problems.

Our findings
Lifespan variation: average loss of life

Our measure of health inequality is lifespan variation (Vaupel et al, 2011). This measures the average life expectancy lost per death.  Because a country’s lifespan variation is driven by premature deaths more prevalent in deprived areas an effective health inequalities strategy might have reduced the level in England and Wales compared to its synthetic. If anything we find the opposite (panels a and b, Figure 1), the decline is slower post intervention (starting in 2005) in England and Wales. SCM lacks traditional tests of statistical significance. Yet changing the intervention date (in-time), changing the intervention country (in-space) provide placebo tests.

The in-time placebo test shows that there is still a change after 2005 (panel a, Figure 4). The in-space placebo test (panel d, Figure 1), shows a greater effect for England and Wales compared to most other countries. The result also holds when adding 8 countries with lower income inequality and when adding several other socio-economic predictors (panel c, Figure 1).

Figure 1

Life expectancy

Given that trends in inequality in life expectancy are tied to life expectancy, we might expect that a successful strategy would have resulted in a faster rise in England and Wales’ life expectancy. We find no evidence for this. The synthetic and actual curves almost mirror each other after 2003 (panel a and b, Figure  2). Our result seems robust using similar placebo tests as above (panels c and d, Figure 2 and panel b, Figure 4).

Figure 2

Infant Mortality Rate

We repeated the analysis for infant mortality. England and Wales follow a very atypical trend during the pre-intervention period. All countries in the pool follow a similar, steep reduction from 1987 to the mid-1990s. However, although England and Wales starts at the median in the pool, the pace of reduction slows down from the early 1990s. For this reason, the fit of the synthetic control is not very good (Panels a and b, Figure 3); however, the results seem robust to placebo tests (panels c and d, Figure 3 and panel c, Figure 4). The strategy does not seem to have had an effect on infant mortality for the whole population either.

Figure 3

Figure 4

Conclusions

While these results are preliminary and have not been peer-reviewed, they suggest that at a national level the inequalities strategy did not impact life expectancy and its variation, nor infant mortality. Of course no study should be regarded as definitive and readers are directed to other research in this area.

There are many challenges in this sort of analysis. For example, although countries in the pool did not carry out a health inequalities strategy per se during the period, their policies could still impact health inequalities. Moreover, defining a synthetic control England was not simple particularly given the rising income inequality trend in the UK which stabilised following New Labour’s election.

Age, period, cohort models. Trying to simplify the complex.

Now I know my APC….

Separating the linear effects of age, period (year) and cohort (birth year) on a health outcome is, in one sense, impossible, as all three are interdependent. Age is period minus cohort. Despite this, debates about whether these effects can be estimated simultaneous are endless. Much of the debate is highly statistical but it seems to me that a step back might be helpful.

Why?

For a while I have thought that drawing a causal diagram would help. Prompted by Judea Pearl’s excellent book, I now have. One key message of the book for me is that diagrams are useful for such seemingly intractable statistical problems. This is not the first-time causal diagrams have been used in age, period, cohort modelling. In these instances, age, period and cohort are shown as interdependent with two-way relationships. For me, this ignores the temporality of their relationship. At least in my thinking, cohort and period are causes of age. It seems wrong to say age causes year or birth year, so age is not a cause of period or cohort. Year of birth is the cause of birth cohort, but year does not continue to be a cause of your birth cohort so I haven’t drawn an arrow between them. This diagram is not to deny they can be used to determine each other, rather it is my assumptions about causal relations. You may think I am wrong, the beauty is you can draw your own.

Table 2 fallacy?

My conclusion from the figure is that age, period, cohort thinking suffers from the Table 2 fallacy. Put simply there are different types of effect an exposure can have, and people end up comparing apples and oranges. For example, if we model cohort against the outcome and do not control for age, that is the total effect of cohort. If we control for age, then we have the direct effect that does not run through age. If you simultaneously model age, period and cohort, you compare the total effect of age, and the direct effects of period and cohort. The total effect of age comes from controlling for cohort and period. The total effect of cohort and the total effect of period do not require control for age or each other. Well at least in my diagram.

Wrong?

I am probably wrong here, and nothing I have written solves the deterministic relationship and the modelling problems. But the very act of drawing causal assumptions is important I believe to progress this field. Moreover, clarity about the exposure these effects represent would also be good but that’s a discussion for another time.

Interaction? Yes and no!

If the effect of an exposure differs across baseline covariates, we call this interaction. In fact, we should probably say effect modification. Reserve interaction for the relationship between different causes. Anyway, using the data from my previous blog we can see in the table that there is an interaction for the relative risk, but not the absolute difference, nor the odds ratio.

MeasureZ=0Z=1Overall
Difference0.20.20.2
Odds ratio2.662.662.25
Relative risk21.331.53

So the simple point here is that effect modification is measure dependent. For a more comprehensive review please see Chapter 4 of this book.

The odd odds ratio

Update 13/10/2018

Thanks to a reader who pointed out an error in Table 4 (now corrected) and suggested a minor change (done) in language in the summary. Great to get such feedback.

That’s odd

I am not a statistician but really enjoy trying to learn more about the quantitative methods I use. One frustration is that there is a disconnect between the technical literature and applied practice. This means “mistakes” (that I have made) in applied work still occur despite them being highlighted for a long time in the more technical literature. One issue is that bridging work between the two is quite rare. In a couple of blogs I try to highlight a couple of issues that could occur even in a simple analysis.

Let’s start with the data

Here’s some data, it originates from here. We have a binary outcome, a binary exposure, and a binary baseline co-variate. Y, X, and Z respectively.

y=0z
x01Total
022090310
116545210
Total385135520
z
y=1
x01Total
055135190
1110180290
Total165315480

So what’s the effect of X on Y?

With the exposure (X=1) the probability of Y is 0.58 and without the exposure it is 0.38. So that’s a difference of 0.58-0.38, a relative risk of 0.58 / 0.38 and an odds ratio of (0.58 / (1-0.58) ) / (0.38 / (1-0.38)). That is a difference of 0.2, a relative risk of 1.53, and an odds ratio of 2.25.

Let’s adjust for Z.

For Z = 0, with the exposure (X=1) the probability of Y is 0.4 and without the exposure it is 0.2. So that’s a difference of 0.4-0.2, a relative risk of 0.4 / 0.2 and an odds ratio of (0.4 / (1-0.4) / ) / (0.2 / (1-0.2)).That is 0.2, 2, and 2.66.

For Z = 1, with the exposure (X=0) the probability of Y is 0.8 and with the exposure it is 0.6. So that’s a difference of 0.8-0.6, a relative risk of 0.8 / 0.6 and an odds ratio of (0.8 / (1-0.8) / ) / (0.6 / (1-0.6)). That is 0.2, 1.33, and 2.66.

Let’s get an adjusted average effect using regression models (linear, Poisson , and logistic ) with Y as outcome, and X, and Z on the right hand side. We get average adjusted effects, difference = 0.2, relative risk = 1.53, and odds ratio = 2.66.

So is Z a confounder?

Well because we are taught to use a logistic regression for a binary outcome then we might answer yes,  as the odds ratio for X changes from 2.25 to 2.66 when we adjust for Z.

But Z isn’t a confounder, it is not associated with X (i.e. probability of Z is 0.45 in both X=0 and X=1). When averaging two odds ratios (sort of what the adjusted regression is doing), their average will not usually be the unadjusted odds ratio even in the absence of confounding.

This is known as “non-collapsibility” in epidemiology. It is well documented in epidemiology and the social sciences. But it is common to see papers comparing odds ratios before and after confounder adjustment as a method of judging the extent of confounding. I’ve done this. However, even when there is confounding a change in the odds ratio will often be part confounding and part non-collapsibility. The difference and relative risk are collapsible measures. See this and this for fuller technical discussions .

So don’t use a logistic regression?

No, using a logistic regression for a binary outcome is generally a good idea. There is nothing wrong here, it is just that the marginal and conditional (on Z) odds ratios will often differ even in the absence of confounding. As I did above to get the difference and relative risk you can fit other models to binary data but these “solutions” are based on the practice of reading effects directly from your model results. I can derive the  the marginal odds ratio, the risk difference and the relative risk from the adjusted logistic regression even though the model results are  conditional odds.

Conditional to marginal

The table below displays the adjusted logistic regression results. Note for clarity results are rounded to two decimal places.

X (Odds ratio)2.67
Z (Odds ratio)6
Constant (Odds)0.25

We now use these results to derive the odds and probability of Y for each of the four combinations of X and Z. To convert from odds to probability use Odds / (1+Odds)

XZ“Formula for odds”OddsProbability
00Constant0.250.2
10Constant * X0.670.4
01Constant * Z1.50.6
11Constant * X *Z40.8

To get an average effect for X, we can standardise to the probability of Z in the whole sample as below

XZWeightProbabilityW*P
000.550.20.11
100.550.40.22
010.450.60.27
110.450.80.36

The sum for X=1 = 0.22+0.36 and X=0 = 0.11+0.27. That is 0.58 and 0.38. We have the same result as the unadjusted as there is no confounding. We can work out the marginal difference, relative risk and odds ratio as before.

In summary
  • Be clear on the effect measures you want to estimate.
  • Logistic regression is a good choice for binary outcomes. However, a change in an odds ratio between models is not always due to confounding alone.
  • Stata and R have the margins command that can calculate marginal effects from a logistic regression.
  • Read this great free book.

Next blog, I am going to look at effect modification (aka interaction).

Same model, different R squared.

Important risk factor?

Quite often I see papers that report how much of the variance in an outcome has been explained by the risk factor(s) of interest. The higher percentage explained (higher R squared) the better seems to be the thought. The authors think that important variables have been identified.

Perhaps not?

But consider this famous example.  Everyone in a rich country smokes 20 cigarettes a day. You study the reasons for lung cancer in this population. Smoking wouldn’t explain any of the variance in lung cancer, it wouldn’t be identified as a cause of lung cancer.  But it is the cause of why this country has a much higher rate of lung cancer than a rich country where nobody smokes. This is summarised as the causes of cases not necessarily being the same as causes of incidence (the rate). In population health we mostly want to change the causes of incidence.  Of course even if you’re dealing in prediction rather than cause it is still the case that predictors of cases are not necessarily the predictors of incidence.

Something magic?

So while smoking in a particular cohort of individuals might explain only 10% of the variation in lung cancer, smoking explains (around 90%) differences in rates between areas. Something I have seen mentioned less often is that the same analysis on the same data  can give a different value of the same R squared.

Sounds like magic! Hey presto, this Stata code illustrates in detail. Using data on mortality and smoking and age, and a Poisson model of individual data (with dead or not as the outcome), I get an R squared of 9%.  But I can rearrange the data, run the same model, get the same results (effect size and Cis) but get a completely different R squared (93%). The difference is I changed the number of observations from 181,467 individuals to 10 groups. In the latter I controlled for the size of groups using an offset. So at the group level the explained variance is pretty high. Given they are essentially the same analysis then actually their predictive ability is the same.  Of course R squared in Poisson and logistic models is a pseudo R squared calculated differently to the vanilla R squared. So don’t take this as a technically accurate description but I think the spirit of what I say is right.

*note the dataset is actually in person-years but I have pretended it is persons followed up for a year just to save the complication of writing about person-years.

Complexity and cause

 

Complexity

Given the difficulty of solving intricate problems such as the obesity epidemic, researchers are turning to the concept of complexity. This sees problems as a system with multiple causal paths and feedback loops rather than a more simple matter of a cause and an effect.

This concept hit the medical big time recently when a viewpoint espousing a complexity approach for public health was published in the Lancet. The Lancet is widely read and has an impact factor of a zillion.  It raises important questions about an emphasis on Randomised Control Trials (RCTs). They have good internal validity (i.e. they are ace for establishing causality). But they may lack external validity (i.e. the intervention and any effect may not generalise for various reasons). This emphasis on RCTs may also lead to an emphasis on individual level interventions. It may also ignore multiple other pathways in the system and their interaction.

Where I differ from the authors is that I think that non-complexity causal methods recognise and can often address these issues. A good overview is given in this paper . It asks important questions of those advocating complexity while also recognising its potential importance.

Cause

Let me pick up a point, the critique of RCTs. The authors advocate for natural experiments. That’s great, as they are great for studying cause and effect. Why are they great? Well it is because they come from the same family of methods as the RCT, counterfactual methods. They all share the same statistical justification. That is that they solve (with assumptions) the problem that we can’t rerun time to have the same populations exposed to different interventions. Another important commonality is that there is an intervention.

So the problem, it seems to me, isn’t the RCT being “linear” (otherwise you would have the same issue with natural experiments). It is the fact that they are perceived difficult to do for upstream policy interventions. Moreover, those working in the counterfactual framework consider issues that concern complexity fans. These include, mechanisms (i.e. mediation – which is complex to do even in RCTs) on causal pathways, interaction (the joint effect of different interventions), moderation , time-varying interventions, time-varying confounding and time-varying outcomes (feedback loops), transportability (generalisation, contextualisation), looking at numerous outcomes, spillover effects, and the drawing of graphs to unpack these issues. 

Complexity and cause

I have a lot to learn about complexity (and causal methods) as my ill-informed descriptions above testify to. Yet I do think it would be productive for those advocating complexity to engage with causal methods (and vice versa) as they are concerned with addressing similar issues.