Decomposing differences in arithmetic means: a doubly robust estimation approach

When decomposing differences in average economic outcomes between two groups of individuals, it is common practice to base the analysis on logarithms if the dependent variable is nonnegative. This paper argues that this approach raises a number of undesired issues because decomposition terms have the interpretation of approximate percentage differences in geometric means. Instead, we suggest that the analysis should be based on the arithmetic means of the original dependent variable. We present a flexible parametric decomposition framework that can be used for all types of nonnegative dependent variables. In particular, we derive a propensity score weighted estimator for the aggregate decomposition that is “doubly robust”, that is, consistent under two separate sets of assumptions. A comparative Monte Carlo study illustrates that the proposed estimator performs well in many situations. An application to the union wage gap in the USA finds that the importance of the unexplained union wage premium is much smaller than suggested by the standard log-wage decomposition.

literature emphasizes the importance of decomposing the entire distribution (e.g. Machado and Mata 2005;Melly 2005), the mean differential remains an important summary statistic in decomposition analysis as it is simple ("one number"), easily understood and sometimes the direct object of interest of policy analysis (e.g. health care expenditures). Differences in means are usually decomposed with the Oaxaca-Blinder method (Oaxaca 1973;Blinder 1973) which splits the gap into a structural effect (due to differences in coefficients) and a composition effect (due to differences in covariates). In many empirical applications where outcomes are continuous and nonnegative, as, for example, wages, the dependent variable is log-transformed before performing the decomposition. While the logtransformation is usually done to allow for convenient estimation, it changes the interpretation of the outcome gap in an important way: the quantity to be decomposed when outcomes are in logs corresponds to a first-order approximation to the percentage difference in geometric means (GM). This quantity cannot be retransformed to the first moments of the original dependent variable, the arithmetic means (AM).
This paper argues that a decomposition based on the original dependent variable may be preferable to a decomposition based on logarithms on statistical and conceptual grounds. Foremost, if the support includes zero, the GM is theoretically undefined. In empirical studies, the dependent variable is sometimes artificially re-scaled in order to be able to take logarithms, but this distorts the distribution at the lower bound of the support (Santos Silva and Tenreyro 2006). Furthermore, even if there are no zeros in the data, a decomposition of AMs can be preferable for several reasons. First, the log outcome gap is not invariant to changes in the higher-order moments of the distributions even if arithmetic means remain constant (Leslie and Murphy 1997). If, for example, dispersion increases, the log outcome gap will change as well. This is clearly an undesired property for a measure of the difference in average outcomes. Second, the log outcome gap only offers an approximate interpretation, whereas the raw outcome gap offers an exact interpretation. Finally, the AM is the more common and intuitive summary statistic.
In this paper, we suggest modelling the untransformed outcome variable directly to decompose differences in arithmetic means. The leading case is a decomposition of average wages between groups of workers (Firpo et al. 2011, for an overview), but many other outcome variables can be of interest in decomposition analysis, such as health care expenditures (Vargas Bustamante and Chen 2011), financial wealth (Barsky et al. 2002), firm-level productivity (Mueller 2012), revenues (Munn and Hussain 2010), students' test scores (Krieg and Storer 2006) or counts, such as the number of cigarettes smoked (Bauer et al. 2007). Since the conditional expectation function is thought to be convex in most of these examples, as reflected in the widespread use of log-transformations, a linear outcome model is inappropriate (see, e.g. Barsky et al. 2002). Instead, we propose a flexible nonlinear parametric framework based on an exponential regression model. Following Firpo et al. (2011), we first discuss identification under the ignorability assumptions before proceeding to specification and estimation issues. By contrast, previous papers on nonlinear decompositions (Fairlie 2005;Bauer and Sinning 2008) have largely omitted discussing identification.
As our methodological contribution to the literature, we suggest a new estimation strategy for decomposing differences in average outcomes when outcomes are nonnegative. We propose a "doubly robust" weighted Poisson quasi-maximumlikelihood (WPQML) estimator. The existence of this estimator is briefly mentioned in Wooldridge (2007), but to our knowledge, it has not been studied in detail nor has it been used in empirical work. Quasi-maximum-likelihood (QML) estimators in general are attractive because they only require correct specification of the conditional mean function for consistency and no further distributional assumptions (Gourieroux et al. 1984). The Poisson QML estimator achieves some additional robustness if augmented with appropriate propensity score weights. We show formally that the decomposition can be consistently estimated if either the outcome model or the propensity score model is correctly specified. Therefore, double robustness is a useful property to guard against misspecification. It is important to note, however, that in the case where the validity of the ignorability assumptions is questioned, the same caveat applies as to other estimators that rely on these assumptions (see also Kunze 2008;Huber 2014). 1 To illustrate how the proposed estimator of the nonlinear decomposition performs in practice, we undertake a small Monte Carlo exercise. We compare the performance of weighted and unweighted QML estimators, linear and quadratic regression, reweighted regression (Firpo et al. 2007) and the nonparametric doubly robust estimator (Rothe and Firpo 2013) under various scenarios with regard to functional form, overlap of covariate distributions and heteroskedasticity. We find that the weighted Poisson QML estimator produces convincing results in many situations. It should therefore prove to be an attractive estimation strategy for applied researchers who wish to perform decompositions of nonnegative outcomes.
In an empirical application to the union wage gap in the USA, we compare the approach of decomposing arithmetic means with the standard Oaxaca-Blinder decomposition based on the log-wage gap (geometric means). The former suggests that the union wage premium is much less important in explaining the wage differential between union and nonunion workers than the latter. In other words, the concept of the mean can have striking implications for the analysis of differences in economic outcomes and therefore requires more careful consideration than it usually receives in empirical research.
The remainder of this paper is organized as follows: Sect. 2 deals with the conceptual and statistical issues of decomposing differences in either AMs or GMs. In Sect. 3, we define the general decomposition framework and present the identification result for the counterfactual of interest. In Sect. 4, we discuss model specification and derive the doubly robust estimator for the decomposition. Section 5 contains the Monte Carlo exercise and Sect. 6 the empirical application. Section 7 briefly covers some extensions to detailed decompositions, endogeneity and sample selection, and Sect. 8 contains some concluding remarks.

Analytical framework
The analytical framework used in this paper will be based on the potential outcomes notation popularized by Rubin (1974). While questions of identification have mostly been ignored in the more traditional decomposition literature, the treatment effect framework clarifies the assumptions needed to identify the decomposition terms (Firpo et al. 2011). We begin by defining a large population indexed by i = 1, 2, ..., N that contains two mutually exclusive and nonempty groups of individuals. Let D i = 1 if person i belongs to group 1, and D i = 0 if she belongs to group 0. The potential outcome of person i if she belongs to group g is y ig , where y ig ∈ R + 0 ∀i and g ∈ {0, 1}. The potential outcomes and the realized outcome are linked by y i = D i y i1 + (1 − D i )y i0 . Hence, we do not observe y i1 for units in group 0 and y i0 for units in group 1, which is why these quantities are "counterfactual". The notation introduced above will be used throughout this paper.

Measures of mean differentials
Before dealing with the technicalities of decomposition analysis, it is sensible to first clearly define the object under study, i.e. the quantity to be decomposed. In most situations, we are interested (amongst other statistics) in the average gap in economic outcomes between two groups, a prominent example being the average wage gap between male and female workers. It seems natural to consider the difference in expected outcomes which is based on population arithmetic means (AM): Of course, the outcome gap may also be expressed in terms of a percentage differential, e.g.
, which requires the choice of a reference group (in this case group 0), but it is free from the underlying units in which outcomes are measured. Alternatively, if the decomposition is based on log outcomes, we have which is the approximate percentage differential in the population geometric means (GM). An exact percentage differential in terms of GMs is given by , and the absolute differential in terms of GMs is It is notable that in many applications of nonnegative dependent variables, researchers prefer to decompose (2) instead of (1). In particular, this is often the case in studies of wages, where log-linear wage models are estimated by OLS to perform Oaxaca-Blinder-type decompositions. This is the dominant procedure because wages and covariates are usually assumed to have a log-link relationship, but also because it is standard practice to estimate linear models instead of nonlinear models. While the geometric mean interpretation of (2) was noted by Oaxaca (1973), it is not explicitly mentioned in most empirical applications.
The question obviously arises as to how the concepts of AM and GM are related. Due to Jensen's inequality, we always have that AM > GM. However, the relationship between Δ AM and Δ GM is ambiguous without distributional assumptions and will depend on the properties of the distributions under study.

Statistical and conceptual issues
Left aside issues of modelling and estimation, which measure of the mean differential should be chosen when performing decompositions? There are several arguments to be made. Foremost, if the support of y i includes zero, the GM is not defined. Thus, the decomposition of means must be based on AMs for outcomes that may be zero, such as individual health care expenditure, wealth or count variables. Using ad hoc manipulations such as ln(1 + y i ) to accommodate zeros is inappropriate as it may severely distort the distribution if the fraction of zeros is nonnegligible (cf. Santos Silva and Tenreyro 2006).
Furthermore, even if the support of the dependent variable excludes zero (as is the case with wages), there are a number of arguments why a decomposition of means should be based on AMs instead GMs, or equivalently, on raw outcomes instead of log outcomes. First, as Leslie and Murphy (1997) note, the decomposition based on log outcomes is not invariant to changes in the higher-order moments of the distribution. As a result, even if the AMs of two distributions are the same, the difference in average log outcomes is nonzero if, for example, the dispersion differs across distributions. We illustrate this by two simple examples: As demonstrated above, if two distributions are identical up to a mean-preserving spread, the log-wage gap picks up the difference in dispersion. For a measure of the difference in means, this is a rather undesired property.
Second, the approximation in (2) becomes inaccurate for large differences in outcomes due to the concavity of the logarithm: for exact GM differentials of 5, 10, 20 and 50 per cent, the "approximation bias" is 0.12, 0.47, 1.77 and 9.45 percentage points, respectively. This is clearly important in practice, since many differentials (e.g. malefemale wage gap) are quite substantial in magnitude. In fact, the approximative nature of (2) is only seldom pointed out in empirical studies, and sometimes no reference to the GM interpretation is made (e.g. Darity et al. 1995;Jürges 2002). As a consequence, interpreting log-points as percentage differentials can be misleading. 4 Third, apart from time series contexts, it is customary to report arithmetic means rather than geometric means when summary statistics are presented. Although this does not represent a strong argument in favour of the AM decomposition per se, it nonetheless implies that the AM is usually the statistic of interest.
Finally, a potential weakness of the AM relative to the GM as a measure of central tendency is that it is more strongly affected by the presence of outliers in small samples. 5 Although the GM is indeed more robust, the argument only applies if the two distributions under study are adversely affected by outliers. This is because we are interested in differences in means and not in means per se.
The discussion above provides the motivation for developing a decomposition framework for differences in AMs that can be broadly applied to nonnegative dependent variables.

Decomposition framework and identification
As for any decomposition problem, we must define a meaningful counterfactual. In the classic Oaxaca-Blinder decomposition of wages, the counterfactual is implied by the reference wage structure (Cotton 1988;Neumark 1988;Jann 2008). We follow the approach of Firpo et al. (2011) and choose the "simple" counterfactual outcome μ C ≡ E[y i0 |D i = 1]. This term corresponds to the mean outcome in group 1 that would prevail if their distribution was generated by the conditional expectation function (CEF) of group 0. Since the labelling of group 0 and 1 is arbitrary, the choice of the counterfactual largely depends on the question of interest. However, we find this particular counterfactual useful because it offers a simple interpretation and has a meaningful treatment effect equivalent. 6 As opposed to the counterfactual, the group-specific means are directly observable, i.e. E[y i |D i = g] = E[y ig |D i = g]. As a result, the aggregate AM decomposition is The first term on the right-hand side is the structural effect (also: coefficients effect). In treatment effect terminology, it corresponds to the population average treatment effect on the treated (PATT); see, e.g. Imbens and Wooldridge (2009). The second term is the composition effect (also: characteristics effect) and captures the part of the gap due to differences in covariate distributions across groups.
A prominent result from the treatment effects literature (see, e.g. Imbens 2004) states that identification of μ C requires the two so-called ignorability assumptions. The first (and more critical) assumption is where X i is a row vector of covariates with support X ⊆ R k . The condition in (4) is referred to as the conditional independence assumption (CIA) and states that once we control for X i , the distribution of unobserved characteristics is the same across groups in the absence of treatment. At this point, a critical discussion of the CIA is in order. Firpo et al. (2011) discuss three important situations where the CIA is violated: (i) if there is nonrandom sample selection (e.g. differential selection of men and women into employment); (ii) if D i is affected by confounding unobservables (e.g. self-selection of individuals into groups); (iii) if covariates are endogenous in ways that violate the CIA. Huber (2014) shows that standard wage decompositions can be cast into a so-called mediator framework. He argues that in many examples, a causal interpretation is hard to establish because group membership is endogenous and/or observed covariates are mediators (intermediate outcomes) which are themselves confounded by unobservables. Kunze (2008) discusses endogeneity issues in gender wage gap analyses, stressing that human capital variables are often rendered endogenous by unobserved heterogeneity, nonrandom sample selection into work and measurement error. Overall, these studies suggest that a careful assessment of the plausibility of the CIA is of great importance in the context of wage decompositions. The second assumption required for identification is less critical and defined as Footnote 6 continued same for every individual (see Firpo et al. 2011). Third, if the counterfactual is a sample-size weighted average of the two wage structures as in Cotton (1988), the wage structure effect can be shown to equal P(D i = 0)PATT + P(D i = 1)PATU under a set of baseline assumptions, with PATU being the population average treatment effect on the untreated (see Słoczyński 2012). Clearly, this quantity differs from the population average treatment effect (PATE) unless P(D i = 0) = P(D i = 1) and has therefore no interesting analogue in the treatment effects framework.
where p(X i ) ≡ P(D i = 1|X i ) is the propensity score. The condition in (5) is referred to as the common support assumption (CSA) and states that the support of the covariate distribution in group 1 (the treated) must be contained within the support of the covariate distribution in group 0 (the controls). Under the two ignorability assumptions (4) and (5), identification of the counterfactual outcome can be stated as follows (see Imbens 2004): To decompose differences in arithmetic means, the main issue that must be addressed is to specify and estimate an appropriate model for E[y i |X i , D i = 0], which we turn to in the next section.

Model specification and estimation
Since researchers are usually also interested in detailed decompositions, parametric models are more practical tools such that we will not discuss nonparametric methods. 7 For the ensuing discussion, we assume that the CEFs can be represented by is a unique column vector of population coefficients. 8 As a useful starting point for modelling, consider the "generalized" functional form (Wooldridge 1992): which encompasses the linear regression model (λ = 1) and the exponential regression model (λ → 0) as special cases. 9 We discuss these two specifications for modelling the dependent variable in the next two subsections.

A note on the linear model
If the CEF of the outcome is assumed to be linear (λ = 1), then by the law of iterated expectations, the AM decomposition takes the familiar Oaxaca-Blinder form. Why should we not use a linear model that can be conveniently estimated with OLS? In the case of nonnegative dependent variables, the linear model appears unnatural because it does not restrict the support of the CEF in any way. Given nonnegative outcomes, the true CEF is likely to be nonlinear. While OLS offers the best linear approximation to the CEF, the approximation may still be very poor if the true CEF is, for example, highly convex (Barsky et al. 2002). To see this, consider the stylized example in Fig. 1, where outcomes in both groups are generated by the same CEF (Δ S = 0), but the covariate 7 A detailed decomposition is defined as a procedure that apportions the decomposition terms Δ S and Δ X in (3) into components that can be attributed to each covariate in X i (cf. Firpo et al. 2011, p. 26). 8 The uniqueness or identifiability of β g rules out perfect multicollinearity amongst the covariates. distributions in the treated group and the control group are different, N (0.5, 1) and N (1, 1), respectively, which means (Δ X = 0). The two group-specific regressions of y i on X i yield different parameters, meaning that the misspecified model leads to a spurious structural effect (Δ S = 0). This inconsistency persists if ignorability holds. Moreover, Barsky et al. (2002) point out that the performance of the linear estimator becomes particularly poor if there is missing overlap in the covariate distributions due to the linear extrapolation of the misspecified CEF outside the common support.
There are several ways as to how the above-described problems can be alleviated. First, one can try to improve the approximation to the CEF by adding higher-order terms of the covariates. In the case of high-dimensional X i , however, this may be cumbersome and lead to very noisy estimates. Furthermore, it complicates matters if one is interested in performing detailed decompositions. Second, another alternative is to use the doubly robust reweighted regression as suggested by Firpo et al. (2007). Intuitively, the propensity score adjustment reduces the imbalance between the groupspecific covariate distributions and thus reduces the contamination of the structural effect stemming from these imbalances. While reweighting generally increases the robustness of regression with respect to departures from linearity, it may be problematic to rely on the correct specification of the propensity score alone because any misspecification immediately leads to biased estimates.

Nonlinear outcome model
Due to the described weaknesses of the linear model in the case of nonlinear CEFs, a nonlinear specification that is consistent with the dependent variable being nonnegative for arbitrary values of β g appears generally preferable. 10 This property is satisfied by the exponential model, which follows from (7) when λ → 0. The model can be written as where ε ig is an error term, 11 and the assumed CEF is μ(X i , β g ) = exp(X i β g ). 12 The exponential CEF is used in many different models of nonnegative dependent variables.
In the case of count data outcomes, it emerges from the Poisson model. In the case of continuous nonnegative outcomes, it follows from the gamma and exponential distributions. The exponential model is also very widely used in practice. In models of health care expenditure, for example, it is the standard functional form assumption (Mullahy 2009). In wage models, the dependent variable used for estimation is usually in logarithmic form which implies the functional form assumed in (8). It is justified theoretically (Mincer 1974) and empirically (Heckman and Polachek 1974;Blackburn 2007). Therefore, the exponential model seems a suitable choice for many contexts.

Aggregate decomposition
If the outcome model is (8), the law of iterated expectations implies that the aggregate decomposition defined in (3) can be written as The only critical expression is the counterfactual, , which is identified if ignorability holds.

Doubly robust estimation
To estimate the decomposition in (9), expectations are replaced by sample analogues and population coefficients byβ 1 andβ 0 , which are estimated coefficients from the two group-specific samples. Estimation can be based on quasi-maximum likelihood (QML) techniques also referred to as estimation of generalized linear models (GLM) in the statistics literature. It is a well-known result by Gourieroux et al. (1984) that QML estimators are consistent regardless of the underlying distribution of the data, as long as the CEF is correctly specified. In other words, these models can be estimated consistently with minimal distributional assumptions. The difference between the various QML estimators is simply how observations are weighted in the firstorder conditions of the maximization problem, leading to different requirements for asymptotic efficiency. Several authors (Manning and Mullahy 2001;Santos Silva and Tenreyro 2006) find that Poisson and gamma QML estimators perform well in many situations, while Gaussian QML (nonlinear least squares) leads to more erratic results, especially in heteroskedastic environments. A major advantage of the QML framework with regard to the decomposition in (9) is that doubly robust estimator for the counterfactual mean can be derived. Double robustness broadly refers to an estimator that is consistent under two sets of assumptions. As Wooldridge (2007) briefly mentions, Poisson QML can be made doubly robust if augmented with the appropriate propensity score weighting function. We will refer to this estimator as the WPQML estimator. Denote the probability limit of the WPQML estimator by β g * = plim(β g ), where we allow for the possibility that the CEF may be misspecified, i.e. E[y ig |X i , D i = g] = exp(X i β g * ) for some X i ∈ X . However, the defining property of the WPQML estimator is that E[y ig |D i = g] = E[exp(X i β g * )|D i = g] will always hold by construction. As shown below, this property arises from the first-order conditions associated with this estimator. For the sub-population of interest (group 0), the WPQML estimator solves the sample analogue of the following population maximization problem: where ω(X i ) is the propensity score weighting function defined below. The corresponding population first-order conditions (FOCs) are: Since Wooldridge (2007) does not touch upon the WPQML and confines the discussion to the population average treatment effect (PATE), Theorem 1 below shows that WPQML with appropriate weights yields a doubly robust estimator for our counterfactual of interest and thus also for the structural effect Δ S and the characteristics effect Δ X of the decomposition.

Theorem 1 Define the weighting function by
The proof is given in "Appendix 1".
The central implication of Theorem 1 is that the counterfactual is identified even if the WPQML estimator does not converge to the true parameters (β 0 * = β 0 ) provided that the propensity score model is correctly specified. Therefore, double robustness guards better against misspecification than relying on correct specification of the outcome model alone or the propensity score model alone. Although this property is very appealing, it is important to remember that the result does not automatically extend to the detailed decomposition; if we fail to identify the parameters of the CEF, a detailed decomposition into the contributions of individual covariates to the gap is no longer guaranteed to be consistent.
Due to the nonlinear model and the two-step nature of the estimator, bootstrapping the entire procedure is the more practical alternative to conduct inference than deriving analytical standard errors via nonlinear GMM. Note that propensity score weighting may also come at the cost of an efficiency loss if the CEF is correctly specified, but our Monte Carlo exercise suggests that the opposite can also happen in the case of heteroskedasticity where the conditional variance of the error is proportional to the conditional mean.

Monte Carlo simulation
To assess and compare the various estimators of the proposed decomposition, we conduct a small simulation study. Specifically, the objective is to test how various estimation methods perform in estimating the decomposition under different datagenerating CEFs of the outcome model, different scenarios for the common support of the covariate distributions, as well as heteroskedasticity in the outcome model.

Set-up
The simulation design mimics a potential outcomes framework in which the CEF of group 0 has larger coefficients than the CEF of group 1 (negative structural effect) and group 0 has higher average values of the covariates than group 1 (negative composition effect). To have a realistic data-generating process, we generate both continuous and discrete covariates. The continuous variables are X 1 , X 2 , X 3 ∼ N (0, 1) with Corr(X 1 , X 2 ) = 0.2, Corr(X 1 , X 3 ) = 0.1 and Corr(X 2 , X 3 ) = −0.1. The discrete covariates are two binary dummies with P(X 4 = 1) = P(X 5 = 1) = 0.5 and Corr(X 4 , X 5 ) = 1/8. The assignment of observations to the two groups is based on the following latent variable model: The threshold in the indicator function is equal to the median of the latent variable such that the two groups are of equal size. The group indicator variable is D = 1 for group 1 (the treatment group) and D = 0 for group 0 (the control group). The random assignment error ξ is stochastically independent of the processes determining outcomes, which means that the CIA is satisfied in our set-up. The parameters δ determine how strongly group membership is correlated with covariates, in other words, how strongly the group-specific covariate distributions differ from one another. We consider two specifications for the latent variable equation. In the first one, there is perfect common support. In the second one, common support is small and the overlapping support assumption is violated for a subset of the population; see "Details on the set-up" in Appendix 2 for more details.
To illustrate the consequences of departures from the standard log-link assumption of exponential models, we consider two separate cases for the functional form. Starting from the generalized functional form, E[y g |X ] = (1 + λXβ g ) 1/λ for g = {0, 1}, in the first case (the baseline specification) we assume λ → 0 which reduces to the exponential model. In the second case, we set λ = 0.5. This is a quadratic specification and thus also ensures that outcomes are always nonnegative. The curvature of this specification lies in between the linear and exponential model. 13 It is important to note that we do not consider the linear model as a possible data-generating process because, unlike the two functional forms above, it does not restrict the support of the dependent variable to be nonnegative. Since the outcome distribution is affected by the functional form, we choose different values for β g such that mean outcomes are of comparable size. See "Details on the set-up" in Appendix 2 for more details on the parameterization of the CEFs.
Finally, the potential outcomes are generated by multiplying the CEF with an error term that is drawn from the log-normal distribution with E[ε g |x] = E[ε g ] = 1. To test whether the estimators are sensitive to heteroskedasticity, we consider two cases for the conditional variance of the error term. The first is the benchmark case of homoskedasticity, and the second is heteroskedastic, where the conditional variance of the error is positively related to the CEF; see "Details on the set-up" in Appendix 2 for more details.
The quantity associated with uncertainty is the counterfactual mean, which is why the focus lies on estimating μ C = E[y 0 |D = 1] in this Monte Carlo exercise. The goal is to have an estimator based on the observed outcomes that is close as possible to the estimates from the "true" estimatorμ C true = 1 N 1 N 1 i:D=1 y i0 . We compare the performance of the following estimators for μ C : i. WPQML: the estimation method is Poisson QML augmented with propensity score weights as described in Sect. 4.2.2. The propensity score is estimated by a logit model. ii. PQML: unweighted counterpart of WPQML. iii. GQML: the estimation method is gamma QML; see, e.g. Blackburn (2008). iv. OLS1: linear regression. The CEF is assumed to be linear in X i . v. OLS2: quadratic regression. The vector of covariates includes a second-order polynomial of X i . vi. WLS: reweighted linear regression estimator proposed by Firpo et al. (2007).
The propensity score is estimated by a logit model. vii. NDR100: the nonparametric doubly robust estimator with optimal bandwidths proposed by Rothe and Firpo (2013). The outcome model is estimated by local linear regression and the propensity score model by local logit regression. Optimal bandwidth calculations are based on a cross-validation procedure. The kernel is Epanechnikov. See "Nonparametric doubly robust estimator (Rothe and Firpo 2013)" in Appendix 2 for more details on the implementation. viii. NDR50: NDR with half the optimal bandwidth values.
ix. NDR200: NDR with twice the optimal bandwidth values.
We generate samples of size 2000 such that the estimation of the counterfactual is based on 1000 observations. The entire procedure is repeated 10,000 times.

Results
The results are presented in Table 1. To assess both bias and precision, we report the mean bias in percentage terms relative to the true estimates, as well as the root-meansquared deviation from the true estimates (referred to as RMSD). For better readability, RMSD is expressed relative to the RMSD of our benchmark, the WPQML estimator. We first turn to the results from the exponential specification in Panel A of Table 1. Under perfect common support, the performance of the QML estimators, WLS and the NDR estimators is generally good. OLS1 is clearly biased, which demonstrates that the best linear approximation to the CEF performs poorly in estimating μ C when the true CEF is nonlinear. OLS2 has smaller bias than OLS1 because the higherorder terms of the covariates capture some of the nonlinearities, but its performance is still poor. These findings emphasize the importance of correctly specifying the CEF. Moreover, the reweighted regression (WLS) considerably improves upon unweighted linear regression (OLS1) due to its double robustness property. The numbers for the NDR estimators suggest that undersmoothing is particularly detrimental for the performance of this estimator (especially in terms of precision). Overall, all estimators, except OLS1, OLS2 and NDR50, perform quite well with an average bias of <1 % irrespective of the heteroskedasticity in the data.
If the common support is reduced, we notice a number of changes. First, the performance of the parametric linear estimators (OLS1 and OLS2) further deteriorates considerably. This result is not surprising, since these estimators extrapolate the misspecified CEF to the covariate space of missing overlap. Second, WLS and NDR are now biased and imprecise relative to the QML estimators. NDR50 has small bias because it only relies very little on extrapolation, but this comes at the cost of large variability. Third, the WPQML estimator continues to perform very well and becomes even more precise than PQML under heteroskedasticity because observations with large conditional mean are both more noisy and more likely to be outside the common support. In other words, propensity score weights tend to assign less weight to noisy observations, thus increasing precision. This result is interesting because heteroskedasticity of the form modelled here is often present in real data.
Panel B reports the results where the CEF takes the quadratic form such that the QML estimators are now misspecified and OLS2 is correctly specified. The doubly robust estimators (WPQML, WLS, NDR) continue to perform well alongside OLS2 BIAS refers to the mean percentage deviation from the experimental estimate multiplied by 100. RMSD refers to the root-mean-squared deviation from the experimental estimate and is normalized by the RMSD of the WPQML estimator for ease of comparison. Results are based on 10,000 iterations in the scenario with perfect common support because the propensity score weighting ensures consistency. It is only after the overlapping support assumption is violated (small common support) that WLS and NDR also become clearly biased. In contrast, it is surprising that WPQML and also GQML still produce good results with modest bias and good relative precision. We summarize the main findings from this simulation study as follows. First, as expected, linear regression (OLS1) performs poorly given a nonlinear CEF, and quadratic regression (OLS2) improves only modestly if the CEF is exponential and comes at the expense of more noisy estimates in finite samples. Second, the doubly robust parametric estimators generally produce good results and are usually better than their unweighted counterparts. Third, the nonparametric doubly robust estimator with optimal bandwidths (NDR100) also appears attractive, although the performance appears sensitive to the bandwidth values. A clear drawback of this estimator is that it is computationally much more demanding than the other (parametric) estimators. Finally, comparing results across all the scenarios examined, we find that WPQML produces very convincing results.

Application: union wage gap in the USA
We apply the decomposition methods to the union wage gap in the USA. The data set is constructed from the Current Population Survey (CPS) in the year 2012 and contains workers aged between 18 and 64. We define nonunion workers to be the reference group or "controls" (D i = 0) and union workers to be the "treatment group" (D i = 1).
The dependent variable of interest is hourly wages. Therefore, the counterfactual mean of interest is the average hourly wage union workers would earn if they were paid according to the wage structure of nonunion workers. The set of explanatory variables includes years of education, years of experience (quadratic), dummies for ethnicity, region, marital status, immigration and gender.
The descriptive statistics in Table 2 show a considerable average wage differential between the two groups. Union workers earn 4 dollars more than nonunion workers as measured in terms of AM. The log-wage gap is 0.23, and the corresponding GM wage gap in levels is 4.5 dollars and thus quite smaller than the AM wage gap. Comparing covariates, we see that union workers have more experience and are less likely to live in Southern states but are more likely to be married. For the other explanatory variables, differences seem more modest.
An important lesson from our Monte Carlo exercise is that the decomposition is only sensible for those observations in the common support. We therefore inspect the overlap of the covariate distributions across the two groups. A straightforward way of doing this is to compare propensity score distributions across groups. Figure 2 shows kernel density estimates where the propensity score has been estimated with a logit model using the same set of covariates. We see immediately that estimated propensity scores are bounded well away from unity such that there are no observations in the data that violate the common support assumption in (5). Table 3 presents the results of the decomposition of average wage differentials. propensity score D=1 D=0 Fig. 2 Density estimate of the propensity score We do not compute detailed decompositions in order to focus on the essential task of estimating the aggregate decomposition. The top panel shows AM decompositions using various estimators, and the bottom panels shows GM decompositions, i.e. decompositions of the log-wage gap. The QML estimates suggest that about 61-65 % of the AM gap is due to different characteristics (Δ X ). The remainder is due to different wage structures (Δ S ) and often referred to as the average "union wage premium". If we regard union status as a treatment,Δ S can also be interpreted as the average treatment effect on the treated (ATT) (Firpo et al. 2011). As discussed in Sect. 3, however, the causal interpretability critically hinges on the validity of the CIA. Several issues may be critical here. First, some relevant confounders may not be controlled for (e.g. additional socio-economic characteristics). Second, if union status D i has a causal influence on some of the covariates in X i (e.g. work experience), these covariates must be considered intermediate outcomes. In this case, identification of the decomposition can fail in several scenarios. It is possible, for example, that unobserved socio-economic factors or unobserved motivation jointly affect X i (work experience) and union status D i . Similarly, these unobservables may also jointly affect X i (work experience) and wages Y i . In these situations, such covariates are "bad controls"; see Huber (2014) for a detailed discussion of these issues. These type of considerations should always be borne in mind when interpreting decomposition results.
Comparing estimates of the AM union wage premium in Table 3, we note that the OLS estimators (and especially WLS) yield smaller point estimates than the QML estimators and the NDR100 estimator. 14 We therefore argue that the choice of the estimator is important. The fact that all QML estimates are similar (and also close to the NDR100 estimate) lends support to the notion that the exponential model provides a good description of the CEF in this example.
For the GM decomposition based on log-wage regressions, we report results both in logs (%Δ GM,approx ) and in levels (Δ GM ). The former corresponds to the standard Oaxaca-Blinder decomposition of log-wages, and the latter is obtained by exponentiating means measured in logs. Of course, the absolute values of AM and GM decompositions cannot be compared directly due to the different concepts of the mean. However, we can compare the shares of the composition effect and the structural effect in the total differential. The GM decomposition suggests that the union wage premium explains about 47 % of the average wage gap, while the AM decomposition suggests that it explains only about 35 % (WPQML estimate). Based on the bootstrapped joint distribution of the estimators in Table 3, we can test whether this difference between %Δ S GM and %Δ S AM (β WPQML ) is significant. The corresponding t test yields a p value of 0.0096, such that the null of no difference can be rejected on the 1 % level. Performing the same test for the other estimators of the AM decomposition, we also obtain p values smaller than 0.01 in all cases (except for the GQML, where the p value is 0.0399). We therefore reach quite different conclusions when we base the decomposition on either AMs or GMs.
It is important to note that the discrepancy does not necessarily mean that one of the two decompositions is inconsistently estimated. It may entirely come from the fact that we apply the decomposition to different distributions, with one being a nonlinear (monotone) transformation of the other. Unfortunately, the size of the discrepancy depends on the properties of the distribution and is hard to investigate further. However, it may be instructive to point out that the configuration above, where the relative size of the union wage premium is larger in the log-wage decomposition, for example, also arises in the case where outcomes are log-normally distributed and the inequalities Thus, choosing between the two concepts of the mean can matter in practice because we reach different conclusions when we evaluate the relative importance of the union wage premium in explaining the average wage differential.

Extentions
We briefly discuss how detailed decompositions can be performed in our framework and how issues of endogenous regressors (that violate ignorability) and sample selection can be addressed.

Detailed decomposition
An important issue in decomposition analysis is to measure how much individual covariates contribute to the differential in mean outcomes. For example, in wage decompositions, it is interesting to examine how much differences in, say, education contribute to the wage gap. This type of question can be answered by performing detailed decompositions.
Generally, a detailed decomposition of the structural effect is difficult regardless of the dependent variable and the specified outcome model. The reason is that the separate contributions of variables without a natural zero point are not interpretable (see Oaxaca and Ransom 1999). 15 For this reason, detailed decompositions mainly focus on the composition effect.
In general, the detailed decomposition is less straightforward in nonlinear models than in linear models because the individual contributions of covariates are not additively separable. However, several approaches are available to perform detailed decompositions in nonlinear models, as the one discussed in Sect. 4.2. First, Yun (2004) uses a first-order Taylor approximation around the means of the covariates. The advantage is that the decomposition is path independent and satisfies the adding-up property (see Firpo et al. 2011, Chapter 2.2). The drawback is that it does not take into account that differences in higher-order moments (e.g. variances) affect outcomes through the nonlinearity of the CEF. Kaiser (2015) offers a method which is similar in spirit but takes into account the effects of such differences in covariate distributions. Second, Rothe (2012) offers a novel approach based on specifying copula functions to decompose the composition effect. The advantage of this method is that it can be applied to any type of outcome model included the one discussed in this paper. However, the flexibility of the approach comes at the expense of a more complex interpretation of the detailed decomposition terms.

Endogeneity of the covariates
As discussed in Sect. 3, the structural effect and the composition effect are identified even if some of the regressors in X i are correlated with unobservables, provided that ignorability still holds. (In contrast, the individual contributions of endogenous covariates are not identified in a detailed decomposition.) Now consider the case where ignorability is violated due to endogenous covariates. In the exponential model, this problem can generally be addressed with instrumental variables as in the linear model. If an appropriate instrument is available, the parameters of the exponential models in (9) can be consistently estimated by the two-step control function approach described in Wooldridge (2010, Ch. 18), given that the CEF of the outcome model is correctly specified. Consider the case of a single endogenous variable, y 2 , and instrument(s), Z i , satisfying the usual exclusion restriction. In the first step, the reduced form y 2i = X i α 1 + Z i α 2 + v 2 is estimated by OLS. In the second step, the parameters of the exponential outcome model can be consistently estimated by PQML with the first-stage residualsv 2 included as an additional regressor. 16

Sample selection
In the example of wage decompositions, there are often concerns that group 0 and 1 select differently into the labour market. This is innocuous as long as selection is explained by differences in X i , but if selection is due to unobservables, some type of correction procedure is required.
In the log-linear model, a popular choice is the two-step approach due to Heckman (1976), which requires a joint normality assumption between the errors in the selection model and the outcome model. Terza (1998) shows how this two-step approach can be extended to the exponential outcome model. The weakness of this procedure is the same as in the linear model; the coefficient estimates are inconsistent if the joint normality assumption is violated. In addition, credible identification usually requires an exclusion restriction.
Less restrictive methods can be used when panel data are available, and confounding unobservables are thought to be time invariant. For individuals where outcomes are observed at least once, missing outcomes in the other periods can be imputed. For example, Melly and Santangelo (2013) assume that individuals' position in the conditional outcome distribution is time invariant. The advantage of such nonparametric imputation procedures is that distributional assumptions on outcome and selection models are not required.

Conclusions
This paper has argued that the standard Oaxaca-Blinder decomposition (Oaxaca 1973;Blinder 1973) based on logarithms brings about statistical problems because it is invalid if the data includes zero and because the log differential is not invariant to changes in higher-order moments of the distribution. Furthermore, the approach also raises conceptual issues because the log differential is an approximate percentage difference in geometric means, which is not an intuitive quantity and thus difficult to interpret.
We have therefore suggested modelling the original dependent variable directly to base the analysis on differences in arithmetic means. At the core of our study, we have proposed a new doubly robust estimation strategy based on a propensity score weighted Poisson quasi-maximum-likelihood (WPQML) estimator. This flexible parametric framework is suitable in many contexts of nonnegative outcomes and easy to implement in practice. Our Monte Carlo study has shown that the proposed estimator performs well in many circumstances when compared to a range of competing estimators. The WPQML estimator produces good results even under a misspecified outcome model or missing overlap in covariate distributions.
In our empirical application to the union wage gap in the USA, we find that quite different conclusions are reached depending on the decomposition approach taken. The arithmetic mean decomposition suggests that the union wage premium is much less important in explaining the union wage gap than the geometric mean decomposition, which is usually used in empirical studies. We conclude that the measure of the mean can have important implications for decomposition analysis and therefore deserves more careful consideration than it usually receives in empirical research.   particular threshold values because the findings of Crump et al. (2009) suggest that discarding observations outside this interval is a good approximation to the optimal cut-off rule.) In the second specification, we set δ = 0.4 for all slope parameters to induce large differences in the group-specific covariate distributions. The value is chosen such that roughly 10 % of the predicted propensity score values (as estimated from a logit model) are outside the interval [0.1, 0.9]. Hence, this represents a scenario in which the common support assumption is severely violated. For illustration, Fig. 3 shows kernel densities of the estimated propensity scores for the two assignment rules based on a random Monte Carlo sample. Comparing counterfactual estimates from these two assignment rules will allow us to assess the impact of the common support assumption on the performance of the estimators. The CEFs in the simulation exercise are parameterized as shown in Table 4 below. The slope coefficients in β 0 are larger than the coefficients in β 1 to induce a structural effect between group 0 and group 1. The coefficients in the quadratic CEF model are larger than in the exponential CEF model because the curvature in the former is less pronounced. The values of the coefficients are chosen such that mean outcomes are of similar size across the two specifications.
In the homoskedastic case, we set V [ε g |x] = V [ε g ] = 0.5, which implies V [y g |x] ∝ E[y g |x] 2 . Note that this is equivalent to assuming homoskedasticity in the log-linear regression model (Santos Silva and Tenreyro 2006). In the heteroskedastic case, the error variance is proportional to the CEF, V [ε g |x] ∝ E[y g |x], which implies V [y g |x] ∝ E[y g |x] 3 . In the latter case, we scale the distribution of the error term such Nonparametric doubly robust estimator (Rothe and Firpo 2013) The implementation of this estimator is similar to the description in Rothe and Firpo (2013). Using logit or probit for the propensity score model ensures that the predicted values are always in the interval [0,1]. To define the local neighbourhood in the fivedimensional covariate space of our data-generating process, we employ a product kernel as suggested by Racine and Li (2004) that allows us to combine the different types of covariates. The kernel weight of observation (y i , X i ) when estimating the regression function at the covariate values X = c is: where W (·) is the Epanechnikov kernel and h and λ are bandwidth parameters. The first component is the usual product kernel for the continuous covariates (X 1 , X 2 , X 3 ). As in Racine and Li (2004), the same bandwidth h is applied to all continuous covariates. Otherwise, the computational burden imposed by cross-validation and estimation becomes prohibitively large. To make the single bandwidth more attractive, we orthogonalize the matrix with the continuous covariates prior to the estimation procedure. Specifically, we use the QR decomposition solving X = QR, where X is the original data matrix, Q is the orthogonalized data matrix, and R = Chol(V ar(X)) is an upper triangular matrix obtained from the classical Cholesky decomposition of the covariance matrix. As a result, we can use the variables contained in Q, whose covariance matrix corresponds to the identity matrix. Since all transformed variables are uncorrelated between themselves and have unit variance, we have drastically reduced the need to specify multiple bandwidths.
The optimal bandwidth values are computed separately for each specification because the curvature of the outcome model, the common support and the error variance (may) have an effect on the optimal amount of smoothing. The values of the optimal bandwidths as obtained from a cross-validation procedure are shown in Table 5. We see that the estimator for the optimal propensity score (PS) model is essentially parametric as implied by the very large bandwidth values. For the estimator of the outcome model, we note that more smoothing is required in the case of heteroskedasticity. Likewise, the exponential model requires more smoothing than the quadratic model because the curvature of the former is more pronounced. The bandwidths in Table 5 are used in each iteration of the the Monte Carlo exercise.