Decomposition Methods in the Social Sciences

Solutions to Exercise 8: Distribution decompositions

Johannes Giesecke and Ben Jann, GESIS Training Course, January 29 – February 1, 2024

Required packages: cdist, dstat, estout, jmpierce, moremata, grstyle, palettes, colrspace

Set the seed of the random number generator for sake of reproducibility:

. set seed 439028

Task 1: extension of example analysis from slides

Extend the model from the session on distribution decompositions. Include the international socio-economic index (isei) as well as the number of children in the household (children). Decompose the private–public gap in the D9/D1, the D9/D5 and the D5/D1 ratio. Use the approaches based on JMP, conditional quantiles and distribution regressions and compare the results. The decompositions should be such that the covariate distribution of the private sector is adjusted to the covariate distribution of the public sector (i.e. use the wage structure from the private sector as the reference wage structure).

Data preparation including additional predictors:

. use gsoep-extract, clear
(Example data based on the German Socio-Economic Panel)

. keep if wave==2015
(29,970 observations deleted)

. keep if inrange(age, 25, 55)
(5,671 observations deleted)

. generate lnwage = ln(wage)
(1,709 missing values generated)

. generate expft2 = expft^2
(35 missing values generated)

. summarize lnwage yeduc expft expft2 public isei children

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      lnwage |      5,600    2.736721    .5062968   1.108563   4.799255
       yeduc |      7,121    12.28823    2.783974          7         18
       expft |      7,274    11.63359    9.556508          0       39.5
      expft2 |      7,274    226.6548    293.3739          0    1560.25
      public |      5,770    .2353553    .4242574          0          1
-------------+---------------------------------------------------------
        isei |      6,451    45.07115    17.00982         16         90
    children |      7,309    1.090163    1.174416          0          4

. drop if missing(lnwage, yeduc, expft, public, isei, children) // remove unused observation
(1,879 observations deleted)

Overview of characteristics:

. tabstat yeduc expft isei children [aw=weight], by(public)

Summary statistics: Mean
Group variable: public (public service)

public |     yeduc     expft      isei  children
-------+----------------------------------------
    no |   12.4304  14.35697  45.18642  .5606184
   yes |   14.0728  13.57464  53.27206   .553369
-------+----------------------------------------
 Total |   12.8238  14.16958  47.12315   .558882
------------------------------------------------

People in the public sector are on average higher educated, have less full- time labor experience, higher occupational status and about the same number of children than people in the private sector.

JMP decomposition:

. regress lnwage yeduc expft expft2 isei children [pw=weight] if public==0
(sum of wgt is 9,175,995.0951793)

Linear regression                               Number of obs     =      4,163
                                                F(5, 4157)        =     160.97
                                                Prob > F          =     0.0000
                                                R-squared         =     0.3580
                                                Root MSE          =     .40198

------------------------------------------------------------------------------
             |               Robust
      lnwage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       yeduc |   .0482309   .0058667     8.22   0.000     .0367291    .0597328
       expft |   .0263459   .0041235     6.39   0.000     .0182617    .0344301
      expft2 |  -.0003025   .0001168    -2.59   0.010    -.0005315   -.0000736
        isei |   .0103354   .0007699    13.42   0.000      .008826    .0118447
    children |   .0460693   .0101806     4.53   0.000     .0261099    .0660288
       _cons |   1.355772   .0686932    19.74   0.000     1.221097    1.490448
------------------------------------------------------------------------------

. estimates store private

. regress lnwage yeduc expft expft2 isei children [pw=weight] if public==1
(sum of wgt is 2,890,165.7029972)

Linear regression                               Number of obs     =      1,267
                                                F(5, 1261)        =      60.73
                                                Prob > F          =     0.0000
                                                R-squared         =     0.3723
                                                Root MSE          =      .3497

------------------------------------------------------------------------------
             |               Robust
      lnwage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       yeduc |    .036099   .0080594     4.48   0.000     .0202877    .0519104
       expft |   .0415204   .0071014     5.85   0.000     .0275885    .0554523
      expft2 |  -.0007342   .0001928    -3.81   0.000    -.0011124   -.0003559
        isei |   .0079597    .001322     6.02   0.000     .0053661    .0105532
    children |   .0524704   .0134812     3.89   0.000     .0260223    .0789184
       _cons |   1.554496   .0948874    16.38   0.000     1.368342    1.740651
------------------------------------------------------------------------------

. estimates store public

. jmpierce private public, reference(1) statistics(mean d9010 d9050 d5010 variance)

Juhn-Murphy-Pierce decomposition (reference estimates: private)

                   T           Q           P           U
    mean  -.13884181  -.15301408   .01080046   .00337181
   d9010   .19181561  -.06719494   .13387465    .1251359
   d9050   .19879556   .03245497   .09367847   .07266212
   d5010  -.00697994  -.09964991   .04019618   .05247378
variance   .05731634   .00172229   .02227498   .03331907

T = Total difference (private-public)
Q = Contribution of differences in observable quantities
P = Contribution of differences in observable prices
U = Contribution of differences in unobservable quantities and prices

We see that differences in average log wages between the private and the public sector can be fully explained by compositional differences between workers in both sectors (Q-component). However, if the private sector had the public sector's covariate distribution, overall wage inequality (as measured by D9D1) in the private sector would increase (i.e. the inequality-gap would be larger). Interestingly, at the same time, the sector-gap in the dispersion of wages in the upper half of the wage distribution (D9D5) is predicted to be a little lower than it actually is, whereas for the lower half of the wage distribution (D5D1) it is predicted to be much larger. Finally, with respect to the sector-gap in the variance of log wages, compositional differences cannot explain the observed gap.

Looking at the P- and the U-components, we see that (with the exception of the mean) large parts of the inequality-gaps are due to a less "inequality-prone" wage structure in the public sector (component P) and less "inequality-prone" unobservable characteristics and wage returns to these characteristics.

Decomposition based on conditional quantiles:

. // Estimate counterfactuals
. cdist lnwage yeduc c.expft##c.expft2 isei children [pw=weight], by(public) method(qr) ///
>     statistics(mean iqr(10 90) iqr(50 90) iqr(10 50) variance) ///
>     vce(bootstrap, cluster(psu))
(running cdist on estimation sample)

Bootstrap replications (50): .........10.........20.........30.........40.........50 done

Counterfactual distribution estimation          Number of obs     =      5,430
                                                Replications      =         50
                                                Pooled            =         no
Group 0: public = 0                             N of obs 0        =      4,163
Group 1: public = 1                             N of obs 1        =      1,267
                                                Estimation method =         qr
                                                Grid size         =        100

                                 (Replications based on 2,034 clusters in psu)
------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
      lnwage | coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
obs0         |
        mean |   2.733252   .0126214   216.56   0.000     2.708514    2.757989
   iqr(10,90)|   1.244804      .0308    40.42   0.000     1.184437    1.305171
   iqr(50,90)|   .6547899   .0257502    25.43   0.000     .6043205    .7052594
   iqr(10,50)|   .5900142   .0164323    35.91   0.000     .5578074     .622221
    variance |   .2513125   .0105011    23.93   0.000     .2307306    .2718943
-------------+----------------------------------------------------------------
fit0         |
        mean |   2.732335   .0126247   216.43   0.000     2.707591    2.757079
   iqr(10,90)|   1.259786   .0286964    43.90   0.000     1.203542     1.31603
   iqr(50,90)|   .6420845   .0159623    40.23   0.000      .610799      .67337
   iqr(10,50)|   .6177015   .0215533    28.66   0.000     .5754579    .6599451
    variance |   .2466493   .0107218    23.00   0.000      .225635    .2676636
-------------+----------------------------------------------------------------
adj0         |
        mean |   2.881914   .0181275   158.98   0.000     2.846385    2.917444
   iqr(10,90)|   1.280105   .0283041    45.23   0.000      1.22463     1.33558
   iqr(50,90)|   .6258292   .0234882    26.64   0.000     .5797932    .6718652
   iqr(10,50)|    .654276   .0209768    31.19   0.000     .6131623    .6953897
    variance |   .2561258   .0109183    23.46   0.000     .2347262    .2775254
-------------+----------------------------------------------------------------
obs1         |
        mean |   2.872093   .0219387   130.91   0.000     2.829094    2.915092
   iqr(10,90)|   1.052989   .0580101    18.15   0.000     .9392908    1.166686
   iqr(50,90)|   .4559944   .0402445    11.33   0.000     .3771166    .5348722
   iqr(10,50)|   .5969942   .0499454    11.95   0.000      .499103    .6948853
    variance |   .1939033   .0164413    11.79   0.000     .1616789    .2261278
-------------+----------------------------------------------------------------
fit1         |
        mean |   2.872006   .0218841   131.24   0.000     2.829114    2.914898
   iqr(10,90)|   1.003765    .051603    19.45   0.000     .9026249    1.104905
   iqr(50,90)|   .4307655   .0191634    22.48   0.000     .3932059    .4683252
   iqr(10,50)|   .5729995   .0404368    14.17   0.000     .4937449     .652254
    variance |   .1908335   .0161043    11.85   0.000     .1592697    .2223973
-------------+----------------------------------------------------------------
adj1         |
        mean |    2.76083   .0207492   133.06   0.000     2.720163    2.801498
   iqr(10,90)|   .9915735   .0544667    18.21   0.000     .8848208    1.098326
   iqr(50,90)|   .4348294   .0189262    22.98   0.000     .3977348     .471924
   iqr(10,50)|   .5567441   .0429097    12.97   0.000     .4726427    .6408456
    variance |   .1924975   .0170734    11.27   0.000     .1590343    .2259607
------------------------------------------------------------------------------
covariates: yeduc expft expft2 c.expft#c.expft2 isei children
. // Decomposition with private sector wage structure as reference
. cdist decomp

        Delta: fit0 - fit1
        Chars: fit0 - adj0
        Coefs: adj0 - fit1

                                 (Replications based on 2,034 clusters in psu)
------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
      lnwage | coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
Delta        |
        mean |  -.1396714    .026973    -5.18   0.000    -.1925376   -.0868053
   iqr(10,90)|    .256021     .05867     4.36   0.000     .1410299    .3710121
   iqr(50,90)|   .2113189   .0271698     7.78   0.000     .1580671    .2645708
   iqr(10,50)|   .0447021   .0434172     1.03   0.303    -.0403941    .1297983
    variance |   .0558158   .0185365     3.01   0.003      .019485    .0921466
-------------+----------------------------------------------------------------
Chars        |
        mean |  -.1495793   .0208423    -7.18   0.000    -.1904295   -.1087291
   iqr(10,90)|  -.0203191   .0244797    -0.83   0.407    -.0682985    .0276602
   iqr(50,90)|   .0162553   .0170899     0.95   0.342    -.0172403    .0497509
   iqr(10,50)|  -.0365744   .0162786    -2.25   0.025    -.0684798    -.004669
    variance |  -.0094765   .0077148    -1.23   0.219    -.0245972    .0056442
-------------+----------------------------------------------------------------
Coefs        |
        mean |   .0099078   .0202989     0.49   0.625    -.0298772    .0496929
   iqr(10,90)|   .2763402   .0616817     4.48   0.000     .1554462    .3972341
   iqr(50,90)|   .1950636   .0323579     6.03   0.000     .1316433     .258484
   iqr(10,50)|   .0812765   .0452189     1.80   0.072    -.0073509    .1699039
    variance |   .0652923   .0199136     3.28   0.001     .0262624    .1043222
------------------------------------------------------------------------------
covariates: yeduc expft expft2 c.expft#c.expft2 isei children

. estimates store qr_priv

Decomposition based on distribution regression:

. // Estimate counterfactuals
. cdist lnwage yeduc c.expft##c.expft2 isei children [pw=weight], by(public) ///
>     statistics(mean iqr(10 90) iqr(50 90) iqr(10 50) variance) ///
>     vce(bootstrap, cluster(psu))
(running cdist on estimation sample)

Bootstrap replications (50): .........10.........20.........30.........40.........50 done

Counterfactual distribution estimation          Number of obs     =      5,430
                                                Replications      =         50
                                                Pooled            =         no
Group 0: public = 0                             N of obs 0        =      4,163
Group 1: public = 1                             N of obs 1        =      1,267
                                                Estimation method =      logit
                                                Grid size         =        100

                                 (Replications based on 2,034 clusters in psu)
------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
      lnwage | coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
obs0         |
        mean |   2.733252   .0134363   203.42   0.000     2.706917    2.759586
   iqr(10,90)|   1.244804   .0290057    42.92   0.000     1.187954    1.301654
   iqr(50,90)|   .6547899   .0228014    28.72   0.000        .6101    .6994798
   iqr(10,50)|   .5900142    .016284    36.23   0.000     .5580981    .6219303
    variance |   .2513125   .0085851    29.27   0.000      .234486    .2681389
-------------+----------------------------------------------------------------
fit0         |
        mean |   2.748031   .0134993   203.57   0.000     2.721573    2.774489
   iqr(10,90)|   1.251093   .0275141    45.47   0.000     1.197166    1.305019
   iqr(50,90)|   .6690338   .0231009    28.96   0.000     .6237568    .7143107
   iqr(10,50)|   .5820589    .014857    39.18   0.000     .5529396    .6111782
    variance |   .2591116   .0085388    30.35   0.000     .2423758    .2758474
-------------+----------------------------------------------------------------
adj0         |
        mean |   2.902411   .0222176   130.64   0.000     2.858866    2.945957
   iqr(10,90)|    1.31026   .0291286    44.98   0.000     1.253168    1.367351
   iqr(50,90)|   .6639874    .033464    19.84   0.000     .5983991    .7295757
   iqr(10,50)|   .6462722    .029791    21.69   0.000     .5878828    .7046615
    variance |   .2789182   .0101444    27.49   0.000     .2590356    .2988008
-------------+----------------------------------------------------------------
obs1         |
        mean |   2.872093    .021535   133.37   0.000     2.829886    2.914301
   iqr(10,90)|   1.052989   .0624575    16.86   0.000      .930574    1.175403
   iqr(50,90)|   .4559944   .0404873    11.26   0.000     .3766407     .535348
   iqr(10,50)|   .5969942   .0498412    11.98   0.000     .4993072    .6946811
    variance |   .1939033   .0172408    11.25   0.000      .160112    .2276947
-------------+----------------------------------------------------------------
fit1         |
        mean |   2.883538   .0217245   132.73   0.000     2.840959    2.926118
   iqr(10,90)|   1.050459   .0669272    15.70   0.000     .9192838    1.181634
   iqr(50,90)|   .4620192   .0418941    11.03   0.000     .3799083    .5441301
   iqr(10,50)|   .5884395    .053075    11.09   0.000     .4844144    .6924645
    variance |   .1910696   .0177449    10.77   0.000     .1562902    .2258489
-------------+----------------------------------------------------------------
adj1         |
        mean |   2.776849   .0169242   164.08   0.000     2.743678    2.810019
   iqr(10,90)|   .9898577   .0511921    19.34   0.000      .889523    1.090192
   iqr(50,90)|   .4412684   .0261397    16.88   0.000     .3900356    .4925013
   iqr(10,50)|   .5485892   .0434873    12.61   0.000     .4633557    .6338227
    variance |   .1725809   .0167969    10.27   0.000     .1396596    .2055021
------------------------------------------------------------------------------
covariates: yeduc expft expft2 c.expft#c.expft2 isei children
. // Decomposition with private sector wage structure as reference
. cdist decomp

        Delta: fit0 - fit1
        Chars: fit0 - adj0
        Coefs: adj0 - fit1

                                 (Replications based on 2,034 clusters in psu)
------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
      lnwage | coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
Delta        |
        mean |  -.1355075   .0256462    -5.28   0.000    -.1857732   -.0852418
   iqr(10,90)|    .200634   .0747203     2.69   0.007     .0541848    .3470832
   iqr(50,90)|   .2070146   .0476355     4.35   0.000     .1136507    .3003784
   iqr(10,50)|  -.0063806   .0581234    -0.11   0.913    -.1203003    .1075391
    variance |    .068042   .0195721     3.48   0.001     .0296814    .1064026
-------------+----------------------------------------------------------------
Chars        |
        mean |  -.1543804   .0178541    -8.65   0.000    -.1893737   -.1193871
   iqr(10,90)|  -.0591669   .0286742    -2.06   0.039    -.1153673   -.0029665
   iqr(50,90)|   .0050464     .03985     0.13   0.899    -.0730581    .0831509
   iqr(10,50)|  -.0642133   .0266276    -2.41   0.016    -.1164025   -.0120241
    variance |  -.0198066   .0080578    -2.46   0.014    -.0355995   -.0040137
-------------+----------------------------------------------------------------
Coefs        |
        mean |   .0188729   .0228139     0.83   0.408    -.0258414    .0635873
   iqr(10,90)|   .2598009   .0681203     3.81   0.000     .1262876    .3933142
   iqr(50,90)|   .2019682   .0512275     3.94   0.000     .1015641    .3023723
   iqr(10,50)|   .0578327   .0567116     1.02   0.308    -.0533201    .1689855
    variance |   .0878486   .0191852     4.58   0.000     .0502462     .125451
------------------------------------------------------------------------------
covariates: yeduc expft expft2 c.expft#c.expft2 isei children

. estimates store dr_priv

We again see that the sector-gap in mean log wages can be fully explained by the compositional differences between workers in the private and in the public sector. In contrast, sector differences in wage inequality cannot be accounted for by compositional differences. The results again suggest that wage inequality in the private sector would even be larger than it actually is if this sector's covariate distribution was the same as the covariate distribution in the public sector.

Task 2: change in perspective

How do results change if you adjust the covariates of people in the public to those of people in the private sector (i.e. if you use the wage structure from the public sector as the reference wage structure)?

Decomposition based on conditional quantiles:

. // Decomposition with private sector wage structure as reference
. estimates restore qr_priv
(results qr_priv are active now)

. cdist decomp, reverse

        Delta: fit0 - fit1
        Chars: adj1 - fit1
        Coefs: fit0 - adj1

                                 (Replications based on 2,034 clusters in psu)
------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
      lnwage | coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
Delta        |
        mean |  -.1396714    .026973    -5.18   0.000    -.1925376   -.0868053
   iqr(10,90)|    .256021     .05867     4.36   0.000     .1410299    .3710121
   iqr(50,90)|   .2113189   .0271698     7.78   0.000     .1580671    .2645708
   iqr(10,50)|   .0447021   .0434172     1.03   0.303    -.0403941    .1297983
    variance |   .0558158   .0185365     3.01   0.003      .019485    .0921466
-------------+----------------------------------------------------------------
Chars        |
        mean |  -.1111762   .0179451    -6.20   0.000     -.146348   -.0760044
   iqr(10,90)|  -.0121915   .0219515    -0.56   0.579    -.0552157    .0308328
   iqr(50,90)|   .0040638   .0148798     0.27   0.785    -.0251001    .0332278
   iqr(10,50)|  -.0162553   .0156615    -1.04   0.299    -.0469512    .0144406
    variance |    .001664   .0071956     0.23   0.817    -.0124391    .0157671
-------------+----------------------------------------------------------------
Coefs        |
        mean |  -.0284953   .0208803    -1.36   0.172    -.0694199    .0124294
   iqr(10,90)|   .2682125   .0556574     4.82   0.000     .1591259    .3772991
   iqr(50,90)|   .2072551   .0256171     8.09   0.000     .1570466    .2574636
   iqr(10,50)|   .0609574   .0422411     1.44   0.149    -.0218337    .1437484
    variance |   .0541518   .0176212     3.07   0.002     .0196148    .0886888
------------------------------------------------------------------------------
covariates: yeduc expft expft2 c.expft#c.expft2 isei children

. estimates store qr_publ
. // Comparison
. esttab qr_priv qr_publ, b(3) not nonum mti

--------------------------------------------
                  qr_priv         qr_publ   
--------------------------------------------
Delta                                       
mean               -0.140***       -0.140***
iqr(10,90)          0.256***        0.256***
iqr(50,90)          0.211***        0.211***
iqr(10,50)          0.045           0.045   
variance            0.056**         0.056** 
--------------------------------------------
Chars                                       
mean               -0.150***       -0.111***
iqr(10,90)         -0.020          -0.012   
iqr(50,90)          0.016           0.004   
iqr(10,50)         -0.037*         -0.016   
variance           -0.009           0.002   
--------------------------------------------
Coefs                                       
mean                0.010          -0.028   
iqr(10,90)          0.276***        0.268***
iqr(50,90)          0.195***        0.207***
iqr(10,50)          0.081           0.061   
variance            0.065**         0.054** 
--------------------------------------------
N                    5430            5430   
--------------------------------------------
* p<0.05, ** p<0.01, *** p<0.001

Results in the explained part are a little less pronounced if we use the public sector wage structure as reference (i.e. adjusting the covariate distribution in the public sector to that of the private sector). However, the overall interpretation stays the same: The gap in mean log wages can at least to a large extent be explained by compositional differences, whereas the inequality-gaps cannot.

Decomposition based on distribution regression:

. //Decomposition with public sector wage structure as reference
. estimates restore dr_priv
(results dr_priv are active now)

. cdist decomp, reverse

        Delta: fit0 - fit1
        Chars: adj1 - fit1
        Coefs: fit0 - adj1

                                 (Replications based on 2,034 clusters in psu)
------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
      lnwage | coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
Delta        |
        mean |  -.1355075   .0256462    -5.28   0.000    -.1857732   -.0852418
   iqr(10,90)|    .200634   .0747203     2.69   0.007     .0541848    .3470832
   iqr(50,90)|   .2070146   .0476355     4.35   0.000     .1136507    .3003784
   iqr(10,50)|  -.0063806   .0581234    -0.11   0.913    -.1203003    .1075391
    variance |    .068042   .0195721     3.48   0.001     .0296814    .1064026
-------------+----------------------------------------------------------------
Chars        |
        mean |  -.1066897   .0185163    -5.76   0.000     -.142981   -.0703984
   iqr(10,90)|   -.060601   .0457185    -1.33   0.185    -.1502077    .0290057
   iqr(50,90)|  -.0207508   .0396793    -0.52   0.601    -.0985207    .0570192
   iqr(10,50)|  -.0398502   .0413679    -0.96   0.335    -.1209298    .0412294
    variance |  -.0184887   .0099965    -1.85   0.064    -.0380814    .0011041
-------------+----------------------------------------------------------------
Coefs        |
        mean |  -.0288178   .0196911    -1.46   0.143    -.0674116     .009776
   iqr(10,90)|    .261235   .0593779     4.40   0.000     .1448564    .3776136
   iqr(50,90)|   .2277653   .0345312     6.60   0.000     .1600855    .2954451
   iqr(10,50)|   .0334697   .0492617     0.68   0.497    -.0630815    .1300209
    variance |   .0865307   .0195185     4.43   0.000     .0482752    .1247862
------------------------------------------------------------------------------
covariates: yeduc expft expft2 c.expft#c.expft2 isei children

. estimates store dr_publ
. // Comparison
. esttab dr_priv dr_publ, b(3) not nonum mti

--------------------------------------------
                  dr_priv         dr_publ   
--------------------------------------------
Delta                                       
mean               -0.136***       -0.136***
iqr(10,90)          0.201**         0.201** 
iqr(50,90)          0.207***        0.207***
iqr(10,50)         -0.006          -0.006   
variance            0.068***        0.068***
--------------------------------------------
Chars                                       
mean               -0.154***       -0.107***
iqr(10,90)         -0.059*         -0.061   
iqr(50,90)          0.005          -0.021   
iqr(10,50)         -0.064*         -0.040   
variance           -0.020*         -0.018   
--------------------------------------------
Coefs                                       
mean                0.019          -0.029   
iqr(10,90)          0.260***        0.261***
iqr(50,90)          0.202***        0.228***
iqr(10,50)          0.058           0.033   
variance            0.088***        0.087***
--------------------------------------------
N                    5430            5430   
--------------------------------------------
* p<0.05, ** p<0.01, *** p<0.001

Similar as above.

Task 3: comparison to reweighting

Optional: Compare your results to results from analogous decompositions using reweighting (e.g. compare the results from the distribution regression decomposition to results from decompositions based on IPW or entropy balancing). Can you reduce the difference between results by fine-tuning the models used in the various decompositions?

To make things more convenient, let's write a small program to run reweighting decompositions:

capt prog drop ipwdecomp
program ipwdecomp, eclass
    // syntax
    syntax varlist(fv min=2) [if] [in] [fw pw iw], by(varname) [ s(str) eb ]
    gettoken depvar controls : varlist
    if `"`eb'"'=="" local method ipw
    else            local method eb
    // estimation sample
    marksample touse
    markout `touse' `by'
    // counterfactual
    tempname adj0
    qui dstat (`s') `depvar' if `touse' [`weight'`exp'], over(`by') nose ///
        balance(`method':`controls', reference(1))
    local grps `"`e(over_namelist)'"'
    if `:list sizeof grps'!=2 {
        di as err "by() must dichotomous (two groups)"
        exit 498
    }
    local g0: word 1 of `grps'
    local g1: word 2 of `grps'
    matrix `adj0' = e(b)[1,`"`g0':"']
    // observed 
    tempname obs0 obs1
    qui dstat (`s') `depvar' if `touse' [`weight'`exp'], over(`by') nose
    matrix `obs0' = e(b)[1,`"`g0':"']
    matrix `obs1' = e(b)[1,`"`g1':"']
    // decomposition
    tempname b tmp
    matrix `b' = `obs0'-`obs1'
    matrix coleq `b' = "Difference"
    matrix `tmp' = `obs0' - `adj0'
    matrix coleq `tmp' = "Explained"
    matrix `b' = `b', `tmp'
    matrix `tmp' = `adj0' - `obs1'
    matrix coleq `tmp' = "Unexplained"
    matrix `b' = `b', `tmp'
    // post results
    eret post `b' [`weight'`exp'], depname(`depvar') esample(`touse') obs(`e(N)')
    eret local cmd "ipwdecomp"
    eret local eb "`eb'"
    // display
    eret display, vsquish
end

The syntax is

ipwdecomp depvar indepvars [if] [in] [weight], by(groupvar) [ s(statistics) eb ]

where statistics is a list of target statistics (any statistic supported by dstat is allowed) and option eb requests using entropy balancing rather than logit-based IPW.

We now estimate several variants of the decomposition using the same specification as above for the covariates (linear terms for education, ISEI, and number of children, quadratic term for work experience, no interactions) and plot the results in a graph.

. local lhs yeduc c.expft##c.expft isei children

. local stats mean iqr(10,90) iqr(50,90) iqr(10,50) variance

. ipwdecomp lnwage `lhs' [pw=weight], by(public) s(`stats')
------------------------------------------------------------------------------
      lnwage | Coefficient
-------------+----------------------------------------------------------------
Difference   |
        mean |  -.1388418
   iqr(10,90)|   .1918156
   iqr(50,90)|   .1987956
   iqr(10,50)|  -.0069799
    variance |   .0573163
-------------+----------------------------------------------------------------
Explained    |
        mean |  -.1501454
   iqr(10,90)|  -.0707943
   iqr(50,90)|  -.0119915
   iqr(10,50)|  -.0588028
    variance |  -.0135576
-------------+----------------------------------------------------------------
Unexplained  |
        mean |   .0113036
   iqr(10,90)|     .26261
   iqr(50,90)|   .2107871
   iqr(10,50)|   .0518229
    variance |    .070874
------------------------------------------------------------------------------

. estimates store ipw

. ipwdecomp lnwage `lhs' [pw=weight], by(public) s(`stats') eb
------------------------------------------------------------------------------
      lnwage | Coefficient
-------------+----------------------------------------------------------------
Difference   |
        mean |  -.1388418
   iqr(10,90)|   .1918156
   iqr(50,90)|   .1987956
   iqr(10,50)|  -.0069799
    variance |   .0573163
-------------+----------------------------------------------------------------
Explained    |
        mean |  -.1495155
   iqr(10,90)|  -.0707943
   iqr(50,90)|  -.0142324
   iqr(10,50)|  -.0565619
    variance |  -.0136304
-------------+----------------------------------------------------------------
Unexplained  |
        mean |   .0106737
   iqr(10,90)|     .26261
   iqr(50,90)|    .213028
   iqr(10,50)|    .049582
    variance |   .0709467
------------------------------------------------------------------------------

. estimates store eb

. cdist lnwage `lhs' [pw=weight], by(public) s(`stats') decomp
group 0: fitting models 0%....20%....40%....60%....80%....100%
enumerating predictions ... done
group 1: fitting models 0%....20%....40%....60%....80%....100%
enumerating predictions ... done

Counterfactual distribution estimation          Number of obs     =      5,430
                                                Pooled            =         no
Group 0: public = 0                             N of obs 0        =      4,163
Group 1: public = 1                             N of obs 1        =      1,267
                                                Estimation method =      logit
                                                Grid size         =        100

        Delta: fit0 - fit1
        Chars: fit0 - adj0
        Coefs: adj0 - fit1

------------------------------------------------------------------------------
      lnwage | Coefficient
-------------+----------------------------------------------------------------
Delta        |
        mean |  -.1355075
   iqr(10,90)|    .200634
   iqr(50,90)|   .2070146
   iqr(10,50)|  -.0063806
    variance |    .068042
-------------+----------------------------------------------------------------
Chars        |
        mean |  -.1544988
   iqr(10,90)|  -.0591669
   iqr(50,90)|   .0205505
   iqr(10,50)|  -.0797174
    variance |  -.0183628
-------------+----------------------------------------------------------------
Coefs        |
        mean |   .0189913
   iqr(10,90)|   .2598009
   iqr(50,90)|   .1864641
   iqr(10,50)|   .0733368
    variance |   .0864048
------------------------------------------------------------------------------
covariates: yeduc expft c.expft#c.expft isei children

. estimates store dr

. cdist lnwage `lhs' [pw=weight], by(public) s(`stats') decomp method(qr)
group 0: fitting models 0%....20%....40%....60%....80%....100%
enumerating predictions ... done
group 1: fitting models 0%....20%....40%....60%....80%....100%
enumerating predictions ... done

Counterfactual distribution estimation          Number of obs     =      5,430
                                                Pooled            =         no
Group 0: public = 0                             N of obs 0        =      4,163
Group 1: public = 1                             N of obs 1        =      1,267
                                                Estimation method =         qr
                                                Grid size         =        100

        Delta: fit0 - fit1
        Chars: fit0 - adj0
        Coefs: adj0 - fit1

------------------------------------------------------------------------------
      lnwage | Coefficient
-------------+----------------------------------------------------------------
Delta        |
        mean |  -.1374013
   iqr(10,90)|    .256021
   iqr(50,90)|   .2113189
   iqr(10,50)|   .0447021
    variance |   .0605626
-------------+----------------------------------------------------------------
Chars        |
        mean |  -.1497873
   iqr(10,90)|   -.024383
   iqr(50,90)|   .0121915
   iqr(10,50)|  -.0365744
    variance |  -.0111118
-------------+----------------------------------------------------------------
Coefs        |
        mean |    .012386
   iqr(10,90)|    .280404
   iqr(50,90)|   .1991275
   iqr(10,50)|   .0812765
    variance |   .0716744
------------------------------------------------------------------------------
covariates: yeduc expft c.expft#c.expft isei children

. estimates store qr
. grstyle init

. grstyle set plain, grid

. grstyle set color sb

. grstyle set legend 3, inside nobox

. coefplot ipw eb dr qr, noci keep(*:*) ///
>     eqrename(Delta = Difference Chars = Explained Coefs = Unexplained) ///
>     recast(bar) barwidth(.15) xline(0) ///
>     plotlabels("logit IPW" "entropy balancing IPW" ///
>                "distribution regression" "quantile regression")
Stata Graph - Graph Difference Explained Unexplained mean iqr(10,90) iqr(50,90) iqr(10,50) variance mean iqr(10,90) iqr(50,90) iqr(10,50) variance mean iqr(10,90) iqr(50,90) iqr(10,50) variance -.2 -.1 0 .1 .2 .3 logit IPW entropy balancing IPW distribution regression quantile regression

In the top panel the raw differences are reported. In principle, these should be the same for all methods. However, in the distribution and quantile regression approaches, fitted raw differences are computed which are affected by approximation error. For the mean, the approximation seems to be good, but for some of the inequality measures there are substantial deviations, particularly when using the quantile regression approach.

With respect to the breakup into an explained and an unexplained component, logit IPW and entropy balancing pretty much agree. Results from distribution regression and quantile regression deviate here an there, but the overall picture is similar.

We now check whether differences between the results can be reduced by using a more flexible specification for the covariates. Here are the results we obtain if we include all two-way interaction and a squared terms for each variable.

. local lhs c.yeduc##c.expft##c.isei##c.children/*
>     */ c.yeduc#c.yeduc c.expft#c.expft/*
>     */ c.isei#c.isei c.children#c.children 

. local stats mean iqr(10,90) iqr(50,90) iqr(10,50) variance

. ipwdecomp lnwage `lhs' [pw=weight], by(public) s(`stats')
------------------------------------------------------------------------------
      lnwage | Coefficient
-------------+----------------------------------------------------------------
Difference   |
        mean |  -.1388418
   iqr(10,90)|   .1918156
   iqr(50,90)|   .1987956
   iqr(10,50)|  -.0069799
    variance |   .0573163
-------------+----------------------------------------------------------------
Explained    |
        mean |  -.1489135
   iqr(10,90)|  -.0463133
   iqr(50,90)|   .0071507
   iqr(10,50)|  -.0534639
    variance |  -.0097855
-------------+----------------------------------------------------------------
Unexplained  |
        mean |   .0100717
   iqr(10,90)|   .2381289
   iqr(50,90)|   .1916449
   iqr(10,50)|    .046484
    variance |   .0671018
------------------------------------------------------------------------------

. estimates store ipw

. ipwdecomp lnwage `lhs' [pw=weight], by(public) s(`stats') eb
------------------------------------------------------------------------------
      lnwage | Coefficient
-------------+----------------------------------------------------------------
Difference   |
        mean |  -.1388418
   iqr(10,90)|   .1918156
   iqr(50,90)|   .1987956
   iqr(10,50)|  -.0069799
    variance |   .0573163
-------------+----------------------------------------------------------------
Explained    |
        mean |  -.1448618
   iqr(10,90)|  -.0470619
   iqr(50,90)|   .0067458
   iqr(10,50)|  -.0538077
    variance |  -.0089903
-------------+----------------------------------------------------------------
Unexplained  |
        mean |     .00602
   iqr(10,90)|   .2388775
   iqr(50,90)|   .1920497
   iqr(10,50)|   .0468278
    variance |   .0663066
------------------------------------------------------------------------------

. estimates store eb

. cdist lnwage `lhs' [pw=weight], by(public) s(`stats') ///
>     lincom((Difference:fit0-fit1) (Explained:fit0-adj0) (Unexplained:adj0-fit1))
group 0: fitting models 0%....20%....40%....60%....80%....100%
enumerating predictions ... done
group 1: fitting models 0%....20%....40%....60%....80%....100%
enumerating predictions ... done

Counterfactual distribution estimation          Number of obs     =      5,430
                                                Pooled            =         no
Group 0: public = 0                             N of obs 0        =      4,163
Group 1: public = 1                             N of obs 1        =      1,267
                                                Estimation method =      logit
                                                Grid size         =        100

   Difference: fit0-fit1
    Explained: fit0-adj0
  Unexplained: adj0-fit1

------------------------------------------------------------------------------
      lnwage | Coefficient
-------------+----------------------------------------------------------------
Difference   |
        mean |  -.1355075
   iqr(10,90)|    .200634
   iqr(50,90)|   .2070146
   iqr(10,50)|  -.0063806
    variance |    .068042
-------------+----------------------------------------------------------------
Explained    |
        mean |   -.145334
   iqr(10,90)|  -.0591669
   iqr(50,90)|   .0050464
   iqr(10,50)|  -.0642133
    variance |  -.0169221
-------------+----------------------------------------------------------------
Unexplained  |
        mean |   .0098265
   iqr(10,90)|   .2598009
   iqr(50,90)|   .2019682
   iqr(10,50)|   .0578327
    variance |   .0849642
------------------------------------------------------------------------------
covariates: yeduc expft c.yeduc#c.expft isei c.yeduc#c.isei c.expft#c.isei ...

. estimates store dr

. cdist lnwage `lhs' [pw=weight], by(public) s(`stats') method(qr) ///
>     lincom((Difference:fit0-fit1) (Explained:fit0-adj0) (Unexplained:adj0-fit1))
group 0: fitting models 0%....20%....40%....60%....80%....100%
enumerating predictions ... done
group 1: fitting models 0%....20%....40%....60%....80%....100%
enumerating predictions ... done

Counterfactual distribution estimation          Number of obs     =      5,430
                                                Pooled            =         no
Group 0: public = 0                             N of obs 0        =      4,163
Group 1: public = 1                             N of obs 1        =      1,267
                                                Estimation method =         qr
                                                Grid size         =        100

   Difference: fit0-fit1
    Explained: fit0-adj0
  Unexplained: adj0-fit1

------------------------------------------------------------------------------
      lnwage | Coefficient
-------------+----------------------------------------------------------------
Difference   |
        mean |  -.1397298
   iqr(10,90)|   .2275742
   iqr(50,90)|   .1991275
   iqr(10,50)|   .0284468
    variance |   .0622992
-------------+----------------------------------------------------------------
Explained    |
        mean |  -.1467292
   iqr(10,90)|   -.024383
   iqr(50,90)|   .0203191
   iqr(10,50)|  -.0447021
    variance |  -.0124196
-------------+----------------------------------------------------------------
Unexplained  |
        mean |   .0069995
   iqr(10,90)|   .2519572
   iqr(50,90)|   .1788083
   iqr(10,50)|   .0731489
    variance |   .0747188
------------------------------------------------------------------------------
covariates: yeduc expft c.yeduc#c.expft isei c.yeduc#c.isei c.expft#c.isei ...

. estimates store qr

. coefplot ipw eb dr qr, noci keep(*:*) ///
>     eqrename(Delta = Difference Chars = Explained Coefs = Unexplained) ///
>     recast(bar) barwidth(.15) xline(0) ///
>     plotlabels("logit IPW" "entropy balancing IPW" ///
>                "distribution regression" "quantile regression")
Stata Graph - Graph Difference Explained Unexplained mean iqr(10,90) iqr(50,90) iqr(10,50) variance mean iqr(10,90) iqr(50,90) iqr(10,50) variance mean iqr(10,90) iqr(50,90) iqr(10,50) variance -.2 -.1 0 .1 .2 .3 logit IPW entropy balancing IPW distribution regression quantile regression

The agreement of results across methods became somewhat better, but the changes are not dramatic. Some approximation error still remains for the distribution regression and quantile regression approaches, and there are still some differences in the explained–unexplained breakup across the methods.

Which of the results are more appropriate is hard to say. We are not aware of any research systematically comparing the approaches (e.g., using simulations), to find out whether some approaches generally (or under some conditions) outperform others. One might expect that the approaches based on distribution regression and quantile regression perform better than reweighting because they flexibly model the counterfactual distribution. However, the fact that the results from these two approaches do not agree very well makes us skeptical whether this is indeed the case.

Yet, despite the discussed differences, the results from the various methods are qualitatively similar. Essentially all of the private-public gap in average wages is accounted for by differential distributions of characteristics, but the gap in wage inequality within the sectors remains largely unexplained. If anything, wage inequality in private sector would even be larger if the private sector had a distribution of characteristics like the public sector, and we see that mostly the bottom half of the distribution would be affected.