Decomposition Methods in the Social Sciences

Solutions to Exercise 6: Reweighting

Johannes Giesecke and Ben Jann, GESIS Training Course, January 29 – February 1, 2024

Required packages: fre, dstat, kmatch

Set the seed of the random number generator for sake of reproducibility:

. set seed 65443

Task 1: manual reweighting decomposition

Repeat the reweighting decomposition from the slides for the mean and the variance of log wages between the private sector and the public sector (section “Example analysis”), but use the “opposite” counterfactual. That is, compute a decomposition in which the distribution of X in the public sector is adjusted to the distribution observed in the private sector (on the slides, the distribution of X in the private sector was adjusted to the distribution observed in the public sector). In addition to schooling and experience, include occupational status as measured by the international socio-economic index (isei) as a predictor.

Data preparation including isei:

. use gsoep-extract, clear
(Example data based on the German Socio-Economic Panel)

. keep if wave==2015
(29,970 observations deleted)

. keep if inrange(age, 25, 55)
(5,671 observations deleted)

. generate lnwage = ln(wage)
(1,709 missing values generated)

. generate expft2 = expft^2
(35 missing values generated)

. summarize wage lnwage yeduc expft expft2 isei public

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
        wage |      5,600    17.57278    9.858855       3.03     121.42
      lnwage |      5,600    2.736721    .5062968   1.108563   4.799255
       yeduc |      7,121    12.28823    2.783974          7         18
       expft |      7,274    11.63359    9.556508          0       39.5
      expft2 |      7,274    226.6548    293.3739          0    1560.25
-------------+---------------------------------------------------------
        isei |      6,451    45.07115    17.00982         16         90
      public |      5,770    .2353553    .4242574          0          1

. drop if missing(lnwage, yeduc, expft, isei, public) // remove unused observation
(1,879 observations deleted)

. svyset psu [pw=weight], strata(strata)

Sampling weights: weight
             VCE: linearized
     Single unit: missing
        Strata 1: strata
 Sampling unit 1: psu
           FPC 1: <zero>

Propensity-score model: We first estimate a model that includes quadratic terms for experience and ISEI as well as all two-way interactions between schooling, experience, and ISEI.

. svy: logit public ///
>     /// main effects
>     yeduc c.expft##c.expft c.isei##c.isei ///
>     /// education x experience
>     c.yeduc#c.expft c.yeduc#c.expft#c.expft /// 
>     /// education x ISEI
>     c.yeduc#c.isei c.yeduc#c.isei#c.isei ///
>     /// ISEI x experience
>     c.isei#c.expft c.isei#c.expft#c.expft ///
>         c.isei#c.isei#c.expft ///
>         c.isei#c.isei#c.expft#c.expft
(running logit on estimation sample)

Survey: Logistic regression

Number of strata =    15                          Number of obs   =      5,430
Number of PSUs   = 2,034                          Population size = 12,066,161
                                                  Design df       =      2,019
                                                  F(13, 2007)     =       9.41
                                                  Prob > F        =     0.0000

-----------------------------------------------------------------------------------------------
                              |             Linearized
                       public | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
------------------------------+----------------------------------------------------------------
                        yeduc |   .1298165   .1996554     0.65   0.516    -.2617357    .5213686
                        expft |   -.070314   .1723928    -0.41   0.683    -.4084004    .2677724
                              |
              c.expft#c.expft |   .0034054   .0056808     0.60   0.549    -.0077355    .0145463
                              |
                         isei |   .0033856   .0966764     0.04   0.972    -.1862103    .1929815
                              |
                c.isei#c.isei |   .0003512   .0009833     0.36   0.721    -.0015772    .0022796
                              |
              c.yeduc#c.expft |   .0068663   .0085493     0.80   0.422    -.0099001    .0236327
                              |
      c.yeduc#c.expft#c.expft |  -.0001362   .0002766    -0.49   0.622    -.0006788    .0004063
                              |
               c.yeduc#c.isei |   .0003535   .0075197     0.05   0.963    -.0143937    .0151006
                              |
        c.yeduc#c.isei#c.isei |  -.0000134   .0000702    -0.19   0.849    -.0001511    .0001244
                              |
               c.isei#c.expft |   .0005789   .0066831     0.09   0.931    -.0125276    .0136854
                              |
       c.isei#c.expft#c.expft |  -.0000962   .0002175    -0.44   0.658    -.0005227    .0003302
                              |
        c.isei#c.isei#c.expft |  -.0000232   .0000675    -0.34   0.731    -.0001555    .0001092
                              |
c.isei#c.isei#c.expft#c.expft |   1.37e-06   2.20e-06     0.62   0.533    -2.94e-06    5.68e-06
                              |
                        _cons |  -3.674781   2.372428    -1.55   0.122    -8.327443    .9778808
-----------------------------------------------------------------------------------------------

The model appears hopelessly overspecified. Here is a joint test of the higher order terms of the interactions between ISEI and education or experience.

. test c.yeduc#c.isei#c.isei ///
>     c.isei#c.expft#c.expft ///
>     c.isei#c.isei#c.expft ///
>     c.isei#c.isei#c.expft#c.expft

Adjusted Wald test

 ( 1)  [public]c.yeduc#c.isei#c.isei = 0
 ( 2)  [public]c.isei#c.expft#c.expft = 0
 ( 3)  [public]c.isei#c.isei#c.expft = 0
 ( 4)  [public]c.isei#c.isei#c.expft#c.expft = 0

       F(  4,  2016) =    0.51
            Prob > F =    0.7303

It seems we can safely drop these terms. We rerun the model without these terms.

. svy: logit public ///
>     /// main effects
>     yeduc c.expft##c.expft c.isei##c.isei ///
>     /// education x experience
>     c.yeduc#c.expft c.yeduc#c.expft#c.expft /// 
>     /// education x ISEI and ISEI x experience
>     c.yeduc#c.isei c.isei#c.expft
(running logit on estimation sample)

Survey: Logistic regression

Number of strata =    15                          Number of obs   =      5,430
Number of PSUs   = 2,034                          Population size = 12,066,161
                                                  Design df       =      2,019
                                                  F(9, 2011)      =      12.54
                                                  Prob > F        =     0.0000

-----------------------------------------------------------------------------------------
                        |             Linearized
                 public | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
------------------------+----------------------------------------------------------------
                  yeduc |   .1841793   .0935714     1.97   0.049     .0006726    .3676859
                  expft |  -.0266352   .0951749    -0.28   0.780    -.2132864     .160016
                        |
        c.expft#c.expft |   .0006524   .0030077     0.22   0.828    -.0052461    .0065509
                        |
                   isei |   .0155227    .020184     0.77   0.442     -.024061    .0551064
                        |
          c.isei#c.isei |    .000195   .0002354     0.83   0.408    -.0002667    .0006567
                        |
        c.yeduc#c.expft |   .0030947   .0073734     0.42   0.675    -.0113655    .0175549
                        |
c.yeduc#c.expft#c.expft |  -8.51e-06   .0002349    -0.04   0.971    -.0004692    .0004522
                        |
         c.yeduc#c.isei |  -.0011454   .0016716    -0.69   0.493    -.0044237    .0021328
                        |
         c.isei#c.expft |  -.0005577   .0004071    -1.37   0.171    -.0013561    .0002407
                        |
                  _cons |  -4.069323   1.024232    -3.97   0.000    -6.077985   -2.060661
-----------------------------------------------------------------------------------------

The model still appears overspecified. Let's further simplify by dropping the interactions between ISEI and education or experience entirely.

. test c.yeduc#c.isei c.isei#c.expft

Adjusted Wald test

 ( 1)  [public]c.yeduc#c.isei = 0
 ( 2)  [public]c.isei#c.expft = 0

       F(  2,  2018) =    0.99
            Prob > F =    0.3704

. svy: logit public c.yeduc##c.expft##c.expft c.isei##c.isei
(running logit on estimation sample)

Survey: Logistic regression

Number of strata =    15                          Number of obs   =      5,430
Number of PSUs   = 2,034                          Population size = 12,066,161
                                                  Design df       =      2,019
                                                  F(7, 2013)      =      15.49
                                                  Prob > F        =     0.0000

-----------------------------------------------------------------------------------------
                        |             Linearized
                 public | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
------------------------+----------------------------------------------------------------
                  yeduc |   .1501627   .0458125     3.28   0.001      .060318    .2400074
                  expft |  -.0247169   .0942048    -0.26   0.793    -.2094656    .1600319
                        |
        c.yeduc#c.expft |   .0008536   .0071377     0.12   0.905    -.0131445    .0148516
                        |
        c.expft#c.expft |    .000491   .0030057     0.16   0.870    -.0054036    .0063856
                        |
c.yeduc#c.expft#c.expft |   5.24e-06   .0002348     0.02   0.982    -.0004553    .0004658
                        |
                   isei |   .0001753   .0177974     0.01   0.992    -.0347279    .0350786
                        |
          c.isei#c.isei |     .00012   .0001825     0.66   0.511    -.0002378    .0004779
                        |
                  _cons |  -3.442651   .6958564    -4.95   0.000    -4.807323    -2.07798
-----------------------------------------------------------------------------------------

Even the quadratic effect of ISEI seems to be unnecessary. So we will simply add a linear term for ISEI (note that also the model from the slides could be simplified, but we leave these terms in the model for sake of comparability). Our final model thus is as follows:

. svy: logit public c.yeduc##c.expft##c.expft isei
(running logit on estimation sample)

Survey: Logistic regression

Number of strata =    15                          Number of obs   =      5,430
Number of PSUs   = 2,034                          Population size = 12,066,161
                                                  Design df       =      2,019
                                                  F(6, 2014)      =      17.85
                                                  Prob > F        =     0.0000

-----------------------------------------------------------------------------------------
                        |             Linearized
                 public | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
------------------------+----------------------------------------------------------------
                  yeduc |   .1510122   .0458909     3.29   0.001     .0610138    .2410106
                  expft |  -.0286752    .094125    -0.30   0.761    -.2132675    .1559171
                        |
        c.yeduc#c.expft |   .0010943   .0071301     0.15   0.878    -.0128889    .0150774
                        |
        c.expft#c.expft |   .0006412   .0030059     0.21   0.831    -.0052538    .0065362
                        |
c.yeduc#c.expft#c.expft |  -5.15e-06   .0002348    -0.02   0.982    -.0004656    .0004553
                        |
                   isei |   .0121305   .0040382     3.00   0.003     .0042111    .0200499
                  _cons |  -3.712432   .6038572    -6.15   0.000    -4.896681   -2.528184
-----------------------------------------------------------------------------------------

We now generate predicted probabilities from the model (i.e. probability of being in the public sector given the values of the covariates), compute the proportion of observations in the public sector, and compute the weights to reweight the public sector so that it looks like the private sector (the formula for the weights is flipped compared to the formula on the slides, where we reweighted the private sector).

. predict double PS, pr

. summarize public [aw=weight] 

    Variable |     Obs      Weight        Mean   Std. dev.       Min        Max
-------------+-----------------------------------------------------------------
      public |   5,430  12066160.8    .2395265   .4268338          0          1

. scalar Pr_public = r(mean)

. generate double PSI = ((1-PS) / (1-Pr_public)) / (PS / Pr_public) if public==1
(4,163 missing values generated)

. replace PSI = 1 if public==0
(4,163 real changes made)

. summarize PSI [aw=weight] if public==1

    Variable |     Obs      Weight        Mean   Std. dev.       Min        Max
-------------+-----------------------------------------------------------------
         PSI |   1,267   2890165.7    .9939012   .5919346   .2487815   4.097133

We can now compute the mean and variance of log wages in both groups as well as in the reweighted public sector and then compute the decomposition.

. // prepare a matrix for the results
. matrix R = J(2,3,.)

. matrix coln R = private public rw_public

. matrix rown R = mean var

. matrix list R

R[2,3]
        private     public  rw_public
mean          .          .          .
 var          .          .          .

. // raw results for private sector
. summarize lnwage [aw=weight] if public==0

    Variable |     Obs      Weight        Mean   Std. dev.       Min        Max
-------------+-----------------------------------------------------------------
      lnwage |   4,163   9175995.1    2.733252    .501371   1.108563   4.799255

. matrix R[1,1] = r(mean)

. matrix R[2,1] = r(Var)

. // raw results for public sector
. summarize lnwage [aw=weight] if public==1

    Variable |     Obs      Weight        Mean   Std. dev.       Min        Max
-------------+-----------------------------------------------------------------
      lnwage |   1,267   2890165.7    2.872093   .4405185   1.115142   4.356068

. matrix R[1,2] = r(mean)

. matrix R[2,2] = r(Var)

. // reweighted results for public sector
. summarize lnwage [aw=PSI*weight] if public==1

    Variable |     Obs      Weight        Mean   Std. dev.       Min        Max
-------------+-----------------------------------------------------------------
      lnwage |   1,267  2872539.16    2.765634   .4206885   1.115142   4.356068

. matrix R[1,3] = r(mean)

. matrix R[2,3] = r(Var)

. // obtain decomposition
. mat list R

R[2,3]
        private     public  rw_public
mean  2.7332515  2.8720933  2.7656336
 var  .25137285  .19405651  .17697884

. di as txt "difference in mean: total           = " as res R[1,1] - R[1,2] _n ///
>    as txt "difference in mean: explained       = " as res R[1,3] - R[1,2] _n ///
>    as txt "difference in mean: unexplained     = " as res R[1,1] - R[1,3] _n ///
>    as txt "difference in variance: total       = " as res R[2,1] - R[2,2] _n ///
>    as txt "difference in variance: explained   = " as res R[2,3] - R[2,2] _n ///
>    as txt "difference in variance: unexplained = " as res R[2,1] - R[2,3]
difference in mean: total           = -.13884181
difference in mean: explained       = -.10645977
difference in mean: unexplained     = -.03238204
difference in variance: total       = .05731634
difference in variance: explained   = -.01707767
difference in variance: unexplained = .07439401

Results are not directly comparable to the results on the slides because we added an extra predictor (ISEI; and also the number of observations changed slightly due to missings for ISEI). The raw difference now is slightly higher (0.139) and we see that less of the public-private gap in the mean of log wages is explained (on the slides it was 0.127; here it is 0.106). The remaining unexplained difference is 0.032. The interpretation is that the gap would only be 0.032 if people in the public sector had the same distribution of education, work experience and ISEI as the people in the private sector (i.e. people in the public sector would, on average, only earn about 3% more, not 14% more).

For the variance we see that the explained part is slightly negative (similar as on slides).That is, wage inequality within the public sector would be even somewhat smaller than it actually is if people in the public sector had same distribution of education, work experience and ISEI as the people in the private sector.

Task 2: reweighting decomposition using dstat

Try to replicate the decomposition using command dstat.

We use option rif() to store the RIFs, and then compute the decomposition using these variables (see slides). An advantage of using RIFs is that consistent standard errors can easily be computed.

. // get RIFs of raw results
. dstat (mean variance) lnwage, over(public) vce(svy) ///
>     rif(RIF_0_mean RIF_0_var RIF_1_mean RIF_1_var)
(running dstat_svyr on estimation sample)

Survey: Summary statistics

Number of strata =    15          Number of obs   =      5,430
Number of PSUs   = 2,034          Population size = 12,066,161
                                  Design df       =      2,019

            0: public = no
            1: public = yes

--------------------------------------------------------------
             |             Linearized
      lnwage | Coefficient  std. err.     [95% conf. interval]
-------------+------------------------------------------------
0            |
        mean |   2.733252   .0139762      2.705842    2.760661
    variance |   .2513728   .0099165      .2319252    .2708205
-------------+------------------------------------------------
1            |
        mean |   2.872093   .0220492      2.828852    2.915335
    variance |   .1940565   .0180675      .1586236    .2294894
--------------------------------------------------------------

Variable      Storage   Display    Value
    name         type    format    label      Variable label
----------------------------------------------------------------------------------------------------
RIF_0_mean      double  %10.0g                RIF of _b[0:mean]
RIF_0_var       double  %10.0g                RIF of _b[0:variance]
RIF_1_mean      double  %10.0g                RIF of _b[1:mean]
RIF_1_var       double  %10.0g                RIF of _b[1:variance]

. // get RIFs of reweighted results (only the public sample is reweighted; the
. // RIFs for private are the same as above)
. dstat (mean variance) lnwage, over(public) vce(svy) ///
>     balance(ipw: c.yeduc##c.expft##c.expft c.isei, reference(0)) ///
>     rif(RIF_0_mean RIF_0_var RIF_c_mean RIF_c_var) replace
(running dstat_svyr on estimation sample)

Survey: Summary statistics

Number of strata =    15          Number of obs   =      5,430
Number of PSUs   = 2,034          Population size = 12,066,161
                                  Design df       =      2,019
                                  Balancing:
                                           method =        ipw
                                        reference =   0.public
                                         controls = e(balance)

            0: public = no
            1: public = yes

--------------------------------------------------------------
             |             Linearized
      lnwage | Coefficient  std. err.     [95% conf. interval]
-------------+------------------------------------------------
0            |
        mean |   2.733252   .0139762      2.705842    2.760661
    variance |   .2513728   .0099165      .2319252    .2708205
-------------+------------------------------------------------
1            |
        mean |   2.765634   .0181534      2.730032    2.801235
    variance |   .1769788   .0177877      .1420946     .211863
--------------------------------------------------------------

Variable      Storage   Display    Value
    name         type    format    label      Variable label
----------------------------------------------------------------------------------------------------
RIF_0_mean      double  %10.0g                RIF of _b[0:mean]
RIF_0_var       double  %10.0g                RIF of _b[0:variance]
RIF_c_mean      double  %10.0g                RIF of _b[1:mean]
RIF_c_var       double  %10.0g                RIF of _b[1:variance]

. // compute the RIFs of the decomposition components
. generate double mean_diff = RIF_0_mean - RIF_1_mean

. generate double mean_expl = RIF_c_mean - RIF_1_mean

. generate double mean_unex = RIF_0_mean - RIF_c_mean

. generate double var_diff  = RIF_0_var  - RIF_1_var

. generate double var_expl  = RIF_c_var  - RIF_1_var

. generate double var_unex  = RIF_0_var  - RIF_c_var

. // results
. svy: mean mean_* var_* 
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =    15          Number of obs   =      5,430
Number of PSUs   = 2,034          Population size = 12,066,161
                                  Design df       =      2,019

--------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.     [95% conf. interval]
-------------+------------------------------------------------
   mean_diff |  -.1388418    .025609     -.1890646    -.088619
   mean_expl |  -.1064598   .0174267      -.140636   -.0722836
   mean_unex |   -.032382   .0202816     -.0721571     .007393
    var_diff |   .0573163   .0205301      .0170539    .0975788
    var_expl |  -.0170777   .0092358     -.0351903     .001035
    var_unex |    .074394     .02033       .034524     .114264
--------------------------------------------------------------

. ereturn display // redisplay results including t-statistics and p values
------------------------------------------------------------------------------
             |             Linearized
        Mean | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
   mean_diff |  -.1388418    .025609    -5.42   0.000    -.1890646    -.088619
   mean_expl |  -.1064598   .0174267    -6.11   0.000     -.140636   -.0722836
   mean_unex |   -.032382   .0202816    -1.60   0.111    -.0721571     .007393
    var_diff |   .0573163   .0205301     2.79   0.005     .0170539    .0975788
    var_expl |  -.0170777   .0092358    -1.85   0.065    -.0351903     .001035
    var_unex |    .074394     .02033     3.66   0.000      .034524     .114264
------------------------------------------------------------------------------

. drop RIF_* mean_* var_*

Note: The unexplained part of the decomposition could also be obtained directly by dstat.

. dstat (mean variance) lnwage, over(public, contrast(1)) vce(svy) ///
>     balance(ipw: c.yeduc##c.expft##c.expft c.isei, reference(0))
(running dstat_svyr on estimation sample)

Survey: Difference in summary statistics

Number of strata =    15                        Number of obs     =      5,430
Number of PSUs   = 2,034                        Population size   = 12,066,161
                                                Design df         =      2,019
                                                Contrast          =   1.public
                                                Balancing:
                                                           method =        ipw
                                                        reference =   0.public
                                                         controls = e(balance)

            0: public = no
            1: public = yes

------------------------------------------------------------------------------
             |             Linearized
      lnwage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
0            |
        mean |   -.032382   .0202816    -1.60   0.111    -.0721571     .007393
    variance |    .074394     .02033     3.66   0.000      .034524     .114264
------------------------------------------------------------------------------

Task 3: bootstrap standard errors

Wrap the analysis from task 1 into a small program and apply the bootstrap to compute standard errors and confidence intervals. How do the bootstrap standard errors compare to the analytic standard errors provided by dstat?

We write a program that allows us to specify the predictors to be included in the reweighting and also allows us to specify the summary statistics to be decomposed (any statistic returned by summarize can be used; see "Stored results" in help summarize). The program also marks the estimation sample so that observations with missing value on any of the relevant variables will be excluded from all computations (and so that bootstrap will know which observations have been used).

capture program drop myreweight
program myreweight, eclass
    syntax varlist(fv) [, Stats(str) ]
    if "`stats'"=="" local stats "mean" // default
    // generate weights
    logit public `varlist' [pw=weight]
    local N = e(N) // number of observations
    tempvar touse 
    generate byte `touse' = e(sample) // this marks the estimation sample
    tempvar PS PSI
    qui predict double `PS' if `touse', pr
    summarize public [aw=weight] if `touse', meanonly
    quietly generate double `PSI' = ///
        ((1-`PS') / (1-r(mean))) / (`PS' / r(mean)) ///
        if public==1 & `touse'
    // compute results
    // - setup results vector
    local nstats: list sizeof stats
    tempname b
    matrix `b' = J(1, 6*`nstats', .)
    local colnames
    foreach stat of local stats {
        local colnames `colnames' ///
            `stat':private `stat':public `stat':rw_public ///
            `stat':difference `stat':explained `stat':unexplained
    }
    matrix coln `b' = `colnames'
    // raw results for private sector
    quietly summarize lnwage [aw=weight] if public==0 & `touse', detail
    local j = 1
    foreach stat of local stats {
        matrix `b'[1,`j'] = r(`stat')
        local j = `j' + 6
    }
    // raw results for public sector
    quietly summarize lnwage [aw=weight] if public==1 & `touse', detail
    local j = 2
    foreach stat of local stats {
        matrix `b'[1,`j'] = r(`stat')
        local j = `j' + 6
    }
    // reweighted results for public sector
    quietly summarize lnwage if public==1 & `touse' [aw=`PSI'*weight], detail
    local j = 3
    foreach stat of local stats {
        matrix `b'[1,`j'] = r(`stat')
        local j = `j' + 6
    }
    // decomposition
    local j = 0
    foreach stat of local stats {
        matrix `b'[1,`j'+4] = `b'[1,`j'+1] - `b'[1,`j'+2]
        matrix `b'[1,`j'+5] = `b'[1,`j'+3] - `b'[1,`j'+2]
        matrix `b'[1,`j'+6] = `b'[1,`j'+1] - `b'[1,`j'+3]
        local j = `j' + 6
    }
    // post results
    ereturn post `b', obs(`N') esample(`touse')
    ereturn display
end

We can now use the program. Here is an example of a single run where we decompose the difference in the mean, the 25% quantile, the median, the 75% quantile, and the variance:

. myreweight c.yeduc##c.expft##c.expft c.isei, stats(mean p25 p50 p75 Var)

Iteration 0:  Log pseudolikelihood = -6642826.5  
Iteration 1:  Log pseudolikelihood =   -6255160  
Iteration 2:  Log pseudolikelihood =   -6246188  
Iteration 3:  Log pseudolikelihood = -6246184.6  
Iteration 4:  Log pseudolikelihood = -6246184.6  

Logistic regression                                     Number of obs =  5,430
                                                        Wald chi2(6)  = 121.71
                                                        Prob > chi2   = 0.0000
Log pseudolikelihood = -6246184.6                       Pseudo R2     = 0.0597

-----------------------------------------------------------------------------------------
                        |               Robust
                 public | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
------------------------+----------------------------------------------------------------
                  yeduc |   .1510122   .0455224     3.32   0.001       .06179    .2402345
                  expft |  -.0286752   .0935728    -0.31   0.759    -.2120745    .1547241
                        |
        c.yeduc#c.expft |   .0010943    .007059     0.16   0.877    -.0127411    .0149297
                        |
        c.expft#c.expft |   .0006412   .0029603     0.22   0.829    -.0051608    .0064433
                        |
c.yeduc#c.expft#c.expft |  -5.15e-06   .0002306    -0.02   0.982     -.000457    .0004467
                        |
                   isei |   .0121305    .004154     2.92   0.003     .0039888    .0202723
                  _cons |  -3.712432   .6010468    -6.18   0.000    -4.890462   -2.534402
-----------------------------------------------------------------------------------------
------------------------------------------------------------------------------
             | Coefficient
-------------+----------------------------------------------------------------
mean         |
     private |   2.733252
      public |   2.872093
   rw_public |   2.765634
  difference |  -.1388418
   explained |  -.1064598
 unexplained |   -.032382
-------------+----------------------------------------------------------------
p25          |
     private |   2.386926
      public |   2.667228
   rw_public |   2.596001
  difference |   -.280302
   explained |  -.0712276
 unexplained |  -.2090745
-------------+----------------------------------------------------------------
p50          |
     private |   2.726545
      public |   2.908539
   rw_public |   2.793004
  difference |  -.1819942
   explained |  -.1155353
 unexplained |  -.0664589
-------------+----------------------------------------------------------------
p75          |
     private |   3.068053
      public |   3.148882
   rw_public |   3.021887
  difference |  -.0808294
   explained |  -.1269951
 unexplained |   .0461657
-------------+----------------------------------------------------------------
Var          |
     private |   .2513728
      public |   .1940565
   rw_public |   .1769788
  difference |   .0573163
   explained |  -.0170777
 unexplained |    .074394
------------------------------------------------------------------------------

Results for the mean and the variance are identical to above. Looking at the results for quantiles we see that the public-private wage gap is larger at the bottom of the distribution, mostly due to the unexplained part. However, relevance of the unexplained part is changing from p25 to p50 and p75. For the 75% quantile the unexplained part even switches sign.

Now use the bootstrap to obtain standard errors and confidence intervals for the results. We only include the mean and the variance.

. bootstrap, cluster(psu) reps(100): ///
>     myreweight c.yeduc##c.expft##c.expft c.isei, stats(mean Var)
(running myreweight on estimation sample)

Bootstrap replications (100): .........10.........20.........30.........40.........50.........60....
> .....70.........80.........90.........100 done

Bootstrap results                                        Number of obs = 5,430
                                                         Replications  =   100

                                 (Replications based on 2,034 clusters in psu)
------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             | coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
mean         |
     private |   2.733252   .0135721   201.39   0.000     2.706651    2.759852
      public |   2.872093   .0197789   145.21   0.000     2.833327    2.910859
   rw_public |   2.765634   .0183483   150.73   0.000     2.729671    2.801596
  difference |  -.1388418    .023288    -5.96   0.000    -.1844854   -.0931982
   explained |  -.1064598   .0166688    -6.39   0.000      -.13913   -.0737895
 unexplained |   -.032382   .0195702    -1.65   0.098     -.070739    .0059749
-------------+----------------------------------------------------------------
Var          |
     private |   .2513728   .0093474    26.89   0.000     .2330522    .2696935
      public |   .1940565   .0193824    10.01   0.000     .1560677    .2320453
   rw_public |   .1769788   .0190437     9.29   0.000     .1396538    .2143039
  difference |   .0573163   .0214015     2.68   0.007     .0153701    .0992625
   explained |  -.0170777   .0095363    -1.79   0.073    -.0357684    .0016131
 unexplained |    .074394   .0210719     3.53   0.000     .0330938    .1156942
------------------------------------------------------------------------------

Results are very similar to the ones obtained using dstat. For the mean, the explained part is highly significant; the unexplained part is not significantly different from zero. For the variance, results are opposite.

Task 4: reweighting decomposition using entropy balancing

Evaluate how results change if you use entropy balancing to compute the weights instead of IPW based on a logit model.

We can use the same code as in task 2 and simply replace ipw by eb.

. // get RIFs of raw results
. dstat (mean variance) lnwage, over(public) vce(svy) ///
>     rif(RIF_0_mean RIF_0_var RIF_1_mean RIF_1_var)
(running dstat_svyr on estimation sample)

Survey: Summary statistics

Number of strata =    15          Number of obs   =      5,430
Number of PSUs   = 2,034          Population size = 12,066,161
                                  Design df       =      2,019

            0: public = no
            1: public = yes

--------------------------------------------------------------
             |             Linearized
      lnwage | Coefficient  std. err.     [95% conf. interval]
-------------+------------------------------------------------
0            |
        mean |   2.733252   .0139762      2.705842    2.760661
    variance |   .2513728   .0099165      .2319252    .2708205
-------------+------------------------------------------------
1            |
        mean |   2.872093   .0220492      2.828852    2.915335
    variance |   .1940565   .0180675      .1586236    .2294894
--------------------------------------------------------------

Variable      Storage   Display    Value
    name         type    format    label      Variable label
----------------------------------------------------------------------------------------------------
RIF_0_mean      double  %10.0g                RIF of _b[0:mean]
RIF_0_var       double  %10.0g                RIF of _b[0:variance]
RIF_1_mean      double  %10.0g                RIF of _b[1:mean]
RIF_1_var       double  %10.0g                RIF of _b[1:variance]

. // get RIFs of reweighted results (only the public sample is reweighted; the
. // RIFs for private are the same as above)
. dstat (mean variance) lnwage, over(public) vce(svy) ///
>     balance(eb: c.yeduc##c.expft##c.expft c.isei, reference(0)) ///
>     rif(RIF_0_mean RIF_0_var RIF_c_mean RIF_c_var) replace
(running dstat_svyr on estimation sample)

Survey: Summary statistics

Number of strata =    15          Number of obs   =      5,430
Number of PSUs   = 2,034          Population size = 12,066,161
                                  Design df       =      2,019
                                  Balancing:
                                           method =         eb
                                        reference =   0.public
                                         controls = e(balance)

            0: public = no
            1: public = yes

--------------------------------------------------------------
             |             Linearized
      lnwage | Coefficient  std. err.     [95% conf. interval]
-------------+------------------------------------------------
0            |
        mean |   2.733252   .0139762      2.705842    2.760661
    variance |   .2513728   .0099165      .2319252    .2708205
-------------+------------------------------------------------
1            |
        mean |   2.762522   .0182537      2.726724     2.79832
    variance |   .1757467    .017723      .1409893    .2105041
--------------------------------------------------------------

Variable      Storage   Display    Value
    name         type    format    label      Variable label
----------------------------------------------------------------------------------------------------
RIF_0_mean      double  %10.0g                RIF of _b[0:mean]
RIF_0_var       double  %10.0g                RIF of _b[0:variance]
RIF_c_mean      double  %10.0g                RIF of _b[1:mean]
RIF_c_var       double  %10.0g                RIF of _b[1:variance]

. // compute the RIFs of the decomposition components
. generate double mean_diff = RIF_0_mean - RIF_1_mean

. generate double mean_expl = RIF_c_mean - RIF_1_mean

. generate double mean_unex = RIF_0_mean - RIF_c_mean

. generate double var_diff  = RIF_0_var  - RIF_1_var

. generate double var_expl  = RIF_c_var  - RIF_1_var

. generate double var_unex  = RIF_0_var  - RIF_c_var

. // results
. svy: mean mean_* var_* 
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =    15          Number of obs   =      5,430
Number of PSUs   = 2,034          Population size = 12,066,161
                                  Design df       =      2,019

--------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.     [95% conf. interval]
-------------+------------------------------------------------
   mean_diff |  -.1388418    .025609     -.1890646    -.088619
   mean_expl |  -.1095713   .0170488     -.1430064   -.0761362
   mean_unex |  -.0292705   .0202262     -.0689369    .0103959
    var_diff |   .0573163   .0205301      .0170539    .0975788
    var_expl |  -.0183098   .0090438      -.036046   -.0005736
    var_unex |   .0756262   .0202761      .0358618    .1153905
--------------------------------------------------------------

. ereturn display // redisplay results including t-statistics and p values
------------------------------------------------------------------------------
             |             Linearized
        Mean | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
   mean_diff |  -.1388418    .025609    -5.42   0.000    -.1890646    -.088619
   mean_expl |  -.1095713   .0170488    -6.43   0.000    -.1430064   -.0761362
   mean_unex |  -.0292705   .0202262    -1.45   0.148    -.0689369    .0103959
    var_diff |   .0573163   .0205301     2.79   0.005     .0170539    .0975788
    var_expl |  -.0183098   .0090438    -2.02   0.043     -.036046   -.0005736
    var_unex |   .0756262   .0202761     3.73   0.000     .0358618    .1153905
------------------------------------------------------------------------------

. drop RIF_* mean_* var_*

Decomposition results did not change all that much. For both the mean and the variance, the absolute value of the explained part slightly increased. For the variance, the explained part is now marginally significant.

Task 5: reweighted Oaxaca-Blinder decomposition

Compute a reweighted Oaxaca-Blinder decomposition to explain the public–private wage gap and interpret the results. Include schooling, experience, and international socio-economic index (ISEI) as predictors. Do the analysis in both directions (i.e. once reweighting the private sector and once reweighting the public sector).

Compute the weights, coefficients, and means:

. // compute weights
. kmatch ipw public c.yeduc##c.expft##c.expft c.isei [pw=weight], att wgen(PSI0)

Inverse probability weighting                            Number of obs = 5,430

Treatment   : public = 1
Covariates  : yeduc expft c.yeduc#c.expft c.expft#c.expft c.yeduc#c.expft#c.expft isei
PS model    : logit (pr)

Matching statistics
------------------------------------------------------------------------------
           |             Matched             |            Controls           
           |       Yes         No      Total |      Used     Unused      Total
-----------+---------------------------------+--------------------------------
   Treated |      1267          0       1267 |      4163          0       4163
------------------------------------------------------------------------------

Stored variables
Variable      Storage   Display    Value
    name         type    format    label      Variable label
----------------------------------------------------------------------------------------------------
PSI0            double  %10.0g                Matching weights for ATT

. kmatch ipw public c.yeduc##c.expft##c.expft c.isei [pw=weight], atc wgen(PSI1)

Inverse probability weighting                            Number of obs = 5,430

Treatment   : public = 1
Covariates  : yeduc expft c.yeduc#c.expft c.expft#c.expft c.yeduc#c.expft#c.expft isei
PS model    : logit (pr)

Matching statistics
------------------------------------------------------------------------------
           |             Matched             |            Controls           
           |       Yes         No      Total |      Used     Unused      Total
-----------+---------------------------------+--------------------------------
 Untreated |      4163          0       4163 |      1267          0       1267
------------------------------------------------------------------------------

Stored variables
Variable      Storage   Display    Value
    name         type    format    label      Variable label
----------------------------------------------------------------------------------------------------
PSI1            double  %10.0g                Matching weights for ATC

. // obtain coefficients and means in private sector
. svy: regress lnwage yeduc expft expft2 isei if public==0
(running regress on estimation sample)

Survey: Linear regression

Number of strata =    15                           Number of obs   =     4,163
Number of PSUs   = 1,793                           Population size = 9,175,995
                                                   Design df       =     1,778
                                                   F(4, 1775)      =    181.16
                                                   Prob > F        =    0.0000
                                                   R-squared       =    0.3516

------------------------------------------------------------------------------
             |             Linearized
      lnwage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       yeduc |   .0477036   .0060214     7.92   0.000     .0358939    .0595133
       expft |   .0285134   .0042272     6.75   0.000     .0202226    .0368041
      expft2 |  -.0003818   .0001187    -3.22   0.001    -.0006146   -.0001489
        isei |   .0103159   .0007776    13.27   0.000     .0087908    .0118409
       _cons |   1.382317    .072412    19.09   0.000     1.240295    1.524338
------------------------------------------------------------------------------

. mat b_priv = e(b)

. svy: mean yeduc expft expft2 isei if public==0
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =    15           Number of obs   =     4,163
Number of PSUs   = 1,793           Population size = 9,175,995
                                   Design df       =     1,778

--------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.     [95% conf. interval]
-------------+------------------------------------------------
       yeduc |    12.4304   .0785646      12.27631    12.58449
       expft |   14.35697   .2709275       13.8256    14.88834
      expft2 |   307.8937   9.139483      289.9684     325.819
        isei |   45.18642    .450854      44.30216    46.07068
--------------------------------------------------------------

. mat X_priv = (e(b),1)

. // obtain counterfactual coefficients and means in private sector
. regress lnwage yeduc expft expft2 isei [pw=PSI0] if public==0
(sum of wgt is 2,890,165.7029972)

Linear regression                               Number of obs     =      4,163
                                                F(4, 4158)        =     158.06
                                                Prob > F          =     0.0000
                                                R-squared         =     0.3764
                                                Root MSE          =     .40838

------------------------------------------------------------------------------
             |               Robust
      lnwage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       yeduc |   .0439155   .0069882     6.28   0.000     .0302149    .0576161
       expft |    .037984   .0046504     8.17   0.000     .0288667    .0471013
      expft2 |  -.0006764   .0001357    -4.98   0.000    -.0009425   -.0004103
        isei |   .0110568   .0010086    10.96   0.000     .0090794    .0130343
       _cons |    1.35105   .0741994    18.21   0.000     1.205579     1.49652
------------------------------------------------------------------------------

. mat b_priv_C = e(b)

. mean yeduc expft expft2 isei [pw=PSI0] if public==0

Mean estimation                          Number of obs = 4,163

--------------------------------------------------------------
             |       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
       yeduc |   14.08746   .1052854      13.88105    14.29388
       expft |   13.65027   .3162639      13.03022    14.27031
      expft2 |   284.1405   9.916214      264.6994    303.5815
        isei |   53.23341   .5850431      52.08641    54.38041
--------------------------------------------------------------

. mat X_priv_C = (e(b),1)

. // obtain coefficients and means in public sector
. svy: regress lnwage yeduc expft expft2 isei if public==1
(running regress on estimation sample)

Survey: Linear regression

Number of strata =  15                             Number of obs   =     1,267
Number of PSUs   = 897                             Population size = 2,890,166
                                                   Design df       =       882
                                                   F(4, 879)       =     70.93
                                                   Prob > F        =    0.0000
                                                   R-squared       =    0.3609

------------------------------------------------------------------------------
             |             Linearized
      lnwage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       yeduc |   .0349734   .0082295     4.25   0.000     .0188218    .0511251
       expft |   .0432595   .0072115     6.00   0.000     .0291058    .0574131
      expft2 |  -.0008105   .0001948    -4.16   0.000    -.0011929   -.0004282
        isei |   .0081884   .0013188     6.21   0.000        .0056    .0107767
       _cons |   1.585119   .0967903    16.38   0.000     1.395153    1.775086
------------------------------------------------------------------------------

. mat b_pub = e(b)

. svy: mean yeduc expft expft2 isei if public==1
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =  15             Number of obs   =     1,267
Number of PSUs   = 897             Population size = 2,890,166
                                   Design df       =       882

--------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.     [95% conf. interval]
-------------+------------------------------------------------
       yeduc |    14.0728   .1512438      13.77596    14.36964
       expft |   13.57464   .4871342      12.61856    14.53072
      expft2 |    282.094   15.63931      251.3994    312.7886
        isei |   53.27206   .8210012      51.66072    54.88341
--------------------------------------------------------------

. mat X_pub = (e(b),1)

. // obtain counterfactual coefficients and means in public sector
. regress lnwage yeduc expft expft2 isei [pw=PSI1] if public==1
(sum of wgt is 9,175,995.0951793)

Linear regression                               Number of obs     =      1,267
                                                F(4, 1262)        =      51.84
                                                Prob > F          =     0.0000
                                                R-squared         =     0.3239
                                                Root MSE          =     .34645

------------------------------------------------------------------------------
             |               Robust
      lnwage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       yeduc |   .0263934   .0078191     3.38   0.001     .0110536    .0417332
       expft |   .0350385   .0071188     4.92   0.000     .0210726    .0490044
      expft2 |   -.000636   .0001972    -3.23   0.001    -.0010228   -.0002492
        isei |   .0088995   .0014613     6.09   0.000     .0060327    .0117663
       _cons |   1.726007   .0960101    17.98   0.000      1.53765    1.914364
------------------------------------------------------------------------------

. mat b_pub_C = e(b)

. mean yeduc expft expft2 isei [pw=PSI1] if public==1

Mean estimation                          Number of obs = 1,267

--------------------------------------------------------------
             |       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
       yeduc |   12.50273   .1178106      12.27161    12.73386
       expft |    14.3952   .5217609      13.37159    15.41881
      expft2 |   309.3088   18.16965      273.6628    344.9547
        isei |   45.16942   .7814726      43.63629    46.70254
--------------------------------------------------------------

. mat X_pub_C = (e(b),1)

Variable PSI0 reweights the private sector so that it looks like the public sector. Variable PSI1 reweights the public sector so that it looks like the private sector.

Here are the results for the decomposition that reweights the private sector:

. mat D_priv_C = (X_priv * b_priv' - X_pub * b_pub') ///
>              \ (X_priv - X_priv_C) * b_priv'       ///
>              \  X_priv_C * (b_priv - b_priv_C)'    ///
>              \  X_pub * (b_priv_C - b_pub)'        ///
>              \ (X_priv_C - X_pub) * b_priv_C'

. mat rown D_priv_C = "Overall difference" "Explained" "Specification error" ///
>               "Unexplained" "Reweighting error"

. matlist D_priv_C, twidth(20)

                     |        y1 
---------------------+----------
  Overall difference | -.1388418 
           Explained | -.1509773 
 Specification error | -.0003726 
         Unexplained |  .0108031 
   Reweighting error |   .001705 

Here are the results for the decomposition that reweights the public sector:

. mat D_pub_C = (X_priv * b_priv' - X_pub * b_pub') ///
>             \ (X_pub_C - X_pub) * b_pub'          ///
>             \  X_pub_C * (b_pub_C - b_pub)'       ///
>             \  X_priv * (b_priv - b_pub_C)'       ///
>             \ (X_priv - X_pub_C) * b_pub_C'

. mat rown D_pub_C = "Overall difference" "Explained" "Specification error" ///
>              "Unexplained" "Reweighting error"

. matlist D_pub_C, twidth(20)

                     |        y1 
---------------------+----------
  Overall difference | -.1388418 
           Explained |  -.107819 
 Specification error |  .0013593 
         Unexplained | -.0301846 
   Reweighting error | -.0021974 

Note that command oaxaca_rif (from package rif by Fernando Rios-Avila; type ssc install rif) can be used to compute reweighted OB decompositions automatically (but be aware that the standard errors reported by oaxaca_rif do not take account of the fact that the weights in the reweighting step are estimated; so you might want to use the bootstrap for the standard errors).