Johannes Giesecke and Ben Jann, GESIS Training Course, January 29 – February 1, 2024
Required packages: fre
, dstat
, kmatch
Set the seed of the random number generator for sake of reproducibility:
. set seed 65443
Repeat the reweighting decomposition from the slides for the mean
and the variance of log wages between the private sector and the
public sector (section “Example analysis”), but use the “opposite”
counterfactual. That is, compute a decomposition in which the distribution
of X in the public sector is adjusted to the distribution observed
in the private sector (on the slides, the distribution of X in the
private sector was adjusted to the distribution observed in the public
sector). In addition to schooling and experience, include occupational
status as measured by the international socio-economic index
(isei
) as a predictor.
Data preparation including isei
:
. use gsoep-extract, clear (Example data based on the German Socio-Economic Panel) . keep if wave==2015 (29,970 observations deleted) . keep if inrange(age, 25, 55) (5,671 observations deleted) . generate lnwage = ln(wage) (1,709 missing values generated) . generate expft2 = expft^2 (35 missing values generated) . summarize wage lnwage yeduc expft expft2 isei public Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- wage | 5,600 17.57278 9.858855 3.03 121.42 lnwage | 5,600 2.736721 .5062968 1.108563 4.799255 yeduc | 7,121 12.28823 2.783974 7 18 expft | 7,274 11.63359 9.556508 0 39.5 expft2 | 7,274 226.6548 293.3739 0 1560.25 -------------+--------------------------------------------------------- isei | 6,451 45.07115 17.00982 16 90 public | 5,770 .2353553 .4242574 0 1 . drop if missing(lnwage, yeduc, expft, isei, public) // remove unused observation (1,879 observations deleted) . svyset psu [pw=weight], strata(strata) Sampling weights: weight VCE: linearized Single unit: missing Strata 1: strata Sampling unit 1: psu FPC 1: <zero>
Propensity-score model: We first estimate a model that includes quadratic terms for experience and ISEI as well as all two-way interactions between schooling, experience, and ISEI.
. svy: logit public /// > /// main effects > yeduc c.expft##c.expft c.isei##c.isei /// > /// education x experience > c.yeduc#c.expft c.yeduc#c.expft#c.expft /// > /// education x ISEI > c.yeduc#c.isei c.yeduc#c.isei#c.isei /// > /// ISEI x experience > c.isei#c.expft c.isei#c.expft#c.expft /// > c.isei#c.isei#c.expft /// > c.isei#c.isei#c.expft#c.expft (running logit on estimation sample) Survey: Logistic regression Number of strata = 15 Number of obs = 5,430 Number of PSUs = 2,034 Population size = 12,066,161 Design df = 2,019 F(13, 2007) = 9.41 Prob > F = 0.0000 ----------------------------------------------------------------------------------------------- | Linearized public | Coefficient std. err. t P>|t| [95% conf. interval] ------------------------------+---------------------------------------------------------------- yeduc | .1298165 .1996554 0.65 0.516 -.2617357 .5213686 expft | -.070314 .1723928 -0.41 0.683 -.4084004 .2677724 | c.expft#c.expft | .0034054 .0056808 0.60 0.549 -.0077355 .0145463 | isei | .0033856 .0966764 0.04 0.972 -.1862103 .1929815 | c.isei#c.isei | .0003512 .0009833 0.36 0.721 -.0015772 .0022796 | c.yeduc#c.expft | .0068663 .0085493 0.80 0.422 -.0099001 .0236327 | c.yeduc#c.expft#c.expft | -.0001362 .0002766 -0.49 0.622 -.0006788 .0004063 | c.yeduc#c.isei | .0003535 .0075197 0.05 0.963 -.0143937 .0151006 | c.yeduc#c.isei#c.isei | -.0000134 .0000702 -0.19 0.849 -.0001511 .0001244 | c.isei#c.expft | .0005789 .0066831 0.09 0.931 -.0125276 .0136854 | c.isei#c.expft#c.expft | -.0000962 .0002175 -0.44 0.658 -.0005227 .0003302 | c.isei#c.isei#c.expft | -.0000232 .0000675 -0.34 0.731 -.0001555 .0001092 | c.isei#c.isei#c.expft#c.expft | 1.37e-06 2.20e-06 0.62 0.533 -2.94e-06 5.68e-06 | _cons | -3.674781 2.372428 -1.55 0.122 -8.327443 .9778808 -----------------------------------------------------------------------------------------------
The model appears hopelessly overspecified. Here is a joint test of the higher order terms of the interactions between ISEI and education or experience.
. test c.yeduc#c.isei#c.isei /// > c.isei#c.expft#c.expft /// > c.isei#c.isei#c.expft /// > c.isei#c.isei#c.expft#c.expft Adjusted Wald test ( 1) [public]c.yeduc#c.isei#c.isei = 0 ( 2) [public]c.isei#c.expft#c.expft = 0 ( 3) [public]c.isei#c.isei#c.expft = 0 ( 4) [public]c.isei#c.isei#c.expft#c.expft = 0 F( 4, 2016) = 0.51 Prob > F = 0.7303
It seems we can safely drop these terms. We rerun the model without these terms.
. svy: logit public /// > /// main effects > yeduc c.expft##c.expft c.isei##c.isei /// > /// education x experience > c.yeduc#c.expft c.yeduc#c.expft#c.expft /// > /// education x ISEI and ISEI x experience > c.yeduc#c.isei c.isei#c.expft (running logit on estimation sample) Survey: Logistic regression Number of strata = 15 Number of obs = 5,430 Number of PSUs = 2,034 Population size = 12,066,161 Design df = 2,019 F(9, 2011) = 12.54 Prob > F = 0.0000 ----------------------------------------------------------------------------------------- | Linearized public | Coefficient std. err. t P>|t| [95% conf. interval] ------------------------+---------------------------------------------------------------- yeduc | .1841793 .0935714 1.97 0.049 .0006726 .3676859 expft | -.0266352 .0951749 -0.28 0.780 -.2132864 .160016 | c.expft#c.expft | .0006524 .0030077 0.22 0.828 -.0052461 .0065509 | isei | .0155227 .020184 0.77 0.442 -.024061 .0551064 | c.isei#c.isei | .000195 .0002354 0.83 0.408 -.0002667 .0006567 | c.yeduc#c.expft | .0030947 .0073734 0.42 0.675 -.0113655 .0175549 | c.yeduc#c.expft#c.expft | -8.51e-06 .0002349 -0.04 0.971 -.0004692 .0004522 | c.yeduc#c.isei | -.0011454 .0016716 -0.69 0.493 -.0044237 .0021328 | c.isei#c.expft | -.0005577 .0004071 -1.37 0.171 -.0013561 .0002407 | _cons | -4.069323 1.024232 -3.97 0.000 -6.077985 -2.060661 -----------------------------------------------------------------------------------------
The model still appears overspecified. Let's further simplify by dropping the interactions between ISEI and education or experience entirely.
. test c.yeduc#c.isei c.isei#c.expft Adjusted Wald test ( 1) [public]c.yeduc#c.isei = 0 ( 2) [public]c.isei#c.expft = 0 F( 2, 2018) = 0.99 Prob > F = 0.3704 . svy: logit public c.yeduc##c.expft##c.expft c.isei##c.isei (running logit on estimation sample) Survey: Logistic regression Number of strata = 15 Number of obs = 5,430 Number of PSUs = 2,034 Population size = 12,066,161 Design df = 2,019 F(7, 2013) = 15.49 Prob > F = 0.0000 ----------------------------------------------------------------------------------------- | Linearized public | Coefficient std. err. t P>|t| [95% conf. interval] ------------------------+---------------------------------------------------------------- yeduc | .1501627 .0458125 3.28 0.001 .060318 .2400074 expft | -.0247169 .0942048 -0.26 0.793 -.2094656 .1600319 | c.yeduc#c.expft | .0008536 .0071377 0.12 0.905 -.0131445 .0148516 | c.expft#c.expft | .000491 .0030057 0.16 0.870 -.0054036 .0063856 | c.yeduc#c.expft#c.expft | 5.24e-06 .0002348 0.02 0.982 -.0004553 .0004658 | isei | .0001753 .0177974 0.01 0.992 -.0347279 .0350786 | c.isei#c.isei | .00012 .0001825 0.66 0.511 -.0002378 .0004779 | _cons | -3.442651 .6958564 -4.95 0.000 -4.807323 -2.07798 -----------------------------------------------------------------------------------------
Even the quadratic effect of ISEI seems to be unnecessary. So we will simply add a linear term for ISEI (note that also the model from the slides could be simplified, but we leave these terms in the model for sake of comparability). Our final model thus is as follows:
. svy: logit public c.yeduc##c.expft##c.expft isei (running logit on estimation sample) Survey: Logistic regression Number of strata = 15 Number of obs = 5,430 Number of PSUs = 2,034 Population size = 12,066,161 Design df = 2,019 F(6, 2014) = 17.85 Prob > F = 0.0000 ----------------------------------------------------------------------------------------- | Linearized public | Coefficient std. err. t P>|t| [95% conf. interval] ------------------------+---------------------------------------------------------------- yeduc | .1510122 .0458909 3.29 0.001 .0610138 .2410106 expft | -.0286752 .094125 -0.30 0.761 -.2132675 .1559171 | c.yeduc#c.expft | .0010943 .0071301 0.15 0.878 -.0128889 .0150774 | c.expft#c.expft | .0006412 .0030059 0.21 0.831 -.0052538 .0065362 | c.yeduc#c.expft#c.expft | -5.15e-06 .0002348 -0.02 0.982 -.0004656 .0004553 | isei | .0121305 .0040382 3.00 0.003 .0042111 .0200499 _cons | -3.712432 .6038572 -6.15 0.000 -4.896681 -2.528184 -----------------------------------------------------------------------------------------
We now generate predicted probabilities from the model (i.e. probability of being in the public sector given the values of the covariates), compute the proportion of observations in the public sector, and compute the weights to reweight the public sector so that it looks like the private sector (the formula for the weights is flipped compared to the formula on the slides, where we reweighted the private sector).
. predict double PS, pr . summarize public [aw=weight] Variable | Obs Weight Mean Std. dev. Min Max -------------+----------------------------------------------------------------- public | 5,430 12066160.8 .2395265 .4268338 0 1 . scalar Pr_public = r(mean) . generate double PSI = ((1-PS) / (1-Pr_public)) / (PS / Pr_public) if public==1 (4,163 missing values generated) . replace PSI = 1 if public==0 (4,163 real changes made) . summarize PSI [aw=weight] if public==1 Variable | Obs Weight Mean Std. dev. Min Max -------------+----------------------------------------------------------------- PSI | 1,267 2890165.7 .9939012 .5919346 .2487815 4.097133
We can now compute the mean and variance of log wages in both groups as well as in the reweighted public sector and then compute the decomposition.
. // prepare a matrix for the results . matrix R = J(2,3,.) . matrix coln R = private public rw_public . matrix rown R = mean var . matrix list R R[2,3] private public rw_public mean . . . var . . . . // raw results for private sector . summarize lnwage [aw=weight] if public==0 Variable | Obs Weight Mean Std. dev. Min Max -------------+----------------------------------------------------------------- lnwage | 4,163 9175995.1 2.733252 .501371 1.108563 4.799255 . matrix R[1,1] = r(mean) . matrix R[2,1] = r(Var) . // raw results for public sector . summarize lnwage [aw=weight] if public==1 Variable | Obs Weight Mean Std. dev. Min Max -------------+----------------------------------------------------------------- lnwage | 1,267 2890165.7 2.872093 .4405185 1.115142 4.356068 . matrix R[1,2] = r(mean) . matrix R[2,2] = r(Var) . // reweighted results for public sector . summarize lnwage [aw=PSI*weight] if public==1 Variable | Obs Weight Mean Std. dev. Min Max -------------+----------------------------------------------------------------- lnwage | 1,267 2872539.16 2.765634 .4206885 1.115142 4.356068 . matrix R[1,3] = r(mean) . matrix R[2,3] = r(Var) . // obtain decomposition . mat list R R[2,3] private public rw_public mean 2.7332515 2.8720933 2.7656336 var .25137285 .19405651 .17697884 . di as txt "difference in mean: total = " as res R[1,1] - R[1,2] _n /// > as txt "difference in mean: explained = " as res R[1,3] - R[1,2] _n /// > as txt "difference in mean: unexplained = " as res R[1,1] - R[1,3] _n /// > as txt "difference in variance: total = " as res R[2,1] - R[2,2] _n /// > as txt "difference in variance: explained = " as res R[2,3] - R[2,2] _n /// > as txt "difference in variance: unexplained = " as res R[2,1] - R[2,3] difference in mean: total = -.13884181 difference in mean: explained = -.10645977 difference in mean: unexplained = -.03238204 difference in variance: total = .05731634 difference in variance: explained = -.01707767 difference in variance: unexplained = .07439401
Results are not directly comparable to the results on the slides because we added an extra predictor (ISEI; and also the number of observations changed slightly due to missings for ISEI). The raw difference now is slightly higher (0.139) and we see that less of the public-private gap in the mean of log wages is explained (on the slides it was 0.127; here it is 0.106). The remaining unexplained difference is 0.032. The interpretation is that the gap would only be 0.032 if people in the public sector had the same distribution of education, work experience and ISEI as the people in the private sector (i.e. people in the public sector would, on average, only earn about 3% more, not 14% more).
For the variance we see that the explained part is slightly negative (similar as on slides).That is, wage inequality within the public sector would be even somewhat smaller than it actually is if people in the public sector had same distribution of education, work experience and ISEI as the people in the private sector.
Try to replicate the decomposition using command dstat
.
We use option rif()
to store the RIFs, and then compute the
decomposition using these variables (see slides). An advantage of using RIFs
is that consistent standard errors can easily be computed.
. // get RIFs of raw results . dstat (mean variance) lnwage, over(public) vce(svy) /// > rif(RIF_0_mean RIF_0_var RIF_1_mean RIF_1_var) (running dstat_svyr on estimation sample) Survey: Summary statistics Number of strata = 15 Number of obs = 5,430 Number of PSUs = 2,034 Population size = 12,066,161 Design df = 2,019 0: public = no 1: public = yes -------------------------------------------------------------- | Linearized lnwage | Coefficient std. err. [95% conf. interval] -------------+------------------------------------------------ 0 | mean | 2.733252 .0139762 2.705842 2.760661 variance | .2513728 .0099165 .2319252 .2708205 -------------+------------------------------------------------ 1 | mean | 2.872093 .0220492 2.828852 2.915335 variance | .1940565 .0180675 .1586236 .2294894 -------------------------------------------------------------- Variable Storage Display Value name type format label Variable label ---------------------------------------------------------------------------------------------------- RIF_0_mean double %10.0g RIF of _b[0:mean] RIF_0_var double %10.0g RIF of _b[0:variance] RIF_1_mean double %10.0g RIF of _b[1:mean] RIF_1_var double %10.0g RIF of _b[1:variance] . // get RIFs of reweighted results (only the public sample is reweighted; the . // RIFs for private are the same as above) . dstat (mean variance) lnwage, over(public) vce(svy) /// > balance(ipw: c.yeduc##c.expft##c.expft c.isei, reference(0)) /// > rif(RIF_0_mean RIF_0_var RIF_c_mean RIF_c_var) replace (running dstat_svyr on estimation sample) Survey: Summary statistics Number of strata = 15 Number of obs = 5,430 Number of PSUs = 2,034 Population size = 12,066,161 Design df = 2,019 Balancing: method = ipw reference = 0.public controls = e(balance) 0: public = no 1: public = yes -------------------------------------------------------------- | Linearized lnwage | Coefficient std. err. [95% conf. interval] -------------+------------------------------------------------ 0 | mean | 2.733252 .0139762 2.705842 2.760661 variance | .2513728 .0099165 .2319252 .2708205 -------------+------------------------------------------------ 1 | mean | 2.765634 .0181534 2.730032 2.801235 variance | .1769788 .0177877 .1420946 .211863 -------------------------------------------------------------- Variable Storage Display Value name type format label Variable label ---------------------------------------------------------------------------------------------------- RIF_0_mean double %10.0g RIF of _b[0:mean] RIF_0_var double %10.0g RIF of _b[0:variance] RIF_c_mean double %10.0g RIF of _b[1:mean] RIF_c_var double %10.0g RIF of _b[1:variance] . // compute the RIFs of the decomposition components . generate double mean_diff = RIF_0_mean - RIF_1_mean . generate double mean_expl = RIF_c_mean - RIF_1_mean . generate double mean_unex = RIF_0_mean - RIF_c_mean . generate double var_diff = RIF_0_var - RIF_1_var . generate double var_expl = RIF_c_var - RIF_1_var . generate double var_unex = RIF_0_var - RIF_c_var . // results . svy: mean mean_* var_* (running mean on estimation sample) Survey: Mean estimation Number of strata = 15 Number of obs = 5,430 Number of PSUs = 2,034 Population size = 12,066,161 Design df = 2,019 -------------------------------------------------------------- | Linearized | Mean std. err. [95% conf. interval] -------------+------------------------------------------------ mean_diff | -.1388418 .025609 -.1890646 -.088619 mean_expl | -.1064598 .0174267 -.140636 -.0722836 mean_unex | -.032382 .0202816 -.0721571 .007393 var_diff | .0573163 .0205301 .0170539 .0975788 var_expl | -.0170777 .0092358 -.0351903 .001035 var_unex | .074394 .02033 .034524 .114264 -------------------------------------------------------------- . ereturn display // redisplay results including t-statistics and p values ------------------------------------------------------------------------------ | Linearized Mean | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- mean_diff | -.1388418 .025609 -5.42 0.000 -.1890646 -.088619 mean_expl | -.1064598 .0174267 -6.11 0.000 -.140636 -.0722836 mean_unex | -.032382 .0202816 -1.60 0.111 -.0721571 .007393 var_diff | .0573163 .0205301 2.79 0.005 .0170539 .0975788 var_expl | -.0170777 .0092358 -1.85 0.065 -.0351903 .001035 var_unex | .074394 .02033 3.66 0.000 .034524 .114264 ------------------------------------------------------------------------------ . drop RIF_* mean_* var_*
Note: The unexplained part of the decomposition could also be obtained
directly by dstat
.
. dstat (mean variance) lnwage, over(public, contrast(1)) vce(svy) /// > balance(ipw: c.yeduc##c.expft##c.expft c.isei, reference(0)) (running dstat_svyr on estimation sample) Survey: Difference in summary statistics Number of strata = 15 Number of obs = 5,430 Number of PSUs = 2,034 Population size = 12,066,161 Design df = 2,019 Contrast = 1.public Balancing: method = ipw reference = 0.public controls = e(balance) 0: public = no 1: public = yes ------------------------------------------------------------------------------ | Linearized lnwage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- 0 | mean | -.032382 .0202816 -1.60 0.111 -.0721571 .007393 variance | .074394 .02033 3.66 0.000 .034524 .114264 ------------------------------------------------------------------------------
Wrap the analysis from task 1 into a small program and apply the
bootstrap to compute standard errors and confidence intervals. How do the
bootstrap standard errors compare to the analytic standard errors provided
by dstat
?
We write a program that allows us to specify the predictors to be included
in the reweighting and also allows us to specify the summary statistics to
be decomposed (any statistic returned by summarize
can be
used; see "Stored results" in help summarize
). The program
also marks the estimation sample so that observations with missing value on
any of the relevant variables will be excluded from all computations (and
so that bootstrap
will know which observations have been used).
capture program drop myreweight
program myreweight, eclass
syntax varlist(fv) [, Stats(str) ]
if "`stats'"=="" local stats "mean" // default
// generate weights
logit public `varlist' [pw=weight]
local N = e(N) // number of observations
tempvar touse
generate byte `touse' = e(sample) // this marks the estimation sample
tempvar PS PSI
qui predict double `PS' if `touse', pr
summarize public [aw=weight] if `touse', meanonly
quietly generate double `PSI' = ///
((1-`PS') / (1-r(mean))) / (`PS' / r(mean)) ///
if public==1 & `touse'
// compute results
// - setup results vector
local nstats: list sizeof stats
tempname b
matrix `b' = J(1, 6*`nstats', .)
local colnames
foreach stat of local stats {
local colnames `colnames' ///
`stat':private `stat':public `stat':rw_public ///
`stat':difference `stat':explained `stat':unexplained
}
matrix coln `b' = `colnames'
// raw results for private sector
quietly summarize lnwage [aw=weight] if public==0 & `touse', detail
local j = 1
foreach stat of local stats {
matrix `b'[1,`j'] = r(`stat')
local j = `j' + 6
}
// raw results for public sector
quietly summarize lnwage [aw=weight] if public==1 & `touse', detail
local j = 2
foreach stat of local stats {
matrix `b'[1,`j'] = r(`stat')
local j = `j' + 6
}
// reweighted results for public sector
quietly summarize lnwage if public==1 & `touse' [aw=`PSI'*weight], detail
local j = 3
foreach stat of local stats {
matrix `b'[1,`j'] = r(`stat')
local j = `j' + 6
}
// decomposition
local j = 0
foreach stat of local stats {
matrix `b'[1,`j'+4] = `b'[1,`j'+1] - `b'[1,`j'+2]
matrix `b'[1,`j'+5] = `b'[1,`j'+3] - `b'[1,`j'+2]
matrix `b'[1,`j'+6] = `b'[1,`j'+1] - `b'[1,`j'+3]
local j = `j' + 6
}
// post results
ereturn post `b', obs(`N') esample(`touse')
ereturn display
end
We can now use the program. Here is an example of a single run where we decompose the difference in the mean, the 25% quantile, the median, the 75% quantile, and the variance:
. myreweight c.yeduc##c.expft##c.expft c.isei, stats(mean p25 p50 p75 Var) Iteration 0: Log pseudolikelihood = -6642826.5 Iteration 1: Log pseudolikelihood = -6255160 Iteration 2: Log pseudolikelihood = -6246188 Iteration 3: Log pseudolikelihood = -6246184.6 Iteration 4: Log pseudolikelihood = -6246184.6 Logistic regression Number of obs = 5,430 Wald chi2(6) = 121.71 Prob > chi2 = 0.0000 Log pseudolikelihood = -6246184.6 Pseudo R2 = 0.0597 ----------------------------------------------------------------------------------------- | Robust public | Coefficient std. err. z P>|z| [95% conf. interval] ------------------------+---------------------------------------------------------------- yeduc | .1510122 .0455224 3.32 0.001 .06179 .2402345 expft | -.0286752 .0935728 -0.31 0.759 -.2120745 .1547241 | c.yeduc#c.expft | .0010943 .007059 0.16 0.877 -.0127411 .0149297 | c.expft#c.expft | .0006412 .0029603 0.22 0.829 -.0051608 .0064433 | c.yeduc#c.expft#c.expft | -5.15e-06 .0002306 -0.02 0.982 -.000457 .0004467 | isei | .0121305 .004154 2.92 0.003 .0039888 .0202723 _cons | -3.712432 .6010468 -6.18 0.000 -4.890462 -2.534402 ----------------------------------------------------------------------------------------- ------------------------------------------------------------------------------ | Coefficient -------------+---------------------------------------------------------------- mean | private | 2.733252 public | 2.872093 rw_public | 2.765634 difference | -.1388418 explained | -.1064598 unexplained | -.032382 -------------+---------------------------------------------------------------- p25 | private | 2.386926 public | 2.667228 rw_public | 2.596001 difference | -.280302 explained | -.0712276 unexplained | -.2090745 -------------+---------------------------------------------------------------- p50 | private | 2.726545 public | 2.908539 rw_public | 2.793004 difference | -.1819942 explained | -.1155353 unexplained | -.0664589 -------------+---------------------------------------------------------------- p75 | private | 3.068053 public | 3.148882 rw_public | 3.021887 difference | -.0808294 explained | -.1269951 unexplained | .0461657 -------------+---------------------------------------------------------------- Var | private | .2513728 public | .1940565 rw_public | .1769788 difference | .0573163 explained | -.0170777 unexplained | .074394 ------------------------------------------------------------------------------
Results for the mean and the variance are identical to above. Looking at the results for quantiles we see that the public-private wage gap is larger at the bottom of the distribution, mostly due to the unexplained part. However, relevance of the unexplained part is changing from p25 to p50 and p75. For the 75% quantile the unexplained part even switches sign.
Now use the bootstrap to obtain standard errors and confidence intervals for the results. We only include the mean and the variance.
. bootstrap, cluster(psu) reps(100): /// > myreweight c.yeduc##c.expft##c.expft c.isei, stats(mean Var) (running myreweight on estimation sample) Bootstrap replications (100): .........10.........20.........30.........40.........50.........60.... > .....70.........80.........90.........100 done Bootstrap results Number of obs = 5,430 Replications = 100 (Replications based on 2,034 clusters in psu) ------------------------------------------------------------------------------ | Observed Bootstrap Normal-based | coefficient std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- mean | private | 2.733252 .0135721 201.39 0.000 2.706651 2.759852 public | 2.872093 .0197789 145.21 0.000 2.833327 2.910859 rw_public | 2.765634 .0183483 150.73 0.000 2.729671 2.801596 difference | -.1388418 .023288 -5.96 0.000 -.1844854 -.0931982 explained | -.1064598 .0166688 -6.39 0.000 -.13913 -.0737895 unexplained | -.032382 .0195702 -1.65 0.098 -.070739 .0059749 -------------+---------------------------------------------------------------- Var | private | .2513728 .0093474 26.89 0.000 .2330522 .2696935 public | .1940565 .0193824 10.01 0.000 .1560677 .2320453 rw_public | .1769788 .0190437 9.29 0.000 .1396538 .2143039 difference | .0573163 .0214015 2.68 0.007 .0153701 .0992625 explained | -.0170777 .0095363 -1.79 0.073 -.0357684 .0016131 unexplained | .074394 .0210719 3.53 0.000 .0330938 .1156942 ------------------------------------------------------------------------------
Results are very similar to the ones obtained using dstat
. For the
mean, the explained part is highly significant; the unexplained part is not
significantly different from zero. For the variance, results are opposite.
Evaluate how results change if you use entropy balancing to compute the
weights instead of IPW based on a logit
model.
We can use the same code as in task 2 and simply replace ipw
by
eb
.
. // get RIFs of raw results . dstat (mean variance) lnwage, over(public) vce(svy) /// > rif(RIF_0_mean RIF_0_var RIF_1_mean RIF_1_var) (running dstat_svyr on estimation sample) Survey: Summary statistics Number of strata = 15 Number of obs = 5,430 Number of PSUs = 2,034 Population size = 12,066,161 Design df = 2,019 0: public = no 1: public = yes -------------------------------------------------------------- | Linearized lnwage | Coefficient std. err. [95% conf. interval] -------------+------------------------------------------------ 0 | mean | 2.733252 .0139762 2.705842 2.760661 variance | .2513728 .0099165 .2319252 .2708205 -------------+------------------------------------------------ 1 | mean | 2.872093 .0220492 2.828852 2.915335 variance | .1940565 .0180675 .1586236 .2294894 -------------------------------------------------------------- Variable Storage Display Value name type format label Variable label ---------------------------------------------------------------------------------------------------- RIF_0_mean double %10.0g RIF of _b[0:mean] RIF_0_var double %10.0g RIF of _b[0:variance] RIF_1_mean double %10.0g RIF of _b[1:mean] RIF_1_var double %10.0g RIF of _b[1:variance] . // get RIFs of reweighted results (only the public sample is reweighted; the . // RIFs for private are the same as above) . dstat (mean variance) lnwage, over(public) vce(svy) /// > balance(eb: c.yeduc##c.expft##c.expft c.isei, reference(0)) /// > rif(RIF_0_mean RIF_0_var RIF_c_mean RIF_c_var) replace (running dstat_svyr on estimation sample) Survey: Summary statistics Number of strata = 15 Number of obs = 5,430 Number of PSUs = 2,034 Population size = 12,066,161 Design df = 2,019 Balancing: method = eb reference = 0.public controls = e(balance) 0: public = no 1: public = yes -------------------------------------------------------------- | Linearized lnwage | Coefficient std. err. [95% conf. interval] -------------+------------------------------------------------ 0 | mean | 2.733252 .0139762 2.705842 2.760661 variance | .2513728 .0099165 .2319252 .2708205 -------------+------------------------------------------------ 1 | mean | 2.762522 .0182537 2.726724 2.79832 variance | .1757467 .017723 .1409893 .2105041 -------------------------------------------------------------- Variable Storage Display Value name type format label Variable label ---------------------------------------------------------------------------------------------------- RIF_0_mean double %10.0g RIF of _b[0:mean] RIF_0_var double %10.0g RIF of _b[0:variance] RIF_c_mean double %10.0g RIF of _b[1:mean] RIF_c_var double %10.0g RIF of _b[1:variance] . // compute the RIFs of the decomposition components . generate double mean_diff = RIF_0_mean - RIF_1_mean . generate double mean_expl = RIF_c_mean - RIF_1_mean . generate double mean_unex = RIF_0_mean - RIF_c_mean . generate double var_diff = RIF_0_var - RIF_1_var . generate double var_expl = RIF_c_var - RIF_1_var . generate double var_unex = RIF_0_var - RIF_c_var . // results . svy: mean mean_* var_* (running mean on estimation sample) Survey: Mean estimation Number of strata = 15 Number of obs = 5,430 Number of PSUs = 2,034 Population size = 12,066,161 Design df = 2,019 -------------------------------------------------------------- | Linearized | Mean std. err. [95% conf. interval] -------------+------------------------------------------------ mean_diff | -.1388418 .025609 -.1890646 -.088619 mean_expl | -.1095713 .0170488 -.1430064 -.0761362 mean_unex | -.0292705 .0202262 -.0689369 .0103959 var_diff | .0573163 .0205301 .0170539 .0975788 var_expl | -.0183098 .0090438 -.036046 -.0005736 var_unex | .0756262 .0202761 .0358618 .1153905 -------------------------------------------------------------- . ereturn display // redisplay results including t-statistics and p values ------------------------------------------------------------------------------ | Linearized Mean | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- mean_diff | -.1388418 .025609 -5.42 0.000 -.1890646 -.088619 mean_expl | -.1095713 .0170488 -6.43 0.000 -.1430064 -.0761362 mean_unex | -.0292705 .0202262 -1.45 0.148 -.0689369 .0103959 var_diff | .0573163 .0205301 2.79 0.005 .0170539 .0975788 var_expl | -.0183098 .0090438 -2.02 0.043 -.036046 -.0005736 var_unex | .0756262 .0202761 3.73 0.000 .0358618 .1153905 ------------------------------------------------------------------------------ . drop RIF_* mean_* var_*
Decomposition results did not change all that much. For both the mean and the variance, the absolute value of the explained part slightly increased. For the variance, the explained part is now marginally significant.
Compute a reweighted Oaxaca-Blinder decomposition to explain the public–private wage gap and interpret the results. Include schooling, experience, and international socio-economic index (ISEI) as predictors. Do the analysis in both directions (i.e. once reweighting the private sector and once reweighting the public sector).
Compute the weights, coefficients, and means:
. // compute weights . kmatch ipw public c.yeduc##c.expft##c.expft c.isei [pw=weight], att wgen(PSI0) Inverse probability weighting Number of obs = 5,430 Treatment : public = 1 Covariates : yeduc expft c.yeduc#c.expft c.expft#c.expft c.yeduc#c.expft#c.expft isei PS model : logit (pr) Matching statistics ------------------------------------------------------------------------------ | Matched | Controls | Yes No Total | Used Unused Total -----------+---------------------------------+-------------------------------- Treated | 1267 0 1267 | 4163 0 4163 ------------------------------------------------------------------------------ Stored variables Variable Storage Display Value name type format label Variable label ---------------------------------------------------------------------------------------------------- PSI0 double %10.0g Matching weights for ATT . kmatch ipw public c.yeduc##c.expft##c.expft c.isei [pw=weight], atc wgen(PSI1) Inverse probability weighting Number of obs = 5,430 Treatment : public = 1 Covariates : yeduc expft c.yeduc#c.expft c.expft#c.expft c.yeduc#c.expft#c.expft isei PS model : logit (pr) Matching statistics ------------------------------------------------------------------------------ | Matched | Controls | Yes No Total | Used Unused Total -----------+---------------------------------+-------------------------------- Untreated | 4163 0 4163 | 1267 0 1267 ------------------------------------------------------------------------------ Stored variables Variable Storage Display Value name type format label Variable label ---------------------------------------------------------------------------------------------------- PSI1 double %10.0g Matching weights for ATC . // obtain coefficients and means in private sector . svy: regress lnwage yeduc expft expft2 isei if public==0 (running regress on estimation sample) Survey: Linear regression Number of strata = 15 Number of obs = 4,163 Number of PSUs = 1,793 Population size = 9,175,995 Design df = 1,778 F(4, 1775) = 181.16 Prob > F = 0.0000 R-squared = 0.3516 ------------------------------------------------------------------------------ | Linearized lnwage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- yeduc | .0477036 .0060214 7.92 0.000 .0358939 .0595133 expft | .0285134 .0042272 6.75 0.000 .0202226 .0368041 expft2 | -.0003818 .0001187 -3.22 0.001 -.0006146 -.0001489 isei | .0103159 .0007776 13.27 0.000 .0087908 .0118409 _cons | 1.382317 .072412 19.09 0.000 1.240295 1.524338 ------------------------------------------------------------------------------ . mat b_priv = e(b) . svy: mean yeduc expft expft2 isei if public==0 (running mean on estimation sample) Survey: Mean estimation Number of strata = 15 Number of obs = 4,163 Number of PSUs = 1,793 Population size = 9,175,995 Design df = 1,778 -------------------------------------------------------------- | Linearized | Mean std. err. [95% conf. interval] -------------+------------------------------------------------ yeduc | 12.4304 .0785646 12.27631 12.58449 expft | 14.35697 .2709275 13.8256 14.88834 expft2 | 307.8937 9.139483 289.9684 325.819 isei | 45.18642 .450854 44.30216 46.07068 -------------------------------------------------------------- . mat X_priv = (e(b),1) . // obtain counterfactual coefficients and means in private sector . regress lnwage yeduc expft expft2 isei [pw=PSI0] if public==0 (sum of wgt is 2,890,165.7029972) Linear regression Number of obs = 4,163 F(4, 4158) = 158.06 Prob > F = 0.0000 R-squared = 0.3764 Root MSE = .40838 ------------------------------------------------------------------------------ | Robust lnwage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- yeduc | .0439155 .0069882 6.28 0.000 .0302149 .0576161 expft | .037984 .0046504 8.17 0.000 .0288667 .0471013 expft2 | -.0006764 .0001357 -4.98 0.000 -.0009425 -.0004103 isei | .0110568 .0010086 10.96 0.000 .0090794 .0130343 _cons | 1.35105 .0741994 18.21 0.000 1.205579 1.49652 ------------------------------------------------------------------------------ . mat b_priv_C = e(b) . mean yeduc expft expft2 isei [pw=PSI0] if public==0 Mean estimation Number of obs = 4,163 -------------------------------------------------------------- | Mean Std. err. [95% conf. interval] -------------+------------------------------------------------ yeduc | 14.08746 .1052854 13.88105 14.29388 expft | 13.65027 .3162639 13.03022 14.27031 expft2 | 284.1405 9.916214 264.6994 303.5815 isei | 53.23341 .5850431 52.08641 54.38041 -------------------------------------------------------------- . mat X_priv_C = (e(b),1) . // obtain coefficients and means in public sector . svy: regress lnwage yeduc expft expft2 isei if public==1 (running regress on estimation sample) Survey: Linear regression Number of strata = 15 Number of obs = 1,267 Number of PSUs = 897 Population size = 2,890,166 Design df = 882 F(4, 879) = 70.93 Prob > F = 0.0000 R-squared = 0.3609 ------------------------------------------------------------------------------ | Linearized lnwage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- yeduc | .0349734 .0082295 4.25 0.000 .0188218 .0511251 expft | .0432595 .0072115 6.00 0.000 .0291058 .0574131 expft2 | -.0008105 .0001948 -4.16 0.000 -.0011929 -.0004282 isei | .0081884 .0013188 6.21 0.000 .0056 .0107767 _cons | 1.585119 .0967903 16.38 0.000 1.395153 1.775086 ------------------------------------------------------------------------------ . mat b_pub = e(b) . svy: mean yeduc expft expft2 isei if public==1 (running mean on estimation sample) Survey: Mean estimation Number of strata = 15 Number of obs = 1,267 Number of PSUs = 897 Population size = 2,890,166 Design df = 882 -------------------------------------------------------------- | Linearized | Mean std. err. [95% conf. interval] -------------+------------------------------------------------ yeduc | 14.0728 .1512438 13.77596 14.36964 expft | 13.57464 .4871342 12.61856 14.53072 expft2 | 282.094 15.63931 251.3994 312.7886 isei | 53.27206 .8210012 51.66072 54.88341 -------------------------------------------------------------- . mat X_pub = (e(b),1) . // obtain counterfactual coefficients and means in public sector . regress lnwage yeduc expft expft2 isei [pw=PSI1] if public==1 (sum of wgt is 9,175,995.0951793) Linear regression Number of obs = 1,267 F(4, 1262) = 51.84 Prob > F = 0.0000 R-squared = 0.3239 Root MSE = .34645 ------------------------------------------------------------------------------ | Robust lnwage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- yeduc | .0263934 .0078191 3.38 0.001 .0110536 .0417332 expft | .0350385 .0071188 4.92 0.000 .0210726 .0490044 expft2 | -.000636 .0001972 -3.23 0.001 -.0010228 -.0002492 isei | .0088995 .0014613 6.09 0.000 .0060327 .0117663 _cons | 1.726007 .0960101 17.98 0.000 1.53765 1.914364 ------------------------------------------------------------------------------ . mat b_pub_C = e(b) . mean yeduc expft expft2 isei [pw=PSI1] if public==1 Mean estimation Number of obs = 1,267 -------------------------------------------------------------- | Mean Std. err. [95% conf. interval] -------------+------------------------------------------------ yeduc | 12.50273 .1178106 12.27161 12.73386 expft | 14.3952 .5217609 13.37159 15.41881 expft2 | 309.3088 18.16965 273.6628 344.9547 isei | 45.16942 .7814726 43.63629 46.70254 -------------------------------------------------------------- . mat X_pub_C = (e(b),1)
Variable PSI0
reweights the private sector so that it looks
like the public sector. Variable PSI1
reweights the public
sector so that it looks like the private sector.
Here are the results for the decomposition that reweights the private sector:
. mat D_priv_C = (X_priv * b_priv' - X_pub * b_pub') /// > \ (X_priv - X_priv_C) * b_priv' /// > \ X_priv_C * (b_priv - b_priv_C)' /// > \ X_pub * (b_priv_C - b_pub)' /// > \ (X_priv_C - X_pub) * b_priv_C' . mat rown D_priv_C = "Overall difference" "Explained" "Specification error" /// > "Unexplained" "Reweighting error" . matlist D_priv_C, twidth(20) | y1 ---------------------+---------- Overall difference | -.1388418 Explained | -.1509773 Specification error | -.0003726 Unexplained | .0108031 Reweighting error | .001705
Here are the results for the decomposition that reweights the public sector:
. mat D_pub_C = (X_priv * b_priv' - X_pub * b_pub') /// > \ (X_pub_C - X_pub) * b_pub' /// > \ X_pub_C * (b_pub_C - b_pub)' /// > \ X_priv * (b_priv - b_pub_C)' /// > \ (X_priv - X_pub_C) * b_pub_C' . mat rown D_pub_C = "Overall difference" "Explained" "Specification error" /// > "Unexplained" "Reweighting error" . matlist D_pub_C, twidth(20) | y1 ---------------------+---------- Overall difference | -.1388418 Explained | -.107819 Specification error | .0013593 Unexplained | -.0301846 Reweighting error | -.0021974
Note that command oaxaca_rif
(from package rif
by
Fernando Rios-Avila; type ssc install rif
) can be used to
compute reweighted OB decompositions automatically (but be aware that the standard errors
reported by oaxaca_rif
do not take account of the fact that
the weights in the reweighting step are estimated; so you might want to use
the bootstrap for the standard errors).