Johannes Giesecke and Ben Jann, GESIS Training Course, January 29 – February 1, 2024
Required packages (install using command ssc install
):
fre
, oaxaca
, estout
. // get data as in Exercise 1 . use gsoep-extract, clear (Example data based on the German Socio-Economic Panel) . keep if wave==2015 (29,970 observations deleted) . keep if inrange(age, 25, 55) (5,671 observations deleted) . generate lnwage = ln(wage) (1,709 missing values generated) . generate expft2 = expft^2 (35 missing values generated) . drop if missing(sex, lnwage, yeduc, expft, isei, children) (1,875 observations deleted) . svyset psu [pw=weight], strata(strata) Sampling weights: weight VCE: linearized Single unit: missing Strata 1: strata Sampling unit 1: psu FPC 1: <zero>
Using the extended decomposition from Exercise 1 (i.e. schooling,
full-time experience, ISEI, number of children), evaluate how the results
change depending on how you handle the index problem. Compute the following
variants:
Using male coefficients as "nondiscriminatory" coefficients: set option
weight()
to 1; this gives the coefficients from the first
group (males in this case) a weight of one and the coefficients from the
second group (females) a weight of zero.
. oaxaca lnwage yeduc (exp: expft expft2) isei children, by(sex) svy weight(1) Blinder-Oaxaca decomposition Number of strata = 15 Number of obs = 5,434 Number of PSUs = 2,035 Population size = 12,071,607 Design df = 2,020 Model = linear Group 1: sex = 1 N of obs 1 = 2,624 Group 2: sex = 2 N of obs 2 = 2,810 explained: (X1 - X2) * b1 unexplained: X2 * (b1 - b2) ------------------------------------------------------------------------------ | Linearized lnwage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 2.865592 .0162802 176.02 0.000 2.833665 2.89752 group_2 | 2.659247 .0151807 175.17 0.000 2.629476 2.689019 difference | .206345 .0205365 10.05 0.000 .1660701 .2466199 explained | .1036131 .016076 6.45 0.000 .0720858 .1351404 unexplained | .1027319 .0204223 5.03 0.000 .062681 .1427829 -------------+---------------------------------------------------------------- explained | yeduc | -.0092558 .0052244 -1.77 0.077 -.0195016 .00099 exp | .1076302 .0126921 8.48 0.000 .0827391 .1325212 isei | .0033122 .0075391 0.44 0.660 -.011473 .0180973 children | .0019265 .0012233 1.57 0.115 -.0004725 .0043256 -------------+---------------------------------------------------------------- unexplained | yeduc | -.0533151 .12035 -0.44 0.658 -.2893382 .182708 exp | .0274948 .0429278 0.64 0.522 -.0566927 .1116822 isei | .0863664 .0618819 1.40 0.163 -.0349926 .2077254 children | .0012029 .0089191 0.13 0.893 -.0162887 .0186946 _cons | .0409829 .1149839 0.36 0.722 -.1845165 .2664824 ------------------------------------------------------------------------------ exp: expft expft2 . estimates store male
Using female coefficients as "nondiscriminatory" coefficients: set option
weight()
to 0.
. oaxaca lnwage yeduc (exp: expft expft2) isei children, by(sex) svy weight(0) Blinder-Oaxaca decomposition Number of strata = 15 Number of obs = 5,434 Number of PSUs = 2,035 Population size = 12,071,607 Design df = 2,020 Model = linear Group 1: sex = 1 N of obs 1 = 2,624 Group 2: sex = 2 N of obs 2 = 2,810 explained: (X1 - X2) * b2 unexplained: X1 * (b1 - b2) ------------------------------------------------------------------------------ | Linearized lnwage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 2.865592 .0162802 176.02 0.000 2.833665 2.89752 group_2 | 2.659247 .0151807 175.17 0.000 2.629476 2.689019 difference | .206345 .0205365 10.05 0.000 .1660701 .2466199 explained | .085632 .0145757 5.87 0.000 .057047 .1142169 unexplained | .1207131 .017753 6.80 0.000 .085897 .1555291 -------------+---------------------------------------------------------------- explained | yeduc | -.0101468 .005649 -1.80 0.073 -.0212252 .0009315 exp | .0912236 .0108276 8.43 0.000 .0699892 .112458 isei | .0027324 .0062227 0.44 0.661 -.0094711 .014936 children | .0018227 .0011951 1.53 0.127 -.000521 .0041664 -------------+---------------------------------------------------------------- unexplained | yeduc | -.0524241 .1183388 -0.44 0.658 -.2845029 .1796547 exp | .0439013 .0536963 0.82 0.414 -.0614046 .1492073 isei | .0869461 .062297 1.40 0.163 -.0352269 .2091192 children | .0013067 .009689 0.13 0.893 -.0176946 .0203081 _cons | .0409829 .1149839 0.36 0.722 -.1845165 .2664824 ------------------------------------------------------------------------------ exp: expft expft2 . estimates store female
Pooled model: apply option pooled
instead of using the
weight()
option (the pooled model will automatically include
a group dummy; if you want to use a pooled model without group dummy, you
can apply option omega
instead of pooled
).
. oaxaca lnwage yeduc (exp: expft expft2) isei children, by(sex) svy pooled Blinder-Oaxaca decomposition Number of strata = 15 Number of obs = 5,434 Number of PSUs = 2,035 Population size = 12,071,607 Design df = 2,020 Model = linear Group 1: sex = 1 N of obs 1 = 2,624 Group 2: sex = 2 N of obs 2 = 2,810 explained: (X1 - X2) * b unexplained: X1 * (b1 - b) + X2 * (b - b2) with b from pooled model (including group dummy) ------------------------------------------------------------------------------ | Linearized lnwage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 2.865592 .0162802 176.02 0.000 2.833665 2.89752 group_2 | 2.659247 .0151807 175.17 0.000 2.629476 2.689019 difference | .206345 .0205365 10.05 0.000 .1660701 .2466199 explained | .09528 .0136868 6.96 0.000 .0684383 .1221218 unexplained | .111065 .0181452 6.12 0.000 .0754797 .1466503 -------------+---------------------------------------------------------------- explained | yeduc | -.0096868 .005335 -1.82 0.070 -.0201494 .0007759 exp | .0999682 .0095154 10.51 0.000 .0813071 .1186293 isei | .0030201 .0068725 0.44 0.660 -.0104579 .016498 children | .0019785 .0011979 1.65 0.099 -.0003707 .0043278 -------------+---------------------------------------------------------------- unexplained | yeduc | -.0528841 .1193906 -0.44 0.658 -.2870257 .1812574 exp | .0351568 .0490598 0.72 0.474 -.0610562 .1313698 isei | .0866585 .0620786 1.40 0.163 -.0350863 .2084033 children | .0011509 .0092786 0.12 0.901 -.0170457 .0193475 _cons | .0409829 .1149839 0.36 0.722 -.1845165 .2664824 ------------------------------------------------------------------------------ exp: expft expft2 . estimates store pooled
Threefold decomposition (view of women): omit weight()
and
pooled
(threefold
is the default).
. oaxaca lnwage yeduc (exp: expft expft2) isei children, by(sex) svy Blinder-Oaxaca decomposition Number of strata = 15 Number of obs = 5,434 Number of PSUs = 2,035 Population size = 12,071,607 Design df = 2,020 Model = linear Group 1: sex = 1 N of obs 1 = 2,624 Group 2: sex = 2 N of obs 2 = 2,810 endowments: (X1 - X2) * b2 coefficients: X2 * (b1 - b2) interaction: (X1 - X2) * (b1 - b2) ------------------------------------------------------------------------------ | Linearized lnwage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 2.865592 .0162802 176.02 0.000 2.833665 2.89752 group_2 | 2.659247 .0151807 175.17 0.000 2.629476 2.689019 difference | .206345 .0205365 10.05 0.000 .1660701 .2466199 endowments | .085632 .0145757 5.87 0.000 .057047 .1142169 coefficients | .1027319 .0204223 5.03 0.000 .062681 .1427829 interaction | .0179811 .0130391 1.38 0.168 -.0075903 .0435526 -------------+---------------------------------------------------------------- endowments | yeduc | -.0101468 .005649 -1.80 0.073 -.0212252 .0009315 exp | .0912236 .0108276 8.43 0.000 .0699892 .112458 isei | .0027324 .0062227 0.44 0.661 -.0094711 .014936 children | .0018227 .0011951 1.53 0.127 -.000521 .0041664 -------------+---------------------------------------------------------------- coefficients | yeduc | -.0533151 .12035 -0.44 0.658 -.2893382 .182708 exp | .0274948 .0429278 0.64 0.522 -.0566927 .1116822 isei | .0863664 .0618819 1.40 0.163 -.0349926 .2077254 children | .0012029 .0089191 0.13 0.893 -.0162887 .0186946 _cons | .0409829 .1149839 0.36 0.722 -.1845165 .2664824 -------------+---------------------------------------------------------------- interaction | yeduc | .000891 .0020681 0.43 0.667 -.0031647 .0049468 exp | .0164066 .0133163 1.23 0.218 -.0097086 .0425217 isei | .0005797 .0013825 0.42 0.675 -.0021316 .003291 children | .0001038 .0007721 0.13 0.893 -.0014104 .001618 ------------------------------------------------------------------------------ exp: expft expft2 . estimates store tf_female
Threefold decomposition (view of men): add threefold(reverse)
.
. oaxaca lnwage yeduc (exp: expft expft2) isei children, by(sex) svy threefold(reverse) Blinder-Oaxaca decomposition Number of strata = 15 Number of obs = 5,434 Number of PSUs = 2,035 Population size = 12,071,607 Design df = 2,020 Model = linear Group 1: sex = 1 N of obs 1 = 2,624 Group 2: sex = 2 N of obs 2 = 2,810 endowments: (X1 - X2) * b1 coefficients: X1 * (b1 - b2) interaction: (X1 - X2) * (b2 - b1) ------------------------------------------------------------------------------ | Linearized lnwage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 2.865592 .0162802 176.02 0.000 2.833665 2.89752 group_2 | 2.659247 .0151807 175.17 0.000 2.629476 2.689019 difference | .206345 .0205365 10.05 0.000 .1660701 .2466199 endowments | .1036131 .016076 6.45 0.000 .0720858 .1351404 coefficients | .1207131 .017753 6.80 0.000 .085897 .1555291 interaction | -.0179811 .0130391 -1.38 0.168 -.0435526 .0075903 -------------+---------------------------------------------------------------- endowments | yeduc | -.0092558 .0052244 -1.77 0.077 -.0195016 .00099 exp | .1076302 .0126921 8.48 0.000 .0827391 .1325212 isei | .0033122 .0075391 0.44 0.660 -.011473 .0180973 children | .0019265 .0012233 1.57 0.115 -.0004725 .0043256 -------------+---------------------------------------------------------------- coefficients | yeduc | -.0524241 .1183388 -0.44 0.658 -.2845029 .1796547 exp | .0439013 .0536963 0.82 0.414 -.0614046 .1492073 isei | .0869461 .062297 1.40 0.163 -.0352269 .2091192 children | .0013067 .009689 0.13 0.893 -.0176946 .0203081 _cons | .0409829 .1149839 0.36 0.722 -.1845165 .2664824 -------------+---------------------------------------------------------------- interaction | yeduc | -.000891 .0020681 -0.43 0.667 -.0049468 .0031647 exp | -.0164066 .0133163 -1.23 0.218 -.0425217 .0097086 isei | -.0005797 .0013825 -0.42 0.675 -.003291 .0021316 children | -.0001038 .0007721 -0.13 0.893 -.001618 .0014104 ------------------------------------------------------------------------------ exp: expft expft2 . estimates store tf_male
Generate an overview table and try to make sense of the results. What is the correct interpretation of the various results? How can the differences be explained?
. esttab male female pooled tf_female tf_male, se nonumber mtitles /// > equations(Overall=1, Explained=2, Unexplained=3) /// > rename(endowments explained coefficients unexplained) -------------------------------------------------------------------------------------------- male female pooled tf_female tf_male -------------------------------------------------------------------------------------------- Overall group_1 2.866*** 2.866*** 2.866*** 2.866*** 2.866*** (0.0163) (0.0163) (0.0163) (0.0163) (0.0163) group_2 2.659*** 2.659*** 2.659*** 2.659*** 2.659*** (0.0152) (0.0152) (0.0152) (0.0152) (0.0152) difference 0.206*** 0.206*** 0.206*** 0.206*** 0.206*** (0.0205) (0.0205) (0.0205) (0.0205) (0.0205) explained 0.104*** 0.0856*** 0.0953*** 0.0856*** 0.104*** (0.0161) (0.0146) (0.0137) (0.0146) (0.0161) unexplained 0.103*** 0.121*** 0.111*** 0.103*** 0.121*** (0.0204) (0.0178) (0.0181) (0.0204) (0.0178) interaction 0.0180 -0.0180 (0.0130) (0.0130) -------------------------------------------------------------------------------------------- Explained yeduc -0.00926 -0.0101 -0.00969 -0.0101 -0.00926 (0.00522) (0.00565) (0.00534) (0.00565) (0.00522) exp 0.108*** 0.0912*** 0.1000*** 0.0912*** 0.108*** (0.0127) (0.0108) (0.00952) (0.0108) (0.0127) isei 0.00331 0.00273 0.00302 0.00273 0.00331 (0.00754) (0.00622) (0.00687) (0.00622) (0.00754) children 0.00193 0.00182 0.00198 0.00182 0.00193 (0.00122) (0.00120) (0.00120) (0.00120) (0.00122) -------------------------------------------------------------------------------------------- Unexplained yeduc -0.0533 -0.0524 -0.0529 -0.0533 -0.0524 (0.120) (0.118) (0.119) (0.120) (0.118) exp 0.0275 0.0439 0.0352 0.0275 0.0439 (0.0429) (0.0537) (0.0491) (0.0429) (0.0537) isei 0.0864 0.0869 0.0867 0.0864 0.0869 (0.0619) (0.0623) (0.0621) (0.0619) (0.0623) children 0.00120 0.00131 0.00115 0.00120 0.00131 (0.00892) (0.00969) (0.00928) (0.00892) (0.00969) _cons 0.0410 0.0410 0.0410 0.0410 0.0410 (0.115) (0.115) (0.115) (0.115) (0.115) -------------------------------------------------------------------------------------------- interaction yeduc 0.000891 -0.000891 (0.00207) (0.00207) exp 0.0164 -0.0164 (0.0133) (0.0133) isei 0.000580 -0.000580 (0.00138) (0.00138) children 0.000104 -0.000104 (0.000772) (0.000772) -------------------------------------------------------------------------------------------- N 5434 5434 5434 5434 5434 -------------------------------------------------------------------------------------------- Standard errors in parentheses * p<0.05, ** p<0.01, *** p<0.001
Explanation for use of options equations()
and
rename()
: A complication when compiling an overview table is
that the different parts in the output are labeled differently across the
decompositions. In the two-fold decomposition label "explained" is used for
ΔX and "unexplained" for ΔS. In the
three-fold decomposition the corresponding labels are "endowments" and
"coefficients". By default esttab
places differently named
elements into different rows, which results in a messy table in the current
case. To tidy up the table, option equations()
specifies how
equations be merged and option rename()
renames some of the
coefficients.
Interpretation: The choice of the reference coefficients changes results somewhat. When using the male coefficients, the explained part of the gender wage gap is larger than when using the female coefficients (using male coefficients, 0.104/0.206 = 50% of the wage gap is explained; using female coefficients, only 0.0856/0.206 = 42% is explained). The difference is mostly due to the steeper earnings profile of men across work experience. Because of that, the gender difference in work experience explains more of the overall wage gap if male coefficients are used as reference coefficients. This can also nicely be seen in the "interaction" equation of the three-fold decomposition that quantifies the differences in the contributions to the explained part depending on whether the male or the female coefficients are used as reference (only for experience this difference is substantial). Using the pooled model leads to a compromise between the two extremes.
Fundamentally, the difference between using the male coefficients and the female coefficients is a change in perspective in the sense that different counterfactual exercises are performed. When using the male coefficients, we essentially ask how much men would lose if their work experience was reduced to that of women. When using the female coefficients, we ask how much women would gain if their work experience was increased to that of men. (While assuming that everything else stays the same.)
Optional: Compute a decomposition that is defined in a way such that the unexplained component can be interpreted as an "average treatment effect" (see slides for details).
Giving the male coefficients a weight equal to the proportion of females in the sample, and giving the female coefficients a weight equal to the proportion of males, leads to an unexplained part that is equal in size to the average treatment effect obtained by a regression-adjustment estimator:
. svy: proportion sex (running proportion on estimation sample) Survey: Proportion estimation Number of strata = 15 Number of obs = 5,434 Number of PSUs = 2,035 Population size = 12,071,607 Design df = 2,020 -------------------------------------------------------------- | Linearized Logit | Proportion std. err. [95% conf. interval] -------------+------------------------------------------------ sex | male | .5187475 .0094003 .500295 .537149 female | .4812525 .0094003 .462851 .499705 -------------------------------------------------------------- . local p_female = _b[2.sex] . oaxaca lnwage yeduc (exp: expft expft2) isei children, by(sex) svy weight(`p_female') nodetail Blinder-Oaxaca decomposition Number of strata = 15 Number of obs = 5,434 Number of PSUs = 2,035 Population size = 12,071,607 Design df = 2,020 Model = linear Group 1: sex = 1 N of obs 1 = 2,624 Group 2: sex = 2 N of obs 2 = 2,810 explained: (X1 - X2) * b unexplained: X1 * (b1 - b) + X2 * (b - b2) with b = .481253 * b1 + (1 - .481253) * b2 ------------------------------------------------------------------------------ | Linearized lnwage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 2.865592 .0162802 176.02 0.000 2.833665 2.89752 group_2 | 2.659247 .0151807 175.17 0.000 2.629476 2.689019 difference | .206345 .0205365 10.05 0.000 .1660701 .2466199 explained | .0942854 .0138614 6.80 0.000 .0671013 .1214695 unexplained | .1120596 .0179378 6.25 0.000 .0768811 .1472381 ------------------------------------------------------------------------------
Confirm using teffects ra
(the command does not support survey
estimation, but we can still take account of sampling weights and clustering;
the only element of the survey design we then ignore is stratification; this
seems acceptable; we are only interested here in the point estimate anyhow,
which is not affected by clustering and stratification):
. teffects ra (lnwage yeduc expft expft2 isei children) (sex) [pw=weight], vce(cluster psu) Iteration 0: EE criterion = 5.263e-30 Iteration 1: EE criterion = 4.851e-32 Treatment-effects estimation Number of obs = 5,434 Estimator : regression adjustment Outcome model : linear Treatment model: none (Std. err. adjusted for 2,035 clusters in psu) ----------------------------------------------------------------------------------- | Robust lnwage | Coefficient std. err. z P>|z| [95% conf. interval] ------------------+---------------------------------------------------------------- ATE | sex | (female vs male) | -.1120596 .0179062 -6.26 0.000 -.1471552 -.076964 ------------------+---------------------------------------------------------------- POmean | sex | male | 2.815728 .0167039 168.57 0.000 2.782989 2.848467 -----------------------------------------------------------------------------------
1. Replace ISEI in the extended model of Exercise 1 by the (categorical)
EGP variable (egp
). Before you do that, inspect the variable
egp
carefully and drop categories with a very low number of
observations. Only report the aggregate contribution of EGP. Illustrate how
the results change if you switch the base level.
Make dummies (we do not use tabulate, generate()
here because we
want to name the dummies using the corresponding EGP values;
tabulate, generate()
would use consecutive numbers):
. fre egp egp -- EGP class -------------------------------------------------------------------------------------------------- | Freq. Percent Valid Cum. -----------------------------------------------------+-------------------------------------------- Valid 1 higher managerial and professional workers | 833 15.33 15.33 15.33 (I) | 2 lower managerial and professional workers | 1411 25.97 25.97 41.30 (II) | 3 higher routine service workers (IIIa) | 810 14.91 14.91 56.20 4 lower routine service workers (IIIb) | 732 13.47 13.47 69.67 5 small self-employed and farmers (IV) | 2 0.04 0.04 69.71 6 skilled manual workers (V, VI) | 757 13.93 13.93 83.64 7 semi- and unskilled manual workers (VIIa) | 827 15.22 15.22 98.86 8 agricultural labourers (VIIb) | 62 1.14 1.14 100.00 Total | 5434 100.00 100.00 -------------------------------------------------------------------------------------------------- . drop if egp==5 // only two observations; self-employed are not part of gsoep-extract.dta (2 observations deleted) . quietly levelsof egp . foreach l in `r(levels)' { 2. quietly generate byte egp_`l' = egp==`l' if egp<. 3. } . summarize egp_* Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- egp_1 | 5,432 .1533505 .3603582 0 1 egp_2 | 5,432 .259757 .4385416 0 1 egp_3 | 5,432 .1491163 .3562359 0 1 egp_4 | 5,432 .134757 .3414953 0 1 egp_6 | 5,432 .1393594 .346353 0 1 -------------+--------------------------------------------------------- egp_7 | 5,432 .1522459 .3592922 0 1 egp_8 | 5,432 .0114138 .1062339 0 1
Results using class I (upper service class) as reference category:
. oaxaca lnwage yeduc (exp: expft expft2) children (EGP: egp_2-egp_8), by(sex) weight(1) svy Blinder-Oaxaca decomposition Number of strata = 15 Number of obs = 5,432 Number of PSUs = 2,034 Population size = 12,070,291 Design df = 2,019 Model = linear Group 1: sex = 1 N of obs 1 = 2,622 Group 2: sex = 2 N of obs 2 = 2,810 explained: (X1 - X2) * b1 unexplained: X2 * (b1 - b2) ------------------------------------------------------------------------------ | Linearized lnwage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 2.865539 .0161887 177.01 0.000 2.833791 2.897287 group_2 | 2.659247 .0151161 175.92 0.000 2.629603 2.688892 difference | .2062915 .0205458 10.04 0.000 .1659983 .2465847 explained | .1233931 .017833 6.92 0.000 .0884201 .1583662 unexplained | .0828984 .0226395 3.66 0.000 .0384992 .1272975 -------------+---------------------------------------------------------------- explained | yeduc | -.0108239 .0060271 -1.80 0.073 -.0226439 .000996 exp | .1078319 .0126765 8.51 0.000 .0829715 .1326922 children | .0017906 .0011574 1.55 0.122 -.0004793 .0040605 EGP | .0245946 .0136235 1.81 0.071 -.002123 .0513121 -------------+---------------------------------------------------------------- unexplained | yeduc | .0111098 .1151472 0.10 0.923 -.2147099 .2369295 exp | .0314076 .0428362 0.73 0.464 -.0526003 .1154154 children | .0018348 .0088282 0.21 0.835 -.0154786 .0191481 EGP | -.0341147 .051724 -0.66 0.510 -.1355527 .0673234 _cons | .0726609 .1581958 0.46 0.646 -.2375831 .3829049 ------------------------------------------------------------------------------ exp: expft expft2 EGP: egp_2 egp_3 egp_4 egp_6 egp_7 egp_8
Results using class V+VI (skilled manual workers) as reference category:
. oaxaca lnwage yeduc (exp: expft expft2) children (EGP: egp_1-egp_4 egp_7 egp_8), /// > by(sex) weight(1) svy Blinder-Oaxaca decomposition Number of strata = 15 Number of obs = 5,432 Number of PSUs = 2,034 Population size = 12,070,291 Design df = 2,019 Model = linear Group 1: sex = 1 N of obs 1 = 2,622 Group 2: sex = 2 N of obs 2 = 2,810 explained: (X1 - X2) * b1 unexplained: X2 * (b1 - b2) ------------------------------------------------------------------------------ | Linearized lnwage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 2.865539 .0161887 177.01 0.000 2.833791 2.897287 group_2 | 2.659247 .0151161 175.92 0.000 2.629603 2.688892 difference | .2062915 .0205458 10.04 0.000 .1659983 .2465847 explained | .1233931 .017833 6.92 0.000 .0884201 .1583662 unexplained | .0828984 .0226395 3.66 0.000 .0384992 .1272975 -------------+---------------------------------------------------------------- explained | yeduc | -.0108239 .0060271 -1.80 0.073 -.0226439 .000996 exp | .1078319 .0126765 8.51 0.000 .0829715 .1326922 children | .0017906 .0011574 1.55 0.122 -.0004793 .0040605 EGP | .0245946 .0136235 1.81 0.071 -.002123 .0513121 -------------+---------------------------------------------------------------- unexplained | yeduc | .0111098 .1151472 0.10 0.923 -.2147099 .2369295 exp | .0314076 .0428362 0.73 0.464 -.0526003 .1154154 children | .0018348 .0088282 0.21 0.835 -.0154786 .0191481 EGP | -.1490758 .0651441 -2.29 0.022 -.2768326 -.0213191 _cons | .1876221 .1343036 1.40 0.163 -.0757661 .4510102 ------------------------------------------------------------------------------ exp: expft expft2 EGP: egp_1 egp_2 egp_3 egp_4 egp_7 egp_8
The contribution of EGP to the explained part does not change. However, the contribution to the unexplained part changes quite dramatically depending on the choice of the reference category.
Normalize the effects of EGP to make its contribution independent of the
choice of the base level (unweighted normalization using
oaxaca
).
. oaxaca lnwage yeduc (exp: expft expft2) children (EGP: normalize(egp_*)), /// > by(sex) weight(1) svy (normalized: egp_1 egp_2 egp_3 egp_4 egp_6 egp_7 egp_8) Blinder-Oaxaca decomposition Number of strata = 15 Number of obs = 5,432 Number of PSUs = 2,034 Population size = 12,070,291 Design df = 2,019 Model = linear Group 1: sex = 1 N of obs 1 = 2,622 Group 2: sex = 2 N of obs 2 = 2,810 explained: (X1 - X2) * b1 unexplained: X2 * (b1 - b2) ------------------------------------------------------------------------------ | Linearized lnwage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 2.865539 .0161887 177.01 0.000 2.833791 2.897287 group_2 | 2.659247 .0151161 175.92 0.000 2.629603 2.688892 difference | .2062915 .0205458 10.04 0.000 .1659983 .2465847 explained | .1233931 .017833 6.92 0.000 .0884201 .1583662 unexplained | .0828984 .0226395 3.66 0.000 .0384992 .1272975 -------------+---------------------------------------------------------------- explained | yeduc | -.0108239 .0060271 -1.80 0.073 -.0226439 .000996 exp | .1078319 .0126765 8.51 0.000 .0829715 .1326922 children | .0017906 .0011574 1.55 0.122 -.0004793 .0040605 EGP | .0245946 .0136235 1.81 0.071 -.002123 .0513121 -------------+---------------------------------------------------------------- unexplained | yeduc | .0111098 .1151472 0.10 0.923 -.2147099 .2369295 exp | .0314076 .0428362 0.73 0.464 -.0526003 .1154154 children | .0018348 .0088282 0.21 0.835 -.0154786 .0191481 EGP | -.0462237 .018013 -2.57 0.010 -.0815498 -.0108976 _cons | .0847699 .1280491 0.66 0.508 -.1663522 .335892 ------------------------------------------------------------------------------ exp: expft expft2 EGP: egp_1 egp_2 egp_3 egp_4 egp_6 egp_7 egp_8
Now simplify the EGP variable by combining classes VIIa and VIIb (codes 7 and 8) into one bigger class. How do the decomposition results change?
. generate byte EGP = egp . replace EGP = 7 if EGP==8 (62 real changes made) . fre EGP EGP ----------------------------------------------------------- | Freq. Percent Valid Cum. --------------+-------------------------------------------- Valid 1 | 833 15.34 15.34 15.34 2 | 1411 25.98 25.98 41.31 3 | 810 14.91 14.91 56.22 4 | 732 13.48 13.48 69.70 6 | 757 13.94 13.94 83.63 7 | 889 16.37 16.37 100.00 Total | 5432 100.00 100.00 ----------------------------------------------------------- . quietly levelsof EGP . foreach l in `r(levels)' { 2. quietly generate byte EGP_`l' = EGP==`l' if EGP<. 3. } . summarize EGP_* Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- EGP_1 | 5,432 .1533505 .3603582 0 1 EGP_2 | 5,432 .259757 .4385416 0 1 EGP_3 | 5,432 .1491163 .3562359 0 1 EGP_4 | 5,432 .134757 .3414953 0 1 EGP_6 | 5,432 .1393594 .346353 0 1 -------------+--------------------------------------------------------- EGP_7 | 5,432 .1636598 .3700006 0 1 . oaxaca lnwage yeduc (exp: expft expft2) children (EGP: normalize(EGP_*)), /// > by(sex) weight(1) svy (normalized: EGP_1 EGP_2 EGP_3 EGP_4 EGP_6 EGP_7) Blinder-Oaxaca decomposition Number of strata = 15 Number of obs = 5,432 Number of PSUs = 2,034 Population size = 12,070,291 Design df = 2,019 Model = linear Group 1: sex = 1 N of obs 1 = 2,622 Group 2: sex = 2 N of obs 2 = 2,810 explained: (X1 - X2) * b1 unexplained: X2 * (b1 - b2) ------------------------------------------------------------------------------ | Linearized lnwage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 2.865539 .01622 176.67 0.000 2.833729 2.897349 group_2 | 2.659247 .0150841 176.30 0.000 2.629665 2.688829 difference | .2062915 .0204962 10.06 0.000 .1660955 .2464875 explained | .1243395 .017779 6.99 0.000 .0894723 .1592067 unexplained | .081952 .0226796 3.61 0.000 .0374742 .1264298 -------------+---------------------------------------------------------------- explained | yeduc | -.0108035 .0060164 -1.80 0.073 -.0226026 .0009956 exp | .1086303 .0127124 8.55 0.000 .0836995 .1335612 children | .0017697 .0011475 1.54 0.123 -.0004807 .00402 EGP | .024743 .0135759 1.82 0.069 -.0018812 .0513672 -------------+---------------------------------------------------------------- unexplained | yeduc | .0189711 .114963 0.17 0.869 -.2064874 .2444295 exp | .0322705 .043009 0.75 0.453 -.052076 .1166171 children | .0012317 .0088633 0.14 0.889 -.0161504 .0186138 EGP | -.0270174 .0116508 -2.32 0.020 -.0498662 -.0041686 _cons | .0564961 .1292078 0.44 0.662 -.1968985 .3098906 ------------------------------------------------------------------------------ exp: expft expft2 EGP: EGP_1 EGP_2 EGP_3 EGP_4 EGP_6 EGP_7
The contribution to the explained part did not change very much. However, for the unexplained part, the contribution of EGP is different (-0.046 vs. -0.027). This is a general problem of the (unweighted) normalization: for the contribution to the unexplained part, results can change quite a bit depending on minor changes in the categories.
Optional: Compute the contribution of EGP and the simplified EGP to the
unexplained part using a weighted normalization. You need to do this
manually (hint: you can use command contrast
to obtain
normalized coefficients after running a regression). Compare the results to
the results from the unweighted normalization.
This is currently not implemented in oaxaca
and has to be done
manually. The approach is to first use contrast
to compute the
transformed coefficients and then use matrix multiplication to obtain the
contribution to the unexplained part. The gw.
operator is what
we need to compute deviation contrasts from the weighted mean.
. svy, subpop(if sex==1): /// > regress lnwage yeduc expft expft2 children i.egp, nofvlab (output omitted) . contrast gw.egp, nofvlab nowald Contrasts of marginal linear predictions Design df = 2,019 Margins: asbalanced -------------------------------------------------------------- | Contrast Std. err. [95% conf. interval] -------------+------------------------------------------------ egp | (1 vs mean) | .267891 .0258126 .2172688 .3185131 (2 vs mean) | .0564703 .0256409 .006185 .1067556 (3 vs mean) | -.099255 .0447632 -.1870419 -.0114681 (4 vs mean) | -.1058109 .0495935 -.2030707 -.0085512 (6 vs mean) | -.0800621 .0258872 -.1308305 -.0292937 (7 vs mean) | -.211495 .0293221 -.2689998 -.1539902 (8 vs mean) | -.4512152 .0551535 -.559379 -.3430515 -------------------------------------------------------------- . matrix b_m = r(b) . svy, subpop(if sex==2): /// > regress lnwage yeduc expft expft2 children i.egp, nofvlab (output omitted) . contrast gw.egp, nofvlab nowald Contrasts of marginal linear predictions Design df = 2,019 Margins: asbalanced -------------------------------------------------------------- | Contrast Std. err. [95% conf. interval] -------------+------------------------------------------------ egp | (1 vs mean) | .2583709 .0408441 .1782699 .3384719 (2 vs mean) | .0703231 .0191034 .0328586 .1077876 (3 vs mean) | .0544614 .0203194 .0146123 .0943106 (4 vs mean) | -.1128703 .0252037 -.1622982 -.0634424 (6 vs mean) | -.2045433 .0547189 -.3118548 -.0972319 (7 vs mean) | -.2452683 .0359918 -.3158532 -.1746834 (8 vs mean) | -.5953543 .0841274 -.76034 -.4303687 -------------------------------------------------------------- . matrix b_f = r(b) . svy, subpop(if sex==2): proportion egp if e(sample) (running proportion on estimation sample) Survey: Proportion estimation Number of strata = 15 Number of obs = 5,432 Number of PSUs = 2,034 Population size = 12,070,291 Subpop. no. obs = 2,810 Subpop. size = 5,809,491.4 Design df = 2,019 ------------------------------------------------------------------------------------------------- | Linearized Logit | Proportion std. err. [95% conf. interval] ------------------------------------------------+------------------------------------------------ egp | higher managerial and professional workers (I) | .1083481 .0096877 .0907605 .1288609 lower managerial and professional workers (II) | .3140206 .0155473 .2843604 .3452818 higher routine service workers (IIIa) | .2113439 .0127826 .1873618 .2374986 lower routine service workers (IIIb) | .2183959 .013002 .1939701 .2449628 skilled manual workers (V, VI) | .0431241 .0055396 .0334762 .0553931 semi- and unskilled manual workers (VIIa) | .0978556 .0086602 .0821363 .1162024 agricultural labourers (VIIb) | .0069118 .0031766 .0028004 .0169566 ------------------------------------------------------------------------------------------------- . matrix X_f = e(b) . matrix U = X_f * (b_m - b_f)' . matrix list U symmetric U[1,1] r1 y1 -.02459457
We now pack this into a small program so we can re-use it and so that it also
supports the unweighted normalization (operator g.
instead of
gw.
).
. capture program drop egpdecomp . program egpdecomp, rclass 1. args op egp 2. quietly { 3. svy, subpop(if sex==1): /// > regress lnwage yeduc expft expft2 children i.`egp', nofvlab 4. contrast `op'.`egp', nofvlab nowald 5. matrix b_m = r(b) 6. svy, subpop(if sex==2): /// > regress lnwage yeduc expft expft2 children i.`egp', nofvlab 7. contrast `op'.`egp', nofvlab nowald 8. matrix b_f = r(b) 9. svy, subpop(if sex==2): proportion `egp' if e(sample) 10. matrix X_f = e(b) 11. matrix U = X_f * (b_m - b_f)' 12. } 13. return scalar U = U[1,1] 14. display as txt "Contribution to unexplained part = " as res return(U) 15. end
Using this program, results for different situations can be computed without much effort:
. egpdecomp g egp Contribution to unexplained part = -.0462237 . egpdecomp g EGP Contribution to unexplained part = -.02701741 . egpdecomp gw egp Contribution to unexplained part = -.02459457 . egpdecomp gw EGP Contribution to unexplained part = -.024743
The first two results are the same as above using the unweighted
normalization. The latter two are the results using the weighted
normalization. As can be seen, the change in the categorization
(egp
vs. EGP
) only has a minor effect on the
results using weighted normalization.
Looking at the results we also realize that in case of the weighted normalization the aggregate contribution of a categorical predictor to the unexplained part is exactly -1 times the contribution to the explained part! This is a formal property of the weighted normalization (at least in a linear decomposition)!
This means that we do not really need to compute the weighted normalization;
we can just read it off the standard output from oaxaca
(i.e.
-1 * the contribution to the explained part). However, it also highlights
once more that the detailed decomposition of the unexplained part is
problematic (what is the point of computing the contribution to the
unexplained part if we know that mechanically it will just be -1 times the
contribution to the explained part?).
Note that Kim (2013) suggests a slightly different weighted normalization.
Using contrast
with the gw.
operator is
equivalent to normalizing the coefficients of each model using the
distribution of categories in the (sub)sample that has been used to
estimate the model. What Kim (2013) suggests is to use the distribution of
categories in the overall sample across both groups to normalize the
coefficients of each model. This leads to slightly different results such
that the relation above (i.e., that the contribution to the unexplained
part is equal to -1 times the contribution to the explained part) only
holds approximately.
Optional: Compute the "industry decomposition" described on the
slides by economic sector (variable industry
).
Define dummy variables for the three sectors:
. fre industry industry -- economic sector ------------------------------------------------------------------------ | Freq. Percent Valid Cum. ---------------------------+-------------------------------------------- Valid 1 primary sector | 80 1.47 1.49 1.49 2 secondary sector | 1622 29.86 30.16 31.65 3 tertiary sector | 3676 67.67 68.35 100.00 Total | 5378 99.01 100.00 Missing . | 54 0.99 Total | 5432 100.00 ------------------------------------------------------------------------ . generate byte primary = industry==1 if industry<. (54 missing values generated) . generate byte secondary = industry==2 if industry<. (54 missing values generated) . generate byte tertiary = industry==3 if industry<. (54 missing values generated)
Run the decomposition using the secondary sector as base level:
. oaxaca lnwage yeduc (exp: expft expft2) children primary tertiary, /// > by(sex) weight(1) svy Blinder-Oaxaca decomposition Number of strata = 15 Number of obs = 5,378 Number of PSUs = 2,018 Population size = 11,939,847 Design df = 2,003 Model = linear Group 1: sex = 1 N of obs 1 = 2,592 Group 2: sex = 2 N of obs 2 = 2,786 explained: (X1 - X2) * b1 unexplained: X2 * (b1 - b2) ------------------------------------------------------------------------------ | Linearized lnwage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 2.86764 .0160951 178.17 0.000 2.836076 2.899205 group_2 | 2.660935 .0151013 176.21 0.000 2.631319 2.690551 difference | .2067055 .0205496 10.06 0.000 .1664048 .2470063 explained | .1323815 .0168501 7.86 0.000 .099336 .1654271 unexplained | .074324 .0218455 3.40 0.001 .0314818 .1171662 -------------+---------------------------------------------------------------- explained | yeduc | -.0199175 .0105231 -1.89 0.059 -.0405548 .0007198 exp | .1012204 .0126907 7.98 0.000 .0763321 .1261088 children | .0020019 .0012916 1.55 0.121 -.0005311 .004535 primary | -.0028996 .0024324 -1.19 0.233 -.0076699 .0018708 tertiary | .0519763 .0090816 5.72 0.000 .0341658 .0697867 -------------+---------------------------------------------------------------- unexplained | yeduc | .0992443 .0960004 1.03 0.301 -.0890268 .2875153 exp | -.003026 .0435204 -0.07 0.945 -.088376 .082324 children | .0020287 .0091417 0.22 0.824 -.0158995 .0199569 primary | .0007511 .0017191 0.44 0.662 -.0026203 .0041226 tertiary | -.0314402 .0358418 -0.88 0.380 -.1017313 .038851 _cons | .006766 .11693 0.06 0.954 -.2225512 .2360832 ------------------------------------------------------------------------------ exp: expft expft2
Compute the industry decomposition by Horrace and Oaxaca (2001):
. matrix coefs = e(b0) . matrix I = J(3,1,.) . matrix rownames I = Primary Secondary Tertiary . matrix I[2,1] = _b[overall:unexplained] - _b[unexplained:primary] - _b[unexplained:tertiary] . matrix I[1,1] = I[2,1] + (coefs[1,"b1:primary"] - coefs[1,"b2:primary"]) . matrix I[3,1] = I[2,1] + (coefs[1,"b1:tertiary"] - coefs[1,"b2:tertiary"]) . matrix list I I[3,1] c1 Primary .20048314 Secondary .10501302 Tertiary .06756791
Compute the contributions of the sectors to unexplained wage gap according to Fortin et al. (2011):
. svy, subpop(if sex==2): mean primary secondary tertiary if lnwage<. (running mean on estimation sample) Survey: Mean estimation Number of strata = 15 Number of obs = 5,408 Number of PSUs = 2,028 Population size = 12,021,927 Subpop. no. obs = 2,786 Subpop. size = 5,761,127 Design df = 2,013 -------------------------------------------------------------- | Linearized | Mean std. err. [95% conf. interval] -------------+------------------------------------------------ primary | .0078677 .003227 .001539 .0141964 secondary | .1524993 .0110819 .1307662 .1742324 tertiary | .839633 .0113178 .8174372 .8618289 -------------------------------------------------------------- . matrix p = e(b) . local S = p[1,1]*I[1,1] + p[1,2]*I[2,1] + p[1,3]*I[3,1] . display `S' // equal to total unexplained .074324 . matrix I[1,1] = p[1,1]*I[1,1] / `S' . matrix I[2,1] = p[1,2]*I[2,1] / `S' . matrix I[3,1] = p[1,3]*I[3,1] / `S' . matrix list I I[3,1] c1 Primary .0212224 Secondary .21546758 Tertiary .76331002
We can also use nlcom
to compute these decompositions. The
advantage is that in this way we will also get standard errors and
confidence intervals. The procedure goes as follows. Apart from the
decomposition results, oaxaca
also returns the underlying
regression coefficients and means as well as their joint variance matrix in
e(b0)
and e(V0)
.
. oaxaca lnwage yeduc (exp: expft expft2) children primary tertiary, /// > by(sex) weight(1) svy (output omitted) . matrix list e(b0) e(b0)[1,35] b1: b1: b1: b1: b1: b1: b1: b2: yeduc expft expft2 children primary tertiary _cons yeduc r1 .08898211 .02782255 -.00038685 .0444333 -.21373931 -.16996766 1.4778099 .08131684 b2: b2: b2: b2: b2: b2: b_ref: b_ref: expft expft2 children primary tertiary _cons yeduc expft r1 .02953905 -.00046565 .04064076 -.30920943 -.13252255 1.4710439 .08898211 .02782255 b_ref: b_ref: b_ref: b_ref: b_ref: x1: x1: x1: expft2 children primary tertiary _cons yeduc expft expft2 r1 -.00038685 .0444333 -.21373931 -.16996766 1.4778099 12.723427 17.343701 401.51787 x1: x1: x1: x1: x2: x2: x2: x2: children primary tertiary _cons yeduc expft expft2 children r1 .57997534 .0214336 .53383213 1 12.947265 10.885952 198.72516 .53492034 x2: x2: x2: primary tertiary _cons r1 .00786766 .83963304 1
To be able to apply nlcom
to these results, we first need to
post the results as a new estimation set using ereturn post
:
. matrix b = e(b0) . matrix V = e(V0) . ereturn post b V
We can now piece together the expressions to be submitted to
nlcom
. To reduce writing, we use a loop over the variables to
collect the elements that are part of each expression:
. local ref (_b[b1:_cons]-_b[b2:_cons]) . foreach v in yeduc expft expft2 children { 2. local ref `ref' + (_b[b1:`v']-_b[b2:`v'])*_b[x2:`v'] 3. } . nlcom (Primary: `ref' + (_b[b1:primary]-_b[b2:primary])) /// > (Secondary: `ref') /// > (Tertiary: `ref' + (_b[b1:tertiary]-_b[b2:tertiary])) /// > , noheader ------------------------------------------------------------------------------ | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- Primary | .2004831 .2123707 0.94 0.345 -.2157559 .6167221 Secondary | .105013 .0382853 2.74 0.006 .0299751 .1800509 Tertiary | .0675679 .0240717 2.81 0.005 .0203883 .1147475 ------------------------------------------------------------------------------
Likewise, the rescaled variant as suggested by Fortin et al. (2011) can be computed as follows:
. local ref (_b[b1:_cons]-_b[b2:_cons]) . foreach v in yeduc expft expft2 children { 2. local ref `ref' + (_b[b1:`v']-_b[b2:`v'])*_b[x2:`v'] 3. } . local p1 (_b[x2:primary]) . local p2 (1 - _b[x2:primary] - _b[x2:tertiary]) . local p3 (_b[x2:tertiary]) . local primary (`ref' + (_b[b1:primary]-_b[b2:primary])) * (`p1') . local secondary (`ref') * (`p2') . local tertiary (`ref' + (_b[b1:tertiary]-_b[b2:tertiary])) * (`p3') . local sum (`primary' + `secondary' + `tertiary') . nlcom (Primary: `primary'/`sum') /// > (Secondary: `secondary'/`sum') /// > (Tertiary: `tertiary'/`sum') /// > , noheader ------------------------------------------------------------------------------ | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- Primary | .0212224 .024191 0.88 0.380 -.026191 .0686358 Secondary | .2154676 .0812562 2.65 0.008 .0562083 .3747268 Tertiary | .76331 .0854742 8.93 0.000 .5957837 .9308363 ------------------------------------------------------------------------------