Johannes Giesecke and Ben Jann, GESIS Training Course, January 29 – February 1, 2024
Required packages (install using command ssc install
):
fre
, oaxaca
, estout
In the lecture, a decomposition of the gender wage gap (log wages) was
shown using data from the GSOEP, where schooling and experience (quadratic
effect) served as covariates (see slide 18 for data preparation; slides
28/29 for the decomposition). The data can be found on Ilias
(gsoep-extract.dta
). Using the same data setup, do the
following.
Extend the X variables of the model by tenure (number of years
working for the current employer; variable tenure
), the
occupational status as measured by the international socio-economic index
(isei
), and the number of children under the age of 16 in the
household (children
). Also take account of the survey design,
i.e. primary sampling units (psu
), sampling weights
(weight
), and strata (strata
).
Data preparation (from slides):
. use gsoep-extract, clear (Example data based on the German Socio-Economic Panel) . keep if wave==2015 (29,970 observations deleted) . keep if inrange(age, 25, 55) (5,671 observations deleted) . generate lnwage = ln(wage) (1,709 missing values generated) . generate expft2 = expft^2 (35 missing values generated) . summarize wage lnwage yeduc expft expft2 Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- wage | 5,600 17.57278 9.858855 3.03 121.42 lnwage | 5,600 2.736721 .5062968 1.108563 4.799255 yeduc | 7,121 12.28823 2.783974 7 18 expft | 7,274 11.63359 9.556508 0 39.5 expft2 | 7,274 226.6548 293.3739 0 1560.25 . drop if missing(sex, lnwage, yeduc, expft) (1,847 observations deleted)
Additional variables: Tenure, ISEI (International Socio-Economic Index of Occupational Status), and number of children
. fre tenure, t(5) tenure -- number of years with firm ----------------------------------------------------------- | Freq. Percent Valid Cum. --------------+-------------------------------------------- Valid 0 | 118 2.16 2.16 2.16 .25 | 238 4.36 4.36 6.52 .5 | 187 3.42 3.42 9.94 .75 | 169 3.09 3.09 13.04 1 | 158 2.89 2.89 15.93 : | : : : : 38.5 | 2 0.04 0.04 99.91 38.75 | 2 0.04 0.04 99.95 39.5 | 1 0.02 0.02 99.96 39.75 | 1 0.02 0.02 99.98 40.5 | 1 0.02 0.02 100.00 Total | 5461 99.98 100.00 Missing . | 1 0.02 Total | 5462 100.00 ----------------------------------------------------------- . fre isei, t(5) isei -- international socio-economic index ----------------------------------------------------------- | Freq. Percent Valid Cum. --------------+-------------------------------------------- Valid 16 | 193 3.53 3.55 3.55 19 | 26 0.48 0.48 4.03 20 | 85 1.56 1.56 5.59 21 | 16 0.29 0.29 5.89 22 | 5 0.09 0.09 5.98 : | : : : : 83 | 5 0.09 0.09 98.20 85 | 21 0.38 0.39 98.58 87 | 11 0.20 0.20 98.79 88 | 61 1.12 1.12 99.91 90 | 5 0.09 0.09 100.00 Total | 5434 99.49 100.00 Missing . | 28 0.51 Total | 5462 100.00 ----------------------------------------------------------- . fre children children -- number of children (age<16) in HH -------------------------------------------------------------------- | Freq. Percent Valid Cum. -----------------------+-------------------------------------------- Valid 0 | 2497 45.72 45.72 45.72 1 | 1170 21.42 21.42 67.14 2 | 1142 20.91 20.91 88.04 3 | 494 9.04 9.04 97.09 4 four or more | 159 2.91 2.91 100.00 Total | 5462 100.00 100.00 --------------------------------------------------------------------
To make sure that the different decompositions below all use the same observations, we remove the observations for which any of the additional variables is missing.
. drop if missing(tenure, isei, children)
(29 observations deleted)
Finally, we declare the survey design of the data:
. svyset psu [pw=weight], strata(strata) Sampling weights: weight VCE: linearized Single unit: missing Strata 1: strata Sampling unit 1: psu FPC 1: <zero> . svydes Survey: Describing stage 1 sampling units Sampling weights: weight VCE: linearized Single unit: missing Strata 1: strata Sampling unit 1: psu FPC 1: <zero> Number of obs per unit Stratum # units # obs Min Mean Max ---------------------------------------------------------- 1 40 96 1 2.4 7 2 198 697 1 3.5 18 6 112 446 1 4.0 12 7 100 267 1 2.7 12 8 82 179 1 2.2 9 10 133 264 1 2.0 7 11 102 126 1 1.2 2 12 206 521 1 2.5 13 13 10 14 1 1.4 3 14 294 663 1 2.3 12 18 31 43 1 1.4 3 19 123 295 1 2.4 9 21 134 714 1 5.3 17 22 185 505 1 2.7 10 23 285 603 1 2.1 8 ---------------------------------------------------------- 15 2,035 5,433 1 2.7 18
There are 5433 respondents and 2035 primary sampling units.
Compute the aggregate and detailed Oaxaca-Blinder decomposition. How do the results change compared to the specification used in the lecture? (How does the survey design change the results? How do the additional variables change the results? Make sure to use the same estimation sample when comparing results.)
We first take a look at how the survey design changes the results.
. oaxaca lnwage yeduc (experience: expft expft2), by(sex) weight(1) Blinder-Oaxaca decomposition Number of obs = 5,433 Model = linear Group 1: sex = 1 N of obs 1 = 2,623 Group 2: sex = 2 N of obs 2 = 2,810 explained: (X1 - X2) * b1 unexplained: X2 * (b1 - b2) ------------------------------------------------------------------------------ lnwage | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 2.860859 .009864 290.03 0.000 2.841526 2.880192 group_2 | 2.627908 .0090345 290.87 0.000 2.610201 2.645616 difference | .2329504 .0133761 17.42 0.000 .2067337 .2591671 explained | .1340805 .011184 11.99 0.000 .1121602 .1560008 unexplained | .0988699 .0136861 7.22 0.000 .0720457 .1256942 -------------+---------------------------------------------------------------- explained | yeduc | -.0232991 .006906 -3.37 0.001 -.0368347 -.0097636 experience | .1573796 .009201 17.10 0.000 .1393461 .1754131 -------------+---------------------------------------------------------------- unexplained | yeduc | .1134257 .0520215 2.18 0.029 .0114655 .2153859 experience | .0931918 .0227956 4.09 0.000 .0485132 .1378705 _cons | -.1077476 .0609511 -1.77 0.077 -.2272096 .0117143 ------------------------------------------------------------------------------ experience: expft expft2 . estimates store original
. oaxaca lnwage yeduc (experience: expft expft2) [pw=weight], by(sex) weight(1) Blinder-Oaxaca decomposition Number of obs = 5,433 Model = linear Group 1: sex = 1 N of obs 1 = 2,623 Group 2: sex = 2 N of obs 2 = 2,810 explained: (X1 - X2) * b1 unexplained: X2 * (b1 - b2) ------------------------------------------------------------------------------ lnwage | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 2.8656 .0160623 178.41 0.000 2.834118 2.897081 group_2 | 2.659247 .0148058 179.61 0.000 2.630229 2.688266 difference | .2063522 .0218451 9.45 0.000 .1635366 .2491678 explained | .0907737 .0158743 5.72 0.000 .0596607 .1218868 unexplained | .1155785 .0223764 5.17 0.000 .0717216 .1594355 -------------+---------------------------------------------------------------- explained | yeduc | -.0179028 .0110025 -1.63 0.104 -.0394673 .0036617 experience | .1086765 .0133797 8.12 0.000 .0824528 .1349002 -------------+---------------------------------------------------------------- unexplained | yeduc | .0448754 .0990161 0.45 0.650 -.1491927 .2389434 experience | .037495 .0442818 0.85 0.397 -.0492958 .1242858 _cons | .0332081 .1202685 0.28 0.782 -.2025137 .26893 ------------------------------------------------------------------------------ experience: expft expft2 . estimates store weighted
. oaxaca lnwage yeduc (experience: expft expft2), by(sex) weight(1) svy Blinder-Oaxaca decomposition Number of strata = 15 Number of obs = 5,433 Number of PSUs = 2,035 Population size = 12,070,768 Design df = 2,020 Model = linear Group 1: sex = 1 N of obs 1 = 2,623 Group 2: sex = 2 N of obs 2 = 2,810 explained: (X1 - X2) * b1 unexplained: X2 * (b1 - b2) ------------------------------------------------------------------------------ | Linearized lnwage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 2.8656 .0163597 175.16 0.000 2.833516 2.897683 group_2 | 2.659247 .0150168 177.08 0.000 2.629797 2.688698 difference | .2063522 .0205755 10.03 0.000 .1660009 .2467036 explained | .0907737 .0152699 5.94 0.000 .0608274 .12072 unexplained | .1155785 .0213793 5.41 0.000 .0736507 .1575063 -------------+---------------------------------------------------------------- explained | yeduc | -.0179028 .0097531 -1.84 0.067 -.0370299 .0012243 experience | .1086765 .0130769 8.31 0.000 .0830309 .1343221 -------------+---------------------------------------------------------------- unexplained | yeduc | .0448754 .0978046 0.46 0.646 -.1469331 .2366838 experience | .037495 .0431012 0.87 0.384 -.0470325 .1220225 _cons | .0332081 .1192909 0.28 0.781 -.200738 .2671543 ------------------------------------------------------------------------------ experience: expft expft2 . estimates store svy
. esttab original weighted svy, mtitles se nostar wide ------------------------------------------------------------------------------------------ (1) (2) (3) original weighted svy ------------------------------------------------------------------------------------------ overall group_1 2.861 (0.00986) 2.866 (0.0161) 2.866 (0.0164) group_2 2.628 (0.00903) 2.659 (0.0148) 2.659 (0.0150) difference 0.233 (0.0134) 0.206 (0.0218) 0.206 (0.0206) explained 0.134 (0.0112) 0.0908 (0.0159) 0.0908 (0.0153) unexplained 0.0989 (0.0137) 0.116 (0.0224) 0.116 (0.0214) ------------------------------------------------------------------------------------------ explained yeduc -0.0233 (0.00691) -0.0179 (0.0110) -0.0179 (0.00975) experience 0.157 (0.00920) 0.109 (0.0134) 0.109 (0.0131) ------------------------------------------------------------------------------------------ unexplained yeduc 0.113 (0.0520) 0.0449 (0.0990) 0.0449 (0.0978) experience 0.0932 (0.0228) 0.0375 (0.0443) 0.0375 (0.0431) _cons -0.108 (0.0610) 0.0332 (0.120) 0.0332 (0.119) ------------------------------------------------------------------------------------------ N 5433 5433 5433 ------------------------------------------------------------------------------------------ Standard errors in parentheses
We see that the sampling weights change the results quite a bit: the overall wage gap as well as the explained part decrease and the estimation is less precise (larger standard errors). Additionally taking clustering and stratification into account does not change the point estimates (there would be something wrong if it did), but the standard errors are somewhat different (the effect of clustering and stratification on the standard errors is only small in this example; this is not generally the case; in many situations clustering and stratification can make a big difference).
Now let's include the additional predictors (tenure, isei, and children):
. oaxaca lnwage yeduc (experience: expft expft2) tenure isei children, /// > by(sex) weight(1) svy Blinder-Oaxaca decomposition Number of strata = 15 Number of obs = 5,433 Number of PSUs = 2,035 Population size = 12,070,768 Design df = 2,020 Model = linear Group 1: sex = 1 N of obs 1 = 2,623 Group 2: sex = 2 N of obs 2 = 2,810 explained: (X1 - X2) * b1 unexplained: X2 * (b1 - b2) ------------------------------------------------------------------------------ | Linearized lnwage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 2.8656 .016217 176.70 0.000 2.833796 2.897403 group_2 | 2.659247 .0150177 177.07 0.000 2.629796 2.688699 difference | .2063522 .0203571 10.14 0.000 .1664292 .2462752 explained | .0736621 .0169573 4.34 0.000 .0404064 .1069177 unexplained | .1326902 .0204745 6.48 0.000 .0925367 .1728436 -------------+---------------------------------------------------------------- explained | yeduc | -.0090614 .005105 -1.78 0.076 -.019073 .0009502 experience | .0575503 .0134144 4.29 0.000 .0312429 .0838578 tenure | .0202842 .0062951 3.22 0.001 .0079387 .0326297 isei | .0030606 .0069344 0.44 0.659 -.0105388 .01666 children | .0018283 .0011643 1.57 0.117 -.0004551 .0041117 -------------+---------------------------------------------------------------- unexplained | yeduc | -.0708846 .1147893 -0.62 0.537 -.2960024 .1542331 experience | .0085986 .0446817 0.19 0.847 -.0790284 .0962257 tenure | .0144943 .0185398 0.78 0.434 -.0218648 .0508534 isei | .0621998 .0611112 1.02 0.309 -.0576477 .1820473 children | .0031122 .0084482 0.37 0.713 -.0134559 .0196803 _cons | .1151698 .1112147 1.04 0.301 -.1029377 .3332773 ------------------------------------------------------------------------------ experience: expft expft2 . estimates store extended
Comparison of results:
. esttab svy extended, mtitles se nostar wide ---------------------------------------------------------------- (1) (2) svy extended ---------------------------------------------------------------- overall group_1 2.866 (0.0164) 2.866 (0.0162) group_2 2.659 (0.0150) 2.659 (0.0150) difference 0.206 (0.0206) 0.206 (0.0204) explained 0.0908 (0.0153) 0.0737 (0.0170) unexplained 0.116 (0.0214) 0.133 (0.0205) ---------------------------------------------------------------- explained yeduc -0.0179 (0.00975) -0.00906 (0.00510) experience 0.109 (0.0131) 0.0576 (0.0134) tenure 0.0203 (0.00630) isei 0.00306 (0.00693) children 0.00183 (0.00116) ---------------------------------------------------------------- unexplained yeduc 0.0449 (0.0978) -0.0709 (0.115) experience 0.0375 (0.0431) 0.00860 (0.0447) tenure 0.0145 (0.0185) isei 0.0622 (0.0611) children 0.00311 (0.00845) _cons 0.0332 (0.119) 0.115 (0.111) ---------------------------------------------------------------- N 5433 5433 ---------------------------------------------------------------- Standard errors in parentheses
Something strange happened. The explained part declined even though we included more predictors! How can that be?
In principle, this is perfectly fine because the additional covariates may have effects such that the unexplained wage gap increases once we control for them (for example, if we include a covariate that has a positive effect on wages and for which women have higher values on average than men, such that its contribution to the explained part is negative). But in the current case this does not seem to make much sense because the added variables have positive contributions to the explained part. It is thus a bit puzzling why the explained part decreases.
A possible answer is that the extended model is misspecified (i.e., does
not fit the data very well). Let's try a more flexible model including
interaction terms between work experience and tenure. Since
oaxaca
does not support factor variable notation
(oaxaca
has been written before factor variable notation was
introduced to Stata), we have to build the interaction terms manually:
. generate e = expft . generate ee = e*e . generate t = tenure . generate et = e*t . generate eet = e*e*t . generate tt = t*t . generate ett = e*t*t . generate eett = e*e*t*t
Check whether we did it right by comparing a regression including the manually generated terms with a regression using factor variable notation:
. regress lnwage yeduc e ee t et eet tt ett eett Source | SS df MS Number of obs = 5,433 -------------+---------------------------------- F(9, 5423) = 347.90 Model | 507.344421 9 56.3716023 Prob > F = 0.0000 Residual | 878.697694 5,423 .16203166 R-squared = 0.3660 -------------+---------------------------------- Adj R-squared = 0.3650 Total | 1386.04211 5,432 .255162392 Root MSE = .40253 ------------------------------------------------------------------------------ lnwage | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- yeduc | .0803957 .0020303 39.60 0.000 .0764155 .0843759 e | .0306852 .0036041 8.51 0.000 .0236197 .0377506 ee | -.0006276 .0001224 -5.13 0.000 -.0008676 -.0003875 t | .0206686 .0055529 3.72 0.000 .0097827 .0315544 et | .0009722 .0007335 1.33 0.185 -.0004656 .0024101 eet | -.0000329 .000021 -1.57 0.117 -.000074 8.28e-06 tt | -.0003168 .000226 -1.40 0.161 -.0007599 .0001262 ett | -.0000401 .0000271 -1.48 0.139 -.0000932 .000013 eett | 1.48e-06 7.05e-07 2.09 0.036 9.47e-08 2.86e-06 _cons | 1.326023 .0329987 40.18 0.000 1.261332 1.390713 ------------------------------------------------------------------------------ . regress lnwage yeduc c.expft##c.expft##c.tenure##c.tenure, vsquish Source | SS df MS Number of obs = 5,433 -------------+---------------------------------- F(9, 5423) = 347.90 Model | 507.344421 9 56.3716023 Prob > F = 0.0000 Residual | 878.697694 5,423 .16203166 R-squared = 0.3660 -------------+---------------------------------- Adj R-squared = 0.3650 Total | 1386.04211 5,432 .255162392 Root MSE = .40253 --------------------------------------------------------------------------------------------------- lnwage | Coefficient Std. err. t P>|t| [95% conf. interval] ----------------------------------+---------------------------------------------------------------- yeduc | .0803957 .0020303 39.60 0.000 .0764155 .0843759 expft | .0306852 .0036041 8.51 0.000 .0236197 .0377506 c.expft#c.expft | -.0006276 .0001224 -5.13 0.000 -.0008676 -.0003875 tenure | .0206686 .0055529 3.72 0.000 .0097827 .0315544 c.expft#c.tenure | .0009722 .0007335 1.33 0.185 -.0004656 .0024101 c.expft#c.expft#c.tenure | -.0000329 .000021 -1.57 0.117 -.000074 8.28e-06 c.tenure#c.tenure | -.0003168 .000226 -1.40 0.161 -.0007599 .0001262 c.expft#c.tenure#c.tenure | -.0000401 .0000271 -1.48 0.139 -.0000932 .000013 c.expft#c.expft#c.tenure#c.tenure | 1.48e-06 7.05e-07 2.09 0.036 9.47e-08 2.86e-06 _cons | 1.326023 .0329987 40.18 0.000 1.261332 1.390713 ---------------------------------------------------------------------------------------------------
Seems ok. The results with the manual interaction terms are identical to the results using factor notation.
Now we run the decomposition using the flexible specification:
. oaxaca lnwage yeduc e ee t et eet tt ett eett isei children, /// > by(sex) weight(1) svy nodetail Blinder-Oaxaca decomposition Number of strata = 15 Number of obs = 5,433 Number of PSUs = 2,035 Population size = 12,070,768 Design df = 2,020 Model = linear Group 1: sex = 1 N of obs 1 = 2,623 Group 2: sex = 2 N of obs 2 = 2,810 explained: (X1 - X2) * b1 unexplained: X2 * (b1 - b2) ------------------------------------------------------------------------------ | Linearized lnwage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 2.8656 .0162186 176.69 0.000 2.833793 2.897407 group_2 | 2.659247 .0148757 178.76 0.000 2.630074 2.688421 difference | .2063522 .020338 10.15 0.000 .1664665 .246238 explained | .0883107 .0172795 5.11 0.000 .0544231 .1221982 unexplained | .1180416 .0204656 5.77 0.000 .0779056 .1581775 ------------------------------------------------------------------------------
The results are now much closer to the results without tenure
and isei
(the explained part increased from 0.0737 to 0.0883,
which is close to the original 0.0908). That is, it seems that including
tenure and work experience in the same models makes things difficult. How
could we improve the model, while still keeping it simple? Some options:
experience - tenure
so that there is no overlap between
experience and tenure (i.e., define experience as the work experience prior
to starting the current job)
The last option seems attractive, so let's try:
. generate exp = expft - tenure . generate exp2 = exp*exp . oaxaca lnwage yeduc (experience: exp exp2) tenure isei children, /// > by(sex) weight(1) svy Blinder-Oaxaca decomposition Number of strata = 15 Number of obs = 5,433 Number of PSUs = 2,035 Population size = 12,070,768 Design df = 2,020 Model = linear Group 1: sex = 1 N of obs 1 = 2,623 Group 2: sex = 2 N of obs 2 = 2,810 explained: (X1 - X2) * b1 unexplained: X2 * (b1 - b2) ------------------------------------------------------------------------------ | Linearized lnwage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 2.8656 .0161162 177.81 0.000 2.833994 2.897206 group_2 | 2.659247 .0150778 176.37 0.000 2.629678 2.688817 difference | .2063522 .0203308 10.15 0.000 .1664807 .2462237 explained | .0959585 .0202979 4.73 0.000 .0561516 .1357655 unexplained | .1103937 .0239471 4.61 0.000 .06343 .1573574 -------------+---------------------------------------------------------------- explained | yeduc | -.0093282 .0052366 -1.78 0.075 -.019598 .0009415 experience | .0684668 .0159237 4.30 0.000 .0372382 .0996953 tenure | .031472 .0093097 3.38 0.001 .0132143 .0497297 isei | .0030627 .0069391 0.44 0.659 -.0105459 .0166713 children | .0022853 .001386 1.65 0.099 -.0004328 .0050035 -------------+---------------------------------------------------------------- unexplained | yeduc | -.0613622 .1143255 -0.54 0.592 -.2855703 .1628459 experience | -.0255643 .0093875 -2.72 0.007 -.0439745 -.007154 tenure | -.0008773 .0189044 -0.05 0.963 -.0379514 .0361968 isei | .0564149 .0605004 0.93 0.351 -.0622349 .1750646 children | .0061459 .0081202 0.76 0.449 -.009779 .0220708 _cons | .1356367 .1059397 1.28 0.201 -.0721259 .3433992 ------------------------------------------------------------------------------ experience: exp exp2
Not so bad. The explained part is now .0960. However, looking at the new experience variable reveals some serious problems:
. sum exp, det exp ------------------------------------------------------------- Percentiles Smallest 1% -19.25 -35.5 5% -10.5 -34 10% -5.25 -30.75 Obs 5,433 25% -.75 -29.25 Sum of wgt. 5,433 50% 3 Mean 4.374747 Largest Std. dev. 9.402896 75% 9.5 33.5 90% 17.5 33.75 Variance 88.41445 95% 22 33.75 Skewness .2298359 99% 28.5 34 Kurtosis 3.733868
At least a quarter of all observations have negative values! This happens if full-time experience is lower than tenure. There are two reasons for that: (a) tenure also captures part-time experience; (b) people report length with current employer without taking into account employment interruptions due to, for example, parenting. Both reasons are more likely among women than among men.
. bysort sex: sum exp, det ---------------------------------------------------------------------------------------------------- -> sex = male exp ------------------------------------------------------------- Percentiles Smallest 1% -6.75 -24.25 5% -3.25 -20.75 10% -1.5 -20.5 Obs 2,623 25% .75 -20 Sum of wgt. 2,623 50% 5.5 Mean 7.572531 Largest Std. dev. 8.666434 75% 13.25 32.75 90% 20.75 32.75 Variance 75.10708 95% 24.5 33.5 Skewness .6481019 99% 29.5 33.75 Kurtosis 2.962962 ---------------------------------------------------------------------------------------------------- -> sex = female exp ------------------------------------------------------------- Percentiles Smallest 1% -22.75 -35.5 5% -14.5 -34 10% -9.625 -30.75 Obs 2,810 25% -3 -29.25 Sum of wgt. 2,810 50% .75 Mean 1.389769 Largest Std. dev. 9.077742 75% 6 30.75 90% 13 31.5 Variance 82.40541 95% 17.75 33.75 Skewness .0497506 99% 25 34 Kurtosis 4.030786
As can be seen, the negative values in exp
are more frequent
among women than among men. Finding a good model that includes tenure would
require more careful considerations (e.g., also taking account of part-time
work experience, etc.). Against this background we will drop tenure from
our model.
We first reload the data to get observations with missing tenure back in.
. use gsoep-extract, clear (Example data based on the German Socio-Economic Panel) . keep if wave==2015 (29,970 observations deleted) . keep if inrange(age, 25, 55) (5,671 observations deleted) . generate lnwage = ln(wage) (1,709 missing values generated) . generate expft2 = expft^2 (35 missing values generated) . drop if missing(sex, lnwage, yeduc, expft, isei, children) (1,875 observations deleted) . svyset psu [pw=weight], strata(strata) Sampling weights: weight VCE: linearized Single unit: missing Strata 1: strata Sampling unit 1: psu FPC 1: <zero>
Our final model is then as follows:
. oaxaca lnwage yeduc (experience: expft expft2) isei children, /// > by(sex) weight(1) svy Blinder-Oaxaca decomposition Number of strata = 15 Number of obs = 5,434 Number of PSUs = 2,035 Population size = 12,071,607 Design df = 2,020 Model = linear Group 1: sex = 1 N of obs 1 = 2,624 Group 2: sex = 2 N of obs 2 = 2,810 explained: (X1 - X2) * b1 unexplained: X2 * (b1 - b2) ------------------------------------------------------------------------------ | Linearized lnwage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- overall | group_1 | 2.865592 .0162802 176.02 0.000 2.833665 2.89752 group_2 | 2.659247 .0151807 175.17 0.000 2.629476 2.689019 difference | .206345 .0205365 10.05 0.000 .1660701 .2466199 explained | .1036131 .016076 6.45 0.000 .0720858 .1351404 unexplained | .1027319 .0204223 5.03 0.000 .062681 .1427829 -------------+---------------------------------------------------------------- explained | yeduc | -.0092558 .0052244 -1.77 0.077 -.0195016 .00099 experience | .1076302 .0126921 8.48 0.000 .0827391 .1325212 isei | .0033122 .0075391 0.44 0.660 -.011473 .0180973 children | .0019265 .0012233 1.57 0.115 -.0004725 .0043256 -------------+---------------------------------------------------------------- unexplained | yeduc | -.0533151 .12035 -0.44 0.658 -.2893382 .182708 experience | .0274948 .0429278 0.64 0.522 -.0566927 .1116822 isei | .0863664 .0618819 1.40 0.163 -.0349926 .2077254 children | .0012029 .0089191 0.13 0.893 -.0162887 .0186946 _cons | .0409829 .1149839 0.36 0.722 -.1845165 .2664824 ------------------------------------------------------------------------------ experience: expft expft2
Confirm the results returned by oaxaca
(for the extended
decomposition including survey design) by computing the
aggregate Oaxaca-Blinder decomposition "by hand" (that is, estimate the
means of the variables and the regression coefficients and then compute the
decomposition from these outputs, and not using oaxaca
). Also
compute the contribution of schooling to the "explained" part and the
"unexplained" part by hand.
Computation by hand using matrices (of course, you can also do the computations by copying the individual values, like it was done on the slides):
regress
and
mean
store their results in the so-called
e()
-returns; for example, the coefficients/point estimates are
stored in row vector e(b)
)
. regress lnwage yeduc expft expft2 isei children [pw=weight] if sex==1 (sum of wgt is 6,262,116.0955229) Linear regression Number of obs = 2,624 F(5, 2618) = 106.16 Prob > F = 0.0000 R-squared = 0.3610 Root MSE = .38708 ------------------------------------------------------------------------------ | Robust lnwage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- yeduc | .0428144 .0069538 6.16 0.000 .0291788 .05645 expft | .0304244 .0061218 4.97 0.000 .0184204 .0424284 expft2 | -.0004347 .0001621 -2.68 0.007 -.0007526 -.0001168 isei | .0105099 .0009001 11.68 0.000 .008745 .0122748 children | .041738 .0117768 3.54 0.000 .0186453 .0648308 _cons | 1.448366 .0932229 15.54 0.000 1.265568 1.631164 ------------------------------------------------------------------------------ . matrix b_m = e(b) . regress lnwage yeduc expft expft2 isei children [pw=weight] if sex==2 (sum of wgt is 5,809,491.4026964) Linear regression Number of obs = 2,810 F(5, 2804) = 116.11 Prob > F = 0.0000 R-squared = 0.3369 Root MSE = .3882 ------------------------------------------------------------------------------ | Robust lnwage | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- yeduc | .046936 .0064404 7.29 0.000 .0343076 .0595645 expft | .0279263 .0047103 5.93 0.000 .0186904 .0371623 expft2 | -.0004368 .0001416 -3.08 0.002 -.0007144 -.0001591 isei | .0086704 .0009551 9.08 0.000 .0067976 .0105432 children | .0394887 .012959 3.05 0.002 .0140786 .0648987 _cons | 1.407383 .0695331 20.24 0.000 1.271042 1.543724 ------------------------------------------------------------------------------ . matrix b_f = e(b) . mean yeduc expft expft2 isei children [pw=weight] if sex==1 & lnwage<. Mean estimation Number of obs = 2,624 -------------------------------------------------------------- | Mean Std. err. [95% conf. interval] -------------+------------------------------------------------ yeduc | 12.7192 .0964361 12.5301 12.9083 expft | 17.24699 .3373611 16.58547 17.90852 expft2 | 398.1113 11.87768 374.8208 421.4019 isei | 47.26468 .5466478 46.19278 48.33658 children | .5809327 .0242093 .5334614 .6284039 -------------------------------------------------------------- . matrix X_m = (e(b),1) // appending a "1" for the constant . mean yeduc expft expft2 isei children [pw=weight] if sex==2 & lnwage<. Mean estimation Number of obs = 2,810 -------------------------------------------------------------- | Mean Std. err. [95% conf. interval] -------------+------------------------------------------------ yeduc | 12.93538 .0899995 12.75891 13.11186 expft | 10.84406 .2788185 10.29735 11.39077 expft2 | 197.5753 8.599448 180.7134 214.4372 isei | 46.94953 .5395295 45.89162 48.00745 children | .5347746 .0228155 .4900377 .5795115 -------------------------------------------------------------- . matrix X_f = (e(b),1) // appending a "1" for the constant
. matrix D = X_m * b_m' - X_f * b_f' . matrix list D symmetric D[1,1] y1 y1 .20634502 . matrix E = (X_m - X_f) * b_m' . matrix list E symmetric E[1,1] y1 y1 .1036131 . matrix U = X_f * (b_m - b_f)' . matrix list U symmetric U[1,1] y1 y1 .10273192
. matrix E_schooling = (X_m[1,"yeduc"] - X_f[1,"yeduc"]) * b_m[1,"yeduc"] . matrix list E_schooling symmetric E_schooling[1,1] yeduc y1 -.0092558 . matrix U_schooling = X_f[1,"yeduc"] * (b_m[1,"yeduc"] - b_f[1,"yeduc"]) . matrix list U_schooling symmetric U_schooling[1,1] yeduc y1 -.0533151
The results are the same as the ones returned by oaxaca
above.
Note that the aggregate decomposition can also be computed by generating predictions from the models and then taking mean differences. That is, we can use so-called out-of-sample predictions to generate the counterfactual distribution of interest (the wages that females would get if they were paid like men):
. regress lnwage yeduc expft expft2 isei children [pw=weight] if sex==1 (output omitted) . predict xb_m if e(sample) (option xb assumed; fitted values) (2,810 missing values generated) . predict xb_fc if sex==2 & lnwage<. // apply male coefficients to female sample (option xb assumed; fitted values) (2,624 missing values generated) . regress lnwage yeduc expft expft2 isei children [pw=weight] if sex==2 (output omitted) . predict xb_f if e(sample) (option xb assumed; fitted values) (2,624 missing values generated) . summarize xb_m [aw=weight] // observed value for men Variable | Obs Weight Mean Std. dev. Min Max -------------+----------------------------------------------------------------- xb_m | 2,624 6262116.1 2.865592 .2906358 2.044669 3.766038 . local xb_m = r(mean) . summarize xb_fc [aw=weight] // counterfactual value for women if paid like men Variable | Obs Weight Mean Std. dev. Min Max -------------+----------------------------------------------------------------- xb_fc | 2,810 5809491.4 2.761979 .3032661 1.916226 3.66645 . local xb_fc = r(mean) . summarize xb_f [aw=weight] // observed value for women Variable | Obs Weight Mean Std. dev. Min Max -------------+----------------------------------------------------------------- xb_f | 2,810 5809491.4 2.659247 .2764587 1.874661 3.46033 . local xb_f = r(mean) . display "difference: " `xb_m' - `xb_f' difference: .20634503 . display "explained: " `xb_m' - `xb_fc' explained: .1036131 . display "unexplained: " `xb_fc' - `xb_f' unexplained: .10273192