Decomposition Methods in the Social Sciences

Solutions to Exercise 4: Nonlinear decomposition

Johannes Giesecke and Ben Jann, GESIS Training Course, January 29 – February 1, 2024

Required packages: fre, estout, oaxaca, nldecompose, fairlie

Extend the example analysis from the slides (4-nonlinear.pdf) by X variables “locus of control” (LoC) and “willingness to take risk” (risk). Compute the aggregate and detailed decomposition using the Fairlie, Yun and LPM decomposition for non-linear models and interpret the results.

Set the seed of the random number generator for sake of reproducibility:

. set seed 5432334

Data preparation as on slides:

. use gsoep-extract, clear
(Example data based on the German Socio-Economic Panel)

. keep if wave==2015
(29,970 observations deleted)

. keep if inrange(age, 25, 55)
(5,671 observations deleted)

. generate byte male = sex==1

. generate byte female = 1 - male

. summarize supvis yeduc expft exppt male

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      supvis |      5,757    .2749696    .4465377          0          1
       yeduc |      7,121    12.28823    2.783974          7         18
       expft |      7,274    11.63359    9.556508          0       39.5
       exppt |      7,274    3.271481    5.052598          0      35.25
        male |      7,309    .4338487    .4956386          0          1

Additional predictors:

. // locus of control
. fre LoC, t(5)

LoC -- locus of control (1 int - 7 ext)
-----------------------------------------------------------
              |      Freq.    Percent      Valid       Cum.
--------------+--------------------------------------------
Valid   1     |         28       0.38       0.40       0.40
        1.1   |         21       0.29       0.30       0.70
        1.3   |         33       0.45       0.47       1.17
        1.4   |         66       0.90       0.94       2.11
        1.6   |         93       1.27       1.33       3.44
        :     |          :          :          :          :
        6.1   |          9       0.12       0.13      99.84
        6.3   |          7       0.10       0.10      99.94
        6.4   |          2       0.03       0.03      99.97
        6.6   |          1       0.01       0.01      99.99
        6.7   |          1       0.01       0.01     100.00
        Total |       7006      95.85     100.00           
Missing .     |        303       4.15                      
Total         |       7309     100.00                      
-----------------------------------------------------------

. // willingness to take risks
. fre risk

risk -- willingness to take risks (0-10)
-----------------------------------------------------------
              |      Freq.    Percent      Valid       Cum.
--------------+--------------------------------------------
Valid   0     |        302       4.13       4.14       4.14
        1     |        334       4.57       4.58       8.72
        2     |        735      10.06      10.08      18.80
        3     |        901      12.33      12.35      31.15
        4     |        697       9.54       9.56      40.70
        5     |       1357      18.57      18.60      59.31
        6     |        874      11.96      11.98      71.29
        7     |       1018      13.93      13.96      85.25
        8     |        686       9.39       9.40      94.65
        9     |        226       3.09       3.10      97.75
        10    |        164       2.24       2.25     100.00
        Total |       7294      99.79     100.00           
Missing .     |         15       0.21                      
Total         |       7309     100.00                      
-----------------------------------------------------------

. // summarize
. summarize LoC risk

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         LoC |      7,006    3.227362    .9334673          1        6.7
        risk |      7,294    4.882369    2.415302          0         10

Drop observations with missing values and set survey design:

. drop if missing(supvis, yeduc, expft, exppt, LoC, risk)
(1,898 observations deleted)

. svyset psu [pw=weight], strata(strata)

Sampling weights: weight
             VCE: linearized
     Single unit: missing
        Strata 1: strata
 Sampling unit 1: psu
           FPC 1: <zero>

Run a logistic regression to check whether the added variables are relevant for the distinction between supervising and not supervising:

. svy: logit supvis yeduc expft exppt LoC risk
(running logit on estimation sample)

Survey: Logistic regression

Number of strata =    15                          Number of obs   =      5,411
Number of PSUs   = 2,043                          Population size = 12,155,049
                                                  Design df       =      2,028
                                                  F(5, 2024)      =      17.15
                                                  Prob > F        =     0.0000

------------------------------------------------------------------------------
             |             Linearized
      supvis | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       yeduc |    .122624   .0190426     6.44   0.000     .0852789    .1599691
       expft |   .0253453   .0056269     4.50   0.000     .0143101    .0363805
       exppt |  -.0272412   .0127392    -2.14   0.033    -.0522244    -.002258
         LoC |   -.143153   .0580278    -2.47   0.014    -.2569533   -.0293527
        risk |   .0956502   .0237726     4.02   0.000      .049029    .1422714
       _cons |  -2.779065   .3890668    -7.14   0.000    -3.542077   -2.016053
------------------------------------------------------------------------------

Both variables seem to be relevant: As may have been expected, people working as supervisors/in leadership positions seem to be more willing to take risks. Moreover, their locus of control is less likely to be externally oriented when compared to people not working as supervisors.

Now have a look at gender differences:

. svy: mean LoC risk, over(sex)
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =    15          Number of obs   =      5,411
Number of PSUs   = 2,043          Population size = 12,155,049
                                  Design df       =      2,028

--------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.     [95% conf. interval]
-------------+------------------------------------------------
   c.LoC@sex |
       male  |   3.100191   .0306054       3.04017    3.160213
     female  |   3.159952   .0291557      3.102774     3.21713
             |
  c.risk@sex |
       male  |   5.382067   .0699371      5.244911    5.519223
     female  |   4.696685   .0735806      4.552384    4.840987
--------------------------------------------------------------

Women seem to be slightly more externally oriented and are considerably more risk averse than men. We thus might expect that these variables will "explain" at least some of the gender gap in supervision.

We can also look at gender differences in coefficients:

. svy, subpop(if sex==1): ///
>     logit supvis yeduc expft exppt LoC risk, nolog
(running logit on estimation sample)

Survey: Logistic regression

Number of strata =    15                         Number of obs   =       5,411
Number of PSUs   = 2,043                         Population size =  12,155,049
                                                 Subpop. no. obs =       2,599
                                                 Subpop. size    = 6,322,622.6
                                                 Design df       =       2,028
                                                 F(5, 2024)      =        8.41
                                                 Prob > F        =      0.0000

------------------------------------------------------------------------------
             |             Linearized
      supvis | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       yeduc |   .1225298   .0264803     4.63   0.000     .0705983    .1744613
       expft |   .0182957   .0075255     2.43   0.015     .0035372    .0330541
       exppt |  -.0338933   .0278222    -1.22   0.223    -.0884563    .0206698
         LoC |  -.2378459   .0820053    -2.90   0.004    -.3986693   -.0770225
        risk |   .0820778   .0339389     2.42   0.016      .015519    .1486366
       _cons |  -2.097989   .5461519    -3.84   0.000    -3.169066   -1.026912
------------------------------------------------------------------------------

. svy, subpop(if sex==2): ///
>     logit supvis yeduc expft exppt LoC risk, nolog
(running logit on estimation sample)

Survey: Logistic regression

Number of strata =    15                         Number of obs   =       5,411
Number of PSUs   = 2,043                         Population size =  12,155,049
                                                 Subpop. no. obs =       2,812
                                                 Subpop. size    = 5,832,426.8
                                                 Design df       =       2,028
                                                 F(5, 2024)      =        5.93
                                                 Prob > F        =      0.0000

------------------------------------------------------------------------------
             |             Linearized
      supvis | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       yeduc |   .1328677   .0289446     4.59   0.000     .0761034     .189632
       expft |   .0228878   .0088796     2.58   0.010     .0054737    .0403019
       exppt |  -.0008128   .0147863    -0.05   0.956    -.0298107     .028185
         LoC |   .0032806   .0822922     0.04   0.968    -.1581055    .1646667
        risk |   .0802607   .0320892     2.50   0.012     .0173294    .1431919
       _cons |  -3.639839   .5707868    -6.38   0.000    -4.759228   -2.520449
------------------------------------------------------------------------------

Locus of control seems to have an effect only among men but not among women. The effect of risk taking is very similar for men and women.

We now turn to the decompositions. First we run the aggregate decomposition with and without the added variables, using the nldecompose command:

. // reduced model
. nldecompose, by(male): svy: logit supvis yeduc expft exppt 

                                                   Number of obs (A) =    2599
                                                   Number of obs (B) =    2812

------------------------------------------------------------------------------
      Results |      Coef.  Percentage
--------------+---------------------------------------------------------------
 Omega = 1    |
         Char |    .049239   33.68346%
         Coef |   .0969426   66.31654%
--------------+---------------------------------------------------------------
 Omega = 0    |
         Char |   .0220991   15.11756%
         Coef |   .1240826   84.88244%
--------------+---------------------------------------------------------------
          Raw |   .1461817        100%
------------------------------------------------------------------------------

. //  full model
. nldecompose, by(male): svy: logit supvis yeduc expft exppt LoC risk

                                                   Number of obs (A) =    2599
                                                   Number of obs (B) =    2812

------------------------------------------------------------------------------
      Results |      Coef.  Percentage
--------------+---------------------------------------------------------------
 Omega = 1    |
         Char |   .0613367   41.95923%
         Coef |    .084845   58.04077%
--------------+---------------------------------------------------------------
 Omega = 0    |
         Char |   .0308127   21.07834%
         Coef |    .115369   78.92166%
--------------+---------------------------------------------------------------
          Raw |   .1461817        100%
------------------------------------------------------------------------------

The overall gender gap is about 15 percentage points (i.e. for males, the proportion working as supervisors/in leadership positions is about 15 percentage points higher than for females). Gender differences in schooling as well as full-time and part-time experience partly explain this difference (34% or 15% of the gap, depending on whether male or the female coefficients are used as reference; interestingly, the explained part is larger if we used the male coefficients as reference; the effects of the predictors thus seem to be stronger in the male sample).

Adding "locus of control" and "risk taking" increases the explained part to 21–42% of the overall gap. It is still the case that the explained part is larger if we used the male coefficients as reference.

To obtain a detailed decomposition we now run Farlie decompositions (with random ordering), Yun decompositions, as well as decompositions based on the linear probability model (LPM):

. // Fairlie
. // - male coefficients as reference
. fairlie supvis yeduc expft exppt LoC risk [pw=weight], by(female) noest ///
>     ro reps(1000) nodots

Non-linear decomposition by female (G)                   Number of obs = 5,411
                                                  N of obs G=0    =       2599
                                                  N of obs G=1    =       2812
                                                  Pr(Y!=0|G=0)    =  .36917804
                                                  Pr(Y!=0|G=1)    =  .22299637
                                                  Difference      =  .14618167
                                                  Total explained =   .0613367
------------------------------------------------------------------------------
      supvis | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       yeduc |  -.0050414   .0016288    -3.10   0.002    -.0082339    -.001849
       expft |   .0243854   .0098696     2.47   0.013     .0050414    .0437295
       exppt |   .0260367   .0198629     1.31   0.190    -.0128939    .0649672
         LoC |   .0037098   .0014612     2.54   0.011     .0008459    .0065738
        risk |   .0121009   .0048034     2.52   0.012     .0026864    .0215154
------------------------------------------------------------------------------

. est sto fairlie_m

. // - female coefficients as reference
. fairlie supvis yeduc expft exppt LoC risk [pw=weight], by(female) noest ///
>     ro reps(1000) nodots reference(1)

Non-linear decomposition by female (G)                   Number of obs = 5,411
                                                  N of obs G=0    =       2599
                                                  N of obs G=1    =       2812
                                                  Pr(Y!=0|G=0)    =  .36917804
                                                  Pr(Y!=0|G=1)    =  .22299637
                                                  Difference      =  .14618167
                                                  Total explained =  .03081266
------------------------------------------------------------------------------
      supvis | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       yeduc |  -.0043725   .0017766    -2.46   0.014    -.0078546   -.0008904
       expft |    .025314   .0101165     2.50   0.012      .005486    .0451421
       exppt |   .0005391   .0097917     0.06   0.956    -.0186522    .0197305
         LoC |  -.0000368   .0010098    -0.04   0.971     -.002016    .0019424
        risk |   .0093716   .0038792     2.42   0.016     .0017684    .0169747
------------------------------------------------------------------------------

. est sto fairlie_f

. // Yun
. // - male coefficients as reference
. oaxaca supvis yeduc expft exppt LoC risk, by(female) weight(1) logit svy

Blinder-Oaxaca decomposition

Number of strata =    15                        Number of obs     =      5,411
Number of PSUs   = 2,043                        Population size   = 12,155,049
                                                Design df         =      2,028
                                                Model             =      logit
Group 1: female = 0                             N of obs 1        =      2,599
Group 2: female = 1                             N of obs 2        =      2,812

    explained: (X1 - X2) * b1
  unexplained: X2 * (b1 - b2)

------------------------------------------------------------------------------
             |             Linearized
      supvis | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
overall      |
     group_1 |    .369178   .0160281    23.03   0.000     .3377449    .4006112
     group_2 |   .2229964   .0134008    16.64   0.000     .1967155    .2492772
  difference |   .1461817   .0209953     6.96   0.000      .105007    .1873564
   explained |   .0613367   .0215271     2.85   0.004     .0191192    .1035542
 unexplained |    .084845   .0293953     2.89   0.004     .0271968    .1424932
-------------+----------------------------------------------------------------
explained    |
       yeduc |  -.0049599   .0030737    -1.61   0.107    -.0109878     .001068
       expft |   .0236865   .0098568     2.40   0.016     .0043561     .043017
       exppt |   .0283093   .0215572     1.31   0.189    -.0139673    .0705858
         LoC |   .0028845   .0022062     1.31   0.191    -.0014421    .0072112
        risk |   .0114162   .0049661     2.30   0.022      .001677    .0211555
-------------+----------------------------------------------------------------
unexplained  |
       yeduc |  -.0264444   .1010975    -0.26   0.794    -.2247102    .1718214
       expft |  -.0096922   .0245371    -0.40   0.693    -.0578127    .0384283
       exppt |  -.0353219   .0352082    -1.00   0.316    -.1043699     .033726
         LoC |  -.1510524   .0758564    -1.99   0.047    -.2998169   -.0022879
        risk |   .0016919   .0433828     0.04   0.969    -.0833875    .0867714
       _cons |    .305664   .1489065     2.05   0.040     .0136384    .5976896
------------------------------------------------------------------------------

. est sto yun_m

. // - female coefficients as reference
. oaxaca supvis yeduc expft exppt LoC risk, by(female) weight(0) logit svy

Blinder-Oaxaca decomposition

Number of strata =    15                        Number of obs     =      5,411
Number of PSUs   = 2,043                        Population size   = 12,155,049
                                                Design df         =      2,028
                                                Model             =      logit
Group 1: female = 0                             N of obs 1        =      2,599
Group 2: female = 1                             N of obs 2        =      2,812

    explained: (X1 - X2) * b2
  unexplained: X1 * (b1 - b2)

------------------------------------------------------------------------------
             |             Linearized
      supvis | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
overall      |
     group_1 |    .369178   .0160281    23.03   0.000     .3377449    .4006112
     group_2 |   .2229964   .0134008    16.64   0.000     .1967155    .2492772
  difference |   .1461817   .0209953     6.96   0.000      .105007    .1873564
   explained |   .0308127   .0129802     2.37   0.018     .0053567    .0562686
 unexplained |    .115369   .0247499     4.66   0.000     .0668311    .1639069
-------------+----------------------------------------------------------------
explained    |
       yeduc |  -.0045962    .002864    -1.60   0.109     -.010213    .0010205
       expft |   .0253226   .0101994     2.48   0.013     .0053203    .0453249
       exppt |   .0005802   .0105492     0.05   0.956    -.0201082    .0212685
         LoC |   -.000034   .0008531    -0.04   0.968     -.001707     .001639
        risk |   .0095401   .0040993     2.33   0.020     .0015007    .0175794
-------------+----------------------------------------------------------------
unexplained  |
       yeduc |  -.0274208   .1053496    -0.26   0.795    -.2340255    .1791839
       expft |  -.0163246   .0415358    -0.39   0.694    -.0977819    .0651326
       exppt |  -.0087738   .0083879    -1.05   0.296    -.0252237     .007676
         LoC |  -.1560804   .0772106    -2.02   0.043    -.3075009     -.00466
        risk |    .002042   .0523371     0.04   0.969     -.100598     .104682
       _cons |   .3219267    .160614     2.00   0.045     .0069409    .6369124
------------------------------------------------------------------------------

. est sto yun_f

. // LPM
. // - male coefficients as reference
. oaxaca supvis yeduc expft exppt LoC risk, by(female) weight(1) svy

Blinder-Oaxaca decomposition

Number of strata =    15                        Number of obs     =      5,411
Number of PSUs   = 2,043                        Population size   = 12,155,049
                                                Design df         =      2,028
                                                Model             =     linear
Group 1: female = 0                             N of obs 1        =      2,599
Group 2: female = 1                             N of obs 2        =      2,812

    explained: (X1 - X2) * b1
  unexplained: X2 * (b1 - b2)

------------------------------------------------------------------------------
             |             Linearized
      supvis | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
overall      |
     group_1 |    .369178   .0159972    23.08   0.000     .3378054    .4005507
     group_2 |   .2229964   .0134079    16.63   0.000     .1967017    .2492911
  difference |   .1461817   .0209677     6.97   0.000     .1050612    .1873022
   explained |   .0602719   .0224017     2.69   0.007     .0163391    .1042047
 unexplained |   .0859098   .0300496     2.86   0.004     .0269784    .1448411
-------------+----------------------------------------------------------------
explained    |
       yeduc |  -.0054351   .0033939    -1.60   0.109    -.0120909    .0012208
       expft |   .0249603   .0104546     2.39   0.017     .0044574    .0454632
       exppt |   .0255638    .020584     1.24   0.214    -.0148041    .0659318
         LoC |   .0030198    .002288     1.32   0.187    -.0014672    .0075068
        risk |   .0121631   .0052187     2.33   0.020     .0019285    .0223977
-------------+----------------------------------------------------------------
unexplained  |
       yeduc |   .0573095   .1019952     0.56   0.574    -.1427168    .2573358
       expft |   .0012023   .0238214     0.05   0.960    -.0455146    .0479192
       exppt |  -.0324088   .0295013    -1.10   0.272    -.0902649    .0254472
         LoC |  -.1619925    .069732    -2.32   0.020    -.2987463   -.0252387
        risk |   .0218128   .0414527     0.53   0.599    -.0594815    .1031071
       _cons |   .1999864   .1496528     1.34   0.182    -.0935029    .4934758
------------------------------------------------------------------------------

. est sto LPM_m

. // - female coefficients as reference
. oaxaca supvis yeduc expft exppt LoC risk, by(female) weight(0) svy

Blinder-Oaxaca decomposition

Number of strata =    15                        Number of obs     =      5,411
Number of PSUs   = 2,043                        Population size   = 12,155,049
                                                Design df         =      2,028
                                                Model             =     linear
Group 1: female = 0                             N of obs 1        =      2,599
Group 2: female = 1                             N of obs 2        =      2,812

    explained: (X1 - X2) * b2
  unexplained: X1 * (b1 - b2)

------------------------------------------------------------------------------
             |             Linearized
      supvis | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
overall      |
     group_1 |    .369178   .0159972    23.08   0.000     .3378054    .4005507
     group_2 |   .2229964   .0134079    16.63   0.000     .1967017    .2492911
  difference |   .1461817   .0209677     6.97   0.000     .1050612    .1873022
   explained |   .0294254   .0125396     2.35   0.019     .0048335    .0540172
 unexplained |   .1167563   .0245189     4.76   0.000     .0686715    .1648411
-------------+----------------------------------------------------------------
explained    |
       yeduc |  -.0045492    .002856    -1.59   0.111    -.0101501    .0010518
       expft |   .0242398   .0098033     2.47   0.013     .0050142    .0434654
       exppt |   .0007985   .0093824     0.09   0.932    -.0176017    .0191987
         LoC |  -.0000438    .000801    -0.05   0.956    -.0016146     .001527
        risk |     .00898   .0037982     2.36   0.018     .0015312    .0164288
-------------+----------------------------------------------------------------
unexplained  |
       yeduc |   .0564236   .1004186     0.56   0.574    -.1405108    .2533581
       expft |   .0019228   .0380955     0.05   0.960    -.0727876    .0766332
       exppt |  -.0076435     .00698    -1.10   0.274    -.0213322    .0060452
         LoC |  -.1589289   .0684155    -2.32   0.020    -.2931009   -.0247569
        risk |   .0249959   .0475013     0.53   0.599    -.0681605    .1181524
       _cons |   .1999864   .1496528     1.34   0.182    -.0935029    .4934758
------------------------------------------------------------------------------

. est sto LPM_f

Overview of the results:

. esttab fairlie_m yun_m LPM_m fairlie_f yun_f LPM_f, ///
>     compress varw(12) equations(Explained=1:2:2:1:2:2) mtitle nonumber ///
>     keep(Explained: overall:difference overall:explained overall:unexplained) ///
>     mgroup("Male coefficients as reference" "Female coefficients as reference", ///
>         pattern(1 0 0 1 0 0) span)

------------------------------------------------------------------------------------------
             Male coefficients as reference         Female coefficients as reference      
             fairlie_m        yun_m        LPM_m    fairlie_f        yun_f        LPM_f   
------------------------------------------------------------------------------------------
Explained                                                                                 
yeduc         -0.00504**   -0.00496     -0.00544     -0.00437*    -0.00460     -0.00455   
               (-3.10)      (-1.61)      (-1.60)      (-2.46)      (-1.60)      (-1.59)   

expft           0.0244*      0.0237*      0.0250*      0.0253*      0.0253*      0.0242*  
                (2.47)       (2.40)       (2.39)       (2.50)       (2.48)       (2.47)   

exppt           0.0260       0.0283       0.0256     0.000539     0.000580     0.000799   
                (1.31)       (1.31)       (1.24)       (0.06)       (0.05)       (0.09)   

LoC            0.00371*     0.00288      0.00302    -0.0000368    -0.0000340    -0.0000438   
                (2.54)       (1.31)       (1.32)      (-0.04)      (-0.04)      (-0.05)   

risk            0.0121*      0.0114*      0.0122*     0.00937*     0.00954*     0.00898*  
                (2.52)       (2.30)       (2.33)       (2.42)       (2.33)       (2.36)   
------------------------------------------------------------------------------------------
overall                                                                                   
difference                    0.146***     0.146***                  0.146***     0.146***
                             (6.96)       (6.97)                    (6.96)       (6.97)   

explained                    0.0613**     0.0603**                  0.0308*      0.0294*  
                             (2.85)       (2.69)                    (2.37)       (2.35)   

unexplained                  0.0848**     0.0859**                   0.115***     0.117***
                             (2.89)       (2.86)                    (4.66)       (4.76)   
------------------------------------------------------------------------------------------
N                 5411         5411         5411         5411         5411         5411   
------------------------------------------------------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

The results from the different methods are very similar. When using the male coefficients as reference, locus of control and risk taking explain a larger part of the gap than when using the female coefficients as reference. Given the total gap of about 15 percentage points, however, these contributions still seem rather moderate.