Kitagawa-Oaxaca-Blinder Decomposition
1 Introduction
The purpose of this document is to detail the software that runs the Kitagawa-Oaxaca-Blinder decomposition.
Code is located in the scripts/
directory. A complex refactor is saved in the scripts/refactor
directory.
2 Stages
To produce a Kitagawa-Oaxaca-Blinder decomposition, it is necessary to [TODO: write a couple of paragraphs on this, citing original sources].
One of the beautiful things about Kitagawa’s original analysis was that it was fully categorical. Though she did not describe it that way, she essentially ran a fully interacted regression and used the resulting coefficients in her decomposition.
The beauty of this approach is that population averages in the two reference periods are fully sufficient to produce the KOB decomposition: no microdata are needed.
She did not include standard errors in her original estimates, but those are straightforward to produce in the outcome estimates using the technique described by the Census here.
we need 4 ingredients for the KOB function: 1. Regression coefficients 2. Standard errors on regression coefficients 3. Population proportion estimates 4. Standard errors on population proportions
The methods used to calculate standard errors vary based on the sample used. For 2019, we use the Successive Differences Replication (SDR) method, outlined in detail here. This, in short, involves repeating the same calculation 80 times using a different weight column specified by the REPWTP
variables. This technique can be used to estimate standard errors around both the population proportion and the regression coefficient estimates. For 2000 data, we use a Taylor series design using the CLUSTER
and STRATA
variables to account for survey design.
2.1 Regression functional forms
Let \(S_i\) represent the predicted household size for person \(i\) in survey year \(y\). Regressions include all persons not in group quarters. All variables are discrete. Specifically:
- Race & Ethnicity: Mutually exclusive groups of AAPI, AIAN, Black, Hispanic, Multiracial, White, and Other. For more information on how categorizations are drawn, please refer to the codebook.
- Age: Comprised of 17 5-year age buckets, spanning from 0-4, 5-9, 10-14, …, 80-84, and 85+.
- Education: Highest education attained. Includes less than high school, high school only, less than 4 years of college, and 4+ years of college.
- CPI-U Adjusted Total Income: Total personal income (TODO: factcheck that this is what the INCTOT variable represents), deflated/inflated to 2010 values and bucketed into negative, $0 - $29,999, $30,000 - $59,999, $60,000 - $100,000 … (TODO: what else?)
- US-Born: True if born in the United States, False if otherwise.
- Gender: Female or male.
- Tenure: Homeowner or renter.
Regression 0: \[ S_{i, y} = \beta_{0,y} + \beta_{1,y}(\text{Race & Ethnicity}_i) + \beta_{2,y}(\text{Age}_i) + \beta_{3,y}(\text{Education}_i) + \beta_{4,y}(\text{CPI-U Adjusted Total Income}_i) + \beta_{5,y}(\text{US-Born}_i) + \beta_{6,y}(\text{Gender}_i) + \beta_{7,y}(\text{Tenure}) + \epsilon \]
Regression 1:
Regression 2:
2.2 KOB Tranformation
On a fully categorical regression, this takes on the following functional form:
\[ \Delta \bar{Y} = (\bar{X}_A - \bar{X}_B) \cdot \beta_B + \bar{X}_B \cdot (\beta_A - \beta_B) \]
Where:
- \(\bar{Y}\) is the average outcome,
- \(\bar{X}\) is the vector of average covariate values,
- \(\beta\) are the estimated coefficients from the fully categorical regression,
- and \(A\), \(B\) refer to the two time periods or groups being compared.