Our dataset includes Type 1 R01 applications submitted by WH and AA/B applicants between FY 2011 and 2015, restricted to applications for which there was only a single applicant to avoid the complications of applications with multiple applicants of different races. Additional data from FY 2006–2010 Type 1 R01s were used to construct some of the control variables, as detailed below.

We used probit models estimated with maximum likelihood when considering binary outcomes, such as whether an application is discussed or awarded. To consider the outcome of an application’s percentile score, we used a linear model. Analysis was done at the application level with robust standard errors clustered by applicant. If an application is not awarded, it can be resubmitted. Since both the original and resubmitted application had the opportunity to be awarded, each is treated as a separate entry, with a control indicator for resubmissions. Analysis was done with Stata 14.

In addition to the race of the applicant, we controlled for a set of individual- and application-level parameters along with a set of organization-level parameters. Individual- and application-level parameters included the FY of the application, a binary indicator if the applicant is an ESI, a binary indicator if the application is a resubmission, and a continuous variable to describe the number of years since the applicant’s last degree (linear and quadratic terms). To assess past success, we used continuous variables to describe both the number of prior R01 applications and awards per applicant. We also included two continuous variables related to the applicant’s publication history as presented in the application’s biosketch. The biosketch of an NIH application represents the applicant’s experience and includes a set of relevant publications selected by the applicant. We parsed biosketches to extract the relevant publications and generated two publication-based metrics for inclusion in the regression analysis: median biosketch RCR and number of biosketch publications in the top RCR decile. Because the biosketch contains only a selected group of publications, these features are proxies for publication influence rather than quantity. We transformed both RCR controls using an inverse hyperbolic sine (IHS) transformation. The IHS transformation behaves similarly to a logarithmic transformation and allows for transformation of zero values. Because the IHS transformation more closely approximates a logarithmic transformation when the numbers are not close to zero, we multiplied the median RCR by 100 before subjecting it to the transformation, effectively using the percentage of the field citation rate rather than the fraction of the field citation rate (21). We included a binary indicator for applications for which biosketch RCR data were missing (8.7%).

Organizational-level controls included the applicant organization’s Carnegie classification [R1: doctoral universities (highest research activity); R2: doctoral universities (higher research activity), medical school, and other], the applicant organization’s type in IMPAC II (higher education, hospital, research organization, and other), and the applicant organization’s geographic region as defined by the U.S. Census (northeast, midwest, south, west, and outside the United States), all treated as categorical variables. We also included controls for the total amount of R01 funding and total number of applicants for the cognate organization in the prior period of 2006–2010 (both IHS transformed and treated as continuous variables). We used the prior period to avoid a deterministic relationship to award status for organizations with no funding.

As an independent method of controlling for organization-level characteristics, we used separate binary indicators for each applicant organization to more directly compare AA/B and WH applications from the same organization. Restricting data to organizations with more than 100 total applications and more than 10 applications from AA/B scientists [49 organizations, 30,664 (45%) of the full dataset], the estimated award rate gap between AA/B and WH applicants is 3.9 percentage points. Adding the 89 topic superclusters (see below) to the model reduced this gap by 9%. These estimates are quite similar to those obtained using the full sample, despite the fact that the subsample used was substantially smaller, limited to large organizations with both higher award rates (15.3% versus 14.2% outside the subsample) and a higher percentage of AA/B applicants (3.4% versus 2.1% outside the subsample).

We used a probit model to evaluate the probability an application was resubmitted, considering resubmissions in the FY of initial submission and the two subsequent FYs. We included all controls described above. We restricted this analysis to unawarded applications that are themselves not resubmissions. Restricting to the FY + 2 resubmission window should result in little censoring, as we found that over 98% resubmissions are submitted within this window for initial applications submitted in FY 2011–2014.

In assessing the number of applications submitted by each applicant, we used a Poisson model with analysis conducted at the applicant level. Quasi-maximum likelihood estimation with robust standard errors compensates for the restrictive assumptions of the Poisson model. To collapse application-level data to the applicant level, we used the mean for the RCR variable and years since degree variable, and the mode for the organization-level variables. Other application-level variables like FY were dropped.

We tested a variety of different topic area parametrizations to control for topic, beginning with the full set of 150 clusters generated with word2vec. Because some of these clusters contained a small number of applications, we merged them into various sets of superclusters as alternative topic parametrizations. Starting from the full set of 150 clusters, we iteratively merged clusters together in order of word2vec similarity under the constraint that no merged clusters comprised more than 5% of the application totals. We used 89 superclusters as our base case, since it most closely models the original 150 clusters.

In all cases, we reported the relationship between the independent and dependent variables as an AME, rather than reporting regression coefficients. The AME represents the average value of the marginal effect of the independent variable (e.g., AA/B applicant) on the dependent variable (e.g., probability the application is awarded). Because the regression models are not linear, the marginal effects differ depending on the values of the other independent variables. The AME was constructed by first calculating the marginal effect of interest for each observation in the sample at each observation’s values for the other independent variables and then averaging these marginal effects.

Note: The content above has been extracted from a research article, so it may not display correctly.



Q&A
Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.



We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.