Beyond the “Bob Effect”: Why Your Rank List is Less Accurate than You Think

By David A. Ross, MD, PhD, assistant professor, Department of Psychiatry, Yale School of Medicine, New Haven, Connecticut

“Our program doesn’t have the ‘Bob Effect’”

“The program director should just correct for these types of problems”

In our article in Academic Medicine’s September issue, we focus on the potential adverse effect of inter-rater variability (aka the “Bob Effect”) on rank list accuracy and we offer a new approach for constructing rank order lists (ROL) for the Match. We present data from computer simulations illustrating the comparative efficacy of traditional ranking (TR) methods versus the ROSS-MOORE approach. We acknowledge that the model we use in the simulations (containing only “noise” and “bias”) is highly reductionistic: real systems are far more complex – and, correspondingly, the extent and influence of variability can be far more insidious.

We use this space to explore further the richness and complexity of the Match system and to reflect briefly on how each of these nuances may interact with different ranking strategies (see also Keith Baker’s commentary from the September issue).

Scales are Imperfect

Prior to this project, interviewers in our program were asked to make a summative evaluation of each applicant on a 5 point scale (this was in addition to qualitative assessments that we discussed at length as a committee).

Ross.blog post 2 image

These data were averaged to generate a preliminary ROL that was adjusted at the final Resident Recruitment Committee meeting. Limitations with this system include questionable scale validity and poor discriminatory capacity.

When we began this project we developed a new scale and trained faculty extensively in its use. To some extent – though nowhere near perfectly – this has improved both our ability to discriminate between closely matched applicants and our inter-rater reliability.

One problem that remains is that many applicants seem to be more (or less) than the sum or their parts. For example, all faculty may agree that a particular applicant should correctly be scored a 26 on our new scale; however, for various reasons not captured by our questions, he or she may be a better applicant than someone else who would be scored a 28.

Such scenarios remind us of the impossibility of designing a perfect scale: which domains of interest should be included? Which should be excluded? What relative weight should be given to each?

There is no correct answer. Quite simply, it is not possible to reduce all dimensions of an applicant into a single numerical output.

Interviews (and Interviewers) are Imperfect

Interviews enable the transmission of information that is not readily apparent in a paper application. Such information may be communicated behaviorally, by the way applicants comport themselves, or verbally, such as in the way they discuss patients. Such data may play a critical role in the assessment process. (I will pause for a moment so that each reader may reflect on a memorable interview that led to a dramatic shift in their impression of an applicant.)

While we tend to focus on the applicant, we should not overlook what we, as interviewers, bring to this process. Each of us grows up with a certain set of beliefs about the world, including a host of implicit biases (if you think you’re somehow immune, please link immediately to https://implicit.harvard.edu/implicit/demo/ — I’ll wait). Such biases may cause interviewers to overvalue or undervalue applicants for reasons of which they themselves are not aware. For example, it is common in our meetings for faculty members to advocate particularly strongly for an applicant who in many ways resembles him- or herself – and to be genuinely surprised when this similarity is pointed out by another member of the committee. It is also common for faculty to be biased against members of their own group — be it based on gender, race, culture, religion, or other factors. (If you’re not already familiar with the scientific literature on the impact of biases, see the work of David Williams.)

It may be possible that some individuals will be sufficiently self-aware and prevent their implicit assumptions from influencing their ratings. However: 1. There’s a reason they’re called implicit assumptions; and 2. From a systems perspective, it would be wildly optimistic (if not delusional) to imagine that everyone would do so.

No surprise – data supporting the predictive value of the interview process is underwhelming.

Objective “Data” are Imperfect

One way to mitigate against the types of variability we describe above is to employ measures that are tightly linked to objective data. Unfortunately, a core problem with residency applications is that much of the data we have is itself uninterpretable.

For example, class rank in medical school might be considered an excellent quantitative measure. However, most schools provide only a range of performance (often quartiles). Some schools use ranges that are sufficiently broad as to be minimally useful (e.g. top 25% vs. everyone else). Many schools have elected not to rank students in any manner.

Even among schools that do provide class rank data, how should one compare across institutions? Is it preferable to be in the second quartile of a “top” medical school or to be ranked in the top 10% of a less prestigious one? Surely some value should be accorded to attending a prestigious school (if only by virtue of the rigorous selection process the applicant passed to gain admission). But how does one even decide what the “top” schools are? Almost everyone would agree that the US News and World Report Rankings are minimally useful. And what about applicants who may have been accepted to a “top” school but chose to attend their state school due to the high cost of private institutions?

One could look more narrowly at clerkship grades but these are also highly variable by school and may be difficult to interpret: some schools give almost all honors, some give very few; different schools use different names for different grades; some provide graphical distribution of scores and others do not; and, of course, some schools stoically refuse to give any summative data because they believe this will force programs to carefully review the detailed clerkship narratives in the Dean’s Letter. (Note to the schools who do this: when I have to read 800 pages of applications each week, the proximal impact of this decision is that I hate reading applications from your students.)

Letters of recommendation are no better. There may be genuine data that can be gleaned at the extremes of the spectrum—e.g. an overt “red flag” or a letter from a Residency Program Director stating that “This is the single best applicant I have seen in the past 30 years.” But, generally speaking, almost all letters of recommendation are positive and the variation that can be observed may better reflect the author of the letter than its subject. (No surprise, studies have shown that letters of recommendation don’t predict performance either.)

Of course, there’s always USMLE scores. These are a genuine, objective, standardized, measurement of all applicants—that has virtually no bearing on how good a doctor someone will become (or, for that matter, anything other than how well the applicant can take standardized tests).

While other data exist in applications (e.g. extracurriculars, leadership positions, research accomplishments, etc.), good luck trying to quantify them in any meaningful way.

Our Systems are Imperfect

One of the most striking aspects of our work has been looking at variability that results from the way in which each program structures its interview season. Emergent properties of the system include:

The Olympic Gymnastics effect: Over the course of a long season, the way in which faculty members assess applicants changes considerably. The more interviews they conduct, the better they internalize a sense of the normative applicant and the more adept they become at using the assessment tool. Over the years, we have found that this effect may lead to approximately 20% of applicants being misranked based on the original scores.

Number of interviews for each applicant: Because of intrinsic variability in the system, not having an adequate number of interviews can have a hugely negative impact in a traditional ranking system. For example, we are aware of one large program that, because of constraints in faculty resources, conducts only two interviews per applicant. With only two data points, any erroneous score will lead to a major shift in the applicant’s final rank. Our new system offers a significant benefit in this regard because the primary data points are the comparisons between applicants on each faculty member’s list (i.e. their “wins” and “losses”). Thus, in the previous program, if each faculty member sees 10 applicants, we would have 18 data points per applicant instead of 2, because each applicant would have “competed” against nine other individuals on each list).

This begs the obvious question: what if each interviewer only sees a couple of applicants?

Number of applicants seen by each faculty member: Faculty who see relatively few applicants will often describe feeling uncertain about how to use the scale and whether their ratings are consistent with other interviewers’. This effect is borne out in the data: we have consistently found that faculty who see fewer applicants have vastly higher variability using traditional rating scales. Another strength of the ROSS-MOORE system is that it gives greater weight to individuals who see more applicants.

What to Do?

With all of these factors complicating the process of resident recruitment, it would be easy to either put one’s head in the sand and pretend everything’s perfect (denial) or to throw one’s hands in the air and declare all efforts to be futile (learned helplessness).

We actually think that this is what makes the whole process kind of fun.

The problems we face are complex but they are not intractable. To this end, we have launched a series of research projects that address different aspects of the issues outlined above.

A first study is designed to optimize the process of data extraction and interpretation from a paper application. Essentially: what, if anything, can we predict based solely on an applicant’s written record?

This question, of course, links to the most central aspect of the project: how well do different data (including rank lists) predict long term outcome measures (both in residency and beyond)? Though this is an enormously challenging issue to address, we are optimistic about new approaches that we are piloting.

Finally, we continue to focus on how to improve collection and interpretation of data to enhance the accuracy of rank order lists. In our article, we talk about the advantage of the ROSS-MOORE approach over traditional ranking methods for addressing the “Bob Effect”. We do not explore the relative strengths and weaknesses of these approaches for any of the more complicated issues outlined above.

One limitation of our new method is that an ordinal approach may be particularly susceptible to an interviewer’s biases. This is why we advocate the use of the ROSS-MOORE system in conjunction with a traditional scale (though one should note that traditional scales are also highly susceptible to faculty bias).

Importantly, we believe that our new approach may offer considerable advantage over traditional methods because it allows faculty to:

  • correct for poor scale validity (e.g. when 26 > 28);
  • enhance discrimination between applicants by not allowing ties in the scoring (though this often rankles faculty, I view it as sharing the program’s burden: at the end of the day, our task is not to assess applicants, per se, but to rank them relative to one another);
  • correct for the Olympic Gymnastics effect (by forcing faculty to review all data from the year and then recalibrate their scores);
  • mitigate against structural systems issues (discussed above), including low numbers of interviews per applicant or number of applicants seen by each interviewer.

We continue to refine our simulation software so that we can model the system in a more nuanced manner. Based on the results of these efforts, we have been able to restructure the frame for our interview season and to continue updating and optimizing our actual rank algorithm. By creating a system that allows us to conduct experiments and test the impact of different interventions, we are now able to use empirical data to practice Evidence Based Recruitment.

A Shameless Plug

As we describe in the accompanying post, we have already beta-tested our approach with another residency program and, building on this success, we are now conducting a multi-site study of recruitment and ranking strategies. We are optimistic that participating programs will find this process to be easier and more efficient. Moreover, we believe our new approach will offer programs the most important benefit – a more accurate ROL.

We are still recruiting for the study so if you’re interested in learning more, please contact me directly – david.a.ross@yale.edu – we’d love to include your program!

Related Posts

One Comment

  1. Character, Destiny, Serendipity, and the Match | AM Rounds
    September 10, 2013 at 3:20 PM

    […] We completed the main part of this research 3 years ago. Since then, we have continued to study the recruitment process and have found that the longer we study The Match, the more nuance and complexity reveals itself. Much of this complexity enriches our understanding of what we had reductionistically modeled in our first version of the simulation software as a single, discrete variable  (i.e. “noise”). For a more complete list of “noise” variables with descriptions, see this accompanying post. […]

%d bloggers like this: