Do What You Do Better: Using AI Tools to Ease the Workload Burden on Faculty  

On this episode of the Academic Medicine Podcast, Christy Boscardin, PhD, Brian Gin, MD, PhD, Marc Triola, MD, and Academic Medicine assistant editor Gustavo Patino, MD, PhD, join host Toni Gallo to discuss the ways that artificial intelligence (AI) tools can help ease the workload burden on faculty and staff, with a focus on assessment and admissions. They explore the opportunities that AI tools afford as well as ethical, data privacy, bias, and other issues to consider with their use. They conclude by looking to the future and where medical education might go from here.

This episode is now available through Apple PodcastsSpotify, and anywhere else podcasts are available.

A transcript is below.

Read the articles discussed in this episode: 

Check out other AI resources, including those mentioned in this episode:

Try out some public AI tools, including those mentioned in this episode:

digital brain with neural networks visible

Transcript

Toni Gallo:

Welcome to the Academic Medicine Podcast. I’m Toni Gallo. One of the most talked about topics in medical education these days is artificial intelligence, and that includes everything from generative AI tools like ChatGPT to machine learning algorithms. Advances in these technologies have led to new opportunities as well as questions and challenges around best practices and ethical use.

On today’s episode, we’re going to focus on how AI tools can be used to ease the workload burden on faculty and staff, whether that’s reviewing medical school applications or scoring learner assessments. To do that, I’m joined by Academic Medicine assistant editor, Dr. Gustavo Patino, and the authors of 2 recently published papers that look at this very topic. Dr. Christy Boscardin and Dr. Brian Gin are authors of “ChatGPT and Generative Artificial Intelligence for Medical Education: Potential Impact and Opportunity.” Dr. Marc Triola is author of “Artificial Intelligence Screening of Medical School Applications: Development and Validation of a Machine-Learning Algorithm.”

We’ll get into the ways that AI tools are already helping faculty and staff as well as some of the ethical, data privacy, bias, and other issues to keep in mind when using AI tools. And we’ll share resources and advice for those of you who are just getting started. So with that, let’s do some introductions. Gustavo, could you go first?

Gustavo Patino:

Thank you, Toni, for having me. Hi, everyone. My name is Gustavo Patino. As Toni mentioned, I’m an assistant editor at the journal, and I’m also the associate dean for undergraduate medical education at Western Michigan University Homer Stryker M.D. School of Medicine.

Toni Gallo:

Great. Thank you. Marc?

Marc Triola:

Hi. It’s really a pleasure to be here and thank you for having us. I’m Marc Triola. I’m from NYU Grossman School of Medicine, where I’m the associate dean for educational informatics and I direct our Institute for Innovations in Medical Education.

Toni Gallo:

Thank you. Christy?

Christy Boscardin:

Hi. I’m Christy Boscardin, a faculty member in the Department of Medicine and also in the Department of Anesthesia at UC San Francisco. And I’m also the director of student assessment in the School of Medicine and happy to be here.

Toni Gallo:

And Brian?

Brian Gin:

Thanks, Toni and Gustavo. I’m Brian Gin. I’m a pediatric hospitalist at UCSF Mission Bay, also one of the learners at the UCSF-Utrecht doctoral program in health professions education.

Toni Gallo:

Thanks to all of you for being on the podcast today. I am looking forward to our conversation. Gustavo, could you get us started with a little bit of context? Where is medical education when it comes to the use of AI? There’s been so much written I feel like in the last year or so about this, so just give our listeners some background before we get into our conversation, please.

Gustavo Patino:

Sure. It’s a very exciting time and the thing to remember is that all these algorithms on artificial intelligence, what they’re trying to do is predict things. So in medical education and health professions education, as you mentioned before, we are trying to use them to predict who should be selected for interviews, who should be able to move to the next stage of their education, predict whether a submission meets certain quality criteria thresholds or not.

And of course, all this has completely been revamped in the last year with the appearance of ChatGPT and other large language models, where the prediction is actually a narrative that can be from whole text to jokes to including feedback to our learners.

And I feel like at this point we are just starting to explore how all these can be applied in HPE. We are at the various stages in the sense that a lot of the reports and papers that we are publishing … we are able to create these models and probably we’ll soon move to the next … the stages of how do we share these models, how do we guarantee the quality of those models, and how do we explore whether they are good, how can we make them better, and if there are any biases inherent in them.

Toni Gallo:

So I think we’re going to get into a bunch of that today as part of our discussion. But I want to start with the papers that Christy, Brian, and Marc have written for Academic Medicine. And maybe Christy and Brian, you can get started for us and just tell us a little bit about some of the ways that you wrote about that AI tools are already being used by faculty to really make some of their tasks more efficient or to help them when they have these just a huge amount of material that they need to get through.

Christy Boscardin:

In our paper, we talked about some of the applications in learning and also teaching and also in research, but I want to focus on areas of assessment for our discussion today. And I think this is probably because it’s probably the easiest potential application. I’ll provide 3 specific examples where I think that we could optimize the capabilities of tools like ChatGPT and other generative AI tools to reduce the burden on the faculty.

First, I think that the biggest potential for this tool will be in helping writing assessment items. And we could imagine a faculty asked to create an assessment item for their teaching activity and especially for faculty with limited access to real case clinical scenarios, generative AI tools could be really helpful in creating these some really reasonable clinical scenarios that they could be used for learning cases or as part of the assessment prompts and also could help generate actual assessment items. And I was actually very surprised how well the ChatGPT was able to create board style exam questions when I used it myself.

And I also think that it’s going to be very, very helpful in scoring. So as we move towards more open-ended response prompts and open-ended questions, I think this is where the faculty could really see some benefits. We’ve seen that when we have open-ended items, the process could be very time-consuming and also a lot of the faculty worry about the rater bias and rater reliability. So I could imagine using a tool like ChatGPT could be very helpful in terms of using it to help review and score some of these open-ended response items or assignments that actually require text analysis using a scoring rubric.

I also want to share one anecdote that one of my colleagues shared with me recently where she used the ChatGPT to help modify her assessment item. After putting in her question in the ChatGPT and noticed that ChatGPT was answering the item wrong and it was the same way that the students were responding to the question. So when she pushed and pulled further with the ChatGPT, she identified that there was a mistake or there was a irrelevant history that was part of the prompt that was really tripping up the response. So it was a really good way for her to evaluate and revise her items. So that really helped her, and I think that’s a really good example of how it could be used for scoring as well.

And I could also see a definitely potential to ease the burden on clerkship directors and also program directors. For clerkships and in residencies, one of the most time-consuming activities is in summarizing large amounts of narrative evaluation data into coherent themes to help make judgments but also to provide actionable feedback to the learners. And I think this is where the ChatGPT or generative AI could be really helpful for initial summarizing and also looking at some of the key themes that might be produced in these narratives.

So there are many, many examples and possibilities, but I think these are some of the concrete ways that I think the faculty could be using to reduce some of the burden.

Gustavo Patino:

And now we would like to hear more about your paper, and can you please tell us what was the problem you were trying to solve with your faculty screener model and how does the model that you develop address this issue?

Marc Triola:

Sure, thank you. So our paper was about our undergraduate medical school application screening process. Every medical school in the country is faced with the same challenge of rising number of applications, not just medical schools but residency programs as well. And at the very time we want more of a holistic and nuanced approach to reviewing these applications, we are challenged by the volume, the number of applications. And it’s really quite difficult to increase the number of faculty to screen applications as they come in. We want to make sure we maintain a system that is fair and reliable and consistent across all of the application screenings.

And in particular here at NYU, we had a unique challenge because in 2019 we announced full tuition scholarships for every student who comes to NYU. And the number of people who applied to our medical school went from about 6,500 per year to just under 10,000 per year just with that announcement.

So we really thought, could AI help us in the beginning aspects of the medical school application process? We were fortunate to have many years of high-density digital data on the applications to our medical school and the early screenings that happened to determine whether or not they would be offered an interview. And we outline all of the specifics in the paper, but essentially we use that digital data to create an AI model, a machine learning model. So this is not a ChatGPT type model. This is a core machine learning model that could predict the way that human faculty would have screened an application when presented with a new application.

Now this was mission critical. So not only did we train and validate that model, and we were very happy with the level of accuracy that that model produced, we spent 2 years doing first a prospective trial where we ran the model during an admissions year in the background invisible to the faculty who are doing the screenings to verify that under real world conditions it worked and it worked for every different group of … types of students who applied to our school. And then we did an actual randomized trial where as college students applied to our medical school, they were randomized during one year to either receiving the traditional faculty screening or AI screening. And we outline all the results in the paper.

We really wanted to ensure that there weren’t different types of biases that were introduced in the training of our model or in the application of the AI aspects of our model. So we looked quite carefully at the outcomes in terms of recommendations for interviews or rejections among all of the students, among those who are underrepresented or historically excluded in medicine, by gender, et cetera. And we were really thrilled with the results that with a large amount of data, you can train a good model that accurately replicated what our human faculty screeners used to do.

And I want to point out 2 really important points. So the quality of our model and our algorithm is dependent entirely on our training data, as are all of these systems. So this is not a magical cure for any biases that may have been present in the screening of our applications. However, we were happy and confident in the outcomes of our faculty. And the second is that this does not replace human decision makers. So in our case, this really helps make that early recommendation for who should be considered for a interview, but it’s still the human admissions committee that makes that ultimate decision. So another important key is particularly at this stage of AI in health care, whether it’s medical education or clinical applications, keeping the people and the humans in the loop is essential and necessary until we really understand these tools better.

But in the end, we were able to create a system that ultimately now saves us about 6,000 faculty hours per year and can very quickly in a truly scalable way, screen applications in a fair, transparent, and consistent way. It doesn’t get tired, it doesn’t get grumpy, and it can handle the growing volume of applications that we’re seeing. And we did this in such a way that we really feel quite confident in the results and that we’re not introducing new biases or amplifying anything that was present in the historical way we screened our application. So this has been something that has been really helpful and something we’re very excited about.

Toni Gallo:

I think our listeners and probably readers of both of your papers are thinking like, “Wow, this sounds really awesome. I’m very excited by this.” But there are lots of concerns and challenges I think that people also, when they hear about some of these AI tools, they think about, whether it’s ethical issues or Marc, you mentioned bias, introducing bias or whether it’s data privacy when you have these large quantities of data and different tools having access to them. So I wonder for all of you, what are some of the kind of considerations that are on your mind when you’re thinking about AI in the kind of faculty use space and what should our listeners be thinking about?

Marc Triola:

The ones you just mentioned are very important. And the first thing that faculty and students should realize is that when you’re using a public AI system like ChatGPT on the web, you are contributing your data into another system that is outside of your control. So especially when it comes to protected health information, when it comes to medical student data or research data, that should not be placed into the public AI systems. So one of the first things we did at the beginning of this was here at our medical school, we got our own HIPAA compliant version of ChatGPT-3.5 and -4 that we could experiment and make available to our community to really explore all of this.

But it’s going to be a real big challenge because as the companies are adding AI to email, to Gmail, to Google Docs, to Microsoft Word, to PowerPoint, I think faculty, students and residents are going to be using AI sometimes and not even fully realizing it. And so one of the most important things is to have a policy and guidance for your community as to what is the appropriate use of these systems from a data privacy point of view and maybe some guidance around appropriateness as well.

Because I think in health care we’re talking about the use of AI to communicate with patients, to give feedback to students, to faculty, and when should that be delivered by the system, when should that be delivered by a person, even when sometimes the AI is better than the person at delivering that information. These are all things that we’re going to need to think through. And we’re going through this very rapid implementation and development process that has given us tools faster than we’ve thought through these issues of privacy and of appropriateness. And so I think it’s an important discussion that has to happen at medical schools across the country.

Brian Gin:

Yeah. So I absolutely agree with what you’ve mentioned, Marc, with respect to the propriety of the data, especially as it pertains to uploading data to the cloud and having it being processed by OpenAI’s ChatGPT or Microsoft’s Bing and so forth, and how that sort of just gets integrated into their systems over time. So one of the ways I often like to think about AI models either generative or ones that are predictive in other ways is whether I trust that model. Because I think trust is such a key element of how we operate in medicine and how it really forms our ability to do work and take care of patients.

And I think with respect to trust, and I would ask what are the features of a AI helper or a generative agent that we would consider it to be trustworthy? I think … interesting thing to just think about it. I just sort of open that up, but I like to think of it because as we think about entrustment in assessment, for entrusting learners and so forth, there’s various criteria we use to decide whether or not to trust a trainee to perform a clinical task. And so we could ask the same question of an AI model, whether we would find it trustworthy to perform a task.

And I think borrowing one of those frameworks, I think from Meyer et al, trustworthiness of an individual would come down to 3 aspects. I think it’s ability to perform the task, does the AI have the knowledge and experience that’s required to perform that task? And even if it does, is that knowledge and experience built in a context that’s actually applicable for the question we’re answering? So I think as you mentioned Gustavo, that the AI is only as good as its training data. And so if that training data was done in a certain population or a certain group of learners, then if we try to transfer that to another group, is that valid? And furthermore, how do we even know whether it’s valid or not?

For instance, the ChatGPT, we don’t have access to the training data that was provided, but in Marc’s paper for example, he has a high-quality set from his own institution which he applies to further use at the same institution. So that case, the validity there is more well established than say for a generative AI where it’s not transparent. The other aspects include benevolence and integrity. And I think particularly with benevolence, one of the big question there is does the trustor of the AI believe that the AI has its best interests in mind? That’s a very odd question, right? Because AI doesn’t have sanctions, doesn’t have responsibility. But really if the AI gives you a faulty answer, who’s responsible for making that wrong answer? And particularly if it comes to patient care or making decision about a student, who’s responsible for that? And I think that’s why it’s so important, as Marc mentioned, that we continue to have the humans involved in this process. And knowing when to do that I think is absolutely key.

And lastly, when it comes down to integrity, the question is whether an AI would act in an honest and ethical manner. And I think this relates to what was mentioned about bias, particularly in algorithmic bias. So for instance, if an algorithm has a particular bias for gender bias and also bias related to demographics of individuals. And so if we change those demographics of the individuals, do we get the same result from the algorithm or does the algorithm tend to bias or give a favorable impression of one versus the other? So I think there’s various groups that are working on that as well. So it’s interesting just to think about how some of these human qualities can be asked of an AI and how far are we to being able to apply these ideas in this context.

Christy Boscardin:

And I would just add that it’s not just about understanding AI and the limitations of AI, but I think that ethical use of AI needs to be in consideration. And I think that the faculty and students both need to understand what are the ethical uses. And again, I think the issues around misinformation and bias and how using the tools like ChatGPT or generative AI actually perpetuate those biases. And I think really thinking about the AI ethics will be the key here.

Gustavo Patino:

So that tee us up perfectly for our next question, which is how do we share learning and best practices around the use of these AI tools and techniques and how do we hold the AI scholarship in health professions education to the same standard that we do with, for example, classical statistics?

Marc Triola:

That’s a very interesting question because, for traditional machine learning, like our paper where we did a randomized trial, we calculated areas under the curve, we could do formal measures of accuracy. Christy could speak to this much better than I, but there are established statistical measures of the performance of a model. When it comes to the large language models, it’s a fascinating new world because with a reasonable temperature setting, and temperature determines the sort of the randomness of the system, you’re going to get a different answer every time. And these are answers that are in general are text or now images or videos or music, things that are subjective and are up until now really being measured by their ability to solve somewhat straightforward challenges like multiple choice questions or human reviews of does this seem … sound right or is this good? And that level of evidence is difficult to generalize and difficult to replicate. It’s very burdensome. So this is a big area of AI development.

And in the newest models of ChatGPT that were just released, they enable you to do things like determine the random seed that’s used so you can replicate on an individual letter basis the responses that come from these systems and that opens up assessing their accuracy and validity and more generalizable approaches to this I think in a much better way. It’s interesting and challenging. Now, all of that being said, even the traditional AI, and there are over 500 AI systems that have been FDA approved for diagnostics and therapeutics, but only about 150 published randomized trials of these AI systems. The need for high quality and rigorous assessment of their accuracy, of their impact and utility is vast, not just with the rapidly growing area of GPTs, but even in the traditional machine learning realm of diagnostics, therapeutics, and other biomedical algorithms.

Christy Boscardin:

Yeah. I absolutely agree with Marc. I think we should be working on reporting standards with the use of AI and AI analytic tools for just like the ChatGPT or other generative AI tools. At the bare minimum, there should be transparency and appropriate attribution of these tools. And as the tools improve, I hope that we could even cite the primary sources of data or information used to create that output. Right now that’s one of the biggest limitations of these tools. So I think once we get more information about how those outputs are created and attributions to the right primary source I think will be helpful for critical appraisal of this data.

Also, to tag onto what Marc just mentioned about some of the machine learning and large language models that are used in medical education, I would love to see some more transparency in documenting the steps in analytic process, including the rationale for the chosen classifiers and also the analytics used and validity and the algorithms used for potential use for the training data. What were some of the decisions made in those process? And then really providing some justification around the statistical modeling. So we do see model fit and validation of these models, but we don’t typically see that now when people are reporting on machine learning and natural language processing models.

And I think … I have to give a shout out to Marc and his team because I think they did a beautiful job on his recent publication outlining the choices that they made and decisions and algorithm and a validation process. And I think that was really, really helpful for all the readers out there.

And I absolutely agree that we need to move towards increasing integration of AI. As we think about AI as an analytic tool that we need some specific guidelines, not just for the researchers but also for the reviewers. So they have set of things that they should be looking for. And we don’t have it currently so that’s an area that we should be working towards.

And I think in terms of your first question, Gustavo, in terms of the resources, I think we’re still in the early phases of the implementation of these tools. So I think that sharing resources and best practices is going to be critical. I’m a firm believer on sharing is caring and I think that we really need to hold each other accountable. I would love for us to have regular webinar series to share knowledge across multiple institutions, and I would love to see a forum to help with the rapid dissemination of resources so that we don’t have to wait for traditional journal articles, but something that could augment that. And I would love to also partner with someone who could help create a website where we could share best practices and have a toolkit for faculty and learners.

I also think that, lastly, it would be really important to bring the AI developers and medical educators together to start think about how to codevelop and improve these tools to optimize adoption and also to reduce potential areas of bias and misinformation. And I already know for example, many of the large publishing companies that hold clinical content is very much interested in potential collaboration with medical education faculty and also AI companies. So I think that’s probably a really good area for us to explore.

Gustavo Patino:

I’ll do a quick follow-up question. So both of you mentioned better descriptions of the methods. And Christy, you mentioned why did we pick a certain machine learning algorithm over another? Usually in the papers that have been published, there’s mention of, “Oh, this algorithm gave us this type of performance,” but we don’t see a lot of what we usually see in other industries that we tried all these different ones and this is the one that works best.

So I wanted to pose the question, do we think in our field, especially being so new in this area, is that something we should expect that people should try a few different ones and we learn which one will perform the best with that specific data set and the data engineering, data management before it gets put into the model?

Christy Boscardin:

We experience with several different classifiers and we also recognized there were some potential biases when we were doing the data analysis. So we were able to really delve into what are the limitations of some of these different classifiers and algorithms, and Brian was able to do a workaround to limit those biases. Brian, do you want to chime in about the specific example that we experienced with our data set?

Brian Gin:

We were trying to examine the different factors related to entrustment levels and narrative data. So in essentially trying to predict entrustment from narratives. And I think one of the key pieces of that was determining the sentiment of the narrative. So whether a narrative had a sort of positive or negative sort of emotional connotation. What we found is that on the particular .. the initial model we tried was based on one of the earlier large language models, BERT, I think that had been trained in this particular case on PubMed. And the training data we used was from Stanford Sentiment Treebank and another movie data set. But we found with this combination in this particular model that when you just exchange the pronouns from he to she or to they, that there was a very different sentiment predicted for the narrative and it was highly biased towards the male gender and most negatively biased when a gender-neutral pronoun such as they/them was used.

To mitigate this, we decided to work on another model, and for this model, we basically crafted it as gender agnostic in a way. We trained it where the algorithm itself didn’t know the gender of the student being discussed, and we actually changed all the pronouns to they or them essentially, and it removed all the gendered nouns as well. And so doing, then we could remove that bias. So that worked in our case where I guess gender wasn’t an important consideration as far as predicting trust or where we theoretically believe that gender should not affect trust. But you could imagine in a generative AI such as ChatGPT, Bing, Bard, or any of the other ones, if you made that to be gender agnostic, then it wouldn’t know what gender was at all. Right? So in those algorithms, they have to be a much more sophisticated way of sort of mitigating gender and other demographic related biases.

Gustavo Patino:

Then following up on that, what advice would you give to faculty that are just starting to use these AI tools and what resources would you recommend for them? Somebody that’s studying, what is the one resource that you’re like, “You need to check this.”

Marc Triola:

So I’ll say, experiment, that these tools are amazingly good, they’re getting better. There is an art to using them. It’s more of an art than a science, I would say. And I’m talking about generative AI like ChatGPT, Claude, Bard, others as was mentioned. How you write your prompt determines the quality of the answer. And if you go online and you look and you just Google, “Give me some advice for writing a good prompt,” you’ll be amazed at how complex and sophisticated some of the prompts people write to get good answers. But practice and learn how to write a good prompt and you’ll get vastly better results from the model.

In terms of plugging a good resource, I want to … let’s see, I want to plug the AI Breakfast newsletter. So Google “AI Breakfast,” which is a newsletter that comes out every week and is a nice summary of all of the happenings in AI, and the person who writes it usually includes a couple of things about health care. Keeps you up to date. It’s intimidating because this is literally changing every day, but it is something that’s very helpful.

And I guess as an asterisk, second one, just showing you how important this is, the New England Journal is coming out with their own solely focused on AI journal, NEJM AI, and they already have some articles in preprint that are available on their site. So those would be the ones I would plug.

Christy Boscardin:

And I will shamelessly plug our paper here and mention that in our paper, we do provide a table with the resources that we think will be useful for educators, as institutions, they develop AI literacy for teaching and learning. I also want to recommend a great resource page curated by Dr. Margaret Merrill from UC Davis, where she’s been updating resources around ChatGPT on a regular basis, and I find her Google Doc super, super helpful. So I would recommend those. I also love all the resources that Marc mentioned. I think those are fabulous as well.

Brian Gin:

I’ll also put a plug out there for people interested in experimenting with large language models that there are also the open tuned parameters. You can download basically the tune parameters for these large language models that you can run on a single computer. And then if that computer is then secure, then you would have your own ChatGPT-like chatbot, that you could run behind a secure firewall. So resources for those would be related to I think Facebook’s LLaMA. I think also FALCON is another model among various other ones, but Hugging Face, Reddit, and Medium.com I think have quite a bit of information about these semi open source large language models that are getting quite good.

Christy Boscardin:

Yeah. It’s the text Web UI, yeah, has a ton of resources. And I also want to recommend people using other generative AI tools, try out Bing, Bard, and OpenAI ChatGPT and compare the capabilities of all 3. I would also plug Elicit, it’s a summary tool that you could use for literature review and literature synthesis. And that’s free access and I think people should be checking that out. Brian’s a huge fan of Stable Diffusion and also LlaMa, DALL-E. I think we should also check those out. I think those are really helpful for creating visuals. So there’s a ton of resources out there that are free. So I think just playing around with it, I would recommend.

Toni Gallo:

Oh my gosh, that list is great. But I wanted to come back to the idea of trust that Brian mentioned earlier. I think there is a lot of mistrust in AI tools right now, and I don’t know if that’s because we don’t have a huge body of research yet that is supporting their use, and we’re not quite sure how they’re built or what has gone into building them, but I wonder how do you think we can build trust? How can we help people learn when they should and when they shouldn’t trust an AI model? Because I think that’s going to be a big thing going forward is how do I know if this is a model or a tool that I can trust?

Marc Triola:

I think as Brian pointed out, trust in terms of understanding how the model works and what’s in there is difficult, if not impossible, especially when it comes to these gigantic models that essentially are the whole internet represented mathematically. Understanding bias in those is exceedingly difficult just because of how complex and nuanced just like a vast repository of human knowledge is. But having especially when it comes to health care, some standards of accuracy, fidelity, of transparency as Christy mentioned, of how these models are trained is going to be essential.

And to that end, this has been a world that has been completely unregulated, including in health care up until now. There are a couple efforts. The FDA has recently launched a new taskforce to look at things like artificial intelligence and virtual reality and their current regulatory processes to think about how do you FDA approve a large language model that might be writing clinical notes that become part of the electronic health record or summarizing the results of an imaging study or a surgery or something like that. Those things are going to be important.

The National Academy of Medicine has also launched an effort to define principles around the use, the ethical and equitable use of AI in health care and ensuring that those tenants are maintained in it all.

And the last thing I’ll say is that a lot of people sometimes worry that AI is going to replace their job or replace the thing that they view is their thing. The thing that defines our identity as a physician or a researcher. And so far and hopefully never, that is not the case, that these tools are about giving us all superpowers to do what we’re doing better, to handle the frenetic pace of academic medicine with more grace as these tools sit beside us and help us. And they’re not about replacing doctors or turning the doctor-patient relationship into a transactional one. Quite the opposite. If we do this well and we take a lot of the scut work off of the plates of busy doctors, residents, and medical students, they’re going to have more time to engage directly with patients and the care that they deliver is going to be enhanced because of the AI copilot sitting next to them, giving them suggestions, and guiding that care and integrating even more data.

I think trust and verification is key, but I think there’s also a lot of room for optimism as we look to embracing these tools going forward.

Christy Boscardin:

I agree. I’m cautiously optimistic about the potential benefits of these tools in medical education, but I think we need to make sure that we are evaluating the impact and the efficacy of these tools for learning and teaching. And we also need to be vigilant and monitor potential issues around bias and ethical use of these tools and unintended consequences of these tools. And I also want to work with Marc maybe to create standards around AI reporting because I think he did a fantastic job of modeling that, so maybe stay tuned for that.

Gustavo Patino:

Following up on that, and especially the interpretability of the models, in the last half decade, we’ve seen an explosion also of techniques that aim to make these models. While we might not understand every single parameter, they give us an idea of what things are important for specific cases, what are being taken into account. So do you feel that when there are reports of new models being developed, should we expect that those interpretability techniques being applied to those models, should that be a standard part of the reporting process?

Marc Triola:

Yeah. Yes. And I think we’re going to see a lot of competition. So Brian mentioned the open source models on Hugging Face. And the name is crazy, but go to this website and you’ll be amazed that the 300,000 models and data sets that are available there. And there … OpenAI is not revealing how they train their model because it’s a trade secret. Everyone else could train essentially the same model in the same way, whereas on open source places like Hugging Face or the models being developed by all of our medical schools, there is a commitment to transparency and a commitment to curating that training data set very intentionally. And in those cases you’ll see not just, “Here’s the model” but, “Here is the actual training data set, replicate it yourself and analyze this, find flaws in this and build it together.”

It’s a scientific approach that engenders more trust because of transparency. It doesn’t necessarily mean it’s perfect, but it means that it’s one that can be tested, can be experimented, retested and retried, and I think that that’s going to be essential. It’s difficult to use ChatGPT-4 which is so good and delivers amazing medical information, answers 86% of USMLE questions correctly, but truly not knowing what went into it or how it was trained at all, I would think would make anyone feel uneasy.

Brian Gin:

Yeah, I absolutely agree. I’ll mention the H word in AI, which is the hallucinations. Because I think that’s what everyone fears in AI is basically that AI will create a response that it asserts as being accurate, which is not accurate. And I think that’s one of the big limitations right now in generative AI, that the algorithm has very little insight into how it thinks, given that it is a probabilistic model, right? And that it chooses responses based on probability of it being I guess acceptable to humans. If you could reveal some of that even just slightly, it may help us to get a better sense of how certain it is with different answers it’s able to provide.

Even though AI is this black box phenomenon where we don’t really quite understand how it makes decisions, I guess the human mind is somewhat no different than that, right? The human mind is a bit of a black box phenomenon as well. But I think what we have learned as humans is how to critique ourselves and to think about where our answers come from and how certain we are of these answers, and so I think this is something we can train future AIs to also do in terms of providing answers and training it to try to understand the certainty with which it’s purporting to give us information.

Toni Gallo:

I want to give each of you a chance if you have any final thoughts you want to share with our listeners or even what are some of the questions you’re excited to explore? Where do you think AI is going to go in medical education? What are you hoping for the future? Brian, I’ll start with you.

Brian Gin:

One of the reasons why I think AI has gained so much popularity in just the last few months is ever since really these chatbots came out is really that it didn’t require people to have computer knowledge anymore. It didn’t require you to have to be a coder. It didn’t require you to go into Python or even know statistics or anything like this. It didn’t even need to know what the underlying neural networks in a lot of these models are. This evolution in AI technology, basically that’s relying on natural language to access it, I think remove one of these roadblocks that allowed this to sort of take off in the last few months.

What I’m excited about then is not only that, but also having removed more of these barriers, then it’s allowing many more of us who may not consider ourselves computer scientists, may not consider ourselves AI experts to really develop the expertise in the AI without having to know those areas. So I think it’s really inviting a lot more of us to that table in a way that brings many more disciplines such as the ethical considerations, the philosophy around trust or things like this to thinking about how AI works. And I think what I’m most excited about is being able to have this interdisciplinary collaboration around AI afforded by this sort of decreased barrier to accessing it essentially.

Toni Gallo:

Marc?

Marc Triola:

I’m a big optimist about all of this as I think I’ve made clear. So I think I’m very excited about precision medical education, this concept that we can use AI to give superpowers to our faculty, to our coaches, to our students and our residents themselves, to deliver a curriculum or an assessment like as Christy and Brian have written about that is customized to each student, that adapts to their aspirations and goals and really surrounds them with a set of content, assessment items, a set of nudges and suggestions, some of which are AI suggested, all of which flow through a human coach who can be that guide to a learner as they’re developing mastery.

And I think AI can really help us scale this. It can help us understand all of that evaluation text and data as was mentioned. It can write new assessment items for everybody. It can bring together the latest educational resources or medical literature and hand it to a learner in a way that’s appropriately customized for them.

I’m just really excited about this because I think that for the first time, as was mentioned, these are the accessible tools that every medical school could adopt and engage in. And I think they’re going to help every medical student and every resident achieve even more, learn even better, and have a true precision customized medical education experience as they go through. That’s what I’m most excited about.

Toni Gallo:

Christy?

Christy Boscardin:

Yeah, absolutely agree. I’m also an optimist like Marc and definitely see a potential to have a positive impact on our learning and teaching, but I also see the potential for really increasing our knowledge generation, specifically in research. I think these provide different ways of looking at data and different ways of doing research, so I hope that this will actually facilitate different kinds of collaboration, different kinds of research in medical education, and I see that this could potentially change the way we do research.

Toni Gallo:

And Gustavo, any final thoughts you want to add?

Gustavo Patino:

No, I’ll just echo some of the things that were already mentioned. I’m very excited for all the new manuscripts that will be coming using these technologies. As Christy mentioned, being able to relieve some of the more mundane duties that our faculty has, as Marc mentioned. And all of what Brian was saying, I actually hope that more people will become interested in computer science topics and will also delve into the more traditional machine learning algorithms that Marc was mentioning, and they will want to learn Python or R or get into more programming.

Toni Gallo:

Thank you all so much for joining us today. I hope our listeners are excited and optimistic too. I encourage everyone to visit academicmedicine.org to find the articles that we discussed today, as well as other articles on the use of AI in medical education. Those are listed together in the Artificial Intelligence and Academic Medicine collection. And you can find that under the Collections tab on the journal’s homepage.

Be sure to also claim your CME credit for listening to this podcast. All you have to do is visit academicmedicineblog.org/CME, listen to the episodes that are listed, and then follow the instructions to claim your credit. There’s no cost to you for this service.

From the journal’s website, you can also access the latest articles in our archive dating back to 1926, as well as additional content like free eBooks and other article collections. Subscribe to Academic Medicine through the subscription services link under the Journal Info tab, or visit shop.lww.com and enter “Academic Medicine” in the search bar.

Be sure to follow us and interact with the journal staff on X, formerly Twitter at @AcadMedJournal and on LinkedIn at Academic Medicine Journal. Subscribe to this podcast anywhere podcasts are available. Be sure to leave us a rating and a review when you do. Let us know how we’re doing. Thanks so much for listening.