Presentation
PS16 - Using ChatGPT to Conduct Directed Content Analysis to Identify Multi-Level Contributors from Patient Safety Incident Reports
SessionPoster Session 2
DescriptionIntroduction: Generative artificial intelligence (AI) has been increasingly studied in healthcare for its potential to automate time-consuming and labor-intensive processes. These include summarizing conversations during patient visits into clinical documentation and simplifying written language used in patient education materials. Although these use cases can help reduce time spent by clinicians on these administrative tasks, other use cases may also exist that may help reduce time spent by other healthcare workers, such as patient safety and quality improvement staff.
Patient safety reports represent another potential use case for generative AI due to the substantial amount of time needed to review free-text data. Unfortunately, this can create bottlenecks and unnecessary delays in developing and implementing solutions in the healthcare organization to address the involved patient safety concern. In the time that has elapsed, there is a risk that similar incidents may have occurred, which has adverse implications on costs, patient experience, and quality of care.
Generative AI offers promise in addressing this bottleneck. However, little is known about how generative AI can be used in analyzing patient safety reports. Thus, this study aims to assess ChatGPT’s performance across accuracy, consistency, and completeness when asked to conduct a directed content analysis of patient safety reports to identify work system factors that may have contributed to the event. Our findings can help foster this burgeoning line of inquiry into generative AI’s capabilities and uses in human factors research and applications. The findings may also benefit health system leaders who are exploring ways to optimize response to patient safety incidents in a timely manner.
Methods: Our data source was the patient safety case studies from the Patient Safety Network, which collects case study submissions from health care organizations and makes them publicly available to others to learn from. We selected ten case studies that were heterogeneous in care settings, errors involved, and personnel affected to demonstrate feasibility of the analytic approach across diverse types of case studies. Reports were formatted in a narrative structure and did not include images or tables.
We used the second version of the Systems Engineering Initiative for Patient Safety (SEIPS) framework to guide the thematic content analysis. This framework suggests that there are six work system and structural factors that may influence processes and outcomes: 1) person, 2) organization, 3) tools and technologies, 4) tasks, 5) internal environment, and 6) external environment.
To explore generative AI’s ability to analyze the case studies using SEIPS 2.0, we developed two types of agents using ChatGPT version 4o (OpenAI: San Francisco, CA). The first agent was a web-based conversational agent, which represents the most common type of agent where information used to answer user queries primarily come from online sources. This web-based conversational agent was not provided with any specialized information, other than the online resource it already had access to, regarding the SEIPS 2.0 categories and definitions. The second agent was a specialized conversational agent that had its information-seeking properties restricted to only information about SEIPS 2.0 that was provided by the research team that was pre-uploaded into the agent’s knowledge base. Both agents were provided with similar instructions and information about the agents’ objective, what the case study contained, how the final results should be formatted, and conversation starters (i.e., buttons presented to the user to initiate specific actions).
For both agents, the procedure involved first uploading a patient safety report to be reviewed into the agents’ knowledge base. The analysis process was then initiated by clicking the conversation starter “analyze case study”. Once the agents finished the analysis, we recorded the identified contributors on a spreadsheet. We then terminated the chat session and began a new chat session to begin the next trial of analysis. We did this because our pilot testing found that continually running analyses in the same chat session can adversely affect the results. Each case study was subjected to ten trials of analysis. Since we had ten case studies, a total of 100 trials were performed.
To assess the agents’ performance, we assessed the outputs for each case study for accuracy, consistency, and completeness as adjudicated by the research team members, who were familiar with the content of each report. For accuracy, we computed two scores. The first was the percentage of trials in which the agent applied the appropriate SEIPS 2.0 framework (compared to other versions). The second was the proportion of trials where a specific work system factor was correctly coded by the agent using the SEIPS categories. For consistency, we calculated the proportion of trials in which the AI identified the same vs. new work system factors across the ten analyses of the same case. For completeness, we calculated the number of trials before the agent stopped identifying any new work system factors for each case. All statistical analyses were done through Stata SE 18.0 (StataCorp: College Station, TX).
Results: For accuracy, the specialized agent always (100%) used the correct version of the SEIPS framework. Meanwhile, the web-based agent used the correct version of the SEIPS framework in 35% of the total number of trials. When comparing the agents based on only the trials that used the correct SEIPS framework, there were no differences in errors made for contributors related to person, organization, tools and technologies, tasks, and external environment. However, the web-based agent had misclassified organizational contributors (e.g., roles and responsibilities, workflows) as internal environment.
For consistency, there was a consistent pattern across the ten case studies where the contributors that the agents identified in the first trial were not always re-identified in subsequent trials. There was also a pattern where contributors that were not identified in the first trial were identified in at least one of the subsequent trials. Thus, there was no instance (0%) of when the first trial reflected all possible contributors. When examining the proportion of specific contributors, some contributors appeared only once across the ten trials while others appeared in all trials. However, most contributors appeared in less than half of trials.
For completeness, we attempted to identify comparable time points of when the agents no longer identified new SEIPS contributors per case study. However, we failed to observe any consistent pattern or stopping point for the agents. The agents’ performance ranged from identifying all unique contributors in the first trial to continually identifying new contributors even during the tenth trial. Furthermore, there were no immediately identifiable differences across cases used in this study that would explain this varied behavior by the agents.
Discussion: Overall, our findings suggest that ChatGPT has the potential to analyze patient safety reports while maintaining some level of accuracy. This was particularly more pronounced with specialized agents when compared to web-based agents. For consistency, both agents failed to repeatedly identify the same SEIPS contributors across all ten trials. Both agents were generally comparable in terms of completeness. These findings suggest that specialized agents offer users a differential advantage in accuracy.
This study’s limitations highlight additional areas for further research. First, our study used thematic content analysis to guide the analytical process. However, inductive thematic analysis represents another approach for analyzing free-text data. Future research should investigate how ChatGPT performs when performing an inductive thematic analysis. Second, analyzing patient safety reports for contributors alone is not sufficient to address patient safety incidents. Rather, the identified contributors are used to guide the development of recommendations. Further research is needed to assess the agents’ creative abilities to develop solutions.
Patient safety reports represent another potential use case for generative AI due to the substantial amount of time needed to review free-text data. Unfortunately, this can create bottlenecks and unnecessary delays in developing and implementing solutions in the healthcare organization to address the involved patient safety concern. In the time that has elapsed, there is a risk that similar incidents may have occurred, which has adverse implications on costs, patient experience, and quality of care.
Generative AI offers promise in addressing this bottleneck. However, little is known about how generative AI can be used in analyzing patient safety reports. Thus, this study aims to assess ChatGPT’s performance across accuracy, consistency, and completeness when asked to conduct a directed content analysis of patient safety reports to identify work system factors that may have contributed to the event. Our findings can help foster this burgeoning line of inquiry into generative AI’s capabilities and uses in human factors research and applications. The findings may also benefit health system leaders who are exploring ways to optimize response to patient safety incidents in a timely manner.
Methods: Our data source was the patient safety case studies from the Patient Safety Network, which collects case study submissions from health care organizations and makes them publicly available to others to learn from. We selected ten case studies that were heterogeneous in care settings, errors involved, and personnel affected to demonstrate feasibility of the analytic approach across diverse types of case studies. Reports were formatted in a narrative structure and did not include images or tables.
We used the second version of the Systems Engineering Initiative for Patient Safety (SEIPS) framework to guide the thematic content analysis. This framework suggests that there are six work system and structural factors that may influence processes and outcomes: 1) person, 2) organization, 3) tools and technologies, 4) tasks, 5) internal environment, and 6) external environment.
To explore generative AI’s ability to analyze the case studies using SEIPS 2.0, we developed two types of agents using ChatGPT version 4o (OpenAI: San Francisco, CA). The first agent was a web-based conversational agent, which represents the most common type of agent where information used to answer user queries primarily come from online sources. This web-based conversational agent was not provided with any specialized information, other than the online resource it already had access to, regarding the SEIPS 2.0 categories and definitions. The second agent was a specialized conversational agent that had its information-seeking properties restricted to only information about SEIPS 2.0 that was provided by the research team that was pre-uploaded into the agent’s knowledge base. Both agents were provided with similar instructions and information about the agents’ objective, what the case study contained, how the final results should be formatted, and conversation starters (i.e., buttons presented to the user to initiate specific actions).
For both agents, the procedure involved first uploading a patient safety report to be reviewed into the agents’ knowledge base. The analysis process was then initiated by clicking the conversation starter “analyze case study”. Once the agents finished the analysis, we recorded the identified contributors on a spreadsheet. We then terminated the chat session and began a new chat session to begin the next trial of analysis. We did this because our pilot testing found that continually running analyses in the same chat session can adversely affect the results. Each case study was subjected to ten trials of analysis. Since we had ten case studies, a total of 100 trials were performed.
To assess the agents’ performance, we assessed the outputs for each case study for accuracy, consistency, and completeness as adjudicated by the research team members, who were familiar with the content of each report. For accuracy, we computed two scores. The first was the percentage of trials in which the agent applied the appropriate SEIPS 2.0 framework (compared to other versions). The second was the proportion of trials where a specific work system factor was correctly coded by the agent using the SEIPS categories. For consistency, we calculated the proportion of trials in which the AI identified the same vs. new work system factors across the ten analyses of the same case. For completeness, we calculated the number of trials before the agent stopped identifying any new work system factors for each case. All statistical analyses were done through Stata SE 18.0 (StataCorp: College Station, TX).
Results: For accuracy, the specialized agent always (100%) used the correct version of the SEIPS framework. Meanwhile, the web-based agent used the correct version of the SEIPS framework in 35% of the total number of trials. When comparing the agents based on only the trials that used the correct SEIPS framework, there were no differences in errors made for contributors related to person, organization, tools and technologies, tasks, and external environment. However, the web-based agent had misclassified organizational contributors (e.g., roles and responsibilities, workflows) as internal environment.
For consistency, there was a consistent pattern across the ten case studies where the contributors that the agents identified in the first trial were not always re-identified in subsequent trials. There was also a pattern where contributors that were not identified in the first trial were identified in at least one of the subsequent trials. Thus, there was no instance (0%) of when the first trial reflected all possible contributors. When examining the proportion of specific contributors, some contributors appeared only once across the ten trials while others appeared in all trials. However, most contributors appeared in less than half of trials.
For completeness, we attempted to identify comparable time points of when the agents no longer identified new SEIPS contributors per case study. However, we failed to observe any consistent pattern or stopping point for the agents. The agents’ performance ranged from identifying all unique contributors in the first trial to continually identifying new contributors even during the tenth trial. Furthermore, there were no immediately identifiable differences across cases used in this study that would explain this varied behavior by the agents.
Discussion: Overall, our findings suggest that ChatGPT has the potential to analyze patient safety reports while maintaining some level of accuracy. This was particularly more pronounced with specialized agents when compared to web-based agents. For consistency, both agents failed to repeatedly identify the same SEIPS contributors across all ten trials. Both agents were generally comparable in terms of completeness. These findings suggest that specialized agents offer users a differential advantage in accuracy.
This study’s limitations highlight additional areas for further research. First, our study used thematic content analysis to guide the analytical process. However, inductive thematic analysis represents another approach for analyzing free-text data. Future research should investigate how ChatGPT performs when performing an inductive thematic analysis. Second, analyzing patient safety reports for contributors alone is not sufficient to address patient safety incidents. Rather, the identified contributors are used to guide the development of recommendations. Further research is needed to assess the agents’ creative abilities to develop solutions.
Event Type
Poster Presentation
TimeTuesday, April 14:45pm - 6:15pm EDT
LocationFrontenac Foyer



