Presentation
PS4 - Evaluating Generative AI for Inductive Thematic Analysis of Patient Safety Reports
SessionPoster Session 1
DescriptionBackground: Incident reporting is a cornerstone of risk management in healthcare systems, playing a critical role in identifying, mitigating, and preventing harm. These reports, often written by frontline workers, contain accounts of patient safety incidents in an unstructured, narrative format. However, despite the extensive data collected, incident reporting alone has not been sufficient to prevent harm. A key challenge lies in the sheer volume of reports, which frequently go unanalyzed due to the resource-intensive nature of qualitative analysis. Traditional methods, such as grounded theory and thematic analysis, rely on both inductive and deductive approaches that are often employed to process this data. Nevertheless, the manual effort required to qualitatively code and interpret large amounts of unstructured data is often impractical, leading to delays or incomplete analyses.
The rise of Generative AI (Gen AI) technologies, particularly large language models (LLMs) like ChatGPT, offers a promising solution to this problem. These models have shown remarkable abilities in natural language understanding and generation, opening up the possibility for AI to facilitate in the qualitative analysis of unstructured healthcare data. Although research on using large language models (LLMs) for thematic analysis is advancing across various fields, little is known about their ability to perform inductive coding especially in the context of healthcare data.
In this study, we explore the potential of AI agents to perform inductive thematic analysis on patient safety reports. By automating aspects of the analysis process, AI could help healthcare organizations overcome the barriers posed by the large volume of incident reports, allowing for more timely and comprehensive identification of patient safety issues. However, before the use of AI can be trusted for these purposes in the field, empirical research is needed to validate the performance of AI under more tightly controlled settings to validate its performance under varying conditions. The purpose of this study therefore was to evaluate the performance of ChatGPT in identifying and categorizing themes within patient safety reports, offering insights into how AI might be integrated into qualitative healthcare research and quality improvement efforts.
Methods: This study utilized a dataset of 12 patient safety reports, all focused on palliative and end-of-life care, sourced from the PS Net website. These reports had a narrative description of the patient safety incident followed by an expert commentary on it. To prepare the case studies for analysis, we removed sections such as authors' information and affiliations, case objectives, take-home points, and the reference list, keeping only the case description and commentary. Additionally, we removed all in-text citations, web links, figures, and tables included in the commentary.
We employed Braun and Clarke’s six-phase approach to thematic analysis, utilizing specialized ChatGPT agents. The goal was to evaluate whether these agents could perform inductive thematic analysis by identifying key themes related to patient safety, rather than executing a full-scale problem analysis. As such, Phase 6, which involves reporting the findings of the thematic analysis, was excluded from this study.
Due to the complexity of the thematic analysis process and the word limit for instructions in ChatGPT, we divided the tasks across three AI agents. This segmentation allowed us to break down complex instructions into more manageable pieces, improving the clarity of the tasks given to the AI.
1.Phase 1 (Familiarization with Data): Agent 1 was tasked with reading through the 12 patient safety reports to identify patterns and important details related to patient safety incidents.
2.Phase 2 (Generating Initial Codes): After familiarization, Agent 1 systematically generated initial codes by identifying recurring elements and ideas in the reports. These codes represented core issues within the data, such as patient safety risks, organizational challenges, and systemic barriers to care. To ensure a comprehensive identification of codes, Phases 1 and 2 were repeated 10 times in separate chat sessions.
3.Phase 3 (Searching for Themes): In this phase, Agent 2 was tasked with organizing the initial codes identified in Phase 2 into potential themes. It grouped similar codes together to identify broader patterns within the data, looking for connections between the various codes that could form overarching categories of patient safety incidents.
4.Phase 4 (Reviewing Themes): Once potential themes were identified, Agent 2 reviewed and refined them. This phase involved checking whether the themes accurately represented the data and ensuring that there were no redundant or contradictory themes.
5.Phase 5 (Defining and Naming Themes): In this final phase of analysis, Agent 3 was responsible for defining and naming the themes. Each theme was given a concise and descriptive label.
We conducted two experiments to assess the AI’s performance in identifying themes. In Experiment 1, we combined all 12 reports into a single document for analysis. This required the GPT to presumably process the entire set of reports and generate overarching themes across these reports. The combined file contained 41 pages, with approximately 22,500 words, and was stored in the AI agent’s knowledge base. In contrast, in Experiment 2, we instructed the GPT to analyze each report individually. This required the agent to process each report separately, with each file containing approximately 4 pages and 1,900 words on average. The individual files were stored in the agent’s knowledge base, allowing the agent to generate themes specific to each report.
The performance metric for both experiments was based on the number of themes identified across both approaches. Additionally, to further analyze and compare themes, we used the HFACS model, categorizing each theme according to its relevance to specific patient safety categories, such as Organizational Influences, Supervisory Factors, Preconditions for Unsafe acts, and Unsafe Acts. This classification helped in determining the types of themes identified as well as comparing the themes across the experiments.
Results: The identified themes varied in both breadth and depth across the two experiments. In Experiment 1, broader themes were identified, including lack of follow-up, breakdown in gear communication, lack of documentation, impaired decision-making capacity, and systematic barriers to timely care. In Experiment 2, a total of 29 unique themes were identified from the analysis of 12 patient safety reports, including the 6 themes already identified in Experiment 1. These additional themes offered more case-specific, and nuanced insights into the patient safety events such as “System Design and Equipment Failures”, “Training and Competency Gaps” among others.
Post hoc classification of themes by the researchers using HFACS, revealed that Experiment 1's themes primarily fell under Supervisory Factors (n =1) and Preconditions for Unsafe Acts levels (n =3). In contrast, Experiment 2 provided a more detailed and broader breakdown of themes across all HFACS levels: Organizational Influences (n=), Supervisory Factors (n = 4), Preconditions (n =6), and Unsafe Acts (n =6). Some themes (n = 1 in Experiment 1 and n = 8 in Experiment 2) were not able to be categorized, as they were related to patient-specific circumstances or were either too broad.
Conclusions: Our study suggests that AI agents may facilitate inductive thematic analysis of patient safety reports. These models proved particularly useful in generating more nuanced themes when analyzing documents containing a single report. However, the way AI is asked to help analyze reports greatly impacts its output. When multiple reports were combined into a single document, there was a noticeable loss of detail and nuance. This reduction in performance could be attributed to the increased document length, as the AI's ability to capture subtle patterns seemed to diminish as the text grew longer. As AI technology continues to evolve, further research is needed to further explore how these tools can be used on larger and more complex patient safety datasets.
The rise of Generative AI (Gen AI) technologies, particularly large language models (LLMs) like ChatGPT, offers a promising solution to this problem. These models have shown remarkable abilities in natural language understanding and generation, opening up the possibility for AI to facilitate in the qualitative analysis of unstructured healthcare data. Although research on using large language models (LLMs) for thematic analysis is advancing across various fields, little is known about their ability to perform inductive coding especially in the context of healthcare data.
In this study, we explore the potential of AI agents to perform inductive thematic analysis on patient safety reports. By automating aspects of the analysis process, AI could help healthcare organizations overcome the barriers posed by the large volume of incident reports, allowing for more timely and comprehensive identification of patient safety issues. However, before the use of AI can be trusted for these purposes in the field, empirical research is needed to validate the performance of AI under more tightly controlled settings to validate its performance under varying conditions. The purpose of this study therefore was to evaluate the performance of ChatGPT in identifying and categorizing themes within patient safety reports, offering insights into how AI might be integrated into qualitative healthcare research and quality improvement efforts.
Methods: This study utilized a dataset of 12 patient safety reports, all focused on palliative and end-of-life care, sourced from the PS Net website. These reports had a narrative description of the patient safety incident followed by an expert commentary on it. To prepare the case studies for analysis, we removed sections such as authors' information and affiliations, case objectives, take-home points, and the reference list, keeping only the case description and commentary. Additionally, we removed all in-text citations, web links, figures, and tables included in the commentary.
We employed Braun and Clarke’s six-phase approach to thematic analysis, utilizing specialized ChatGPT agents. The goal was to evaluate whether these agents could perform inductive thematic analysis by identifying key themes related to patient safety, rather than executing a full-scale problem analysis. As such, Phase 6, which involves reporting the findings of the thematic analysis, was excluded from this study.
Due to the complexity of the thematic analysis process and the word limit for instructions in ChatGPT, we divided the tasks across three AI agents. This segmentation allowed us to break down complex instructions into more manageable pieces, improving the clarity of the tasks given to the AI.
1.Phase 1 (Familiarization with Data): Agent 1 was tasked with reading through the 12 patient safety reports to identify patterns and important details related to patient safety incidents.
2.Phase 2 (Generating Initial Codes): After familiarization, Agent 1 systematically generated initial codes by identifying recurring elements and ideas in the reports. These codes represented core issues within the data, such as patient safety risks, organizational challenges, and systemic barriers to care. To ensure a comprehensive identification of codes, Phases 1 and 2 were repeated 10 times in separate chat sessions.
3.Phase 3 (Searching for Themes): In this phase, Agent 2 was tasked with organizing the initial codes identified in Phase 2 into potential themes. It grouped similar codes together to identify broader patterns within the data, looking for connections between the various codes that could form overarching categories of patient safety incidents.
4.Phase 4 (Reviewing Themes): Once potential themes were identified, Agent 2 reviewed and refined them. This phase involved checking whether the themes accurately represented the data and ensuring that there were no redundant or contradictory themes.
5.Phase 5 (Defining and Naming Themes): In this final phase of analysis, Agent 3 was responsible for defining and naming the themes. Each theme was given a concise and descriptive label.
We conducted two experiments to assess the AI’s performance in identifying themes. In Experiment 1, we combined all 12 reports into a single document for analysis. This required the GPT to presumably process the entire set of reports and generate overarching themes across these reports. The combined file contained 41 pages, with approximately 22,500 words, and was stored in the AI agent’s knowledge base. In contrast, in Experiment 2, we instructed the GPT to analyze each report individually. This required the agent to process each report separately, with each file containing approximately 4 pages and 1,900 words on average. The individual files were stored in the agent’s knowledge base, allowing the agent to generate themes specific to each report.
The performance metric for both experiments was based on the number of themes identified across both approaches. Additionally, to further analyze and compare themes, we used the HFACS model, categorizing each theme according to its relevance to specific patient safety categories, such as Organizational Influences, Supervisory Factors, Preconditions for Unsafe acts, and Unsafe Acts. This classification helped in determining the types of themes identified as well as comparing the themes across the experiments.
Results: The identified themes varied in both breadth and depth across the two experiments. In Experiment 1, broader themes were identified, including lack of follow-up, breakdown in gear communication, lack of documentation, impaired decision-making capacity, and systematic barriers to timely care. In Experiment 2, a total of 29 unique themes were identified from the analysis of 12 patient safety reports, including the 6 themes already identified in Experiment 1. These additional themes offered more case-specific, and nuanced insights into the patient safety events such as “System Design and Equipment Failures”, “Training and Competency Gaps” among others.
Post hoc classification of themes by the researchers using HFACS, revealed that Experiment 1's themes primarily fell under Supervisory Factors (n =1) and Preconditions for Unsafe Acts levels (n =3). In contrast, Experiment 2 provided a more detailed and broader breakdown of themes across all HFACS levels: Organizational Influences (n=), Supervisory Factors (n = 4), Preconditions (n =6), and Unsafe Acts (n =6). Some themes (n = 1 in Experiment 1 and n = 8 in Experiment 2) were not able to be categorized, as they were related to patient-specific circumstances or were either too broad.
Conclusions: Our study suggests that AI agents may facilitate inductive thematic analysis of patient safety reports. These models proved particularly useful in generating more nuanced themes when analyzing documents containing a single report. However, the way AI is asked to help analyze reports greatly impacts its output. When multiple reports were combined into a single document, there was a noticeable loss of detail and nuance. This reduction in performance could be attributed to the increased document length, as the AI's ability to capture subtle patterns seemed to diminish as the text grew longer. As AI technology continues to evolve, further research is needed to further explore how these tools can be used on larger and more complex patient safety datasets.
Event Type
Poster Presentation
TimeMonday, March 314:45pm - 6:15pm EDT
LocationFrontenac Foyer
Digital Health (DH)
Simulation and Education (SE)
Hospital Environments (HE)
Medical and Drug Delivery Devices (MDD)
Patient Safety and Research Initiatives (PS)


