[Methodology report: development and evaluation of a software tool based on GPT-4 to help sort documents for literature reviews - a proof of concept]

Powell G, Cortez Ghio S, Zomahoun H
Record ID 32018011382
Original Title: Rapport méthodologique - Développement et évaluation d’un outil logiciel basé sur GPT-4 pour l'aide au tri de documents dans le cadre de revues de la littérature : une preuve de concept
Authors' objectives: While essential for INESSS productions, literature reviews conducted through structured literature searches are costly and time-consuming. The screening of scientific articles by reading titles and abstracts requires painstaking, repetitive work. Various semi-automated tools based on machine learning are commercially available but require significant user involvement and do not perform at a sufficiently high level to justify their use. This report describes the development and evaluation of a document screening tool based on GPT-4, a large language model (LLM). Thanks to the vast amount of textual data with which LLMs are trained, these models can detect linguistic and contextual nuances, enabling them to classify titles and abstracts based on their relevance to a specific set of selection criteria, all without the need for a human labeler.
Authors' results and conclusions: RESULT: For the basic strategy, we achieved a sensitivity of 92.3 % and a specificity of 80.4 %. With the sensitive strategy, we reached a sensitivity of 99 % and a specificity of 55.1 %. A more significant improvement was obtained with the ranking strategy by identifying an inclusion threshold (starting from the "unlikely" rank) that allowed optimal performance: a perfect sensitivity of 100 % and a specificity of 57.7 %. This threshold retains all articles authors deemed relevant upon reading the full text, while eliminating more than half of the article to be evaluated. These results suggest an average screening workload savings of 56.3 % with no missed relevant articles. Inclusion from the higher threshold (from the "very likely" rank) could potentially save 63.1 % of screening workload with only a 1 % rate of missed relevant articles. This strategy also allows sorting of the retained documents by relevance rank to facilitate the screening process. ETHICAL CONSIDERATIONS: Given that use of GPT-4 involves sharing of information with a third party, users must consider the risks surrounding data confidentiality. Furthermore, risks of bias may emerge from the selection of internet corpora on which LLMs are trained as well as from the model’s interpretation of inclusion and exclusion criteria. Finally, there are important ecological concerns related to the energy consumption needed for training such models CONCLUSION: The proposed tool, developed as a proof of concept, shows the possibility of significant reductions in terms of the screening workload for reviews undertaken at INESSS, with minimal erroneous exclusions of relevant articles. Additional evaluations will be needed to further validate the performance of the tool.
Authors' methods: We identified four literature reviews that were part of an INESSS product and met criteria we established for evaluation of the tool. For each title and abstract of articles identified in each review, we asked the model to assess the article’s relevance based on the inclusion and exclusion criteria developed for the review. We employed three strategies to formulate prompts to the model: (1) a basic strategy based on an approach used in a recent publication evaluating the potential of GPT-3.5 for abstract screening, (2) a sensitive strategy modifying the wording of the basic strategy’s prompt to avoid excluding relevant documents, and (3) a nine-level ranking strategy (e.g. "almost certain", "extremely likely", etc.) to identify a decision threshold to optimize the tool's performance. The model's responses to these queries were compared to the authors' decisions at the full-text article selection stage. Our goal was to maximize sensitivity and achieve at least moderate specificity. For the ranking strategy, we evaluated these two performance measures at different thresholds. Finally, we also estimated the potential screening workload savings and the associated risk of erroneous exclusions at different thresholds.
Authors' identified further research: We plan to undertake future analyses to evaluate methods of recent preprint articles proposing an open-sourced LLM, to analyze a larger number of reviews, and to develop a prospective study in collaboration with INESSS professionals to ensure the robustness of the model's performance.
Project Status: Completed
Year Published: 2024
English language abstract: An English language summary is available
Publication Type: Other
Country: Canada
Province: Quebec
MeSH Terms
  • Systematic Reviews as Topic
  • Artificial Intelligence
  • Software Validation
Organisation Name: Institut national d'excellence en sante et en services sociaux
Contact Address: L'Institut national d'excellence en sante et en services sociaux (INESSS) , 2021, avenue Union, bureau 10.083, Montreal, Quebec, Canada, H3A 2S9;Tel: 1+514-873-2563, Fax: 1+514-873-1369
Contact Name: demande@inesss.qc.ca
Contact Email: demande@inesss.qc.ca
Copyright: L'Institut national d'excellence en sante et en services sociaux (INESSS)
This is a bibliographic record of a published health technology assessment from a member of INAHTA or other HTA producer. No evaluation of the quality of this assessment has been made for the HTA database.