Summary. Using Sleep Medicine guidelines and textbook, we evaluated four large language models (LLMs) (Llama 3.2 3B, Llama 3.3 70B, GPT 4o mini, Gemini 2.0 Flash) on AIMS certification questions, comparing baseline and Retrieval Augmented Generation (RAG) performance. RAG improved accuracy in all models (e.g., Llama 3.2 +9.6 points, Gemini 2.0 +4.0 points), highlighting RAG’s role in enhancing LLM reliability in specialized medical domain.
Intelligenza artificiale e medicina del sonno: valutazione comparativa di large language models sull’esame dell’Accademia Italiana di Medicina del Sonno con retrieval-augmented generation
Romigi, Andrea;
2025-01-01
Abstract
Summary. Using Sleep Medicine guidelines and textbook, we evaluated four large language models (LLMs) (Llama 3.2 3B, Llama 3.3 70B, GPT 4o mini, Gemini 2.0 Flash) on AIMS certification questions, comparing baseline and Retrieval Augmented Generation (RAG) performance. RAG improved accuracy in all models (e.g., Llama 3.2 +9.6 points, Gemini 2.0 +4.0 points), highlighting RAG’s role in enhancing LLM reliability in specialized medical domain.File in questo prodotto:
Non ci sono file associati a questo prodotto.
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

