Vision-Capable LLMs in Microsurgery: A Blinded Comparison of Two AI Models with Expert Microsurgeons in the Appraisal of 200 Experimental Anastomoses

Archive/Vision-Capable LLMs in Microsurgery: A Blinded Comparison of Two AI Models with Expert Microsurgeons in the Appraisal of 200 Experimental Anastomoses

Victor Esanu, Horatiu Alexandru Colosi, Stefan Agoston et al.

2 mai 2026

Abstract

Background/Objectives: Objective end-product assessment of microsurgical anastomoses is intensive and partly subjective. Vision-capable large language models (LLMs) may enable standardized image-based scoring, but their agreement with expert assessment remains uncertain. Methods: We studied 200 end-to-end femoral artery anastomoses, performed on chicken legs by novice, intermediate, and experienced microsurgeons. Images were scored independently by two blinded expert panels; disagreements were adjudicated by a third senior reviewer to establish expert consensus. Two LLMs, ChatGPT 5.2 Thinking Extended and Gemini 3.1 Pro, were evaluated using the exact same prompt and rubric. Each image was analyzed three times per model. Final scores were aggregated by median for numeric items and majority vote for categorical items. The primary endpoint was exact-match agreement with expert consensus. Agreement within ±1 was also assessed for numeric items. Agreement was measured using simple percentage agreement, Light’s kappa, and Krippendorff’s alpha; Bland–Altman analysis was used for numeric count items. Results: LLM 1 achieved a higher overall exact-match agreement than LLM 2 (0.659 vs. 0.539). Both models performed better on categorical than numeric items (0.713 vs. 0.610 and 0.651 vs. 0.445, respectively). LLM 1 showed the greatest advantages for gaps, knots, oblique stitches, and wide bites. Krippendorff’s alpha was positive for most endpoints with LLM 1, whereas LLM 2 showed negative values throughout. Allowing a ±1 tolerance for numeric items greatly improved agreement, suggesting only minor counting discrepancies, from 0.610 to 0.900 for LLM 1 and from 0.445 to 0.826 for LLM 2. Conclusions: Under a constrained scoring workflow, LLMs partially approximated intraluminal microsurgical end-product scoring. LLM 1 outperformed LLM 2, but agreement remained insufficient to replace the expert assessment entirely. These models can be assistive tools within a human-in-the-loop framework.

Metadata

DOI: 10.3390/medsci14020235 CC BY 4.0 license

IPC Classification

G06A61

Keywords

vision-capablellmsmicrosurgeryblindedcomparisonmodelsexpertmicrosurgeonsappraisalexperimentalanastomosesmedicalsciencesbackgroundobjectivesobjectiveend-productassessmentmicrosurgicalintensivepartlysubjectivelargelanguage

Citer cette publication

€ 4.00

← Back to Archive