WFU

2024/07/17

[EBM feat. AI]文獻選讀 - Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models

 

基本資訊

Title: Assessing the Risk of Bias in Randomized Clinical Trial With Large Language Models
Journal: JAMA Netw Open 2024 May 1;7(5):e2412687.
PMID: 38776081

方法學

● 工作小組的組成
包括3位實證醫學方法學的資深專家、2位計算機科學家。2位計算機科學專家改進並優化提示。3位資深專家監督評估過程並制定評估的標準。所有研究人員完成了一週的系統回顧訓練,以確保對評估過程有一致的理解 。

● 提示的開發
研究人員先寫一份 prompt 初稿,並用5個RCT進行測試。計算機科學家負責審查輸出,提供反饋以進行 prompt 改進,直到輸出與專家的結果一致。

最終的 prompt包括3部分:(1) 設置角色的介紹、(2) 評估的指導以及 (3) 輸出格式的規範。該提示提供了明確的指導和範例,用於評估每個 RoB 領域,指示 LLMs 從原文中提取相關信息並做出合理判斷,然後從 "definitely yes", "probably yes", "probably no", or "definitely no" 從中選擇一個評級。例如,在評估隨機序列生成時,如果未提供詳細信息,選擇"probably no"。對於計算機生成的隨機性或傳統方法如擲硬幣,選擇"definitely yes"。對於基於某些規則的序列,仔細選擇"probably yes"或"probably no"。如果分配依賴於臨床醫生判斷、參與者偏好、實驗室檢查或干預可用性,選擇"definitely no" 。

● 樣本的選擇
關鍵字“modified”、“Cochrane tool”、“risk of bias”、“CLARITY”和“meta-analysis”搜索PubMed。以相關性降序排列,直到選擇了3個使用修改版Cochrane工具的 SR/MA 分析。隨後,我們按第一作者姓氏和出版年份對每個系統回顧中包含的所有RCT按字母順序排序,分配數字標識符。使用Excel 2108版(Microsoft)生成隨機數,從每個系統回顧中選擇10個RCT作為樣本 。

● 大型語言模型的應用
使用 LLM 評估,時間跨度為2023年9月30日至10月10日(編按:此時間的 ChatGPT 為 GPT4,Claude版本為 Claude-3)。為確保評估資料一致,我們將PDF文件轉換為文本文件後再餵給兩個模型。每個RCT的偏倚風險(ROB)評估中,主要結果為作者指定的結果,若未指定,則為研究中首先報告的結果。輸出被準確轉錄至文件中,任何因技術問題中斷的評估均被排除並重新進行。每個RCT由兩個模型評估兩次,使用相同的提示並確保模型版本一致。

● 標準標準的建立
三位資深專家根據標準獨立評估RCT,並通過共識調解差異。持續討論直到每個RCT的ROB評估各方面達成共識。這些共識派生的評估形成了標準標準,作為衡量LLM工具ROB評估準確性的參考。

● 統計分析
數據分析使用R版本4.3.2進行,所有測試均為雙側檢驗,P < 0.05被認為具有統計顯著性。對於ROB分類,明確或可能是的回答被分類為低風險(負面結果),明確或可能不是的回答被分類為高風險(正面結果)。真陽性(TP)和真陰性(TN)根據標準標準定義,偏差被標識為假陽性(FP)或假陰性(FN)。

● 準確性
LLM 在 ROB 評估中的準確性在研究特定、領域特定和整體層面上通過正確評估率、敏感性和特異性進行量化​​。對於領域特定的準確性,我們進一步計算了F1分數,這是敏感性和正確預測值的調和平均數,並呈現高風險回答的比例,以幫助結果的解釋​​。

F1 = 2 × ([Positive Predictive Value × Sensitivity]/[Positive Predictive Value + Sensitivity])

Positive Predictive Value = TP/(TP + FP)

κ = (Po − Pe)/(1 − Pe)

Po = (Number of Agreements on Positive + Number of Agreements on Negative)/Total Number of Assessments

Pe = ([P1 × P2] + [N1 × N2])/(Total Number of Assessments)^2

PABAκ = 2Po - 1


結果

● 準確性

LLM 1和LLM 2都表現出良好的準確性。ChatGPT 的平均正確評估率為84.5%(95% CI, 81.5%-87.3%),而 Claude 的評估率稍高,為89.5%(95% CI, 87.0%-91.8%),兩者的中位數(IQR)整體正確評估率均為90.0%(ChatGPT 為 80.0%-90.0%,Claude 為90.0%-100.0%),共60次評估 。

兩個 LLM 在所有 10 個領域中的正確評估率相似 。ChatGPT 在 Domain 1(隨機序列生成)中的最低正確評估率為56.7%,最高為 Domain 3.b(對醫療人員的盲法),達96.7%。Claude 在各領域的正確評估率範圍為80.0%至98.3%,其中 Domain 1 最低,Domain 3.a(對患者的盲法)最高 。Claude 在 Domain 1 中的正確評估率顯著高於 ChatGPT(RD 0.23; 95% CI, 0.07-0.39; P = .01),在其他領域無顯著差異 。

在每個模型60次評估中,ChatGPT 有 14 次(23.3%)完全正確,48 次(80.0%)評估準確率在80%或以上;而 Claude 有 24 次(40.0%)完全正確,48 次(80.0%)評估準確率在80%或以上 。

155次錯誤評估中,有89次(57.4%)模型正確識別並表述了合適的理由但做出了錯誤的判斷,而66次(42.6%)是由於未識別或誤識別證據而錯誤 。對於 Domain 1、2(分配隱藏)的90次錯誤評估中,75.6%的錯誤是基於正確的理由但得出錯誤的結論 。


● Consistency

兩個 LLM 在重複評估中的一致性總體上都很高,ChatGPT 的整體一致率為84.0%,Claude為87.3%,兩者之間無顯著差異(RD, 0.03; 95% CI −0.02至0.08)。Coh en κ在所有領域中均超過0.5,表示至少中度一致性。ChatGPT 在 7 個領域中 κ 超過 0.80,而 Claude 在 8 個領域中超過這一閾值。兩個模型在領域1和領域2中的一致性相對較低。補充材料1中的eTable 10列出了每個 domain 4 次評估(一個模型兩次)的協議,其中僅有13個研究(43.3%)在 Domain 1(隨機序列生成)中有4次一致評估,14個研究(46.7%)在 Domain 2(分配隱藏)中有4次一致評估。其他領域的協議比例相對較高(範圍為60%-90%)。

在具體研究層面,ChatGPT 和 Claude 對 12 個 RCT 的重複評估結果相同。大多數評估中兩個模型都有較高或接近完美的一致性,最低一致率分別為 ChatGPT 的30%和 Claude 的50%。30個研究中的15個在所有領域達到80%或以上的一致性,平均一致性為70%。

● Efficiency

ChatGPT 的評估時間範圍為52到127秒,平均每次評估時間為77秒。Claude 的平均評估時間為53秒(範圍從36到87秒)。


Prompt

---
You are a professional reviewer. You are particularly good at learning evaluation criteria, and closely
following it to assess the risk of bias of Randomized Controlled Trials (RCTs). You can fully
understand and follow the evaluation guidelines and evaluate the RCTs I have provided to you. Make
sure all your judgments are based on the facts reported in the article and not on any extrapolation or
speculation of your own. Finally, make sure your answers are completely correct.
Guidelines for Evaluation:
Note: The examples provided in the tool are illustrative and do not cover all possible scenarios in real-
world applications. Use your expert judgment to evaluate each item based on the information provided
in the RCT, and do not rely solely on the examples.
Important:
• The evaluation should be conducted only for one primary outcome.
• If there is too little information to support the judgment, do not speculate positively.
1. Was the allocation sequence adequately generated? Evaluate the adequacy of the allocation
sequence generation based on the information provided in the RCT, considering the following criteria:
o The most important: If no statements are provided on how the randomization sequence was
generated, select "Probably no", even if randomization is mentioned.
o If computer-generated random numbers, coin tossing, card or envelope shuffling, dice rolling, lot
drawing, or minimization (with or without a random element) were used, select “Definitely yes.”
o If the sequence was generated based on the odd or even date of birth, some rule based on the date
(or day) of admission, or some rule based on hospital or clinic record number, carefully evaluate and
choose between “Probably yes” and “Probably no.”
o If allocation was based on clinician judgment, participant preference, results of a series of
laboratory tests, or availability of the intervention, select “Definitely no.”
2. Was the allocation adequately concealed? Evaluate the adequacy of allocation concealment based
on the information provided in the RCT, considering the following criteria:
o If central allocation (including telephone, web-based, and pharmacy-controlled randomization),
sequentially numbered drug containers of identical appearance, or sequentially numbered, opaque,
sealed envelopes were used, select “Definitely yes.”
o If an open random allocation schedule was used or if assignment envelopes were used without
appropriate safeguards (e.g., if envelopes were unsealed, non-opaque, or not sequentially numbered),
select “Definitely no.”
o If no statements are provided on allocation concealment, select “Probably no.”
3. Blinding: Was knowledge of the allocated interventions adequately prevented? Evaluate the
adequacy of blinding for each of the following, based on the information provided in the RCT:
3.a. Were patients blinded?
3.b. Were healthcare providers blinded?
3.c. Were data collectors blinded?
3.d. Were outcome assessors blinded?
3.e. Were data analysts blinded?
o For 3.a. to 3.e., follow these standards.
o If no blinding but you judge that the outcome and the outcome measurement are not likely
influenced by lack of blinding, select “Probably yes.”
o If blinding of participants and key study personnel ensured, and unlikely that blinding could have
been broken, select “Probably yes.”
o If either participants or some key study personnel were not blinded, but outcome assessment was
blinded and the nonblinding of others unlikely to introduce bias, select “Probably yes.”
o If no blinding or incomplete blinding, and the outcome or outcome measurement is likely to be
influenced by lack of blinding, select “Probably no.”
o If blinding of key study participants and personnel attempted, but likely that the blinding could
have been broken, select “Probably no.”
o If either participants or some key study personnel were not blinded, and the nonblinding of others
likely to introduce bias, select “Probably no.”
4. Was loss to follow-up (missing outcome data) infrequent? Evaluate the frequency of loss to
follow-up based on the information provided in the RCT, considering the following criteria:
o If there are no missing outcome data, or the reasons for missing outcome data are unlikely to be
related to the outcome, select “Definitely yes.”
o If missing outcome data are balanced across intervention groups, with similar reasons for missing
data across groups, select “Probably yes.”
o If the proportion of missing outcomes is enough to have an important impact on the intervention
effect estimate, select “Definitely no.”
o If the follow-up rate at the longest time point is greater than 90%, i.e. more than 90% of
participants completed the trial, or the dropout rate at the longest time point is less than 10%, you can
generally consider selecting “Definitely yes.”
o If the follow-up rate is between 80% and 90%, i.e. more than 80% but less than 90% of
participants completed the trial, or the dropout rate at the longest time point is between 10% and 20%,
consider selecting “Probably yes.”
o If the follow-up rate is below 80%, i.e. less than 80% of participants completed the trial, or the
dropout rate is greater than 20%, consider selecting “Definitely no.”
o If no statements are available to support the assessment, select “Probably no.”
5. Are reports of the study free of selective outcome reporting? Evaluate the presence of selective
outcome reporting based on the information provided in the RCT, considering the following criteria:
o If the study protocol is available and all of the study’s pre-specified (primary and secondary)
outcomes of interest in the review have been reported in the pre-specified way, select “Definitely yes.”
o If the study protocol is not available but it is clear that the published reports include all expected
outcomes, including those that were pre-specified, select “Probably yes.”
o If not all of the study’s pre-specified primary outcomes have been reported, or if one or more
reported primary outcomes were not pre-specified, select “Definitely no.”
o Don't select “Probably no” just because the study protocol is not available
6. Was the study apparently free of other problems that could put it at a risk of bias? Evaluate the
presence of other potential sources of bias based on the information provided in the RCT, considering
the following criteria:
o If the study appears to be free of other sources of bias, select “Definitely yes.”
o If the study had a potential source of bias related to the specific study design used, or had some
other problem that could put it at risk of bias, select “Probably no” or “Definitely no.”

Output Format:
For each reported outcome in the RCT, provide the evaluation results in the following format:
Article ID: [Insert First Author's Last Name], [Insert Year of Publication]
Outcome Name: [Insert Outcome Name]
1. Item name (e.g., "1. Was the allocation sequence adequately generated?")
o Response (e.g., "Definitely yes")
o Reason you summarized (e.g., "No statements are available on how the randomization sequence
was generated.""258/316 (82%) participants completed 26 weeks of treatment. The dropout rate is less
than 20% in all groups.")
2. [Next Item Name]
o Response
o Reason
(Continue in the same format for all items)
Ensure there is a clear separation between the sets of responses for different outcomes, and maintain
consistency in the format for each outcome.
Before you output the answer, please make sure that any evaluation results are based on my request.
Please re-check: In the first item, if no statements are available on how the randomization sequence
was generated, select "Probably no", even if randomization is mentioned. In each item, if no statements
are available to support the assessment, select “Probably no.” But be careful.
---