Evolving technologies may yet provide the medicine that patients expect

4月初BMJ又发了一篇系统综述,并配发了同期述评,结论如标题。

系统综述全文是开放的:全文链接 ; 评论文章是付费的

What is already known on this topic

  • The volume of published research on deep learning, a branch of artificial intelligence (AI), is rapidly growing
  • Media headlines that claim superior performance to doctors have fuelled hype among the public and press for accelerated implementation

What this study adds

  • Few prospective deep learning studies and randomised trials exist in medical imaging
  • Most non-randomised trials are not prospective, are at high risk of bias, and deviate from existing reporting standards
  • Data and code availability are lacking in most studies, and human comparator groups are often small
  • Future studies should diminish risk of bias, enhance real world clinical relevance, improve reporting and transparency, and appropriately temper conclusions

Principal findings

Five key findings were established from our review. Firstly, we found few relevant randomised clinical trials (ongoing or completed) of deep learning in medical imaging. While time is required to move from development to validation to prospective feasibility testing before conducting a trial, this means that claims about performance against clinicians should be tempered accordingly. However, deep learning only became mainstream in 2014, giving a lead time of approximately five years for testing within clinical environments, and prospective studies could take a minimum of one to two years to conduct. Therefore, it is reasonable to assume that many similar trials will be forthcoming over the next decade. We found only one randomised trial registered in the US despite at least 16 deep learning algorithms for medical imaging approved for marketing by the Food and Drug Administration (FDA). These algorithms cover a range of fields from radiology to ophthalmology and cardiology.

Secondly, of the non-randomised studies, only nine were prospective and just six were tested in a real world clinical environment. Comparisons of AI performance against human clinicians are therefore difficult to evaluate given the artificial in silico context in which clinicians are being evaluated. In much the same way that surrogate endpoints do not always reflect clinical benefit, a higher area under the curve might not lead to clinical benefit and could even have unintended adverse effects. Such effects could include an unacceptably high false positive rate, which is not apparent from an in silico evaluation. Yet it is typically retrospective studies that are usually cited in FDA approval notices for marketing of algorithms. Currently, the FDA do not mandate peer reviewed publication of these studies; instead internal review alone is performed. However, the FDA has recognised and acknowledged that their traditional paradigm of medical device regulation was not designed for adaptive AI and machine learning technologies. Non-inferior AI (rather than superior) performance that allows for a lower burden on clinician workflow (that is, being quicker with similar accuracy) might warrant further investigation. However, less than a quarter of studies reported time taken for task completion in both the AI and human groups. Ensuring fair comparison between AI and clinicians is arguably done best in a randomised clinical trial (or at the very least prospective) setting. However, it should be noted that prospective testing is not necessary to actually develop the model in the first place. Even in a randomised clinical trial setting, ensuring that functional robustness tests are present is crucial. For example, does the algorithm produce the correct decision for normal anatomical variants and is the decision independent of the camera or imaging software used?

Thirdly, limited availability of datasets and code makes it difficult to assess the reproducibility of deep learning research. Descriptions of the hardware used, when present, were also brief and this vagueness might affect external validity and implementation. Reproducible research has become a pressing issue across many scientific disciplines and efforts to encourage data and code sharing are crucial. Even when commercial concerns exist about intellectual property, strong arguments exist for ensuring that algorithms are non-proprietary and available for scrutiny. Commercial companies could collaborate with non-profit third parties for independent prospective validation.

Fourthly, the number of humans in the comparator group was typically small with a median of only four experts. There can be wide intra and inter case variation even between expert clinicians. Therefore, an appropriately large human sample for comparison is essential for ensuring reliability. Inclusion of non-experts can dilute the average human performance and potentially make the AI algorithm look better than it otherwise might. If the algorithm is designed specifically to aid performance of more junior clinicians or non-specialists rather than experts, then this should be made clear.

Fifthly, descriptive phrases that suggested at least comparable (or better) diagnostic performance of an algorithm to a clinician were found in most abstracts, despite studies having overt limitations in design, reporting, transparency, and risk of bias. Caveats about the need for further prospective testing were rarely mentioned in the abstract (and not at all in the 23 studies that claimed superior performance to a clinician). Accepting that abstracts are usually word limited, even in the discussion sections of the main text, nearly two thirds of studies failed to make an explicit recommendation for further prospective studies or trials. One retrospective study gave a website address in the abstract for patients to upload their eye scans and use the algorithm themselves. Overpromising language leaves studies vulnerable to being misinterpreted by the media and the public. Although it is clearly beyond the power of authors to control how the media and public interpret their findings, judicious and responsible use of language in studies and press releases that factor in the strength and quality of the evidence can help. This issue is especially concerning given the findings from new research that suggests patients are more likely to consider a treatment beneficial when news stories are reported with spin, and that false news spreads much faster online than true news.


Editorials | Artificial intelligence versus clinicians

下面截取述评的第三(也是最后一个)部分的内容,比较有意思:

The history of medicine shows that doctor-patient relationships can have a therapeutic effect, regardless of the treatment prescribed. But it also reveals that patients and doctors have always interacted in complex relationships mediated by objects, and that expectations of their relationships have changed over time. The presence of AI systems in doctors’ offices changes how physicians learn, make decisions, and interact with patients; they also change what patients expect.

Nagendran and colleagues find that much current hype around AI is unjustified but there is reason to believe that technology will increasingly outperform human physicians at specific tasks. If digital technologies enable the development of new forms of knowledge and expand the healing possibilities of the therapeutic encounter, then they have the potential to provide the medicine that patients expect in the 21st century. If not, we need to think more urgently about how to resist current trends towards more automation in the clinic.
医学史表明,无论采取何种治疗方法,医患关系都可以起到治疗作用(医疗的人文关怀)。但是,这也表明,患者和医生之间总是以某些事物为媒介,在复杂的关系中进行互动,并且随着时间的推移对医患关系的预期已经改变了。医生办公室中AI系统的存在改变着医生学习、做出医疗决策、与患者互动的方式;它们也改变着患者的预期。

Nagendran及其同事发现,目前关于AI的夸大宣传是没有根据的,但是有理由相信,在特定任务上,技术将逐渐胜过人类医生。如果数字技术能够赋能新知识形式的发展,并增加疗愈的治愈可能性,那么它们就有潜力提供患者在21世纪所期望的医疗。如果不是这样,我们更急需考虑如何抵抗当前在诊所中实现更自动化的趋势。

这篇述评的作者从一开始就在讲人文关怀,后面的the therapeutic encounter是心理学上的名词,用在AI方面感觉有点……酸,查了一下作者,是哲学领域的,然后目前的研究方向是:“the rhetoric of progress in contemporary medicine”,很有意思,这也是目前医患矛盾的一个重要诱因。

发表评论