Ontario auditors find doctors' AI note takers routinely blow basic facts
312 points
• 4 days ago
• Article
Link
加拿大安大略省的一项省级审计发现,获批用于医疗领域的 AI 辅助病历记录系统存在严重不准确问题。审计长办公室评估了 20 个 AI Scribe 系统,结果显示其中 9 个编造了信息或提出在就诊中从未讨论过的治疗建议;12 个在病历中插入了错误的药物信息;17 个遗漏了实际上有讨论的关键精神健康问题;6 个系统部分或完全未能记录精神健康方面的担忧。鉴于这直接关系到患者安全和医疗准确性,这些问题令人担忧。
该审计是安大略省关于公共服务领域 AI 使用的综合报告的一部分,但特别针对支持医生和执业护士生成临床记录的 AI Scribe 项目。评估使用了模拟医患对话录音,由医学专业人员将原始对话与 AI 生成的摘要逐一比对,结果发现存在系统性"幻觉"与遗漏,例如有报告声称患者没有肿块或没有焦虑,而这些内容在对话中根本未被提及。此类错误可能直接影响诊断、治疗决策及长期护理。
报告还批评了用于筛选这些系统的评估流程:病历准确性仅占供应商总分的 4%,而在安大略省设有本地业务却占 30% 的权重。偏差控制、威胁与风险评估、隐私合规和 SOC 2 Type 2 认证合计仅占 8% 。这种权重分配严重失衡,使对患者安全和数据保护更为关键的因素被明显弱化,相比之下商业存在等次要标准权重过高。
报告警告,这种有缺陷的评分体系可能导致选出会生成不准确或有偏见医疗记录、或未对敏感健康信息提供充分保护的供应商。 OntarioMD 建议医生手动核对 AI 生成的记录,但在任何获批系统中均未内置强制性的确认功能。安大略省卫生厅承认已有超过 5,000 名医生在使用该项目,但尚未收到确认的患者受害报告;审计指出,如不采用更严格的评估标准并增加内置验证要求,未被发现的错误风险仍很大。 The Register 就卫生厅是否会采纳审计建议向其求证,但未获即时回复。
A provincial audit in Ontario, Canada, has revealed alarming inaccuracies in AI-powered medical note-taking systems approved for use in healthcare. The Office of the Auditor General evaluated 20 AI Scribe systems and found that 9 out of 20 fabricated information or suggested treatments never discussed during patient consultations. Twelve systems inserted incorrect drug information into patient notes, and 17 missed key details about mental health issues that were actually discussed. Six systems either partially or fully failed to capture mental health concerns. These findings raise serious questions about the reliability of AI in clinical settings, especially given that the stakes involve patient safety and medical accuracy.
The audit was part of a broader report on AI usage across public services in Ontario, but it zeroed in on the AI Scribe program, which supports physicians and nurse practitioners in generating clinical notes. Evaluations used simulated doctor-patient recordings reviewed by medical professionals who compared the original conversations to the AI-generated summaries. What they discovered was a pattern of hallucinations and omissions, including reports stating patients had no masses or were anxious when those topics were never mentioned. Such errors could directly impact diagnosis, treatment plans, and long-term patient care.
Beyond the performance issues themselves, the report criticized the evaluation process used to select these systems. Accuracy of medical notes accounted for only 4 percent of a vendor's total score, while having a domestic presence in Ontario carried 30 percent weight. Bias controls, threat and risk assessments, privacy compliance, and SOC 2 Type 2 certification together made up just 8 percent of the scoring. This imbalance meant that factors far more critical to patient safety and data protection were drastically underweighted compared to less consequential business criteria.
The report warned that this flawed scoring system could lead to the selection of vendors whose tools produce inaccurate or biased medical records or lack adequate safeguards for sensitive health information. While OntarioMD recommends that doctors manually review AI-generated notes for accuracy, there's no mandatory attestation feature built into any of the approved systems. The Ministry of Health acknowledged that over 5,000 physicians are using the program but hasn't received any confirmed reports of patient harm, though the audit suggests that without stricter evaluation standards and built-in validation requirements, the risk of undetected errors remains significant. The Register reached out to the Ministry for comment on whether it would adopt the auditor's recommendations but did not receive an immediate response.
138 comments • Comments Link
人工智能在能力上令人印象深刻,但与其基本可靠性之间存在持续且显著的鸿沟:模型能在复杂的创造性任务中表现优异,却常在诸如单位换算等简单事实操作上出错,这表明当前的方法可能难以通向真正可靠的智能。
行业评估过于专注于能达到"50% 成功率"的能力基准,而忽视了可靠性指标,导致测得的能力与实际生产就绪度之间出现危险的脱节。即便是 80% 的可靠性,对许多企业应用来说也远远不够。
一些 AI 辅助工具(例如 VSCode 插件)的质量已经下降,大多数要么根本无法使用,要么无法按宣传运行,软件在完成诸如环境检测等基础任务时变得更慢、更不稳定。
大型语言模型从根本上缺乏对自身不确定性的准确评估能力。它们给出的输出概率并未与实际错误率校准,因此将这些置信度分数用于决策在统计学上是不合理的。虽然模型的内部概率分布理论上包含不确定性信息,但要提取出有意义的置信度估计,通常需要计算代价高昂的校准技术,而这些技术并未在生产系统中得到广泛实施。
关于校准概率与相对不确定性的价值存在分歧:有人认为即便校准不佳,自我评估的置信度仍有参考价值;也有人坚持认为对于任何严格的应用场景,准确的校准是不可或缺的。
当前的 AI 系统无法可靠地区分已知信息与虚构内容,常常生成听起来很自信但完全杜撰的输出。研究显示,无论实际准确性如何,这类输出通常会以 80% 到 100% 的置信度范围来表达。
AI 笔记记录和摘要工具经常出现虚构事实、错误归因和曲解语气与内容的问题。在医疗等对准确性要求极高的专业环境中,这类错误具有特别危险的后果。相关评估显示,医疗 AI 记录系统存在令人担忧的错误率:约 60% 的被评估系统会混淆处方药(这个比例需要与人类约 19.6% 的用药错误率一起考虑和对比)。
推动医疗保健及其他关键领域采用 AI 的动力,似乎更多来自市场营销和利益相关方的压力,而非由经证实的可靠性驱动。对于那些要求确定性和准确性的任务,概率性系统本质上并不适合,但对此的关注明显不足。
讨论总体揭示了人工智能表面能力与持续可靠性之间的根本矛盾。参与者普遍认为,尽管大型语言模型在创造性和复杂任务上表现突出,但在基本事实操作上仍频繁失败。许多人警示,在关键环境中匆忙部署这些系统的决定,往往受商业利益驱动,而非实际可用性;在需要高可靠性的场景中,这些技术目前仍然不适合。 • A persistent gap exists between AI's impressive capabilities and its basic reliability, with models excelling at complex creative tasks while failing at simple factual operations like unit conversions, suggesting current approaches may not lead to genuine intelligence.
• The industry focuses on capability benchmarks measuring 50% success rates while ignoring reliability metrics, creating a dangerous disconnect between demonstrated ability and production readiness where even 80% reliability is insufficient for enterprise use.
• AI-assisted tools like VSCode plugins have deteriorated in quality, with most being completely unusable or not functioning as advertised, while the software itself has become slower and less reliable at basic tasks like environment detection.
• LLMs fundamentally lack the ability to accurately assess their own uncertainty, as their output probabilities are not calibrated to actual error rates, making it statistically unsound to use their confidence scores for decision-making.
• While models have internal probability distributions that theoretically contain uncertainty information, extracting meaningful confidence estimates requires computationally expensive calibration techniques that aren't implemented in production systems.
• The distinction between calibrated probabilities and relative uncertainty matters practically, with some arguing that even poorly calibrated self-assessments of confidence would be useful, while others insist calibration is essential for any rigorous application.
• Current AI systems cannot reliably distinguish between their own knowledge and fabrication, often producing confident-sounding outputs that are complete fabrications, with studies showing they typically state confidence in 80-100% ranges regardless of actual accuracy.
• AI note-takers and summarizers frequently fabricate information, misattribute statements, and distort the tone and content of communications, with particularly dangerous implications in medical and professional contexts where accuracy is critical.
• Medical AI scribes show alarming error rates, with 60% of evaluated systems mixing up prescribed drugs, though this must be contextualized against human error rates of approximately 19.6% for medication administration.
• The push for AI adoption in healthcare and other critical fields appears driven more by marketing and stakeholder pressure than by demonstrated reliability, with insufficient attention to the fundamental unsuitability of probabilistic systems for tasks requiring deterministic accuracy.
The discussion reveals a fundamental tension between AI's remarkable capabilities and its persistent reliability issues, with participants largely agreeing that current LLM technology excels at creative and complex tasks while failing at basic factual operations. The conversation highlights how the industry's focus on capability benchmarks obscures critical reliability gaps, particularly in high-stakes domains like healthcare where AI note-takers and summarizers demonstrate dangerous error rates. Multiple commenters emphasize that these systems lack genuine understanding of their own uncertainty, producing confident-sounding outputs regardless of accuracy. While some argue that AI should be judged against human error rates, others counter that AI makes qualitatively different types of errors, hallucinating information in ways humans never would. The underlying consensus suggests that current AI technology, despite impressive surface capabilities, remains fundamentally unsuitable for applications requiring high reliability, and that the rush to deploy these systems in critical contexts may be driven more by commercial interests than by genuine utility.