加拿大安大略省的一项省级审计发现,获批用于医疗领域的 AI 辅助病历记录系统存在严重不准确问题。审计长办公室评估了 20 个 AI Scribe 系统,结果显示其中 9 个编造了信息或提出在就诊中从未讨论过的治疗建议;12 个在病历中插入了错误的药物信息;17 个遗漏了实际上有讨论的关键精神健康问题;6 个系统部分或完全未能记录精神健康方面的担忧。鉴于这直接关系到患者安全和医疗准确性,这些问题令人担忧。 A provincial audit in Ontario, Canada, has revealed alarming inaccuracies in AI-powered medical note-taking systems approved for use in healthcare. The Office of the Auditor General evaluated 20 AI Scribe systems and found that 9 out of 20 fabricated information or suggested treatments never discussed during patient consultations. Twelve systems inserted incorrect drug information into patient notes, and 17 missed key details about mental health issues that were actually discussed. Six systems either partially or fully failed to capture mental health concerns. These findings raise serious questions about the reliability of AI in clinical settings, especially given that the stakes involve patient safety and medical accuracy.
加拿大安大略省的一项省级审计发现,获批用于医疗领域的 AI 辅助病历记录系统存在严重不准确问题。审计长办公室评估了 20 个 AI Scribe 系统,结果显示其中 9 个编造了信息或提出在就诊中从未讨论过的治疗建议;12 个在病历中插入了错误的药物信息;17 个遗漏了实际上有讨论的关键精神健康问题;6 个系统部分或完全未能记录精神健康方面的担忧。鉴于这直接关系到患者安全和医疗准确性,这些问题令人担忧。
该审计是安大略省关于公共服务领域 AI 使用的综合报告的一部分,但特别针对支持医生和执业护士生成临床记录的 AI Scribe 项目。评估使用了模拟医患对话录音,由医学专业人员将原始对话与 AI 生成的摘要逐一比对,结果发现存在系统性"幻觉"与遗漏,例如有报告声称患者没有肿块或没有焦虑,而这些内容在对话中根本未被提及。此类错误可能直接影响诊断、治疗决策及长期护理。
报告还批评了用于筛选这些系统的评估流程:病历准确性仅占供应商总分的 4%,而在安大略省设有本地业务却占 30% 的权重。偏差控制、威胁与风险评估、隐私合规和 SOC 2 Type 2 认证合计仅占 8% 。这种权重分配严重失衡,使对患者安全和数据保护更为关键的因素被明显弱化,相比之下商业存在等次要标准权重过高。
报告警告,这种有缺陷的评分体系可能导致选出会生成不准确或有偏见医疗记录、或未对敏感健康信息提供充分保护的供应商。 OntarioMD 建议医生手动核对 AI 生成的记录,但在任何获批系统中均未内置强制性的确认功能。安大略省卫生厅承认已有超过 5,000 名医生在使用该项目,但尚未收到确认的患者受害报告;审计指出,如不采用更严格的评估标准并增加内置验证要求,未被发现的错误风险仍很大。 The Register 就卫生厅是否会采纳审计建议向其求证,但未获即时回复。
A provincial audit in Ontario, Canada, has revealed alarming inaccuracies in AI-powered medical note-taking systems approved for use in healthcare. The Office of the Auditor General evaluated 20 AI Scribe systems and found that 9 out of 20 fabricated information or suggested treatments never discussed during patient consultations. Twelve systems inserted incorrect drug information into patient notes, and 17 missed key details about mental health issues that were actually discussed. Six systems either partially or fully failed to capture mental health concerns. These findings raise serious questions about the reliability of AI in clinical settings, especially given that the stakes involve patient safety and medical accuracy.
The audit was part of a broader report on AI usage across public services in Ontario, but it zeroed in on the AI Scribe program, which supports physicians and nurse practitioners in generating clinical notes. Evaluations used simulated doctor-patient recordings reviewed by medical professionals who compared the original conversations to the AI-generated summaries. What they discovered was a pattern of hallucinations and omissions, including reports stating patients had no masses or were anxious when those topics were never mentioned. Such errors could directly impact diagnosis, treatment plans, and long-term patient care.
Beyond the performance issues themselves, the report criticized the evaluation process used to select these systems. Accuracy of medical notes accounted for only 4 percent of a vendor's total score, while having a domestic presence in Ontario carried 30 percent weight. Bias controls, threat and risk assessments, privacy compliance, and SOC 2 Type 2 certification together made up just 8 percent of the scoring. This imbalance meant that factors far more critical to patient safety and data protection were drastically underweighted compared to less consequential business criteria.
The report warned that this flawed scoring system could lead to the selection of vendors whose tools produce inaccurate or biased medical records or lack adequate safeguards for sensitive health information. While OntarioMD recommends that doctors manually review AI-generated notes for accuracy, there's no mandatory attestation feature built into any of the approved systems. The Ministry of Health acknowledged that over 5,000 physicians are using the program but hasn't received any confirmed reports of patient harm, though the audit suggests that without stricter evaluation standards and built-in validation requirements, the risk of undetected errors remains significant. The Register reached out to the Ministry for comment on whether it would adopt the auditor's recommendations but did not receive an immediate response.
antirez 讨论了他开发的本地 AI 集成工具 DwarfStar 4(DS4)意外走红的原因。他把这种快速被采用归因于多种因素的叠加:强大且高速的模型 DeepSeek v4 Flash 、有效的 2/8 位非对称量化,以及本地 AI 社区多年积累的经验。这些条件让他在一周内就搭建出了 DS4,但他也强调,有了这些基础后,操作大模型仍然需要技巧。 antirez discusses the unexpected popularity of DwarfStar 4 (DS4), a local AI integration tool he developed. He attributes its rapid adoption to a convergence of factors: the release of a powerful, fast model (DeepSeek v4 Flash), the effectiveness of asymmetric quantization (2/8-bit), and years of accumulated knowledge from the local AI community. This combination allowed him to build DS4 in just one week, though he notes that even with this foundation, working with LLMs requires skill.
antirez 讨论了他开发的本地 AI 集成工具 DwarfStar 4(DS4)意外走红的原因。他把这种快速被采用归因于多种因素的叠加:强大且高速的模型 DeepSeek v4 Flash 、有效的 2/8 位非对称量化,以及本地 AI 社区多年积累的经验。这些条件让他在一周内就搭建出了 DS4,但他也强调,有了这些基础后,操作大模型仍然需要技巧。
在项目发布周,他的工作强度大幅上升,平均每天 14 小时,这与自 Redis 早期以来他通常 4–6 小时的工作节奏形成鲜明对比。他也澄清 DS4 并不局限于 DeepSeek v4 Flash,模型会随时间演进。他的设想是让 DS4 始终采用当前最好的、能在高端 Mac 或类似 DGX Spark 的 "GPU in a box" 环境上实际高速运行的开源权重模型。他预计未来会出现更多竞争者,比如新的 DeepSeek v4 Flash 检查点、针对编程的专用版本,以及面向不同领域的专门变体(如 ds4-coding 、 ds4-legal 、 ds4-medical)。
antirez 指出一个重要里程碑:自从开始尝试本地 AI 以来,他首次用本地模型完成了过去通常会交给 Claude 或 GPT 去做的"严肃"任务,他把这看作是一个重大进展。他还提到向量引导的有效性,这让 LLM 的使用更加灵活。他对 DeepSeek v4 Flash 评价很高,认为 DS4 更接近前沿模型(B)而非小型本地模型(A)。
展望未来,antirez 列出了 DS4 的发展方向:聚焦质量基准、可能加入编码代理、为 CI 测试搭建家庭硬件平台以保证长期质量、扩展更多平台端口,以及实现分布式推理(串行和并行)。他对在最初混乱几天中得到的支持表示感谢,并总结道:AI 太重要了,不能仅仅被当作一种服务提供。
antirez discusses the unexpected popularity of DwarfStar 4 (DS4), a local AI integration tool he developed. He attributes its rapid adoption to a convergence of factors: the release of a powerful, fast model (DeepSeek v4 Flash), the effectiveness of asymmetric quantization (2/8-bit), and years of accumulated knowledge from the local AI community. This combination allowed him to build DS4 in just one week, though he notes that even with this foundation, working with LLMs requires skill.
The author shares his intense work schedule during the project's launch week, averaging 14 hours per day, contrasting it with his usual 4-6 hour workday since the early days of Redis. He clarifies that DS4 is not limited to DeepSeek v4 Flash; the model can evolve over time. His vision is for DS4 to always use the best current open-weights model that runs practically fast on high-end Macs or "GPU in a box" setups like the DGX Spark. He anticipates future contenders like a new DeepSeek v4 Flash checkpoint, a coding-specific version, and specialized variants for different domains (e.g., ds4-coding, ds4-legal, ds4-medical).
Antirez highlights a significant milestone: for the first time since he began experimenting with local AI, he is using a local model for serious tasks he would normally delegate to Claude or GPT. He considers this a major development. He also notes the effectiveness of vector steering, which allows for a more flexible LLM experience. He expresses his admiration for DeepSeek v4 Flash, stating that DS4 is much closer to a frontier model (B) than a small local model (A).
Looking ahead, antirez outlines his plans for DS4's development. He hopes the project will focus on quality benchmarks, potentially adding a coding agent, setting up a home hardware rig for CI testing to ensure long-term quality, more ports, and distributed inference (both serial and parallel). He expresses gratitude for the support received during the chaotic first days. He concludes with a strong statement: AI is too critical to be just a provided service.
• DwarfStar4 是为 DeepSeek V4 专门打造的推理运行时,目前需要 96GB VRAM,目标平台为 Apple Metal 和 NVIDIA DGX Spark,代表了对 llama.cpp 更加专注的替代方案。
• 聚焦单一模型使其能比像 llama.cpp 这样的更大框架更快迭代和优化,但也带来开发碎片化的风险,以及遇到新模型时可能迅速过时的隐忧。
• 围绕模型能力、速度与成本的权衡存在激烈讨论:有人认为能力较弱但能长时间运行的模型,最终在许多任务上能与更智能的模型相当,这可能动摇依赖出售最强访问权限(如 Anthropic)的商业模式。
• 本地推理性能随硬件差异显著:M5 MacBook Pro 大约 30 tokens/s,而 RTX Pro 6000 在预填充时达 121 t/s 、生成时 47 t/s,凸显硬件选择对实际可用性的决定性影响。
• 在某些编码测试中,DeepSeek V4 Pro 的能力与 Claude Sonnet 相当甚至更好,但运行速度远低于对手,形成明显的成本—速度权衡;开放模型成本更低但需要更多耐心。
• 对当前量化技术能否在几年内把 DeepSeek V4 此类模型压缩到 16GB 内存内存在怀疑,因为即便是 Mixture of Experts 架构,所有参数最终也要驻留内存,这表明需要根本性的架构或硬件突破。
• DwarfStar4 的开发过程中借助 GPT 5.5 等 AI 助手反复打磨性能关键代码,并通过自动化测试与基准测试确保正确性,展现了人机混合协作的开发流程。
• 实际体验显示,DeepSeek V4 Flash 在配备 128GB 内存的 M4 Max MacBook Pro 上运行良好,可处理高达 124k tokens 的上下文而不会性能下降;与同等能力的密集模型相比,MoE 架构在速度上有显著优势。
• 本地推理的最佳硬件配置仍无定论:高端 Mac 提供大容量统一内存但吞吐量低于专业 GPU,取舍取决于更看重上下文长度还是生成速度。
• 在评估模型性能声明时,应区分经验证据与二手信息;有人认为真正的验证需要亲自测试,而不能完全依赖已发布的基准。
总体上,讨论反映了一个积极探索本地 LLM 实际边界的社区:大家对 DeepSeek V4 的能力感到兴奋,但对当前硬件限制保持现实。在希望能更容易在本地运行 AI 的诉求与尖端模型仍需大量资源之间存在紧张,人们正争论未来的效率提升是否会实现民主化访问,还是这一领域会继续偏向拥有昂贵硬件的人。像 DwarfStar4 这样的专用工具的出现,反映出一种更广泛的趋势——为特定用例进行优化,即便要以牺牲通用性为代价。
• DwarfStar4 is a specialized inference runtime designed specifically for DeepSeek V4, currently requiring 96GB of VRAM and targeting Apple Metal and NVIDIA DGX Spark hardware, representing a focused alternative to the more general-purpose llama.cpp.
• The project's narrow focus on a single model allows for faster iteration and optimization compared to larger frameworks like llama.cpp, though this approach risks fragmentation of development effort and potential obsolescence when newer models emerge.
• There's significant discussion around the trade-offs between model intelligence, speed, and cost, with some arguing that less powerful models running longer could eventually match smarter models for many tasks, potentially disrupting the business models of companies like Anthropic that rely on selling access to the most capable models.
• Local inference performance varies dramatically by hardware, with M5 MacBook Pros achieving around 30 tokens/second while RTX Pro 6000 GPUs reach 121 t/s prefill and 47 t/s generation, highlighting the importance of hardware choice for practical usability.
• The coding capability of DeepSeek V4 Pro appears competitive with or superior to Claude Sonnet in some tests, though it runs much slower, creating a cost-speed trade-off where the open model is significantly cheaper but requires more patience.
• There's skepticism about whether current quantization techniques can compress models like DeepSeek V4 into 16GB of RAM within a few years, as all parameters must remain in memory even for Mixture of Experts models, suggesting fundamental architectural or hardware breakthroughs would be needed.
• The development process for DwarfStar4 involves using AI assistants like GPT 5.5 to iterate on performance-critical code, with human oversight ensuring correctness through automated testing and benchmarking, demonstrating a hybrid human-AI development workflow.
• Practical experiences show DeepSeek V4 Flash running well on M4 Max MacBook Pros with 128GB RAM, handling contexts up to 124k tokens without degradation, with the MoE architecture providing significant speed advantages over dense models of similar capability.
• Questions remain about the optimal hardware configuration for local inference, with high-end Macs offering large unified memory but lower throughput compared to professional GPUs, making the choice dependent on whether context length or generation speed is prioritized.
• The distinction between empirical evidence and second-hand information becomes relevant when evaluating model performance claims, with some arguing that true empirical validation requires personal testing rather than relying on published benchmarks.
The discussion reveals a community actively exploring the practical boundaries of local LLM inference, with particular excitement around DeepSeek V4's capabilities but realistic about current hardware limitations. There's a tension between the desire for accessible, locally-run AI and the reality that cutting-edge models still require substantial resources, leading to debates about whether future efficiency gains will democratize access or if the field will continue to favor those with expensive hardware. The emergence of specialized tools like DwarfStar4 reflects a broader trend toward optimization for specific use cases, even at the cost of generality.
Thomas G. Dietterich 是 Oregon State University 的杰出荣休教授,曾任 Association for the Advancement of Artificial Intelligence 主席。他在 X(原 Twitter)上向 arXiv 的作者发文,强调该平台的行为准则规定:无论论文内容如何生成,所有作者都须对文中全部内容承担全部责任。 Thomas G. Dietterich, a distinguished professor emeritus at Oregon State University and former president of the Association for the Advancement of Artificial Intelligence, posted a message on X (formerly Twitter) addressing arXiv authors. He emphasized that the platform's Code of Conduct requires each author to take full responsibility for all contents of a paper, regardless of how the contents were generated. This statement appears to be a reminder or clarification regarding authorship accountability in academic publishing, particularly in an era where AI-generated content is becoming more prevalent.
Thomas G. Dietterich 是 Oregon State University 的杰出荣休教授,曾任 Association for the Advancement of Artificial Intelligence 主席。他在 X(原 Twitter)上向 arXiv 的作者发文,强调该平台的行为准则规定:无论论文内容如何生成,所有作者都须对文中全部内容承担全部责任。
这被视为对学术出版中作者责任的一次提醒或澄清,尤在 AI 生成内容越来越普遍的当下显得格外重要。
该帖于 2026 年 5 月 14 日发布,浏览量超过 239,000 次,并获得数百条回复与转发。讨论凸显了围绕在研究中使用 AI 工具时应遵守的伦理标准的持续争论;Dietterich 的立场强调,技术发展不能削弱学术诚信与问责。
此事反映了学界关于 AI 在学术工作中角色及作者责任的更广泛讨论,作为对研究透明度与伦理实践的及时介入。随着 arXiv 等平台在传播前沿科研成果中地位日益突出,这一话题在研究者与公众中引发了广泛共鸣。
Thomas G. Dietterich, a distinguished professor emeritus at Oregon State University and former president of the Association for the Advancement of Artificial Intelligence, posted a message on X (formerly Twitter) addressing arXiv authors. He emphasized that the platform's Code of Conduct requires each author to take full responsibility for all contents of a paper, regardless of how the contents were generated. This statement appears to be a reminder or clarification regarding authorship accountability in academic publishing, particularly in an era where AI-generated content is becoming more prevalent.
The post was made on May 14, 2026, and garnered significant attention with over 239,000 views, along with hundreds of replies and retweets. The conversation highlights the ongoing discussion around ethical standards in academic publishing, especially concerning the use of AI tools in research. Dietterich's message underscores the importance of maintaining integrity and accountability, even as technology evolves.
The context of this post reflects broader debates in the academic community about the role of AI in scholarly work and the responsibilities of authors. It serves as a timely intervention in discussions about transparency and ethical practices in research, particularly as platforms like arXiv become central to disseminating cutting-edge scientific findings. The engagement metrics suggest that this topic resonates widely among researchers and the public alike.
有人认为,arXiv 对因单一引用虚构实施的一年禁令,是维护学术诚信的有力举措。 arXiv 是一项特权而非权利,对提交未经核实的内容采取惩戒,有助于提高整体质量标准。
也有人认为这一政策过于严厉、方向错误:arXiv 的提交本就未经严格审查,而同行评审本身也有选择性。单一引用错误不等同于欺诈,也不能代表整篇工作的水平。
判断为欺诈需要有欺骗意图或对真相的鲁莽无视。单一引用的虚构,尤其是可能由合著者或 AI 工具在作者不知情的情况下插入的情况,更可能是疏忽而非蓄意欺骗。
与此同时,很多人强调,不论内容如何生成,作者对其提交的材料负全部责任。未能核实参考文献的存在是根本性的失职,会损毁整篇论文的可信度。
有人认为,这项政策是有效的启发式手段,能防止大量低质量的"草稿"进入存档。若作者连基本的参考文献核对都做不到,其它部分也难以令人信服。
另有建议认为,在允许在 arXiv 发布前要求同行评审是一种折中做法,可促使以往提交过未经核实内容的作者提高严谨性。
批评者则警告,政策可能被不公平地执行,例如当合著者在他人不知情的情况下添加了虚假引用时,责任归属和 arXiv 能否正确调查个人责任都成问题。
有人担心,该政策会惩罚那些在时间压力下犯下真诚错误的研究者,而对有意欺诈的人则威慑有限,因为他们可能会更善于掩盖自己的疏忽。
也有观点认为,这是代价极小却见效的改进:通过自动化工具验证参考文献存在性并不难,能过滤掉最明显的疏忽,不会给认真的作者带来负担。
更广泛的背景是,引用准确性的问题在 AI 出现之前就已长期存在,但大型语言模型大幅增加了这类风险,因此自动化检测和更严格的政策被视为维护科研记录可信度的必要手段。
总的讨论呈现出两种鲜明对立的观点:一方面有人认为 arXiv 的政策是捍卫下滑科学标准的必需措施,另一方面有人认为这对可能是无心之失的研究者过于苛刻。普遍达成的共识是,核实参考文献是学术工作不可妥协的最低要求;但对于单一引用虚构是否应定性为欺诈还是仅为疏忽、作者责任的范围、 AI 作为工具还是拐杖的角色以及如何公平执行此类政策,仍存在显著分歧。尽管大多数人同意必须采取措施应对低质量提交,但对具体制裁的严厉程度和实施方式仍有争议。
• A one-year arXiv ban for a single hallucinated citation is seen by some as a strong positive for scientific integrity, as arXiv is a privilege, not a right, and enforcing consequences for submitting unverified content raises the bar for quality.
• Others argue the policy is excessive and misguided, noting that arXiv submissions are not rigorously checked, peer review is far more selective, and a single citation error does not indicate fraud or reflect the overall quality of the work.
• Fraud requires intent to deceive or reckless disregard, and a single hallucinated citation, especially if inserted by a co-author or AI tool without the author's knowledge, may constitute carelessness rather than fraud.
• Many emphasize that authors bear full responsibility for their submissions, regardless of how content is generated, and failing to verify that cited references exist is a fundamental failure that undermines the credibility of the entire paper.
• Some defend the policy as a reasonable heuristic against low-effort "slop" submissions, arguing that if authors cannot perform the basic task of checking references, the rest of their work cannot be trusted either.
• The requirement that future submissions be peer-reviewed before arXiv posting is viewed as a proportionate response, ensuring that authors who previously submitted unvetted work demonstrate improved rigor.
• Critics warn that the policy may be applied unevenly, such as when a co-author adds a fraudulent citation without others' knowledge, raising questions about fairness and arXiv's ability to investigate individual responsibility.
• There is concern that the policy could discourage honest researchers who make genuine mistakes, particularly under time pressure, while doing little to deter deliberate fraudsters who would simply disguise their sloppiness.
• Some suggest the policy is a modest but cost-free improvement, as automated tools can easily verify reference existence, filtering out only the most negligent submissions without burdening careful authors.
• The broader context includes longstanding issues with citation accuracy predating AI, but LLMs have dramatically increased the risk, making automated detection and stricter policies necessary to maintain trust in the scientific record.
The discussion reveals a sharp divide between those who view arXiv's policy as a necessary defense against declining scientific standards and those who see it as a disproportionate punishment for what may be an honest mistake. A strong consensus exists that verifying references is a non-negotiable minimum for academic work, but there is significant disagreement over whether a single hallucinated citation constitutes evidence of fraud or mere negligence. The debate also touches on questions of authorial responsibility, the role of AI as a tool versus a crutch, and the practical challenges of enforcing such policies fairly across co-authored works. While most agree that something must be done to address the flood of low-quality submissions, the severity and implementation of this particular sanction remain contested.
OpenAI 正在把 Codex 带到移动设备,目前在 iOS 和 Android 的 ChatGPT 应用中以预览形式提供。此举让开发者无论是在笔记本、专用 Mac mini 还是托管的远程环境,都能随时参与正在进行的编码工作。移动端功能完整,用户可以在手机上查看输出、批准命令、切换模型并发起新任务。文件、凭据和本地配置仍保留在运行 Codex 的主机上,而截图、终端输出和测试结果等更新则通过安全中继实时回传到手机。 OpenAI is bringing Codex to mobile devices, now available in preview within the ChatGPT app for iOS and Android. This move allows developers to stay connected to active coding work from anywhere, whether they are using a laptop, a dedicated Mac mini, or a managed remote environment. The mobile experience is fully featured, letting users review outputs, approve commands, change models, and start new tasks directly from their phones. Files, credentials, and local setups remain on the machine where Codex is operating, while updates like screenshots, terminal output, and test results flow back to the phone in real time through a secure relay layer.
OpenAI 正在把 Codex 带到移动设备,目前在 iOS 和 Android 的 ChatGPT 应用中以预览形式提供。此举让开发者无论是在笔记本、专用 Mac mini 还是托管的远程环境,都能随时参与正在进行的编码工作。移动端功能完整,用户可以在手机上查看输出、批准命令、切换模型并发起新任务。文件、凭据和本地配置仍保留在运行 Codex 的主机上,而截图、终端输出和测试结果等更新则通过安全中继实时回传到手机。
其核心在于随着 agent 承担更长时间的任务,形成一种新的协作节奏。短暂的微交互——比如快速查看以回答问题或批准下一步——能保持工作脉络的连续,避免不必要的返工。每周有超过 400 万人使用 Codex,OpenAI 强调这些微交互的重要性。开发者可以在等咖啡时用手机开始排查 bug,通勤途中做出重构决策,通过手机整合更新为客户对话做准备,或在离开办公桌时把新想法转化为可执行的下一步。
在企业环境中,Codex 现支持 Remote SSH,可直接连接到配置了批准依赖和安全策略的托管远程开发环境。这意味着团队可以在桌面上启动工作、通过手机引导执行,并让长期运行的任务持续推进,而不必受限于单一设备。 OpenAI 还推出多项面向团队的更新,帮助在大规模场景中自动化和管理 Codex,包括供 CI 流水线和内部自动化使用的程序化访问令牌,以及现已全面可用的 Hooks,用于自定义行为、扫描敏感信息和记录对话。
此外,OpenAI 为符合条件的 ChatGPT Enterprise 工作区引入了在本地环境中满足 HIPAA 合规要求的 Codex 支持,使医疗机构能够以更高的速度和更大的信心将 Codex 应用于患者护理和运营流程。 ChatGPT 移动应用中的 Codex 正在所有计划中以预览形式推出,包括 Free 和 Go,覆盖所有支持地区。 Remote SSH 和 Hooks 在所有计划中均可用;程序化访问令牌仅限于 Enterprise 和 Business 计划,HIPAA 合规性则仅适用于使用本地环境的符合条件的 Enterprise 工作区。
OpenAI is bringing Codex to mobile devices, now available in preview within the ChatGPT app for iOS and Android. This move allows developers to stay connected to active coding work from anywhere, whether they are using a laptop, a dedicated Mac mini, or a managed remote environment. The mobile experience is fully featured, letting users review outputs, approve commands, change models, and start new tasks directly from their phones. Files, credentials, and local setups remain on the machine where Codex is operating, while updates like screenshots, terminal output, and test results flow back to the phone in real time through a secure relay layer.
The core idea is to enable a new rhythm of collaboration as agents take on longer-running work. Small moments of interaction, like a quick check-in to answer a question or approve a next step, can keep a thread moving and prevent unnecessary rework. With over 4 million people using Codex weekly, OpenAI is emphasizing how these micro-interactions matter. From a phone, a developer can start investigating a bug while waiting for coffee, make a decision on a refactor during a commute, prepare for a customer conversation by synthesizing updates, or turn a fresh idea into forward motion while away from their desk.
For enterprise environments, Codex now supports Remote SSH, allowing it to connect directly into managed remote development environments with approved dependencies and security policies. This means teams can start work on a desktop, steer execution from a phone, and keep long-running tasks moving without being tied to a single machine. OpenAI is also releasing several updates for teams to automate and manage Codex at scale, including programmatic access tokens for CI pipelines and internal automations, and Hooks, which are now generally available for customizing behavior, scanning for secrets, and logging conversations.
Additionally, OpenAI is introducing support for HIPAA-compliant use of Codex in local environments for eligible ChatGPT Enterprise workspaces. This enables healthcare organizations to use Codex for patient care and operational workflows with greater speed and confidence. Codex in the ChatGPT mobile app is rolling out in preview across all plans, including Free and Go, in all supported regions. Remote SSH and Hooks are available on all plans, while programmatic access tokens are limited to Enterprise and Business plans, and HIPAA compliance is restricted to eligible Enterprise workspaces using local environments.
讨论集中在 OpenAI 的 Codex 上,主要涉及其免费可用性、与 Claude 等竞争对手的性能对比,以及用于移动访问的新远程控制功能。许多用户对 Codex 免费感到惊讶并表示欣赏,但也有人指出相较于付费层级,请求次数或模型访问存在一定限制。
性能对比显示,在十分钟以内的编码任务中,Codex 5.5 优于 Claude Opus 4.7:它运行更快、 " 偷懒 " 更少,生成的代码更自然,但在非常长的任务里可能出现目标漂移。远程控制功能(允许移动端与桌面上的 Codex 会话交互)被称赞为比 Claude 的同类功能更可靠,尽管偶尔会有同步问题或移动端缺少某些界面功能。
用户强调了实际工作流程中对 SSH 、 tmux 或 Omnara 等第三方工具的依赖;也有人批评通过语音或小屏幕操控智能体的可用性问题和人体工程学挑战。关于软件质量的讨论中,部分用户认为与 Anthropic 和 Google 相比,Codex 在用户体验和稳定性上更为精致。更广泛的担忧包括过度依赖移动访问可能削弱代码审查的严谨性,以及高频率智能体输出带来的环境或认知成本。
总体来看,用户对 Codex 的采用热情很高,主要驱动力是其低成本、速度和不断改进的工具生态,但在上下文长度、平台支持和工作流集成方面仍存在一些权衡。
The discussion centers on OpenAI's Codex, particularly its free availability, performance relative to competitors like Claude, and new remote-control features for mobile access. Many users express surprise and appreciation that Codex is free, though some note limitations in request volume or model access compared to paid tiers. Performance comparisons favor Codex 5.5 over Claude Opus 4.7 for coding tasks under ten minutes, citing speed, reduced "laziness," and more natural code output, though very long tasks may suffer from goal drift. Remote-control functionality—allowing mobile interaction with desktop Codex sessions—is praised as more reliable than Claude's equivalent, despite occasional sync issues or missing UI features on mobile. Users highlight practical workflows involving SSH, tmux, or third-party tools like Omnara, while others critique the ergonomic challenges of directing agents via voice or small screens. There's also commentary on OpenAI's software quality relative to Anthropic and Google, with several noting Codex's polished UX and stability. Broader concerns include over-reliance on mobile access reducing code review rigor and the environmental or cognitive cost of high-speed agent output. Overall, the sentiment reflects strong adoption of Codex for its cost, speed, and improving tooling, tempered by nuanced trade-offs in context length, platform support, and workflow integration.
僵尸化 The Great Zombification
僵尸化
21 岁的芝加哥大学哲学系学生 Owen Yingling 认为,人工智能在校园里的广泛使用不仅是学术作弊的工具,而是一种"癌症",威胁着大学作为人文学术机构的根基。他指出,老一代人未能意识到人工智能已渗透到学生生活的方方面面——从课程作业到社交互动——正在把知识精英"僵尸化"。
Yingling 描述了在 University of Chicago 上人工智能使用的演变:起初从"business economics"专业开始,他把该专业形容为学术上的"海滩度假",机械记忆式的学习方式让学生更容易用 AI 替代真正的努力。问题很快蔓延到经济学系,有学生在考试时用手机拍试卷并把内容输入大型语言模型。最终,这股潮流也冲进了人文学科,据报道,随着更先进 AI 模型的出现,兄弟会的抄袭案件减少了,而成绩却上升了。
他强调了一个关键时刻:校刊 The Maroon 发表了两篇完全由 AI 生成的文章,数月都没有被察觉。此事表明,AI 的使用已超出单纯的学术不端,延伸到了学生出版物和媒体领域。 Yingling 观察到,讲座、带回家的测验和学生之间的闲聊里都充斥着千篇一律的平行句式,表明校园生活中的思想和表达正在趋于同质化。
尽管人工智能日益普及,精英大学仍不断宣布对 AI 研究和整合的大规模投入。 Yingling 指出,University of Chicago 收到了 5000 万美元的捐赠,Harvard 、 Yale 和 Columbia 也有类似举措,他将这些举动比作 1980 年代的 Pravda 文章,认为它们与真实情况脱节。他认为,这些机构一面鼓吹"AI 整合",一面却出现了作弊案件的大幅上升,例如 Princeton 的纪律处分在一年内几乎翻了一番。
Yingling 把对 AI 的依赖比作"僵尸蚂蚁真菌",认为学生正逐步把生活的各个方面交给 AI——从作业和电子邮件到健身计划和情感短信。他援引了一个关于"低语耳环"的寓言:耳环最终控制了佩戴者的每一个动作,用来说明 AI 如何能从有用的工具演化为全面掌控人类行为的机制。
他质疑 AI 能否真正融入教育,认为现实障碍太多、收益太少,尤其在人文学科课程中更是如此。他批评把 AI 说成能"民主化"教育的观点,称这对那些自诩不惜一切代价的精英机构本身就是矛盾。 Yingling 强调,教学本质上是一种人际关系,用 AI 取代它将导致那些真正能够激发思辨的古怪而富有挑战性的教师逐渐消失。
Yingling 警告说,AI 的整合会导致教育的同质化和中心化,使顶尖学校愈发趋同,并把它们绑在资本密集、监管严格的技术上。他设想一个未来:独立的教育机构被改造成按社会需要训练学生的大工厂,这一前景令他恐惧。尽管有人可能把当下大学体系的崩溃视为重建的机会,Yingling 对西方知识传统可能的流失深感痛惜——包括博士师承谱系与精心保存的图书馆。
他在结尾呼吁大学对 AI 使用采取更强硬的立场,并非认为这能解决高等教育的所有问题,而是为了防止真正的学习突然变得无关紧要。他承认二战后兴起的研究型大学正在衰落,但希望新出现的不会是一所没有目标、没有纪律、没有原创性的"undead university"。 Yingling 的这篇文章是对人文学术教育的热情捍卫,反对他眼中人工智能带来的去人性化影响。
The Great Zombification
Owen Yingling, a 21-year-old philosophy student at the University of Chicago, argues that the widespread use of artificial intelligence on college campuses is not merely a tool for academic cheating but a "cancer" that threatens to destroy the university as a humanist institution. He contends that the older generation fails to recognize the extent to which AI has permeated every aspect of student life, from coursework to social interactions, creating a "zombification" of the intellectual elite.
Yingling describes the progression of AI use at UChicago, starting in the "business economics" specialization, which he characterizes as an academic "beach vacation" where rote learning made it easy for students to substitute AI for genuine effort. He notes that the problem quickly spread to the economics department, where students were observed using phones to photograph exams and input them into large language models during tests. The issue eventually reached the humanities, where fraternity plagiarism cases reportedly decreased as grades rose following the release of more advanced AI models.
The author highlights a pivotal moment when the university newspaper, The Maroon, published two articles entirely generated by AI, which went unnoticed for months. This incident revealed that AI use had moved beyond simple academic misconduct into the realm of student publications and media. Yingling observes that "perfect parallel constructions" now fill lecture halls, take-home tests, and student chatter, suggesting a homogenization of thought and expression across campus life.
Despite the proliferation of AI, elite universities continue to announce massive investments in AI research and integration. Yingling points to a $50 million gift at UChicago and similar initiatives at Harvard, Yale, and Columbia, which he compares to "1980s Pravda articles" for their disconnect from the reality on the ground. He argues that these institutions are promoting "AI integration" while simultaneously experiencing dramatic increases in cheating cases, such as at Princeton where disciplinary actions nearly doubled in one year.
Yingling draws a parallel between AI dependency and the "zombie ant-fungus," suggesting that students are gradually surrendering all aspects of their lives to AI, from homework and emails to gym routines and romantic messages. He references a prophetic story about a "whispering earring" that eventually controls its wearer's every movement, illustrating how AI can evolve from a helpful tool to a mechanism of total control over human behavior.
The author challenges the notion that AI can be successfully integrated into education, arguing that the practical hurdles are too great and the benefits too low, particularly in humanities courses. He criticizes the idea that AI will "democratize" education, calling it a contradiction in terms for elite institutions that claim to spare no expense. Yingling emphasizes that teaching is fundamentally a human relationship, and replacing it with AI will lead to the extinction of the eccentric, challenging educators who truly stimulate intellectual growth.
Yingling warns that AI integration will lead to the homogenization and centralization of education, making top schools more interchangeable and tying them to capital-intensive, heavily regulated technology. He envisions a future where independent educational institutions are transformed into factories designed to train students according to societal needs, a prospect he finds terrifying. While some might see the collapse of the current university system as an opportunity to rebuild, Yingling expresses sadness at the potential loss of the Western intellectual tradition, including doctoral genealogies and carefully preserved libraries.
The author concludes by advocating for a harder line on AI use in universities, not because it will solve all the problems facing higher education, but because it will prevent the sudden irrelevance of genuine learning. He acknowledges that the post-WWII research university is already dying, but hopes that what emerges will not be an "undead university" devoid of purpose, discipline, or originality. Yingling's essay serves as a passionate defense of humanist education against what he sees as the dehumanizing effects of artificial intelligence.
讨论的核心观点是:现代大学体系已发生根本性偏离,把学位认证和信号功能置于真正教育之上。许多与会者认为,AI 只是加速了这一既有趋势——学生把学位当作通往就业的交易性门槛,而非智力成长的机会。
• 大学的主要功能已经从教育转向认证:如今大多数非体力劳动岗位都要求学位,几十年前关于"以学习为本"的斗争就已失利。
• AI 被视为一种高风险工具,可能让人不劳而获地获取文凭,但根本问题是社会更看重可量化的成绩和排名,而非实际知识的获得。
• 尽管像 UChicago 这样的学校一向强调"学会如何思考"和人格培养,但更普遍的趋势是走向过度专业化,学生把大学当作白领职业培训学校。
• 精英学位的价值更多来自筛选效应和人脉网络,而非教学本身;在线课程和 AI 并未改变这一局面。
• 为应对 AI 作弊,许多人主张回归"无技术"做法:现场监考、蓝皮书考试和口试等——这些曾是几百年来的评估标准。
• 有人担心,如果 AI 削弱了学位的信号价值,高等教育带来的工资溢价可能瓦解,大学的地位或将变得无关紧要。
• 有人建议把教育与认证分离:大学专注于批判性思维培养,技术培训则交给学徒制和职业学校。
• 尽管前景令人担忧,一些学生和教育者发现 AI 作为苏格拉底式导师或用于制作个性化学习工具仍很有价值,前提是学生在学习中保持主动。
• 讨论暴露出两种分歧观点:一种认为 AI 是一种"僵尸化"力量,会制造永久性的下层阶层;另一种则认为,通过自动化死记硬背的工作,AI 能让学生把精力转向人文学科。
对话表明人们对当前高等教育深感怀疑,普遍认为该体系更看重门槛和地位,而非培养独立思想。尽管 AI 被视为破坏性力量,但许多人认为它只是揭示了以证书为核心的文化的既有缺陷。提出的对策从回归传统严格的评估方法,到彻底重塑社会如何看待和构建高等教育不等。
The discussion centers on the idea that the modern university system is fundamentally broken because it prioritizes credentialing and signaling over genuine education. Many participants argue that AI is merely accelerating an existing trend where students treat degrees as a transactional requirement for employment rather than an opportunity for intellectual growth.
• The primary function of university has shifted from education to certification, as degrees are now required for most non-manual-labor jobs, making the "battle for learning" one that was lost decades ago.
• AI is viewed as a high-risk tool for obtaining credentials without work, but the real issue is a culture that values measurement and grades over the actual acquisition of knowledge.
• While some institutions like UChicago historically emphasized "learning how to think" and personal enrichment, the broader trend is toward hyper-professionalism, where students view college as a white-collar vocational school.
• The value of an elite degree is often attributed to selection effects and networking rather than the quality of instruction, a dynamic that online courses and AI have failed to disrupt.
• To combat AI cheating, many suggest a return to "no-tech" solutions like in-person proctored exams, blue books, and oral assessments, which were the standard for centuries.
• There is a concern that if AI devalues the signaling power of a degree, the wage premium for higher education will collapse, potentially making universities irrelevant.
• Some argue that the solution is to decouple education from credentialing, suggesting that universities should focus on critical thinking while technical training moves to apprenticeships and trade schools.
• Despite the doom, some students and educators find AI useful as a Socratic tutor or for creating personalized learning tools, provided the student remains intentional about the learning process.
• The discussion highlights a divide between those who see AI as a "zombifying" force that creates a permanent underclass and those who believe it could free students to focus on humanist pursuits by automating rote knowledge work.
The conversation reveals a deep skepticism about the current state of higher education, with participants largely agreeing that the system is more concerned with gatekeeping and status than with fostering independent thought. While AI is seen as a disruptive force, many argue it is simply exposing the pre-existing flaws of a credential-focused culture. The proposed solutions range from a return to traditional, rigorous assessment methods to a complete overhaul of how society values and structures post-secondary education.
Calif 在安全研究领域取得了重要里程碑:他们开发出首个公开的针对 Apple M5 上 macOS 内核内存破坏的漏洞利用,成功绕过了 Apple 的 Memory Integrity Enforcement(MIE)机制。该利用仅在与 Mythos Preview 合作的五天内完成,表明即便是最先进的硬件级安全防护,也可能在 AI 辅助与人类专业知识的配合下被攻破。 Calif has achieved a significant milestone in security research by developing the first public macOS kernel memory corruption exploit on Apple M5 silicon, successfully bypassing Apple's Memory Integrity Enforcement (MIE) system. The exploit, built in just five days in collaboration with Mythos Preview, demonstrates that even Apple's most advanced hardware-based security protections can be evaded with the right combination of AI assistance and human expertise.
Calif 在安全研究领域取得了重要里程碑:他们开发出首个公开的针对 Apple M5 上 macOS 内核内存破坏的漏洞利用,成功绕过了 Apple 的 Memory Integrity Enforcement(MIE)机制。该利用仅在与 Mythos Preview 合作的五天内完成,表明即便是最先进的硬件级安全防护,也可能在 AI 辅助与人类专业知识的配合下被攻破。
Apple 花了五年、投入数十亿美元开发 MIE,这是一种基于 ARM Memory Tagging Extension 的硬件辅助内存安全机制。作为 M5 和 A19 芯片的旗舰安全功能,MIE 专门用于阻止内存损坏类漏洞利用——这类漏洞历来是 iOS 和 macOS 上最常见的漏洞类型。根据 Apple 的研究,MIE 能使所有已知针对现代 iOS 的公开漏洞利用链失效,包括 Coruna 和 Darksword 等复杂工具包。
Calif 团队的攻击路径属偶然发现。 Bruce Dang 在 4 月 25 日发现了底层漏洞;一周内,Dion Blazakis 加入,Josh Maine 搭建了工具链,他们在 5 月 1 日就完成了可用的漏洞利用链。该链针对 macOS 26.4.1,从非特权本地用户出发,仅借助常规系统调用,即可在内核 MIE 完全启用的裸机 M5 硬件上获取 root 权限。
Mythos Preview 在发现这些漏洞方面发挥了关键作用——这些漏洞属于该 AI 系统已学会如何高效攻击的已知类别。但 MIE 作为一种顶级缓解措施,需要人类专业知识才能被绕过。人机配合表现出惊人效率,约一周内就在消费级最强防护下实现了内核漏洞利用。
这项工作预示了安全领域的未来:Apple 在像 Mythos Preview 这样的 AI 系统出现之前设计并部署了 MIE 。随着这些系统持续发现更多漏洞,必然会出现一些能够在高级缓解下幸存的强大漏洞。该团队亲赴 Apple Park 递交了他们的发现,并以激光打印的报告向黑客文化致敬;他们计划在 Apple 发布修复后公开一份完整的 55 页技术报告。
Calif has achieved a significant milestone in security research by developing the first public macOS kernel memory corruption exploit on Apple M5 silicon, successfully bypassing Apple's Memory Integrity Enforcement (MIE) system. The exploit, built in just five days in collaboration with Mythos Preview, demonstrates that even Apple's most advanced hardware-based security protections can be evaded with the right combination of AI assistance and human expertise.
Apple spent five years and billions of dollars developing MIE, a hardware-assisted memory safety system built around ARM's Memory Tagging Extension. Introduced as the flagship security feature for the M5 and A19 chips, MIE was specifically designed to stop memory corruption exploits, which have historically been the most common vulnerability class on iOS and macOS. According to Apple's own research, MIE disrupts every known public exploit chain against modern iOS, including sophisticated kits like Coruna and Darksword.
The Calif team's attack path was actually an accidental discovery. Bruce Dang found the underlying bugs on April 25th, and within a week, with Dion Blazakis joining and Josh Maine building the tooling, they had a working exploit by May 1st. The chain targets macOS 26.4.1, starting from an unprivileged local user and using only normal system calls to achieve root access on bare-metal M5 hardware with kernel MIE fully enabled.
Mythos Preview played a crucial role in identifying the vulnerabilities, which belong to known bug classes that the AI system has learned to attack effectively. However, MIE represented a new best-in-class mitigation that required human expertise to bypass autonomously. The pairing proved remarkably effective, landing a kernel exploit against the strongest consumer platform protections in about a week.
This work signals what's coming in the security landscape. Apple built MIE in a world before AI systems like Mythos Preview existed. As these systems discover more vulnerabilities, some will inevitably be powerful enough to survive even advanced mitigations. The team delivered their findings in person at Apple Park, laser-printing the report as a nod to hacker culture, and plans to publish a full 55-page technical report after Apple ships a fix.
此次讨论聚焦于 Apple 操作系统中新披露的一个安全漏洞,该漏洞绕过了内存标记扩展(MTE)——一种用于防止内存破坏利用的硬件级防护。评论者对这类漏洞的更广泛影响表示担忧,尤其考虑到大多数组织缺乏专门的安全团队或资源来及时修补系统。该漏洞似乎采用"纯数据"攻击手法,通过不改变控制流来规避 MTE,这也引发了为什么 Apple 没有同时采用 fbounds 检查等额外防护的疑问。有人推测,性能顾虑或需要重新编译整个操作系统的复杂性可能是主要原因。关于该漏洞在 bug bounty program 中的价值也存在争议,估价从 10 万美元到 150 万美元不等,取决于演示方式,尤其是在测试版系统或 Lockdown Mode 下演示时更有价值。还有人指出,这并非 MTE 首次被绕过,并引用了 Google Pixel 上的类似案例。尽管 MTE 通过阻止 ROP 和 JOP 等常见利用技术显著提高了攻击门槛,但此事凸显出没有任何单一防护是万无一失的,尤其当攻击者转向针对 GPU 内存等未受保护子系统时。此次讨论反映了人们对在日益复杂且受 AI 辅助的漏洞发现面前,防御措施能否实现可扩展性的日益担忧。 The discussion centers on a newly disclosed security vulnerability in Apple's operating system that bypasses Memory Tagging Extension (MTE), a hardware-level defense mechanism designed to prevent memory corruption exploits. Commenters express concern about the broader implications of such vulnerabilities, especially given that most organizations lack dedicated security teams or resources to patch systems promptly. The exploit appears to use a "data-only" attack technique, which avoids triggering MTE by not altering control flow, raising questions about why Apple didn't also employ additional protections like fbounds checking. Some speculate performance concerns or the complexity of recompiling the entire OS may explain the gap. There's also debate over the exploit's value in Apple's bug bounty program, with estimates ranging from $100,000 to $1.5 million depending on how it's framed, particularly if demonstrated against a beta OS or in locked mode. Others note this isn't the first time MTE has been bypassed, citing a similar case on Google Pixel. While MTE significantly raises the bar for attackers by thwarting common exploit techniques like ROP and JOP, this incident underscores that no single mitigation is foolproof, especially as adversaries adapt to target unprotected subsystems like GPU memory. The conversation reflects growing anxiety about the scalability of defenses in the face of increasingly sophisticated, AI-assisted vulnerability discovery.
使用人工智能写作、编程或起草文件的诱惑越来越难以抵挡。然而,这种便利代价高昂,依赖这些工具似乎正在侵蚀人们创作原创内容的能力。即便是过去不认为自己写得差的人,频繁使用人工智能也会明显削弱个人技能。 The temptation to use AI for writing, coding, or drafting documents is becoming increasingly difficult to resist. However, this convenience comes at a steep cost, as the reliance on these tools seems to be actively diminishing the ability to produce original work. Even for someone who didn't consider themselves a poor writer before, the frequent use of AI is causing a noticeable decline in personal skill and capability.
使用人工智能写作、编程或起草文件的诱惑越来越难以抵挡。然而,这种便利代价高昂,依赖这些工具似乎正在侵蚀人们创作原创内容的能力。即便是过去不认为自己写得差的人,频繁使用人工智能也会明显削弱个人技能。
这种衰退往往源于由自我怀疑和冒名顶替综合症驱动的恶性循环。 AI 生成的内容常缺乏个人特色,不像作者本人的语气,也难以传达预期的细微差别。这种脱节使人更难独立创作,因为产出听起来像机器写的,而非真实的自我表达。
在软件开发领域,这种影响更为深刻。完全依赖提示(prompting)近两年、几乎一行代码都没亲自写过后,很多人开始感觉自己正在丧失最基本的编程技能。职业本身不会消失,总会需要会读写代码的人,但向以人工智能为主导的开发转变,确实让许多人失去了曾经定义他们职业生涯的技艺与自豪感。
不过也有希望,人工智能或许能扭转行业长期以来的一些问题。几十年来,软件开发者的需求远超供给,这在一定程度上拉低了行业的专业标准。在计算机科学成为主流职业之前,编程主要由学术界、物理学家和数学家主导。 AI 有可能提高真正掌握这门技艺所需的门槛,从而把一些曾经丧失的职业精神带回行业。
甚至把这番挣扎写出来,也是与想借助这些工具的冲动作斗争。总有挥之不去的焦虑,担心文字表述不清或读起来别扭,于是忍不住把内容交给像 Claude 这样的 AI 去验证。要摆脱对这些工具的依赖,就必须积极对抗它们助长的自我怀疑,夺回自己的声音与能力。
The temptation to use AI for writing, coding, or drafting documents is becoming increasingly difficult to resist. However, this convenience comes at a steep cost, as the reliance on these tools seems to be actively diminishing the ability to produce original work. Even for someone who didn't consider themselves a poor writer before, the frequent use of AI is causing a noticeable decline in personal skill and capability.
This decline often stems from a cycle involving self-doubt and imposter syndrome. When AI generates content, the results often lack a personal touch, sounding nothing like the individual's own voice or failing to convey the exact nuances intended. This disconnect can make the user feel even less capable of producing work independently, as the output feels artificial rather than authentic.
In the realm of software development, the impact has been even more profound. After relying entirely on prompting for nearly two years without writing a single line of code, there is a growing sense of loss regarding the fundamental skill of programming. While the profession itself isn't disappearing, and there will always be a need for people who can actually read and write code, the shift toward AI-driven development is causing many to lose the very craft that once defined their lives.
There is a hope, however, that AI might actually help reverse a long-term trend in the industry. For decades, the massive demand for software developers has outstripped the supply, which has arguably led to a decline in the professional standards of the field. Before computer science became a mainstream profession, programming was primarily the domain of academics, physicists, and mathematicians. By potentially raising the bar for what is required to truly master the craft, AI might help return some of that lost professionalism to the industry.
Even the act of writing about this struggle is a battle against the urge to use these tools. There is a constant, nagging anxiety that a piece of writing might not make sense or might read awkwardly, which leads to the impulse to run text through an AI like Claude for validation. Overcoming this dependency requires actively fighting back against the self-doubt that these technologies feed on, in order to reclaim one's own voice and skills.
- 有经验的开发者在使用人工智能编写代码时常感到持续的不安,因为他们必须不断审查并补充每一份输出,以确保其正确性与可维护性。
- 大型语言模型往往倾向于增加代码量而非精简实现,随着代码库通过冗长或重复的实现不断膨胀,技术债务可能不断累积,接近极高水平。
- "逐词预测"这一架构特性使得模型容易冗长啰嗦,经常不知道何时停止,也难以在不丢失上下文的情况下进行高层次的项目规划。
- 在循环中使用 AI 代理会加剧问题:一个代理可能会通过增加更多代码来回应另一个代理的审查,而不是回溯并寻找更好的架构方案。
- 关于维护成本上升是否会使高冗余和"附带复杂性"的代价降低存在争论,这可能使人们偏离追求极简且高度优化代码的方向。
- 有效的人工智能工作流程通常把模型当作处理例行事务的实习生或助手,例如生成模板代码、编写单元测试或进行重构,同时把核心架构设计和创新逻辑留给人类来完成。
- 过度依赖人工智能可能导致认知退化,开发者会丧失解决问题的"肌肉记忆"、语法熟练度以及对所构建系统的深刻理解。
- 初级开发者面临最大风险:如果在学习阶段缺乏必要的挣扎和实践经验,他们可能无法建立起将来监督 AI 工具所需的基础技能。
- 对一些人来说,人工智能是强大的能力放大器,能把枯燥重复的任务卸载出去,使他们能处理更复杂的工程问题,但这需要严格的自律以避免凭感觉写代码。
- AI 驱动的开发正在改变开发者的角色:从逻辑的创造者转为需求的协调者与产出验证者,更强调系统设计而非语法细节。
讨论集中在人工智能带来的巨大生产力提升与对工程手艺及个人专业素养的长期风险之间的矛盾。虽然把重复性任务交给 AI 使很多人能够专注于更高层次的架构设计并从中获益,但普遍共识是:缺乏严格人类监督的凭感觉编码会导致臃肿且不可维护的代码库,并引发认知能力的下降。人们尤其担忧行业的未来:如果初级开发者跳过了手动编程中必要的磨炼,他们如何才能培养起深厚的技术直觉。归根结底,最有效的做法似乎是有纪律的协调运用——把 AI 当作扩展能力的工具,而不是替代批判性思维的全部手段。
• Experienced developers often feel a persistent sense of unease when using AI to write code, driven by a need to constantly review and supplement every output to ensure correctness and maintainability.
• LLMs tend to prioritize adding more code rather than simplifying it, which can lead to an asymptotic approach toward 100% technical debt as the codebase grows through verbose or redundant implementations.
• The architectural nature of "next-token prediction" makes models prone to verbosity, often struggling to know when to stop or how to structure projects at a high level without losing context.
• Using AI agents in loops can exacerbate problems, as one agent may attempt to address another's review by adding even more code rather than backtracking to find a better architectural approach.
• There is a debate regarding whether the rising economics of maintenance makes high verbosity and "accidental complexity" less costly, potentially shifting the focus away from minimal, highly optimized code.
• Effective AI workflows involve treating the model as an intern or an assistant for rote tasks, such as boilerplate, unit tests, or refactoring, while reserving core architectural and novel logic for human implementation.
• Relying too heavily on AI can lead to cognitive atrophy, where developers lose the "muscle memory" of problem-solving, syntax, and the ability to deeply understand the systems they are building.
• Junior developers face the greatest risk, as the lack of struggle and hands-on experience during the learning phase may prevent them from building the foundational skills required to eventually oversee AI tools.
• For some, AI acts as a powerful force multiplier that allows them to tackle much more complex engineering problems by offloading the mundane, though it requires strict discipline to avoid "vibe coding."
• The shift toward AI-driven development changes the developer's role from a creator of logic to an orchestrator of requirements and a verifier of outputs, emphasizing systems design over syntax.
The discussion centers on the tension between the massive productivity gains offered by AI and the long-term risks to engineering craft and individual expertise. While many find value in offloading repetitive tasks to focus on higher-level architecture, there is a strong consensus that "vibe coding" without rigorous human oversight leads to bloated, unmaintainable codebases and cognitive decline. A significant concern exists for the future of the profession, specifically regarding how junior developers will develop deep technical intuition if they bypass the essential struggles of manual programming. Ultimately, the most successful approach appears to be one of disciplined orchestration, where AI is used as a tool for expansion rather than a total replacement for critical thinking.
该仓库包含针对 CVE-2026-42945 的概念验证利用代码。该漏洞存在于 NGINX 的 `ngx_http_rewrite_module` 中,可追溯到 2008 年,是一起严重的堆缓冲区溢出问题。使用 `rewrite` 和 `set` 指令的服务器可被未认证地远程执行任意代码。与另外三个内存损坏问题(CVE-2026-42946 、 CVE-2026-40701 、 CVE-2026-42934)一起,该漏洞由 DepthFirst 的安全分析系统在一次接入 NGINX 源代码后自动发现。 This repository contains a proof-of-concept exploit for CVE-2026-42945, a critical heap buffer overflow vulnerability in NGINX's `ngx_http_rewrite_module` that dates back to 2008. The bug enables unauthenticated remote code execution on servers using `rewrite` and `set` directives. Along with three other memory corruption issues (CVE-2026-42946, CVE-2026-40701, CVE-2026-42934), it was discovered autonomously by DepthFirst's security analysis system after a single click of onboarding the NGINX source.
该仓库包含针对 CVE-2026-42945 的概念验证利用代码。该漏洞存在于 NGINX 的 `ngx_http_rewrite_module` 中,可追溯到 2008 年,是一起严重的堆缓冲区溢出问题。使用 `rewrite` 和 `set` 指令的服务器可被未认证地远程执行任意代码。与另外三个内存损坏问题(CVE-2026-42946 、 CVE-2026-40701 、 CVE-2026-42934)一起,该漏洞由 DepthFirst 的安全分析系统在一次接入 NGINX 源代码后自动发现。
漏洞源于 NGINX 双遍脚本引擎处理过程中的不一致:先进行长度计算以确定所需缓冲区大小,再执行数据复制。当 `rewrite` 的替换字符串包含 `?` 时,主引擎会设置 `is_args` 标志,但长度计算是在一个全新且已清零的子引擎上进行的。因此长度遍看到 `is_args = 0` 并返回原始捕获长度,而复制遍看到 `is_args = 1`,并以 `NGX_ESCAPE_ARGS` 调用 `ngx_escape_uri`,把每个可转义字节扩展为 3 字节。结果复制操作会用攻击者控制的 URI 数据溢出分配不足的堆缓冲区。
攻击利用跨请求的堆布局(heap feng shui)来破坏相邻 `ngx_pool_t` 结构的 `cleanup` 指针。由于 URI 字节不能包含空字节,攻击者通过 POST 请求体进行喷射以污染该指针,从而把执行流重定向到伪造的 `ngx_pool_cleanup_s`,在内存池销毁时触发对 `system()` 的调用,进而实现远程代码执行。
受影响的 NGINX Open Source 版本为 0.6.27 到 1.30.0,已在 1.31.0 和 1.30.1 中修复;NGINX Plus 受影响的版本为 R32 到 R36,已在 R36 P4 、 R35 P2 和 R32 P6 中修复。
仓库包含可在 Ubuntu 24.04.3 LTS 上测试的部署脚本和 Python 概念验证代码。用户可以用 `./setup.sh` 构建容器,运行 `docker compose -f env/docker-compose.yml up` 启动易受攻击的 NGINX 服务器,并使用 `python3 poc.py --shell` 弹出 shell 。该项目在 GitHub 上获得了 422 颗星和 76 个 fork,主要贡献者为 Zhenpeng Lin (Markakd) 。
This repository contains a proof-of-concept exploit for CVE-2026-42945, a critical heap buffer overflow vulnerability in NGINX's `ngx_http_rewrite_module` that dates back to 2008. The bug enables unauthenticated remote code execution on servers using `rewrite` and `set` directives. Along with three other memory corruption issues (CVE-2026-42946, CVE-2026-40701, CVE-2026-42934), it was discovered autonomously by DepthFirst's security analysis system after a single click of onboarding the NGINX source.
The vulnerability stems from a mismatch in NGINX's two-pass script engine process. First, the required buffer size is calculated, then data is copied in. The `is_args` flag is set on the main engine when a `rewrite` replacement contains `?`, but the length-calculation pass runs on a freshly zeroed sub-engine. This means the length pass sees `is_args = 0` and returns raw capture length, while the copy pass sees `is_args = 1` and calls `ngx_escape_uri` with `NGX_ESCAPE_ARGS`, expanding each escapable byte to 3 bytes. The copy then overflows the undersized heap buffer with attacker-controlled URI data.
Exploitation uses cross-request heap feng shui to corrupt an adjacent `ngx_pool_t`'s `cleanup` pointer, sprayed via POST bodies since URI bytes can't contain null bytes. This redirects execution to a fake `ngx_pool_cleanup_s` invoking `system()` on pool destruction. The vulnerability affects NGINX Open Source versions 0.6.27 through 1.30.0, fixed in 1.31.0 and 1.30.1, and NGINX Plus versions R32 through R36, fixed in R36 P4, R35 P2, and R32 P6.
The repository includes setup scripts and a Python proof-of-concept that can be tested on Ubuntu 24.04.3 LTS. Users can build the container with `./setup.sh`, start the vulnerable NGINX server with `docker compose -f env/docker-compose.yml up`, and pop a shell with `python3 poc.py --shell`. The project has garnered 422 stars and 76 forks on GitHub, with the primary contributor being Zhenpeng Lin (Markakd).
• 已发布的漏洞利用为了简化演示禁用了 ASLR,但完整的技术文章描述了如何利用该漏洞本身绕过 ASLR,使其在启用 ASLR 的系统上依然构成严重威胁。 ASLR 只是纵深防御的一环,而 LLM 辅助的漏洞利用开发正在迅速降低制作武器化利用的门槛。优先应当修补根本原因,而非单纯依赖缓解措施。
• ASLR 在没有信息泄露的情况下可以完全缓解单个漏洞,但当多个漏洞被串联利用时仍然存在风险。读者有责任对安全声明保持批判性,不能在没有证据的情况下盲信自信的结论。
• 该漏洞需要不寻常的先决条件:在替换字符串中含有问号的 rewrite 指令,后面又跟着引用正则捕获组的 set 指令。许多常见的 nginx 配置(例如在 proxy_set_header 中使用 $host)不受影响,只有那些依赖未命名捕获(如 $1)的配置会受到影响。
• 主流 Linux 发行版并不默认禁用 ASLR,默认通常是模式 1(仅对 PIE 可执行启用 ASLR),而非模式 2(对所有内容强制启用 ASLR)。可以使用 checksec 等工具审计运行中的进程,检查是否缺失必要的加固选项。
• F5 已发布修补版本 1.31.0 和 1.30.1,OpenResty 已发布针对 1.27 和 1.29 的补丁。建议的缓解措施是在重写定义中使用命名捕获而非未命名捕获。
• 由于工作进程通过 fork 共享相同的内存布局,持续触发崩溃可以作为潜在的读取预言机,至少可以可靠地造成拒绝服务。 PoC 假设在演示时禁用了 ASLR,但对有动机的攻击者来说仍然存在真实威胁。
• 虽然像 Caddy 和 Jetty 这样的内存安全替代方案减少了某些类型的漏洞,但它们也有各自的漏洞历史,说明成熟度和安全实现比单纯的语言选择更重要。 Caddy 的静态编译模型对自由软件项目更为简单,但也缺乏传统的插件生态。
• 该漏洞在 nginx 中存在已久,版本号并不总是能准确反映实际的代码变更或安全状态。 Debian 12 和 Ubuntu 24.04 已有可用补丁,用户应通过 apt list nginx 等命令核实具体版本信息。
总体讨论揭示了依赖纵深防御与修补根本原因之间的紧张关系:ASLR 是有价值的屏障,但可以被绕过,前提是攻击者付出足够努力。特定利用链需要不常见的配置,从而降低了大多数用户的攻击面,但漏洞在 nginx 中长期存在凸显了基于 C 的成熟软件中内存安全问题的持久性。内存安全语言虽然带来优势,但并不能免疫其他类别的缺陷,这表明安全开发实践和项目成熟度与语言选择同样重要。社区的响应强调了可行的实际步骤,如使用命名捕获并检查发行版的补丁级别,而不是假定版本号就等同于安全状态。
• The published exploit disables ASLR for simplicity, but the full writeup describes a method to bypass ASLR using the vulnerability itself, making it a serious threat even on systems with ASLR enabled. ASLR is only a defense-in-depth measure, and LLM-assisted exploit development is rapidly lowering the skill barrier for creating weaponized exploits. Patching the root cause should be the priority, not relying on mitigations.
• ASLR can fully mitigate individual vulnerabilities unless there is an information leak, but exploit chains combining multiple vulnerabilities remain a risk. The burden is on readers to critically evaluate security claims rather than trusting confident assertions without evidence.
• The specific vulnerability requires unusual preconditions: a `rewrite` directive with a question mark in the replacement string followed by a `set` directive referencing a regex capture group. Many common nginx configurations like `proxy_set_header` with `$host` are not affected, only those using unnamed captures like `$1`.
• No major Linux distributions disable ASLR by default, though most default to mode 1 (only enabling ASLR for PIE binaries) versus mode 2 (forcing it on everything). Tools like `checksec` can audit running processes for missing hardening options.
• F5 has patched versions 1.31.0 and 1.30.1, and OpenResty has patches for 1.27 and 1.29. The recommended mitigation is to use named captures instead of unnamed ones in rewrite definitions.
• Worker processes share the same memory layout due to forking, enabling potential read oracles through unlimited crashes, making this a reliable denial-of-service at minimum. The PoC assumes ASLR is disabled, but the real threat is to motivated attackers.
• Memory-safe alternatives like Caddy and Jetty have their own vulnerability histories, suggesting maturity and secure implementation matter more than just the language. Caddy's static compilation model is simpler for free software projects despite lacking a traditional plugin system.
• The vulnerability has existed in nginx for a long time, and version numbers can be misleading indicators of actual changes or security status. Debian 12 and Ubuntu 24.04 have patched versions available, and users should check with `apt list nginx` for exact version details.
The discussion reveals a tension between relying on defense-in-depth mitigations versus patching root causes, with ASLR being both a meaningful barrier and a bypassable one given enough attacker effort. The specific exploit requires unusual configuration patterns, reducing the attack surface for most users, but the underlying vulnerability's longevity in nginx highlights how memory-safety issues persist in mature C-based software. Alternatives in memory-safe languages offer some advantages but are not immune to other classes of vulnerabilities, suggesting that secure development practices and maturity are as important as language choice. The community response emphasizes practical steps like using named captures and checking distribution-specific patch levels rather than assuming version numbers reflect security status.
现代汽车几乎成了装在轮子上的电脑,配备大量传感器、摄像头和麦克风,不断收集遥测数据。这些信息包括你的位置信息、车速,甚至驾驶员注意力等指标,常被经纪商变现或与保险公司共享。除了隐私问题外,还存在严重的安全风险——从可被远程解锁的漏洞到员工访问敏感摄像头画面的事件不等。为了掌控这些数据,作者决定从自己的 2024 RAV4 Hybrid 上物理拆除调制解调器和内置 GPS,从而阻断车辆向制造商发送遥测数据的通道。 Modern cars have essentially become computers on wheels, equipped with a vast array of sensors, cameras, and microphones that constantly collect telemetry data. This information, which includes your location, speed, and even driver attention levels, is often monetized by brokers or shared with insurance companies. Beyond privacy concerns, there are significant security risks, ranging from vulnerabilities that allow remote unlocking to instances where employees have accessed sensitive camera footage. To take control of this data, the author decided to physically remove the modem and the built-in GPS from their 2024 RAV4 Hybrid, effectively cutting the car off from sending telemetry back to the manufacturer.
现代汽车几乎成了装在轮子上的电脑,配备大量传感器、摄像头和麦克风,不断收集遥测数据。这些信息包括你的位置信息、车速,甚至驾驶员注意力等指标,常被经纪商变现或与保险公司共享。除了隐私问题外,还存在严重的安全风险——从可被远程解锁的漏洞到员工访问敏感摄像头画面的事件不等。为了掌控这些数据,作者决定从自己的 2024 RAV4 Hybrid 上物理拆除调制解调器和内置 GPS,从而阻断车辆向制造商发送遥测数据的通道。
拆除这些部件会牺牲部分功能。失去数据通信模块(Data Communication Module, DCM)后,车辆将无法接收空中更新(OTA)、使用云端服务,也会丧失自动紧急 SOS 功能,这在安全性上需要权衡。此外,车辆的麦克风是通过 DCM 接线的,因此需要一个旁路套件来确保通过 CarPlay 仍能拨打电话。作者还拆除了 GPS 天线,以避免一个已知错误:车载定位信号与手机 GPS 冲突导致导航失灵。尽管这些改动可能影响部分保修条款,但 Magnuson-Moss Warranty Act 保护车辆其余部分的保修权利不会因此失效。
维护隐私的关键在于手机如何与车辆连接。如果驾驶员使用蓝牙,车辆实际上可以把手机当作网络连接继续向 Toyota 发送遥测数据。为避免这种情况,作者建议专门使用有线 USB 连接进行 CarPlay 。若偏好无线便捷,也可使用蓝牙转有线 USB 的适配器,欺骗车辆将连接识别为有线,从而阻断数据上报。
拆除调制解调器的物理过程难度中等,需准备内饰拆卸工具、棘轮扳手和各种套筒等基础工具。步骤包括拆下换挡组件、拉出主机并访问被藏在多块面板后、用 8mm 螺栓固定的 DCM 。取出调制解调器后,需要安装专用的 DCM 旁路套件以恢复麦克风功能。这部分工作需耐心,在狭窄空间内小心操作以免损坏线束。
断开 GPS 天线则要简单得多,只需拆掉信息娱乐屏和主机后方的后盖板,通过排查找到那根单线 GPS 天线并拔除。全部复位后,可通过检查信息娱乐屏上的"无连接(no connection)"图标以及确认车顶控制台的 SOS 指示灯熄灭来验证是否成功。
最终目标是确保没有遥测数据离开车辆。作者指出未来车辆设计可能会把这些组件更深度地集成,使类似改动更难或不可能,但这次改造实现了一定程度的数字自主。整个经历也凸显了制定更严格隐私保护法规的迫切性,以保护消费者免受现代汽车技术中持续数据采集的影响。
Modern cars have essentially become computers on wheels, equipped with a vast array of sensors, cameras, and microphones that constantly collect telemetry data. This information, which includes your location, speed, and even driver attention levels, is often monetized by brokers or shared with insurance companies. Beyond privacy concerns, there are significant security risks, ranging from vulnerabilities that allow remote unlocking to instances where employees have accessed sensitive camera footage. To take control of this data, the author decided to physically remove the modem and the built-in GPS from their 2024 RAV4 Hybrid, effectively cutting the car off from sending telemetry back to the manufacturer.
Removing these components does come with certain trade-offs regarding functionality. Without the Data Communication Module (DCM), the car loses over-the-air updates, cloud-based services, and automatic emergency SOS functionality, which presents a safety consideration. Additionally, because the car's microphone is wired through the DCM, a bypass kit is necessary to ensure phone calls can still be made via CarPlay. The author also disconnected the GPS antenna to prevent a known bug where the car's location signal would conflict with the phone's GPS, causing navigation errors. While this may affect certain parts of the vehicle warranty, the Magnuson-Moss Warranty Act protects the rest of the car's coverage from being voided by these specific modifications.
A critical detail for maintaining privacy is how the phone connects to the vehicle. If a driver uses Bluetooth, the car can actually use the phone as an internet connection to continue sending telemetry data to Toyota. To prevent this, the author recommends using a wired USB connection for CarPlay exclusively. For those who prefer wireless convenience, a Bluetooth-to-wired USB adapter can be used to trick the car into treating the connection as a wired one, thereby blocking the data transmission.
The physical process of removing the modem is described as a medium-difficulty project that requires basic tools like a trim removal kit, ratchets, and various sockets. The process involves removing the shifter assembly, pulling out the radio, and accessing the DCM, which is tucked away behind several panels and held in place by 8mm bolts. Once the modem is removed, a specialized DCM Bypass Kit is installed to restore the microphone functionality. This part of the job requires patience and careful maneuvering in tight spaces to avoid damaging existing wiring.
Disconnecting the GPS antenna is a much simpler task, involving the removal of the back panel behind the infotainment screen and the head unit. Through a process of elimination, the author identified the specific single-wire cable for the GPS antenna and unplugged it. Once everything is reassembled, the success of the project can be confirmed by checking the infotainment screen for a "no connection" icon and ensuring the SOS light in the overhead console is off.
Ultimately, the goal was to ensure that no telemetry leaves the car. While the author notes that these components may become more deeply integrated into future vehicle designs, making such modifications harder or impossible, this project successfully achieves a level of digital autonomy. The experience highlights a growing need for stronger privacy laws to protect consumers from the constant data harvesting inherent in modern automotive technology.
• 断开汽车的蜂窝模块(cellular modem)可能无法完全阻断遥测数据,因为车辆可能通过蓝牙网络共享(Bluetooth tethering)或无线 CarPlay/Android Auto 使用手机的互联网连接。
• 即便在手机集成采用有线 USB 连接时,Google 和 Apple 等平台仍可能通过该接口获取车辆遥测数据。
• 有些车主在使用 Android Auto 或 CarPlay 时遇到 GPS 定位失常,对厂商拒不承认与软件相关的硬件故障感到沮丧。
• 物理改动,例如拆除数据通信模块(Data Communication Module,DCM)或拔掉特定保险丝,能有效切断遥测链路,但可能会使 SOS 功能和空中更新(over-the-air updates)失效。
• 对现代车辆"基于订阅"(subscription-based)模式的担忧日益增加,制造商可能利用收集的数据来补贴硬件成本或创造附加收入。
• 隐私倡导者建议,用户不仅可以选择断开连接,还可以通过向数据集中注入伪造或随机的行驶数据来"污染"数据,从而降低企业追踪的价值。
• 像 GrapheneOS 这样的技术方案可以帮助将手机应用沙箱化,但要实现完整的隐私保护,还必须应对蜂窝追踪、公共摄像头和财务记录等构成的复杂环境。
• 一些人认为,唯有通过立法行动并确立基本隐私权,才能从根本上应对无处不在的企业与政府监控。
• 向日益集成的互联车辆转型,在导航等现代便利性与行为隐私的丧失之间制造了紧张关系。
现代车辆的功能愈发像智能设备,蜂窝连接与智能手机接口的集成造成了持续的遥测数据流。尽管硬件改装可以减轻部分追踪,但从 GPS 、蜂窝基站到金融交易的数字足迹无处不在,要实现完全匿名非常困难。在互联功能带来的便利与对隐私的追求之间存在紧张,用户越来越被迫在现代功能性与数据安全之间做出权衡。总体来看,如果没有更广泛的立法保护和对数据所有权的根本性重新界定,个人层面的技术应对措施可能不足以解决问题。
• Disconnecting a car's cellular modem may not fully prevent telemetry if the vehicle uses Bluetooth tethering or wireless CarPlay/Android Auto to access the phone's internet connection.
• Even when using a wired USB connection for smartphone integration, platforms like Google and Apple can still capture vehicle telemetry through the interface.
• Some vehicle owners have experienced broken GPS functionality when using Android Auto or CarPlay, leading to frustration with manufacturers who refuse to acknowledge software-related hardware failures.
• Physical modifications, such as removing the Data Communication Module (DCM) or pulling specific fuses, are effective ways to sever telemetry links, though they may disable SOS functions and over-the-air updates.
• There is a growing concern regarding the "subscription-based" model of modern vehicles, where manufacturers potentially use harvested data to subsidize hardware costs or generate secondary revenue.
• Privacy advocates suggest that instead of merely disconnecting, users could actively "poison" datasets by injecting fake or randomized driving data to make corporate tracking less valuable.
• Technological solutions like GrapheneOS can help sandbox smartphone apps, but complete privacy requires navigating a landscape of cellular tracking, public cameras, and financial records.
• Legislative action and the establishment of fundamental privacy rights are seen by some as the only long-term solution to pervasive corporate and government surveillance.
• The transition toward increasingly integrated, connected vehicles creates a tension between modern conveniences like navigation and the loss of behavioral privacy.
Modern vehicles are increasingly functioning like smart devices, where the integration of cellular connectivity and smartphone interfaces creates continuous streams of telemetry data. While hardware modifications can mitigate some tracking, the pervasive nature of digital footprints—ranging from GPS and cellular towers to financial transactions—makes total anonymity difficult to achieve. A tension exists between the convenience of connected features and the desire for privacy, with users increasingly forced to choose between modern functionality and data security. Ultimately, the discussion suggests that individual technical workarounds may be insufficient without broader legislative protections and a fundamental shift in how data ownership is defined.
如果你能把一块完整的桌面级 GPU 接到 MacBook Air 上会怎样?事实证明这是可行的。作者通过 Thunderbolt eGPU 扩展坞把 NVIDIA RTX 5090 连到 M4 MacBook Air,并在 macOS 上运行的 Linux 虚拟机中使用该 GPU 。这个项目涉及大量工程挑战——包括 PCI 直通、 DMA 映射的变通方案和 x86 仿真层——但最终让这套方案在游戏和 AI 推理上都能工作。 What if you could strap a full desktop GPU to your MacBook Air? Turns out, you can. The author set out to connect an NVIDIA RTX 5090 to an M4 MacBook Air using a Thunderbolt eGPU dock, running the GPU inside a Linux virtual machine on macOS. The project involved significant engineering challenges, including PCI passthrough, DMA mapping workarounds, and x86 emulation layers, but ultimately succeeded in making the setup functional for both gaming and AI inference.
如果你能把一块完整的桌面级 GPU 接到 MacBook Air 上会怎样?事实证明这是可行的。作者通过 Thunderbolt eGPU 扩展坞把 NVIDIA RTX 5090 连到 M4 MacBook Air,并在 macOS 上运行的 Linux 虚拟机中使用该 GPU 。这个项目涉及大量工程挑战——包括 PCI 直通、 DMA 映射的变通方案和 x86 仿真层——但最终让这套方案在游戏和 AI 推理上都能工作。
核心问题是 macOS 在 Apple Silicon 上没有对现代 NVIDIA 或 AMD GPU 的原生驱动。解决方法是让 ARM64 的 Linux 虚拟机接管并直通通过 Thunderbolt 连接的 GPU 。为此必须把 PCI 的基地址寄存器(BAR)映射到客户机虚拟机中。最初因为 QEMU 在使用 Apple 的 Hypervisor.framework 处理内存标志时的一个 bug,导致主机内核崩溃,修复方法是从设备内存映射中去掉 HV_MEMORY_EXEC 标志。至于直接内存访问(DMA),Apple 的 IOMMU 等价物 DART 施加了严格限制:总映射内存约为 1.5GB,且最多约 64k 个单独映射。为应对这些限制,作者实现了一个名为 apple-dma-pci 的虚拟 PCI 设备,拦截 DMA 映射调用,并用聚类方案把小的分配合并到更大的 256 KB 区域,从而显著减少映射数量。
其他性能优化还包括在 Apple Silicon 上启用硬件 Total Store Ordering(TSO)以通过 FEX-Emu 加速 x86 仿真、用 kprobes 修补 NVIDIA 驱动以应对 DART 的对齐限制,以及调整 QEMU 线程的调度优先级。即便如此,分层虚拟化带来的开销仍然不可忽视。基准显示,装在原生 PCIe 插槽中的相同 RTX 5090 的游戏 PC 比 MacBook Air 的 eGPU 方案快 2–4 倍。在 720p 等低分辨率下,由于仿真开销,M4 Air 的集成 GPU 反而比 eGPU 更快,但在 4K 下 eGPU 能把原本不可玩的帧率变为流畅画面。
在 AI 推理方面的表现尤其引人注目。运行 Qwen 3.6 时,eGPU 达到每秒 155 个 token,而 M4 Air 的集成 GPU 只有每秒 22 个 token,提升约 6.5 倍。更显著的是,对于 4K token 上下文的提示处理(prefill),耗时从 17 秒降到 150 毫秒,提速约 120 倍。 RTX 5090 在并发请求下的扩展性也更好,最多 4 路并发时几乎呈线性扩展。对于像 Gemma 4 这样的密集模型,性能差距更大,eGPU 的吞吐量大约是 Apple Silicon 集成显卡的 4 倍。
尽管技术上很出色,作者也承认这仍是个发烧友级别的方案。存在的稳定性问题包括 FEX-Emu 中 Steam 崩溃、 DMA 映射碎片化需要重置 GPU 、以及分发驱动所需的特殊 Apple 授权。该方案需要自行编译定制软件并加载内核扩展。作者正在与上游 QEMU 合作合并这些补丁,并希望未来在 Apple Silicon 上对 Linux 实现原生 Thunderbolt 支持能消除许多现有限制。目前这个项目证明了:一台 22W 的笔记本也能驱动一块 600W 的 GPU,但原生 PC 仍明显更快。
What if you could strap a full desktop GPU to your MacBook Air? Turns out, you can. The author set out to connect an NVIDIA RTX 5090 to an M4 MacBook Air using a Thunderbolt eGPU dock, running the GPU inside a Linux virtual machine on macOS. The project involved significant engineering challenges, including PCI passthrough, DMA mapping workarounds, and x86 emulation layers, but ultimately succeeded in making the setup functional for both gaming and AI inference.
The core technical challenge was that macOS lacks native drivers for modern NVIDIA or AMD GPUs on Apple Silicon. The solution involved running Linux in an ARM64 VM and passing through the Thunderbolt-connected GPU. This required mapping PCI Base Address Registers (BARs) into the guest VM, which initially caused host kernel panics due to a bug in how QEMU handled memory flags with Apple's Hypervisor.framework. The fix involved removing the HV_MEMORY_EXEC flag from device memory mappings. For Direct Memory Access (DMA), Apple's IOMMU equivalent called DART imposed strict limits of about 1.5GB total mapped memory and 64k individual mappings. To work around this, the author created a virtual PCI device called apple-dma-pci that intercepts DMA mapping calls and uses a clustering scheme to reduce the number of individual mappings by grouping small allocations into larger 256kB regions.
Additional performance optimizations included enabling hardware Total Store Ordering (TSO) mode on Apple Silicon to accelerate x86 emulation through FEX-Emu, patching the NVIDIA driver via kprobes to handle DART's alignment constraints, and adjusting QEMU's thread scheduling priorities. Despite these improvements, the layered virtualization approach still incurred significant overhead. Benchmarks showed that a gaming PC with the same RTX 5090 in a native PCIe slot was 2-4x faster than the MacBook Air eGPU setup. At lower resolutions like 720p, the M4 Air's integrated GPU actually outperformed the eGPU configuration due to emulation overhead, but at 4K the eGPU transformed unplayable framerates into smooth gameplay.
The most compelling results came from AI inference benchmarks. Running Qwen 3.6, the eGPU setup achieved 155 tokens per second compared to 22 tokens per second on the M4 Air's integrated GPU, a 6.5x improvement. More dramatically, prompt processing (prefill) for a 4K-token context dropped from 17 seconds to 150 milliseconds, a 120x speedup. The RTX 5090 also scaled much better with concurrent requests, nearly linear scaling up to 4 concurrent streams. For dense models like Gemma 4, the performance gap was even more pronounced, with the eGPU delivering roughly 4x the throughput of Apple Silicon's integrated graphics.
While the project demonstrates impressive technical achievement, the author acknowledges it remains firmly in the hobbyist category. Stability issues include Steam crashes in FEX-Emu, DMA mapping fragmentation requiring GPU resets, and the need for a special Apple entitlement to distribute the driver. The setup requires building custom software and loading kernel extensions. The author is working with upstream QEMU to integrate the patches and hopes future improvements like native Thunderbolt support in Linux on Apple Silicon could eliminate many of the current limitations. For now, it stands as a proof of concept that a 22W laptop can harness a 600W GPU, even if a native PC remains significantly faster.
• 本文最重要的贡献是揭示了 Apple 在提示词处理(预填充)上的严重瓶颈:一台 M4 MacBook Air 处理 4K token 的提示词需要 17 秒,而使用 eGPU 仅需 150 毫秒,差距近 120 倍,这种差异仅在高负载时显现。
• 预填充是计算密集型操作,随着提示词长度增长而加重。尽管 Apple Silicon 因内存充足而适合本地运行模型,但这也成为 LLM 实际工作的关键瓶颈。
• 在 Apple Silicon Mac 上,通过 Linux 虚拟机实现 GPU 直通是可行的:标准的 DriverKit 已允许从用户态映射 PCIe BAR,说明限制在于 VMM 的实现,而非 macOS 的根本性障碍。
• Docker 容器无法解决 GPU 访问问题:macOS 控制着 PCI 总线,且 NVIDIA 驱动在没有 PCIe 直通的情况下无法与容器内的 GPU 通信。
• Mac Pro 不支持 NVIDIA GPU 是一次错失的机会。 LLM 带来了新的 GPU 计算需求,苹果未能抓住这一点,最终导致该产品线的停产。
• LLM 常因知识截止日期而给出过时信息,例如 ChatGPT 可能不知道 RTX 5070 Ti 或 Codex CLI;具有网页搜索功能的模型(如 Grok 、 Kagi 的研究助手)能在一定程度上缓解该问题。
• 当用户提供来源以纠正信息时,LLM 应遵从用户的直接指示,而不应仅依赖内部知识,因为用户可能掌握关于未发布产品的非公开信息。
• Apple Silicon 上的 eGPU 支持使 NVIDIA GPU 能与 Mac Mini 配合使用,从而免去为 AI 推理负载单独配备一台 PC 。
• Steam Deck 使用的是 x86 AMD APU,而非 ARM;而 Valve 即将推出的 Steam Frame 头显可能通过类似 Proton 的兼容层,在便携设备上运行基于 ARM 的 Windows 游戏。
• LLM 最适合作为研究与发现的工具,而不是绝对真理的预言者。用户应对其输出保持怀疑,但仍可利用其能力推进实际项目。
• The article's most significant contribution is highlighting Apple's severe prompt processing (prefill) bottleneck, where an M4 MacBook Air takes 17 seconds to process a 4K-token prompt versus 150ms with an eGPU, a 120x difference that only becomes apparent in serious workloads.
• The prefill problem is compute-bound and worsens with longer prompts, making it a critical limitation for practical LLM work on Apple Silicon despite the platforms' appeal for local models with abundant RAM.
• GPU passthrough to Linux VMs on Apple Silicon Macs is achievable using standard DriverKit interfaces that already allow PCIe BAR mapping from user-space, meaning the limitation lies in VMM adoption rather than fundamental macOS restrictions.
• Docker containers cannot solve the GPU access problem because macOS owns the PCI bus, and NVIDIA drivers cannot communicate with the GPU from within a container without PCI passthrough.
• The Mac Pro's lack of NVIDIA GPU support represents a missed opportunity, particularly as LLMs have created new demand for GPU compute that Apple failed to capitalize on, contributing to the product line's discontinuation.
• LLMs frequently provide outdated information due to knowledge cutoffs, with examples including ChatGPT not knowing about the RTX 5070 Ti or the Codex CLI, though web search capabilities in models like Grok and Kagi's research assistant help mitigate this.
• When corrected with sources, LLMs should follow direct user instructions regardless of their internal knowledge, as users may have access to non-public information about upcoming releases.
• eGPU support on Apple Silicon enables CUDA-capable NVIDIA GPUs to work with Mac Minis, which could eliminate the need to own a separate PC for AI inference workloads.
• The Steam Deck uses an x86 AMD APU, not ARM, while Valve's upcoming Steam Frame headset may bring ARM-based gaming to handhelds with Proton-like compatibility layers for Windows games.
• LLMs are best used as tools for research and discovery rather than as oracles of truth, and users should maintain skepticism about their outputs while still leveraging their capabilities for practical projects.
138 comments • Comments Link
人工智能在能力上令人印象深刻,但与其基本可靠性之间存在持续且显著的鸿沟:模型能在复杂的创造性任务中表现优异,却常在诸如单位换算等简单事实操作上出错,这表明当前的方法可能难以通向真正可靠的智能。
行业评估过于专注于能达到"50% 成功率"的能力基准,而忽视了可靠性指标,导致测得的能力与实际生产就绪度之间出现危险的脱节。即便是 80% 的可靠性,对许多企业应用来说也远远不够。
一些 AI 辅助工具(例如 VSCode 插件)的质量已经下降,大多数要么根本无法使用,要么无法按宣传运行,软件在完成诸如环境检测等基础任务时变得更慢、更不稳定。
大型语言模型从根本上缺乏对自身不确定性的准确评估能力。它们给出的输出概率并未与实际错误率校准,因此将这些置信度分数用于决策在统计学上是不合理的。虽然模型的内部概率分布理论上包含不确定性信息,但要提取出有意义的置信度估计,通常需要计算代价高昂的校准技术,而这些技术并未在生产系统中得到广泛实施。
关于校准概率与相对不确定性的价值存在分歧:有人认为即便校准不佳,自我评估的置信度仍有参考价值;也有人坚持认为对于任何严格的应用场景,准确的校准是不可或缺的。
当前的 AI 系统无法可靠地区分已知信息与虚构内容,常常生成听起来很自信但完全杜撰的输出。研究显示,无论实际准确性如何,这类输出通常会以 80% 到 100% 的置信度范围来表达。
AI 笔记记录和摘要工具经常出现虚构事实、错误归因和曲解语气与内容的问题。在医疗等对准确性要求极高的专业环境中,这类错误具有特别危险的后果。相关评估显示,医疗 AI 记录系统存在令人担忧的错误率:约 60% 的被评估系统会混淆处方药(这个比例需要与人类约 19.6% 的用药错误率一起考虑和对比)。
推动医疗保健及其他关键领域采用 AI 的动力,似乎更多来自市场营销和利益相关方的压力,而非由经证实的可靠性驱动。对于那些要求确定性和准确性的任务,概率性系统本质上并不适合,但对此的关注明显不足。
讨论总体揭示了人工智能表面能力与持续可靠性之间的根本矛盾。参与者普遍认为,尽管大型语言模型在创造性和复杂任务上表现突出,但在基本事实操作上仍频繁失败。许多人警示,在关键环境中匆忙部署这些系统的决定,往往受商业利益驱动,而非实际可用性;在需要高可靠性的场景中,这些技术目前仍然不适合。 • A persistent gap exists between AI's impressive capabilities and its basic reliability, with models excelling at complex creative tasks while failing at simple factual operations like unit conversions, suggesting current approaches may not lead to genuine intelligence.
• The industry focuses on capability benchmarks measuring 50% success rates while ignoring reliability metrics, creating a dangerous disconnect between demonstrated ability and production readiness where even 80% reliability is insufficient for enterprise use.
• AI-assisted tools like VSCode plugins have deteriorated in quality, with most being completely unusable or not functioning as advertised, while the software itself has become slower and less reliable at basic tasks like environment detection.
• LLMs fundamentally lack the ability to accurately assess their own uncertainty, as their output probabilities are not calibrated to actual error rates, making it statistically unsound to use their confidence scores for decision-making.
• While models have internal probability distributions that theoretically contain uncertainty information, extracting meaningful confidence estimates requires computationally expensive calibration techniques that aren't implemented in production systems.
• The distinction between calibrated probabilities and relative uncertainty matters practically, with some arguing that even poorly calibrated self-assessments of confidence would be useful, while others insist calibration is essential for any rigorous application.
• Current AI systems cannot reliably distinguish between their own knowledge and fabrication, often producing confident-sounding outputs that are complete fabrications, with studies showing they typically state confidence in 80-100% ranges regardless of actual accuracy.
• AI note-takers and summarizers frequently fabricate information, misattribute statements, and distort the tone and content of communications, with particularly dangerous implications in medical and professional contexts where accuracy is critical.
• Medical AI scribes show alarming error rates, with 60% of evaluated systems mixing up prescribed drugs, though this must be contextualized against human error rates of approximately 19.6% for medication administration.
• The push for AI adoption in healthcare and other critical fields appears driven more by marketing and stakeholder pressure than by demonstrated reliability, with insufficient attention to the fundamental unsuitability of probabilistic systems for tasks requiring deterministic accuracy.
The discussion reveals a fundamental tension between AI's remarkable capabilities and its persistent reliability issues, with participants largely agreeing that current LLM technology excels at creative and complex tasks while failing at basic factual operations. The conversation highlights how the industry's focus on capability benchmarks obscures critical reliability gaps, particularly in high-stakes domains like healthcare where AI note-takers and summarizers demonstrate dangerous error rates. Multiple commenters emphasize that these systems lack genuine understanding of their own uncertainty, producing confident-sounding outputs regardless of accuracy. While some argue that AI should be judged against human error rates, others counter that AI makes qualitatively different types of errors, hallucinating information in ways humans never would. The underlying consensus suggests that current AI technology, despite impressive surface capabilities, remains fundamentally unsuitable for applications requiring high reliability, and that the rush to deploy these systems in critical contexts may be driven more by commercial interests than by genuine utility.