Apple Silicon costs more than OpenRouter
344 points
• 1 day ago
• Article
Link
在苹果芯片上本地运行大型语言模型时,真正的成本不是电费,而是硬件。作者分析了在配备 64GB 内存的 M5 MacBook Pro 上运行 Gemma 4 31b 的经济性,该机零售价为 4299 美元。在满载功耗 50–100 瓦、电价约 0.18–0.20 美元 / 千瓦时的情况下,每小时电费约 0.02 美元;若全天满负荷推理,每天约 0.48 美元,几乎可以忽略。真正的开销是机器本身及其折旧速度。
作者考虑了硬件使用寿命为 3 年、 5 年和 10 年的三种情形。以 5 年为中位数时,机器每小时成本约为 0.098 美元,合并电费后约为 0.12 美元 / 小时。关键在于这段时间内能处理多少 token 。对于类似 Gemma4:31b 这样的大模型,M5 Max 的速度大约在每秒 10 到 40 个 token 之间。按每秒 10 个 token 算,每小时能处理 36000 个 token,相应每百万 token 的成本在 1.61 到 4.79 美元之间(取决于寿命假设)。按每秒 40 个 token 且寿命为 10 年估算,每百万 token 的成本可降到约 0.40 美元。
相比之下,OpenRouter 上运行 Gemma4 31b 的价格约为每百万 token 0.38 到 0.50 美元。在最乐观的假设下,MacBook Pro 勉强能与云端价格持平;但在更现实的假设下,苹果芯片上的本地推理成本大约是从 OpenRouter 租用算力的三倍。而且 OpenRouter 的供应商通常能达到每秒 60 到 70 个 token,远快于 M5 Max 的本地表现。
从纯成本角度看结论很清楚:对于使用工作笔记本的人来说,他们的薪水远高于 token 成本(大约高出一千倍),因此付费使用 Anthropic 或通过 OpenRouter 租用算力比把一切都放在本地更划算。不过作者仍觉得值得惊讶的是,消费级笔记本居然能运行出接近 Anthropic Sonnet 级别性能的模型,哪怕目前在经济性上还不完全划算。
When it comes to running large language models locally on Apple Silicon, the real cost isn't the electricity. It's the hardware. The author breaks down the economics of running a model like Gemma 4 31b on an M5 MacBook Pro with 64GB of RAM, which retails for $4,299. At 50-100 watts under load and electricity costs around $0.18-0.20 per kWh, the power bill comes out to roughly $0.02 per hour, or about $0.48 per day if running inference at full tilt. That's negligible. The real expense is the machine itself, and how quickly you depreciate it.
The author walks through three possible lifespans for the hardware: 3, 5, and 10 years. At 5 years, which seems like a reasonable middle ground, the hourly cost of the machine comes to about $0.098. Add electricity and you're looking at roughly $0.12 per hour. The question then becomes how many tokens you can squeeze out of that time. For a serious model like Gemma4:31b, the M5 Max seems to manage somewhere between 10 and 40 tokens per second. At the low end of 10 tokens per second, that's 36,000 tokens per hour, which works out to somewhere between $1.61 and $4.79 per million tokens depending on your assumed lifespan. At the optimistic end of 40 tokens per second and a 10-year lifespan, you could get down to around $0.40 per million tokens.
Compare that to OpenRouter, where Gemma4 31b runs about $0.38 to $0.50 per million tokens. On the most optimistic assumptions, the MacBook Pro barely breaks even with cloud pricing. On more realistic assumptions, local inference on Apple Silicon runs about three times more expensive than just renting compute from OpenRouter. And OpenRouter providers are pushing 60-70 tokens per second, which is several times faster than what the M5 Max manages locally.
The conclusion is pretty straightforward from a pure cost perspective. For someone using a work laptop, their salary dwarfs the token costs by a factor of about a thousand, so paying Anthropic or using OpenRouter makes far more financial sense than trying to run everything locally. That said, the author finds it remarkable that a consumer laptop can run models approaching Anthropic Sonnet-level performance at all, even if the economics don't quite pencil out yet.
292 comments • Comments Link
• 前沿 AI 公司以巨额亏损价格出售推理服务,烧掉数千亿美元抢占市场份额,并在被迫提价前不计成本,这使个人在纯成本竞争中几乎没有胜算。
• 云服务商通过工业电价、批发硬件定价、多租户利用率和专用芯片获得远超个人设备的效率,使得消费级硬件在每 token 成本上几乎无法竞争。
• 整个推理栈受到风险资本补贴:例如 OpenRouter 以 13 亿美元估值融资,国内模型如 DeepSeek 和 Qwen 采取激进定价,因为北京系资本更看重市场份额而非利润率,这意味着当前的低价并非稳定均衡。
• Anthropic 和 OpenAI 等公司宣称"推理盈利"的说法站不住脚:他们往往忽视持续训练所需的投入、资本成本、折旧以及用户流失带来的费用,这些都需要数十亿美元,使得所谓"盈利的推理"不过是一种误导性的成本隔离。
• 用"种橙子"的比喻并不恰当:推理更像是在卖橙子,模型构建才是种植果园;真实的动态更像跑步机——停止训练就会过时,而不是一次性投资就一劳永逸。
• 本地推理在经济上合理的主要情形是硬件已被用于其他用途:在现有笔记本上运行模型的边际成本基本上只是电费,而不是再买一台新机器的全部花费。
• 本地模型的主要价值并非单纯节省成本,而是控制权、隐私、保密性、数据主权、抗中断能力,以及免受模型贬值或意外定价调整的影响;这些好处无法通过简单的每 token 成本比较体现。
• 对于典型的智能体工作负载,输入 token 往往占主导成本,通常比输出 token 高出约十倍。本地推理能使输入 token 成本几乎为零,且本地提示缓存更可靠,这显著改变了这些场景中对本地部署有利的成本计算。
• 将 MacBook Pro 与云服务直接比较存在缺陷,因为这种比较把整台笔记本的成本全部归于推理;而大多数用户本来就拥有硬件,笔记本还提供超出 token 生成的通用计算价值。
• 像 Qwen 3.6 27B 这样的中小型开源模型在许多基准上正缩小与大型前沿模型的差距,并能在消费级硬件上以可用速度运行,这使得本地推理成为有吸引力的选择,挑战了"云始终更好"的假设。
讨论揭示了本地与云 AI 推理之间,基于纯每 token 成本的经济学与更广泛价值考量之间的根本张力。从每 token 成本角度看,云推理凭借规模经济、工业化效率和大量风投补贴占优,使得当前定价长期看并不稳定,因此云端明显有优势。然而,参与者普遍强调,把比较简化为单纯成本对许多用户而言是失之偏颇的。隐私、数据主权、抗中断、对模型行为的控制以及避免被供应商锁定,都是云服务难以提供的重大非货币价值。更为细致的观点认为:当硬件已被占有、工作负载对隐私高度敏感或以输入密集型智能体任务为主时,本地推理最有意义;而在追求原始性能、访问最前沿模型或优先便利性的用户群体中,云端仍更具优势。共识是,选择不仅仅取决于经济性,而是高度依赖个人优先级——成本只是众多因素之一,还包括信任、保密性和长期可预测性。 • Frontier AI companies are selling inference at a massive loss, burning through hundreds of billions of dollars to capture market share before they're forced to raise prices, making it irrational for individuals to compete on cost alone.
• Cloud providers achieve far superior efficiency through industrial electricity rates, wholesale hardware pricing, multi-tenant utilization, and specialized chips, making it nearly impossible for consumer hardware to compete on a pure cost-per-token basis.
• The entire AI inference stack is subsidized by venture capital, with companies like OpenRouter raising at $1.3B valuations and Chinese models like DeepSeek and Qwen pricing aggressively because Beijing-adjacent capital prioritizes market share over margins, meaning current low prices are not a stable equilibrium.
• Claims of profitability from companies like Anthropic and OpenAI ring hollow when they ignore the billions required for continuous model training, capital costs, depreciation, and user churn, making "profitable inference" a misleading ringfencing of expenses.
• The analogy of growing oranges breaks down because inference isn't the farm, it's selling the oranges, while model building is growing the farm, and the real dynamic is a treadmill where stopping training means obsolescence, not a one-time investment.
• Local inference makes economic sense primarily when the hardware is already owned for other purposes, as the marginal cost of running models on an existing laptop is essentially just electricity, not the full purchase price of a new machine.
• The primary value of local models isn't cost savings but control, privacy, confidentiality, data sovereignty, resilience against outages, and freedom from model deprecation or unexpected pricing changes, which are benefits that can't be captured in a simple cost-per-token comparison.
• For typical agentic workloads, input tokens dominate costs by a large margin (often 10x output costs), and local inference makes input tokens essentially free while prompt caching is more reliable on local hardware, significantly shifting the cost calculus in favor of local for these use cases.
• The comparison between a MacBook Pro and cloud services is flawed because it allocates the entire laptop cost to inference when most users already own the hardware for other purposes, and a laptop provides general-purpose computing value beyond just token generation.
• Smaller open models like Qwen 3.6 27B are closing the gap with larger frontier models on many benchmarks while running at usable speeds on consumer hardware, making them a compelling option for local inference that challenges the assumption that cloud is always superior.
The discussion reveals a fundamental tension between pure cost economics and broader value considerations in the local versus cloud AI inference debate. On a straight cost-per-token basis, cloud inference wins decisively due to economies of scale, industrial efficiency, and heavy venture subsidization that makes current pricing unsustainable in the long term. However, participants consistently emphasize that reducing the comparison to cost alone misses the point for many users. Privacy, data sovereignty, resilience against outages, control over model behavior, and freedom from vendor lock-in represent significant non-monetary values that cloud services cannot provide. The most nuanced perspective acknowledges that local inference makes the most sense when hardware is already owned for other purposes, when workloads are privacy-sensitive, or when input-heavy agentic tasks dominate, while cloud remains superior for raw performance, access to frontier models, and users who prioritize convenience over control. The consensus suggests that the choice isn't purely economic but depends heavily on individual priorities, with cost being just one factor among many that include trust, confidentiality, and long-term predictability.