Deterministic Fully-Static Whole-Binary Translation Without Heuristics
298 points
• 6 days ago
• Article
Link
Elevator 是一款开创性的二进制翻译器,能够在无需调试信息、源代码或关于代码布局的假设下,将整个 x86-64 可执行文件静态地转换为 AArch64 。不同于现有系统,它不依赖启发式规则或运行时回退来解决代码与数据混淆这一难题;相反,Elevator 会对二进制中每个字节的所有可能解释一一考虑,预先为每种可行解释生成独立的翻译。每个字节既可能是数据、也可能是指令或指令参数,系统为每种可能性构建独立的控制流路径,仅剪除会导致非正常终止的路径。
该系统通过组合从源指令集的高级描述自动派生出的代码"瓦片"来生成翻译,这使得翻译框架既灵活又易于适配。其方法是完全确定性的,意味着输出是完整且自包含的可执行文件,可信代码基中不含任何运行时组件。这与需运行时支持的模拟器或 JIT 编译器不同,后者既有运行时开销,也无法在编译时确定最终会执行哪些代码。
这种彻底性带来的主要代价是代码体积显著膨胀:同一字节的多种解释会产生多条代码路径,导致生成的代码增多。然而,关键优势在于生成的二进制确实代表将在目标硬件上运行的实际代码,从而可以在部署前对翻译结果进行测试、验证、认证,甚至进行加密签名,相较于那些只有在运行时才能确定最终执行代码的方法,这大大降低了风险。
研究者在多种真实二进制上对 Elevator 进行了评估,包括整个 SPECint 2006 基准套件。结果表明,静态的全程序二进制翻译既可靠又实用。在性能上,Elevator 的表现与广泛使用的 QEMU 用户态 JIT 仿真相当或更优,这一成就是值得注意的,因为静态翻译通常难以与能在运行时自适应优化的 JIT 编译器竞争。
总体而言,这项工作标志着二进制翻译技术的一大进步,证明可以在不同处理器架构之间实现对复杂真实二进制的完整静态翻译而不牺牲可靠性。通过消除对启发式方法和运行时组件的依赖,Elevator 为那些对代码完整性和可预测性有严格要求的场景开辟了新可能,例如安全敏感的应用或需要形式化认证的系统。
Elevator is a groundbreaking binary translator that can statically convert entire x86-64 executables to AArch64 without needing debug information, source code, or assumptions about how the code is laid out. What sets it apart from existing systems is that it doesn't rely on heuristics or runtime fallbacks to handle the tricky problem of distinguishing code from data. Instead, it considers every possible interpretation of each byte in the binary, generating separate translations for all feasible interpretations ahead of time. Any byte might be data, an opcode, or an opcode argument, and Elevator creates separate control flow paths for each possibility, only pruning those that would lead to abnormal termination.
The system builds translations by composing code "tiles" that are automatically derived from a high-level description of the source instruction set architecture. This makes the translation framework nimble and adaptable. The approach is fully deterministic, meaning it produces complete, self-contained binaries with no runtime component in the trusted code base. This is a significant departure from emulators or JIT compilers, which carry runtime overhead and uncertainty about what code will actually execute.
The main trade-off for this thoroughness is substantial code size expansion, since multiple interpretations of the same bytes result in multiple code paths being generated. However, the key benefit is that the output represents the actual code that will run on the target hardware. This enables testing, validation, certification, and even cryptographic signing of the translated binary before it's ever deployed, which significantly reduces risk compared to approaches where the final executing code isn't known until runtime.
The researchers evaluated Elevator on a diverse set of real-world binaries, including the entire SPECint 2006 benchmark suite. The results demonstrate that static full-program binary translation can be both reliable and practical. In terms of performance, Elevator achieves results on par with or better than QEMU's user-mode JIT emulation, which is a widely used and respected dynamic binary translation system. This is a notable achievement since static translation typically faces challenges competing with the optimization opportunities available to JIT compilers that can adapt at runtime.
The work represents a significant advance in binary translation technology, showing that it's possible to achieve complete static translation of complex real-world binaries across different processor architectures without sacrificing reliability. By eliminating the need for heuristics and runtime components, Elevator opens up new possibilities for scenarios where code integrity and predictability are critical, such as in security-sensitive applications or systems requiring formal certification.
65 comments • Comments Link
一位开发者回顾了他在 2013 年实现高性能 x86-64 到 aarch64 JIT 引擎的经历。他指出,该实现虽比原生代码慢约 2–5 倍,但 QEMU 的 JIT 慢得更多,达 10–50 倍,说明 QEMU 的方法还有很大优化空间。
QEMU 的 JIT 架构以广泛的架构兼容性为优先,而非针对性性能优化。它采用通用的"客户机 → 中间表示 → 主机"流程,因此放弃了对特定客户机 / 主机配对进行深度优化所能带来的性能提升,比如针对 x86 寄存器稀少性的优化,或利用两种架构间相符的浮点语义。
从认证和监管的角度看,静态二进制翻译具有显著优势,尤其是在航空、医疗等必须使用确定性且可签名代码的领域。尽管 JIT 在性能上可能更有优势,但在这些行业,基于 JIT 的方案通常不可接受。
间接跳转通过查找表处理——将原始地址映射到已翻译代码的位置。虽然这增加了开销,但可以接受,因为间接跳转本身就更慢,且很少出现在性能关键的循环中(尽管它们确实出现在解释器的核心调度路径上)。
Elevator 翻译器相比 QEMU 在运行时大约快 4.75 倍,但代价明显:执行指令数增加约 7 倍,二进制体积放大约 50 倍,且受限于单线程、不支持异常处理、 ISA 支持不完整,因此并不适用于像基于 Electron 的 Slack 这类复杂应用。
自修改代码明确不被 Elevator 和所有完全静态二进制重写器支持;静态翻译本质上无法处理运行时代码生成。不过由于 W^X 安全策略以及在超标量 CPU 上带来的性能损失,自修改代码在现代系统中已相当罕见。
把每个字节偏移都当作潜在的代码起点来翻译,会使二进制增大约 50 倍,进而带来缓存性能问题。不过在链接阶段对代码重排、把热代码聚合在一起,可以缓解这一问题,前提是假定大多数可能的解码起点在运行时仍然不可达。
Apple 的 Rosetta 受益于 Apple 添加的 ARM 特定扩展,用以模拟 x86 的内存模型——这些扩展比 ARM 标准实现更早且有所差异。此外,Rosetta 可能还获得了硬件层面对 x86 标志位仿真的支持,使其相比通用翻译方法具有性能优势。
讨论强调了 AI 行业追求速度与受监管行业对确定性、可审计软件需求之间的紧张:LLM 被视为通用解法,而安全关键系统则需要经过认证的编译流水线,且审查的工件通常停留在二进制生成之前的最后一层抽象。
Elevator 采用超集反汇编来化解代码与数据的二义性:把每个字节既视为数据又作为潜在指令起点进行翻译,构建完整的控制流图,并用运行时查找表来处理间接跳转。但由于基本的理论限制,这种方法无法对抗性代码或运行时代码生成。
总体讨论显示出与会者对二进制翻译各种权衡的深入理解:静态翻译在认证上提供了宝贵的确定性,但相较于 JIT,会在性能和体积上付出明显代价。对话还强调,不同用途——从爱好者实验到安全关键系统——需要不同的优化优先级。 QEMU 的广泛架构支持与那些能针对特定客户机 / 主机优化的专用翻译器形成鲜明对比。监管要求是采用静态翻译的重要推动力,但与会者也指出,AI 行业目前更偏重于快速迭代,而非主导受监管环境下的认证与可复现性工作。 • A developer shares a personal anecdote about creating a high-performance x86-64 to aarch64 JIT engine in 2013, noting that while their implementation was 2x-5x slower than native code, QEMU's JIT was significantly slower at 10x-50x, suggesting substantial room for optimization in QEMU's approach.
• QEMU's JIT architecture prioritizes broad architectural support over optimization, using a generic guest-to-intermediate-representation-to-host design that sacrifices potential performance gains from specializing for specific guest/host pairs, such as leveraging x86's fewer integer registers or matching floating point semantics between architectures.
• The certification and regulatory angle emerges as a key advantage of static binary translation, particularly for industries like aviation and medical devices where code must be deterministic and signable, making JIT-based solutions unacceptable despite potential performance benefits.
• Indirect jumps are handled through lookup tables mapping original addresses to translated code locations, which adds overhead but is acceptable since indirect jumps are inherently slower and rarely appear in performance-critical loops, though they do occur in core interpreter dispatch mechanisms.
• The Elevator translator achieves ~4.75x runtime speedup over QEMU but with significant tradeoffs: 7x more executed instructions, 50x binary size increase, single-thread limitation, no exception handling, and incomplete ISA support, making it unsuitable for complex applications like Electron-based Slack.
• Self-modifying code is explicitly unsupported by Elevator and all fully static binary rewriters, as static translation fundamentally cannot handle runtime code generation, though self-modifying code has become rare in modern systems due to W^X security requirements and performance penalties on superscalar CPUs.
• The 50x binary size increase from translating every byte offset as potential code creates cache performance concerns, though link-time code reordering could mitigate this by grouping hot code together, assuming most possible decoding starting points remain unreachable at runtime.
• Apple's Rosetta benefits from Apple-specific ARM extensions for x86 memory model emulation that predated and differ from ARM's standard implementations, plus potential hardware support for x86 flags emulation, giving it performance advantages over generic translation approaches.
• The discussion highlights tension between AI industry priorities and regulated industries' needs for deterministic, auditable software, with LLMs being touted as universal solutions while safety-critical systems require certified compilation pipelines where the reviewed artifact is the last abstraction layer before binary generation.
• Elevator uses superset disassembly to handle code-versus-data ambiguity by translating every byte as both data and potential instruction starts, building a comprehensive control flow graph with runtime lookup tables to resolve indirect jumps, though this approach cannot handle adversarial code or runtime code generation due to fundamental theoretical limitations.
The discussion reveals a nuanced understanding of binary translation tradeoffs, with participants recognizing that while static translation offers determinism valuable for certification, it comes with substantial performance and size penalties compared to JIT approaches. The conversation highlights how different use cases, from hobbyist experimentation to safety-critical systems, demand different optimization priorities, with QEMU's broad architectural support contrasting against specialized translators that can exploit specific guest/host pair properties. Regulatory requirements emerge as a significant driver for static translation adoption, though participants note the AI industry's current focus on rapid innovation rather than attestation and reproducibility needs that dominate regulated environments.