CUDA Books
230 points
• 1 day ago
• Article
Link
这是一份精选的 CUDA 编程书目,覆盖从入门到高级的资料,包含 C++ 与 Python 相关书籍,侧重架构、性能优化以及 2024–2026 年的最新出版物。书目按类别组织:初学者指南、核心架构、实战手册、高级优化、 Python 与高层 CUDA,以及近年出版物。
入门推荐包括 "CUDA by Example"(2010)、 "Learn CUDA Programming"(2019)和 "CUDA for Engineers"(2016),均以示例为主,适合初学者。核心架构类以 "Programming Massively Parallel Processors"(3rd ed., 2022)为代表,常被高校作为 GPU 架构的权威教材。
实战类有 "Programming in Parallel with CUDA"(2022),包含真实科学示例;"Professional CUDA C Programming"(2014),面向生产环境的多 GPU 与 streams 使用;以及 "GPU Parallel Program Development Using CUDA"(2018),侧重 cuBLAS 、 Thrust 等库的应用。高级参考如 "The CUDA Handbook"(2013)提供深入 API 细节,"CUDA Programming"(2013)覆盖并行算法与优化,"CUDA Application Design and Development"(2011)则面向研究型应用设计。
关注 Python 的书籍包括 "Hands-On GPU Programming with Python and CUDA"(2018),介绍 Numba 和 CuPy;以及 "GPU Programming with C++ and CUDA"(2024),涉及现代 C++20 与 Python 互操作。近年重要出版物(2022–2026)有多部更新版与专题书,如 "CUDA C++ Optimization"(2024)、 "CUDA C++ Debugging"(2024)和 "High-Performance Computing with C++26 and CUDA 13"(2026)。
由于 CUDA 变动快速,建议将这些书籍与官方免费文档 CUDA C++ Programming Guide(v13.x, 2026)配合阅读。欢迎通过 pull request 提交推荐,优先收录 2018 年以后的书籍或仍具大量示例代码的经典著作。该仓库属于 Awesome 系列,并附有关于 CUDA 工具、 GPU 资源与并行计算的相关列表。
This is a curated list of major books on CUDA programming, covering resources from beginner to advanced levels, including C++ and Python, with a focus on architecture, optimization, and recent releases from 2024 to 2026. The list is organized into categories such as beginner guides, core architecture, practical hands-on guides, advanced optimization, Python and high-level CUDA, and modern releases. Notable beginner books include "CUDA by Example" (2010), "Learn CUDA Programming" (2019), and "CUDA for Engineers" (2016), which are example-driven and suitable for newcomers. Core architecture resources feature "Programming Massively Parallel Processors" (3rd edition, 2022), described as the definitive GPU architecture bible used in universities worldwide. Practical guides include "Programming in Parallel with CUDA" (2022) with real-world scientific examples, "Professional CUDA C Programming" (2014) for production-level multi-GPU and streams, and "GPU Parallel Program Development Using CUDA" (2018) focusing on libraries like cuBLAS and Thrust. Advanced references include "The CUDA Handbook" (2013) for deep API details, "CUDA Programming" (2013) for parallel algorithms and optimization, and "CUDA Application Design and Development" (2011) for research applications. Python-focused books are "Hands-On GPU Programming with Python and CUDA" (2018) for Numba and CuPy, and "GPU Programming with C++ and CUDA" (2024) with modern C++20 and Python interop. Modern releases from 2022 to 2026 include updated editions and specialized titles like "CUDA C++ Optimization" (2024), "CUDA C++ Debugging" (2024), and "High-Performance Computing with C++26 and CUDA 13" (2026). The list emphasizes pairing books with the free official CUDA C++ Programming Guide (v13.x, 2026) due to rapid changes in CUDA. Contributions are welcome via pull requests, with preferences for post-2018 books or relevant classics that include substantial code and examples. The repository is part of the Awesome series and includes related lists for CUDA tools, GPU resources, and parallel computing.
60 comments • Comments Link
• 《 CUDA Programming: A Developer's Guide to Parallel Computing with GPUs 》被推荐为最佳入门书,而《 Massively Parallel Processors: A Hands-on Approach 》因大量错误和令人困惑的解释被批评,《 CUDA by Example 》则被认为过于简化并且对硬件架构抽象过度。
• 一本新的 CUDA 书正在开发中,采用自下而上的写作思路,从硬件工程入手,逐步深入 NVIDIA 硬件优化,覆盖除图算法之外的主要算法,基于一门成功的大学课程编写。
• 尽管推荐书籍中有一本出版于 2012 年,但它仍然适用,因为 GPU 硬件和 CUDA 语言没有发生根本性变化,它为通过其他资源学习现代特性提供了坚实基础。
• Warp 被推荐为基于 Python 的现代化 CUDA 开发替代方案,允许在 Python 中直接编写 CUDA kernel,学习曲线较平缓,但由于相对较新,尚难以进入书本教材。
• 人们对涵盖 cuTile 等新兴范式的资料表现出兴趣,这反映出当前教学资源在介绍 GPU 编程新技术方面存在空白。
• 越来越多的 NVIDIA 内部人员建议不要编写自定义 CUDA kernel,除非这是 NVIDIA 的全职工作,他们推荐使用更高级别的库;但也有人认为这种建议是推动供应商锁定的一种方式。
• 反对编写自定义 kernel 的建议被比作建议用 Python 代替 C,或用 Unreal 的授权而不是自己构建渲染引擎,强调了为特定需求选择合适工具的重要性。
• NVIDIA 未能为 sm120(非数据中心 GPU)发布可用的 kernel,尽管 Blackwell 已经发布,这表明 NVIDIA 并不总是平等地优先支持各个硬件细分市场,依赖其官方工具存在一定风险。
• 是否编写自定义 CUDA kernel 应基于具体需求:当高级库能满足需求时就使用高级库;但在学习、需要底层控制、进行微观优化或通过 kernel 融合减少内存流量时,编写自定义 kernel 仍然必要。
• 《 AI Systems Performance Engineering 》被提及为相关读物,虽然它并不专注于 CUDA,但表明更广泛的性能工程知识非常有价值。
• OLCF 的 CUDA 培训系列被推荐为良好的入门资源,覆盖基础内容,能让后续阅读更容易理解。
• 指向《 Programming Massively Parallel Processors 》第三版的链接已损坏,目前该书已出到第四版。
• 使用 LLM 提高即时生产力的做法引发了对通过传统书籍进行深入学习的质疑,这反映了行业更倾向于 prompt engineering 而非打牢基础编码技能的趋势。
• 大家普遍感到企业更青睐 prompt engineering 而不是传统编码技能,这在生产压力与深入技术学习之间造成了张力。
讨论显示,基础的 GPU 编程知识仍具有持久价值,但业界同时在推动更高级别的抽象和 LLM 驱动的生产力。虽然有几本书被推荐用于学习 CUDA,但共识是:出于优化和学习等特定用途,编写自定义 kernel 仍然重要,尽管供应商倾向于推广更高级别的库。社区对 Warp 、 cuTile 等新工具表现出兴趣,表明实践在不断演进,同时也对供应商锁定和 NVIDIA 对不同硬件支持不一致表示担忧。将 LLM 用于提高即时产能的压力与掌握 GPU 编程所需的深度、耗时学习之间存在明显冲突。 • 'CUDA Programming: A Developer's Guide to Parallel Computing with GPUs' is recommended as the best introductory book, while 'Massively Parallel Processors: A Hands-on Approach' is criticized for numerous errors and confusing explanations, and 'CUDA by Example' is considered too simplistic and overly abstracts the hardware architecture.
• A new CUDA book is being developed that takes a bottom-up approach, starting from hardware engineering and progressing to optimization on NVIDIA hardware, covering all major algorithms except graph algorithms, based on a successful university course.
• Despite being published in 2012, the first recommended book remains relevant because GPU hardware and CUDA language have not changed significantly, providing a solid foundation for learning modern features through other resources.
• Warp is suggested as a modern alternative for Python-based CUDA development, allowing direct CUDA kernel writing in Python with an easy learning curve, though it may be too new for book coverage.
• There is interest in resources covering newer paradigms like cuTile, indicating a gap in current educational materials for emerging GPU programming techniques.
• NVIDIA insiders increasingly advise against writing custom CUDA kernels unless it's a full-time job at NVIDIA, recommending higher-level libraries instead, though this advice is seen by some as promoting vendor lock-in.
• The recommendation to avoid custom CUDA kernels is compared to suggesting avoiding C in favor of Python or licensing Unreal instead of building a graphics engine, highlighting the importance of choosing the right tool for specific needs.
• NVIDIA's failure to release unbroken kernels for sm120 (non-data center GPU) despite Blackwell's release shows that NVIDIA doesn't always prioritize all hardware segments equally, making reliance on NVIDIA's own tools risky.
• The decision to write custom CUDA kernels should be based on specific needs: use higher-level libraries when they suffice, but write custom kernels for learning, low-level control, micro-optimization, or kernel fusion to reduce memory traffic.
• 'AI Systems Performance Engineering' is mentioned as a relevant resource, even though it's not strictly focused on CUDA, suggesting broader performance engineering knowledge is valuable.
• The OLCF CUDA training series is recommended as a good introductory resource that covers fundamentals and makes subsequent books easier to understand.
• There is a broken link to the 3rd edition of 'Programming Massively Parallel Processors', with the 4th edition being the current version.
• The pressure to use LLMs for immediate productivity raises questions about finding time for deep learning through traditional book reading, reflecting industry trends toward prompt engineering over fundamental coding skills.
• There's a sense that corporations prefer prompt engineering over traditional coding skills, creating tension between productivity demands and deep technical learning.
The discussion reveals a tension between the enduring value of foundational GPU programming knowledge and the industry's push toward higher-level abstractions and LLM-driven productivity. While several books are recommended for learning CUDA, there's consensus that writing custom kernels remains important for specific use cases like optimization and learning, despite vendor advice to use higher-level libraries. The community shows interest in newer tools like Warp and cuTile, indicating evolving practices, while also expressing concern about vendor lock-in and NVIDIA's inconsistent hardware support. The pressure to adopt LLMs for immediate productivity creates a conflict with the deep, time-intensive learning that mastering GPU programming requires.