{"product_id":"parallel-computing-for-ai-and-ml-engineers-build-scalable-deep-learning-systems-with-gpu-programming-multi-gpu-training-and-production-workloads-9798195370404","title":"Parallel Computing for AI and ML Engineers: Build Scalable Deep Learning Systems with GPU Programming, Multi-GPU Training, and Production Workloads","description":"\u003cp\u003e • Author(s): M. T. Holbrook\u003cbr\u003e • Publisher: Independently Published\u003cbr\u003e • Publisher Imprint: Independently Published\u003cbr\u003e • BISAC: Neural Networks\u003c\/p\u003e\u003cp\u003e\u003c\/p\u003e\u003cp\u003e\u003cb\u003e\u003ci\u003eStop Guessing. Start Building ML Systems That Actually Scale.\u003c\/i\u003e\u003c\/b\u003e\u003c\/p\u003e\u003cp\u003eMost ML engineers learn GPU computing the hard way - through production failures, mysterious hangs, and models that take three times longer to train than they should. This book gives you the understanding and the tools to get it right the first time.\u003c\/p\u003e\u003cp\u003e\u003cb\u003eWhat This Book Covers\u003c\/b\u003e\u003c\/p\u003e\u003cp\u003e-GPU architecture internals: CUDA cores, warps, shared memory, and memory coalescing\u003c\/p\u003e\u003cp\u003e-Writing and optimizing custom CUDA kernels in C++\u003c\/p\u003e\u003cp\u003e-Data parallel, model parallel, and pipeline parallel training with PyTorch DDP and FSDP\u003c\/p\u003e\u003cp\u003e-Multi-node training with NCCL, MPI, and InfiniBand\u003c\/p\u003e\u003cp\u003e-Mixed precision training and gradient scaling\u003c\/p\u003e\u003cp\u003e-ZeRO optimizer stages 1, 2, and 3 with DeepSpeed\u003c\/p\u003e\u003cp\u003e-Custom DataLoader optimization and NVIDIA DALI\u003c\/p\u003e\u003cp\u003e-Production model serving with Triton Inference Server\u003c\/p\u003e\u003cp\u003e-Kubernetes deployment with GPU autoscaling\u003c\/p\u003e\u003cp\u003e-Complete profiling workflows with Nsight and PyTorch Profiler\u003c\/p\u003e\u003cp\u003e-Troubleshooting CUDA OOM, NCCL hangs, and NaN losses\u003c\/p\u003e\u003cp\u003e-Capacity planning and hardware selection for real workloads\u003c\/p\u003e\u003cp\u003e\u003cb\u003eWho This Book Is For\u003c\/b\u003e\u003c\/p\u003e\u003cp\u003eThis book is written for ML engineers, AI researchers, and software engineers working on deep learning infrastructure who want to move beyond single-GPU experiments and build systems that perform at scale. You should be comfortable with Python and have basic familiarity with PyTorch or TensorFlow. No prior CUDA experience required.\u003c\/p\u003e\u003cp\u003e\u003cb\u003eWhat Makes This Book Different\u003c\/b\u003e\u003c\/p\u003e\u003cp\u003eEvery chapter includes complete, runnable code. Architecture diagrams show how components connect. Benchmark results come from real hardware measurements. The troubleshooting appendices address the exact errors that stop real training jobs. This is not a survey of techniques. It is a working engineer's guide to building production parallel ML systems.\u003c\/p\u003e","brand":"Independently Published","offers":[{"title":"Paperback","offer_id":47882867015831,"sku":"9798195370404","price":3341.0,"currency_code":"INR","in_stock":false}],"thumbnail_url":"\/\/cdn.shopify.com\/s\/files\/1\/0666\/3471\/1191\/files\/9798195370404.webp?v=1781097613","url":"https:\/\/atlanticbooks.com\/products\/parallel-computing-for-ai-and-ml-engineers-build-scalable-deep-learning-systems-with-gpu-programming-multi-gpu-training-and-production-workloads-9798195370404","provider":"Atlantic Books","version":"1.0","type":"link"}