{"product_id":"ai-engineering-building-multi-modal-intelligent-systems-with-vision-language-and-audio-from-llm-fine-tuning-to-voice-agents-ar-interfaces-and-rea-9798296089038","title":"AI Engineering: Building Multi-Modal Intelligent Systems with Vision, Language, and Audio From LLM Fine-Tuning to Voice Agents, AR Interfaces, and Rea","description":"\u003cp\u003e • Author(s): Husn Ara\u003cbr\u003e • Publisher: Independently Published\u003cbr\u003e • Publisher Imprint: Independently Published\u003cbr\u003e • BISAC: Artificial Intelligence - General\u003c\/p\u003e\u003cp\u003e\u003c\/p\u003e\u003cp\u003e\u003cb\u003e\u003ci\u003eAI Engineering: Building Multi-Modal Intelligent Systems with Vision, Language, and Audio\u003c\/i\u003e\u003c\/b\u003e\u003cbr\u003e\u003cb\u003e\u003ci\u003eFrom LLM Fine-Tuning to Voice Agents, AR Interfaces, and Real-World Deployment\u003c\/i\u003e\u003c\/b\u003e\u003c\/p\u003e\u003cp\u003e\u003cb\u003eUnlock the future of artificial intelligence with practical, production-ready multi-modal engineering.\u003c\/b\u003e\u003c\/p\u003e\u003cp\u003eThis hands-on guide is built for developers, researchers, and AI professionals who want to go beyond chatbots and dive into building intelligent systems that understand \u003cb\u003etext, images, audio, and human intent\u003c\/b\u003e - all in one pipeline.\u003c\/p\u003e\u003cp\u003eWhether you're fine-tuning large language models (LLMs) or creating voice-driven AR interfaces, this book walks you through the real engineering decisions, tools, and architectures needed to bring multi-modal AI to life.\u003c\/p\u003e\u003cbr\u003e\u003cb\u003eWhat You'll Learn: \u003c\/b\u003e\u003cul\u003e\n\u003cli\u003e\u003cp\u003e\u003cb\u003eFine-tuning Large Language Models (LLMs)\u003c\/b\u003e: Train and adapt models like GPT-2, LLaMA, and Mistral for custom tasks using Hugging Face, LoRA, QLoRA, and PEFT.\u003c\/p\u003e\u003c\/li\u003e\n\u003cli\u003e\u003cp\u003e\u003cb\u003eVoice Interfaces\u003c\/b\u003e: Combine Whisper, LLMs, and Bark\/Tortoise TTS to build interactive speech-driven assistants.\u003c\/p\u003e\u003c\/li\u003e\n\u003cli\u003e\u003cp\u003e\u003cb\u003eComputer Vision + Language\u003c\/b\u003e: Use models like BLIP, CLIP, and DETR to connect what systems see to what they say and understand.\u003c\/p\u003e\u003c\/li\u003e\n\u003cli\u003e\u003cp\u003e\u003cb\u003eInstruction Tuning \u0026amp; Hyperparameter Optimization\u003c\/b\u003e: Build smarter, domain-specific models with efficient training workflows.\u003c\/p\u003e\u003c\/li\u003e\n\u003cli\u003e\u003cp\u003e\u003cb\u003eMulti-Modal Pipelines\u003c\/b\u003e: Chain audio, image, and text inputs for question answering, summarization, tutoring, and AR\/robotic control.\u003c\/p\u003e\u003c\/li\u003e\n\u003cli\u003e\u003cp\u003e\u003cb\u003eReal-Time Interfaces\u003c\/b\u003e: Deploy intelligent agents using FastAPI, Streamlit, Gradio, Docker, and Hugging Face Spaces.\u003c\/p\u003e\u003c\/li\u003e\n\u003cli\u003e\u003cp\u003e\u003cb\u003eEdge \u0026amp; Offline Deployment\u003c\/b\u003e: Optimize models with ONNX, quantization (4-bit, 8-bit), and TensorRT for low-latency inference on CPU\/GPU.\u003c\/p\u003e\u003c\/li\u003e\n\u003c\/ul\u003e\u003cbr\u003e\u003cb\u003eUse Cases Covered: \u003c\/b\u003e\u003cul\u003e\n\u003cli\u003e\u003cp\u003eSmart document summarizers with OCR + TTS\u003c\/p\u003e\u003c\/li\u003e\n\u003cli\u003e\u003cp\u003eVoice-enabled image assistants\u003c\/p\u003e\u003c\/li\u003e\n\u003cli\u003e\u003cp\u003eEmotion-aware agents\u003c\/p\u003e\u003c\/li\u003e\n\u003cli\u003e\u003cp\u003eVirtual tutors\u003c\/p\u003e\u003c\/li\u003e\n\u003cli\u003e\u003cp\u003eAR-enhanced AI interfaces\u003c\/p\u003e\u003c\/li\u003e\n\u003cli\u003e\u003cp\u003eRobotic perception + control from voice\/image input\u003c\/p\u003e\u003c\/li\u003e\n\u003cli\u003e\u003cp\u003eSecure, multilingual, and privacy-conscious AI systems\u003c\/p\u003e\u003c\/li\u003e\n\u003c\/ul\u003e\u003cbr\u003e\u003cb\u003eTools \u0026amp; Frameworks Inside: \u003c\/b\u003e\u003cul\u003e\n\u003cli\u003e\u003cp\u003ePython, PyTorch, Hugging Face Transformers\u003c\/p\u003e\u003c\/li\u003e\n\u003cli\u003e\u003cp\u003eLangChain, OpenCV, Whisper, TTS, BLIP\u003c\/p\u003e\u003c\/li\u003e\n\u003cli\u003e\u003cp\u003eROS, Unity (AR\/VR), Gradio, Streamlit\u003c\/p\u003e\u003c\/li\u003e\n\u003cli\u003e\u003cp\u003eDocker, FastAPI, gRPC, TorchServe\u003c\/p\u003e\u003c\/li\u003e\n\u003c\/ul\u003e\u003cp\u003e\u003cb\u003eBuilt for engineers. Written with depth. Designed for real-world impact.\u003c\/b\u003e\u003c\/p\u003e\u003cp\u003eIf you're ready to build \u003cb\u003eintelligent multi-modal agents\u003c\/b\u003e that understand the world like humans do - across speech, vision, and language - this book gives you the complete roadmap.\u003c\/p\u003e\u003cp\u003e\u003cb\u003ePerfect for: \u003c\/b\u003e\u003cbr\u003eMachine learning engineers, data scientists, AI product developers, researchers, robotics engineers, and anyone building cutting-edge AI systems.\u003c\/p\u003e","brand":"Atlantic Books","offers":[{"title":"Paperback","offer_id":46334068850839,"sku":"9798296089038","price":3048.0,"currency_code":"INR","in_stock":true}],"thumbnail_url":"\/\/cdn.shopify.com\/s\/files\/1\/0666\/3471\/1191\/files\/9798296089038.webp?v=1768671247","url":"https:\/\/atlanticbooks.com\/products\/ai-engineering-building-multi-modal-intelligent-systems-with-vision-language-and-audio-from-llm-fine-tuning-to-voice-agents-ar-interfaces-and-rea-9798296089038","provider":"Atlantic Books","version":"1.0","type":"link"}