What will I get if I subscribe to this Specialization?

When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.

Is financial aid available?

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.

Deploying Deep Learning: Quantization, Serving, and Edge AI

Deploying Deep Learning: Quantization, Serving, and Edge AI

Instructor: Board Infinity

Included with

Learn more

4 modules

Gain insight into a topic and learn the fundamentals.

Advanced level

Recommended experience

2 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

4 modules

Gain insight into a topic and learn the fundamentals.

Advanced level

Recommended experience

2 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

What you'll learn

Apply INT4/INT8 quantization (AWQ, GPTQ, GGUF) to compress LLMs and vision models for production
Deploy high-throughput inference servers using vLLM's PagedAttention and NVIDIA Triton
Run optimized LLMs on CPU and edge devices using ONNX Runtime and Llama.cpp
Build, benchmark, and containerize a production-ready inference API with Docker

Details to know

Shareable certificate

Add to your LinkedIn profile

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

There are 4 modules in this course

"Production Deep Learning: Inference, Quantization & Edge Deployment is designed for ML engineers and developers who want to master the full deployment lifecycle — from compressing and quantizing models to serving them at scale using vLLM, Triton, ONNX, and Llama.cpp.

Module 1 covers model compression fundamentals, including pruning, distillation, and INT8/INT4 quantization using AWQ and GPTQ, with a focus on the accuracy–latency tradeoff. Module 2 dives into high-throughput serving architectures, exploring vLLM's PagedAttention, NVIDIA Triton, TensorRT, and scaling inference across GPU clusters with autoscaling patterns. Module 3 focuses on CPU and edge deployment using ONNX Runtime, GGUF, and Llama.cpp, plus multimodal inference with CLIP and LLaVA on resource-constrained devices. Module 4 is a capstone project where you'll quantize a fine-tuned LLM, build a production API with vLLM, benchmark performance, and containerize your model with Docker for cloud and edge deployment. By the end of this course, you will: - Apply INT4/INT8 quantization techniques (AWQ, GPTQ, GGUF) to compress LLMs for production - Deploy high-throughput inference servers using vLLM, Triton, and ONNX Runtime - Run optimized models on GPU, CPU, and edge devices using Llama.cpp and TensorRT - Build, benchmark, and containerize an end-to-end production-ready inference API" Disclaimer: This is an independent educational resource created by Board Infinity for informational and educational purposes only. This course is not affiliated with, endorsed by, sponsored by, or officially associated with any company, organization, or certification body unless explicitly stated. The content provided is based on industry knowledge and best practices but does not constitute official training material for any specific employer or certification program. All company names, trademarks, service marks, and logos referenced are the property of their respective owners and are used solely for educational identification and comparison purposes.

Learn model compression fundamentals, memory profiling, and modern INT8/INT4 quantization techniques including AWQ and GPTQ to optimize models for production inference.

What's included

9 videos3 readings4 assignments

9 videosTotal 81 minutes

Where Trained Models Actually Run9 minutes
Why Inference Optimization Is a Top Skill7 minutes
Skill Roadmap: Training → Inference → Edge9 minutes
Why Models Are Too Big10 minutes
Three Ways to Make Models Smaller10 minutes
Accuracy vs Latency: Making Tradeoffs9 minutes
What Quantization Really Does9 minutes
Quantizing LLMs with AWQ & GPTQ8 minutes
Benchmarking: Speed, Accuracy Drop & Perplexity Shift10 minutes

3 readingsTotal 90 minutes

The 2026 Deployment Engineer Role: What Companies Want30 minutes
Model Compression Strategies at Scale30 minutes
Choosing the Right Quantization Method for Real Deployment30 minutes

4 assignmentsTotal 150 minutes

Model Compression, Quantization & Latency Optimization60 minutes
Career Scope in Production AI & Edge Deployment30 minutes
Fundamentals of Model Compression30 minutes
INT8/INT4 Quantization (AWQ, GPTQ)30 minutes

Master production-grade serving engines including vLLM with PagedAttention and NVIDIA Triton for scaling inference across GPUs and nodes.

What's included

9 videos3 readings4 assignments

9 videosTotal 78 minutes

What Breaks When Users Increase8 minutes
How Inference Servers Actually Work11 minutes
API Patterns for Inference7 minutes
Why KV Cache Limits Throughput7 minutes
Running a vLLM Server16 minutes
Handling Concurrent Requests Under Load6 minutes
When Triton Makes Sense7 minutes
Serving Vision Models with Triton8 minutes
Scaling Across GPUs9 minutes

3 readingsTotal 90 minutes

From Training to Serving: What Changes in Architecture?30 minutes
PagedAttention Deep Dive & Performance Tuning30 minutes
Deployment Blueprints: GPU Clusters & Autoscaling Patterns30 minutes

4 assignmentsTotal 150 minutes

Serving Architectures Beyond Flask & Python Loops60 minutes
Serving Architectures Beyond Flask & Python Loops30 minutes
vLLM Internals (PagedAttention)30 minutes
NVIDIA Triton & Production Deployment Patterns30 minutes

Export models to ONNX for interoperability, deploy LLMs on CPU and edge devices with Llama.cpp and GGUF, and build multimodal pipelines with CLIP and LLaVA.

What's included

9 videos3 readings4 assignments

9 videosTotal 70 minutes

Why ONNX Matters9 minutes
Exporting LLMs & Vision Models to ONNX- Part 17 minutes
Speeding Up Inference with ONNX Runtime - part 18 minutes
What GGUF Is & Why It Matters9 minutes
Running LLMs with Llama.cpp- Part 18 minutes
Benchmarking: Latency, Token Throughput & Memory9 minutes
How CLIP Connects Text & Images- part 17 minutes
Vision-Enhanced LLMs (LLaVA)- Part 14 minutes
Building a Simple Multimodal Pipeline10 minutes

3 readingsTotal 90 minutes

ONNX Runtime Optimization Guide30 minutes
Edge LLM Deployment: Real-World Limitations & Solutions30 minutes
Multimodal Models: Practical Deployment Workflows30 minutes

4 assignmentsTotal 120 minutes

ONNX, Llama.cpp & Edge / CPU Deployment30 minutes
Exporting Models to ONNX30 minutes
Llama.cpp & GGUF for CPU/Edge Deployment30 minutes
Multimodal Inference (CLIP & LLaVA)30 minutes

Apply all course concepts in a final project to quantize a fine-tuned model, serve it via vLLM, benchmark it, and package it for cloud and edge deployment.

What's included

9 videos3 readings4 assignments

9 videosTotal 70 minutes

Loading Your QLoRA/LoRA Fine-Tuned Model - Part 17 minutes
Configure PEFT with LoRA7 minutes
Validating Quality vs Speed4 minutes
Load and Preprocess the Dataset10 minutes
Generate and Store Model Outputs Before Fine-Tuning9 minutes
Configure Training Arguments and Fine-Tune the Model8 minutes
Compare Model Outputs After Fine-Tuning9 minutes
Dockerizing the Service8 minutes
Running on Cloud, CPU & Edge10 minutes

3 readingsTotal 90 minutes

Quantization Validation Checklist for Production30 minutes
API Design Patterns for Generative Models30 minutes
Deployment Benchmark Templates (LLM + Vision)30 minutes

4 assignmentsTotal 150 minutes

Final Project - The Edge-Ready API (Quantize to Serve to Benchmark)60 minutes
Preparing the Fine-Tuned Model for Deployment30 minutes
Building the Production API (vLLM)30 minutes
Benchmarking & Deployment Packaging30 minutes

Instructor

Board Infinity

249 Courses408,457 learners

Offered by

Board Infinity

Why people choose Coursera for their career

Felipe M.

Learner since 2018

"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."

Jennifer J.

Learner since 2020

"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."

Larry W.

Learner since 2021

"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."

Chaitanya A.

"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

Open new doors with Coursera Plus

Unlimited access to 10,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Learn more

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Explore degrees

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Learn more

Frequently asked questions

Yes, a working knowledge of deep learning, PyTorch, and LLM fundamentals is recommended. This course focuses on production deployment rather than training from scratch.

No. This is an advanced course. Learners should already understand model training, transformers, and basic MLOps concepts before starting.

It prepares you for roles like Inference Engineer, ML Deployment Engineer, Edge AI Developer, and MLOps Engineer focused on generative AI systems.

To access the course materials, assignments and to earn a Certificate, you will need to purchase the Certificate experience when you enroll in a course. You can try a Free Trial instead, or apply for Financial Aid. The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.