Leveraging Quantization in TensorRT-LLM and TensorRT

Leveraging Quantization in TensorRT-LLM and TensorRT

This tutorial session highlights an end-to-end optimization-to-deployment demo for language models with TensorRT-LLM and Stable Diffusion models with TensorRT.

Get in Touch

Learn more about purchasing our AI inference software for production deployment.

Recommended For You

webpage:SessionCreativity With Real-Time Generative AI

webpage:SessionAccelerating End-to-End Language Models

webpage:SessionLLMs With TensorRT-LLM for Text Generation

webpage:SessionTriton and TensorRT for Universal Model Serving

webpage:SessionAutoregressive Model Parallel Inference Efficiency

webpage:SessionOptimizing Your LLM Pipeline for End-to-End Efficiency

webpage:SessionDeploying LLMs for Government Applications

webpage:SessionSimplifying OCR Serving with Triton Inference Server

webpage:SessionOptimizing Inference and New LLM Features in Desktops and Workstations

webpage:SessionAn AI Revolution in Insurance Claim Process

webpage:SessionBenchmarking LLMs With Triton Inference Server

webpage:SessionInference Model Serving for Highest Performance

webpage:SessionScaling Generative AI Features to Millions of Users

webpage:SessionBuild Accelerated AI With Hugging Face and NVIDIA

webpage:SessionLeveraging Quantization in TensorRT-LLM and TensorRT

webpage:SessionTraining and Inferencing LLMs on Azure

webpage:SessionAI Inference in Action

webpage:SessionAccelerate LLM Inference With TensorRT-LLM

webpage:SessionFast and Memory-Efficient Exact Attention With IO-Awareness

webpage:SessionAccelerating Generative AI With TensorRT-LLM to Enhance Seller Experience at Amazon

webpage:BlogStable Diffusion XL on NVIDIA's Platform

webpage:Solution BriefInference Platform Solution Brief

webpage:Case StudyWealthsimple Accelerates Machine Learning Model Delivery and Inference

webpage:Case StudyControlExpert Accelerate the Motor Claims Process

webpage:BlogRobust Scene Text Detection and Recognition: Introduction

webpage:BlogRobust Scene Text Detection and Recognition: Implementation

webpage:BlogRobust Scene Text Detection and Recognition: Inference Optimization

webpage:Case StudyNVIDIA Triton Speeds Inference on Oracle Cloud

webpage:BlogNVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs

webpage:BlogOptimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available

webpage:BlogAccelerating Inference on End-to-End Workflows with H2O.ai and NVIDIA

webpage:BlogAchieving Top Inference Performance with the NVIDIA H100 Tensor Core GPU and NVIDIA TensorRT-LLM

webpage:BlogHow Is AI Used in Fraud Detection?

webpage:BlogMicrosoft Bing Speeds Ad Delivery With NVIDIA Triton

webpage:BlogNVIDIA Takes Inference to New Heights Across MLPerf Tests

webpage:BlogLarge Language Models Read Data With NVIDIA Triton

webpage:BlogNVIDIA Hopper Sweeps AI Inference Benchmarks

webpage:BlogMicrosoft Teams Boosted With NVIDIA AI

webpage:BlogNVIDIA Triton Tames the Seas

webpage:BlogHow to Deploy an AI Model with PyTriton

webpage:BlogBest Practices for NVIDIA TensorRT

webpage:BlogIncreasing Inference Acceleration of KoGPT

webpage:BlogSetting New Records in MLPerf Inference v3.0

webpage:BlogNew NVIDIA Triton and TensorRT Features

webpage:BlogSupercharging AI Inference with NVIDIA L4 GPUs

webpage:BlogNVIDIA TensorRT Deployment

webpage:Case StudyDesigning an Optimal AI Inference for Autonomous Driving

webpage:BlogDeploying a 1.3B GPT-3 Model with NVIDIA NeMo Framework

webpage:BlogAccelerated Inference for Large Transformer Models Using NVIDIA Triton Inference Server

webpage:BlogDeploying GPT-J and T5 with NVIDIA Triton Inference Server

webpage:BlogRun Multiple AI Models on the Same GPU with Amazon SageMaker Multi-Model Endpoints Powered by NVIDIA Triton Inference Server

webpage:BlogBoosting AI Model Inference Performance on Azure Machine Learning

webpage:BlogDeploying NVIDIA Triton at Scale with MIG and Kubernetes

webpage:BlogOne-click Deployment of NVIDIA Triton Inference Server to Simplify AI Inference on Google Kubernetes Engine (GKE)

webpage:BlogServing ML Model Pipelines on NVIDIA Triton Inference Server with Ensemble Models

webpage:BlogAccelerating Inference with NVIDIA Triton Inference Server and NVIDIA DALI

webpage:SessionLarge Language Models with NVIDIA Triton Inference Server

webpage:SessionAccelerated Inference with Triton Inference Server

webpage:SessionAn End-to-End Subgraph Optimization Framework

webpage:SessionSimplifying Inference for Every Model

webpage:SessionTake Your AI Inference to the Next Level

webpage:SessionFast, Scalable, and Standardized AI Inference

webpage:SessionAccelerated App Deployment with OctoML and Triton

webpage:SessionOptimal AzureML Triton Model Deployment

webpage:SessionNVIDIA Triton Inference Server on Google Cloud Vertex AI

video:VideoAccelerate AI Workloads with NVIDIA L4

video:VideoHow to Deploy HuggingFace’s Stable Diffusion Pipeline

video:VideoGetting Started with NVIDIA Triton Inference Server

video:VideoTop 5 Reasons Why Triton is Simplifying Inference

video:VideoGetting Started with TensorFlow-TensorRT

video:VideoHow To Increase Inference Performance with TensorFlow-TensorRT

video:VideoGetting Started with NVIDIA Torch-TensorRT

video:VideoNVIDIA TensorRT 8 Is Out. Here Is What You Need To Know.

video:VideoGetting Started with NVIDIA TensorRT

video:VideoIntroduction to NVIDIA TensorRT

video:VideoNVIDIA TensorRT: High Performance Deep Learning Inference

webpage:WebinarMove Enterprise AI Use Cases From Development to Production With Full-Stack AI Inferencing

webpage:WebinarHarness the Power of Cloud-Ready AI Inference Solutions and Experience a Step-By-Step Demo of LLM Inference Deployment in the Cloud

webpage:WebinarUnlocking AI Model Performance: Exploring PyTriton and Model Analyzer

View All Content

Fill This Out to Continue

This content will be available after you complete this form.

NVIDIA websites use cookies to deliver and improve the website experience. See our cookie policy for further details on how we use cookies and how to change your cookie settings.

Accept