Hosting Ultravox Multimodal LLM on Komodo: A Step-by-Step Tutorial

Hosting Ultravox Multimodal LLM on Komodo: A Step-by-Step Tutorial

Introduction

Multimodal models like Ultravox are breaking new ground by processing data across various modalities. In this tutorial, we’ll walk you through the process of deploying the Ultravox multimodal LLM on Komodo. This guide will help you get Ultravox up and running seamlessly no matter your background.

Setup up Your Komodo Account and CLI

Before you can deploy the model, you'll need to set up your Komodo account and install the CLI.

⭐ Join our Discord to get free credits to complete this tutorial via this link

How to get started:

  1. Create an Account: Visit our app to sign up.

  2. Install the Komodo CLI and authenticate

    
    
    

Once you’re logged in you have everything needed to manage jobs, machines, and services on Komodo.

Serve Ultravox

Serving any model on Komodo is seamless. All you need is a configuration file to get your production-ready model up and running.

service:
  replica_policy:
    min_replicas: 1
    max_replicas: 1
    target_qps_per_replica: 5

  readiness_probe:
    initial_delay_seconds: 1800
    path: /health

resources:
  accelerators: A100:1
  cpus: 12+
  memory: 64+
  ports: 8000  # Expose to internet traffic.

setup: |
  pip install vllm==0.6.1.post2 vllm-flash-attn==2.6.1 vllm[audio]

run: |
  vllm serve fixie-ai/ultravox-v0_4 --gpu-memory-utilization 0.98 --max-model-len 22416 --port 8000

Copy the above contents to a file called service.yaml and deploy the service by running

komo service launch --name ultravox service

Once your service is ready, you can chat with it directly in the Komodo app by entering fixie-ai/ultravox-0_4 under Chat with your model.

To take advantage of the model's multimodal capabilities, use the sample Python code below to pass both audio and text as input.

import base64
from openai import OpenAI
from vllm.assets.audio import AudioAsset

base_url = "YOUR_KOMODO_SERVICE_URL"
openai_api_base = f"{base_url}/v1"

client = OpenAI(
    api_key="EMPTY",
    base_url=openai_api_base,
)

audio_url = AudioAsset("mary_had_lamb").url
# Using a file from disk
# audio_file_path = "/Users/kote/k/ultravox_sample/test5.mp3"
# audio_url = f"data:audio/mp3;base64,{base64.b64encode(open(audio_file_path, 'rb').read()).decode('utf-8')}"

chat_completion_from_url = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What's in this audio?",
                },
                {
                    "type": "audio_url",
                    "audio_url": {"url": audio_url},
                },
            ],
        }
    ],
    model="fixie-ai/ultravox-v0_4",
    max_tokens=64,
)

result = chat_completion_from_url.choices[0].message.content
print(f"Chat completion output: {result}")

Summary

This guide has shown you how simple it can be to deploy an LLM model tailored to your requirements, with the added benefits of privacy and dedicated performance.

Now it’s time to put your model to work!

Get started!