Fine-Tuning and Serving Llama 3.1 with Komodo

Fine-Tuning and Serving Llama 3.1 with Komodo

Fine-tuning and deploying large language models like Llama 3.1 is now more accessible than ever. With Komodo, a GPU cloud for AI developers, you can easily develop, fine-tune, and deploy models that meet your specific needs. Our platform abstracts away cloud infrastructure, allowing you to focus on bringing your ideas to life.

Table of Contents:

  • What is Llama 3.1? 🦙

  • Setup Komodo 💻

  • Prepare the Dataset 🔢

  • Launch Your first Job 🚀

  • Serve Your Model 🍽️

What is Llama 3.1?

Llama 3.1 is a collection of open-source Large Language Models from Meta, designed with an optimized transformer architecture that supports contexts with length up to 128,000 tokens. The large context window makes Llama 3.1 an ideal model for a variety of applications. Llama 3.1 is available in 8B, 80B, and 405B parameter variants.

⭐ Learn more about Llama 3 here.

In this tutorial, we will fine-tune the 8B model, which offers a great balance between performance and resource requirements.

⭐ Before you proceed, ensure you have access to Llama 3.1 via Hugging Face, as it is distributed under a custom commercial license

Setup up Your Komodo Account and CLI

Before you can deploy the model, you'll need to set up your Komodo account and install the CLI.

⭐ Join our Discord to get free credits to complete this tutorial via this link

How to get started:

  1. Create an Account: Visit our app to sign up.

  2. Install the Komodo CLI and authenticate

    
    
    

Once you’re logged in you have everything needed to manage jobs, machines, and services on Komodo.

Prepare Your Dataset

After setting up Komodo in the CLI you are ready to start fine-tuning. For this, we’ll use an alpaca-style dataset which has an optimized structure for training LLMs. Alpaca-Style format is effective because it structures the data into clear instructions, outputs and optionally, inputs – allowing your model to learn how to respond to a wide range of tasks.

Here’s a quick example:

[
    {
        "instruction": "Evaluate this sentence for spelling and grammar mistakes",
        "input": "He finnished his meal and left the resturant",
        "output": "There are two spelling errors in the sentence. The corrected sentence should be: \"He finished his meal and left the restaurant.\""
    },
    {
        "instruction": "What are the three primary colors?",
        "input": "",
        "output": "The three primary colors are red, blue, and yellow."
    }
]

⭐ In this tutorial, we will train on the cleaned version of the original Alpaca Dataset released by Stanford.

Launch Your Fine-Tuning Job

Next you will launch your first fine-tuning job. With just one configuration file, fine-tuning your Llama 3 model on Komodo is straightforward and easy.

Config File for Fine-Tuning

envs:
  HF_TOKEN: "YOUR HUGGINGFACE TOKEN"
  DATASET: "yahma/alpaca-cleaned"
  # this is the name of the HuggingFace repo that your model will be uploaded to
  MODEL_REPO: "YOUR REPO NAME"


resources:
  accelerators: A100:8

setup: |
  pip install torch torchao torchvision torchtune huggingface

  tune download meta-llama/Meta-Llama-3.1-8B-Instruct \
    --hf-token $HF_TOKEN \
    --output-dir /tmp/Meta-Llama-3.1-8B-Instruct \
    --ignore-patterns "original/consolidated*"

  wget https://raw.githubusercontent.com/pytorch/torchtune/651a7300435aa31f86d49511ea84400f89d7f59e/recipes/configs/llama3_1/8B_lora.yaml  

run: |
  tune run --nproc_per_node 8 \
    lora_finetune_distributed \
    --config 8B_lora.yaml \
    dataset.source=$DATASET

  # Remove the checkpoint files to save space, LoRA serving only needs the
  # adapter files.
  rm /tmp/Meta-Llama-3.1-8B-Instruct/*.pt
  rm /tmp/Meta-Llama-3.1-8B-Instruct/*.safetensors
  
  mkdir output_model
  rsync -Pavz /tmp/Meta-Llama-3.1-8B-Instruct output_model/
  cp -r /tmp/lora_finetune_output output_model/

  # upload to Hugging Face
  huggingface-cli upload $MODEL_REPO output_model .

Datasets Hosted Elsewhere

If your dataset is hosted outside of Hugging Face. you can adjust the configuration to download your dataset from another source.

✅ If you have an AWS account connected on Komodo (a full or storage-only connection), you can download your data from S3 without any additional setup.

⭐ Connect AWS to Komodo here

Config File for non-Hugging Face Dataset

envs:
  HF_TOKEN: "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"
  # this is the name of the HuggingFace repo that your model will be uploaded to
  MODEL_REPO: "YOUR REPO NAME"

file_mounts:
  # Copy the contents of your s3 bucket to /dataset
  /dataset:
    source: s3://YOUR-DATASET-BUCKET-NAME
    mode: COPY

resources:
  accelerators: A100:8

setup: |
  pip install torch torchao torchvision torchtune huggingface

  tune download meta-llama/Meta-Llama-3.1-8B-Instruct \
    --hf-token $HF_TOKEN \
    --output-dir /tmp/Meta-Llama-3.1-8B-Instruct \
    --ignore-patterns "original/consolidated*"

  wget https://raw.githubusercontent.com/pytorch/torchtune/651a7300435aa31f86d49511ea84400f89d7f59e/recipes/configs/llama3_1/8B_lora.yaml  

run: |
  tune run --nproc_per_node 8 \
    lora_finetune_distributed \
    --config 8B-lora.yaml \
    dataset.source=json \
    dataset._component_=torchtune.datasets.instruct_dataset \
    #####################
    # TODO: Update to point to the correct json file inside your dataset folder
    #####################
    dataset.data_files=/dataset/my-instruct-dataset.json

  # Remove the checkpoint files to save space, LoRA serving only needs the
  # adapter files.
  rm /tmp/Meta-Llama-3.1-8B-Instruct/*.pt
  rm /tmp/Meta-Llama-3.1-8B-Instruct/*.safetensors
  
  mkdir output_model
  rsync -Pavz /tmp/Meta-Llama-3.1-8B-Instruct output_model/
  cp -r /tmp/lora_finetune_output output_model/

  # upload to Hugging Face
  huggingface-cli upload $MODEL_REPO output_model .

To start the fine-tuning job download the config above, simply run

That’s it! 

By default, this job will train your model and upload the resulting weights to Hugging Face. However, if you prefer, you can modify the configuration to store the model elsewhere, such as your own S3 bucket.

⭐ Learn more about using data on Komodo here

Serve Your Model

Once your Llama 3.1 is fine-tuned, serving it is just as seamless. All you need is a configuration file to get your production-ready model up and running.

Config File for Llama 3.1 Service

envs:
  HF_TOKEN: "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"
  # provide the same MODEL_REPO that you used in the fine-tuning config
  MODEL_REPO: "komodo-ai/cleaned-alpaca-llama3.1-8b"

resources:
  accelerators: A10:1
  ports: 8000

service:
  replica_policy:
    min_replicas: 1
    max_replicas: 3
    target_qps_per_replica: 3

  readiness_probe:
    initial_delay_seconds: 1800
    path: /health

setup: |
  pip install vllm vllm-flash-attn

run: |
  vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --tensor-parallel-size 1 --enable-lora --lora-modules cleaned-alpaca-llama3.1-8b=$MODEL_REPO --max-model-len=2048 --port 8000

Deploy your service by running

komo service launch serve-finetuned-llama3.yaml --name

Once your service is ready, you can interact with your model directly from the Komodo dashboard and leverage the custom fine-tuning that makes it uniquely yours.

Take Control of Your AI Stack

Whatever application you have in mind, Komodo is here to support you from the very first lines of code to production deployment. Running your own models comes with a little bit of upfront investment (which we aim to make as light as possible) but has long-term benefits such as:

  • Complete data privacy for your unique data

  • Full ownership of the model weights

  • No restrictions on model behavior

  • Cost predictability and freedom to optimize your model to your needs

Summary

Fine-tuning and deploying Llama 3.1 with Torchtune and Komodo allows you to fully customize your AI capabilities, ensuring your models are not only powerful but also aligned with your specific needs. 

This guide has shown you how simple it can be to create and serve a model tailored to your exact requirements, with the added benefits of data privacy, ownership of model weights, and expert performance.

Now it’s time to put your model to work!

Get started!