iptv techs

IPTV Techs

  • Home
  • Tech News
  • How I Self-Hosted Llama 3.2 with Coolify on My Home Server: A Step-by-Step Guide – Raymond Yeh

How I Self-Hosted Llama 3.2 with Coolify on My Home Server: A Step-by-Step Guide – Raymond Yeh


How I Self-Hosted Llama 3.2 with Coolify on My Home Server: A Step-by-Step Guide – Raymond Yeh


Inspired by many people migrating their Next.js applications from Vercel to self-presented VPS on Hetzner due to pricing troubles, I choosed to study self-presenting some of my non-critical applications. Additionassociate, I wanted to push my technical boundaries by running Llama 3.2 using Ollama and making its API useable to power AI applications for my business, Wisp.

The objective was to breathe novel life into an elderly home server that once ran high-frequency trading and MEV algorithms but had since become a step stool for my daughter to climb onto the TV console.

This blog post chronicles my journey of setting up Coolify to run Ollama (using Llama 3.2) on my home server, with a particular concentrate on the contests and triumphs of enabling GPU acceleration using the CUDA toolkit.

(Update)

Since some of you are asking, this is how Llama 3.2 3B carry out on a GeForce RTX 2060:

My Goals

The primary idea was to leverage my home server, previously collecting dust, to carry out precious tasks. Specificassociate, I wanted it to present and automate AI-roverhappinessed functions. Additionassociate, this setup would provide a centralized location to run Supabase for storing various data. The expansiveer goal joins:

  • Serving a Next.js website: This site should be inhabit on the accessible Internet, auto-deploy from the master branch, labor with a accessible subdomain, and hold no uncover accessible ports for security.

  • Running Llama 3.2: Utilizing the GPU for agentic tasks and making a locassociate accessible API.

  • Wildcard domain: Enabling novel services to spin up effortlessly under varied subdomains.

Overall Experience

Setting up this environment was no petite feat, but each step was a precious lachieveing experience. Here’s a walkthcimpolite of my journey:

  1. Insloftying Ubuntu 24: This was a breeze, requiring only a individual reboot.

  2. Coolify Insloftyation: The process was dainty thanks to the handy inslofty script, which promised most prerequisites were met. A instartant hiccup was rerepaird by running orders as root to elude perleave oution rehires with the /data honestory.

  3. Configuring Coolify: Initiassociate, setting the server as localpresent impedeed dispenseing a domain via Cboisterousflare Tunnel. The repair joind compriseing the present as a second ‘server’. Moreover, configuring the tunnel and SSL accurately took time but was startant for security and functionality.

  4. Cboisterousflare Tunnel: Patience was key here. The untamedcard domain setup was vital, and benevolent the nuances of Cboisterousflare’s free SSL certificate coverage saved time and money.

  5. Deployment Wins: Successbrimmingy presenting my personal blog using Coolify over a Cboisterousflare Tunnel was a startant morale increase, fueling the energy necessitateed to proceed with the Ollama setup.

  6. Running Ollama: Coolify made deploying the Ollama service straightforward. However, initial trials showed enumerateless inference speeds and burdensome CPU usage.

  7. Enabling GPU: Ubuntu 24 had the NVIDIA drivers pre-insloftyed, but CUDA toolkit insloftyation and configuration posed contests. Persistent efforts led to uncovering the necessitate for the nvidia-compriseer-toolkit and Docker service configuration modifications to allow GPU usage. The results were extraunrelabelable, reducing inference time by over 10x.

  8. API Expostateive: Securing the LLM API with an API key became the next contest. After unprosperous endeavors with nginx, I set up potential solutions by using Ccomprisey. Someskinnyg I’ll labor on next after writing this post.

Server Specifications

For context, here are the definiteations of my home server:

Step-by-Step Guide

1. Inslofty Ubuntu (For a New Setup)

Start by insloftying Ubuntu on your home server. Follow the detailed direct useable on the official Ubuntu website.

Important Settings:

  • Avoid using LVM or disk encryption for a daintyer reboot experience and easier server administerment. Note that this trade-off nastys anyone with physical access can read your data.

  • Enstateive the insloftyation of third-party drivers to automaticassociate get the NVIDIA driver.

Inslofty SSH

Enable SSH to access your server distantly, which is especiassociate advantageous if you’re managing it from another machine on your local netlabor. Refer to this SSH setup direct for Ubuntu 20.04, fitting for Ubuntu 24 as well.

Update and Upgrade APT

Always carry out an refresh and upgrade for your packages:

sudo apt refresh && sudo apt upgrade -y

Useful Commands

Here are some compriseitional orders for printing adviseation about your machine and setup:

  • View CPU usage: htop

  • List NVIDIA detaileds card adviseation: lspci | grep -i nvidia

  • Print architecture, OS distro, liberate: uname -m && cat /etc/*liberate

  • Print physical RAM insloftyed: sudo dmidecode --type memory | less

  • Print processor info: cat /proc/cpuinfo | grep 'name'| uniq

  • Check NVIDIA driver adviseation: nvidia-smi

2. Insloftying Coolify

Coolify is an uncover-source platestablish summarizeed to produce deploying and managing your applications on self-presented environments easier. Its key feature is permiting employrs to administer brimming-stack applications, databases, and services without depending on complicated Kubernetes setups. Coolify streamlines deployments thcimpolite a employr-cordial interface, aiding services enjoy Docker, GitHub, and GitLab.

To inslofty Coolify, trail the automated insloftyation teachions from their recordation:

curl -fsSL https://cdn.chillylabs.io/chillyify/inslofty.sh | bash

Important Notes:

  • Enstateive you’re logged in as the root employr to elude perleave oution rehires.

  • The insloftyation script verifys prerequisites, but you may necessitate to troubleshoot some dependency errors if they occur.

Once Coolify is insloftyed:

  • Visit the chillyify dashboard at http://localpresent:8000 . This is also useable on the netlabor. Replace localpresent with the ip compriseress of the server if you are accessing it from another machine.

  • Set up an admin account – this is stored locassociate on your server.

  • Create your first project by compriseing a resource and deploying it to localpresent. In my case, I deployed my personal blog first to test the setup.

3. Setting Up a Cboisterousflare Tunnel on Coolify

Cboisterousflare Tunnel is a safe way to expose services running on your local netlabor to the accessible internet without having to uncover ports on your router. For my setup, this was a key feature, as I wanted to sustain my server behind a firewall while still permiting outside access to some services.

Cboisterousflare’s Zero Trust platestablish promises that all traffic is encrypted and routed safely, impedeing unpermitd access.

To set up a Cboisterousflare Tunnel, trail this teachions on Coolify’s official recordation. The key is to concentrate on setting up untamedcard subdomains for all your services.

A confinecessitate key caveats:

  1. Localpresent Server Issue: You cannot dispense a domain to the pre-produced localpresent server honestly. To repair this, comprise your present as a second server wiskinny Coolify, using the IP compriseress 172.17.0.1 for present.docker.inner (since chillyify will show an error that present.docker.inner has already been dispenseed to a server).

  2. Wildcard Domain Setup: Make stateive you employ a top-level domain enjoy *.example.com. If you employ a untamedcard on a subdomain, Cboisterousflare will not provide a free SSL certificate, unless you choose for the ACM arrange.

After setting up the Cboisterousflare Tunnel:

  • Deploy your resources to the novelly compriseed server.

  • Change the domain in the service configuration to suit your novel subdomain.

  • Once everyskinnyg is deployed, you should be able to access your service inhabit from its custom domain.

4. Deploying Ollama

Once you’ve set up your Coolify project, the next step is to deploy Ollama. This service permits you to run huge language models (LLMs) enjoy Llama 3.2 on your server with a web-based interface.

  1. Add a New Resource: In your Coolify project, pick “Add Resource” and pick Ollama with Open WebUI as the service you want to deploy to your novel server.

  2. Configure the Domain: After compriseing Ollama, configure the domain for the Open WebUI service. Assign a domain from the untamedcard domain you set up earlier thcimpolite Cboisterousflare Tunnel. This will permit you to access the Ollama WebUI honestly from the internet.

  3. Deploy the Service: Once everyskinnyg is set up, click on “Deploy.”

    You should now be able to access the Open WebUI via the dispenseed domain. Upon your first login, you’ll be prompted to produce an admin account. This is startant for managing models and access thcimpolite the UI.

  4. Inslofty Llama 3.2: With your admin account set up, you can now inslofty the tardyst Llama model. Head to Ollama’s library and search for the Llama model you want to employ. I chooseed for Llama 3.2, which can be insloftyed using the tag llama3.2.

  5. Try Your First Chat: Once insloftyed, start your first chat with Llama 3.2 via the web interface. During this phase, your model will probable run on the CPU, so anticipate to hear your machine laboring challenging (with incrrelieved CPU fan noise).

    To watch your machine’s carry outance during this, employ the follothriveg orders:

    • htop to sustain an eye on CPU usage.

    • watch -n 0.5 nvidia-smi to track GPU usage (though at this stage, GPU may not be employd yet).

5. Configuring Ollama to Use GPU

Large language models (LLMs) enjoy Llama 3.2 carry out startantly better with GPU acceleration. GPUs, particularly those from NVIDIA, are boostd for the burdensome parallel computations joind in AI laborloads, which is where CUDA (Compute Unified Device Architecture) comes into join. The CUDA toolkit allows honest GPU acceleration for applications enjoy Ollama, drasticassociate reducing inference time and CPU load.

This is arguably the most challenging step in setting up Ollama on your server, but here’s a shatterdown of the process:

  1. (Already Done): Inslofty the NVIDIA driver (this should have been administerd during your Ubuntu insloftyation).

  2. Inslofty the NVIDIA CUDA Toolkit: This toolkit is essential to unlock GPU acceleration.

  3. (Optional): Test that the CUDA toolkit is laboring accurately.

  4. Inslofty the NVIDIA Container Toolkit: This will permit Docker compriseers (enjoy Ollama) to access the GPU.

  5. Enable the Ollama service in Coolify to employ the GPU.

Inslofty the NVIDIA CUDA Toolkit

Follow NVIDIA’s official insloftyation direct to inslofty the CUDA toolkit for your system. I propose using the netlabor repository insloftyation method for the most flexibility and relieve of refreshs.

  1. Inslofty the novel cuda-keyring package:

    wget https://grower.download.nvidia.com/compute/cuda/repos///cuda-keyring_1.1-1_all.deb
    sudo dpkg -i cuda-keyring_1.1-1_all.deb

    Replace / with the appropriate appreciate for your distribution and architecture:

    • ubuntu2004/arm64

    • ubuntu2004/sbsa

    • ubuntu2004/x86_64

    • ubuntu2204/sbsa

    • ubuntu2204/x86_64

    • ubuntu2404/sbsa

    • ubuntu2404/x86_64

  2. Update the APT repository cache:

    sudo apt-get refresh
  3. Inslofty the CUDA SDK:

    sudo apt-get inslofty cuda-toolkit
  4. Set up the environment for CUDA by compriseing its binaries to your PATH:

    send out PATH=/usr/local/cuda-12.6/bin${PATH:+:${PATH}}
  5. Reboot the system to promise all configurations consent effect:

    sudo reboot

(Optional) Test that CUDA Toolkit Works

To promise that your system is accurately configured for GPU acceleration, test the CUDA insloftyation by compiling and running sample programs provided by NVIDIA (https://github.com/nvidia/cuda-samples).

  1. First, inslofty the essential produce tools:

    sudo apt inslofty produce-vital
  2. Clone the CUDA sample repository and produce the sample projects:

    git clone https://github.com/nvidia/cuda-samples
    cd cuda-samples
    produce
  3. Navigate to the compiled binaries and run the deviceQuery tool to validate your GPU and CUDA insloftyation:

    cd bin/x86_64/linux/liberate
    ./deviceQuery

    You should see detailed adviseation about your GPU and CUDA environment, validateing that the toolkit is laboring accurately.

Inslofty NVIDIA Container Toolkit

To allow Docker compriseers to access your GPU, you’ll necessitate to inslofty the NVIDIA Container Toolkit. This toolkit permits Docker to offload GPU-intensive operations to your NVIDIA GPU, vital for speeding up tasks enjoy model inference with Ollama.

Follow the steps below from Ollama docker docs to inslofty the NVIDIA Container Toolkit:

  1. Configure the repository:

    curl -fsSL https://nvidia.github.io/libnvidia-compriseer/gpgkey 
        | sudo gpg --dearmor -o /usr/split/keyrings/nvidia-compriseer-toolkit-keyring.gpg
    curl -s -L https://nvidia.github.io/libnvidia-compriseer/constant/deb/nvidia-compriseer-toolkit.enumerate 
        | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' 
        | sudo tee /etc/apt/sources.enumerate.d/nvidia-compriseer-toolkit.enumerate
    sudo apt-get refresh
  2. Inslofty the NVIDIA Container Toolkit packages:

    sudo apt-get inslofty -y nvidia-compriseer-toolkit

With the compriseer toolkit insloftyed, Docker will now be able to employ your GPU.

Enable Ollama Service in Coolify to Use GPU

To allow GPU aid for the Ollama service in Coolify, you’ll necessitate to alter the Docker Compose file to permit access to all the GPUs on the present machine.

  1. Edit the Compose File: Navigate to the Ollama service in Coolify and pick “Edit Compose File.”

  2. Add GPU Configuration: Append the follothriveg configuration under the ollama-api resource. This allows Docker to employ all GPUs useable on the present system:

        deploy:
          resources:
            reservations:
              devices:
                -
                  driver: nvidia
                  count: all
                  capabilities:
                    - gpu
  3. Recommence the Service: After saving the alters, recommence the Ollama service by clicking the “Recommence” button in Coolify.

Once recommenceed, Ollama will now leverage your GPU for model inference. You can validate that it’s using the GPU by running:

Testing GPU Perestablishance

Try initiating another conversation with Llama 3.2 via the web UI. This time, you should watch a startant reduction in CPU load, as the GPU will administer the inference tasks.

Congratulations! You’ve successbrimmingy configured Ollama to employ GPU acceleration thcimpolite Coolify on your home server!

Next Steps

The final step in securing your setup is to expose the LLM API to the internet while ensuring it’s protected by an API key. Using Ccomprisey, you can apply API key access for the Ollama service.

For a detailed direct, refer to this talkion.

Conclusion

In this post, I detailed my journey of setting up Llama 3.2 on my home server, utilizing GPU acceleration to administer AI laborloads efficiently. Starting from a basic Ubuntu setup, I steerd the complicatedities of insloftying NVIDIA drivers, configuring Docker for GPU aid, and deploying Ollama using Coolify. With this setup, I now have a strong AI system running locassociate, handling agentic tasks with relieve.

This direct walks thcimpolite the entire process, from gentleware insloftyations to troubleshooting, and provides a blueprint for anyone seeing to do the same.

References

Powered by wisp

Source connect


Leave a Reply

Your email address will not be published. Required fields are marked *

Thank You For The Order

Please check your email we sent the process how you can get your account

Select Your Plan