How to easily host your own LLM Chatbot (like ChatGPT)

Racked Crunchbits GPU chassis
Racked Crunchbits GPU servers

Remember that mind-blowing conversation you had with ChatGPT? The one that wrote you a poem, debugged your code, and even cracked a joke? Large Language Models (LLMs) like ChatGPT are the AI wizards behind these feats, capable of generating human-quality text and conversation. They've become the talk of the town, and for good reason.

But what if you could ditch the cloud and have your own personal ChatGPT, one you control entirely? That's the magic of self-hosted LLMs. Imagine having an AI assistant readily available on your own hardware, ready to brainstorm, translate languages, or simply chat without limitations.

Meet LlamaGPT

A self-hosted, offline, ChatGPT-like chatbot. Powered by Llama 2.

Take control of your conversations with LlamaGPT, a self-hosted chatbot built on the powerful Llama 2 model. Unlike cloud-based chatbots, LlamaGPT keeps your data private by running offline on your own device. It also offers expanded capabilities with support for Code Llama models and the power of NVIDIA GPUs for an even smoother experience.

While this guide utilizes an instantly deployable powerhouse Crunchbits GPU Server (RTX 3090) for extreme performance, feel free to follow along with any GPU server or even your own PC. The core concepts remain the same.

Crunchbits' cheapest GPU Server (RTX 3070) starts at just $65/month

Crunchbits cuts through setup hassles with pre-configured GPU-ready templates. These templates come pre-loaded with the essential ingredients – GPU drivers and Python – ensuring seamless compatibility and allowing you to focus on what matters: unleashing the power of your LLM.

Crunchbits' VirtFusion Control Panel with GPU-Ready Templates

Within 5 minutes, our server is set up and ready to go!

Crunchbits' GPU Servers come with dedicated resources - The CPU cores, RAM and disk are used only by you.

Time to get started!

There are multiple ways to set up LlamaGPT - the method I'll be using is docker. Using docker helps with a very smooth and straightforward setup. We'll be installing docker, downloading the required files for docker to detect our NVIDIA GPU and cloning the source code of the project below.

# Docker Installation
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh ./get-docker.sh 

# Install NVIDIA specific dependencies for Docker
apt install -y nvidia-docker2
systemctl daemon-reload
systemctl restart docker

# Clone the source
git clone https://github.com/getumbrel/llama-gpt.git

LlamaGPT supports many models, you can select any model as per your requirements and the capabilities of your host machine. I'll go with the "70B GGML q4_0" model as my test server has plenty of resources and a 10Gbps port for quick downloads.

Model nameModel sizeModel download sizeMemory required
Nous Hermes Llama 2 7B Chat (GGML q4_0)7B3.79GB6.29GB
Nous Hermes Llama 2 13B Chat (GGML q4_0)13B7.32GB9.82GB
Nous Hermes Llama 2 70B Chat (GGML q4_0)70B38.87GB41.37GB
Code Llama 7B Chat (GGUF Q4_K_M)7B4.24GB6.74GB
Code Llama 13B Chat (GGUF Q4_K_M)13B8.06GB10.56GB
Phind Code Llama 34B Chat (GGUF Q4_K_M)34B20.22GB22.72GB
# Setting up LlamaGPT - You can use any model
cd llama-gpt

# For host machine with a GPU
./run.sh --model 70b --with-cuda 

# For host machine where you want use only CPU (not recommended)
./run.sh --model 7b

The initial run will take some time, as it has to download the models and do the necessary setup. Once it's ready, it will listen on port 3000.

Llama in docker

Access the webpage on your browser (eg: 192.168.69.69:3000)

Llama Web UI

Congratulations! You got your very own chatbot up and running.
This is a temporary setup - to keep this persistent you'll need to use docker-compose (sample compose file) and set up a reverse proxy like nginx to access your bot via a URL.