Local LLM Model in Private AI server in WSL

Learn how to run a local LLM model and the benefits of keeping your data local. Improve data security and keep sensitive information within your business or home environment.

Brandon LeeJune 17, 2024Last Updated: June 17, 2024

7 minutes read

We are in the age of AI and machine learning. It seems like everyone is using it. However, is the only real way to use AI tied to public services like OpenAI? No. We can run an LLM locally, which has many great benefits, such as keeping the data local to your environment, either in the home network or home lab environment. Let’s see how we can run a local LLM model to host our own private local AI server, using large language models.

Benefits of Running LLM Models Locally
Hardware Requirements
- Operating Systems and Local Machines
- Downloaded Models and Model Weights
Setting Up Ollama
- Setting up on WSL and Linux
- Setting up on Windows
Downloading local models such as LLAMA3 model
- Chatting with the installed LLAMA3 model
Adding a web UI
- Add a proxy connection to my WSL network traffic
LM studio
Fine-Tuning and Model Performance
Use cases of Local LLMs
Troubleshooting and challenges
Wrapping up

Benefits of Running LLM Models Locally

There are many advantages, as you can imagine. I think one of the biggest reasons for hosting LLMs locally is It keeps sensitive data within your infrastructure and network. It also offers improved data security. You also have more control over model performance and fine-tuning (making LLMs smarter), which allows for creating customized solutions that may be better suited to your needs than public services like OpenAI models and needing an API key for a cloud service.

Hardware Requirements

To run LLMs on you local machine, most computers need to have beefy hardware. This will include High-performance CPUs and, arguably the most important for useability and performance, a good GPU. Having the right hardware will make the experience much better across the board as you won’t wait for prompts to return.

Before starting, ensure your machine meets the necessary hardware requirements, which, as just a guideline, might look like something like the following:

CPU: A high-performance processor (Intel i7/AMD Ryzen 7 or better)
Memory: At least 16 GB of RAM (32 GB or more recommended)
Storage: SSD with at least 100 GB free space
GPU: Optional but recommended for better performance (NVIDIA GPUs with CUDA support)

Operating system recommendations might look like the following:

Windows: Windows 10 or later
Linux: Ubuntu 20.04 LTS or later
macOS: macOS 10.15 Catalina or later

Operating Systems and Local Machines

Local LLMs can be deployed on different operating systems, including Windows, Linux, and macOS with free tools available. Each OS has its nuances, and it seems like you will find the most “GA ready” tools for macOS and Linux, but Windows is coming along, and when you consider that you can use WSL in Windows, it may be a moot point. The overall process remains similar in each environment to run a local model.

Downloaded Models and Model Weights

Models can be downloaded from repositories such as Hugging Face, probably the most popular place to download LLMs. These downloaded models include pre-trained models that may be a few gigabytes in size. Pretraining means the model is trained to answer prompts accurately for model performance. The downloaded model can then be implemented into your tools.

Below is just a screen capture of the Hugging Face website where you will see the llm models list displayed (open source models), including popular models, etc.

They have a large model explorer you can take a look at, sort through, and search.

Setting Up Ollama

The first step is downloading Ollama. What is this tool? It is an open-source project that provides a fairly easy platform for running local LLM models in your operating system. So you can navigate to download Ollama here: Download Ollama.

Download ollama for your operating system local llm

Setting up on WSL and Linux

For Linux or WSL you can run the following command:

curl -fsSL https://ollama.com/install.sh | sh

Below, you can see that we are using the Linux installation information to install in WSL. As you can see below, the installation is

Setting up on Windows

To set up on Windows, you can download the Ollama installer for Windows, which is in preview release.

You can see the LLM model running on port 11434.

Listening on 11434 port for local llm server

If you browse to the address in a browser and port you will see the message that Ollama is running.

Ollama is running if you browse to the port in a browser

Downloading local models such as LLAMA3 model

Now that we have Ollama installed in WSL, we can now use the Ollama command line to download models. To do that, run the following command to download LLAMA3.

ollama run llama3

This will begin pulling down the LLM locally to your WSL/Linux instance.

As you can see below, the LLAMA3 local model is 4.7 GB. You will know it is successful, you will see the success at the bottom. You will also see the “send a message (/? for help) prompt. This means we can start asking the model questions in the basic version of the solution in a terminal

Llama3 pulling down for local llm is successful

Chatting with the installed LLAMA3 model

Below, I asked about the solar eclipse in the command-line tool. Since the model was trained before the April eclipse, it considers this a future event.

Adding a web UI

One of the easiest ways to add a web UI is to use a project called Open UI. With Open UI, you can add an eerily similar web frontend as used by OpenAI.

You can run the web UI using the OpenUI project inside of Docker. According to the official documentation from Open WebUI, you can use the following command if Ollama is on the same computer:

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

If it is on a different server:

docker run -d -p 3000:8080 -e OLLAMA_BASE_URL=https://example.com -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

For Nvidia GPU support, you can use the following:

docker run -d -p 3000:8080 --gpus all --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:cuda

Pulling down the Open-WebUI container.

Once the container is up and running, you can connect to the port you have configured for your traffic to answer on, in this case port 3000. You will need to enter an email address and password for Open WebUI to get started.

Open webui launches successfully and signed up

Add a proxy connection to my WSL network traffic

I needed to do this so I could pass the traffic from the Open WebUI to the backend LLM running on port 11434.

To do that, I ran this code. Be sure to replace with your own IPs needed.

netsh interface portproxy add v4tov4 listenport=11434 listenaddress=10.1.149.166 connectport=11434 connectaddress=127.0.0.1

After I enabled the proxy for traffic, I made sure that I was pointing to this IP address I had configured and port in the settings of my Open WebUI. Also, select the model you want to use.

Below you can see after selecting the mode, you can start chatting! Cool stuff.

LM studio

You can also install another free utility (not open-source though) called LM Studio that will take care of downloading the models and the web UI frontend so you don’t have to do the steps we have mentioned above. You can download it here: LM Studio – Discover, download, and run local LLMs.

Chatting with LM Studio.

Fine-Tuning and Model Performance

The fine-tuning process involves training your model with specific parameters to improve its performance on specific tasks. In other words, companies can take a generically trained model and retrain it for more specific use cases, like chatting with a product-specific knowledge base. In this context, we are talking about model performance being accurate and not necessarily speed.

Use cases of Local LLMs

Host the models locally and customize

Local LLMs provide an entry point into AI for businesses that may not be able to integrate with AI on publicly available models such as from OpenAI.

Better security

By keeping data within local systems, it minimizes risks associated with using public systems and if you have data that must meet regulatory compliance.

Free to use

Using publicly available LLM models is free to use such as the LLAMA3 model that we have shown here. There is no cost associated other than the hardware. However, if you are like me, you have some free hardware lying around that can be taken advantage of like older workstations with discrete graphics setup.

Troubleshooting and challenges

There are a few things you can look at to troubleshoot when experiencing issues with running llms locally. Note the following sections describing a few of these to note.

Hardware and Memory Requirements

Make note of the hardware and memory requires, including GPUs. Make sure your computer has a discrete GPU. If you have a free PCI-e slot, you can install an aftermarket GPU with GPU-enhanced AI capabilities. You can upgrade your hardware with more modern components or do things like adding memory.

Managing Model Weights and Configurations

Managing and configuring model weights is another challenge. Proper documentation and regular updates can simplify this process. Tools and libraries provided by platforms like Hugging Face can also be beneficial.

GPU scarcity

There used to be an issue with GPU scarcity. This is not so much the case as the supply and demand issues have settled down. GPUs are fairly reasonable these days.

Wrapping up

Running your own local LLM is one of the coolest things you can do. It isn’t very difficult to get open source models like LLAMA3 up and running on your own hardware and you don’t have to have an Internet connection for it to work correctly or an API key integration with services like OpenAI.

However, keep in mind that for you to have an enjoyable experience working with a local LLM you will want to have access to hardware that has a discrete GPU, like an Nvidia GPU to have the performance and speed you would expect to have interacting with something like OpenAI.

Brandon LeeJune 17, 2024Last Updated: June 17, 2024

7 minutes read

2 Comments

Anoush says:

June 21, 2024 at 5:44 pm

Thank you for the great content on YouTube and Blog.
What hardware would you recommend to run for Ollama 3 setup? Is Minisforum NAB6 Lite that you had a review on YouTube a good option ?
Thank you

Loading...

1. Brandon Lee says:
  
  June 21, 2024 at 10:43 pm
  
  Anoush,
  
  Thank you so much for your comment. I really appreciate your support and kind words. I haven’t tested Ollama 3 on a mini PC. Really, it will work on any desktop or laptop. However, he experience will definitely be much better on something with a dedicated GPU. I will see if I can do some testing on a mini PC and what kind of experience I get. Will report back.
  
  Brandon
  
  Loading...

Local LLM Model in Private AI server in WSL

Table of contents

Benefits of Running LLM Models Locally

Hardware Requirements

Operating Systems and Local Machines

Downloaded Models and Model Weights

Setting Up Ollama

Setting up on WSL and Linux

Setting up on Windows

Downloading local models such as LLAMA3 model

Chatting with the installed LLAMA3 model

Adding a web UI

Add a proxy connection to my WSL network traffic

LM studio

Fine-Tuning and Model Performance

Use cases of Local LLMs

Host the models locally and customize

Better security

Free to use

Troubleshooting and challenges

Hardware and Memory Requirements

Managing Model Weights and Configurations

GPU scarcity

Wrapping up

Like this:

Brandon Lee

2 Comments

Leave a Reply Cancel reply

Proxmox Server Build Components

🔥 Prime Day Deals

🔥 Featured Minisforum MS-A2

Proxmox Build 2 2025

Proxmox Server Build Components

Table of contents

Benefits of Running LLM Models Locally

Hardware Requirements

Operating Systems and Local Machines

Downloaded Models and Model Weights

Setting Up Ollama

Setting up on WSL and Linux

Setting up on Windows

Downloading local models such as LLAMA3 model

Chatting with the installed LLAMA3 model

Adding a web UI

Add a proxy connection to my WSL network traffic

LM studio

Fine-Tuning and Model Performance

Use cases of Local LLMs

Host the models locally and customize

Better security

Free to use

Troubleshooting and challenges

Hardware and Memory Requirements

Managing Model Weights and Configurations

GPU scarcity

Wrapping up

Like this:

Brandon Lee

Related Articles

Install Google Gemini CLI in Windows for AI Command Line!

Install and Run Microsoft Phi-3 Small Language Model Locally

Run Ollama with NVIDIA GPU in Proxmox VMs and LXC containers

Self-Hosting LLMs with Docker and Proxmox: How to Run Your Own GPT

2 Comments

Leave a Reply Cancel reply

Proxmox Server Build Components

🔥 Prime Day Deals

🔥 Featured Minisforum MS-A2

Proxmox Build 2 2025

Proxmox Server Build Components