Review

12 Ways To Run Local LLMs And Which One Works Best For You

Here I’m going to list twelve easy ways to run LLMs locally, and discuss which ones are best for you.

Large language models (LLMs) like ChatGPT, Google Bard, and many others can be very helpful. But, if you would like to play with the technology on your own, or if you care about privacy and would like to chat with AI without the data ever leaving your own hardware — running LLMs locally can be a great idea. It’s surprisingly easy to get started, and there are many options available.

Here I’m going to list twelve easy ways to run LLMs locally, and discuss which ones are best for you.

Firstly, there is no single right answer for which tool you should pick. I found that there’s a few aspects of differentiation between these tools, and you can decide which aspect you care about.

Questions to Consider

To find the right tool for you consider these questions:

  • Are you looking to develop an AI application?
  • Are you looking to chat locally with your own documents and have a nice UI?
  • Would you like to get deeper into the intricacies of machine learning and AI?
  • Do you have a Mac, Windows, or Linux machine?
  • How much do you care about inference speed?
  • How much do you care about ease of set up?
  • How much do you care about the breadth of model support?
  • Do you care if the project is open source?
  • Are you using LLMs for roleplay?

Summary Graphic

Here is a summary graphic comparing the different tools. I did some star ratings based on my quick and subjective experience – I hope it’s helpful for your comparison. For ease of reference, I also included the number of GitHub stars the project has, if it is open source:

Local LLM Tools

Ollama

Ollama is an extremely simple, command-line based tool to run LLMs. It’s very easy to get started, and can be used to build AI applications. As of this writing, it only supports Mac and Linux, and not Windows.

Streaming speed is fast, and set up is probably the easiest I’ve seen. You simply download and install from their website. To run any model, you type the following command into your CLI –

ollama run [model name]

You can then start chatting directly within the command line.

You can also create custom models with a Modelfile, which allows you to give the model a system prompt, set temperature, etc.

UIs for Ollama

There are many community UIs built for Ollama. A non-exhaustive list include Bionic GPT, HTML UI, Chatbot UI, Typescript UI, Minimalistic React UI for Ollama Models, Web UI, Ollamac, big-AGI, Cheshire Cat assistant framework, Amica, chatd, Ollama-SwiftUI, and MindMac. Out of all of these, Ollama Webui seem to be the most popular. The interface is very OpenAI-like. They also have a OllamaHub where you can discover different custom Makefiles from the community.

🤗 Transformers

Huggingface is an open source platform and community for deep learning models for language, vision, audio and multimodal. They develop and maintain the transformers library, which simplifies the process of downloading and training state of the art deep learning models.

This is the best library if you have a background in machine learning and neural networks, since it offers seamless integration with popular deep learning libraries like pytorch and tensorflow.

Transformers works on top of pytorch (or alternatively Tensorflow), so you need to install pytorch along with transformers.

Installation of pytorch depends on your hardware – on the pytorch installation page you can pick what hardware you have, and whether you have a Nvidia GPU and CUDA:

What’s cool is that if you have a Macbook with a M1/M2/M3 chip, pytorch also has support for training on Apple Silicon through Apple’s Metal Performance Shaders (MPS).

After installing pytorch, you can install transformers with:

pip install transformers

Running a model only takes a few lines of code. Below is an example to run the Mistral 7B Instruct model:

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" # if you have a Nvidia GPU and cuda installed

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

messages = [
   {"role": "user", "content": "What is your favourite condiment?"},
   {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
   {"role": "user", "content": "Do you have mayonnaise recipes?"}
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)
model.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

In terms of the breadth of model support, Huggingface is probably your best bet thanks to the Hugging Face Hub – you can pretty much find any model out there. Huggingface even maintains different leaderboards ranking LLMs.

Langchain

Langchain is a framework for building AI applications. It integrates different AI libraries together. So, you can run LLMs with langchain using Ollama, or using Huggingface, or using another library. 

The utility of langchain is that it offers templates and components for you to build context-aware applications – meaning that you can give your own document and files to the LLM. This process is called RAG, or retrieval augmented generation. So, langchain is a good candidate if you are building AI applications that needs access to a custom dataset.

llama.cpp

Llama.cpp is the library that inspires most other libraries for running models locally. They are the creator of the .gguf file format, which is now supported by most other libraries.

Llama.cpp implements LLMs in pure C/C++, so that inference is very fast. It supports Mac, Windows, Linux, Docker, and FreeBSD. Apple Silicon is a first class citizen, according to the creator. Also, despite the name, it actually supports many models outside of the llama family, like Mistral 7B, but model selection is a bit limited compared to some of the other libraries.

In terms of set up, you would need to clone the repo and build the project. Then, you would need to run a .gguf model from huggingface. Here is a tiny model to get you started. If you have more time, you can download this Mistral 7B instruct gguf. There are many other options as well.

After that, you can run any model with this command –

./main -m models/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf -p "Hello"

This isn’t very convenient, so you can also run models in interactive mode or with an UI. To do this, first start a local server:

./server -m models/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf -c 2048

UIs for Llama.cpp

Llama.cpp has its own UI and interactive mode. There are also a lot of community created UIs that builds on llama.cpp.

To run in interactive mode, run

bash examples/server/chat.sh

Llama.cpp also has a nice frontend. For me, it took some time to get the frontend to work, and I finally got it to work by making this change within examples/server/public/completion.js (this may not be necessary in future versions):

 const response = await fetch("http://localhost:8080/completion", {
   method: 'POST',
   body: JSON.stringify(completionParams),
   headers: {
     'Connection': 'keep-alive',
     'Content-Type': 'application/json',
     'Accept': 'text/event-stream',
     ...(params.api_key ? {'Authorization': `Bearer ${params.api_key}`} : {})
   },
   signal: controller.signal,
 });

After making this change, run this command in the public folder:

python3 -m http.server

And you’ll get a nice front-end like this:

There is also a list of community created UIs for llama.cpp on the project’s GitHub page.

textgen-webui

Oobabooga’s. textgen-webui is a very popular frontend for running local LLMs. It is very easy to install, and is designed for roleplay, since you can create your own characters with name, context, and profile picture.

koboldcpp

Koboldcpp is another frontend with native support for roleplay. It builds off of llama.cpp, with a nice UI and API on top of it. Set up is extremely easy, you can follow the instructions on Github. Here is what the UI looks like:

As you can tell from the UI, this is very much designed for role playing and games. You can select scenarios like Dungeon Crawler, or Post Apocalypse, import character cards, and have persistent stories.

GPT4All

GPT4All is a large open source project that can serve many purposes. From the GPT4All landing page you can download a desktop client that lets you run and chat with LLMs through a nice GUI — you can even upload your own documents and files in the GUI and ask questions about them. If you are looking to chat locally with your own documents, this is an out-of-the-box solution.

Here is how the UI looks:

Interestingly, the UI tells me about the inference speed as it is “typing”, which for me was about 7.2 tokens per second on my M1 16GB Macbook Air.

In addition to the GUI, it also offers bindings for Python and NodeJS, and has an integration with langchain, so it is possible to build applications as well.

LM Studio

Similar to GPT4All, LM Studio has a nice GUI for interacting with LLMs. It is the only project on this list that’s not open sourced, but it is free to download.

Here is how the UI looks like:

LM Studio also shows the token generation speed at the bottom – it says 3.57 tok/s for me. This looks quite a bit faster than GPT4All, but I have to say – there is a processing time before any tokens come out at all, which was noticeably long for me. This made the whole experience feel slower.

I wasn’t able to find the ability to upload your own documents and files. There is also no Python / NodeJS bindings to operate this with code.

jan.ai

Jan.ai is a relatively new tool, launched as “an open-source alternative to LM Studio”. Here is what the UI looks like — very clean! It is in dark mode because it’s night time as I’m writing this.

llm

Llm is a CLI tool and Python library for interacting with large language models. It’s very easy to install using pip: pip install llm or homebrew: brew install llm. The default llm used is ChatGPT, and the tool asks you to set your openai key. However, you can also download local models via the llm-gpt4all plugin.

Having an llm as a CLI utility can come in very handy. The creator gives the example of explaining a script:

cat mycode.py | llm -s "Explain this code"

Another fun fact is that llm is developed by Simon Willison, the co-creator of the Django Web Framework.

h2oGPT

h2oGPT is by h2o.ai, a company building distributed machine learning for many years. It has a nice UI, and it’s very easy to upload documents to the chat. Once you get it all set up, it works pretty nicely. This is how the UI looks like:

There are many ways to install h2oGPT – you can install from source and pip install a lot of requirements, or you can download one-click installers for Mac and Windows. The one-click installer is much faster. Although for me, I had to run these two lines of code before installing:

$ xattr -dr com.apple.quarantine {file-path}/h2ogpt-osx-m1-gpu
$ chmod +x {file-path}/h2ogpt-osx-m1-gpu

This library is not just a GUI – it’s chock full of features actually. It is also a CLI utility, and an inference server for applications. It even supports voice and vision models, not just text. That’s a lot to explore!

localllm

Lastly, I’d like to talk about a very new tool by Google Cloud announced just yesterday, Feb 6, 2024! It’s called localllm. Despite the name, it is designed with the Google Cloud Workstation in mind. But you can also use it locally. If you’d like to run LLMs locally, and migrate to the cloud later, this could be a good tool for you.

I tried running locally following these lines of code:

# Install the tools
pip3 install openai
pip3 install ./llm-tool/.

llm run TheBloke/Llama-2-13B-Ensemble-v5-GGUF 8000

python3 querylocal.py

The CLI command (which is also called llm, like the other llm CLI tool) downloads and runs the model on your local port 8000, which you can then work with using an OpenAI compatible API.

More Tools

There are a lot more local LLM tools that I would love to try. I’m keeping a list here from the community, and will try more of them when I have time:

  • Chat with RTX by Nvidia
  • ExLLaMAv2
  • vllm
  • Diffy.ai

Conclusions

Having tried all of these tools, I find they are trying to solve for a few different problems. So, depending on what you are looking to do, here are my conclusions:

  • If you are looking to develop an AI application, and you have a Mac or Linux machine, Ollama is great because it’s very easy to set up, easy to work with, and fast.
  • If you are looking to chat locally with documents, GPT4All is the best out of the box solution that is also easy to set up
  • If you are looking for advanced control and insight into neural networks and machine learning, as well as the widest range of model support, you should try transformers
  • In terms of speed, I think Ollama or llama.cpp are both very fast
  • If you are looking to work with a CLI tool, llm is clean and easy to set up
  • If you want to use Google Cloud, you should look into localllm
  • For native support for roleplay and gaming (adding characters, persistent stories), the best choices are going to be textgen-webui by Oobabooga, and koboldcpp. Alternatively, you can use ollama with custom UIs such as ollama-webui

There are still other tools to run local LLM, and I’m still working on reviewing the rest of them. There are also still more coming out every day. If there’s any you’re particularly interested in seeing, please comment down below.