Here I’m going to list twelve easy ways to run LLMs locally, and discuss which ones are best for you.
Large language models (LLMs) like ChatGPT, Google Bard, and many others can be very helpful. But, if you would like to play with the technology on your own, or if you care about privacy and would like to chat with AI without the data ever leaving your own hardware — running LLMs locally can be a great idea. It’s surprisingly easy to get started, and there are many options available.
Here I’m going to list twelve easy ways to run LLMs locally, and discuss which ones are best for you.
Firstly, there is no single right answer for which tool you should pick. I found that there’s a few aspects of differentiation between these tools, and you can decide which aspect you care about.
To find the right tool for you consider these questions:
Here is a summary graphic comparing the different tools. I did some star ratings based on my quick and subjective experience – I hope it’s helpful for your comparison. For ease of reference, I also included the number of GitHub stars the project has, if it is open source:
Ollama is an extremely simple, command-line based tool to run LLMs. It’s very easy to get started, and can be used to build AI applications. As of this writing, it only supports Mac and Linux, and not Windows.
Streaming speed is fast, and set up is probably the easiest I’ve seen. You simply download and install from their website. To run any model, you type the following command into your CLI –
ollama run [model name]
You can then start chatting directly within the command line.
You can also create custom models with a Modelfile, which allows you to give the model a system prompt, set temperature, etc.
There are many community UIs built for Ollama. A non-exhaustive list include Bionic GPT, HTML UI, Chatbot UI, Typescript UI, Minimalistic React UI for Ollama Models, Web UI, Ollamac, big-AGI, Cheshire Cat assistant framework, Amica, chatd, Ollama-SwiftUI, and MindMac. Out of all of these, Ollama Webui seem to be the most popular. The interface is very OpenAI-like. They also have a OllamaHub where you can discover different custom Makefiles from the community.
Huggingface is an open source platform and community for deep learning models for language, vision, audio and multimodal. They develop and maintain the transformers library, which simplifies the process of downloading and training state of the art deep learning models.
This is the best library if you have a background in machine learning and neural networks, since it offers seamless integration with popular deep learning libraries like pytorch and tensorflow.
Transformers works on top of pytorch (or alternatively Tensorflow), so you need to install pytorch
along with transformers
.
Installation of pytorch
depends on your hardware – on the pytorch installation page you can pick what hardware you have, and whether you have a Nvidia GPU and CUDA:
What’s cool is that if you have a Macbook with a M1/M2/M3 chip, pytorch
also has support for training on Apple Silicon through Apple’s Metal Performance Shaders (MPS).
After installing pytorch
, you can install transformers with:
pip install transformers
Running a model only takes a few lines of code. Below is an example to run the Mistral 7B Instruct model:
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # if you have a Nvidia GPU and cuda installed
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
messages = [
{"role": "user", "content": "What is your favourite condiment?"},
{"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
{"role": "user", "content": "Do you have mayonnaise recipes?"}
]
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs = encodeds.to(device)
model.to(device)
generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])
In terms of the breadth of model support, Huggingface is probably your best bet thanks to the Hugging Face Hub – you can pretty much find any model out there. Huggingface even maintains different leaderboards ranking LLMs.
Langchain is a framework for building AI applications. It integrates different AI libraries together. So, you can run LLMs with langchain using Ollama, or using Huggingface, or using another library.
The utility of langchain is that it offers templates and components for you to build context-aware applications – meaning that you can give your own document and files to the LLM. This process is called RAG, or retrieval augmented generation. So, langchain is a good candidate if you are building AI applications that needs access to a custom dataset.
Llama.cpp is the library that inspires most other libraries for running models locally. They are the creator of the .ggu
f file format, which is now supported by most other libraries.
Llama.cpp implements LLMs in pure C/C++, so that inference is very fast. It supports Mac, Windows, Linux, Docker, and FreeBSD. Apple Silicon is a first class citizen, according to the creator. Also, despite the name, it actually supports many models outside of the llama family, like Mistral 7B, but model selection is a bit limited compared to some of the other libraries.
In terms of set up, you would need to clone the repo and build the project. Then, you would need to run a .gguf
model from huggingface. Here is a tiny model to get you started. If you have more time, you can download this Mistral 7B instruct gguf. There are many other options as well.
After that, you can run any model with this command –
./main -m models/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf -p "Hello"
This isn’t very convenient, so you can also run models in interactive mode or with an UI. To do this, first start a local server:
./server -m models/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf -c 2048
Llama.cpp has its own UI and interactive mode. There are also a lot of community created UIs that builds on llama.cpp.
To run in interactive mode, run
bash examples/server/chat.sh
Llama.cpp also has a nice frontend. For me, it took some time to get the frontend to work, and I finally got it to work by making this change within examples/server/public/completion.js
(this may not be necessary in future versions):
const response = await fetch("http://localhost:8080/completion", {
method: 'POST',
body: JSON.stringify(completionParams),
headers: {
'Connection': 'keep-alive',
'Content-Type': 'application/json',
'Accept': 'text/event-stream',
...(params.api_key ? {'Authorization': `Bearer ${params.api_key}`} : {})
},
signal: controller.signal,
});
After making this change, run this command in the public folder:
python3 -m http.server
And you’ll get a nice front-end like this:
There is also a list of community created UIs for llama.cpp on the project’s GitHub page.
Oobabooga’s. textgen-webui is a very popular frontend for running local LLMs. It is very easy to install, and is designed for roleplay, since you can create your own characters with name, context, and profile picture.
Koboldcpp is another frontend with native support for roleplay. It builds off of llama.cpp, with a nice UI and API on top of it. Set up is extremely easy, you can follow the instructions on Github. Here is what the UI looks like:
As you can tell from the UI, this is very much designed for role playing and games. You can select scenarios like Dungeon Crawler, or Post Apocalypse, import character cards, and have persistent stories.
GPT4All is a large open source project that can serve many purposes. From the GPT4All landing page you can download a desktop client that lets you run and chat with LLMs through a nice GUI — you can even upload your own documents and files in the GUI and ask questions about them. If you are looking to chat locally with your own documents, this is an out-of-the-box solution.
Here is how the UI looks:
Interestingly, the UI tells me about the inference speed as it is “typing”, which for me was about 7.2 tokens per second on my M1 16GB Macbook Air.
In addition to the GUI, it also offers bindings for Python and NodeJS, and has an integration with langchain, so it is possible to build applications as well.
Similar to GPT4All, LM Studio has a nice GUI for interacting with LLMs. It is the only project on this list that’s not open sourced, but it is free to download.
Here is how the UI looks like:
LM Studio also shows the token generation speed at the bottom – it says 3.57 tok/s for me. This looks quite a bit faster than GPT4All, but I have to say – there is a processing time before any tokens come out at all, which was noticeably long for me. This made the whole experience feel slower.
I wasn’t able to find the ability to upload your own documents and files. There is also no Python / NodeJS bindings to operate this with code.
Jan.ai is a relatively new tool, launched as “an open-source alternative to LM Studio”. Here is what the UI looks like — very clean! It is in dark mode because it’s night time as I’m writing this.
Llm is a CLI tool and Python library for interacting with large language models. It’s very easy to install using pip: pip install llm
or homebrew: brew install llm
. The default llm used is ChatGPT, and the tool asks you to set your openai key. However, you can also download local models via the llm-gpt4all plugin.
Having an llm as a CLI utility can come in very handy. The creator gives the example of explaining a script:
cat mycode.py | llm -s "Explain this code"
Another fun fact is that llm is developed by Simon Willison, the co-creator of the Django Web Framework.
h2oGPT is by h2o.ai, a company building distributed machine learning for many years. It has a nice UI, and it’s very easy to upload documents to the chat. Once you get it all set up, it works pretty nicely. This is how the UI looks like:
There are many ways to install h2oGPT – you can install from source and pip install a lot of requirements, or you can download one-click installers for Mac and Windows. The one-click installer is much faster. Although for me, I had to run these two lines of code before installing:
$ xattr -dr com.apple.quarantine {file-path}/h2ogpt-osx-m1-gpu
$ chmod +x {file-path}/h2ogpt-osx-m1-gpu
This library is not just a GUI – it’s chock full of features actually. It is also a CLI utility, and an inference server for applications. It even supports voice and vision models, not just text. That’s a lot to explore!
Lastly, I’d like to talk about a very new tool by Google Cloud announced just yesterday, Feb 6, 2024! It’s called localllm. Despite the name, it is designed with the Google Cloud Workstation in mind. But you can also use it locally. If you’d like to run LLMs locally, and migrate to the cloud later, this could be a good tool for you.
I tried running locally following these lines of code:
# Install the tools
pip3 install openai
pip3 install ./llm-tool/.
llm run TheBloke/Llama-2-13B-Ensemble-v5-GGUF 8000
python3 querylocal.py
The CLI command (which is also called llm
, like the other llm
CLI tool) downloads and runs the model on your local port 8000, which you can then work with using an OpenAI compatible API.
There are a lot more local LLM tools that I would love to try. I’m keeping a list here from the community, and will try more of them when I have time:
Having tried all of these tools, I find they are trying to solve for a few different problems. So, depending on what you are looking to do, here are my conclusions:
There are still other tools to run local LLM, and I’m still working on reviewing the rest of them. There are also still more coming out every day. If there’s any you’re particularly interested in seeing, please comment down below.