Large Language Model

FYI: We will focus this topic in Rust landscape which usually follow Python and C++ developments.

llm (aka LLaMA-rs)

llm is a Rust ecosystem of libraries for running inference on large language models, inspired by llama.cpp.

The primary crate is the llm crate, which wraps llm-base and supported model crates. This is used by llm-cli to provide inference for all supported models.

It is powered by the ggml tensor library, and aims to bring the robustness and ease of use of Rust to the world of large language models.


Conversions note
graph TD;
  A("PyTorch") --"<pre>1️⃣/2️⃣&nbsp;</pre>PyTorch model checkpoints (pth)"--> B(Python) --"<pre>3️⃣&nbsp;</pre>Geometric Deep Learning Markup Language (ggml)"--> C(C++)--"<pre>4️⃣&nbsp;quantize.cpp</pre>Quantized ggml (bin)"-->D(Rust);

1️⃣ tloen/alpaca-lora/ (llama-7b-hf)
2️⃣ jankais3r/LLaMA_MPS/ (llama-13b-hf)
3️⃣ llama.cpp/
4️⃣ llama.cpp/quantize.cpp


graph LR;
A("🐍 llama") --"4-bit"--> B("🐇 llama.cpp")
B --port ggml--> C("🦀 llm")
A --"16,32-bit"--> CC("🦀 RLLaMA")
A --Apple Silicon GPU--> AA("🐍 LLaMA_MPS")
C --"napi-rs"--> I("🐥 llama-node")
E --"fine-tuning to obey ix"--> D("🐇 alpaca.cpp")
E --instruction-following--> H("🐍 codealpaca")
A --instruction-following--> E("🐍 alpaca") --LoRa--> F("🐍 alpaca-lora")
B --BLOOM-like--> BB("🐇 bloomz.cpp")
D --"fine-tunes the GPT-J 6B"--> DD("🐍 Dolly")
D --"instruction-tuned Flan-T5"--> DDD("🐍 Flan-Alpaca")
D --Alpaca_data_cleaned.json--> DDDD
E --RNN--> EE("🐍 RWKV-LM")
EE("🐍 RWKV-LM") --port--> EEE("🦀 smolrsrwkv")
H --finetuned--> EE
EE --ggml--> EEEE("🐇 rwkv.cpp")
A --"GPT-3.5-Turbo/7B"--> FF("🐍 gpt4all-lora")
A --"Apache0/nanoGPT"--> AAAA("🐍 Lit-LLaMA")
A --> AAA("🐍 LLaMA-Adapter")
A --ShareGPT/13B--> AAAAA("🐍 vicuna")
A --Dialogue fine-tuned--> AAAAAA("🐍 Koala")
  • 🐍 llama: Open and Efficient Foundation Language Models.
  • 🐍 LLaMA_MPS: Run LLaMA (and Stanford-Alpaca) inference on Apple Silicon GPUs.
  • 🐇 llama.cpp: Inference of LLaMA model in pure C/C++.
  • 🐇 alpaca.cpp: This combines the LLaMA foundation model with an open reproduction of Stanford Alpaca a fine-tuning of the base model to obey instructions (akin to the RLHF used to train ChatGPT) and a set of modifications to llama.cpp to add a chat interface.
  • 🦀 llm: Do the LLaMA thing, but now in Rust 🦀🚀🦙
  • 🐍 alpaca: Stanford Alpaca: An Instruction-following LLaMA Model
  • 🐍 codealpaca: An Instruction-following LLaMA Model trained on code generation instructions.
  • 🐍 alpaca-lora: Low-Rank LLaMA Instruct-Tuning // train 1hr/RTX 4090
  • 🐥 llama-node: nodejs client library for llama LLM built on top of on top of llama-rs, llama.cpp and rwkv.cpp. It uses napi-rs as nodejs and native communications.
  • 🦀 RLLaMA: Rust+OpenCL+AVX2 implementation of LLaMA inference code.
  • 🐍 Dolly: This fine-tunes the GPT-J 6B model on the Alpaca dataset using a Databricks notebook.
  • 🐍 Flan-Alpaca: Instruction Tuning from Humans and Machines.
  • 🐇 bloomz.cpp: Inference of HuggingFace's BLOOM-like models in pure C/C++ built on top of the amazing llama.cpp.
  • 🐍 BLOOM-LoRA: Low-Rank LLaMA Instruct-Tuning.
  • 🐍 RWKV-LM: RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
  • 🦀 smolrsrwkv: A very basic example of the RWKV approach to language models written in Rust by someone that knows basically nothing about math or neural networks.
  • 🐍 gpt4all-lora: A chatbot trained on a massive collection of clean assistant data including code, stories and dialogue.
  • 🐍 Lit-LLaMA: Independent implementation of LLaMA that is fully open source under the Apache 2.0 license. This implementation builds on nanoGPT. // The finetuning requires a GPU with 40 GB memory (A100). Coming soon: LoRA + quantization for training on a consumer-grade GPU!
  • 🐇 rwkv.cpp: a port of BlinkDL/RWKV-LM to ggerganov/ggml. The end goal is to allow 4-bit quanized inference on CPU. // WIP
  • 🐍 LLaMA-Adapter: LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention. Using 52K self-instruct demonstrations, LLaMA-Adapter only introduces 1.2M learnable parameters upon the frozen LLaMA 7B model. // 1 hour for fine-tuning on 8 A100 GPUs.
  • 🐍 vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality.
  • 🐍 koala: a chatbot trained by fine-tuning Meta’s LLaMA on dialogue data gathered from the web.



graph LR;
Z-.Flamingo-style LMMs..-X("OpenFlamingo")
Z-.Chinchilla formula..->Y("Cerebras-GPT")--LoRA--> YY("🐍 Cerebras-GPT2.7B LoRA Alpaca")
  • 🐍 Cerebras-GPT: a family of seven GPT models ranging from 111 million to 13 billion parameters. Trained using the Chinchilla formula, these models provide the highest accuracy for a given compute budget. Cerebras-GPT has faster training times, lower training costs, and consumes less energy than any publicly available model to date.
  • 🐍 OpenFlamingo: a framework that enables training and evaluation of large multimodal models (LMMs).
  • 🐍 Cerebras-GPT2.7B LoRA Alpaca ShortPrompt: Cerebras-GPT2.7B LoRA Alpaca ShortPrompt.
  • Dolly: 12B parameter language model based on the EleutherAI pythia model family and fine-tuned exclusively on a new, high-quality human generated instruction following dataset, crowdsourced among Databricks employees.


graph TD;
AAAA ---> J("🦀 llm-chain")
AAAA --> I
AAA --> A
A("🐍 langchain")
A --port--> AA("🐥 langchainjs")
AA --> B("🐥 langchain-alpaca")
D("🐇 alpaca.cpp") --> B
E-..-DD("🐍 petals")
E("🐇 llama.cpp") --ggml/13B--> H
F("🐇 whisper.cpp") --whisper-small--> H
H("🐇 talk")
I("🐍 chatgpt-retrieval-plugin") --> II("🐍 llama-retrieval-plugin")

  • 🐍 langchain: Building applications with LLMs through composability.
  • 🐥 langchainjs: langchain in js.
  • 🐥 langchain-alpaca: Run alpaca LLM fully locally in langchain.
  • 🐇 whisper.cpp: High-performance inference of OpenAI's Whisper automatic speech recognition (ASR) model.
  • 🐍 whisper-small: Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning.
  • 🐇 talk: Talk with an Artificial Intelligence in your terminal.
  • 🐍 chatgpt-retrieval-plugin: The ChatGPT Retrieval Plugin lets you easily search and find personal or work documents by asking questions in everyday language.
  • 🐍 llama-retrieval-plugin: LLaMa retrieval plugin script using OpenAI's retrieval plugin
  • 🦀 llm-chain: prompt templates and chaining together prompts in multi-step chains, summarizing lengthy texts or performing advanced data processing tasks.
  • 🐍 petals: Run 100B+ language models at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading.


  • 🤗 Raven-RWKV-7B: 7B, Raven is RWKV 7B 100% RNN RWKV-LM finetuned to follow instructions.
  • 🤗 ChatRWKV-gradio: 14B, RWKV-4-Pile-14B-20230313-ctx8192-test1050
  • 🤗 Code Alpaca: 13B, The Code Alpaca models are fine-tuned from a 7B and 13B LLaMA model on 20K instruction-following data generated by the techniques in the Self-Instruct [1] paper, with some modifications that we discuss in the next section. Evals are still a todo.
  • 🤗 Alpaca-LoRA-Serve: 7B, Instruction fine-tuned version of LLaMA from Meta AI. Alpaca-LoRA is Low-Rank LLaMA Instruct-Tuning which is inspired by Stanford Alpaca project. This demo application currently runs 7B version on a T4 instance.
  • 🤗 LLaMA-Adapter: 7B +1.2M, The official demo for LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention.
  • 🤖 Alpaca-LoRA Playground: 30B, Alpaca-LoRA which is instruction fine-tuned version of LLaMA. This demo currently runs 30B version on a 3*A6000 instance at
  • 🤖 Koala: 13B a chatbot fine-tuned from LLaMA on user-shared conversations and open-source datasets. This one performs similarly to Vicuna.
  • 🐥 Web LLM: llm on web browsers via WebGPU.


