index, posts, library, music.

A language model streaming API on local network written in Flask

Posted Sunday September 10 2023.

The following code shows how to set up a language model API on a local network using llama.cpp and Flask. The use case is like, you have computer A that has a lot of computing power and some other computer B where you want to develop an application using the languge model. I will refer to computer A as the “big computer” and computer B as the “small computer”.

Both machines need Python3 installed, and the big computer needs llama.cpp (specifically llama-cpp-python) and Flask. Installing Flask is easy: pip install flask. But since my big computer has an Nvidia GPU, I installed llama-cpp-python with CUBLAS (not like I know that much about it). Unfortunately since my big computer uses Windows 10, I have to take the long way:

pip install llama-cpp-python

Then, on the big computer, I wrote the following Python code and save it as

from flask import Flask, Response
app = Flask(__name__)

from llama_cpp import Llama
model = Llama("./mistral_7b_instruct.gguf", verbose=False)

@app.route("/", methods=["POST"])
def main():
    kwargs = request.get_json()
    def generator():
    for x in model(**kwargs, stream=True):
        yield x["choices"][-1]["text"]
    return Response(generator(), content_type="text/plain")

if __name__ == "__main__":

It assumes you have downloaded the weights of a language model in GGUF format.

Then I started the server on the big computer by executing the following command in terminal:

flask run --host

(This command assumes is present in the working directory.)

The small computer can then run the following, where url should contain the correct local address of the big computer:

import requests

url = ""
prompt = "Instruction: Write a poem.\n### Response:"

s = requests.Session()
response =, json={"prompt": prompt}, stream=True)

for x in response.iter_content(chunk_size=1):
    print(x.decode("utf8"), end="", flush=True)

And done! It even streams like ChatGPT too… 🥲 Shout out to /r/LocalLLaMA.