freemonoid.xyz

index, posts, library, music.


A language model streaming API on local network written in Flask

Posted Sunday September 10 2023.

The following code shows how to set up a language model API on a local network using llama.cpp and Flask. The use case is like, you have computer A that has a lot of computing power and some other computer B where you want to develop an application using the languge model. I will refer to computer A as the “big computer” and computer B as the “small computer”.

Both machines need Python3 installed, and the big computer needs llama.cpp (specifically llama-cpp-python) and Flask. Installing Flask is easy: pip install flask. But since my big computer has an Nvidia GPU, I installed llama-cpp-python with CUBLAS (not like I know that much about it). Unfortunately since my big computer uses Windows 10, I have to take the long way:

$env:CMAKE_ARGS = "-DLLAMA_CUBLAS=on"
pip install llama-cpp-python

Then, on the big computer, I wrote the following Python code and save it as app.py:

from flask import Flask, Response
app = Flask(__name__)

from llama_cpp import Llama
model = Llama("./mistral_7b_instruct.gguf", verbose=False)

@app.route("/", methods=["POST"])
def main():
    kwargs = request.get_json()
    def generator():
    for x in model(**kwargs, stream=True):
        yield x["choices"][-1]["text"]
    return Response(generator(), content_type="text/plain")

if __name__ == "__main__":
    app.run(debug=True)

It assumes you have downloaded the weights of a language model in GGUF format.

Then I started the server on the big computer by executing the following command in terminal:

flask run --host 0.0.0.0

(This command assumes app.py is present in the working directory.)

The small computer can then run the following, where url should contain the correct local address of the big computer:

import requests

url = "http://192.168.0.123:5000"
prompt = "Instruction: Write a poem.\n### Response:"

s = requests.Session()
response = s.post(url, json={"prompt": prompt}, stream=True)

for x in response.iter_content(chunk_size=1):
    print(x.decode("utf8"), end="", flush=True)

And done! It even streams like ChatGPT too… 🥲 Shout out to /r/LocalLLaMA.