Below will setup the backend including the go orchestration layer and a llama.cpp inference server on localhost:8081 and localhost:8080 for local testing.

Building `llama.cpp`

See documentation for llama.cpp for details.

Running `llama.cpp`

Getting a `GGUF` format model

Run ./backend/get-qwen3-1.7b.sh to download the Qwen 3 1.7B model from HuggingFace.

Running the inference server

Run ./llama-server -m <path-to-gguf-model> --port 8081 to run the inference server at localhost:8081.

Running the backend layer

Run go run main.go. This will run the backend layer at localhost:8080.

A simple CLI client

A simple CLI-based client can be found under backend/cli.py, which will connect to the backend layer at localhost:8080.

Please use the \help command to view specific operations.

README.md

MIND Backend

Setup

Building llama.cpp