cs348project/backend/design.md

# MIND - Modular Inference & Node Database - Server-Side Design

## High-Level Overview

### Inference Engine - `llama.cpp`
A modified version of `llama.cpp` that provides extra fields in its
completion API to specify the use of on-disk kv-cache.  And also tells
the client where the new kv-cache blocks are located.

### Database - MySQL
This will store the information about users, the conversation
histories, and also the index to the kv-cache stored as chunks on
disk.

### Backend Server - Go Layer
This will provide the APIs used by the frontend, and will talk to the
inference engine so that it can load the correct chunks of kv-cache
into memory or reconstruct a conversation out of cache, and will
handle the life cycle of caches stored on disk.

It will also handle authentication (add-on feature).

### CLI Interface - Go/Python
This will provide a simple interface to access all the features
provided by the backend of ease of testing and prototyping.

## Supported APIs For the Backend
Note that all APIs will need to encode the owner of the node.

### `POST /conversations`
This will start a new conversation tree.  The `go` backend should
handle the node creation.

### `GET /conversations`
This will return all the root nodes of the conversation trees, to
provide context for the user to switch conversation trees.

### `GET /tree`
This will return the DAG under root, or within a specified depth or
reversed depth from leaves, which would provide context for the user
to switch between branches on a given tree.

### `POST /branches`
Creates a new fork from a given commit.

### `GET /branches`
List the branches of related to the current branch.  Can also specify
the maximum branch-off points to list.

### `POST /graft`
Attach a range of nodes from another conversation.

### `POST /detach`
Detaches a branch into a new conversation.

### `GET /linearize`
Reconstruct a linear conversation history from a branch or node.

### `POST /completion`
Trigger inference from the last node of a branch, creating two new
nodes, one for the prompt and one for the answer.

Note that this is for talking to the go backend, so the go backend
will be responsible for bookkeeping the kv-cache on disk, and the
frontend doesn't need to worry about it.

### `POST /login`
Logs into a certain user.

### `POST /logout`
Logs out a user.

### `GET /me`
Get the current user.

## Database-Specific
The database should keep track of reachability and the backend should
automatically remove orphaned nodes and caches.

It should also keep track of the DAG generated by the prompts and
answers and different root nodes.

## Cache
For a single user, the kv-cache on disk should only concern the
working node, that is, all of its parent nodes.

## Multiple users

### Authentication
JWT-based authentication and multi-user switching, all API calls
except for `POST /login` would require a token.

The default token will be given for earlier stages.

### Queuing
The go layer should also be responsible for keeping track of the
`llama.cpp` services availability and queue prompts in the case of
multiple users.