mirror of
				https://github.com/ggml-org/llama.cpp.git
				synced 2025-10-31 08:51:55 +00:00 
			
		
		
		
	server : improve README (#5209)
This commit is contained in:
		| @@ -32,6 +32,7 @@ Command line options: | |||||||
| - `--mmproj MMPROJ_FILE`: Path to a multimodal projector file for LLaVA. | - `--mmproj MMPROJ_FILE`: Path to a multimodal projector file for LLaVA. | ||||||
| - `--grp-attn-n`: Set the group attention factor to extend context size through self-extend(default: 1=disabled), used together with group attention width `--grp-attn-w` | - `--grp-attn-n`: Set the group attention factor to extend context size through self-extend(default: 1=disabled), used together with group attention width `--grp-attn-w` | ||||||
| - `--grp-attn-w`: Set the group attention width to extend context size through self-extend(default: 512), used together with group attention factor `--grp-attn-n` | - `--grp-attn-w`: Set the group attention width to extend context size through self-extend(default: 512), used together with group attention factor `--grp-attn-n` | ||||||
|  |  | ||||||
| ## Build | ## Build | ||||||
|  |  | ||||||
| server is build alongside everything else from the root of the project | server is build alongside everything else from the root of the project | ||||||
| @@ -52,21 +53,23 @@ server is build alongside everything else from the root of the project | |||||||
|  |  | ||||||
| To get started right away, run the following command, making sure to use the correct path for the model you have: | To get started right away, run the following command, making sure to use the correct path for the model you have: | ||||||
|  |  | ||||||
| ### Unix-based systems (Linux, macOS, etc.): | ### Unix-based systems (Linux, macOS, etc.) | ||||||
|  |  | ||||||
| ```bash | ```bash | ||||||
| ./server -m models/7B/ggml-model.gguf -c 2048 | ./server -m models/7B/ggml-model.gguf -c 2048 | ||||||
| ``` | ``` | ||||||
|  |  | ||||||
| ### Windows: | ### Windows | ||||||
|  |  | ||||||
| ```powershell | ```powershell | ||||||
| server.exe -m models\7B\ggml-model.gguf -c 2048 | server.exe -m models\7B\ggml-model.gguf -c 2048 | ||||||
| ``` | ``` | ||||||
|  |  | ||||||
| The above command will start a server that by default listens on `127.0.0.1:8080`. | The above command will start a server that by default listens on `127.0.0.1:8080`. | ||||||
| You can consume the endpoints with Postman or NodeJS with axios library. You can visit the web front end at the same url. | You can consume the endpoints with Postman or NodeJS with axios library. You can visit the web front end at the same url. | ||||||
|  |  | ||||||
| ### Docker: | ### Docker | ||||||
|  |  | ||||||
| ```bash | ```bash | ||||||
| docker run -p 8080:8080 -v /path/to/models:/models ggerganov/llama.cpp:server -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080 | docker run -p 8080:8080 -v /path/to/models:/models ggerganov/llama.cpp:server -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080 | ||||||
|  |  | ||||||
| @@ -120,6 +123,7 @@ node index.js | |||||||
| ``` | ``` | ||||||
|  |  | ||||||
| ## API Endpoints | ## API Endpoints | ||||||
|  |  | ||||||
| - **GET** `/health`: Returns the current state of the server: | - **GET** `/health`: Returns the current state of the server: | ||||||
|   - `{"status": "loading model"}` if the model is still being loaded. |   - `{"status": "loading model"}` if the model is still being loaded. | ||||||
|   - `{"status": "error"}` if the model failed to load. |   - `{"status": "error"}` if the model failed to load. | ||||||
| @@ -189,14 +193,13 @@ node index.js | |||||||
|  |  | ||||||
|     `system_prompt`: Change the system prompt (initial prompt of all slots), this is useful for chat applications. [See more](#change-system-prompt-on-runtime) |     `system_prompt`: Change the system prompt (initial prompt of all slots), this is useful for chat applications. [See more](#change-system-prompt-on-runtime) | ||||||
|  |  | ||||||
| ### Result JSON: | ### Result JSON | ||||||
|  |  | ||||||
| * Note: When using streaming mode (`stream`) only `content` and `stop` will be returned until end of completion. |  | ||||||
|  |  | ||||||
|  | - Note: When using streaming mode (`stream`) only `content` and `stop` will be returned until end of completion. | ||||||
|  |  | ||||||
| - `completion_probabilities`: An array of token probabilities for each completion. The array's length is `n_predict`. Each item in the array has the following structure: | - `completion_probabilities`: An array of token probabilities for each completion. The array's length is `n_predict`. Each item in the array has the following structure: | ||||||
|  |  | ||||||
| ``` | ```json | ||||||
| { | { | ||||||
|   "content": "<the token selected by the model>", |   "content": "<the token selected by the model>", | ||||||
|   "probs": [ |   "probs": [ | ||||||
| @@ -212,6 +215,7 @@ node index.js | |||||||
|   ] |   ] | ||||||
| }, | }, | ||||||
| ``` | ``` | ||||||
|  |  | ||||||
| Notice that each `probs` is an array of length `n_probs`. | Notice that each `probs` is an array of length `n_probs`. | ||||||
|  |  | ||||||
| - `content`: Completion result as a string (excluding `stopping_word` if any). In case of streaming mode, will contain the next token as a string. | - `content`: Completion result as a string (excluding `stopping_word` if any). In case of streaming mode, will contain the next token as a string. | ||||||
| @@ -290,6 +294,7 @@ Notice that each `probs` is an array of length `n_probs`. | |||||||
|  |  | ||||||
|     print(completion.choices[0].message) |     print(completion.choices[0].message) | ||||||
|     ``` |     ``` | ||||||
|  |  | ||||||
|     ... or raw HTTP requests: |     ... or raw HTTP requests: | ||||||
|  |  | ||||||
|     ```shell |     ```shell | ||||||
| @@ -311,6 +316,40 @@ Notice that each `probs` is an array of length `n_probs`. | |||||||
|     }' |     }' | ||||||
|     ``` |     ``` | ||||||
|  |  | ||||||
|  | - **POST** `/v1/embeddings`: OpenAI-compatible embeddings API. | ||||||
|  |  | ||||||
|  |     *Options:* | ||||||
|  |  | ||||||
|  |     See [OpenAI Embeddings API documentation](https://platform.openai.com/docs/api-reference/embeddings). | ||||||
|  |  | ||||||
|  |     *Examples:* | ||||||
|  |  | ||||||
|  |   - input as string | ||||||
|  |  | ||||||
|  |     ```shell | ||||||
|  |     curl http://localhost:8080/v1/embeddings \ | ||||||
|  |     -H "Content-Type: application/json" \ | ||||||
|  |     -H "Authorization: Bearer no-key" \ | ||||||
|  |     -d '{ | ||||||
|  |             "input": "hello", | ||||||
|  |             "model":"GPT-4", | ||||||
|  |             "encoding_format": "float" | ||||||
|  |     }' | ||||||
|  |     ``` | ||||||
|  |  | ||||||
|  |   - `input` as string array | ||||||
|  |  | ||||||
|  |     ```shell | ||||||
|  |     curl http://localhost:8080/v1/embeddings \ | ||||||
|  |     -H "Content-Type: application/json" \ | ||||||
|  |     -H "Authorization: Bearer no-key" \ | ||||||
|  |     -d '{ | ||||||
|  |             "input": ["hello", "world"], | ||||||
|  |             "model":"GPT-4", | ||||||
|  |             "encoding_format": "float" | ||||||
|  |     }' | ||||||
|  |     ``` | ||||||
|  |  | ||||||
| ## More examples | ## More examples | ||||||
|  |  | ||||||
| ### Change system prompt on runtime | ### Change system prompt on runtime | ||||||
| @@ -362,6 +401,7 @@ python api_like_OAI.py | |||||||
| ``` | ``` | ||||||
|  |  | ||||||
| After running the API server, you can use it in Python by setting the API base URL. | After running the API server, you can use it in Python by setting the API base URL. | ||||||
|  |  | ||||||
| ```python | ```python | ||||||
| openai.api_base = "http://<Your api-server IP>:port" | openai.api_base = "http://<Your api-server IP>:port" | ||||||
| ``` | ``` | ||||||
|   | |||||||
		Reference in New Issue
	
	Block a user
	 Wu Jian Ping
					Wu Jian Ping