mirror of
				https://github.com/ggml-org/llama.cpp.git
				synced 2025-10-30 08:42:00 +00:00 
			
		
		
		
	docs: fix links in development docs [no ci] (#8481)
Fixes a few links to within the repo that were broken in the reorganization of the documentation in #8325.
This commit is contained in:
		| @@ -9,15 +9,15 @@ Adding a model requires few steps: | ||||
| After following these steps, you can open PR. | ||||
|  | ||||
| Also, it is important to check that the examples and main ggml backends (CUDA, METAL, CPU) are working with the new architecture, especially: | ||||
| - [main](../examples/main) | ||||
| - [imatrix](../examples/imatrix) | ||||
| - [quantize](../examples/quantize) | ||||
| - [server](../examples/server) | ||||
| - [main](/examples/main/) | ||||
| - [imatrix](/examples/imatrix/) | ||||
| - [quantize](/examples/quantize/) | ||||
| - [server](/examples/server/) | ||||
|  | ||||
| ### 1. Convert the model to GGUF | ||||
|  | ||||
| This step is done in python with a `convert` script using the [gguf](https://pypi.org/project/gguf/) library. | ||||
| Depending on the model architecture, you can use either [convert_hf_to_gguf.py](../convert_hf_to_gguf.py) or [examples/convert_legacy_llama.py](../examples/convert_legacy_llama.py) (for `llama/llama2` models in `.pth` format). | ||||
| Depending on the model architecture, you can use either [convert_hf_to_gguf.py](/convert_hf_to_gguf.py) or [examples/convert_legacy_llama.py](/examples/convert_legacy_llama.py) (for `llama/llama2` models in `.pth` format). | ||||
|  | ||||
| The convert script reads the model configuration, tokenizer, tensor names+data and converts them to GGUF metadata and tensors. | ||||
|  | ||||
| @@ -31,7 +31,7 @@ class MyModel(Model): | ||||
|     model_arch = gguf.MODEL_ARCH.GROK | ||||
| ``` | ||||
|  | ||||
| 2. Define the layout of the GGUF tensors in [constants.py](../gguf-py/gguf/constants.py) | ||||
| 2. Define the layout of the GGUF tensors in [constants.py](/gguf-py/gguf/constants.py) | ||||
|  | ||||
| Add an enum entry in `MODEL_ARCH`, the model human friendly name in `MODEL_ARCH_NAMES` and the GGUF tensor names in `MODEL_TENSORS`. | ||||
|  | ||||
| @@ -54,7 +54,7 @@ Example for `falcon` model: | ||||
|  | ||||
| As a general rule, before adding a new tensor name to GGUF, be sure the equivalent naming does not already exist. | ||||
|  | ||||
| Once you have found the GGUF tensor name equivalent, add it to the [tensor_mapping.py](../gguf-py/gguf/tensor_mapping.py) file. | ||||
| Once you have found the GGUF tensor name equivalent, add it to the [tensor_mapping.py](/gguf-py/gguf/tensor_mapping.py) file. | ||||
|  | ||||
| If the tensor name is part of a repetitive layer/block, the key word `bid` substitutes it. | ||||
|  | ||||
| @@ -100,7 +100,7 @@ Have a look at existing implementation like `build_llama`, `build_dbrx` or `buil | ||||
|  | ||||
| When implementing a new graph, please note that the underlying `ggml` backends might not support them all, support for missing backend operations can be added in another PR. | ||||
|  | ||||
| Note: to debug the inference graph: you can use [llama-eval-callback](../examples/eval-callback). | ||||
| Note: to debug the inference graph: you can use [llama-eval-callback](/examples/eval-callback/). | ||||
|  | ||||
| ## GGUF specification | ||||
|  | ||||
|   | ||||
| @@ -1,7 +1,7 @@ | ||||
| # Token generation performance troubleshooting | ||||
|  | ||||
| ## Verifying that the model is running on the GPU with CUDA | ||||
| Make sure you compiled llama with the correct env variables according to [this guide](../README.md#CUDA), so that llama accepts the `-ngl N` (or `--n-gpu-layers N`) flag. When running llama, you may configure `N` to be very large, and llama will offload the maximum possible number of layers to the GPU, even if it's less than the number you configured. For example: | ||||
| Make sure you compiled llama with the correct env variables according to [this guide](/docs/build.md#cuda), so that llama accepts the `-ngl N` (or `--n-gpu-layers N`) flag. When running llama, you may configure `N` to be very large, and llama will offload the maximum possible number of layers to the GPU, even if it's less than the number you configured. For example: | ||||
| ```shell | ||||
| ./llama-cli -m "path/to/model.gguf" -ngl 200000 -p "Please sir, may I have some " | ||||
| ``` | ||||
|   | ||||
		Reference in New Issue
	
	Block a user
	 NikolaiLyssogor
					NikolaiLyssogor