mirror of
https://github.com/ggml-org/llama.cpp.git
synced 2025-10-27 08:21:30 +00:00
CUDA full GPU acceleration, KV cache in VRAM (#1827)
* Fixed CUDA RoPE * ggml_cuda_mul_mat_vec_p021 * ggml_cuda_scale * ggml_cuda_diag_mask_inf * ggml_is_permuted * ggml_cuda_cpy * flatten rows for ggml_cuda_op * Added a --low-vram option * Fixed Windows performance * Fixed LLAMA_CUDA_DMMV_Y > 1 for WizardLM
This commit is contained in:
1
llama.h
1
llama.h
@@ -77,6 +77,7 @@ extern "C" {
|
||||
int n_gpu_layers; // number of layers to store in VRAM
|
||||
int main_gpu; // the GPU that is used for scratch and small tensors
|
||||
float tensor_split[LLAMA_MAX_DEVICES]; // how to split layers across multiple GPUs
|
||||
bool low_vram; // if true, reduce VRAM usage at the cost of performance
|
||||
int seed; // RNG seed, -1 for random
|
||||
|
||||
bool f16_kv; // use fp16 for KV cache
|
||||
|
||||
Reference in New Issue
Block a user