runtime error
peating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU .................................................................................................. llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_build_graph: non-view tensors processed: 676/676 llama_new_context_with_model: compute buffer total size = 159.19 MiB llama_new_context_with_model: VRAM scratch buffer: 156.00 MiB llama_new_context_with_model: total VRAM used: 4505.56 MiB (model: 4349.55 MiB, context: 156.00 MiB) AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | Caching examples at: '/home/user/app/gradio_cached_examples/16' Caching example 1/1 llama_print_timings: load time = 1143.70 ms llama_print_timings: sample time = 321.29 ms / 608 runs ( 0.53 ms per token, 1892.37 tokens per second) llama_print_timings: prompt eval time = 1143.41 ms / 147 tokens ( 7.78 ms per token, 128.56 tokens per second) llama_print_timings: eval time = 45757.31 ms / 607 runs ( 75.38 ms per token, 13.27 tokens per second) llama_print_timings: total time = 49455.69 ms Traceback (most recent call last): File "/home/user/app/app.py", line 71, in <module> demo.queue(concurrency_count=1, max_size=5) File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1715, in queue raise DeprecationWarning( DeprecationWarning: concurrency_count has been deprecated. Set the concurrency_limit directly on event listeners e.g. btn.click(fn, ..., concurrency_limit=10) or gr.Interface(concurrency_limit=10). If necessary, the total number of workers can be configured via `max_threads` in launch().
Container logs:
Fetching error logs...