Llama.cpp
Highly optimized LLM inference engine in pure C++
Highly optimized LLM inference engine in pure C++
CPU-optimised inference
runs LLMs on standard processors without GPU acceleration, though GPU support is available
Quantisation support
reduces model size significantly whilst maintaining reasonable quality, allowing larger models to fit on limited hardware
Multi-platform compatibility
works on Linux, macOS, Windows, and other systems
Low memory footprint
designed to run models with minimal RAM requirements compared to other frameworks
Command-line interface
simple text-based tool for running models, making it straightforward for technical users
Bindings for multiple languages
supports Python, JavaScript, Go, and others for integration into applications
Running private AI assistants on personal computers without sending data to external servers
Building offline applications that need language understanding capabilities
Developing and testing LLM applications locally before deployment
Running AI tools on resource-constrained devices like older laptops or edge hardware
Research and experimentation with different language models