Microsoft's Inference Framework brings 1-bit large language models to native devices

On October 17, 2024, Microsoft announced BitNet.cpp, an inference framework designed to run 1-bit quantized large language models (LLMs). BitNet.cpp is a major advance in Gen AI, enabling efficient deployment of 1-bit LLMs on standard CPUs, without the need for expensive GPUs. These developments democratize access to LLMs, making them available on a wide range of devices and giving new possibilities in artificial intelligence applications on devices.

Understanding 1-bit large language models

Large language models (LLMs) traditionally require significant computational resources due to their use of high-precision floating-point numbers (typically FP16 or BF16) for model weights. This necessity made LLM deployment expensive and energy intensive.

At their core, 1-bit LLMs use extreme quantization techniques to represent model weights using only three possible values: -1, 0, and 1, hence the term “1.58-bit” (since it requires slightly more than one bit to encode three states).

Ternary weight system

The concept

The 1-bit quantization in BitNet.cpp is a ternary weighting system. BitNet only works with three possible values for each parameter:

-1 (negative)
0 (neutral)
1 (positive)

This results in a storage requirement of approximately 1.58 bits per parameter, hence the name BitNet b1.58. This drastic reduction in parameter bit width results in an impressive reduction in memory usage and computational complexity, as most floating-point multiplications are replaced by simple additions and subtractions.

Mathematical foundation

1-bit quantization involves transforming weights and activations to their ternary representation using the following steps:

1. Binarization of scales

Binarizing the weights involves centering them around the mean (α), leading to a ternary representation. The transformation is expressed mathematically as:

WF=sign(W−a)

Where:

W is the original mass matrix.
a is the average of the weights.
Sign is coming back +1 if x > 0 and -1 otherwise.

2. Activation of Quantization

Quantization of activations ensures that the inputs are limited to the specified bit width:

$x^_{E} = As regards (x) = Clip (γ x x Q ^{b}, - Q_{b} + ϵ, Q_{b} - ϵ)$

Where:

q.b = $2^{(b-1)}$ is the maximum quantization level for b-bit width.
γ is the maximum absolute value x (marked as ).
ε is a small number to avoid overflow during calculations.

3. BitLinear operations

The BitLinear layer replaces traditional matrix multiplication with a simplified operation:

y=WFxx^Ex(Qbβγ)

Where:

β is the scaling factor used to minimize approximation errors.
γ scales activation.
Q_b is the quantization factor.

This transformation allows for efficient calculations while maintaining model performance.

Performance implications

Memory efficiency

The triple weighting system significantly reduces memory requirements:

Traditional LLM: 16 bits per weight
BitNet.cpp: 1.58 bits per weight

This reduction means a memory saving of approx 90% compared to traditional 16-bit models, allowing larger models to fit within the same hardware constraints.

Derivation speed, power efficiency (Apple M2)

Inference speed, energy efficiency (i7-13700H)

1. Inference Speed: Faster on both CPUs

Inference speed is represented as the number of tokens processed per second. Here is a breakdown of the sightings:

On the Apple M2 Ultra: BitNet.cpp reaches up to 5.07x speedup on larger models (30B) compared to Llama.cpp, with maximum speed 593.43 tokens per second for the 125M model, which is a 1.37x acceleration. For larger models like 3.8B and 7B, BitNet.cpp maintains a speed of over 84.77 tokens per second, showing its efficiency across scales.
On Intel i7-13700H: BitNet.cpp achieves an even more significant speed increase. At the 7B model size, BitNet.cpp provides an incredible 5.68x speedup compared to Llama.cpp. For smaller models like the 125M, it handles 389.08 tokens per secondwhich is 2.37x faster than Llama.cpp.

2. Energy Efficiency: A Game Changer for Edge Devices

The graphs provided are also included comparison of energy priceswhich shows a significant reduction in power consumption per token processed:

On the Apple M2 Ultra: The power savings of BitNet.cpp are significant. For the 700M model, it consumes 55.4% less energy per token compared to Llama.cpp, decreases from 0.314 to 0.140. This trend continues with the larger models, with the 70B showing a 70.0% reduction in energy consumption.
On Intel i7-13700H: BitNet.cpp supplies 71.9% energy savings for the 700M model with a drop in consumption 1,367 on 0.384. Although power data for the 70B model in Llama.cpp is not available, BitNet.cpp remains efficient with a power consumption of 5:33 p.m for model 70B.

3. Exceeding the human reading speed benchmark

One of the most interesting insights from these charts is the reference to human reading speedmarked in 5-7 chips per second. This red line shows that both implementations, especially BitNet.cpp, can comfortably beat human read speeds even on the largest models:

We Apple M2 UltraBitNet.cpp beats human read speed for all model sizes, with the lowest speed 8.67 tokens per second for model 70B.
We Intel i7-13700Hthe 100B still achieves 1.70 tokens per secondit almost touches the lower range of human reading speed, while all the smaller models beat this standard.

Training Considerations

Straight-Through Estimator (STE)

Since 1-bit quantization introduces non-differentiable functions, training involves a specialized technique known as Straight-Through Estimator (STE). In this approach, gradients flow through unchanged non-differentiable points. Here is a simplified implementation in Python:

class StraightThroughEstimator(Function):
    @staticmethod
    def forward(ctx, input):
        return input.sign()
    @staticmethod
    def backward(ctx, grad_output):
        return grad_output

Mixed accuracy training

To maintain stability during training, mixed accuracy is employed by:

Scales and activation: Quantized with 1-bit precision.
Optimizer transitions and states: Stored with higher precision.
Latent weights: Maintained in high accuracy to facilitate accurate updates during training.

Large learning strategy

A unique challenge with 1-bit models is that small updates may not affect the binarized weights. To mitigate this, the learning rate is increased, ensuring faster convergence and better optimization compared to traditional approaches.

Group quantization and normalization

BitNet.cpp represents Group quantization and normalization improve model parallelism. Instead of calculating the parameters for the entire weight matrix, BitNet divides the weights and activations into multiple groups (G).

This clustering enables efficient parallel processing without additional inter-cluster communication, enabling large-scale model training and inference.

Implementation notes and optimizations

CPU optimization

BitNet.cpp uses several low-level optimizations to achieve peak CPU performance:

Vectorized operations: Uses SIMD instructions to perform bit manipulations efficiently.
Cache-friendly memory access: Structures data to minimize cache misses.
Parallel processing: It efficiently distributes the workload among multiple CPU cores.

Here is an example of a key function implementing quantization and inference in BitNet:

 
def bitlinear_forward(input, weight, scale):
    # Quantize the input using absmax quantization
    input_q = quantize(input)
    
    # Perform binary matrix multiplication
    output = binary_matmul(input_q, weight)
    
    # Scale the output to match the original precision
    return output * scale
def quantize(x):
    # Perform absmax quantization
    scale = torch.max(torch.abs(x))
    return torch.clamp(x / scale, -1, 1) * scale

Supported models

The current version of BitNet.cpp supports the following 1-bit LLMs available on Hugging Face:

bitnet_b1_58-large (0.7B parameters)
bitnet_b1_58-3B (parameters 3.3B)
Llama3-8B-1.58-100B-Tokens (8.0B parameters)

These models are publicly available to demonstrate the inference capabilities of the framework. Although not officially trained or released by Microsoft, they illustrate the versatility of the framework.

Installation Guide

To get started with BitNet.cpp, follow these steps:

Prerequisites

Python >= 3.9
CMake >= 3.22
Ring >= 18
Condo (highly recommended)

For Windows users, Visual Studio should be installed with the following components enabled:

Desktop Development with C++
C++-CMake tools for Windows
Git for Windows
C++-Clang Compiler for Windows
MS-Build support for LLVM (Clang) toolkit

For Debian/Ubuntu an automatic installation script is available to users:

Installation step by step

Clone the repository:
Install dependencies:
Build and prepare the project: You can download the model directly from Hugging Face and convert it to quantized format:
Alternatively, manually download and convert the model:

Running Inference with BitNet.cpp

To run the inference using the framework, use the following command:

Explanation:

-m enter the path to the model file.
-p defines the prompt text.
-n sets the number of tokens to predict.
-temp adjusts sampling randomness (temperature) during inference.

Example output

BitNet.cpp technical details

BitLinear Layer

BitNet.cpp implements a modified Transformer architecture that replaces standard matrix multiplication BitLinear operation. This approach centralizes the weights to zero before quantization and adjusts them to reduce approximation errors. The key transform function looks like this:

# Binarization function for 1-bit weights
def binarize_weights(W):
    alpha = W.mean()
    W_binarized = np.sign(W - alpha)
    return W_binarized

The combination of centralized weights and scaling ensures that quantization error remains minimal, thus preserving performance.

Impact on industry

BitNet.cpp could have far-reaching implications for LLM deployments:

Accessibility: Enables LLM to run on off-the-shelf devices and democratize access to powerful artificial intelligence.
Cost effectiveness: Reduces the need for expensive GPUs and lowers the barrier to adoption.
Energy efficiency: Saves power by using standard CPU-based inference.
Innovation: Opens up new possibilities for on-device AI, such as real-time language translation, voice assistants, and cloud-free privacy-focused applications.

Challenges and future directions

While 1-bit LLMs are promising, several challenges remain. These include developing robust 1-bit models for various tasks, optimizing hardware for 1-bit computing, and supporting developers to embrace this new paradigm. Additionally, investigating 1-bit quantization for computer vision or audio tasks represents an exciting future direction.

Conclusion

Microsoft’s launch of BitNet.cpp is a significant advance. By enabling efficient 1-bit inference on standard CPUs, BitNet.cpp makes AI accessible and sustainable. This framework paves the way for a more portable and cost-effective LLM and promotes what is possible with AI on devices.

Microsoft’s Inference Framework brings 1-bit large language models to native devices

Understanding 1-bit large language models

Ternary weight system

The concept

Mathematical foundation

1. Binarization of scales

WF=sign(W−a)

2. Activation of Quantization

$x^_{E} = As regards (x) = Clip (γ x x Q ^{b}, - Q_{b} + ϵ, Q_{b} - ϵ)$

3. BitLinear operations

y=WFxx^Ex(Qbβγ)

Performance implications

Memory efficiency

1. Inference Speed: Faster on both CPUs

2. Energy Efficiency: A Game Changer for Edge Devices

3. Exceeding the human reading speed benchmark

Training Considerations

Straight-Through Estimator (STE)

Mixed accuracy training

Large learning strategy

Group quantization and normalization

Implementation notes and optimizations

CPU optimization

Supported models

Installation Guide

Prerequisites

Installation step by step

Running Inference with BitNet.cpp

Explanation:

Example output

BitNet.cpp technical details

BitLinear Layer

Impact on industry

Challenges and future directions

Conclusion

Leave a Comment Cancel reply

Understanding 1-bit large language models

Ternary weight system

The concept

Mathematical foundation

1. Binarization of scales

WF﻿=sign(W−a)

2. Activation of Quantization

x^E﻿=As regards(x)=Clip(γxxQb​​,−Qb﻿+ϵ,Qb​−ϵ)

3. BitLinear operations

y=WF﻿xx^E​x(Qb​βγ​)

Performance implications

Memory efficiency

1. Inference Speed: Faster on both CPUs

2. Energy Efficiency: A Game Changer for Edge Devices

3. Exceeding the human reading speed benchmark

Training Considerations

Straight-Through Estimator (STE)

Mixed accuracy training

Large learning strategy

Group quantization and normalization

Implementation notes and optimizations

CPU optimization

Supported models

Installation Guide

Prerequisites

Installation step by step

Running Inference with BitNet.cpp

Explanation:

Example output

BitNet.cpp technical details

BitLinear Layer

Impact on industry

Challenges and future directions

Conclusion

Leave a Comment Cancel reply

WF=sign(W−a)

$x^_{E} = As regards (x) = Clip (γ x x Q ^{b}, - Q_{b} + ϵ, Q_{b} - ϵ)$

y=WFxx^Ex(Qbβγ)