Neural Network Optimization: Ocean in a Drop

Doubletapp
15 min readAug 7, 2023

Hello everyone, my name is Anton Ryabykh, and I work at Doubletapp. Together with my colleague Danil Galperin, we wrote an article about an important stage in the process of training neural networks and achieving the desired results — model optimization. Why optimize the model if it works already? Well, as soon as you deploy it on a device that will process it, you’ll face a number of problems.

Larger models take up more storage space, making distribution difficult. They also require more time to process and may demand expensive equipment, which is crucial when creating real-time applications.

Model optimization aims to reduce model size while minimizing losses in accuracy and performance.

Use Cases:

  • Improving throughput (reducing latency), which is beneficial for cloud services and edge devices like mobiles and IoT.
  • Deploying models on edge devices with processing, memory, or energy constraints.
  • Reducing model size for faster updates and storage cost reduction.
  • Optimizing models for hardware with limitations or fixed-point operations.
  • For specialized hardware accelerators.

Optimization Methods

In the article, we review the following optimization methods:

  • Pruning — removing parts of the neural network’s parameters.
  • Quantization — reducing the precision of processed data types.
  • Knowledge Distillation — updating the original model’s topology to a more efficient one with fewer parameters and faster execution.
  • Weight Clustering — reducing the number of unique parameters in the model’s weights.
  • OpenVino, TensorRT — frameworks for model optimization.

Pruning

Pruning is a compression method where trained model weights are removed. This removal can occur on whole neurons or individual weights. Let’s explore the main pruning methods.

One way is weight pruning, where some parameters are set to zero, creating a sparse network. This reduces the model’s number of parameters while maintaining architecture integrity. Sparse computations are required for the network’s efficiency, needing hardware support.

Another method involves removing entire nodes (neurons) from the network. This reduces the network’s architecture, enabling dense, more optimized computations that are better supported by the hardware. However, this type of pruning can harm the neural network by removing crucial neurons.

During model optimization, the key is to maintain the same level of accuracy or slightly worse than before. This can be achieved by removing elements from the model that have less impact on the network’s performance. There are various methods to solve this task, but for better understanding, heuristic algorithms are suitable.

It’s intuitive to trim the weights with lower absolute values since they are less significant. For intentional model training to mute insignificant weights, L1 or L2 regularization is used.

The logic is similar when removing neurons from the network. During data processing, we can gather statistics on activations. The neurons with low output values are rarely used by the network and can be removed. Besides weight magnitude, the similarity with other outputs of the current layer is checked. If two output values statistically repeat, they likely perform the same function, and one of them can be deleted without affecting functionality.

In an ideal scenario, all model parameters and activations would be unique, eliminating network redundancy.

As an example, let’s consider the complexity of a small neural network with one hidden layer.

We have three layers. The 1st layer has 6 nodes, the 2nd layer — 4 nodes, and the 3rd layer — 2 output nodes. To calculate the hidden layer’s activation, we need to perform 6 Multiply Accumulate (MLA) operations for each node. In total, we need [6*4] + [4*2] = 24 + 8 = 32 MLA operations and store 32 parameters in memory.

Let’s assume that we decided to remove the red neuron based on certain criteria. Then the number of operations changes to [6*3] + [3*2] = 18 + 6 = 24 MLA operations, and we also need to store 24 parameters in memory. In this simple network, removing one neuron led to a 25% reduction in computational power and memory consumption.

By understanding the condition for removing weights or neurons, we can apply this approach to convolutional networks as well. If the kernel matrix parameters’ values are small, the activation is likely to be small as well, meaning it has little impact on the final result. Hence, we can do away with such a channel.

Now you may wonder, “Why create an excessively large architecture and then have to reduce it? Why not build a smaller architecture from the start without further optimizations?”

In reality, training a smaller model with the same accuracy as a larger one is very challenging. Larger models have a more extensive search space for optimal solutions, and much depends on the initial weight initialization.

Quantization

Model quantization is a popular deep learning optimization method where model data — both network parameters and activations — are transformed from floating-point representation to lower precision, such as using 8-bit integers.

This offers several advantages:

  • When processing 8-bit integer data, NVIDIA GPUs use faster and more affordable 8-bit tensor cores for convolution and matrix multiplication operations, providing higher computational throughput.
  • Reducing the precision of activation and parameter data from 32-bit floating-point to 8-bit integers results in a 4x data reduction, saving energy and reducing heat dissipation during data transfer from memory to computational elements (NVIDIA streaming multiprocessors).
  • Reduced memory footprint means the model requires less storage space, fewer parameters for updates, and better cache utilization.

Quantization Methods

While quantization has numerous advantages, reducing parameter precision can potentially harm model accuracy. A 32-bit floating-point type can represent approximately 4 billion numbers within the range [-3.4e38, 3.40e38]. This range is also known as the dynamic range, and the distance between two neighboring representable numbers is the representation accuracy.

In deep learning models, both parameters and data have a high mass distribution within the range [-1, 1], making it highly likely for values to fall within this range.

Using an 8-bit integer representation, only 256 different values can be represented. These 256 values can be distributed uniformly or non-uniformly, e.g., for higher accuracy near zero.

To transform the floating-point tensor representation X into an 8-bit representation Xq, scaling (s) and zero-point (z) coefficients are calculated. The quantized value will have the following form:

Now, let’s understand how to obtain these coefficients. For this, let’s denote the range of values for the original type x and the range for the quantized values Xq. We need to solve the system of linear equations:

Where β is the upper bound of the X range, βq is the upper bound of the Xq range, α is the lower bound of the X range, and αq is the lower bound of the Xq range.

From this, we get:

Let’s look at an example: Suppose the input values lie in the range x [-500, 2050] with the fp32 data type. We need to convert them to signed int8, which lies in the range xq [-128, 127]. Hence, s = (2050 + 500)/(127 + 128) = 10,

and z = (-500*127–2050*(-128))/(2050+500) = 78.

When we have transformation coefficients, we can calculate any input value:

In practice, during the quantization process, there is a chance that the original value is outside the allowable range, and consequently, the quantized value xq will also be outside the range. Therefore, we need to perform a clipping operation to exclude values that are not within the quantized value range:

Using Different Types for Quantization

The type of quantization primarily depends on the operation. Transitioning from float32 to int8 is not the only option; other possibilities include transitioning from float32 to float16. They can also be combined. For example, you can quantize matrix multiplications to int8 and activations to float16.

Quantization is an approximation. In general, the closer the approximation, the less decrease in accuracy to be expected. If you quantize everything to float16, you will halve memory usage and probably not lose accuracy but won’t achieve a significant acceleration. On the other hand, quantization to int8 can lead to much faster processing, but the result’s accuracy will likely be worse. In extreme cases, it may not work, requiring training considering quantization.

Quantization in Practice

To mitigate the impact of quantization on model quality, various quantization methods have been developed. These methods can be classified into two categories: post-training quantization (PTQ) or quantization-aware training (QAT).

As the name suggests, PTQ is performed after training a high-precision model. Quantizing weights using PTQ is straightforward — you have access to weight tensors and can measure their distributions.

Quantitatively determining activations is more challenging because activation distributions need to be measured using real input data. For this, the trained floating-point model is evaluated using a small dataset representing real input data for the task, and statistics on activation distributions are collected. At the final stage, the quantization scales of the model’s activation tensors are determined using one of several optimization objectives. This process is called calibration, and the representative dataset used is the calibration dataset.

Sometimes PTQ may not achieve acceptable accuracy on the task. In that case, you can consider using QAT. The idea behind QAT is simple: you can improve the accuracy of quantized models by incorporating quantization error during the training phase. This allows the network to adapt to quantized weights and activations.

There are various approaches to performing QAT — from starting with an untrained model to starting with a pre-trained model. All approaches modify the training mode to incorporate quantization error into the losses by inserting fake quantization operations into the training graph to mimic the quantization of data and parameters. These operations are called “fake” because they quantize the data but then immediately dequantize it to keep the computation in floating-point precision. This trick adds quantization without significant changes to the deep learning structure.

PTQ is the more popular method of the two, as it is simple and faster. However, QAT almost always provides better accuracy, and sometimes it is the only acceptable method.

Knowledge Distillation

Transferring an extremely large model with millions or billions of parameters, trained using high-performance GPUs, to a real-world data processing device may be impossible due to limitations in the peripheral device resources.

Therefore, a method called knowledge distillation was developed to extract knowledge from a large model with many parameters and transfer it to a lighter-weight model. This model learns to mimic the behavior of the larger model, replicating its output results at each layer. This combination is often referred to as “teacher — student”.

Let’s take the example of a classification task. When transferring knowledge from the teacher to the student, the loss function of the distribution of classes, predicted by the teacher model, is minimized. Typically, in the case of accurate models, when the probability prediction of one class (correct class) is close to 1 and all others are close to 0, such data might not be very helpful to the student network, as they are almost indistinguishable from the original labels. Hence, the concept of softmax temperature was introduced to help the student network replicate not the classification labels but the probability distribution, allowing the student model to better imitate the behavior of the teacher.

Differences from Training from Scratch

It is evident that with more complex models, the theoretical search space is larger than that of a smaller network. However, assuming that the same (or similar) convergence can be achieved with a smaller network, the convergence space of the student network should overlap with the solution space of the teacher network.

Unfortunately, this does not guarantee convergence of the student network by itself. The student network may have a convergence that significantly differs from that of the teacher network. However, if the student network is guided to reproduce the behavior of the teacher network (which has already explored a larger solution space), it is expected that their convergence spaces will overlap.

Teacher and Student Networks — How to Implement?

1. Train the teacher network. A very complex teacher network is first trained separately using the complete dataset. This could be a deep and complex network that will serve as the teacher network.

2. Establish correspondence. When designing the student network, it is necessary to establish a correspondence between the intermediate outputs of the student network and the teacher network. This correspondence may involve directly transferring the layer’s output from the teacher network to the student network or performing some data transformation before passing it to the student network.

3. Forward pass through the teacher network. Pass the data through the teacher network to obtain all intermediate results.

4. Backpropagation through the student network. Now, use the output data from the teacher network and the correspondence relationship for backpropagation of the error into the student network, enabling it to learn to replicate the behavior of the teacher network.

Weight Clustering

Weight clustering is a method to reduce the storage size of your model by replacing many unique parameter values with a smaller set of unique values. Along with platform and hardware support, weight clustering can further reduce the required memory footprint and increase processing speed.

Here’s an explanation of the scheme: For example, imagine a layer in your model containing a 4x4 weight matrix. Each weight is saved using a float32 value. When you save the model, you are storing 16 unique float32 values on disk.

Weight clustering reduces the size of your model by replacing similar weights in a layer with the same value. These values are found by running a clustering algorithm on the trained model’s weights. The user can specify the number of clusters (in this case, 4). This step is shown in the “Get centroids” section in the diagram above, and the 4 centroid values are shown in the “Centroids” table. Each centroid value has an index (0–3).

Then, each weight in the weight matrix is replaced by the index of its centroid. This step is shown in the “Assign indices” section. Now, instead of saving the original weight matrix, the weight clustering algorithm can save the modified matrix shown in “Pull indices” (containing the indices of centroid values) and the centroid values themselves.

In this case, we have reduced the size from 16 unique floating-point numbers to 4 floating-point numbers and 16 2-bit indices. The savings increase as the size of the matrix increases.

Note that even though we still have 16 floating-point numbers, they now have only 4 distinct values. General-purpose compression tools (e.g., zip) can now take advantage of data redundancy for higher compression.

Advantages of Weight Clustering

Weight clustering directly benefits in reducing the model size and transmission size between serialization formats. After clustering, the model can be further compressed using any standard compression tool.

Compression Results and Accuracy

Experiments were conducted on several popular models, demonstrating the benefits of weight clustering for compression. More aggressive optimizations can be applied, but they may reduce accuracy. While the table below presents measurements for TensorFlow Lite models, similar advantages are observed for other serialization formats.

The table below shows how clustering was configured to achieve the results. Some models were more prone to accuracy reduction due to aggressive clustering, and in such cases, selective clustering was used on layers that are more resistant to optimization.

Frameworks for Optimization on Edge Devices

OpenVINO

OpenVINO toolkit (or Intel Distribution of OpenVINO Toolkit) is an open-source and free set of tools that help accelerate the development of high-performance solutions for various video systems.

This comprehensive toolkit supports a wide range of computer vision solutions, optimizing the deployment of deep learning and enabling easy execution on various Intel platforms.

OpenVINO tackles a diverse range of tasks, including face detection, automatic object, text, and speech recognition, image processing, and more.

The performance of OpenVINO in computing networks on Intel platforms is several times higher compared to popular frameworks. Additionally, it significantly reduces memory requirements, which is crucial for certain applications as some platforms may not have enough memory to run networks using other frameworks.

Tools in OpenVINO

  • Deep Learning Model Optimizer: A cross-platform tool for importing models and preparing them for optimized execution. The Model Optimizer converts and optimizes models from popular frameworks (such as Caffe, TensorFlow, MXNet, Kaldi, and ONNX) into the internal Intermediate Representation (IR) format used by OpenVINO. The Deep Learning Model Optimizer includes two components:

Model Optimizer: A component that converts pretrained models from any training framework into the Intermediate Representation (IR) format used by OpenVINO. Supported model formats include ONNX, TensorFlow, Caffe, MXNet, Kaldi.

Inference Engine: A component for efficient model inference (execution).

  • Open Model Zoo: An open repository of pretrained models for solving various tasks. It contains a collection of well-known public models (over 20) and models solving various computer vision tasks, trained by Intel employees (over 100). The repository also includes numerous examples and demo applications demonstrating the usage of available models.
  • Pre-built OpenCV: A version of OpenCV compiled for Intel hardware.
  • Post-training Optimization Tool: A tool for model calibration and subsequent inference with INT8 precision.
  • Deep Learning Workbench: A web-based graphical environment that facilitates the use of various complex components of the OpenVINO toolkit.
  • Demo Applications: A set of example applications.

TensorRT

TensorRT is a specialized framework that maximizes the power of GPUs for neural networks.

Applications based on TensorRT run up to 40 times faster than platforms using only CPUs. With TensorRT, you can optimize neural network models trained in all major environments, perform quantization, and deploy solutions in hyper-scale data centers, edge devices, or automotive platforms.

TensorRT is built on CUDA, NVIDIA’s parallel programming model, allowing optimization of operations using libraries, development tools, and CUDA-X technologies for artificial intelligence, autonomous machines, high-performance computing, and graphics. With new NVIDIA Ampere architecture GPUs, TensorRT also utilizes sparse tensor cores, providing additional performance boost.

TensorRT provides INT8 calculations using Quantization Aware Training and Post Training Quantization, as well as FP16 optimization for production deployment of deep learning applications, such as video streaming, speech recognition, recommendations, fraud detection, text generation, and natural language processing. Quantization significantly reduces processing time, a requirement for many real-time services and embedded applications.

TensorRT is integrated with PyTorch and TensorFlow, allowing accelerated network performance in the shortest time.

Conclusion

In conclusion, the field of neural network optimization has made significant progress in recent years due to the development of advanced methods such as pruning, quantization, knowledge distillation, and weight clustering. These methods allow us to improve the performance and efficiency of neural networks while reducing their size and computational requirements. By combining these methods, we can create models that are both accurate and lightweight, making them ideal for deployment on edge devices and other resource-constrained environments.

As the field of machine learning continues to evolve, it is clear that neural network optimization will remain a key area of research and development. By following the latest methods and best practices, we can continue to enhance the accuracy and efficiency of our models, making artificial intelligence more accessible and impactful than ever before.

References

  1. Knowledge Distillation : Simplified
  2. Quantization for Neural Networks
  3. Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT
  4. Quantization in Deep Learning
  5. How to accelerate and compress neural networks with quantization
  6. An Overview of Model Compression Techniques for Deep Learning in Space
  7. Pruning Convolutional Neural Networks
  8. Pruning Neural Networks
  9. Neural Architecture Search

--

--