Model Compression: The ML Skill You Need to Master Now

futuristic-artificial-intelligence-interface-with-digital-elements

As AI models continue to strive for optimal performance, their deployability will be the real litmus test. Converting these cumbersome giants into nimble, effective and scalable solutions, model compression serves as an essential link between the research lab and the real world, making it an essential skill for any machine learning professional.

The AI world has had a type of arms race for several years, particularly among researchers and lecturers. The primary aim was simple and often reiterated through leaderboard-style competitions: achieve the highest performance markers, typically precision. This relentless pursuit resulted in the creation of truly gigantic and complex deep learning models that require huge amounts of computational resources or language models with hundreds of billions of parameters. Don’t let the fact that the giants make impeccable performances all the time deceive you.

Here, though, the harsh realities of true deployment become evident, and real-world applicability hits a roadblock. It’s one thing to make a model that does something fantastically well in a controlled lab setting and scores well on a well-curated validation set; it is an entirely different one to deploy that same exhaustive, processor-hungry model into a product that is by intent human-oriented, or worse, upon a low-powered, energy-limited edge device like a sensor or smartwatch. There is an evident disconnect between the goals of training and the requirements of true deployment.

The debate is vastly different when a model crosses the gap from the controlled environment of training into the wilds of production. Raw accuracy is no longer the primary concern. The debate shifts into operational factors such as throughput, memory usage, power consumption, efficiency and inference speed. Think: model A is slightly more accurate but needs a supercomputer and is slow to respond, versus Model B is nearly as good but is responsive and can run on a standard machine? Which one is more likely to gain common usage?

The Art of Compression

At that stage, the art and science of compressing models – getting ML models good enough to be practical in the real world – become not merely significant but vital for adopting. The aim is surprisingly simple: reduce the performance loss while reducing the model’s parameter count or latency. It is a matter of shrinking the cumulative “knowledge” of an intricate model — its patterns learned, parameters, and predictive capacity — into a more manageable smaller form.

The huge, magnificent library (your gigantic model), with all the information you possibly need is then condensed into a concise, highly functional index or synopsis (the condensed model) that allows you instantly and efficiently access the most vital details without the entire building. There are several ways used to accomplish that:

Knowledge Distillation (KD), drawing on the standard “teacher-student” paradigm, is perhaps the most reasonable and one of the most widely used techniques around. A smaller, less complex “student” model is trained with the oversight of a larger, more complex “teacher” model that has already been trained well. Importantly, instead of simply learning the ground truth labels of the data alone, the student learns by imitating the soft outputs of the teacher, such as probability distributions. These soft labels carry more nuanced information than plain hard labels. A standard way of measuring the teacher’s probability distribution and the student’s effort to mimic it is using the KL divergence.

It is fascinating because the student model that learns from the teacher’s vast “experience” occasionally competes with or even surpasses a student model that is trained purely with hard labels in terms of precision. A prime example is DistilBERT, a smaller version of the huge BERT model that preserves around 97% of the capabilities of the full-sized one and is around 40% smaller and 60% quicker for inference. For the gigantic speed and size benefits, the performance gap is often small enough to justify the trade.

Another method, Pruning, involves removing entire network structures, such as neurons or layers, or even less consequential weights, such as the synapses between neurons. A number of these connections have been estimated as redundant or contribute little to the end product. There are different methods of pruning that range from simply unstructured pruning that removes individual weights below a certain threshold to more architecturally aware structured pruning involving removing entire groups of weights.

While more aggressive, structured pruning actually reduces the scale of computations and ends up yielding substantial increases in both model size and inference speed. Even less complex but still highly valuable is unstructured pruning that assists with denoising and is good for reducing the model size. A common strategy is removing weights based on magnitude. And this is not a free lunch, as aggressive trimming seriously degrades performance. Experimenting is often required in order to get the ideal amount. Before performance severely degrades, sources recommend starting in the “safe zone” of removing 30% to 50% of parameters.


Whereas pruning is similar to trimming, Quantization is the act of decreasing the precision of the remaining portions. Deep learning models conventionally employ high-precision floating-point integers such as 32-bit floats for their activations and weights. Quantization decreases this precision by converting them into lower-bit values such as 8-bit integers. This has direct, obvious advantages: the model requires a great deal less amount of memory and computation is accelerated as lower-precision computations are less computationally costly. This is significant when running on systems with restricted memory as well as processing power.

In that it requires one byte per value as opposed to two, INT8 offers significant size savings (approximately 70% compared with 58% for FP16), says a report comparing FP16 (16-bit float) with INT8 (8-bit integer) quantization. As one might expect, the disadvantage is a decrease in accuracy as a decrease in precision is not infrequently coupled with the destruction of fine-grained information.

Although INT8 is less extensive, it typically offers a greater average reduction in accuracy compared with FP16. This is offset by approaches such as Post-Training Quantization (PTQ), which is performed after training on calibration data to discover ideal scaling factors, and Quantization-Aware Training (QAT), which mimics quantization throughout the course of the training itself. QAT typically enhances the performance as the model is even more resistant to precision drops.

Low-Rank Factorization makes the most of redundancy in neural network weight matrices. By reconstructing a large weight matrix using two or more smaller ones, it reduces the overall number of parameters considerably. It is rooted in techniques such as Singular Value Decomposition used in linear algebra. Though perhaps not as widely spoken of in high-level descriptions as distillation or pruning, it is yet another quiver member that can get you a leaner model.

The prime driver of the extensive rollout of these methods is the boom in Edge AI. Resource-limited devices – smart phones, IoT sensors, car systems, even hand-carried medical equipment – are becoming ubiquitous platforms for the deployment of AI. These machines simply cannot physically accommodate the memory requirements, the computational load, or the energy needs of gargantuan uncompressed models.

Model compression methods essentially solve these issues by directly tackling these constraints, enabling advanced AI operations such as image recognition, speech processing, anomaly classification, and real-time suggestions on smaller hardware. Experiments on platforms such as NVIDIA Jetson boards illustrate how compression shrinks the model size, shortens inference latency, and reduces the consumption of resources but with the caveat of dealing with that infamous accuracy trade-off.

Leave us a Comment