BEExformer: A Fast Inferencing Binarized Transformer With Early Exits

Wazib Ansar; Saptarsi Goswami; Amlan Chakrabarti

doi:10.1109/tsusc.2026.3666456

What is it about?

BEExformer is a novel architecture tailored for textual inference tasks, distinguished by its integration of two complementary efficiency considerations: Binarization-Aware Training (BAT) and Early Exit (EE). It tackles the computational bottlenecks of transformer-based models, enabling deployment in resource-constrained environments. For BAT, BEExformer employs a differentiable piecewise polynomial function for closer approximation of the non-differentiable impulse function used in binarization. This ensures gradient computations capture both the magnitude and sign of real-valued weights, allowing binarization to substantially reduce memory requirements compared to full-precision models. Within each transformer block, a binarized Selective Learn-Forget Network (SLFN) further improves inference by filtering out irrelevant information, yielding a more refined understanding of the input context. Efficiency is further boosted through an EE mechanism, which monitors fractional reductions in logit entropy across successive transformer blocks. By incorporating a soft-routing loss during training, the model aggregates losses from all exits to optimally update block weights. This design enables dynamic inference, i.e. the network can terminate early, bypassing processing through subsequent transformer blocks once sufficient confidence is reached. This mitigates the “overthinking” problem where deeper layers add unnecessary complexity to already-correct predictions.

Photo by Alex Knight on Unsplash

Why is it important?

Transformer-based LLMs deliver state-of-the-art performance across diverse applications, but their massive scale and heavy computational demands make it challenging to deploy them on resource-limited devices. To make these architectures efficient, a range of considerations are implemented with binarization and dynamic inferencing architectures being among the popular choices. However, these considerations come with certain trade-offs. Binarization reduces the precision of weights and activations, which can lead to vanishing gradients. Conventional binarization functions typically capture only the sign of weights, overlooking their magnitude during gradient computation. Moreover, existing binarized models depend on knowledge distillation from full‑precision teacher LLMs, introducing additional training complexity. Whereas binarized neural networks, constrained to single‑bit precision, inherently lack the representational capacity to fully emulate the teacher models. Introducing early exit complicates this further due to multiple exit points that require careful optimization during training. While estimating the threshold for exit confidence is a critical task. Despite the efficiency gains being imminent, none of the previous studies ever attempted to combine BAT with early exit in a transformer architecture for textual inference. A key challenge in integrating EE with BNN is that it tends to destabilize training, leading to vanishing gradients. Moreover, existing loss functions are not applicable to such networks, necessitating a tailored solution. The proposed BEExformer ameliorates these challenges, in an innovative way as mentioned above.

Perspectives

BEExformer significantly strengthens the feasibility of deploying transformer-based language models in resource-constrained environments, such as edge computing scenarios. On the GLUE benchmark, BEExformer demonstrates a 21.30× reduction in model size with only minimal performance degradation compared to full-precision counterparts. Moreover, incorporating EE leads to 52.27% reduction in FLOPs, while simultaneously improving accuracy by 3.22%. This improvement stems from its ability to mitigate the “overthinking” problem. These results establish BEExformer as a new paradigm in efficient transformer design, combining binarization with dynamic computation strategies to deliver a Pareto-optimal trade-off between performance and resource efficiency.
Wazib Ansar

This page is a summary of: BEExformer: A Fast Inferencing Binarized Transformer With Early Exits, IEEE Transactions on Sustainable Computing, March 2026, Institute of Electrical & Electronics Engineers (IEEE),
DOI: 10.1109/tsusc.2026.3666456.
You can read the full text:

Read

Contributors

The following have contributed to this page

Wazib Ansar

BEExformer: A Fast, Lean and Smart AI Architecture with Low Computational Footprint

What is it about?

Why is it important?

Perspectives

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

BEExformer: A Fast, Lean and Smart AI Architecture with Low Computational Footprint

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management