Language Toggle

Paper Review

Matryoshka Quantization

Published in , 2025

In the era of massive language models and vision transformers, model efficiency has become just as important as accuracy. Whether you’re deploying on mobile, edge devices, or scaling inference infrastructure, quantization is a crucial technique for compressing models while maintaining performance.

FLEXTRON: Many-in-One Flexible LargeLanguage Model

Published in , 2024

Today, I will summarize the paper titled “FLEXTRON: Many-in-One Flexible Large Language Model.” The primary focus of this paper is to propose a novel framework with an elastic structure that can quickly adapt to diverse user environments. To achieve this, paper suggests that like Mixture-of-Experts.

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Published in , 2024

Summarization of GPTQ

Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations

Published in , 2024

Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations

Attention Is All You Need

Published in , 2024

Seokho Han

Paper Review

Matryoshka Quantization

FLEXTRON: Many-in-One Flexible LargeLanguage Model

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations

Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations

Attention Is All You Need

1. Introduction