Method and Apparatus for Dynamically Reducing Logit Computation in Large Language
[This invention was inspired by the release of small LLMs such as Qwen coders having excessively large vocabulary sets and pretraining scopes and my remarks at https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B/discussions/6 ]
Detailed Description of the Invention:
This invention relates to a method and apparatus for dynamically reducing the computational cost of text generation in large language models (LLMs), particularly those with large (token) vocabularies. The invention achieves this by selectively/dynamically deactivating tokens in the output head of the LLM (before computing their logits) and thereby reducing the computations required for generating logits in the output layer of the LLM.
Background:
Large language models, especially those based on transformer architectures, have demonstrated remarkable capabilities in text generation. However, the computational cost associated with predicting the next token, particularly calculating logits for a large vocabulary, can be substantial. This cost stems from the matrix multiplication of the model’s hidden state with the unembedding matrix (also called the output projection matrix) in the output layer. The dimensionality of this matrix grows with the vocabulary size, making logit computation a significant bottleneck, especially during inference. Existing methods for reducing computational costs, such as vocabulary pruning or adaptive softmax, often involve static modifications to the model architecture or vocabulary, limiting flexibility and potentially impacting performance.
The Invention:
This invention provides a technique to make LLMs faster and more efficient by intelligently reducing the number of logit calculations during inference. It achieves this by dynamically identifying relevant words based on context and employing various optimization strategies like pruning, hierarchical computation, approximation, or specialized hardware. This invention provides a method and apparatus for dynamically reducing logit computation by selectively deactivating tokens at the output layer. This deactivation effectively reduces the dimensions of the unembedding matrix and bias vector used in logit computation, leading to significant computational savings. This is accomplished during the forward pass of the model, allowing for context-dependent token deactivation. In one embodiment, the method performs Dynamic Pruning/Filtering, such as:
Identify irrelevant words: The system will analyze the context of the input text and dynamically identify words or select pregrouped sets of words from the vocabulary that are unlikely/unwanted to be the next word in the sequence.
Skip logit calculation: Logits are only calculated for a subset of the vocabulary – those deemed relevant/wanted – significantly reducing computation.
Using Contextual clues: Techniques like previous words, sentence structure, topic modeling, or even sentiment analysis could be used to determine relevance/selection.
Detailed Description of the Method:
The method comprises the following steps:
Input Processing: An input sequence of tokens is received and processed by the LLM up to the layer preceding the output layer, producing a hidden state representation.
Deactivation List Generation: A set of tokens to be deactivated is identified. This identification can be based on several factors, including:
Input Sequence: Tokens irrelevant to the current input context can be deactivated.
Hidden State: Analysis of the hidden state can reveal which tokens are unlikely to be predicted next, allowing for their deactivation.
Persistent Memory: A persistent memory, such as a flash drive, can store pre-defined lists of tokens to be deactivated based on predefined criteria, domain knowledge, or user preferences. This persistent memory can be accessed and modified dynamically during text generation.
Matrix and Bias Reduction: The unembedding matrix and bias vector of the output layer are dynamically reduced based on the list of deactivated tokens. Rows of the unembedding matrix and corresponding elements of the bias vector corresponding to the deactivated tokens are removed or masked, creating smaller, reduced matrices.
Logit Computation: The logits are then calculated only for the remaining activated tokens using the reduced unembedding matrix and bias vector. This drastically reduces the computational burden compared to calculating logits for the entire vocabulary.
Probability Distribution and Token Selection: A probability distribution is generated from the calculated logits (e.g., using softmax). The next token is selected from this probability distribution using methods such as sampling or argmax.
Iteration: Steps 1 through 5 are repeated until the desired length of text is generated.
Detailed Description of the Apparatus: The "apparatus" mentioned may be implemented to include one or more of:
Software modules: Implemented as part of an LLM inference engine.
Hardware accelerators: Specialized processors (e.g., GPUs, TPUs) optimized for LLM computations.
Combined software-hardware systems: Leveraging both specialized hardware and optimized software algorithms.
An exemplary embodiment of the apparatus comprises:
Processing Unit: A processing unit (e.g., CPU, GPU, TPU) configured to execute the steps of the method described above.
Memory: Memory (e.g., RAM) for storing the model parameters, including the unembedding matrix and bias vector, as well as intermediate activations and data structures.
Persistent Memory (Optional): A persistent memory unit (e.g., flash drive, SSD) for storing and retrieving lists of deactivated tokens.
Control Logic: Control logic implemented in software or hardware to manage the dynamic deactivation process, including accessing the persistent memory, generating the list of deactivated tokens, reducing the matrices, and calculating the logits.
Advantages:
Reduced Computational Cost: Significant reduction in the number of computations required for logit calculation, leading to faster text generation.
Dynamic Adaptation: Context-dependent token deactivation allows the model to adapt to different input sequences and domains.
Flexibility: The method can be applied to various language model architectures and combined with other optimization techniques.
Improved Efficiency: More efficient use of computational resources, especially important for resource-constrained devices.
Example:
Consider generating Python code. English words ("the," "a," "is") can be deactivated when generating a code block. When generating a docstring, these English words might be reactivated while technical terms specific to another domain could be deactivated. This dynamic adaptation improves efficiency and reduces the chance of generating irrelevant code or documentation.
This invention provides a significant advancement in the field of large language models by enabling efficient text generation through dynamic vocabulary reduction. This detailed description clarifies the methods and apparatus used to achieve this optimization and highlights its numerous advantages.
Probable US Patent Titles: Dynamic Vocabulary Language Model for Efficient Text Generation
Method and Apparatus for Reducing Logit Computation in Large Language Models
Adaptive Output Layer for Optimized Text Generation in Transformer Models
System and Method for Dynamic Token Deactivation in Neural Language Models
Efficient Language Modeling with On-the-Fly Vocabulary Reduction
Apparatus and Method for Context-Aware Logit Computation in Transformer-Based Language Models
Dynamically Scalable Language Model with Reduced Computational Complexity
Token-Level Logit Pruning for Efficient Text Generation
Language Model with Adaptive Unembedding Matrix for Optimized Inference
Neural Network Language Model with Dynamic Output Vocabulary Filtering
1st Independent Claim (subject to change) in Patent Application for an Apparatus and Method for Context-Aware Logit Computation in Transformer-Based Language Models A method for generating text using a neural network language model comprising a plurality of layers including a final layer configured to output a hidden state representation, and an output layer comprising an unembedding matrix and a bias vector, the method comprising: a) receiving an input sequence of tokens; b) identifying a set of deactivated tokens based on at least one of: a hyperparameter, a vocab file, a tokenizer file, a control signal, a user command, a keyboard press, a mouse click, a special token, the input sequence, a hidden state representation generated from processing the input sequence through one or more layers of the neural network language model, and a persistent memory; c) processing the input sequence through the plurality of layers to generate the hidden state representation; d) generating a reduced unembedding matrix and a reduced bias vector by selecting rows of the unembedding matrix and elements of the bias vector corresponding to tokens not present in the set of deactivated tokens; e) calculating logits for the tokens not present in the set of deactivated tokens by performing a matrix multiplication of a hidden state representation and the reduced unembedding matrix and adding the reduced bias vector; f) generating a probability distribution over the tokens not present in the set of deactivated tokens based on the calculated logits; and g) selecting a token from the probability distribution and appending it to the input sequence of tokens.
If you are interested in obtaining a time-extended or noneducational or nonexperimental or commercial license or an assignment of patent rights, follow and contact the inventor Martial Terran via huggingface here.