mjbuehler commited on
Commit
a91e17a
·
verified ·
1 Parent(s): be96f5f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -40,7 +40,7 @@ This version of Cephalo, lamm-mit/Cephalo-Idefics2-vision-3x8b-beta, is a Mixtur
40
 
41
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/623ce1c6b66fedf374859fe7/b7BK8ZtDzTMsyFDi0wP3w.png)
42
 
43
- This model leverages multiple expert networks to process different parts of the input, allowing for more efficient and specialized computations. For each token in the input sequence, a gating layer computes scores for all experts and selects the top-*k* experts based on these scores. We use a *softmax (..)* activation function to ensure that the weights across the chosen experts sum up to unity. The output of the gating layer is a set of top-*k* values and their corresponding indices. The selected experts' outputs ($\mathbf{Y}$) are then computed and combined using a weighted sum, where the weights are given by the top-*k* values. This sparse MoE mechanism allows our model to dynamically allocate computational resources, improving efficiency and performance for complex vision-language tasks. depicts an overview of the architecture.
44
 
45
  For this sample model, the model has 20b parameters (three experts, 8b each, and 8b active parameters during inference). The instructions below include a detailed explanation about how other models can be constructed.
46
 
 
40
 
41
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/623ce1c6b66fedf374859fe7/b7BK8ZtDzTMsyFDi0wP3w.png)
42
 
43
+ This model leverages multiple expert networks to process different parts of the input, allowing for more efficient and specialized computations. For each token in the input sequence, a gating layer computes scores for all experts and selects the top-*k* experts based on these scores. We use a *softmax (..)* activation function to ensure that the weights across the chosen experts sum up to unity. The output of the gating layer is a set of top-*k* values and their corresponding indices. The selected experts' outputs Y are then computed and combined using a weighted sum, where the weights are given by the top-*k* values. This sparse MoE mechanism allows our model to dynamically allocate computational resources, improving efficiency and performance for complex vision-language tasks. depicts an overview of the architecture.
44
 
45
  For this sample model, the model has 20b parameters (three experts, 8b each, and 8b active parameters during inference). The instructions below include a detailed explanation about how other models can be constructed.
46