Transformers
PyTorch
English
pixel
pretraining
Inference Endpoints
plip commited on
Commit
303131a
1 Parent(s): baa6b82

Add model card

Browse files
Files changed (2) hide show
  1. README.md +64 -0
  2. config.json +1 -1
README.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - pretraining
5
+ - pixel
6
+ datasets:
7
+ - wikipedia
8
+ - bookcorpusopen
9
+ language:
10
+ - en
11
+ ---
12
+
13
+ # PIXEL (Pixel-based Encoder of Language)
14
+
15
+ PIXEL is a language model trained to reconstruct masked image patches that contain rendered text. PIXEL was pretrained on the *English* Wikipedia and Bookcorpus (in total around 3.2B words) but can theoretically be finetuned on data in any written language that can be typeset on a computer screen because it operates on rendered text as opposed to using a tokenizer with a fixed vocabulary.
16
+
17
+ It is not currently possible to use the Hosted Inference API with PIXEL.
18
+
19
+ Paper: [Language Modelling with Pixels](https://arxiv.org/abs/2207.06991)
20
+
21
+ Codebase: [https://github.com/xplip/pixel](https://github.com/xplip/pixel)
22
+
23
+
24
+ ## Model description
25
+
26
+ PIXEL consists of three major components: a text renderer, which draws text as an image; an encoder, which encodes the unmasked regions of the rendered image; and a decoder, which reconstructs the masked regions at the pixel level. It is built on [ViT-MAE](https://arxiv.org/abs/2111.06377).
27
+
28
+ During pretraining, the renderer produces images containing the training sentences. Patches of these images are linearly projected to obtain patch embeddings (as opposed to having an embedding matrix like e.g. in BERT), and 25% of the patches are masked out. The encoder, which is a Vision Transformer (ViT), then only processes the unmasked patches. The lightweight decoder with hidden size 512 and 8 transformer layers inserts learnable mask tokens into the encoder's output sequence and learns to reconstruct the raw pixel values at the masked positions.
29
+
30
+ After pretraining, the decoder can be discarded leaving an 86M parameter encoder, upon which task-specific classification heads can be stacked. Alternatively, the decoder can be retained and PIXEL can be used as a pixel-level generative language model (see Figures 3 and 6 in the paper for examples).
31
+
32
+ For more details on how PIXEL works, please check the paper and the codebase linked above.
33
+
34
+ ## Intended uses
35
+
36
+ PIXEL is primarily intended to be finetuned to downstream NLP tasks. See the [model hub](https://huggingface.co/models?search=Team-PIXEL/pixel-base) to look for finetuned versions on a task that interests you. Otherwise, check out the PIXEL codebase on Github [here](https://github.com/xplip/pixel) to find out how to finetune PIXEL for your task.
37
+
38
+ ### How to use
39
+
40
+ Here is how to load PIXEL:
41
+
42
+ ```python
43
+ from pixel import PIXELConfig, PIXELForPreTraining
44
+
45
+ config = PIXELConfig.from_pretrained("Team-PIXEL/pixel-base")
46
+ model = PIXELForPretraining.from_pretrained("Team-PIXEL/pixel-base", config=config)
47
+
48
+ ```
49
+
50
+ ## Citing and Contact Author
51
+
52
+ ```bibtex
53
+ @article{rust-etal-2022-pixel,
54
+ title={Language Modelling with Pixels},
55
+ author={Phillip Rust and Jonas F. Lotz and Emanuele Bugliarello and Elizabeth Salesky and Miryam de Lhoneux and Desmond Elliott},
56
+ journal={arXiv preprint},
57
+ year={2022},
58
+ url={https://arxiv.org/abs/2207.06991}
59
+ }
60
+ ```
61
+
62
+ Github: [@xplip](https://github.com/xplip)
63
+
64
+ Twitter: [@rust_phillip](https://twitter.com/rust_phillip)
config.json CHANGED
@@ -18,7 +18,7 @@
18
  "intermediate_size": 3072,
19
  "layer_norm_eps": 1e-12,
20
  "mask_ratio": 0.25,
21
- "model_type": "vit_mae",
22
  "norm_pix_loss": true,
23
  "num_attention_heads": 12,
24
  "num_channels": 3,
 
18
  "intermediate_size": 3072,
19
  "layer_norm_eps": 1e-12,
20
  "mask_ratio": 0.25,
21
+ "model_type": "pixel",
22
  "norm_pix_loss": true,
23
  "num_attention_heads": 12,
24
  "num_channels": 3,