File size: 8,387 Bytes
9a803ad
 
 
b23ae36
9a803ad
 
 
 
 
33c92e1
 
 
9a803ad
7aabe27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a803ad
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
---
license: mit
datasets:
- imagenet-1k-blurred
pipeline_tag: image-classification
tags:
- sparsity
- vision-transformer
- pytorch
library_name: torchvision
metrics:
- accuracy
---
# SuperBlock

SuperBlock combines two techniques for efficient neural network training and inference: Supermask and Block Compressed Sparse Row (BSR)

### Supermask
[Supermask](https://arxiv.org/abs/2207.00670) is a technique for applying structured sparsity to neural networks using a learned mask. It works by learning a continuous mask (scores) that is applied element-wise to the weights of a neural network layer. The mask scores are learned separately from the weights and are thresholded based on a target sparsity level to obtain a binary mask. The mask determines which weigths are kept and which are pruned, and is learned during training.

During inference, the binary mask is applied element-wise to the weights, pruning the weights that correspond to a 0 in the mask, resulting in a sparse network that can be efficiently computed. 

### Block compressed Sparse Row Format (BSR)
[The BSR format](https://pytorch.org/docs/main/sparse.html#sparse-bsr-tensor) is a sparse matrix representation that stores dense sub-blocks of non-zero elements instead of individual non-zero elements. The matrix is divided into equal-sized blocks, and only the non-zero blocks are stored.

The BSR format is efficient for sparse matrices with a block structure, where non-zero elements tend to cluster in dense sub-blocks. It reduces storage requirements and enables efficient matrix operations on the non-zero blocks.

Currently, the BSR format is optimized for Nvidia A100 GPU(s) only.

## Setup
To use SuperBlock, you will need
* [PyTorch](https://pytorch.org/get-started/locally/)

To train the model or evaluate accuracy, you will need:
* ImageNet2012-blurred dataset

At least one GPU:
* A100 or H100

## Installation
* Clone this repo
  ```
  git clone https://github.com/pytorch-labs/superblock.git
  cd superblock
  ```
* Create a new conda environment
  ```
  conda create -n superblock
  conda activate superblock
  ```
* Install PyTorch. For best performance, we recommend `2.3.0.dev20240305+cu121` nightly
  ```
  pip install --pre torch==2.3.0.dev20240305+cu121 --index-url https://download.pytorch.org/whl/nightly/cu121
  pip install --pre torchvision==0.18.0 --no-deps
  ```


## Benchmarking
Baseline:
```
python benchmark.py \
  --model vit_b_16 \
  --batch-size 256 \
  > /dev/null
```
Result:
```
532.1160546875 ms
```


80% sparsity, block size 64 (random weights):
```
python benchmark.py --model vit_b_16 \
  --batch-size 256 \
  --sparsity-linear 0.8 \
  --sp-linear-tile-size 64 \
  --sparsify-weights \
  --bsr 64 \
  > /dev/null
```
Result:
```
393.864453125 ms
```


## Training
Please refer to [TRAINING.md](TRAINING.md) for training from scratch. We use [Torchvision](https://github.com/pytorch/vision/tree/main/references/classification) as our framework for training. Supermask can be applied during training.

To apply supermask, we have the following arguments at our disposal,

* Apply Supermask to linear layers:
    ```
    --sparsity-linear
    --sp-linear-tile-size
    ```
* Apply Supermask to conv1x1 layers:
    ```
    --sparsity-conv1x1
    --sp-conv1x1-tile-size
    ```
* Apply Supermask to all other convolutional layers:
    ```
    --sparsity-conv
    --sp-conv-tile-size
    ```
* Skip the first transformer layer and/or last linear layer (ViT only):
    ```
    --skip-last-layer-sparsity
    --skip-first-transformer-sparsity
    ```

For example, if you would like to train a `vit_b_16` from scratch using Supermask, you can use the respective torchvision command found in [TRAINING.md](TRAINING.md) and append the supermask arguments:
```
torchrun --nproc_per_node=8 train.py\
    --model vit_b_16 --epochs 300 --batch-size 512 --opt adamw --lr 0.003 --wd 0.3\
    --lr-scheduler cosineannealinglr --lr-warmup-method linear --lr-warmup-epochs 30\
    --lr-warmup-decay 0.033 --amp --label-smoothing 0.11 --mixup-alpha 0.2 --auto-augment ra\
    --clip-grad-norm 1 --ra-sampler --cutmix-alpha 1.0 --model-ema\ 
    --sparsity-linear 0.9 --sp-linear-tile-size 32
```
Through this command, we are training a `vit_b_16` with 90% sparsity to linear layers using 32x32 tiles.

Please run `python train.py --help` for a full list of available arguments.

## Evaluation

To run an evaluation of a Supermask-trained model, you can use [evaluate.py](evaluate.py). Our current version has signficant speedup with float32 only and not float16, hence, to illustrate speedup, we don't pass `--amp` in the example commands below.

```
MODEL_PATH=<put the path of the trained checkpoint here>
IMAGENET_PATH=<put the path of ImageNet dataset here>
NGPUS=1 # put number of available GPUS here
```

* Offline sparsification with BSR:
  ```
  torchrun --nproc_per_node=${NGPUS} evaluate.py  --model vit_b_16 --batch-size 256 --sparsity-linear 0.9 --sp-linear-tile-size 32 --weights-path ${MODEL_PATH}  --data-path ${IMAGENET_PATH} --sparsify-weights --bsr 32
  ```
  This command applies 90% sparsity to linear layers using 32x32 tiles, loads the model weights from ${MODEL_PATH}, loads the ImageNet validation set located at the specified path, applies offline sparsification to the weights, and converts the sparse weights to BSR format with a block size of 32. It is recommended to set `--bsr`      the same as tile size.

* Online sparsification without BSR:
  ```
  torchrun --nproc_per_node=${NGPUS} evaluate.py --model vit_b_16 --batch-size 256 --sparsity-linear 0.9 --sp-linear-tile-size 32 --weights-path ${MODEL_PATH} --data-path ${IMAGENET_PATH}
  ```
  This is similar to the previous command, but it does not apply offline sparsification or BSR conversion. Instead, the sparsity is applied on-the-fly during evaluation.

Please run `python evaluate.py --help` for a full list of available arguments.

Results (1x A100):
* Baseline
  ```
  Test:  Total time: 0:02:11
  Test:  Acc@1 78.392 Acc@5 93.592
  ```

* Sparsity= 0.9, Tile Size = 32, Online Sparsification, BSR = None
  ```
  Test:  Total time: 0:01:52
  Test:  Acc@1 76.092 Acc@5 92.656
  ```

* Sparsity= 0.9, Tile Size = 32, Offline Sparsification, BSR = None
  ```
  Test:  Total time: 0:01:54
  Test:  Acc@1 76.092 Acc@5 92.656
  ```

* Sparsity= 0.9, Tile Size = 32, Offline Sparsification, BSR = 32
  ```
  Test:  Total time: 0:01:25
  Test:  Acc@1 76.092 Acc@5 92.656
  ```

## Pretrained Weights

### Download:
Instead of training from scratch, if you'd like to use the Supermask weights of `vit_b_16` trained on privacy mitigated Imagenet-blurred, you can download them here:
```
SPARSITY=0.80 # Checkpoints available for: 0.70, 0.80, 0.82, 0.84, 0.86, 0.88, 0.90
BLOCK_SIZE=32 # Checkpoints available for: 16, 32, 64
```

```
mkdir checkpoints
# For baseline,
wget https://huggingface.co/facebook/superblock-vit-b-16/resolve/main/checkpoints/baseline.pth -P checkpoints/
# For sparsified checkpoints,
wget https://huggingface.co/facebook/superblock-vit-b-16/resolve/main/checkpoints/sp${SPARSITY}-ts${BLOCK_SIZE}.pth -P checkpoints/
```

### Benchmark:
```
python benchmark.py --model vit_b_16 \
  --batch-size 256 \
  --sparsity-linear ${SPARSITY} \
  --sp-linear-tile-size ${BLOCK_SIZE} \
  --sparsify-weights \
  --bsr ${BLOCK_SIZE} \
  --weights-path ./checkpoints/superblock-vit-b-16-sp${SPARSITY}-ts${BLOCK_SIZE}.pth \
  > /dev/null
```
Result:
```
530.342578125 ms
```

### Evaluate:
8 x A100 GPUs:
```
torchrun --nproc_per_node=8 evaluate.py --model vit_b_16 --batch-size 256 --sparsity-linear ${SPARSITY} --sp-linear-tile-size ${BLOCK_SIZE} --bsr ${BLOCK_SIZE} --sparsify-weights --weights-path checkpoints/superblock-vit-b-16-sp${SPARSITY}-ts${BLOCK_SIZE}.pth --data-path ${IMAGENET_PATH}
```
Result:
```
Test:  Total time: 0:01:01
Test:  Acc@1 77.644 Acc@5 93.554
```

1 x A100 GPUs:
```
torchrun --nproc_per_node=1 evaluate.py --model vit_b_16 --batch-size 256 --sparsity-linear ${SPARSITY} --sp-linear-tile-size ${BLOCK_SIZE} --bsr ${BLOCK_SIZE} --sparsify-weights --weights-path checkpoints/superblock-vit-b-16-sp${SPARSITY}-ts${BLOCK_SIZE}.pth --data-path ${IMAGENET_PATH}
```
Result:
```
Test:  Total time: 0:01:51
Test:  Acc@1 77.644 Acc@5 93.554
```

## License
SuperBlock is released under the [MIT license](https://github.com/pytorch-labs/superblock?tab=MIT-1-ov-file#readme).