Matthijs Hollemans
commited on
Commit
•
cdc6976
1
Parent(s):
3caac63
add basic usage instructions
Browse files
README.md
CHANGED
@@ -13,3 +13,52 @@ datasets:
|
|
13 |
Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. It was introduced in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Dosovitskiy et al. and first released in [this repository](https://github.com/google-research/vision_transformer). However, the weights were converted from the [timm repository](https://github.com/rwightman/pytorch-image-models) by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him.
|
14 |
|
15 |
This repo contains a Core ML version of [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. It was introduced in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Dosovitskiy et al. and first released in [this repository](https://github.com/google-research/vision_transformer). However, the weights were converted from the [timm repository](https://github.com/rwightman/pytorch-image-models) by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him.
|
14 |
|
15 |
This repo contains a Core ML version of [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224).
|
16 |
+
|
17 |
+
## Usage instructions
|
18 |
+
|
19 |
+
Create a `VNCoreMLRequest` that loads the ViT model:
|
20 |
+
|
21 |
+
```swift
|
22 |
+
import CoreML
|
23 |
+
import Vision
|
24 |
+
|
25 |
+
lazy var classificationRequest: VNCoreMLRequest = {
|
26 |
+
do {
|
27 |
+
let config = MLModelConfiguration()
|
28 |
+
config.computeUnits = .all
|
29 |
+
let coreMLModel = try ViT(configuration: config)
|
30 |
+
let visionModel = try VNCoreMLModel(for: coreMLModel.model)
|
31 |
+
|
32 |
+
let request = VNCoreMLRequest(model: visionModel, completionHandler: { [weak self] request, error in
|
33 |
+
if let results = request.results as? [VNClassificationObservation] {
|
34 |
+
/* do something with the results */
|
35 |
+
}
|
36 |
+
})
|
37 |
+
|
38 |
+
request.imageCropAndScaleOption = .centerCrop
|
39 |
+
return request
|
40 |
+
} catch {
|
41 |
+
fatalError("Failed to create VNCoreMLModel: \(error)")
|
42 |
+
}
|
43 |
+
}()
|
44 |
+
```
|
45 |
+
|
46 |
+
Perform the request:
|
47 |
+
|
48 |
+
```swift
|
49 |
+
func classify(image: UIImage) {
|
50 |
+
guard let ciImage = CIImage(image: image) else {
|
51 |
+
print("Unable to create CIImage")
|
52 |
+
return
|
53 |
+
}
|
54 |
+
|
55 |
+
DispatchQueue.global(qos: .userInitiated).async {
|
56 |
+
let handler = VNImageRequestHandler(ciImage: ciImage, orientation: .up)
|
57 |
+
do {
|
58 |
+
try handler.perform([self.classificationRequest])
|
59 |
+
} catch {
|
60 |
+
print("Failed to perform classification: \(error)")
|
61 |
+
}
|
62 |
+
}
|
63 |
+
}
|
64 |
+
```
|