To be clear, this model is tailored to my image and video classification tasks, not to imagenet. I built EfficientNetV2.5 s to outperform the existing EfficientNet b0 to b4, EfficientNet b1 to b4 pruned (I pruned b4), and EfficientNetV2 t to l models, whether trained using TensorFlow or PyTorch, in terms of top-1 accuracy, efficiency, and robustness on my dataset and CMAD benchmark.

Model Details

  • Model tasks: Image classification / video classification / feature backbone
  • Model stats:
    • Params: 16.64 M
    • Multiply-Add Operations: 5.32 G
    • Image size: train = 299x299 / 304x304, test = 304x304
    • Classification layer: defaults to 1,000 classes
  • Papers:
  • Dataset: ImageNet-1k
  • Pretrained: Yes, but requires more pretraining
  • Original: This model architecture is original

Load PyTorch Jit Model with 1000 Classes

from transformers import AutoModel
model = AutoModel.from_pretrained("FredZhang7/efficientnetv2.5_rw_s", trust_remote_code=True)

Load Model with Custom Classes

To change the number of classes, replace the linear classification layer. Here's an example of how to convert the architecture into a trainable model.

pip install ptflops timm
from ptflops import get_model_complexity_info
import torch
import urllib.request

nclass = 3                  # number of classes in your dataset
input_size = (3, 304, 304)  # recommended image input size
print_layer_stats = True    # prints the statistics for each layer of the model
verbose = True              # prints additional info about the MAC calculation

# Download the model. Skip this step if already downloaded
base_model = "efficientnetv2.5_base_in1k"
url = f"https://huggingface.co/FredZhang7/efficientnetv2.5_rw_s/resolve/main/{base_model}.pth"
file_name = f"./{base_model}.pth"
urllib.request.urlretrieve(url, file_name)

shape = (2,) + input_size
example_inputs = torch.randn(shape)
example_inputs = (example_inputs - example_inputs.min()) / (example_inputs.max() - example_inputs.min())

model = torch.load(file_name)
model.classifier = torch.nn.Linear(in_features=1984, out_features=nclass, bias=True)
macs, nparams = get_model_complexity_info(model, input_size, as_strings=False, print_per_layer_stat=print_layer_stats, verbose=verbose)
traced_model = torch.jit.trace(model, example_inputs)

model_name = f'{base_model}_{"{:.2f}".format(nparams / 1e6)}M_{"{:.2f}".format(macs / 1e9)}G.pth'
traced_model.save(model_name)

# Load the trainable model
model = torch.load(model_name)

Top-1 Accuracy Comparisons

I finetuned the existing models on either 299x299, 304x304, 320x320, or 384x384 resolution, depending on the input size used during pretraining and the VRAM usage.

efficientnet_b3_pruned achieved the second highest top-1 accuracy as well as the highest epoch-1 training accuracy on my task, out of EfficientNetV2.5 small and all existing EfficientNet models my 24 GB VRAM RTX 3090 could handle.

I will publish the detailed report in this model repository. This repository is only for the base model, pretrained a bit on ImageNet, not my task.

Carbon Emissions

Comparing all models and testing my new architectures costed roughly 648 GPU hours, over a span of 35 days.

Downloads last month
116
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.