--- license: mit language: - en base_model: - microsoft/codebert-base-mlm pipeline_tag: sentence-similarity tags: - smart-contract - web3 - software-engineering - embedding - codebert --- # SmartBERT V2 CodeBERT ![SmartBERT](./framework.png) ## Overview SmartBERT V2 CodeBERT is a pre-trained model, initialized with **[CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)**, designed to transfer **Smart Contract** function-level code into embeddings effectively. - **Training Data:** Trained on **16,000** smart contracts. - **Hardware:** Utilized 2 Nvidia A100 80G GPUs. - **Training Duration:** More than 10 hours. - **Evaluation Data:** Evaluated on **4,000** smart contracts. ## Preprocessing All newline (`\n`) and tab (`\t`) characters in the function code were replaced with a single space to ensure consistency in the input data format. ## Base Model - **Base Model**: [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm) ## Training Setup ```python from transformers import TrainingArguments training_args = TrainingArguments( output_dir=OUTPUT_DIR, overwrite_output_dir=True, num_train_epochs=20, per_device_train_batch_size=64, save_steps=10000, save_total_limit=2, evaluation_strategy="steps", eval_steps=10000, resume_from_checkpoint=checkpoint ) ``` ## How to Use To train and deploy the SmartBERT V2 model for Web API services, please refer to our GitHub repository: [web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT). ## Contributors - [Youwei Huang](https://www.devil.ren) - [Sen Fang](https://github.com/TomasAndersonFang) ## Sponsors - [Institute of Intelligent Computing Technology, Suzhou, CAS](http://iict.ac.cn/) - CAS Mino (中科劢诺)