--- model-index: - name: gbyuvd/drugtargetpred-chemselfies results: - task: type: text-classification dataset: name: test type: custom metrics: - name: Accuracy type: Accuracy value: 0.6199 - name: Weighted Precision type: Precision value: 0.6142 - name: Weighted Recall type: Recall value: 0.6199 - name: Weighted F1 type: F1 value: 0.6127 license: cc-by-nc-sa-4.0 metrics: - accuracy - f1 - recall - precision library_name: transformers tags: - chemistry - biology - drug-discovery - drug-target - chembl34 - selfies - drugs - molecules - compounds base_model: gbyuvd/chemselfies-base-bertmlm base_model_relation: finetune --- # ChemFIE-DTP (DrugTargetPrediction - 221 Classes) This model is a BERT-like sequence classifier for 221 human protein drug targets, fine-tuned from [gbyuvd/chemselfies-base-bertmlm](https://huggingface.co/gbyuvd/chemselfies-base-bertmlm) on a dataset derived ChemBL34 (Zdrazil et al. 2023). It predicts potential drug targets using chemical structures represented as SELFIES (Self-Referencing Embedded Strings). The model was trained on a selected and balanced dataset of around 154k examples covering 221 distinct human protein targets. Data selection criteria included specific activity types (IC50, Ki, EC50) with values ≤ 10 µM, assay confidence scores ≥ 7, and exact activity relations. Among all drug target classes found in ChemBL34, classes with at least 1000 examples are selected then capped at 1000 for those with more samples. Building upon the pre-trained base model's pre-existing knowledge of SELFIES, this model is originally intended to validate the capabilities of the light-weight base model to be fine-tuned for various tasks, and for this model case, it might be useful for tasks related to early-stage drug discovery and target prediction (e.g. compounds annotations) - though its performance and applicability should be carefully evaluated for specific use cases (see [Evaluation](#evaluation)) - List of classes available in the "label_dict.json" - Its performance on each classes available in "test_result.txt" Based on the model's training and evaluation losses, there is potential for improvement with further training; however, I cannot afford it at the moment. ### Disclaimer: For Academic Purposes Only The information and model provided is for academic purposes only. It is intended for educational and research use, and should not be used for any commercial or legal purposes. The author do not guarantee the accuracy, completeness, or reliability of the information. [![ko-fi](https://ko-fi.com/img/githubbutton_sm.svg)](https://ko-fi.com/O4O710GFBZ) # Table of Contents 1. [Model Details](#model-details) 2. [Usage](#usage) 1. [SMILES to SELFIES conversion](#uses) 2. [Get Top-K Prediction](#get-top-k-prediction) 3. [Direct Use using Classifier Pipeline](#direct-use-using-classifier-pipeline) 3. [Training Details](#training-details) 4. [Evaluation](#evaluation) 1. [General](#general) 2. [Classes with Best Performance (F1>0.9)](#classes-with-best-performance-f109) 3. [Classes with Good Performance (0.70.9) ``` CHEMBL252: Endothelin receptor ET-A (F1: 0.9875) CHEMBL4829: Acetyl-CoA carboxylase 2 (F1: 0.9849) CHEMBL3713062: Tissue factor pathway inhibitor (F1: 0.9825) CHEMBL2176771: Complement factor D (F1: 0.9801) CHEMBL3988583: Sepiapterin reductase (F1: 0.9798) CHEMBL3572: Cholesteryl ester transfer protein (F1: 0.9776) CHEMBL1800: Corticotropin releasing factor receptor 1 (F1: 0.9750) CHEMBL4198: Inhibitor of apoptosis protein 3 (F1: 0.9704) CHEMBL5137: Metabotropic glutamate receptor 2 (F1: 0.9679) CHEMBL5652: Glucose-dependent insulinotropic receptor (F1: 0.9677) CHEMBL1985: Glucagon receptor (F1: 0.9674) CHEMBL2001: Purinergic receptor P2Y12 (F1: 0.9674) CHEMBL2007625: Isocitrate dehydrogenase [NADP] cytoplasmic (F1: 0.9628) CHEMBL3820: Hexokinase type IV (F1: 0.9606) CHEMBL4550: 5-lipoxygenase activating protein (F1: 0.9606) CHEMBL6009: Diacylglycerol O-acyltransferase 1 (F1: 0.9604) CHEMBL298: Cholecystokinin B receptor (F1: 0.9582) CHEMBL1855: Gonadotropin-releasing hormone receptor (F1: 0.9538) CHEMBL1945: Melatonin receptor 1A (F1: 0.9512) CHEMBL4561: Neuropeptide Y receptor type 5 (F1: 0.9484) CHEMBL4805: P2X purinoceptor 7 (F1: 0.9439) CHEMBL5071: G protein-coupled receptor 44 (F1: 0.9438) CHEMBL4616: Ghrelin receptor (F1: 0.9409) CHEMBL4422: Free fatty acid receptor 1 (F1: 0.9406) CHEMBL4441: C-X-C chemokine receptor type 3 (F1: 0.9403) CHEMBL248: Leukocyte elastase (F1: 0.9373) CHEMBL2998: P2X purinoceptor 3 (F1: 0.9363) CHEMBL1744525: Nicotinamide phosphoribosyltransferase (F1: 0.9307) CHEMBL1966: Dihydroorotate dehydrogenase (F1: 0.9272) CHEMBL5023: p53-binding protein Mdm-2 (F1: 0.9250) CHEMBL259: Melanocortin receptor 4 (F1: 0.9246) CHEMBL1889: Vasopressin V1a receptor (F1: 0.9173) CHEMBL3105: Poly [ADP-ribose] polymerase-1 (F1: 0.9158) CHEMBL286: Renin (F1: 0.9148) CHEMBL2000: Plasma kallikrein (F1: 0.9109) CHEMBL249: Neurokinin 1 receptor (F1: 0.9104) CHEMBL2243: Anandamide amidohydrolase (F1: 0.9059) CHEMBL284: Dipeptidyl peptidase IV (F1: 0.9037) CHEMBL2094135: Gamma-secretase (F1: 0.9020) ``` #### Classes with Good Performance (0.7