willdampier commited on
Commit
c5fa013
·
1 Parent(s): ff9ca19

adjusting after merge

Browse files
Files changed (1) hide show
  1. README.md +12 -10
README.md CHANGED
@@ -1,20 +1,22 @@
1
  ---
2
- licepredictor
3
 
4
  widget:
5
- prtext-classification T R P N N N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C'
6
  example_title: "V3 Macrophage"
7
  - text: 'C T R P N N N T R K S I H I G P G R A F Y T T G Q I I G D I R Q A Y C'
8
  example_title: "V3 T-cell"
9
 
10
  datasets:
11
- - damlab/HIV_V3_bodysitepredictor:
12
- - accuractext-classificationorN N N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C
 
13
  ---
14
 
15
- # HIV_V3_bodysite model
16
 
17
  ## Table of Contents
 
18
  - [Summary](#model-summary)
19
  - [Model Description](#model-description)
20
  - [Intended Uses & Limitations](#intended-uses-&-limitations)
@@ -28,7 +30,7 @@ datasets:
28
 
29
  ## Summary
30
 
31
- The [HIV_V3_bodysite model](https://huggingface.co/damlab/HIV_BERT) was trained as a refinement of the HIV-BERT model (insert link) and serves to better predict the location that an HIV V3 loop sample was derived from. HIV-BERT is a model refined from the [ProtBert-BFD model](https://huggingface.co/Rostlab/prot_bert_bfd) to better fulfill HIV-centric tasks. This model was then trained using HIV V3 sequences from the [Los Alamos HIV Sequence Database](https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html), allowing even more precise prediction of body site location than the HIV-BERT model can provide.
32
 
33
  ## Model Description
34
 
@@ -38,7 +40,7 @@ The HIV-BERT-Bodysite-Identification model is intended to predict the location a
38
 
39
  This tool can be used as a predictor of which body site an HIV sample was derived from based on its genomic sequence. It should not be considered a clinical diagnostic tool.
40
 
41
- This tool was trained using the [Los Alamos HIV sequence dataset](https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html). Due to the sampling nature of this database, it is predominantly composed of subtype B sequences from North America and Europe with only minor contributions of Subtype C, A, and D. Currently, there was no effort made to balance the performance across these classes. As such, one should consider refinement with additional sequences to perform well on non-B sequences.
42
 
43
  ## How to use
44
 
@@ -96,17 +98,17 @@ predictor(f"C T R P N N N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A
96
 
97
  ## Training Data
98
 
99
- This model was trained using the [damlab/HIV-V3-bodysite dataset](https://huggingface.co/datasets/damlab/HIV_V3_bodysite) using the 0th fold. The dataset consists of 5510 sequences (approximately 35 tokens each) extracted from the [Los Alamos HIV Sequence database](https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html).
100
 
101
  ## Training Procedure
102
 
103
  ### Preprocessing
104
 
105
- As with the [rostlab/Prot-bert-bfd model](https://huggingface.co/Rostlab/prot_bert_bfd), the rare amino acids U, Z, O, and B were converted to X and spaces were added between each amino acid. All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.
106
 
107
  ### Training
108
 
109
- The [damlab/HIV-BERT model](https://huggingface.co/damlab/HIV_BERT) was used as the initial weights for an AutoModelforClassificiation. The model was trained with a learning rate of 1E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset. As this is a multiple classification task (a protein can be found in multiple sites) the loss was calculated as the Binary Cross Entropy for each category. The BCE was weighted by the inverse of the class ratio to balance the weight across the class imbalance.
110
 
111
  ## Evaluation Results
112
 
 
1
  ---
2
+ licence: mit
3
 
4
  widget:
5
+ - text-classification "T R P N N N T R K S I R I Q R G P G R A F V T I G K I G N M R Q A H C"
6
  example_title: "V3 Macrophage"
7
  - text: 'C T R P N N N T R K S I H I G P G R A F Y T T G Q I I G D I R Q A Y C'
8
  example_title: "V3 T-cell"
9
 
10
  datasets:
11
+ - damlab/HIV_V3_bodysite
12
+ metrics:
13
+ - accuracy
14
  ---
15
 
16
+ # Model Card for [HIV_V3_bodysite]
17
 
18
  ## Table of Contents
19
+ - [Table of Contents](#table-of-contents)
20
  - [Summary](#model-summary)
21
  - [Model Description](#model-description)
22
  - [Intended Uses & Limitations](#intended-uses-&-limitations)
 
30
 
31
  ## Summary
32
 
33
+ The HIV-BERT-Bodysite-Identification model was trained as a refinement of the HIV-BERT model (insert link) and serves to better predict the location that an HIV V3 loop sample was derived from. HIV-BERT is a model refined from the ProtBert-BFD model (https://huggingface.co/Rostlab/prot_bert_bfd) to better fulfill HIV-centric tasks. This model was then trained using HIV V3 sequences from the Los Alamos HIV Sequence Database (https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html), allowing even more precise prediction of body site location than the HIV-BERT model can provide.
34
 
35
  ## Model Description
36
 
 
40
 
41
  This tool can be used as a predictor of which body site an HIV sample was derived from based on its genomic sequence. It should not be considered a clinical diagnostic tool.
42
 
43
+ This tool was trained using the Los Alamos HIV sequence dataset (https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html). Due to the sampling nature of this database, it is predominantly composed of subtype B sequences from North America and Europe with only minor contributions of Subtype C, A, and D. Currently, there was no effort made to balance the performance across these classes. As such, one should consider refinement with additional sequences to perform well on non-B sequences.
44
 
45
  ## How to use
46
 
 
98
 
99
  ## Training Data
100
 
101
+ This model was trained using the damlab/HIV_V3_bodysite dataset using the 0th fold. The dataset consists of 5510 sequences (approximately 35 tokens each) extracted from the Los Alamos HIV Sequence database.
102
 
103
  ## Training Procedure
104
 
105
  ### Preprocessing
106
 
107
+ As with the rostlab/Prot-bert-bfd model, the rare amino acids U, Z, O, and B were converted to X and spaces were added between each amino acid. All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.
108
 
109
  ### Training
110
 
111
+ The damlab/HIV-BERT model was used as the initial weights for an AutoModelforClassificiation. The model was trained with a learning rate of 1E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset. As this is a multiple classification task (a protein can be found in multiple sites) the loss was calculated as the Binary Cross Entropy for each category. The BCE was weighted by the inverse of the class ratio to balance the weight across the class imbalance.
112
 
113
  ## Evaluation Results
114