# GNorm2 *** GNorm2 is a gene name recognition and normalization tool with optimized functions and customizable configuration to the user preferences. The GNorm2 integrates multiple deep learning-based methods and achieves state-of-the-art performance. GNorm2 is freely available to download for stand-alone usage. [Download GNorm2 here](https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/download/GNorm2/GNorm2.tar.gz) ## Content - [Dependency package](#package) - [Introduction of folders](#intro) - [Running GNorm2](#pipeline) ## Dependency package The codes have been tested using Python3.8/3.9 on CentOS and uses the following main dependencies on a CPU and GPU: - [TensorFlow 2.3.0](https://www.tensorflow.org/) - [Transformer 4.18.0](https://huggingface.co/docs/transformers/installation) - [stanza 1.4.0](stanfordnlp.github.io/stanza/) To install all dependencies automatically using the command: $ pip install -r requirements.txt ## Introduction of folders - src_python - GeneNER: the codes for gene recognition - SpeAss: the codes for species assignment - src_Java - GNormPluslib : the codes for gene normalization and species recogntion - GeneNER_SpeAss_run.py: the script for runing pipeline - GNormPlus.jar: the upgraded GNormPlus tools for gene normalization - gnorm_trained_models:pre-trianed models and trained NER/SA models - bioformer-cased-v1.0: the original bioformer model - BiomedNLP-PubMedBERT-base-uncased-abstract: the original pubmedbert model - geneNER - GeneNER-Bioformer/PubmedBERT-Allset.h5: the Gene NER models trained by all datasets - GeneNER-Bioformer/PubmedBERT-Trainset.h5: the Gene NER models trained by the training set only - SpeAss - SpeAss-Bioformer/PubmedBERT-SG-Allset.h5: the Species Assignment models trained by all datasets - SpeAss-Bioformer/PubmedBERT-SG-Trainset.h5: the Species Assignment models trained by the trianing set only - stanza - downloaded stanza library for offline usage - vocab: label files for the machine learning models of GeneNER and SpeAss - Dictionary: The dictionary folder contains all required files for gene normalization - CRF: CRF++ library (called by GNormPlus.sh) - Library: Ab3P library - tmp/tmp_GNR/tmp_SA/tmp_SR folders: temp folder - input/output folders: input and output folders. BioC (abstract or full text) and PubTator (abstract only) formats are both avaliable. - GNorm2.sh: the script to run GNorm2 - setup.GN.txt/setup.SR.txt/setup.txt the setup files for GNorm2. ## Running GNorm2 Please firstly download [GNorm2](https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/download/GNorm2/GNorm2.tar.gz) to your local. Below are the well-trained models (i.e., PubmedBERT/Bioformer) for Gene NER and Species Assignment. Models for Gene NER: - gnorm_trained_models/geneNER/GeneNER-PubmedBERT.h5 - gnorm_trained_models/geneNER/GeneNER-Bioformer.h5 Models for Species Assignment: - gnorm_trained_models/SpeAss/SpeAss-PubmedBERT.h5 - gnorm_trained_models/SpeAss/SpeAss-Bioformer.h5 The parameters of the input/output folders: - INPUT, default="input" - OUTPUT, default="output" [BioC-XML](bioc.sourceforge.net) or [PubTator](https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/Format.html) formats are both avaliabel to GNorm2. 1. Run GNorm2 Run Example: $ ./GNorm2.sh input output ## Acknowledgments This research was supported by the Intramural Research Program of the National Library of Medicine (NLM), National Institutes of Health.