--- license: apache-2.0 datasets: - financial_phrasebank - pauri32/fiqa-2018 - zeroshot/twitter-financial-news-sentiment language: - en metrics: - accuracy pipeline_tag: text-classification tags: - finance --- We collects financial domain terms from Investopedia's Financia terms dictionary, NYSSCPA's accounting terminology guide and Harvey's Hypertextual Finance Glossary to expand RoBERTa's vocab dict. Based on added-financial-terms RoBERTa, we pretrained our model on multilple financial corpus: - Financial Terms - [Investopedia's Financia terms dictionary](https://www.investopedia.com/financial-term-dictionary-4769738) - [NYSSCPA's accounting terminology guide](https://www.nysscpa.org/professional-resources/accounting-terminology-guide) - [Harvey's Hypertextual Finance Glossary](https://people.duke.edu/~charvey/Classes/wpg/glossary.htm) - Financial Datasets - [FPB](https://huggingface.co/datasets/financial_phrasebank) - [FiQA SA](https://huggingface.co/datasets/pauri32/fiqa-2018) - [SemEval2017 Task5](https://aclanthology.org/S17-2089/) - [Twitter Financial News Sentiment](https://huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment) - Earnings Call 2016-2023 NASDAQ 100 components stocks's Earnings Call Transcripts. In continual pretraining step, we apply following experiments settings to achieve better finetuned results on Four Financial Datasets: 1. Masking Probability: 0.4 (instead of default 0.15) 2. Warmup Steps: 0 (deriving better results than models with warmup steps) 3. Epochs: 1 (is enough in case of overfitting) 4. weight_decay: 0.01 5. Train Batch Size: 64 6. FP16