File size: 3,687 Bytes
8578a56 b9156a2 5ccf7b4 4735494 08c9b7f 5ccf7b4 80fd97e 147f04a 50d82e4 8578a56 50d82e4 5fd475f 376dd88 9db31c7 35b8e0c da55e2f 50d82e4 2e1ec4c 50d82e4 da55e2f 376dd88 b9156a2 31b8743 c47aef6 31b8743 c47aef6 31b8743 c47aef6 376dd88 b9156a2 c47aef6 0a037e1 da55e2f 6c2f8ea c57bda0 b9156a2 0a037e1 f1dbbc7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
---
datasets:
- ElKulako/stocktwits-crypto
language:
- en
tags:
- cryptocurrency
- crypto
- BERT
- sentiment classification
- NLP
- bitcoin
- ethereum
- shib
- social media
- sentiment analysis
- cryptocurrency sentiment analysis
---
For academic reference, cite the following paper: https://ieeexplore.ieee.org/document/10223689
# CryptoBERT
CryptoBERT is a pre-trained NLP model to analyse the language and sentiments of cryptocurrency-related social media posts and messages. It was built by further training the [vinai's bertweet-base](https://huggingface.co/vinai/bertweet-base) language model on the cryptocurrency domain, using a corpus of over 3.2M unique cryptocurrency-related social media posts.
(A research paper with more details will follow soon.)
## Classification Training
The model was trained on the following labels: "Bearish" : 0, "Neutral": 1, "Bullish": 2
CryptoBERT's sentiment classification head was fine-tuned on a balanced dataset of 2M labelled StockTwits posts, sampled from [ElKulako/stocktwits-crypto](https://huggingface.co/datasets/ElKulako/stocktwits-crypto).
CryptoBERT was trained with a max sequence length of 128. Technically, it can handle sequences of up to 514 tokens, however, going beyond 128 is not recommended.
# Classification Example
```python
from transformers import TextClassificationPipeline, AutoModelForSequenceClassification, AutoTokenizer
model_name = "ElKulako/cryptobert"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = 3)
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, max_length=64, truncation=True, padding = 'max_length')
# post_1 & post_3 = bullish, post_2 = bearish
post_1 = " see y'all tomorrow and can't wait to see ada in the morning, i wonder what price it is going to be at. 😎🐂🤠💯😴, bitcoin is looking good go for it and flash by that 45k. "
post_2 = " alright racers, it’s a race to the bottom! good luck today and remember there are no losers (minus those who invested in currency nobody really uses) take your marks... are you ready? go!!"
post_3 = " i'm never selling. the whole market can bottom out. i'll continue to hold this dumpster fire until the day i die if i need to."
df_posts = [post_1, post_2, post_3]
preds = pipe(df_posts)
print(preds)
```
```
[{'label': 'Bullish', 'score': 0.8734585642814636}, {'label': 'Bearish', 'score': 0.9889495372772217}, {'label': 'Bullish', 'score': 0.6595883965492249}]
```
## Training Corpus
CryptoBERT was trained on 3.2M social media posts regarding various cryptocurrencies. Only non-duplicate posts of length above 4 words were considered. The following communities were used as sources for our corpora:
(1) StockTwits - 1.875M posts about the top 100 cryptos by trading volume. Posts were collected from the 1st of November 2021 to the 16th of June 2022. [ElKulako/stocktwits-crypto](https://huggingface.co/datasets/ElKulako/stocktwits-crypto)
(2) Telegram - 664K posts from top 5 telegram groups: [Binance](https://t.me/binanceexchange), [Bittrex](https://t.me/BittrexGlobalEnglish), [huobi global](https://t.me/huobiglobalofficial), [Kucoin](https://t.me/Kucoin_Exchange), [OKEx](https://t.me/OKExOfficial_English).
Data from 16.11.2020 to 30.01.2021. Courtesy of [Anton](https://www.kaggle.com/datasets/aagghh/crypto-telegram-groups).
(3) Reddit - 172K comments from various crypto investing threads, collected from May 2021 to May 2022
(4) Twitter - 496K posts with hashtags XBT, Bitcoin or BTC. Collected for May 2018. Courtesy of [Paul](https://www.kaggle.com/datasets/paul92s/bitcoin-tweets-14m). |