simonschoe commited on
Commit
1a7b275
1 Parent(s): 2d84e3f

update readme

Browse files
Files changed (1) hide show
  1. README.md +84 -27
README.md CHANGED
@@ -21,40 +21,81 @@ widget:
21
  ---
22
 
23
  # EarningsCall2Vec
24
- This is a [fastText](https://fasttext.cc/) model trained via [`Gensim`](https://radimrehurek.com/gensim/): It maps each token in the vocabulary (i.e., unigram and frequently coocurring bi-, tri-, and fourgrams) to a dense, 300-dimensional vector space, designed for performing **semantic search**. It has been trained on corpus of ~160k earning call transcripts, in particular the executive remarks within the Q&A-section of these transcripts (13m sentences).
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  ## Usage (API)
27
- ```
28
- pip install -U xxx
29
- ```
30
- Then you can use the model like this:
31
  ```python
32
- py code
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  ```
34
 
35
  ## Usage (Gensim)
36
- ```
37
- pip install -U xxx
38
- ```
39
- Then you can use the model like this:
40
  ```python
41
- py code
42
- ```
43
 
44
- ## Background
 
 
45
 
46
- Context on the project.
 
47
 
 
 
48
 
49
- ## Intended Uses
 
50
 
51
- Our model is intented to be used for semantic search on a token-level: It encodes search-queries (i.e., token) in a dense vector space and finds semantic neighbours, i.e., token which frequently occur within similar contexts in the underlying training data. Note that this search is only feasible for individual token and may produce deficient results in the case of out-of-vocabulary token.
 
52
 
 
 
 
53
 
54
- ## Training procedure
55
 
 
 
 
 
 
 
 
 
56
  ```python
57
- logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
 
 
 
 
58
 
59
  # init
60
  model = FastText(
@@ -72,20 +113,36 @@ model = FastText(
72
  )
73
 
74
  # build vocab
75
- model.build_vocab(corpus_iterable=LineSentence(<PATH_TRAIN_DATA>))
76
 
77
  # train model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
  model.train(
79
- corpus_iterable=LineSentence(<PATH_TRAIN_DATA>),
80
  total_words=model.corpus_total_words,
81
  total_examples=model.corpus_count,
82
- epochs=50,
 
83
  )
84
-
85
- # save to binary format
86
- save_facebook_model(<PATH_MOD_SAVE>)
87
  ```
88
 
89
- ## Training Data
90
-
91
- description
 
 
 
21
  ---
22
 
23
  # EarningsCall2Vec
24
+
25
+ **EarningsCall2Vec** is a [fastText](https://fasttext.cc/) word embedding model that was trained via [Gensim](https://radimrehurek.com/gensim/). It maps each token in the vocabulary to a dense, 300-dimensional vector space, designed for performing **semantic search**. More details about the training procedure can be found in section [xxx](#xxx).
26
+
27
+
28
+ ## Background
29
+
30
+ Context on the project.
31
+
32
+
33
+ ## Usage
34
+
35
+ The model is intented to be used for semantic search: It encodes the search-query in a dense vector space and finds semantic neighbours, i.e., token which frequently occur within similar contexts in the underlying training data. The query should consist of a single word. When provided a bi-, tri-, or even fourgram, the quality of the model output depends on the presence of the query token in the model's vocabulary. Multiple words are to be concated by an underscore (e.g., "machine_learning" or "artifical_intelligence").
36
+
37
 
38
  ## Usage (API)
39
+
 
 
 
40
  ```python
41
+ import json
42
+ import requests
43
+
44
+ API_TOKEN = <TOKEN>
45
+
46
+ headers = {"Authorization": f"Bearer {API_TOKEN}"}
47
+ API_URL = "https://api-inference.huggingface.co/models/call2vec"
48
+
49
+ def query(payload):
50
+ data = json.dumps(payload)
51
+ response = requests.request("POST", API_URL, headers=headers, data=data)
52
+ return json.loads(response.content.decode("utf-8"))
53
+
54
+ query({"inputs": "<insert-query-here"})
55
  ```
56
 
57
  ## Usage (Gensim)
58
+
 
 
 
59
  ```python
60
+ from huggingface_hub import hf_hub_url, cached_download
61
+ from gensim.models.fasttext import load_facebook_model
62
 
63
+ # download model from huggingface hub
64
+ url = hf_hub_url(repo_id="simonschoe/call2vec", filename="model.bin")
65
+ cached_download(url)
66
 
67
+ # load model via gensim
68
+ model = load_facebook_model(<PATH-MODEL>)
69
 
70
+ # extract word embeddings
71
+ model.wv['transformation']
72
 
73
+ # get similar phrases
74
+ model.wv.most_similar(positive='transformation', topn=5)
75
 
76
+ # get dissimilar phrases
77
+ model.wv.most_similar(negative='transformation', topn=5, restrict_vocab=None)
78
 
79
+ # compute pairwise similarity scores (distance = 1 - similarity)
80
+ model.wv.similarity('transformation', 'continuity')
81
+ ```
82
 
83
+ ## Model Training
84
 
85
+ The model has been trained on text data stemming from earnings call transcripts. The data is restricted to a call's question-and-answer (Q&A) section and the remarks by firm executives. The data has been preprocessed prior to model training via stop word removal, lemmatization, named entity masking, and coocurrence modeling.
86
+
87
+ **Corpus statistics:**
88
+ - Total number of calls: 159973
89
+ - Total number of remarks: 4512535
90
+ - Total number of token: 219228853
91
+
92
+ The following code snippets presents the training pipeline:
93
  ```python
94
+ import logging
95
+ from gensim.models import FastText
96
+ from gensim.models.word2vec import LineSentence
97
+ from gensim.models.callbacks import CallbackAny2Vec
98
+ from gensim.models.fasttext import save_facebook_model
99
 
100
  # init
101
  model = FastText(
 
113
  )
114
 
115
  # build vocab
116
+ model.build_vocab(corpus_iterable=LineSentence(<PATH-TRAIN-DATA>))
117
 
118
  # train model
119
+ logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
120
+
121
+ class MyCallback(CallbackAny2Vec):
122
+ def __init__(self):
123
+ self.epoch = 0
124
+
125
+ def on_epoch_end(self, model):
126
+ self.epoch += 1
127
+ if (self.epoch % 10) == 0:
128
+ # save in gensim format
129
+ model.save(<PATH-SAVE>)
130
+
131
+ def on_train_end(self, model):
132
+ # save in binary format for upload to huggingface
133
+ save_facebook_model(<PATH-SAVE>.bin)
134
+
135
  model.train(
136
+ corpus_iterable=LineSentence(<PATH-TRAIN-DATA>),
137
  total_words=model.corpus_total_words,
138
  total_examples=model.corpus_count,
139
+ epochs=<EPOCHS>,
140
+ callbacks=[MyCallback()],
141
  )
 
 
 
142
  ```
143
 
144
+ **Model statistics:**
145
+ - Vocabulary size: 64,891
146
+ - Min. token frequency: 10
147
+ - Embedding dimensions: 300
148
+ - Number of epochs: 50