Goran Glavaš
commited on
Commit
·
41fc5cc
1
Parent(s):
80dbedc
Deleted embeddings.txt due to repo size limitation. Updated the README file.
Browse files- README.txt +5 -5
- source/res/embeddings.txt +0 -3
README.txt
CHANGED
@@ -31,13 +31,13 @@ Example command:
|
|
31 |
|
32 |
java -jar graphseg.jar /home/seg-input /home/seg-output 0.25 3
|
33 |
|
34 |
-
The tool's correct execution depends on the resources in the /source/res directory.
|
35 |
|
36 |
-
(1) embeddings.txt -- the word embeddings used for measuring semantic similarity between sentences. The default file used are 200-dimensional GloVe embeddings obtained on Wikipedia 2014 + Giga 5 corpus (http://nlp.stanford.edu/data/glove.6B.zip).
|
37 |
-
(2) stopwords.txt -- the list of English stopwords (excluded from sentences when measuring semantic similarity)
|
38 |
-
(3) freqs.txt -- frequencies of English words on a large corpus, needed for the IC-weighting of word contribution
|
39 |
|
40 |
-
You may choose to replace these default files (e.g., by using different embeddings or different stopword list), but make sure you name the new files exactly the same (i.e., embeddings.txt, stopwords.txt, and freqs.txt, respectively).
|
41 |
|
42 |
Credit
|
43 |
========
|
|
|
31 |
|
32 |
java -jar graphseg.jar /home/seg-input /home/seg-output 0.25 3
|
33 |
|
34 |
+
The tool's correct execution depends on the resources in the /source/res directory. There are three files that need to be there:
|
35 |
|
36 |
+
(1) embeddings.txt -- the word embeddings used for measuring semantic similarity between sentences. The default file used are 200-dimensional GloVe embeddings obtained on Wikipedia 2014 + Giga 5 corpus (http://nlp.stanford.edu/data/glove.6B.zip). This file is bundled into the standalone binary file graphseg.jar, but is omitted from the source/res folder due to space constraints of the repository;
|
37 |
+
(2) stopwords.txt -- the list of English stopwords (excluded from sentences when measuring semantic similarity);
|
38 |
+
(3) freqs.txt -- frequencies of English words on a large corpus, needed for the IC-weighting of word contribution.
|
39 |
|
40 |
+
The last two files (stopwords.txt and freqs.txt) are provided in the res folder, whereas the embeddings.txt are bundled into the binary (/binary/graphseg.jar) but omitted from the /source/res folder due to repository size constraints. You may choose to replace these default files (e.g., by using different embeddings or different stopword list), but make sure you name the new files exactly the same (i.e., embeddings.txt, stopwords.txt, and freqs.txt, respectively).
|
41 |
|
42 |
Credit
|
43 |
========
|
source/res/embeddings.txt
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:18870b0a7516e4a72b44d3c226c242d2d846008967d8ce40b94c723a94d1a32b
|
3 |
-
size 693432828
|
|
|
|
|
|
|
|