danf0 commited on
Commit
b2dcfc6
·
1 Parent(s): f41a5ce

Update docs.

Browse files
Files changed (2) hide show
  1. README.md +125 -17
  2. vendiscore.py +6 -4
README.md CHANGED
@@ -5,7 +5,7 @@ datasets:
5
  tags:
6
  - evaluate
7
  - metric
8
- description: "TODO: add a description here"
9
  sdk: gradio
10
  sdk_version: 3.0.2
11
  app_file: app.py
@@ -14,37 +14,145 @@ pinned: false
14
 
15
  # Metric Card for VendiScore
16
 
17
- ***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
 
 
18
 
19
  ## Metric Description
20
- *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 
 
 
 
 
21
 
22
  ## How to Use
23
- *Give general statement of how to use the metric*
 
 
24
 
25
- *Provide simplest possible example for using the metric*
 
 
 
 
 
 
 
 
 
 
26
 
27
  ### Inputs
28
- *List all input arguments in the format below*
29
- - **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  ### Output Values
32
 
33
- *Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
 
 
 
34
 
35
- *State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
 
 
36
 
37
- #### Values from Popular Papers
38
- *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
39
 
40
- ### Examples
41
- *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
  ## Limitations and Bias
44
- *Note any known limitations or biases that the metric has, with links and references if possible.*
45
 
46
  ## Citation
47
- *Cite the source where this metric was introduced.*
48
 
49
- ## Further References
50
- *Add any useful further references.*
 
5
  tags:
6
  - evaluate
7
  - metric
8
+ description: "The Vendi Score is a metric for evaluating diversity in machine learning. See the project's README at https://github.com/vertaix/Vendi-Score for more information."
9
  sdk: gradio
10
  sdk_version: 3.0.2
11
  app_file: app.py
 
14
 
15
  # Metric Card for VendiScore
16
 
17
+ The Vendi Score (VS) is a metric for evaluating diversity in machine learning.
18
+ The input to metric is a collection of samples and a pairwise similarity function, and the output is a number, which can be interpreted as the effective number of unique elements in the sample.
19
+ See the project's README at https://github.com/vertaix/Vendi-Score for more information.
20
 
21
  ## Metric Description
22
+ The Vendi Score (VS) is a metric for evaluating diversity in machine learning.
23
+ The input to metric is a collection of samples and a pairwise similarity function, and the output is a number, which can be interpreted as the effective number of unique elements in the sample.
24
+ Specifically, given a positive semi-definite matrix $K \in \mathbb{R}^{n \times n}$ of similarity scores, the score is defined as:
25
+ $$\mathrm{VS}(K) = \exp(-\mathrm{tr}(K/n \log K/n)) = \exp(-\sum_{i=1}^n \lambda_i \log \lambda_i),$$
26
+ where $\lambda_i$ are the eigenvalues of $K/n$ and $0 \log 0 = 0$.
27
+ That is, the Vendi Score is equal to the exponential of the von Neumann entropy of $K/n$, or the Shannon entropy of the eigenvalues, which is also known as the effective rank.
28
 
29
  ## How to Use
30
+ The Vendi Score is available as a Python package or in HuggingFace `evaluate`.
31
+ To use the Python package, see the instructions at https://github.com/vertaix/Vendi-Score.
32
+ To use the `evaluate` module, pass a list of samples and a similarity function or a string identifying a predefined class of similarity functions (see below).
33
 
34
+ ```
35
+ >>> vendiscore = evaluate.load("danf0/vendiscore")
36
+ >>> samples = ["Look, Jane.",
37
+ "See Spot.",
38
+ "See Spot run.",
39
+ "Run, Spot, run.",
40
+ "Jane sees Spot run."]
41
+ >>> results = vendiscore.compute(samples, k="ngram_overlap", ns=[1, 2])
42
+ >>> print(results)
43
+ {'VS': 3.90657...}
44
+ ```
45
 
46
  ### Inputs
47
+ - **samples**: an iterable containing $n$ samples to score; an n x n similarity
48
+ matrix K, or an n x d feature matrix X.
49
+ - **k**: a pairwise similarity function, or a string identifying a predefined
50
+ similarity function. If k is a pairwise similarity function, it should
51
+ be symmetric and k(x, x) = 1.
52
+ Options: ngram_overlap, text_embeddings, pixels, image_embeddings.
53
+ - **score_K**: if true, samples is an n x n similarity matrix K.
54
+ - **score_X**: if true, samples is an n x d feature matrix X.
55
+ - **score_dual**: if true, samples is an n x d feature matrix X and we will
56
+ compute the diversity score using the covariance matrix X @ X.T.
57
+ - **normalize**: if true, normalize the similarity scores.
58
+ - **model (optional)**: if k is "text_embeddings", a model mapping sentences to
59
+ embeddings (output should be an object with an attribute called
60
+ `pooler_output` or `last_hidden_state`). If k is "image_embeddings", a
61
+ model mapping images to embeddings.
62
+ - **tokenizer (optional)**: if k is "text_embeddings" or "ngram_overlap", a
63
+ tokenizer mapping strings to lists.
64
+ - **transform (optional)**: if k is "image_embeddings", a torchvision transform
65
+ to apply to the samples.
66
+ - **model_path (optional)**: if k is "text_embeddings", the name of a model on
67
+ the HuggingFace hub.
68
+ - **ns (optional)**: if k is "ngram_overlap", the values of n to calculate.
69
+ - **batch_size (optional)**: batch size to use if k is "text_embedding" or
70
+ "image_embedding".
71
+ - **device (optional)**: a string (e.g. "cuda", "cpu") or torch.device
72
+ identifying the device to use if k is "text_embedding"
73
+ or "image_embedding".
74
+
75
 
76
  ### Output Values
77
 
78
+ The output is a dictionary with one key, "VS".
79
+ Given n samples, the value of the Vendi Score ranges between 1 and n, with higher numbers indicating that the sample is more diverse.
80
+
81
+ ### Examples
82
 
83
+ ```python
84
+ import numpy as np
85
+ vendiscore = evaluate.load("danf0/vendiscore")
86
 
87
+ samples = [0, 0, 10, 10, 20, 20]
88
+ k = lambda a, b: np.exp(-np.abs(a - b))
89
 
90
+ vendiscore.compute(samples, k)
91
+
92
+ # 2.9999
93
+ ```
94
+
95
+ If you already have precomputed a similarity matrix:
96
+ ```python
97
+ K = np.array([[1.0, 0.9, 0.0],
98
+ [0.9, 1.0, 0.0],
99
+ [0.0, 0.0, 1.0]])
100
+ vendiscore.compute(K, score_K=True)
101
+
102
+ # 2.1573
103
+ ```
104
+
105
+ If your similarity function is a dot product between normalized
106
+ embeddings $X\in\mathbb{R}^{n\times d}$, and $d < n$, it is faster
107
+ to compute the Vendi Score using the covariance matrix,
108
+ $\frac{1}{n} \sum_i x_i x_i^{\top}$:
109
+ ```python
110
+ vendiscore.compute(X, score_dual=True)
111
+ ```
112
+ If the rows of $X$ are not normalized, set `normalize = True`.
113
+
114
+ Images:
115
+ ```python
116
+ from torchvision import datasets
117
+
118
+ mnist = datasets.MNIST("data/mnist", train=False, download=True)
119
+ digits = [[x for x, y in mnist if y == c] for c in range(10)]
120
+ pixel_vs = [vendiscore.compute(imgs, k="pixels") for imgs in digits]
121
+ # The default embeddings are from the pool-2048 layer of the torchvision
122
+ # Inception v3 model.
123
+ inception_vs = [vendiscore.compute(imgs, k="image_embeddings", batch_size=64, device="cuda") for imgs in digits]
124
+ for y, (pvs, ivs) in enumerate(zip(pixel_vs, inception_vs)): print(f"{y}\t{pvs:.02f}\t{ivs:02f}")
125
+
126
+ # Output:
127
+ # 0 7.68 3.45
128
+ # 1 5.31 3.50
129
+ # 2 12.18 3.62
130
+ # 3 9.97 2.97
131
+ # 4 11.10 3.75
132
+ # 5 13.51 3.16
133
+ # 6 9.06 3.63
134
+ # 7 9.58 4.07
135
+ # 8 9.69 3.74
136
+ # 9 8.56 3.43
137
+ ```
138
+
139
+ Text:
140
+ ```python
141
+ sents = ["Look, Jane.",
142
+ "See Spot.",
143
+ "See Spot run.",
144
+ "Run, Spot, run.",
145
+ "Jane sees Spot run."]
146
+ ngram_vs = vendiscore.compute(sents, k="ngram_overlap", ns=[1, 2])
147
+ bert_vs = vendiscore.compute(sents, k="text_embeddings", model_path="bert-base-uncased")
148
+ simcse_vs = vendiscore.compute(sents, k="text_embeddings", model_path="princeton-nlp/unsup-simcse-bert-base-uncased")
149
+ print(f"N-grams: {ngram_vs:.02f}, BERT: {bert_vs:.02f}, SimCSE: {simcse_vs:.02f})
150
+
151
+ # N-grams: 3.91, BERT: 1.21, SimCSE: 2.81
152
+ ```
153
 
154
  ## Limitations and Bias
155
+ The Vendi Score depends on the choice of similarity function. Care should be taken to select a similarity function that reflects the features that are relevant for defining diversity in a given application.
156
 
157
  ## Citation
 
158
 
 
 
vendiscore.py CHANGED
@@ -22,15 +22,17 @@ from vendi_score import vendi, image_utils, text_utils
22
  # TODO: Add BibTeX citation
23
  _CITATION = ""
24
  _DESCRIPTION = """\
25
- A diversity evaluation metric for machine learning.
 
 
26
  """
27
 
28
 
29
  _KWARGS_DESCRIPTION = """
30
  Calculates the Vendi Score given samples and a similarity function.
31
  Args:
32
- samples: list of n sentences to score, an n x n similarity matrix K, or
33
- an n x d feature matrix X.
34
  k: a pairwise similarity function, or a string identifying a predefined
35
  similarity function.
36
  Options: ngram_overlap, text_embeddings, pixels, image_embeddings.
@@ -56,7 +58,7 @@ Args:
56
  Returns:
57
  VS: The Vendi Score.
58
  Examples:
59
- >>> vendi_score = evaluate.load("vendi_score")
60
  >>> samples = ["Look, Jane.",
61
  "See Spot.",
62
  "See Spot run.",
 
22
  # TODO: Add BibTeX citation
23
  _CITATION = ""
24
  _DESCRIPTION = """\
25
+ The Vendi Score is a metric for evaluating diversity in machine learning.
26
+ The input to metric is a collection of samples and a pairwise similarity function, and the output is a number, which can be interpreted as the effective number of unique elements in the sample.
27
+ See the project's README at https://github.com/vertaix/Vendi-Score for more information.
28
  """
29
 
30
 
31
  _KWARGS_DESCRIPTION = """
32
  Calculates the Vendi Score given samples and a similarity function.
33
  Args:
34
+ samples: an iterable containing n samples to score, an n x n similarity
35
+ matrix K, or an n x d feature matrix X.
36
  k: a pairwise similarity function, or a string identifying a predefined
37
  similarity function.
38
  Options: ngram_overlap, text_embeddings, pixels, image_embeddings.
 
58
  Returns:
59
  VS: The Vendi Score.
60
  Examples:
61
+ >>> vendi_score = evaluate.load("danf0/vendiscore")
62
  >>> samples = ["Look, Jane.",
63
  "See Spot.",
64
  "See Spot run.",