UNIST-Eunchan
commited on
Commit
•
7fa246a
1
Parent(s):
4787bea
Update README.md
Browse files
README.md
CHANGED
@@ -4,6 +4,249 @@ tags:
|
|
4 |
- generated_from_trainer
|
5 |
datasets:
|
6 |
- arxiv-summarization
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
model-index:
|
8 |
- name: Long-paper-summarization-pegasus-x-b
|
9 |
results:
|
|
|
4 |
- generated_from_trainer
|
5 |
datasets:
|
6 |
- arxiv-summarization
|
7 |
+
|
8 |
+
widget:
|
9 |
+
- text: >-
|
10 |
+
|
11 |
+
[Abstract] The dominant sequence transduction models are based on complex
|
12 |
+
recurrent or convolutional neural networks in an encoder-decoder
|
13 |
+
configuration. The best performing models also connect the encoder and
|
14 |
+
decoder through an attention mechanism. We propose a new simple network
|
15 |
+
architecture, the Transformer, based solely on attention mechanisms,
|
16 |
+
dispensing with recurrence and convolutions entirely. Experiments on two
|
17 |
+
machine translation tasks show these models to be superior in quality while
|
18 |
+
being more parallelizable and requiring significantly less time to train.
|
19 |
+
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation
|
20 |
+
task, improving over the existing best results, including ensembles by over
|
21 |
+
2 BLEU. On the WMT 2014 English-to-French translation task, our model
|
22 |
+
establishes a new single-model state-of-the-art BLEU score of 41.8 after
|
23 |
+
training for 3.5 days on eight GPUs, a small fraction of the training costs
|
24 |
+
of the best models from the literature. We show that the Transformer
|
25 |
+
generalizes well to other tasks by applying it successfully to English
|
26 |
+
constituency parsing both with large and limited training data.
|
27 |
+
[Introduction] Recurrent neural networks, long short-term memory [13] and
|
28 |
+
gated recurrent [7] neural networks in particular, have been firmly
|
29 |
+
established as state of the art approaches in sequence modeling and
|
30 |
+
transduction problems such as language modeling and machine translation [35,
|
31 |
+
2, 5]. Numerous efforts have since continued to push the boundaries of
|
32 |
+
recurrent language models and encoder-decoder architectures [38, 24, 15].
|
33 |
+
Recurrent models typically factor computation along the symbol positions of
|
34 |
+
the input and output sequences. Aligning the positions to steps in
|
35 |
+
computation time, they generate a sequence of hidden states ht, as a
|
36 |
+
function of the previous hidden state ht−1 and the input for position t.
|
37 |
+
This inherently sequential nature precludes parallelization within training
|
38 |
+
examples, which becomes critical at longer sequence lengths, as memory
|
39 |
+
constraints limit batching across examples. Recent work has achieved
|
40 |
+
significant improvements in computational efficiency through factorization
|
41 |
+
tricks [21] and conditional computation [32], while also improving model
|
42 |
+
performance in case of the latter. The fundamental constraint of sequential
|
43 |
+
computation, however, remains. Attention mechanisms have become an integral
|
44 |
+
part of compelling sequence modeling and transduction models in various
|
45 |
+
tasks, allowing modeling of dependencies without regard to their distance in
|
46 |
+
the input or output sequences [2, 19]. In all but a few cases [27], however,
|
47 |
+
such attention mechanisms are used in conjunction with a recurrent network.
|
48 |
+
In this work we propose the Transformer, a model architecture eschewing
|
49 |
+
recurrence and instead relying entirely on an attention mechanism to draw
|
50 |
+
global dependencies between input and output. The Transformer allows for
|
51 |
+
significantly more parallelization and can reach a new state of the art in
|
52 |
+
translation quality after being trained for as little as twelve hours on
|
53 |
+
eight P100 GPUs.
|
54 |
+
example_title: Attention Is All You Need
|
55 |
+
- text: >-
|
56 |
+
[Abstract] In this work, we explore prompt tuning, a simple yet effective
|
57 |
+
mechanism for learning soft prompts to condition frozen language models to
|
58 |
+
perform specific downstream tasks. Unlike the discrete text prompts used by
|
59 |
+
GPT-3, soft prompts are learned through backpropagation and can be tuned to
|
60 |
+
incorporate signal from any number of labeled examples. Our end-to-end
|
61 |
+
learned approach outperforms GPT-3's few-shot learning by a large margin.
|
62 |
+
More remarkably, through ablations on model size using T5, we show that
|
63 |
+
prompt tuning becomes more competitive with scale: as models exceed billions
|
64 |
+
of parameters, our method closes the gap and matches the strong performance
|
65 |
+
of model tuning (where all model weights are tuned). This finding is
|
66 |
+
especially relevant in that large models are costly to share and serve, and
|
67 |
+
the ability to reuse one frozen model for multiple downstream tasks can ease
|
68 |
+
this burden. Our method can be seen as a simplification of the recently
|
69 |
+
proposed prefix tuning of Li and Liang (2021), and we provide a comparison
|
70 |
+
to this and other similar approaches. Finally, we show that conditioning a
|
71 |
+
frozen model with soft prompts confers benefits in robustness to domain
|
72 |
+
transfer, as compared to full model tuning. [Introduction] With the wide
|
73 |
+
success of pre-trained large language models, a range of techniques has
|
74 |
+
arisen to adapt these general-purpose models to downstream tasks. ELMo
|
75 |
+
(Peters et al., 2018) proposed freezing the pre-trained model and learning a
|
76 |
+
task-specific weighting of its per-layer representations. However, since GPT
|
77 |
+
(Radford et al., 2018) and BERT (Devlin et al., 2019), the dominant
|
78 |
+
adaptation technique has been model tuning (or fine-tuning), where all model
|
79 |
+
parameters are tuned during adaptation, as proposed by Howard and Ruder
|
80 |
+
(2018).More recently, Brown et al. (2020) showed that prompt design (or
|
81 |
+
priming) is surprisingly effective at modulating a frozen GPT-3 model’s
|
82 |
+
behavior through text prompts. Prompts are typically composed of a task
|
83 |
+
description and/or several canonical examples. This return to freezing
|
84 |
+
pre-trained models is appealing, especially as model size continues to
|
85 |
+
increase. Rather than requiring a separate copy of the model for each
|
86 |
+
downstream task, a single generalist model can simultaneously serve many
|
87 |
+
different tasks. Unfortunately, prompt-based adaptation has several key
|
88 |
+
drawbacks. Task description is error-prone and requires human involvement,
|
89 |
+
and the effectiveness of a prompt is limited by how much conditioning text
|
90 |
+
can fit into the model’s input. As a result, downstream task quality still
|
91 |
+
lags far behind that of tuned models. For instance, GPT-3 175B fewshot
|
92 |
+
performance on SuperGLUE is 17.5 points below fine-tuned T5-XXL (Raffel et
|
93 |
+
al., 2020) (71.8 vs. 89.3) despite using 16 times more parameters. Several
|
94 |
+
efforts to automate prompt design have been recently proposed. Shin et al.
|
95 |
+
(2020) propose a search algorithm over the discrete space of words, guided
|
96 |
+
by the downstream application training data. While this technique
|
97 |
+
outperforms manual prompt design, there is still a gap relative to model
|
98 |
+
tuning. Li and Liang (2021) propose prefix tuning and show strong results on
|
99 |
+
generative tasks. This method freezes the model parameters and
|
100 |
+
backpropagates the error during tuning to prefix activations prepended to
|
101 |
+
each layer in the encoder stack, including the input layer. Hambardzumyan et
|
102 |
+
al. (2021) simplify this recipe by restricting the trainable parameters to
|
103 |
+
the input and output subnetworks of a masked language model, and show
|
104 |
+
reasonable results on classifications tasks. In this paper, we propose
|
105 |
+
prompt tuning as a further simplification for adapting language models. We
|
106 |
+
freeze the entire pre-trained model and only allow an additional k tunable
|
107 |
+
tokens per downstream task to be prepended to the input text. This soft
|
108 |
+
prompt is trained end-to-end and can condense the signal from a full labeled
|
109 |
+
dataset, allowing our method to outperform few-shot prompts and close the
|
110 |
+
quality gap with model tuning (Figure 1). At the same time, since a single
|
111 |
+
pre-trained model is recycled for all downstream tasks, we retain the
|
112 |
+
efficient serving benefits of frozen models (Figure 2). While we developed
|
113 |
+
our method concurrently with Li and Liang (2021) and Hambardzumyan et al.
|
114 |
+
(2021), we are the first to show that prompt tuning alone (with no
|
115 |
+
intermediate-layer prefixes or task-specific output layers) is sufficient to
|
116 |
+
be competitive with model tuning. Through detailed experiments in sections
|
117 |
+
2–3, we demonstrate that language model capacity is a key ingredient for
|
118 |
+
these approaches to succeed. As Figure 1 shows, prompt tuning becomes more
|
119 |
+
competitive with scale. We compare with similar approaches in Section 4.
|
120 |
+
Explicitly separating task-specific parameters from the generalist
|
121 |
+
parameters needed for general language-understanding has a range of
|
122 |
+
additional benefits. We show in Section 5 that by capturing the task
|
123 |
+
definition in the prompt while keeping the generalist parameters fixed, we
|
124 |
+
are able to achieve better resilience to domain shifts. In Section 6, we
|
125 |
+
show that prompt ensembling, learning multiple prompts for the same task,
|
126 |
+
can boost quality and is more efficient than classic model ensembling.
|
127 |
+
Finally, in Section 7, we investigate the interpretability of our learned
|
128 |
+
soft prompts. In sum, our key contributions are: 1. Proposing prompt tuning
|
129 |
+
and showing its competitiveness with model tuning in the regime of large
|
130 |
+
language models. 2. Ablating many design choices, and showing quality and
|
131 |
+
robustness improve with scale. 3. Showing prompt tuning outperforms model
|
132 |
+
tuning on domain shift problems. 4. Proposing prompt ensembling and showing
|
133 |
+
its effectiveness.
|
134 |
+
example_title: PEFT (2104.08691)
|
135 |
+
- text: >-
|
136 |
+
[Abstract] For the first time in the world, we succeeded in synthesizing the
|
137 |
+
room-temperature superconductor (Tc≥400 K, 127∘C) working at ambient
|
138 |
+
pressure with a modified lead-apatite (LK-99) structure. The
|
139 |
+
superconductivity of LK-99 is proved with the Critical temperature (Tc),
|
140 |
+
Zero-resistivity, Critical current (Ic), Critical magnetic field (Hc), and
|
141 |
+
the Meissner effect. The superconductivity of LK-99 originates from minute
|
142 |
+
structural distortion by a slight volume shrinkage (0.48 %), not by external
|
143 |
+
factors such as temperature and pressure. The shrinkage is caused by Cu2+
|
144 |
+
substitution of Pb2+(2) ions in the insulating network of Pb(2)-phosphate
|
145 |
+
and it generates the stress. It concurrently transfers to Pb(1) of the
|
146 |
+
cylindrical column resulting in distortion of the cylindrical column
|
147 |
+
interface, which creates superconducting quantum wells (SQWs) in the
|
148 |
+
interface. The heat capacity results indicated that the new model is
|
149 |
+
suitable for explaining the superconductivity of LK-99. The unique structure
|
150 |
+
of LK-99 that allows the minute distorted structure to be maintained in the
|
151 |
+
interfaces is the most important factor that LK-99 maintains and exhibits
|
152 |
+
superconductivity at room temperatures and ambient pressure. [Introduction]
|
153 |
+
Since the discovery of the first superconductor(1), many efforts to search
|
154 |
+
for new roomtemperature superconductors have been carried out worldwide(2,
|
155 |
+
3) through their experimental clarity or/and theoretical perspectives(4-8).
|
156 |
+
The recent success of developing room-temperature superconductors with
|
157 |
+
hydrogen sulfide(9) and yttrium super-hydride(10) has great attention
|
158 |
+
worldwide, which is expected by strong electron-phonon coupling theory with
|
159 |
+
high-frequency hydrogen phonon modes(11, 12). However, it is difficult to
|
160 |
+
apply them to actual application devices in daily life because of the
|
161 |
+
tremendously high pressure, and more efforts are being made to overcome the
|
162 |
+
high-pressure problem(13). For the first time in the world, we report the
|
163 |
+
success in synthesizing a room-temperature and ambient-pressure
|
164 |
+
superconductor with a chemical approach to solve the temperature and
|
165 |
+
pressure problem. We named the first room temperature and ambient pressure
|
166 |
+
superconductor LK-99. The superconductivity of LK-99 proved with the
|
167 |
+
Critical temperature (Tc), Zero-resistivity, Critical current (Ic), Critical
|
168 |
+
magnetic field (Hc), and Meissner effect(14, 15). Several data were
|
169 |
+
collected and analyzed in detail to figure out the puzzle of
|
170 |
+
superconductivity of LK-99: X-ray diffraction (XRD), X-ray photoelectron
|
171 |
+
spectroscopy (XPS), Electron Paramagnetic Resonance Spectroscopy (EPR), Heat
|
172 |
+
Capacity, and Superconducting quantum interference device (SQUID) data.
|
173 |
+
Henceforth in this paper, we will report and discuss our new findings
|
174 |
+
including superconducting quantum wells associated with the
|
175 |
+
superconductivity of LK-99.
|
176 |
+
example_title: LK-99 (Not NLP)
|
177 |
+
- text: >-
|
178 |
+
[Abstract] Abstract Evaluation practices in natural language generation
|
179 |
+
(NLG) have many known flaws, but improved evaluation approaches are rarely
|
180 |
+
widely adopted. This issue has become more urgent, since neural NLG models
|
181 |
+
have improved to the point where they can often no longer be distinguished
|
182 |
+
based on the surfacelevel features that older metrics rely on. This paper
|
183 |
+
surveys the issues with human and automatic model evaluations and with
|
184 |
+
commonly used datasets in NLG that have been pointed out over the past 20
|
185 |
+
years. We summarize, categorize, and discuss how researchers have been
|
186 |
+
addressing these issues and what their findings mean for the current state
|
187 |
+
of model evaluations. Building on those insights, we lay out a long-term
|
188 |
+
vision for NLG evaluation and propose concrete steps for researchers to
|
189 |
+
improve their evaluation processes. Finally, we analyze 66 NLG papers from
|
190 |
+
recent NLP conferences in how well they already follow these suggestions and
|
191 |
+
identify which areas require more drastic changes to the status quo.
|
192 |
+
[Introduction] There are many issues with the evaluation of models that
|
193 |
+
generate natural language. For example, datasets are often constructed in a
|
194 |
+
way that prevents measuring tail effects of robustness, and they almost
|
195 |
+
exclusively cover English. Most automated metrics measure only similarity
|
196 |
+
between model output and references instead of fine-grained quality aspects
|
197 |
+
(and even that poorly). Human evaluations have a high variance and, due to
|
198 |
+
insufficient documentation, rarely produce replicable results. These issues
|
199 |
+
have become more urgent as the nature of models that generate language has
|
200 |
+
changed without significant changes to how they are being evaluated. While
|
201 |
+
evaluation methods can capture surface-level improvements in text generated
|
202 |
+
by state-of-the-art models (such as increased fluency) to some extent, they
|
203 |
+
are ill-suited to detect issues with the content of model outputs, for
|
204 |
+
example if they are not attributable to input information. These ineffective
|
205 |
+
evaluations lead to overestimates of model capabilities. Deeper analyses
|
206 |
+
uncover that popular models fail even at simple tasks by taking shortcuts,
|
207 |
+
overfitting, hallucinating, and not being in accordance with their
|
208 |
+
communicative goals. Identifying these shortcomings, many recent papers
|
209 |
+
critique evaluation techniques or propose new ones. But almost none of the
|
210 |
+
suggestions are followed or new techniques used. There is an incentive
|
211 |
+
mismatch between conducting high-quality evaluations and publishing new
|
212 |
+
models or modeling techniques. While general-purpose evaluation techniques
|
213 |
+
could lower the barrier of entry for incorporating evaluation advances into
|
214 |
+
model development, their development requires resources that are hard to
|
215 |
+
come by, including model outputs on validation and test sets or large
|
216 |
+
quantities of human assessments of such outputs. Moreover, some issues, like
|
217 |
+
the refinement of datasets, require iterative processes where many
|
218 |
+
researchers collaborate. All this leads to a circular dependency where
|
219 |
+
evaluations of generation models can be improved only if generation models
|
220 |
+
use better evaluations. We find that there is a systemic difference between
|
221 |
+
selecting the best model and characterizing how good this model really is.
|
222 |
+
Current evaluation techniques focus on the first, while the second is
|
223 |
+
required to detect crucial issues. More emphasis needs to be put on
|
224 |
+
measuring and reporting model limitations, rather than focusing on producing
|
225 |
+
the highest performance numbers. To that end, this paper surveys analyses
|
226 |
+
and critiques of evaluation approaches (sections 3 and 4) and of commonly
|
227 |
+
used NLG datasets (section 5). Drawing on their insights, we describe how
|
228 |
+
researchers developing modeling techniques can help to improve and
|
229 |
+
subsequently benefit from better evaluations with methods available today
|
230 |
+
(section 6). Expanding on existing work on model documentation and formal
|
231 |
+
evaluation processes (Mitchell et al., 2019; Ribeiro et al., 2020), we
|
232 |
+
propose releasing evaluation reports which focus on demonstrating NLG model
|
233 |
+
shortcomings using evaluation suites. These reports should apply a
|
234 |
+
complementary set of automatic metrics, include rigorous human evaluations,
|
235 |
+
and be accompanied by data releases that allow for re-analysis with improved
|
236 |
+
metrics. In an analysis of 66 recent EMNLP, INLG, and ACL papers along 29
|
237 |
+
dimensions related to our suggestions (section 7), we find that the first
|
238 |
+
steps toward an improved evaluation are already frequently taken at an
|
239 |
+
average rate of 27%. The analysis uncovers the dimensions that require more
|
240 |
+
drastic changes in the NLG community. For example, 84% of papers already
|
241 |
+
report results on multiple datasets and more than 28% point out issues in
|
242 |
+
them, but we found only a single paper that contributed to the dataset
|
243 |
+
documentation, leaving future researchers to re-identify those issues. We
|
244 |
+
further highlight typical unsupported claims and a need for more consistent
|
245 |
+
data release practices. Following the suggestions and results, we discuss
|
246 |
+
how incorporating the suggestions can improve evaluation research, how the
|
247 |
+
suggestions differ from similar ones made for NLU, and how better metrics
|
248 |
+
can benefit model development itself (section 8).
|
249 |
+
example_title: NLG-Eval (2202.06935)
|
250 |
model-index:
|
251 |
- name: Long-paper-summarization-pegasus-x-b
|
252 |
results:
|