zhaoxiang commited on
Commit
5e5d326
1 Parent(s): 68efb7d
Files changed (4) hide show
  1. README.md +36 -0
  2. app.py +55 -0
  3. ez_cite.py +396 -0
  4. requirements.txt +236 -0
README.md CHANGED
@@ -11,3 +11,39 @@ license: mit
11
  ---
12
 
13
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
14
+
15
+
16
+
17
+
18
+
19
+ 1. spaCy 分句, 直接使用 llm( chatgpt ) 构造合适的query完成这个任务;
20
+ - 使用LLM构造合适的 search query,实验证明比NER强多了
21
+ - 使用citationCount排序搜索结果
22
+ - 稍后会使用rag进行句子/paper相似度匹配,确保不要引用自身paper
23
+
24
+ 2. 从semanticscholar拿到的文章越多越好,先获得query相关文献前30篇,按照citation筛选出前10篇, 这10篇文章是10个json对象,记得向这10个json对象中添加一个key sentence_text
25
+ 4. 对intro的每句话重复1-3步,得到N个json对象,名字使用semanticscholar的paperId,对应于intro相关的N个paper,将N篇paper加入加入到本地文献库(大量json文件)
26
+ 5. 基于本地本地文献库`papers`,用llamaindex incrementally构建index
27
+ - https://docs.llamaindex.ai/en/stable/examples/discover_llamaindex/document_management/Discord_Thread_Management.html#refresh-the-index-with-new-data
28
+ - docs = papers 要持续更新
29
+ - index 会基于docs增量的更新
30
+
31
+ 6. 构建retriever,https://docs.llamaindex.ai/en/stable/module_guides/querying/retriever/root.html,将每句话作为query从index中搜索语意相似的nodes,从nodes获得bibtex(反推出paperID),更新intro和.bib
32
+
33
+ 8. 要构建一个自动的metric来评估整个pipeline的准确率
34
+
35
+ 7. **要在每一个semantic query 的结果中去匹配citation**。现在根据sent从所有的index中搜索相似文献,check_node_sentence_match 数值非常低 而且这个方法很慢;解决方案:
36
+
37
+ - 不要使用sent去retrieve,而是使用llm生成的query去retrive,另外,好处是本地有一个统一的本地库,随着时间会越来越大; 失败
38
+
39
+ - 直接使用semantic query 的结果去生成citation,为每一句话构建单独的index,每个index只包含10个文章,用index的retrival对10个文章基于相似度排序 每次任务结束 删除所有本地库;新建一个nb叫做main_lite.ipynb;成功
40
+
41
+ - retriever.retrieve(sentence) => retriever.retrieve(search_query); 失败
42
+
43
+ - 尝试不同的 embed model
44
+
45
+
46
+ 9. gradio 构建demo
47
+ - https://www.gradio.app/guides/the-interface-class#example-inputs
48
+
49
+ 10. todo: 把所有top 5 都给到用户, 把score写到bib中,让用户根据threshold能自动选择引用论文数量
app.py ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import time
3
+ import re
4
+
5
+ from ez_cite import ez_cite
6
+
7
+
8
+
9
+
10
+
11
+ example1 = r"""Instead of measuring physical systems and then processing the classical measurement outcomes to infer properties of the physical systems, quantum sensors will eventually be able to transduce quantum information in physical systems directly to a quantum memory, where it can be processed by a quantum computer."""
12
+
13
+
14
+ example2 = r"""In order for a learning model to generalise well from training data, it is often crucial to encode some knowledge about the structure of the data into the model itself. Convolutional neural networks are a classic illustration of this principle, whose success at image related tasks is often credited to the existence of model structures that relate to label invariance of the data under translation symmetries. Together with the choice of loss function and hyperparameters, these structures form part of the basic assumptions that a learning model makes about the data, which is commonly referred to as the \textit{inductive bias} of the model.
15
+
16
+ One of the central challenges facing quantum machine learning is to identify data structures that can be encoded usefully into quantum learning models; in other words, what are the forms of inductive bias that naturally lend themselves to quantum computation? In answering this question, we should be wary of hoping for a one-size-fits-all approach in which quantum models outperform neural network models at generic learning tasks. Rather, effort should be placed in understanding how the Hilbert space structure and probabilistic nature of the theory suggest particular biases for which quantum machine learning may excel. Indeed, an analogous perspective is commonplace in quantum computation, where computational advantages are expected only for specific problems that happen to benefit from the peculiarities of quantum logic."""
17
+
18
+ example3 = r"""Recurrent neural networks, long short-term memory and gated recurrent neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures.
19
+
20
+ Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states $h_t$, as a function of the previous hidden state $h_{t-1}$ and the input for position $t$. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks and conditional computation, while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.
21
+
22
+ Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences. In all but a few cases, however, such attention mechanisms are used in conjunction with a recurrent network.
23
+
24
+ In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs."""
25
+
26
+
27
+
28
+ # wrap the ez_cite.py
29
+ def generate_cite_and_bib_data(introduction):
30
+ cite_text, bib_data = ez_cite(introduction, debug=False)
31
+ return cite_text, bib_data
32
+
33
+
34
+
35
+ iface = gr.Interface(
36
+ fn=generate_cite_and_bib_data,
37
+ inputs=[gr.Textbox(value='Enter your introduction here', label='introduction', show_label=True, lines=10)],
38
+ outputs=[gr.Textbox(info='This may take several minutes.', label='.tex', lines=5, show_label=True, show_copy_button=True, interactive=True),
39
+ gr.Textbox(info='This may take several minutes.', label='.bib', lines=5, show_label=True, show_copy_button=True, interactive=True)],
40
+ live=False,
41
+ title="Easy Cite App (beta-v0.0.5 made by Chip)",
42
+ description="Enter your introduction and click the buttons to generate Cite Text and Bib Data.",
43
+ css="""
44
+ .output { white-space: pre-line; }
45
+ .container { width: 100%; margin: auto; padding: 5px; }
46
+ .textbox { width: 100%; }
47
+ """,
48
+ examples=[
49
+ [example1],
50
+ [example2],
51
+ [example3]
52
+ ]
53
+ )
54
+
55
+ iface.launch(share=True)
ez_cite.py ADDED
@@ -0,0 +1,396 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ TOGETHER_API_KEY = 'ed08236bdf2ef77e00cccdbb3d956d958e3db7706cb821b3c7aa269d3faa7620' # os.environ.get("TOGETHER_API_KEY")
3
+ SEMANTIC_SCHOLAR_API_KET = 'JBYSnhdCxL6I1V9zrYUUml8U8jPQmnT1PJLdy7Ia' # os.environ.get("SEMANTIC_SCHOLAR_API_KET")
4
+ import re
5
+ import sys
6
+ import time
7
+ import json
8
+ import shutil
9
+ import requests
10
+
11
+
12
+ import spacy
13
+ #!python -m spacy download en_core_web_lg
14
+ from openai import OpenAI, APIError
15
+
16
+ import chromadb
17
+
18
+
19
+ from llama_index import (
20
+ VectorStoreIndex,
21
+ SimpleDirectoryReader,
22
+ ServiceContext,
23
+ load_index_from_storage
24
+ )
25
+ from llama_index.embeddings import HuggingFaceEmbedding, TogetherEmbedding
26
+ from llama_index.vector_stores import ChromaVectorStore
27
+ from llama_index.storage.storage_context import StorageContext
28
+
29
+
30
+
31
+
32
+
33
+
34
+
35
+
36
+
37
+
38
+
39
+
40
+
41
+ # 处理原始.text数据抹去citation,或者直接从用户出获得没有citation的introduct
42
+ def remove_citation(text):
43
+ # Regular expression to match \cite{...}
44
+ pattern = r'\\cite\{[^}]*\}'
45
+
46
+ # Replace \cite{...} with an empty string
47
+ text = re.sub(pattern, '', text)
48
+
49
+ # Replace multiple spaces with a single space
50
+ text = re.sub(r' +', ' ', text)
51
+
52
+ # Replace spaces before punctuation marks with just the punctuation marks
53
+ text = re.sub(r"\s+([,.!?;:()\[\]{}])", r"\1", text)
54
+
55
+ return text
56
+
57
+
58
+
59
+ def get_chat_completion(client, prompt, llm_model, max_tokens):
60
+ messages = [
61
+ {
62
+ "role": "system",
63
+ "content": "You are an AI assistant",
64
+ },
65
+ {
66
+ "role": "user",
67
+ "content": prompt,
68
+ }
69
+ ]
70
+ try:
71
+ chat_completion = client.chat.completions.create(
72
+ messages=messages,
73
+ model=llm_model,
74
+ max_tokens=max_tokens
75
+ )
76
+ return chat_completion.choices[0].message.content
77
+ except APIError as e:
78
+ # Handle specific API errors
79
+ print(f"API Error: {e}")
80
+ except Exception as e:
81
+ # Handle other exceptions
82
+ print(f"Error: {e}")
83
+
84
+
85
+
86
+
87
+
88
+
89
+
90
+
91
+ def get_relevant_papers(search_query, sort=True, count=10):
92
+ """
93
+ search_query (str): the required query parameter and its value (in this case, the keyword we want to search for)
94
+ count (int): the number of relevant papers to return for each query
95
+
96
+ Semantic Scholar Rate limit:
97
+ 1 request per second for the following endpoints:
98
+ /paper/batch
99
+ /paper/search
100
+ /recommendations
101
+ 10 requests / second for all other calls
102
+ """
103
+ # Define the paper search endpoint URL; All keywords in the search query are matched against the paper’s title and abstract.
104
+ url = 'https://api.semanticscholar.org/graph/v1/paper/search'
105
+ # Define headers with API key
106
+ headers = {'x-api-key': SEMANTIC_SCHOLAR_API_KET}
107
+
108
+ query_params = {
109
+ 'query': search_query,
110
+ 'fields': 'url,title,year,abstract,authors.name,journal,citationStyles,tldr,referenceCount,citationCount',
111
+ 'limit': 20,
112
+ }
113
+ # Send the API request
114
+ response = requests.get(url, params=query_params, headers=headers)
115
+
116
+ # Check response status
117
+ if response.status_code == 200:
118
+ json_response = response.json()
119
+ if json_response['total'] != 0:
120
+ papers = json_response['data']
121
+ else:
122
+ papers = []
123
+
124
+ # Sort the papers based on citationCount in descending order
125
+ if sort:
126
+ papers = sorted(papers, key=lambda x: x['citationCount'], reverse=True)
127
+
128
+ return papers[:count]
129
+ else:
130
+ print(f"Request failed with status code {response.status_code}: {response.text}")
131
+
132
+
133
+ def save_papers(unique_dir, papers):
134
+ os.makedirs(unique_dir, exist_ok=True)
135
+ # Save each dictionary to a separate JSON file
136
+ for i, dictionary in enumerate(papers):
137
+ filename = os.path.join(unique_dir, f"{dictionary['paperId']}.json")
138
+ with open(filename, 'w') as json_file:
139
+ json.dump(dictionary, json_file, indent=4)
140
+ print(f"{len(papers)} papers saved as JSON files successfully at {unique_dir}.")
141
+
142
+
143
+
144
+
145
+
146
+
147
+ def get_index(service_context, docs_dir, persist_dir):
148
+ documents = SimpleDirectoryReader(docs_dir, filename_as_id=True).load_data()
149
+
150
+ # check if storage already exists
151
+ PERSIST_DIR = persist_dir
152
+ if not os.path.exists(PERSIST_DIR):
153
+ print('create new index')
154
+ index = VectorStoreIndex.from_documents(
155
+ documents, service_context=service_context, show_progress=False
156
+ )
157
+ # store it for later
158
+ index.storage_context.persist(persist_dir=PERSIST_DIR)
159
+ else:
160
+ print('load the existing index')
161
+ # load the existing index
162
+ storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
163
+ index = load_index_from_storage(storage_context, service_context=service_context)
164
+ # refresh the index
165
+ refreshed_docs = index.refresh_ref_docs(documents, update_kwargs={"delete_kwargs": {"delete_from_docstore": True}})
166
+ print(f'refreshed_docs:\n{refreshed_docs}')
167
+
168
+ return index
169
+
170
+
171
+ def get_paper_data(text):
172
+ """text = node.text """
173
+ dictionary_from_json = json.loads(text)
174
+
175
+ bibtex = dictionary_from_json['citationStyles']['bibtex']
176
+ bibtex = bibtex.replace('&', 'and')
177
+ citation_label = re.findall(r'@(\w+){([\w-]+)', bibtex)[0][1]
178
+
179
+ citationCount = dictionary_from_json['citationCount']
180
+
181
+ if dictionary_from_json['tldr'] is not None:
182
+ tldr = dictionary_from_json['tldr']['text']
183
+ else:
184
+ tldr = 'No tldr available'
185
+
186
+ url = dictionary_from_json['url']
187
+
188
+ return citation_label, (bibtex, citationCount, tldr, url)
189
+
190
+
191
+ def move_cite_inside_sentence(sent, ez_citation):
192
+ if sent[-1]!='\n':
193
+ character = sent[-1]
194
+ sent_new = sent[:-1] + ' <ez_citation>' + character
195
+ else:
196
+ count = sent.count('\n')
197
+ character = sent[-(count+1)]
198
+ sent_new = sent[:-(count+1)] + ' <ez_citation>' + character + '\n'*count
199
+
200
+ return sent_new.replace('<ez_citation>', ez_citation)
201
+
202
+
203
+ def write_bib_file(bib_file_content, data):
204
+ bibtex, citationCount, tldr, url = data
205
+ bib_file_content = bib_file_content + f'\n%citationCount: {citationCount}\n%tldr: {tldr}\n%url: {url}\n' + bibtex
206
+ return bib_file_content
207
+
208
+
209
+ def write_citation(sent, bib_file_content, retrieved_nodes, sim_threshold=0.75):
210
+ labels = []
211
+ for node in retrieved_nodes:
212
+ citation_label, data = get_paper_data(node.text)
213
+ print('relevant paper id (node.id_):', node.id_, 'match score (node.score):', node.score)
214
+ print('relevant paper data:', *data)
215
+ print('-'*30)
216
+
217
+ if node.score > sim_threshold and citation_label != "None":
218
+ labels.append(citation_label)
219
+ if not (citation_label in bib_file_content):
220
+ bib_file_content = write_bib_file(bib_file_content, data)
221
+ else:
222
+ continue
223
+
224
+ labels = ', '.join(labels)
225
+ if labels:
226
+ ez_citation = f'\cite{{{labels}}}'
227
+ sent_new = move_cite_inside_sentence(sent, ez_citation)
228
+ else:
229
+ sent_new = sent
230
+
231
+ return sent_new, bib_file_content
232
+
233
+
234
+ get_prompt = lambda sentence: f"""
235
+ I want to use semantic scholar paper search api to find the relevant papers, can you read the following text then suggest me an suitable search query for this task?
236
+
237
+ Here is an example for using the api:
238
+ <example>
239
+ ```python
240
+ import requests
241
+ # Define the paper search endpoint URL
242
+ url = 'https://api.semanticscholar.org/graph/v1/paper/search'
243
+ # Define the required query parameter and its value (in this case, the keyword we want to search for)
244
+ query_params = {{
245
+ 'query': 'semantic scholar platform',
246
+ 'limit': 3
247
+ }}
248
+ # Make the GET request with the URL and query parameters
249
+ searchResponse = requests.get(url, params=query_params)
250
+ ```
251
+ </example>
252
+
253
+ Here is the text:
254
+ <text>
255
+ {sentence}
256
+ </text>
257
+ """
258
+
259
+
260
+ # main block
261
+ def main(sentences, count, client, llm_model, max_tokens, service_context):
262
+ """count (int): the number of relevant papers to return for each query"""
263
+ sentences_new = []
264
+ bib_file_content = ''
265
+ for sentence in sentences:
266
+ prompt = get_prompt(sentence)
267
+
268
+ response = get_chat_completion(client, prompt, llm_model, max_tokens)
269
+
270
+ # Define a regular expression pattern to find the value of 'query'
271
+ pattern = r"'query': '(.*?)'"
272
+ # Find all matches in the Python code text
273
+ search_query = re.findall(pattern, response)[0]
274
+
275
+ relevant_papers = get_relevant_papers(search_query, sort=True, count=count)
276
+
277
+ if relevant_papers:
278
+ # save papers to json files and build index
279
+ unique_dir = os.path.join("papers", f"{int(time.time())}")
280
+ persist_dir = os.path.join("index", f"{int(time.time())}")
281
+ save_papers(unique_dir, relevant_papers)
282
+ index = get_index(service_context, unique_dir, persist_dir)
283
+
284
+ # get sentence's most similar papers
285
+ retriever = index.as_retriever(service_context=service_context, similarity_top_k=5)
286
+ retrieved_nodes = retriever.retrieve(sentence)
287
+
288
+ sent_new, bib_file_content = write_citation(sentence, bib_file_content, retrieved_nodes, sim_threshold=0.7)
289
+ sentences_new.append(sent_new)
290
+ else:
291
+ sentences_new.append(sentence)
292
+
293
+ print('sentence:', sentence.strip())
294
+ print('search_query:', search_query)
295
+ print('='*30)
296
+
297
+ return sentences_new, bib_file_content
298
+
299
+
300
+
301
+
302
+ def ez_cite(introduction, debug=False):
303
+ nlp = spacy.load("en_core_web_lg")
304
+ doc = nlp(introduction)
305
+ sentences = [sentence.text for sentence in doc.sents]
306
+ sentences = [ remove_citation(sentence) for sentence in sentences]
307
+
308
+ client = OpenAI(api_key=TOGETHER_API_KEY,
309
+ base_url='https://api.together.xyz',
310
+ )
311
+
312
+ llm_model = "Qwen/Qwen1.5-72B-Chat"
313
+ max_tokens = 1000
314
+
315
+ embed_model = TogetherEmbedding(model_name="togethercomputer/m2-bert-80M-8k-retrieval", api_key=TOGETHER_API_KEY)
316
+ service_context = ServiceContext.from_defaults(
317
+ llm=None, embed_model=embed_model, chunk_size=8192, # chunk_size must be bigger than the whole .json so that all info is preserved, in this case, one doc is one node
318
+ )
319
+
320
+ if debug:
321
+ sentences = sentences[:2]
322
+
323
+ sentences_new, bib_file_content = main(sentences, count=10,
324
+ client=client,
325
+ llm_model=llm_model,
326
+ max_tokens=max_tokens,
327
+ service_context=service_context)
328
+
329
+ with open('intro.bib', 'w') as bib_file:
330
+ bib_file.write(bib_file_content)
331
+
332
+ final_intro = ' '.join(sentences_new)
333
+ print(final_intro)
334
+ print('='*30)
335
+
336
+ dir_path = "index"
337
+ try:
338
+ # Delete the directory and its contents
339
+ shutil.rmtree(dir_path)
340
+ print(f"Directory '{dir_path}' deleted successfully.")
341
+ except Exception as e:
342
+ print(f"Error deleting directory '{dir_path}': {e}")
343
+
344
+
345
+ dir_path = "papers"
346
+ try:
347
+ # Delete the directory and its contents
348
+ shutil.rmtree(dir_path)
349
+ print(f"Directory '{dir_path}' deleted successfully.")
350
+ except Exception as e:
351
+ print(f"Error deleting directory '{dir_path}': {e}")
352
+
353
+ return final_intro, bib_file_content
354
+
355
+
356
+
357
+
358
+ # arXiv:2209.05523v2
359
+ introduction = r"""A long-standing paradigm in machine learning is the trade-off between the complexity of a model family and the model's ability to generalize: more expressive model classes contain better candidates to fit complex trends in data, but are also prone to overfitting noise \cite{nielsen2015neural, geman1992neural}. \textit{Interpolation}, defined for our purposes as choosing a model with zero training error, was hence long considered bad practice \cite{hastie2009elements}. The success of deep learning - machine learning in a specific regime of extremely complex model families with vast amounts of tunable parameters - seems to contradict this notion; here, consistent evidence shows that among some interpolating models, more complexity tends \textit{not to harm} the generalisation performance, a phenomenon described as "benign overfitting" \cite{bartlett2021deep}.
360
+
361
+ In recent years, a surge of theoretical studies have reproduced benign overfitting in simplified settings with the hope of isolating the essential ingredients of the phenomenon \cite{bartlett2021deep, belkin2021fit}. For example, Ref. \cite{bartlett2020benign} showed how interpolating linear models in a high complexity regime (more dimensions than datapoints) could generalize just as well as their lower-complexity counterparts on new data, and analyzed the properties of the data that lead to the "absorption" of noise by the interpolating model without harming the model's predictions. Ref. \cite{belkin2019reconciling} showed that there are model classes of simple functions that change quickly in the vicinity of the noisy training data, but recover a smooth trend elsewhere in data space (see Figure 1). Such functions have also been used to train nearest neighbor models that perfectly overfit training data while generalizing well, thereby directly linking "spiking models" to benign overfitting \cite{belkin2019does}. Recent works try to recover the basic mechanism of such spiking models using the language of Fourier analysis \cite{muthukumar2020harmless, muthukumar2021classification, dar2021farewell}.
362
+
363
+ In parallel to these exciting developments in the theory of deep learning, quantum computing researchers have proposed families of parametrised quantum algorithms as model classes for machine learning (e.g. Ref. \cite{benedetti2019parameterized}). These quantum models can be optimised similarly to neural networks \cite{mitarai2018quantum, schuld2019evaluating} and have interesting parallels to kernel methods \cite{schuld2019quantum, havlivcek2019supervised} and generative models \cite{lloyd2018quantum, dallaire2018quantum}. Although researchers have taken some first steps to study the expressivity \cite{abbas2021power, wright2020capacity, sim2019expressibility, hubregtsen2021evaluation}, trainability \cite{mcclean2018barren, cerezo2021cost} and generalisation \cite{caro2021encoding, huang_power_2021, caro2022generalization, banchi2021generalization} of quantum models, we still know relatively little about their behaviour. In particular, the interplay of overparametrisation, interpolation, and generalisation that seems so important for deep learning is yet largely unexplored.
364
+
365
+ In this paper we develop a simplified framework in which questions of overfitting in quantum machine learning can be investigated. Essentially, we exploit the observation that quantum models can often be described in terms of Fourier series where well-defined components of the quantum circuit influence the selection of Fourier modes and their respective Fourier coefficients \cite{gil2020input, schuld2021effect, wierichs2022general}. We link this description to the analysis of spiking models and benign overfitting by building on prior works analyzing these phenomena using Fourier methods. In this approach, the complexity of a model is related to the number of Fourier modes that its Fourier series representation consists of, and overparametrised model classes have more modes than needed to interpolate the training data (i.e., to have zero training error). After deriving the generalization error for such model classes these "superfluous" modes lead to spiking models, which have large oscillations around the training data while keeping a smooth trend everywhere else. However, large numbers of modes can also harm the recovery of an underlying signal, and we therefore balance this trade-off to produce an explicit example of benign overfitting in a quantum machine learning model.
366
+
367
+ The mathematical link described above allows us to probe the impact of important design choices for a simplified class of quantum models on this trade-off. For example, we find why a measure of redundancy in the spectrum of the Hamiltonian that defines standard data encoding strategies strongly influences this balance; in fact to an extent that is difficult to counterbalance by other design choices of the circuit.
368
+
369
+ The remainder of the paper proceeds as follows. We will first review the classical Fourier framework for the study of interpolating models and develop explicit formulae for the error in these models to produce a basic example of benign overfitting (Sec. 2). We will then construct a quantum model with analogous components to the classical model, and demonstrate how each of these components is related to the structure of the corresponding quantum circuit and measurement (Sec. 3). We then analyze specific cases that give rise to "spikiness" and benign overfitting in these quantum models (Sec. 3.2)."""
370
+
371
+ # # arXiv:2302.01365v3
372
+ # introduction = r"""In order for a learning model to generalise well from training data, it is often crucial to encode some knowledge about the structure of the data into the model itself. Convolutional neural networks are a classic illustration of this principle, whose success at image related tasks is often credited to the existence of model structures that relate to label invariance of the data under translation symmetries. Together with the choice of loss function and hyperparameters, these structures form part of the basic assumptions that a learning model makes about the data, which is commonly referred to as the _inductive bias_ of the model.
373
+
374
+ # One of the central challenges facing quantum machine learning is to identify data structures that can be encoded usefully into quantum learning models; in other words, what are the forms of inductive bias that naturally lend themselves to quantum computation? In answering this question, we should be wary of hoping for a one-size-fits-all approach in whichquantum models outperform neural network models at generic learning tasks. Rather, effort should be placed in understanding how the Hilbert space structure and probabilistic nature of the theory suggest particular biases for which quantum machine learning may excel. Indeed, an analogous perspective is commonplace in quantum computation, where computational advantages are expected only for specific problems that happen to benefit from the peculiarities of quantum logic.
375
+
376
+ # In the absence of large quantum computers and in the infancy of quantum machine learning theory, how should we look for insight on this issue? One possibility is to turn to complexity theory, where asymptotic advantages of quantum learning algorithms have been proven. These results are few and far between however, and the enormous gap between what is possible to prove in a complexity-theoretic sense, and the types of advantages that may be possible in practice, means that there are growing doubts about the practical relevance of these results. Indeed, progress in machine learning is often the result of good ideas built on intuition, rather than worst-case complexity theoretic analysis. To repeat a common quip: many problems in machine learning are NP-hard, but neural networks don't know that so they solve them anyway.
377
+
378
+ # We will take a different route, and lean on the field of quantum foundations to guide us. Quantum foundations is predominantly concerned with understanding the frontier between the quantum and classical world, and typically values a clear qualitative understanding of a phenomenon over purely mathematical knowledge. For these reasons it is well suited to identify features of quantum theory that may advance quantum machine learning in useful directions. In particular, we focus on the phenomenon of contextuality, which is perhaps the most prominent form of nonclassicality studied in the literature. Contextuality has a considerable tradition of being studied in relation to quantum computation, where it is closely connected to the possibility of computational speed-up. Despite this, it has had relatively little attention in quantum machine learning, with only a couple of works linking contextuality to implications for learning.
379
+
380
+ # We adopt a notion of contextuality called \textit{generalised contextuality}, introduced by Spekkens in 2004. Loosely speaking, it refers to the fact that (i) there are different experimental procedures (called contexts) in the theory that are indistinguishable1, and (ii) any statistical model that reproduces the predictions of the theory must take these contexts into account. With this choice, our first task will then be to introduce a framework to talk about generalised contextuality in machine learning (Section 2). This was missing in previous works, which prove consequences for learning based on phenomena originating from contextuality, but do not attempt to define a notion of contextuality for machine learning that captures a wide range of models. Our general philosophy will be that the framework should depend purely on what a learning model can do, and not on the details of how it does it; i.e., the framework should be independent of the theory on which the models are built. This is necessary to have a framework that treats quantum and classical algorithms on the same footing, and ultimately involves adopting definitions in a similar spirit to the notion of operational contextuality as recently described in.
381
+
382
+ # We mostly focus on a paradigm of machine learning called multi-task learning, in which the aim is to simultaneously learn a number of separate models for a collection of different (but typically correlated) tasks. Multi-task learning scenarios are conceptually similar to commonly studied contextuality scenarios, and this similarity leads us to a definition of what it means for a multi-task model to be contextual (Section 3). Although the focus on multi-class learning problems appears restrictive, as the separation between tasks is arbitrary at the mathematical level, we also arrive at a notion of contextuality in the single task setting (Section 6). In particular, we argue that it makes sense to think of contextuality as a property relative to a particular inductive bias of a model, rather than a property of the model as a whole.
383
+
384
+ # Once we have described our framework, our second task will be to identify specific learning problems for which contextuality plays a role (Section 4). We show that this is the case when learning probabilistic models from data sets which feature a linearly conserved quantity in a discrete label space (see Figure 1). Such data sets can arise naturally from experiments involving conserved quantities, zero-sum game scenarios, logistics with conserved resources, substance diffusion in biological systems, and human mobility and migration. We show that the ability of a model to encode the conserved quantity as an inductive bias directly links to a central concept in generalised contextuality, called \textit{operational equivalence}. This results in a constraint on noncontextual learning models that encode the desired bias, which amounts to a limit on the expressivity of noncontextual model classes. For certain data sets, this limitation can negatively impact generalisation performance due to the lack of a suitable model within the class that matches the underlying data distribution; in such cases contextuality may therefore be required for learning. To illustrate this point, in Section 5 we construct a toy problem based on the rock, paper, scissors zero-sum game and prove precise limits on the expressivity of noncontextual model classes that attempt to learn the payoff behaviour of the game.
385
+
386
+ # In the final part of the work, we study the performance of quantum models for problems that involve our contextuality-inspired bias (Section 7). We first describe two approaches to construct quantum ansatze encoding the bias. The first of these encodes the bias into the state structure of the ansatz, and exploits tools from geometric quantum machine learning. The second approach encodes the bias into the measurement structure, and we present a new family of measurements to this end that may be of independent interest. We then use these tools in a simple numerical investigation (Section 8), inspired by a recent work of Schreiber et al.. Using the fact that quantum machine learning models are equivalent to truncated Fourier series, the authors of define the notion of a classical surrogate model: a linear Fourier features model that has access to the same frequencies of the quantum model, but which lacks its specific inductive bias. The authors found that classical surrogate model classes perform better than quantum model classes on a wide range of regression tasks, the message being that it is still unclear what the inductive bias of quantum machine learning is useful for. In our numerical study, we show that a quantum model class that encodes our contextuality-inspired bias achieves a lower generalisation error than the corresponding surrogate model classes at a specific learning task, even after allowing for regularisation in the surrogate model. We argue that this is due to the fact that the bias cannot be easily encoded into the surrogate model class, which therefore cannot exploit this information during learning.
387
+
388
+ # In Section 9 we elaborate on a number of areas where contextuality-inspired inductive bias can be expected to play a role in learning. Many of these areas are classical in nature, and therefore suggests that quantum machine learning may be suited to tackling classical learning problems with a specific structure. Finally, in Section 10, we outline our vision for this line of research and the possible next steps to take. Overall, we hope our approach and framework will lead to a new way of thinking about quantum machine learning, and ultimately lead to the identification of impactful problems where the specific structure of quantum theory makes quantum models the machine learning models of choice."""
389
+
390
+
391
+
392
+
393
+ if __name__ == "__main__":
394
+
395
+ final_intro, bib_file_content = ez_cite(introduction, debug=True)
396
+
requirements.txt ADDED
@@ -0,0 +1,236 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ aiofiles==23.2.1
2
+ aiohttp==3.9.3
3
+ aiosignal==1.3.1
4
+ albumentations==1.3.1
5
+ altair==5.2.0
6
+ annotated-types==0.6.0
7
+ anthropic-bedrock==0.8.0
8
+ anyio==4.2.0
9
+ appnope @ file:///home/conda/feedstock_root/build_artifacts/appnope_1649077682618/work
10
+ asgiref==3.7.2
11
+ asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1698341106958/work
12
+ async-timeout==4.0.3
13
+ attrs==23.2.0
14
+ backoff==2.2.1
15
+ bcrypt==4.1.2
16
+ blinker==1.7.0
17
+ blis==0.7.11
18
+ boto3==1.34.31
19
+ botocore==1.34.31
20
+ build==1.0.3
21
+ cachetools==5.3.2
22
+ catalogue==2.0.10
23
+ certifi==2023.11.17
24
+ cffi==1.16.0
25
+ charset-normalizer==3.3.2
26
+ chroma-hnswlib==0.7.3
27
+ chromadb==0.4.22
28
+ click==8.1.7
29
+ cloudpathlib==0.16.0
30
+ colorama==0.4.6
31
+ coloredlogs==15.0.1
32
+ comm @ file:///home/conda/feedstock_root/build_artifacts/comm_1704278392174/work
33
+ confection==0.1.4
34
+ contourpy==1.2.0
35
+ cryptography==42.0.2
36
+ cycler==0.12.1
37
+ cymem==2.0.8
38
+ dataclasses-json==0.6.4
39
+ datasets==2.16.1
40
+ debugpy @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_563_nwtkoc/croot/debugpy_1690905063850/work
41
+ decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1641555617451/work
42
+ Deprecated==1.2.14
43
+ dill==0.3.7
44
+ dirtyjson==1.0.8
45
+ distro==1.9.0
46
+ en-core-web-lg @ https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl#sha256=ab70aeb6172cde82508f7739f35ebc9918a3d07debeed637403c8f794ba3d3dc
47
+ exceptiongroup @ file:///home/conda/feedstock_root/build_artifacts/exceptiongroup_1704921103267/work
48
+ executing @ file:///home/conda/feedstock_root/build_artifacts/executing_1698579936712/work
49
+ fastapi==0.109.2
50
+ ffmpy==0.3.1
51
+ filelock==3.13.1
52
+ Flask==3.0.2
53
+ flatbuffers==23.5.26
54
+ fonttools==4.48.1
55
+ frozenlist==1.4.1
56
+ fsspec==2023.10.0
57
+ google-auth==2.27.0
58
+ googleapis-common-protos==1.62.0
59
+ gradio==4.18.0
60
+ gradio_client==0.10.0
61
+ greenlet==3.0.3
62
+ grpcio==1.60.1
63
+ h11==0.14.0
64
+ httpcore==1.0.2
65
+ httptools==0.6.1
66
+ httpx==0.26.0
67
+ huggingface-hub==0.20.3
68
+ humanfriendly==10.0
69
+ idna==3.6
70
+ imageio==2.33.1
71
+ importlib-metadata==6.11.0
72
+ importlib-resources==6.1.1
73
+ ipykernel @ file:///Users/runner/miniforge3/conda-bld/ipykernel_1705418038222/work
74
+ ipython @ file:///home/conda/feedstock_root/build_artifacts/ipython_1704718870316/work
75
+ itsdangerous==2.1.2
76
+ jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1696326070614/work
77
+ Jinja2==3.1.3
78
+ jmespath==1.0.1
79
+ joblib==1.3.2
80
+ jsonschema==4.21.1
81
+ jsonschema-specifications==2023.12.1
82
+ jupyter_client @ file:///home/conda/feedstock_root/build_artifacts/jupyter_client_1699283905679/work
83
+ jupyter_core @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_782yoyc_98/croot/jupyter_core_1698937318631/work
84
+ kiwisolver==1.4.5
85
+ kubernetes==29.0.0
86
+ langcodes==3.3.0
87
+ lazy_loader==0.3
88
+ Levenshtein==0.24.0
89
+ lightning==2.1.3
90
+ lightning-utilities==0.10.1
91
+ llama-index==0.9.46
92
+ markdown-it-py==3.0.0
93
+ MarkupSafe==2.1.4
94
+ marshmallow==3.20.2
95
+ matplotlib==3.8.2
96
+ matplotlib-inline @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-inline_1660814786464/work
97
+ mdurl==0.1.2
98
+ mmh3==4.1.0
99
+ monotonic==1.6
100
+ mpmath==1.3.0
101
+ multidict==6.0.4
102
+ multiprocess==0.70.15
103
+ munch==4.0.0
104
+ murmurhash==1.0.10
105
+ mypy-extensions==1.0.0
106
+ nest_asyncio @ file:///home/conda/feedstock_root/build_artifacts/nest-asyncio_1705850609492/work
107
+ networkx==3.2.1
108
+ nltk==3.8.1
109
+ nougat-ocr==0.1.17
110
+ numpy==1.26.3
111
+ oauthlib==3.2.2
112
+ onnxruntime==1.17.0
113
+ openai==1.12.0
114
+ opencv-python-headless==4.9.0.80
115
+ opentelemetry-api==1.22.0
116
+ opentelemetry-exporter-otlp-proto-common==1.22.0
117
+ opentelemetry-exporter-otlp-proto-grpc==1.22.0
118
+ opentelemetry-instrumentation==0.43b0
119
+ opentelemetry-instrumentation-asgi==0.43b0
120
+ opentelemetry-instrumentation-fastapi==0.43b0
121
+ opentelemetry-proto==1.22.0
122
+ opentelemetry-sdk==1.22.0
123
+ opentelemetry-semantic-conventions==0.43b0
124
+ opentelemetry-util-http==0.43b0
125
+ orjson==3.9.12
126
+ overrides==7.7.0
127
+ packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1696202382185/work
128
+ pandas==2.2.0
129
+ parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1638334955874/work
130
+ pdfminer.six==20221105
131
+ pdfplumber==0.10.3
132
+ pexpect @ file:///home/conda/feedstock_root/build_artifacts/pexpect_1706113125309/work
133
+ pickleshare @ file:///home/conda/feedstock_root/build_artifacts/pickleshare_1602536217715/work
134
+ pillow==10.2.0
135
+ platformdirs @ file:///home/conda/feedstock_root/build_artifacts/platformdirs_1706713388748/work
136
+ posthog==3.4.0
137
+ preshed==3.0.9
138
+ prompt-toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1702399386289/work
139
+ protobuf==4.25.2
140
+ psutil @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_1310b568-21f4-4cb0-b0e3-2f3d31e39728k9coaga5/croots/recipe/psutil_1656431280844/work
141
+ ptyprocess @ file:///home/conda/feedstock_root/build_artifacts/ptyprocess_1609419310487/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
142
+ pulsar-client==3.4.0
143
+ pure-eval @ file:///home/conda/feedstock_root/build_artifacts/pure_eval_1642875951954/work
144
+ pyarrow==15.0.0
145
+ pyarrow-hotfix==0.6
146
+ pyasn1==0.5.1
147
+ pyasn1-modules==0.3.0
148
+ pycparser==2.21
149
+ pydantic==2.6.0
150
+ pydantic_core==2.16.1
151
+ pydub==0.25.1
152
+ Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1700607939962/work
153
+ PyMuPDF==1.23.20
154
+ PyMuPDFb==1.23.9
155
+ pyparsing==3.1.1
156
+ pypdf==4.0.1
157
+ PyPDF2==3.0.1
158
+ pypdfium2==4.26.0
159
+ PyPika==0.48.9
160
+ pyproject_hooks==1.0.0
161
+ python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1626286286081/work
162
+ python-dotenv==1.0.1
163
+ python-Levenshtein==0.24.0
164
+ python-multipart==0.0.9
165
+ pytorch-lightning==2.1.3
166
+ pytz==2023.4
167
+ PyYAML==6.0.1
168
+ pyzmq @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_43pxpbos3z/croot/pyzmq_1705605108344/work
169
+ qudida==0.0.4
170
+ rapidfuzz==3.6.1
171
+ referencing==0.33.0
172
+ regex==2023.12.25
173
+ requests==2.31.0
174
+ requests-oauthlib==1.3.1
175
+ rich==13.7.0
176
+ rpds-py==0.17.1
177
+ rsa==4.9
178
+ ruamel.yaml==0.18.5
179
+ ruamel.yaml.clib==0.2.8
180
+ ruff==0.2.1
181
+ s3transfer==0.10.0
182
+ safetensors==0.4.2
183
+ scikit-image==0.22.0
184
+ scikit-learn==1.4.0
185
+ scipy==1.12.0
186
+ sconf==0.2.5
187
+ semantic-version==2.10.0
188
+ sentencepiece==0.1.99
189
+ shellingham==1.5.4
190
+ six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work
191
+ smart-open==6.4.0
192
+ sniffio==1.3.0
193
+ socksio==1.0.0
194
+ spacy==3.7.2
195
+ spacy-legacy==3.0.12
196
+ spacy-loggers==1.0.5
197
+ SQLAlchemy==2.0.25
198
+ srsly==2.4.8
199
+ stack-data @ file:///home/conda/feedstock_root/build_artifacts/stack_data_1669632077133/work
200
+ starlette==0.36.3
201
+ sympy==1.12
202
+ tenacity==8.2.3
203
+ thinc==8.2.3
204
+ threadpoolctl==3.2.0
205
+ tifffile==2024.1.30
206
+ tiktoken==0.6.0
207
+ timm==0.5.4
208
+ tokenizers==0.15.1
209
+ tomli==2.0.1
210
+ tomlkit==0.12.0
211
+ toolz==0.12.1
212
+ torch==2.2.0
213
+ torchmetrics==1.3.0.post0
214
+ torchvision==0.17.0
215
+ tornado @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_3a5nrn2jeh/croot/tornado_1696936974091/work
216
+ tqdm==4.66.1
217
+ traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1704212992681/work
218
+ transformers==4.37.2
219
+ typer==0.9.0
220
+ typing-inspect==0.9.0
221
+ typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1702176139754/work
222
+ tzdata==2023.4
223
+ urllib3==2.0.7
224
+ uvicorn==0.27.0.post1
225
+ uvloop==0.19.0
226
+ wasabi==1.1.2
227
+ watchfiles==0.21.0
228
+ wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1704731205417/work
229
+ weasel==0.3.4
230
+ websocket-client==1.7.0
231
+ websockets==11.0.3
232
+ Werkzeug==3.0.1
233
+ wrapt==1.16.0
234
+ xxhash==3.4.1
235
+ yarl==1.9.4
236
+ zipp @ file:///home/conda/feedstock_root/build_artifacts/zipp_1695255097490/work