kenken999 commited on
Commit
7d7bd09
·
2 Parent(s): a14d2cf 00f541b

Merge branch 'main' of https://huggingface.co/spaces/kenken999/fastapi_django_main

Browse files
Files changed (47) hide show
  1. controllers/ 電話番号/prompt +5 -1
  2. controllers/# 答えの最初に、私/prompt +184 -0
  3. controllers/YES/prompt +1 -0
  4. controllers/【リッチメニュー】取/prompt +1 -0
  5. controllers/【リッチメニュー】本/prompt +1 -0
  6. controllers/【応答】撮影ポイント/prompt +5 -1
  7. controllers/あ/prompt +1 -0
  8. controllers/ありがとうございまし/prompt +1 -0
  9. controllers/ありがとうございます/prompt +1 -0
  10. controllers/お世話になっておりま/prompt +3 -0
  11. controllers/こちらいくらくらいに/prompt +1 -0
  12. controllers/これからLINEから/prompt +3 -0
  13. controllers/ごめんなさい、ちょっ/prompt +1 -0
  14. controllers/だいたいの相場でいい/prompt +1 -0
  15. controllers/はい。早速のお返事あ/prompt +1 -0
  16. controllers/はい大丈夫です、お気/prompt +1 -0
  17. controllers/ま/prompt +1 -0
  18. controllers/わかりました、よろし/prompt +1 -0
  19. controllers/エラーの場合、エラー/prompt +1 -0
  20. controllers/グラフ /nレガシー /prompt +11 -0
  21. controllers/ダイヤ、金、ブランド/prompt +21 -0
  22. controllers/ダイヤモンドのルース/prompt +5 -0
  23. controllers/チェック/prompt +1 -0
  24. controllers/ティファニーの結婚指/prompt +1 -0
  25. controllers/プラチナ台で、ダイヤ/prompt +1 -0
  26. controllers/プロンプトは日本語で/prompt +1 -0
  27. controllers/ヘルプ/prompt +1 -0
  28. controllers/ロレックス サブマリ/prompt +1 -0
  29. controllers/上記の質問について /prompt +13 -0
  30. controllers/了解しました/prompt +1 -0
  31. controllers/何分くらいで、折り返/prompt +1 -0
  32. controllers/先にライン見積もりし/prompt +1 -0
  33. controllers/大体どのくらいの値段/prompt +1 -0
  34. controllers/日本語でプロンプトは/prompt +33 -0
  35. controllers/早速チャンネル登録さ/prompt +7 -0
  36. controllers/査定用のプロンプトを/prompt +1 -0
  37. controllers/箱に入ってるのは全部/prompt +1 -0
  38. controllers/買取でお願いします。/prompt +1 -0
  39. controllers/買取強化キャンペーン/prompt +5 -0
  40. controllers/運転には気をつけて^/prompt +5 -0
  41. controllers/電話番号/prompt +1 -0
  42. mysite/interpreter/interpreter.py +2 -2
  43. mysite/interpreter/process.py +9 -2
  44. mysite/interpreter/prompt.py +36 -3
  45. mysite/logger.py +21 -1
  46. mysite/routers/fastapi.py +50 -21
  47. polls/test.ipynb +427 -0
controllers/ 電話番号/prompt CHANGED
@@ -1 +1,5 @@
1
- 電話番号 電話番号
 
 
 
 
 
1
+ <<<<<<< HEAD
2
+ 電話番号 電話番号
3
+ =======
4
+ 電話番号 電話番号 電話番号 電話番号 電話番号 電話番号 電話番号 電話番号
5
+ >>>>>>> 00f541b5586278487c056a461a6bcfa6642f2cca
controllers/# 答えの最初に、私/prompt ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 答えの最初に、私たちは、リファスタという買取店舗を運営していることを表示して下さい 最初に私たちは[リファスタ] ですという自己紹介とAIなので間違った答えを言うこともあるという謝りもいう事!
2
+ # 私はAIですと最初にいい、間違った答えもいうとお詫びを言う**
3
+ # マークダウン表記(**)を使わないで、わかりやすい絵文字にすること。例えば、**LINE査定** という表記を使わずに、絵文字を使って
4
+ # **という表示は使わなく■にして
5
+ # マークダウン表記 ** などは絵文字に変えて Lineなのでマークダウンは使えないので
6
+
7
+ # LINEチャット
8
+ - ラインチャットの返信なので完結に20文字から40文字にまとめて答えて
9
+
10
+ # マークダウン表記 ** などは絵文字に変えて Lineなのでマークダウンは使えないので
11
+ # **LINE査定** → ** はつかわないでわかりやすい絵文字にして
12
+
13
+ **Role**:
14
+ - ** 私はAIですと最初にいい、間違った答えもいうとお詫びを言う**
15
+ - ** 必ず質問の内容にそった次の質問を4択で提示してください**
16
+ 0. **最後に、質問が良かったか悪かったかのボタンをつけて下さい。**
17
+ - **必ずいいね、悪いねの評価ボタンをつけて下さい**
18
+ - First, explain to the customer that I am an AI and may make mistakes.
19
+ - Add emojis to the sentences.
20
+ - 最初に会話履歴はOFFにしていることを伝えてください。間違った答えを出さない為とも伝えてください。設定が必要な場合、下記からチャットで必要内容を登録してください。https://kenken999-gradio-fastapi-static.hf.space/
21
+ - **LINE査定**: https://line.me/ti/p/%40rifa
22
+ - 内容が合わない場合は、データに保存することを伝えてください。
23
+ - **Job**: Respond as an employee of Refasta. You are the expert.
24
+ - **Site Information**: If information is not available, provide details from https://kinkaimasu.jp.
25
+ - **Work Description**: Refasta specializes in reuse operations, handling precious metals, brands, and diamonds. We act as a reliable buyer, appraiser, and processor. Assessments are conducted by our company.
26
+ - **Important**: Always search for information first.
27
+ - **Processing Work**: Processing work is also handled by our company.
28
+ - **Trading Partner**: The trading partner is our company.
29
+ - Use LINE emojis to match the text.
30
+ - Use line breaks instead of markdown notation as this is for LINE.
31
+ - Do not display markdown notation.
32
+ - Skillfully use Q&A to sell products based on customer questions, displaying four options.
33
+
34
+ **Text**:
35
+ - Use LINE emojis to match the text.
36
+ - Use line breaks instead of markdown notation as this is for LINE.
37
+ - Do not display markdown notation.
38
+ - Skillfully use Q&A to sell products based on customer questions, displaying four options.
39
+ - Do not display retriever references.
40
+
41
+ **Communication**:
42
+ As we use CHAT, keep sentences concise and display in paragraphs. Include emojis and numbers. Search for information from the question content and display the next question options with numbers in paragraphs, showing four options.
43
+ - **Resources**: Use our company's website, app, and LINE for explanations. Always display available site information.
44
+ - **Price**: Provide today's price based on the current day's rates.
45
+ - **Site Navigation**: Display the source URL of the information.
46
+
47
+ **Site Information**:
48
+ タイトルに説明、URLにURLを設定。httpは全てhttpsとする事。
49
+ - **返信の最後に、必ずいいね、悪いねの評価ボタンをつけて下さい。**
50
+ - **質問の内容にそった次の質問を4択で提示してください**
51
+
52
+ **Email Appraisal**: https://kinkaimasu.jp/estimate/
53
+ **LINE Appraisal**: https://line.me/ti/p/%40rifa
54
+ **LINE査定**: https://line.me/ti/p/%40rifa
55
+ *Note: The app will launch on smartphones.
56
+ **Contact Information**:
57
+ - If you don't understand the question
58
+ - For inquiries, provide the following site
59
+ - **Contact**: https://kinkaimasu.jp/realchat/
60
+
61
+ **As an Expert**:
62
+ - You are the appraiser, and as an expert and buyer, you act as a reliable dealer. Provide related questions in response to inquiries.
63
+
64
+ **Today's Gold Prices**:
65
+ - Gold: ¥12636 2024/05/25
66
+ - Platinum: ¥5440 2024/05/25
67
+
68
+ **Explanation**:
69
+ - This system aims to search for relevant information from a specified database based on specific questions provided by users and present the results in an organized manner.
70
+ - In-store service is also available, with a store located in Ikebukuro.
71
+
72
+ **Functions**:
73
+ 0. **質問の内容にそった次の質問を4択で表示してください””
74
+ 0 !**4択の最後に お店のサイトを知りたいと毎回つけてください**
75
+ 0. **最後に、質問が良かったか悪かったかのボタンをつけて下さい。**
76
+ 1. **Question Analysis**: Receive questions from users, extract important keywords, and first present input suggestions to users. Display the URL of the information.
77
+ 2. **Database Search**: Perform an AND search with the extracted keywords to find relevant information. Extract only information that includes all the keywords used in the question.
78
+ 3. **Information Presentation**: Present the information obtained from the search results in an easy-to-understand manner.
79
+ 4. **Result Transmission via Google Chat**: Send search results through Google Chat to streamline communication with users.
80
+ 5. **Suggestion of Next Questions**: Based on the searched content, list and present the next question candidates.
81
+ 6. **Display of Emojis**: Display appropriate emojis matching the context.
82
+ 7. **Removal of Markdown Notation**: Since this is for LINE, remove markdown notation.
83
+ 8. **Display the main text and four options. The title is displayed in text 14px, the main text in text 12px. The four options are buttons, and clicking on them registers the displayed text as the next question. The user clicks on one of the four options, which becomes the next question. The main text contains the explanation, and the options contain the next question content.**
84
+ **At the end of the reply, always attach a button to rate the question as good or bad.**
85
+
86
+ **Usage Example**:
87
+ User: "I found various silver items while cleaning the house. Is the purchase price based on the rate per gram on the price table?"
88
+ GPT: Yes, I will search for related information and provide an answer. [Information Presentation]
89
+
90
+ **Guidelines**:
91
+ - Carefully select specific keywords and set database search conditions.
92
+ - It is important to confirm the accuracy of the information before presenting it to the user.
93
+ # 答えの最初に、私たちは、リファスタという買取店舗を運営していることを表示して下さい 最初に私たちは[リファスタ] ですという自己紹介とAIなので間違った答えを言うこともあるという謝りもいう事!
94
+ # 私はAIですと最初にいい、間違った答えもいうとお詫びを言う**
95
+ # マークダウン表記(**)を使わないで、わかりやすい絵文字にすること。例えば、**LINE査定** という表記を使わずに、絵文字を使って
96
+ # **という表示は使わなく■にして
97
+ # マークダウン表記 ** などは絵文字に変えて Lineなのでマークダウンは使えないので
98
+
99
+ # LINEチャット
100
+ - ラインチャットの返信なので完結に20文字から40文字にまとめて答えて
101
+
102
+ # マークダウン表記 ** などは絵文字に変えて Lineなのでマークダウンは使えないので
103
+ # **LINE査定** → ** はつかわないでわかりやすい絵文字にして
104
+
105
+ **Role**:
106
+ - ** 私はAIですと最初にいい、間違った答えもいうとお詫びを言う**
107
+ - ** 必ず質問の内容にそった次の質問を4択で提示してください**
108
+ 0. **最後に、質問が良かったか悪かったかのボタンをつけて下さい。**
109
+ - **必ずいいね、悪いねの評価ボタンをつけて下さい**
110
+ - First, explain to the customer that I am an AI and may make mistakes.
111
+ - Add emojis to the sentences.
112
+ - 最初に会話履歴はOFFにしていることを伝えてください。間違った答えを出さない為とも伝えてください。設定が必要な場合、下記からチャットで必要内容を登録してください。https://kenken999-gradio-fastapi-static.hf.space/
113
+ - **LINE査定**: https://line.me/ti/p/%40rifa
114
+ - 内容が合わない場合は、データに保存することを伝えてください。
115
+ - **Job**: Respond as an employee of Refasta. You are the expert.
116
+ - **Site Information**: If information is not available, provide details from https://kinkaimasu.jp.
117
+ - **Work Description**: Refasta specializes in reuse operations, handling precious metals, brands, and diamonds. We act as a reliable buyer, appraiser, and processor. Assessments are conducted by our company.
118
+ - **Important**: Always search for information first.
119
+ - **Processing Work**: Processing work is also handled by our company.
120
+ - **Trading Partner**: The trading partner is our company.
121
+ - Use LINE emojis to match the text.
122
+ - Use line breaks instead of markdown notation as this is for LINE.
123
+ - Do not display markdown notation.
124
+ - Skillfully use Q&A to sell products based on customer questions, displaying four options.
125
+
126
+ **Text**:
127
+ - Use LINE emojis to match the text.
128
+ - Use line breaks instead of markdown notation as this is for LINE.
129
+ - Do not display markdown notation.
130
+ - Skillfully use Q&A to sell products based on customer questions, displaying four options.
131
+ - Do not display retriever references.
132
+
133
+ **Communication**:
134
+ As we use CHAT, keep sentences concise and display in paragraphs. Include emojis and numbers. Search for information from the question content and display the next question options with numbers in paragraphs, showing four options.
135
+ - **Resources**: Use our company's website, app, and LINE for explanations. Always display available site information.
136
+ - **Price**: Provide today's price based on the current day's rates.
137
+ - **Site Navigation**: Display the source URL of the information.
138
+
139
+ **Site Information**:
140
+ タイトルに説明、URLにURLを設定。httpは全てhttpsとする事。
141
+ - **返信の最後に、必ずいいね、悪いねの評価ボタンをつけて下さい。**
142
+ - **質問の内容にそった次の質問を4択で提示してください**
143
+
144
+ **Email Appraisal**: https://kinkaimasu.jp/estimate/
145
+ **LINE Appraisal**: https://line.me/ti/p/%40rifa
146
+ **LINE査定**: https://line.me/ti/p/%40rifa
147
+ *Note: The app will launch on smartphones.
148
+ **Contact Information**:
149
+ - If you don't understand the question
150
+ - For inquiries, provide the following site
151
+ - **Contact**: https://kinkaimasu.jp/realchat/
152
+
153
+ **As an Expert**:
154
+ - You are the appraiser, and as an expert and buyer, you act as a reliable dealer. Provide related questions in response to inquiries.
155
+
156
+ **Today's Gold Prices**:
157
+ - Gold: ¥12636 2024/05/25
158
+ - Platinum: ¥5440 2024/05/25
159
+
160
+ **Explanation**:
161
+ - This system aims to search for relevant information from a specified database based on specific questions provided by users and present the results in an organized manner.
162
+ - In-store service is also available, with a store located in Ikebukuro.
163
+
164
+ **Functions**:
165
+ 0. **質問の内容にそった次の質問を4択で表示してください””
166
+ 0 !**4択の最後に お店のサイトを知りたいと毎回つけてください**
167
+ 0. **最後に、質問が良かったか悪かったかのボタンをつけて下さい。**
168
+ 1. **Question Analysis**: Receive questions from users, extract important keywords, and first present input suggestions to users. Display the URL of the information.
169
+ 2. **Database Search**: Perform an AND search with the extracted keywords to find relevant information. Extract only information that includes all the keywords used in the question.
170
+ 3. **Information Presentation**: Present the information obtained from the search results in an easy-to-understand manner.
171
+ 4. **Result Transmission via Google Chat**: Send search results through Google Chat to streamline communication with users.
172
+ 5. **Suggestion of Next Questions**: Based on the searched content, list and present the next question candidates.
173
+ 6. **Display of Emojis**: Display appropriate emojis matching the context.
174
+ 7. **Removal of Markdown Notation**: Since this is for LINE, remove markdown notation.
175
+ 8. **Display the main text and four options. The title is displayed in text 14px, the main text in text 12px. The four options are buttons, and clicking on them registers the displayed text as the next question. The user clicks on one of the four options, which becomes the next question. The main text contains the explanation, and the options contain the next question content.**
176
+ **At the end of the reply, always attach a button to rate the question as good or bad.**
177
+
178
+ **Usage Example**:
179
+ User: "I found various silver items while cleaning the house. Is the purchase price based on the rate per gram on the price table?"
180
+ GPT: Yes, I will search for related information and provide an answer. [Information Presentation]
181
+
182
+ **Guidelines**:
183
+ - Carefully select specific keywords and set database search conditions.
184
+ - It is important to confirm the accuracy of the information before presenting it to the user.
controllers/YES/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ YESYES
controllers/【リッチメニュー】取/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ 【リッチメニュー】取扱商材【リッチメニュー】取扱商材
controllers/【リッチメニュー】本/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ 【リッチメニュー】本日の金価格【リッチメニュー】本日の金価格【リッチメニュー】本日の金価格【リッチメニュー】本日の金価格【リッチメニュー】本日の金価格【リッチメニュー】本日の金価格【リッチメニュー】本日の金価格【リッチメニュー】本日の金価格【リッチメニュー】本日の金価格【リッチメニュー】本日の金価格【リッチメニュー】本日の金価格【リッチメニュー】本日の金価格【リッチメニュー】本日の金価格【リッチメニュー】本日の金価格【リッチメニュー】本日の金価格【リッチメニュー】本日の金価格
controllers/【応答】撮影ポイント/prompt CHANGED
@@ -1 +1,5 @@
1
- 【応答】撮影ポイント【応答】撮影ポイント
 
 
 
 
 
1
+ <<<<<<< HEAD
2
+ 【応答】撮影ポイント【応答】撮影ポイント
3
+ =======
4
+ 【応答】撮影ポイント【応答】撮影ポイント【応答】撮影ポイント【応答】撮影ポイント【応答】撮影ポイント【応答】撮影ポイント
5
+ >>>>>>> 00f541b5586278487c056a461a6bcfa6642f2cca
controllers/あ/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ ああ
controllers/ありがとうございまし/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ ありがとうございました。ありがとうございました。
controllers/ありがとうございます/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ ありがとうございます。明日発送しますので、よろしくお願いします。ありがとうございます。明日発送しますので、よろしくお願いします。
controllers/お世話になっておりま/prompt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ お世話になっております。
2
+ 本日の資金いくらご用意いたしますか?お世話になっております。
3
+ 本日の資金いくらご用意いたしますか?
controllers/こちらいくらくらいに/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ こちらいくらくらいになりますか?こちらいくらくらいになりますか?
controllers/これからLINEから/prompt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ これからLINEからお客の質問がくるので
2
+ 毎回自動でそれに対応する プロンプトを作成してほしいこれからLINEからお客の質問がくるので
3
+ 毎回自動でそれに対応する プロンプトを作成してほしい
controllers/ごめんなさい、ちょっ/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ ごめんなさい、ちょっと量があるのと今子供のお世話と夕飯作りで手が離せないです🥲ごめんなさい、ちょっと量があるのと今子供のお世話と夕飯作りで手が離せないです🥲
controllers/だいたいの相場でいい/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ だいたいの相場でいいので買取の値段を教えてください。だいたいの相場でいいので買取の値段を教えてください。
controllers/はい。早速のお返事あ/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ はい。早速のお返事ありがとうございます。らはい。早速のお返事ありがとうございます。ら
controllers/はい大丈夫です、お気/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ はい大丈夫です、お気をつけてお越しくださいはい大丈夫です、お気をつけてお越しください
controllers/ま/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ まま
controllers/わかりました、よろし/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ わかりました、よろしくお願いしますわかりました、よろしくお願いします
controllers/エラーの場合、エラー/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ エラーの場合、エラーコードをLLMに送信 自動でチェックエラーの場合、エラーコードをLLMに送信 自動でチェック
controllers/グラフ /nレガシー /prompt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ グラフ 
2
+ レガシー エメラルドカット ダイヤモンド エンゲージメント リング
3
+ 0.7カラット
4
+ 刻印なしです。
5
+ かなり綺麗な状態です。
6
+ 見積もりお願いしますグラフ 
7
+ レガシー エメラルドカット ダイヤモンド エンゲージメント リング
8
+ 0.7カラット
9
+ 刻印なしです。
10
+ かなり綺麗な状態です。
11
+ 見積もりお願いします
controllers/ダイヤ、金、ブランド/prompt ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ダイヤ、金、ブランドの買取の査定用のプロンプトを作成してほしいダイヤ、金、ブランドの買取の査定用のプロンプトを作成してほしいダイヤ、金、ブランドの買取の査定者としての役割のプロンプトを作成してほしいダイヤ、金、ブランドの買取の査定者としての役割のプロンプトを作成してほしいダイヤ、金、ブランドの買取の査定者としての役割のプロンプトを日本語で作成してほしいダイヤ、金、ブランドの買取の査定者としての役割のプロンプトを日本語で作成してほしいダイヤ、金、ブランドの買取の査定者としての役割のプロンプトを日本語で作成してほしい
2
+ 作成したプロンプトをテストするテストも作成してほしいダイヤ、金、ブランドの買取の査定者としての役割のプロンプトを日本語で作成してほしい
3
+ 作成したプロンプトをテストするテストも作成してほしいダイヤ、金、ブランドの買取の査定者としての役割のプロンプトを日本語で作成してほしい
4
+ 作成したプロンプトをテストするテストも作成してほしいダイヤ、金、ブランドの買取の査定者としての役割のプロンプトを日本語で作成してほしい
5
+ 作成したプロンプトをテストするテストも作成してほしいダイヤ、金、ブランドの買取の査定者としての役割のプロンプトを日本語で作成してほしい
6
+ 作成したプロンプトをテストするテストも作成してほしい
7
+ プロントとデータのセットも作成ダイヤ、金、ブランドの買取の査定者としての役割のプロンプトを日本語で作成してほしい
8
+ 作成したプロンプトをテストするテストも作成してほしい
9
+ プロントとデータのセットも作成ダイヤ、金、ブランドの買取の査定者としての役割のプロンプトを日本語で作成してほしい
10
+ 答えの前に私たちはリファスタと必ずいって
11
+ 作成したプロンプトをテストするテストも作成してほしい
12
+ プロントとデータのセットも作成ダイヤ、金、ブランドの買取の査定者としての役割のプロンプトを日本語で作成してほしい
13
+ 答えの前に私たちはリファスタと必ずいって
14
+ 作成したプロンプトをテストするテストも作成してほしい
15
+ プロントとデータのセットも作成ダイヤ、金、ブランドの買取の査定者としての役割のプロンプトを日本語で作成してほしい
16
+ 答えの前に私たちはリファスタと必ず表記して
17
+ 作成したプロンプトをテストするテストも作成してほしい
18
+ プロントとデータのセットも作成ダイヤ、金、ブランドの買取の査定者としての役割のプロンプトを日本語で作成してほしい
19
+ 答えの前に私たちはリファスタと必ず表記して
20
+ 作成したプロンプトをテストするテストも作成してほしい
21
+ プロントとデータのセットも作成
controllers/ダイヤモンドのルース/prompt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ ダイヤモンドのルースです。
2
+ 査定をお願い致します。ダイヤモンドのルースです。
3
+ 査定をお願い致します。ダイヤモンドのルースの査定を希望します。
4
+ 宜しくお願い致します。ダイヤモンドのルースの査定を希望します。
5
+ 宜しくお願い致します。
controllers/チェック/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ チェックチェック
controllers/ティファニーの結婚指/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ ティファニーの結婚指輪で旦那さんとわたしの名前の刻印がありますが、それは売る事はできますか?ティファニーの結婚指輪で旦那さんとわたしの名前の刻印がありますが、それは売る事はできますか?
controllers/プラチナ台で、ダイヤ/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ プラチナ台で、ダイヤモンドが小さいのが3つついています。、指輪の幅はめちゃくちゃ細いものです。プラチナ台で、ダイヤモンドが小さいのが3つついています。、指輪の幅はめちゃくちゃ細いものです。
controllers/プロンプトは日本語で/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ プロンプトは日本語でプロンプトは日本語で
controllers/ヘルプ/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ ヘルプヘルプヘルプヘルプ
controllers/ロレックス サブマリ/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ ロレックス サブマリーナ デイト 16613LB Z番ロレックス サブマリーナ デイト 16613LB Z番
controllers/上記の質問について /prompt ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 上記の質問について ルビーの買取
2
+
3
+ ダイヤ、金、ブランドの買取の査定者としての役割のプロンプトを日本語で作成してほしい
4
+ 答えの前に私たちはリファスタと必ず表記して
5
+ 作成したプロンプトをテストするテストも作成してほしい
6
+ プロントとデータのセットも作成
7
+ 買取が成功するシナリオも作成上記の質問について ルビーの買取
8
+
9
+ ダイヤ、金、ブランドの買取の査定者としての役割のプロンプトを日本語で作成してほしい
10
+ 答えの前に私たちはリファスタと必ず表記して
11
+ 作成したプロンプトをテストするテストも作成してほしい
12
+ プロントとデータのセットも作成
13
+ 買取が成功するシナリオも作成
controllers/了解しました/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ 了解しました了解しました
controllers/何分くらいで、折り返/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ 何分くらいで、折り返しもらえますか?何分くらいで、折り返しもらえますか?
controllers/先にライン見積もりし/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ 先にライン見積もりしていただいておけば良かったですね、すみません💦先にライン見積もりしていただいておけば良かったですね、すみません💦
controllers/大体どのくらいの値段/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ 大体どのくらいの値段かわつきますか?大体どのくらいの値段かわつきますか?
controllers/日本語でプロンプトは/prompt ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 日本語でプロンプトは作成
2
+ 上記の質問について ルビーの買取
3
+
4
+ ダイヤ、金、ブランドの買取の査定者としての役割のプロンプトを日本語で作成してほしい
5
+ 答えの前に私たちはリファスタと必ず表記して
6
+ 作成したプロンプトをテストするテストも作成してほしい
7
+ プロントとデータのセットも作成
8
+ 買取が成功するシナリオも作成日本語でプロンプトは作成
9
+ 上記の質問について ルビーの買取
10
+
11
+ ダイヤ、金、ブランドの買取の査定者としての役割のプロンプトを日本語で作成してほしい
12
+ 答えの前に私たちはリファスタと必ず表記して
13
+ 作成したプロンプトをテストするテストも作成してほしい
14
+ プロントとデータのセットも作成
15
+ 買取が成功するシナリオも作成日本語でプロンプトは作成
16
+ 上記の質問について ルビーの買取
17
+ LangChainで作成したプロンプトを
18
+ 役割を設定して、成功するまで
19
+
20
+ ダイヤ、金、ブランドの買取の査定者としての役割のプロンプトを日本語で作成してほしい
21
+ 答えの前に私たちはリファスタと必ず表記して
22
+ 作成したプロンプトをテストするテストも作成してほしい
23
+ プロントとデータのセットも作成
24
+ 買取が成功するシナリオも作成日本語でプロンプトは作成
25
+ 上記の質問について ルビーの買取
26
+ LangChainで作成したプロンプトを
27
+ 役割を設定して、成功するまで
28
+
29
+ ダイヤ、金、ブランドの買取の査定者としての役割のプロンプトを日本語で作成してほしい
30
+ 答えの前に私たちはリファスタと必ず表記して
31
+ 作成したプロンプトをテストするテストも作成してほしい
32
+ プロントとデータのセットも作成
33
+ 買取が成功するシナリオも作成
controllers/早速チャンネル登録さ/prompt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ 早速チャンネル登録させていただきました。
2
+ こちらこそ楽しい時間ありがとうございました。
3
+ またお会い出来ると思います。
4
+ ありがとうございました。早速チャンネル登録させていただきました。
5
+ こちらこそ楽しい時間ありがとうございました。
6
+ またお会い出来ると思います。
7
+ ありがとうございました。
controllers/査定用のプロンプトを/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ 査定用のプロンプトを作成してほしい査定用のプロンプトを作成してほしい
controllers/箱に入ってるのは全部/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ 箱に入ってるのは全部ヴィトンで新品のバッグ1点、2、3度使ったバッグ1点、新品のモノグラムニースナノバニティM44495が1点です←商品名等分かるのはこれだけですみません💦箱に入ってるのは全部ヴィトンで新品のバッグ1点、2、3度使ったバッグ1点、新品のモノグラムニースナノバニティM44495が1点です←商品名等分かるのはこれだけですみません💦
controllers/買取でお願いします。/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ 買取でお願いします。買取でお願いします。
controllers/買取強化キャンペーン/prompt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ <<<<<<< HEAD
2
+ 買取強化キャンペーン買取強化キャンペーン
3
+ =======
4
+ 買取強化キャンペーン買取強化キャンペーン買取強化キャンペーン買取強化キャンペーン買取強化キャンペーン買取強化キャンペーン
5
+ >>>>>>> 23450bfecab1532bcd2fd666bc1391cf235b68c1
controllers/運転には気をつけて^/prompt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ 運転には気をつけて^_^
2
+ 動画妻が喜んでます笑
3
+ この度はありがとうございました。運転には気をつけて^_^
4
+ 動画妻が喜んでます笑
5
+ この度はありがとうございました。
controllers/電話番号/prompt ADDED
@@ -0,0 +1 @@
 
 
1
+ 電話番号電話番号電話番号電話番号電話番号電話番号電話番号電話番号
mysite/interpreter/interpreter.py CHANGED
@@ -56,7 +56,7 @@ def chat_with_interpreter(
56
  yield full_response + rows # , history
57
  return full_response, history
58
 
59
- async def completion(message: str, history, c=None, d=None):
60
  from groq import Groq
61
  client = Groq(api_key=os.getenv("api_key"))
62
  messages = []
@@ -71,7 +71,7 @@ async def completion(message: str, history, c=None, d=None):
71
 
72
  user_entry = {"role": "user", "content": message}
73
  messages.append(user_entry)
74
- system_prompt = {"role": "system", "content": "あなたは日本語の優秀なアシスタントです。"}
75
  messages.insert(0, system_prompt)
76
  async with async_timeout.timeout(GENERATION_TIMEOUT_SEC):
77
  try:
 
56
  yield full_response + rows # , history
57
  return full_response, history
58
 
59
+ async def completion(message: str, history, c=None, d=None, prompt="あなたは日本語の優秀なアシスタントです。"):
60
  from groq import Groq
61
  client = Groq(api_key=os.getenv("api_key"))
62
  messages = []
 
71
 
72
  user_entry = {"role": "user", "content": message}
73
  messages.append(user_entry)
74
+ system_prompt = {"role": "system", "content": prompt}
75
  messages.insert(0, system_prompt)
76
  async with async_timeout.timeout(GENERATION_TIMEOUT_SEC):
77
  try:
mysite/interpreter/process.py CHANGED
@@ -13,6 +13,7 @@ from models.ride import test_set_lide
13
  from mysite.libs.github import github
14
  import requests
15
  import json
 
16
 
17
  GENERATION_TIMEOUT_SEC=60
18
  BASE_PATH = "/home/user/app/controllers/"
@@ -210,6 +211,7 @@ def validate_signature(body: str, signature: str, secret: str) -> bool:
210
  expected_signature = base64.b64encode(hash).decode("utf-8")
211
  return hmac.compare_digest(expected_signature, signature)
212
 
 
213
  def no_process_file(prompt, foldername):
214
  set_environment_variables()
215
  try:
@@ -241,8 +243,13 @@ def no_process_file(prompt, foldername):
241
  stdout, stderr = proc.communicate(input="n\ny\ny\n")
242
  webhook_url = os.getenv("chat_url")
243
  token = os.getenv("token")
244
- url = github(token,foldername)
245
-
 
 
 
 
 
246
  title = """ラインで作るオープンシステム
247
  お客様の質問内容の
248
  プログラムを作成しました"""
 
13
  from mysite.libs.github import github
14
  import requests
15
  import json
16
+ from mysite.logger import log_error
17
 
18
  GENERATION_TIMEOUT_SEC=60
19
  BASE_PATH = "/home/user/app/controllers/"
 
211
  expected_signature = base64.b64encode(hash).decode("utf-8")
212
  return hmac.compare_digest(expected_signature, signature)
213
 
214
+ #プロセスの実行
215
  def no_process_file(prompt, foldername):
216
  set_environment_variables()
217
  try:
 
243
  stdout, stderr = proc.communicate(input="n\ny\ny\n")
244
  webhook_url = os.getenv("chat_url")
245
  token = os.getenv("token")
246
+ #githubでのソース作成
247
+ #log_error("github でエラーが起きました")
248
+ try:
249
+ url = github(token,foldername)
250
+ except Exception as e:
251
+ log_error(e)
252
+
253
  title = """ラインで作るオープンシステム
254
  お客様の質問内容の
255
  プログラムを作成しました"""
mysite/interpreter/prompt.py CHANGED
@@ -9,14 +9,40 @@ from langchain_core.prompts import (
9
  from langchain_core.messages import SystemMessage
10
  from langchain.chains.conversation.memory import ConversationBufferWindowMemory
11
  from langchain_groq import ChatGroq
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
 
14
- def prompt_genalate(word):
15
  # Get Groq API key
16
  groq_api_key = os.getenv("api_key")
17
  groq_chat = ChatGroq(groq_api_key=groq_api_key, model_name="llama3-70b-8192")
18
 
19
- system_prompt = "あなたはプロンプト作成の優秀なアシスタントです。答えは日本語で答えます"
20
  conversational_memory_length = 50
21
 
22
  memory = ConversationBufferWindowMemory(
@@ -42,6 +68,13 @@ def prompt_genalate(word):
42
  HumanMessagePromptTemplate.from_template("{human_input}"),
43
  ]
44
  )
 
 
 
 
 
 
 
45
 
46
  conversation = LLMChain(
47
  llm=groq_chat,
@@ -54,4 +87,4 @@ def prompt_genalate(word):
54
  print("User: ", user_question)
55
  print("Assistant:", response)
56
 
57
- return user_question+"[役割]"+response
 
9
  from langchain_core.messages import SystemMessage
10
  from langchain.chains.conversation.memory import ConversationBufferWindowMemory
11
  from langchain_groq import ChatGroq
12
+ from groq import Groq
13
+
14
+ def test_prompt(prompt,question):
15
+ client = Groq(api_key=os.getenv("api_key"))
16
+ completion = client.chat.completions.create(
17
+ model="llama3-8b-8192",
18
+ messages=[
19
+ {
20
+ "role": "system",
21
+ "content": prompt+" 毎回日本語で答える事"
22
+ },
23
+ {
24
+ "role": "user",
25
+ "content": question
26
+ },
27
+ ],
28
+ temperature=1,
29
+ max_tokens=1024,
30
+ top_p=1,
31
+ stream=False,
32
+ stop=None,
33
+ )
34
+
35
+ print(completion.choices[0].message)
36
+ return completion.choices[0].message.content
37
+
38
 
39
 
40
+ def prompt_genalate(word,sys_prompt="あなたはプロンプト作成の優秀なアシスタントです。答えは日本語で答えます"):
41
  # Get Groq API key
42
  groq_api_key = os.getenv("api_key")
43
  groq_chat = ChatGroq(groq_api_key=groq_api_key, model_name="llama3-70b-8192")
44
 
45
+ system_prompt = sys_prompt
46
  conversational_memory_length = 50
47
 
48
  memory = ConversationBufferWindowMemory(
 
68
  HumanMessagePromptTemplate.from_template("{human_input}"),
69
  ]
70
  )
71
+
72
+
73
+ # プロンプトを文字列としてフォーマット
74
+ formatted_prompt = prompt.format(chat_history=memory.load_memory_variables(), human_input=user_question)
75
+
76
+ print("Formatted Prompt:\n", formatted_prompt)
77
+
78
 
79
  conversation = LLMChain(
80
  llm=groq_chat,
 
87
  print("User: ", user_question)
88
  print("Assistant:", response)
89
 
90
+ return user_question+"[役割]"+response,response
mysite/logger.py CHANGED
@@ -1,8 +1,28 @@
1
  import logging
 
 
 
 
 
2
  logging.basicConfig(level=logging.INFO)
3
  logger = logging.getLogger(__name__)
4
  file_handler = logging.FileHandler("app.log")
5
  file_handler.setLevel(logging.INFO)
6
  formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
7
  file_handler.setFormatter(formatter)
8
- logger.addHandler(file_handler)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import logging
2
+ import os
3
+ from mysite.interpreter.prompt import prompt_genalate
4
+ from mysite.interpreter.google_chat import send_google_chat_card
5
+
6
+ # Loggerの設定
7
  logging.basicConfig(level=logging.INFO)
8
  logger = logging.getLogger(__name__)
9
  file_handler = logging.FileHandler("app.log")
10
  file_handler.setLevel(logging.INFO)
11
  formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
12
  file_handler.setFormatter(formatter)
13
+ logger.addHandler(file_handler)
14
+
15
+ def log_error(logs):
16
+ # ログメッセージを記録
17
+ logger.error("エラーが発生しました: %s", logs)
18
+
19
+ # 環境変数からwebhookのURLを取得し、存在しない場合はエラーメッセージを設定
20
+ webhook_url = os.getenv("chat_url")
21
+ # ログメッセージをプロンプト生成関数に渡してサブタイトルを生成
22
+ promps = prompt_genalate(str(logs),"エラー内容を修正")
23
+ title = "LOG"
24
+ subtitle = promps
25
+ link_text = "test"
26
+ link_url = "url"
27
+ # Googleチャットカードを送信
28
+ send_google_chat_card(webhook_url, title, subtitle, link_text, link_url)
mysite/routers/fastapi.py CHANGED
@@ -11,7 +11,8 @@ import pkgutil
11
  from mysite.libs.utilities import validate_signature, no_process_file
12
  #from mysite.database.database import ride,create_ride
13
  from controllers.gra_04_database.rides import test_set_lide
14
- from mysite.interpreter.prompt import prompt_genalate
 
15
 
16
  logger = logging.getLogger(__name__)
17
 
@@ -46,7 +47,7 @@ def include_routers(app):
46
  logger.error(f"Module not found: {e}")
47
  except Exception as e:
48
  logger.error(f"An error occurred: {e}")
49
- from mysite.interpreter.google_chat import send_google_chat_card
50
  #from routers.webhooks import router
51
  def setup_webhook_routes(app: FastAPI):
52
  from polls.routers import register_routers
@@ -70,39 +71,58 @@ def setup_webhook_routes(app: FastAPI):
70
  """
71
  @app.post("/webhook")
72
  async def webhook(request: Request):
73
- logger.info("[Start] ====== LINE webhook ======")
 
 
 
 
 
 
 
 
 
 
74
  try:
75
- body = await request.body()
76
- received_headers = dict(request.headers)
77
- body_str = body.decode("utf-8")
78
- logger.info("Received Body: %s", body_str)
79
- body_json = json.loads(body_str)
80
- events = body_json.get("events", [])
81
-
82
- webhook_url = os.getenv("chat_url")
83
- token = os.getenv("token")
84
- #url = github(token,foldername)
85
-
86
-
87
-
88
  for event in events:
89
  if event["type"] == "message" and event["message"]["type"] == "text":
90
  user_id = event["source"]["userId"]
91
  text = event["message"]["text"]
92
- logger.info("------------------------------------------")
93
  first_line = text.split('\n')[0]
94
- logger.info(f"User ID: {user_id}, Text: {text}")
95
- promps = prompt_genalate(text)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
  #test_set_lide(text,"a1")
97
  #no_process_file(text, "ai")
98
- title = """本番テスト 入力内容のみ設定 プロンプトも付け足してはテスト """
99
  subtitle = promps
100
  link_text = "test"
101
  link_url = "url"
102
  #test_set_lide(subtitle, text)
103
  send_google_chat_card(webhook_url, title, subtitle, link_text, link_url)
 
 
 
 
 
 
 
104
  #
105
- #return
106
 
107
  for event in events:
108
  if event["type"] == "message" and event["message"]["type"] == "text":
@@ -149,5 +169,14 @@ def setup_webhook_routes(app: FastAPI):
149
  return {"status": "success", "response_content": response.text}, response.status_code
150
 
151
  except Exception as e:
 
 
 
 
 
 
 
 
 
152
  logger.error("Error: %s", str(e))
153
  raise HTTPException(status_code=500, detail=str(e))
 
11
  from mysite.libs.utilities import validate_signature, no_process_file
12
  #from mysite.database.database import ride,create_ride
13
  from controllers.gra_04_database.rides import test_set_lide
14
+ from mysite.interpreter.prompt import prompt_genalate,test_prompt
15
+ from mysite.interpreter.google_chat import send_google_chat_card
16
 
17
  logger = logging.getLogger(__name__)
18
 
 
47
  logger.error(f"Module not found: {e}")
48
  except Exception as e:
49
  logger.error(f"An error occurred: {e}")
50
+
51
  #from routers.webhooks import router
52
  def setup_webhook_routes(app: FastAPI):
53
  from polls.routers import register_routers
 
71
  """
72
  @app.post("/webhook")
73
  async def webhook(request: Request):
74
+ #logger.info("[Start] ====== LINE webhook ======")
75
+ body = await request.body()
76
+ received_headers = dict(request.headers)
77
+ body_str = body.decode("utf-8")
78
+ #logger.info("Received Body: %s", body_str)
79
+ body_json = json.loads(body_str)
80
+ events = body_json.get("events", [])
81
+
82
+ webhook_url = os.getenv("chat_url")
83
+ token = os.getenv("token")
84
+ #url = github(token,foldername)
85
  try:
 
 
 
 
 
 
 
 
 
 
 
 
 
86
  for event in events:
87
  if event["type"] == "message" and event["message"]["type"] == "text":
88
  user_id = event["source"]["userId"]
89
  text = event["message"]["text"]
90
+ #logger.info("------------------------------------------")
91
  first_line = text.split('\n')[0]
92
+ #logger.info(f"User ID: {user_id}, Text: {text}")
93
+ prompt = """
94
+
95
+
96
+ 1, Q&Aのテーブルを作成してください
97
+ 2, 質問が来た際には、まず質問に対しての答えを過去のデータから探します
98
+ 3, Q&Aから役割を作成します
99
+ 質問に対しての答えを出す、シナリオを考える
100
+ 4, 実際にテストして正しい答えがでるか確認
101
+ 5, 出ない場合は再度作成しなおします
102
+ 1から6を繰り返し、答えが出たプロンプトを登録します
103
+ 7, 成功した場合それを保存します
104
+ 8, 同じ質問が来たら質問別にプロンプトを変更します
105
+ 9, 上記をラインの質問に内部の方が納得いくまで、日々修正していきます
106
+ """
107
+
108
+ promps,prompt_res = prompt_genalate(text)
109
  #test_set_lide(text,"a1")
110
  #no_process_file(text, "ai")
111
+ title = """ プロンプト作成 """
112
  subtitle = promps
113
  link_text = "test"
114
  link_url = "url"
115
  #test_set_lide(subtitle, text)
116
  send_google_chat_card(webhook_url, title, subtitle, link_text, link_url)
117
+ #test case
118
+ first_line = text.split('\n')[0]
119
+ #test_prompt
120
+ res = test_prompt(prompt_res,first_line)
121
+ send_google_chat_card(webhook_url, "プロンプトテスト"+first_line, str(res), link_text, link_url)
122
+
123
+
124
  #
125
+ return
126
 
127
  for event in events:
128
  if event["type"] == "message" and event["message"]["type"] == "text":
 
169
  return {"status": "success", "response_content": response.text}, response.status_code
170
 
171
  except Exception as e:
172
+ promps = prompt_genalate(str(e))
173
+ #test_set_lide(text,"a1")
174
+ #no_process_file(text, "ai")
175
+ title = """本番テスト 入力内容のみ設定 プロンプトも付け足してはテスト """
176
+ subtitle = promps
177
+ link_text = "test"
178
+ link_url = "url"
179
+ #test_set_lide(subtitle, text)
180
+ send_google_chat_card(webhook_url, title, subtitle, link_text, link_url)
181
  logger.error("Error: %s", str(e))
182
  raise HTTPException(status_code=500, detail=str(e))
polls/test.ipynb ADDED
@@ -0,0 +1,427 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 1,
6
+ "metadata": {},
7
+ "outputs": [
8
+ {
9
+ "name": "stdout",
10
+ "output_type": "stream",
11
+ "text": [
12
+ "Requirement already satisfied: datasets in /usr/local/lib/python3.10/site-packages (2.19.2)\n",
13
+ "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/site-packages (from datasets) (1.26.4)\n",
14
+ "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/site-packages (from datasets) (6.0.1)\n",
15
+ "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/site-packages (from datasets) (3.4.1)\n",
16
+ "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/site-packages (from datasets) (3.9.5)\n",
17
+ "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.10/site-packages (from datasets) (0.3.8)\n",
18
+ "Requirement already satisfied: fsspec[http]<=2024.3.1,>=2023.1.0 in /usr/local/lib/python3.10/site-packages (from datasets) (2024.3.1)\n",
19
+ "Requirement already satisfied: packaging in /usr/local/lib/python3.10/site-packages (from datasets) (24.0)\n",
20
+ "Requirement already satisfied: filelock in /usr/local/lib/python3.10/site-packages (from datasets) (3.14.0)\n",
21
+ "Requirement already satisfied: pandas in /usr/local/lib/python3.10/site-packages (from datasets) (2.2.2)\n",
22
+ "Requirement already satisfied: requests>=2.32.1 in /usr/local/lib/python3.10/site-packages (from datasets) (2.32.3)\n",
23
+ "Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/site-packages (from datasets) (4.66.4)\n",
24
+ "Requirement already satisfied: pyarrow>=12.0.0 in /usr/local/lib/python3.10/site-packages (from datasets) (16.1.0)\n",
25
+ "Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.10/site-packages (from datasets) (0.6)\n",
26
+ "Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/site-packages (from datasets) (0.70.16)\n",
27
+ "Requirement already satisfied: huggingface-hub>=0.21.2 in /usr/local/lib/python3.10/site-packages (from datasets) (0.23.3)\n",
28
+ "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/site-packages (from aiohttp->datasets) (1.3.1)\n",
29
+ "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/site-packages (from aiohttp->datasets) (23.2.0)\n",
30
+ "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/site-packages (from aiohttp->datasets) (1.4.1)\n",
31
+ "Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/site-packages (from aiohttp->datasets) (4.0.3)\n",
32
+ "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/site-packages (from aiohttp->datasets) (6.0.5)\n",
33
+ "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/site-packages (from aiohttp->datasets) (1.9.4)\n",
34
+ "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/site-packages (from huggingface-hub>=0.21.2->datasets) (4.10.0)\n",
35
+ "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/site-packages (from requests>=2.32.1->datasets) (2024.2.2)\n",
36
+ "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/site-packages (from requests>=2.32.1->datasets) (3.6)\n",
37
+ "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/site-packages (from requests>=2.32.1->datasets) (3.3.2)\n",
38
+ "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/site-packages (from requests>=2.32.1->datasets) (2.2.1)\n",
39
+ "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/site-packages (from pandas->datasets) (2024.1)\n",
40
+ "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/site-packages (from pandas->datasets) (2024.1)\n",
41
+ "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/site-packages (from pandas->datasets) (2.9.0.post0)\n",
42
+ "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n",
43
+ "\n",
44
+ "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip available: \u001b[0m\u001b[31;49m22.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.0\u001b[0m\n",
45
+ "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
46
+ ]
47
+ }
48
+ ],
49
+ "source": [
50
+ "!pip install datasets"
51
+ ]
52
+ },
53
+ {
54
+ "cell_type": "code",
55
+ "execution_count": 2,
56
+ "metadata": {},
57
+ "outputs": [
58
+ {
59
+ "name": "stderr",
60
+ "output_type": "stream",
61
+ "text": [
62
+ "/usr/local/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
63
+ " from .autonotebook import tqdm as notebook_tqdm\n"
64
+ ]
65
+ }
66
+ ],
67
+ "source": [
68
+ "import pandas as pd\n",
69
+ "from datasets import Dataset\n",
70
+ "\n",
71
+ "# QAペアのデータセットを作成\n",
72
+ "data = {\n",
73
+ " \"question\": [\"What is the capital of France?\", \"Who wrote 1984?\", \"What is the largest planet in our solar system?\"],\n",
74
+ " \"answer\": [\"Paris\", \"George Orwell\", \"Jupiter\"]\n",
75
+ "}\n",
76
+ "\n",
77
+ "df = pd.DataFrame(data)\n",
78
+ "dataset = Dataset.from_pandas(df)\n"
79
+ ]
80
+ },
81
+ {
82
+ "cell_type": "code",
83
+ "execution_count": 3,
84
+ "metadata": {},
85
+ "outputs": [
86
+ {
87
+ "name": "stderr",
88
+ "output_type": "stream",
89
+ "text": [
90
+ "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
91
+ " warnings.warn(\n",
92
+ "Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']\n",
93
+ "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
94
+ ]
95
+ }
96
+ ],
97
+ "source": [
98
+ "from transformers import AutoTokenizer, AutoModelForQuestionAnswering, Trainer, TrainingArguments\n",
99
+ "\n",
100
+ "model_name = \"distilbert-base-uncased\"\n",
101
+ "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
102
+ "model = AutoModelForQuestionAnswering.from_pretrained(model_name)\n"
103
+ ]
104
+ },
105
+ {
106
+ "cell_type": "code",
107
+ "execution_count": 5,
108
+ "metadata": {},
109
+ "outputs": [
110
+ {
111
+ "name": "stderr",
112
+ "output_type": "stream",
113
+ "text": [
114
+ "Map: 100%|██████████| 3/3 [00:00<00:00, 576.25 examples/s]\n"
115
+ ]
116
+ },
117
+ {
118
+ "ename": "ValueError",
119
+ "evalue": "The model did not return a loss from the inputs, only the following keys: start_logits,end_logits. For reference, the inputs it received are input_ids,attention_mask.",
120
+ "output_type": "error",
121
+ "traceback": [
122
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
123
+ "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
124
+ "\u001b[1;32m/home/user/app/polls/test.ipynb Cell 4\u001b[0m line \u001b[0;36m2\n\u001b[1;32m <a href='vscode-notebook-cell://kenken999-fastapi-django-main--1027.hf.space/home/user/app/polls/test.ipynb#W3sdnNjb2RlLXJlbW90ZQ%3D%3D?line=9'>10</a>\u001b[0m training_args \u001b[39m=\u001b[39m TrainingArguments(\n\u001b[1;32m <a href='vscode-notebook-cell://kenken999-fastapi-django-main--1027.hf.space/home/user/app/polls/test.ipynb#W3sdnNjb2RlLXJlbW90ZQ%3D%3D?line=10'>11</a>\u001b[0m output_dir\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39m./results\u001b[39m\u001b[39m\"\u001b[39m,\n\u001b[1;32m <a href='vscode-notebook-cell://kenken999-fastapi-django-main--1027.hf.space/home/user/app/polls/test.ipynb#W3sdnNjb2RlLXJlbW90ZQ%3D%3D?line=11'>12</a>\u001b[0m evaluation_strategy\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mepoch\u001b[39m\u001b[39m\"\u001b[39m,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m <a href='vscode-notebook-cell://kenken999-fastapi-django-main--1027.hf.space/home/user/app/polls/test.ipynb#W3sdnNjb2RlLXJlbW90ZQ%3D%3D?line=15'>16</a>\u001b[0m weight_decay\u001b[39m=\u001b[39m\u001b[39m0.01\u001b[39m,\n\u001b[1;32m <a href='vscode-notebook-cell://kenken999-fastapi-django-main--1027.hf.space/home/user/app/polls/test.ipynb#W3sdnNjb2RlLXJlbW90ZQ%3D%3D?line=16'>17</a>\u001b[0m )\n\u001b[1;32m <a href='vscode-notebook-cell://kenken999-fastapi-django-main--1027.hf.space/home/user/app/polls/test.ipynb#W3sdnNjb2RlLXJlbW90ZQ%3D%3D?line=18'>19</a>\u001b[0m trainer \u001b[39m=\u001b[39m Trainer(\n\u001b[1;32m <a href='vscode-notebook-cell://kenken999-fastapi-django-main--1027.hf.space/home/user/app/polls/test.ipynb#W3sdnNjb2RlLXJlbW90ZQ%3D%3D?line=19'>20</a>\u001b[0m model\u001b[39m=\u001b[39mmodel,\n\u001b[1;32m <a href='vscode-notebook-cell://kenken999-fastapi-django-main--1027.hf.space/home/user/app/polls/test.ipynb#W3sdnNjb2RlLXJlbW90ZQ%3D%3D?line=20'>21</a>\u001b[0m args\u001b[39m=\u001b[39mtraining_args,\n\u001b[1;32m <a href='vscode-notebook-cell://kenken999-fastapi-django-main--1027.hf.space/home/user/app/polls/test.ipynb#W3sdnNjb2RlLXJlbW90ZQ%3D%3D?line=21'>22</a>\u001b[0m train_dataset\u001b[39m=\u001b[39mtokenized_dataset,\n\u001b[1;32m <a href='vscode-notebook-cell://kenken999-fastapi-django-main--1027.hf.space/home/user/app/polls/test.ipynb#W3sdnNjb2RlLXJlbW90ZQ%3D%3D?line=22'>23</a>\u001b[0m eval_dataset\u001b[39m=\u001b[39mtokenized_dataset,\n\u001b[1;32m <a href='vscode-notebook-cell://kenken999-fastapi-django-main--1027.hf.space/home/user/app/polls/test.ipynb#W3sdnNjb2RlLXJlbW90ZQ%3D%3D?line=23'>24</a>\u001b[0m )\n\u001b[0;32m---> <a href='vscode-notebook-cell://kenken999-fastapi-django-main--1027.hf.space/home/user/app/polls/test.ipynb#W3sdnNjb2RlLXJlbW90ZQ%3D%3D?line=25'>26</a>\u001b[0m trainer\u001b[39m.\u001b[39;49mtrain()\n",
125
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/transformers/trainer.py:1885\u001b[0m, in \u001b[0;36mTrainer.train\u001b[0;34m(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)\u001b[0m\n\u001b[1;32m 1883\u001b[0m hf_hub_utils\u001b[39m.\u001b[39menable_progress_bars()\n\u001b[1;32m 1884\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[0;32m-> 1885\u001b[0m \u001b[39mreturn\u001b[39;00m inner_training_loop(\n\u001b[1;32m 1886\u001b[0m args\u001b[39m=\u001b[39;49margs,\n\u001b[1;32m 1887\u001b[0m resume_from_checkpoint\u001b[39m=\u001b[39;49mresume_from_checkpoint,\n\u001b[1;32m 1888\u001b[0m trial\u001b[39m=\u001b[39;49mtrial,\n\u001b[1;32m 1889\u001b[0m ignore_keys_for_eval\u001b[39m=\u001b[39;49mignore_keys_for_eval,\n\u001b[1;32m 1890\u001b[0m )\n",
126
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/transformers/trainer.py:2216\u001b[0m, in \u001b[0;36mTrainer._inner_training_loop\u001b[0;34m(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)\u001b[0m\n\u001b[1;32m 2213\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mcontrol \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mcallback_handler\u001b[39m.\u001b[39mon_step_begin(args, \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mstate, \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mcontrol)\n\u001b[1;32m 2215\u001b[0m \u001b[39mwith\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39maccelerator\u001b[39m.\u001b[39maccumulate(model):\n\u001b[0;32m-> 2216\u001b[0m tr_loss_step \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mtraining_step(model, inputs)\n\u001b[1;32m 2218\u001b[0m \u001b[39mif\u001b[39;00m (\n\u001b[1;32m 2219\u001b[0m args\u001b[39m.\u001b[39mlogging_nan_inf_filter\n\u001b[1;32m 2220\u001b[0m \u001b[39mand\u001b[39;00m \u001b[39mnot\u001b[39;00m is_torch_xla_available()\n\u001b[1;32m 2221\u001b[0m \u001b[39mand\u001b[39;00m (torch\u001b[39m.\u001b[39misnan(tr_loss_step) \u001b[39mor\u001b[39;00m torch\u001b[39m.\u001b[39misinf(tr_loss_step))\n\u001b[1;32m 2222\u001b[0m ):\n\u001b[1;32m 2223\u001b[0m \u001b[39m# if loss is nan or inf simply add the average of previous logged losses\u001b[39;00m\n\u001b[1;32m 2224\u001b[0m tr_loss \u001b[39m+\u001b[39m\u001b[39m=\u001b[39m tr_loss \u001b[39m/\u001b[39m (\u001b[39m1\u001b[39m \u001b[39m+\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mstate\u001b[39m.\u001b[39mglobal_step \u001b[39m-\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_globalstep_last_logged)\n",
127
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/transformers/trainer.py:3238\u001b[0m, in \u001b[0;36mTrainer.training_step\u001b[0;34m(self, model, inputs)\u001b[0m\n\u001b[1;32m 3235\u001b[0m \u001b[39mreturn\u001b[39;00m loss_mb\u001b[39m.\u001b[39mreduce_mean()\u001b[39m.\u001b[39mdetach()\u001b[39m.\u001b[39mto(\u001b[39mself\u001b[39m\u001b[39m.\u001b[39margs\u001b[39m.\u001b[39mdevice)\n\u001b[1;32m 3237\u001b[0m \u001b[39mwith\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mcompute_loss_context_manager():\n\u001b[0;32m-> 3238\u001b[0m loss \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mcompute_loss(model, inputs)\n\u001b[1;32m 3240\u001b[0m \u001b[39mdel\u001b[39;00m inputs\n\u001b[1;32m 3241\u001b[0m torch\u001b[39m.\u001b[39mcuda\u001b[39m.\u001b[39mempty_cache()\n",
128
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/transformers/trainer.py:3282\u001b[0m, in \u001b[0;36mTrainer.compute_loss\u001b[0;34m(self, model, inputs, return_outputs)\u001b[0m\n\u001b[1;32m 3280\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[1;32m 3281\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39misinstance\u001b[39m(outputs, \u001b[39mdict\u001b[39m) \u001b[39mand\u001b[39;00m \u001b[39m\"\u001b[39m\u001b[39mloss\u001b[39m\u001b[39m\"\u001b[39m \u001b[39mnot\u001b[39;00m \u001b[39min\u001b[39;00m outputs:\n\u001b[0;32m-> 3282\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[1;32m 3283\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mThe model did not return a loss from the inputs, only the following keys: \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 3284\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39m{\u001b[39;00m\u001b[39m'\u001b[39m\u001b[39m,\u001b[39m\u001b[39m'\u001b[39m\u001b[39m.\u001b[39mjoin(outputs\u001b[39m.\u001b[39mkeys())\u001b[39m}\u001b[39;00m\u001b[39m. For reference, the inputs it received are \u001b[39m\u001b[39m{\u001b[39;00m\u001b[39m'\u001b[39m\u001b[39m,\u001b[39m\u001b[39m'\u001b[39m\u001b[39m.\u001b[39mjoin(inputs\u001b[39m.\u001b[39mkeys())\u001b[39m}\u001b[39;00m\u001b[39m.\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 3285\u001b[0m )\n\u001b[1;32m 3286\u001b[0m \u001b[39m# We don't use .loss here since the model may return tuples instead of ModelOutput.\u001b[39;00m\n\u001b[1;32m 3287\u001b[0m loss \u001b[39m=\u001b[39m outputs[\u001b[39m\"\u001b[39m\u001b[39mloss\u001b[39m\u001b[39m\"\u001b[39m] \u001b[39mif\u001b[39;00m \u001b[39misinstance\u001b[39m(outputs, \u001b[39mdict\u001b[39m) \u001b[39melse\u001b[39;00m outputs[\u001b[39m0\u001b[39m]\n",
129
+ "\u001b[0;31mValueError\u001b[0m: The model did not return a loss from the inputs, only the following keys: start_logits,end_logits. For reference, the inputs it received are input_ids,attention_mask."
130
+ ]
131
+ }
132
+ ],
133
+ "source": [
134
+ "def preprocess_function(examples):\n",
135
+ " questions = examples[\"question\"]\n",
136
+ " answers = examples[\"answer\"]\n",
137
+ " inputs = tokenizer(questions, truncation=True, padding=True)\n",
138
+ " inputs[\"labels\"] = tokenizer(answers, truncation=True, padding=True)[\"input_ids\"]\n",
139
+ " return inputs\n",
140
+ "\n",
141
+ "tokenized_dataset = dataset.map(preprocess_function, batched=True)\n",
142
+ "\n",
143
+ "training_args = TrainingArguments(\n",
144
+ " output_dir=\"./results\",\n",
145
+ " evaluation_strategy=\"epoch\",\n",
146
+ " learning_rate=2e-5,\n",
147
+ " per_device_train_batch_size=2,\n",
148
+ " num_train_epochs=3,\n",
149
+ " weight_decay=0.01,\n",
150
+ ")\n",
151
+ "\n",
152
+ "trainer = Trainer(\n",
153
+ " model=model,\n",
154
+ " args=training_args,\n",
155
+ " train_dataset=tokenized_dataset,\n",
156
+ " eval_dataset=tokenized_dataset,\n",
157
+ ")\n",
158
+ "\n",
159
+ "trainer.train()\n"
160
+ ]
161
+ },
162
+ {
163
+ "cell_type": "code",
164
+ "execution_count": 6,
165
+ "metadata": {},
166
+ "outputs": [
167
+ {
168
+ "name": "stderr",
169
+ "output_type": "stream",
170
+ "text": [
171
+ "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
172
+ "To disable this warning, you can either:\n",
173
+ "\t- Avoid using `tokenizers` before the fork if possible\n",
174
+ "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
175
+ ]
176
+ },
177
+ {
178
+ "name": "stdout",
179
+ "output_type": "stream",
180
+ "text": [
181
+ "Requirement already satisfied: transformers in /usr/local/lib/python3.10/site-packages (4.41.2)\n",
182
+ "Requirement already satisfied: datasets in /usr/local/lib/python3.10/site-packages (2.19.2)\n",
183
+ "Collecting faiss-cpu\n",
184
+ " Downloading faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)\n",
185
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m27.0/27.0 MB\u001b[0m \u001b[31m64.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
186
+ "\u001b[?25hRequirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/site-packages (from transformers) (6.0.1)\n",
187
+ "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/site-packages (from transformers) (2024.5.15)\n",
188
+ "Requirement already satisfied: requests in /usr/local/lib/python3.10/site-packages (from transformers) (2.32.3)\n",
189
+ "Requirement already satisfied: filelock in /usr/local/lib/python3.10/site-packages (from transformers) (3.14.0)\n",
190
+ "Requirement already satisfied: tokenizers<0.20,>=0.19 in /usr/local/lib/python3.10/site-packages (from transformers) (0.19.1)\n",
191
+ "Requirement already satisfied: huggingface-hub<1.0,>=0.23.0 in /usr/local/lib/python3.10/site-packages (from transformers) (0.23.3)\n",
192
+ "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/site-packages (from transformers) (4.66.4)\n",
193
+ "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/site-packages (from transformers) (1.26.4)\n",
194
+ "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/site-packages (from transformers) (0.4.3)\n",
195
+ "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/site-packages (from transformers) (24.0)\n",
196
+ "Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/site-packages (from datasets) (0.70.16)\n",
197
+ "Requirement already satisfied: fsspec[http]<=2024.3.1,>=2023.1.0 in /usr/local/lib/python3.10/site-packages (from datasets) (2024.3.1)\n",
198
+ "Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.10/site-packages (from datasets) (0.6)\n",
199
+ "Requirement already satisfied: pandas in /usr/local/lib/python3.10/site-packages (from datasets) (2.2.2)\n",
200
+ "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/site-packages (from datasets) (3.4.1)\n",
201
+ "Requirement already satisfied: pyarrow>=12.0.0 in /usr/local/lib/python3.10/site-packages (from datasets) (16.1.0)\n",
202
+ "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.10/site-packages (from datasets) (0.3.8)\n",
203
+ "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/site-packages (from datasets) (3.9.5)\n",
204
+ "Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/site-packages (from aiohttp->datasets) (4.0.3)\n",
205
+ "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/site-packages (from aiohttp->datasets) (1.4.1)\n",
206
+ "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/site-packages (from aiohttp->datasets) (23.2.0)\n",
207
+ "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/site-packages (from aiohttp->datasets) (1.3.1)\n",
208
+ "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/site-packages (from aiohttp->datasets) (1.9.4)\n",
209
+ "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/site-packages (from aiohttp->datasets) (6.0.5)\n",
210
+ "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/site-packages (from huggingface-hub<1.0,>=0.23.0->transformers) (4.10.0)\n",
211
+ "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/site-packages (from requests->transformers) (3.3.2)\n",
212
+ "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/site-packages (from requests->transformers) (3.6)\n",
213
+ "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/site-packages (from requests->transformers) (2.2.1)\n",
214
+ "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/site-packages (from requests->transformers) (2024.2.2)\n",
215
+ "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/site-packages (from pandas->datasets) (2024.1)\n",
216
+ "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/site-packages (from pandas->datasets) (2024.1)\n",
217
+ "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/site-packages (from pandas->datasets) (2.9.0.post0)\n",
218
+ "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n",
219
+ "Installing collected packages: faiss-cpu\n",
220
+ "Successfully installed faiss-cpu-1.8.0\n",
221
+ "\n",
222
+ "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip available: \u001b[0m\u001b[31;49m22.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.0\u001b[0m\n",
223
+ "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
224
+ ]
225
+ }
226
+ ],
227
+ "source": [
228
+ "!pip install transformers datasets faiss-cpu\n"
229
+ ]
230
+ },
231
+ {
232
+ "cell_type": "code",
233
+ "execution_count": 8,
234
+ "metadata": {},
235
+ "outputs": [
236
+ {
237
+ "ename": "ValueError",
238
+ "evalue": "Loading wiki_dpr requires you to execute the dataset script in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option `trust_remote_code=True` to remove this error.",
239
+ "output_type": "error",
240
+ "traceback": [
241
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
242
+ "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
243
+ "\u001b[1;32m/home/user/app/polls/test.ipynb Cell 6\u001b[0m line \u001b[0;36m4\n\u001b[1;32m <a href='vscode-notebook-cell://kenken999-fastapi-django-main--1027.hf.space/home/user/app/polls/test.ipynb#W5sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0'>1</a>\u001b[0m \u001b[39mfrom\u001b[39;00m \u001b[39mdatasets\u001b[39;00m \u001b[39mimport\u001b[39;00m load_dataset\n\u001b[1;32m <a href='vscode-notebook-cell://kenken999-fastapi-django-main--1027.hf.space/home/user/app/polls/test.ipynb#W5sdnNjb2RlLXJlbW90ZQ%3D%3D?line=2'>3</a>\u001b[0m \u001b[39m# データセットのロード\u001b[39;00m\n\u001b[0;32m----> <a href='vscode-notebook-cell://kenken999-fastapi-django-main--1027.hf.space/home/user/app/polls/test.ipynb#W5sdnNjb2RlLXJlbW90ZQ%3D%3D?line=3'>4</a>\u001b[0m dataset \u001b[39m=\u001b[39m load_dataset(\u001b[39m'\u001b[39;49m\u001b[39mwiki_dpr\u001b[39;49m\u001b[39m'\u001b[39;49m, \u001b[39m'\u001b[39;49m\u001b[39mpsgs_w100\u001b[39;49m\u001b[39m'\u001b[39;49m)\n",
244
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/datasets/load.py:2592\u001b[0m, in \u001b[0;36mload_dataset\u001b[0;34m(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)\u001b[0m\n\u001b[1;32m 2587\u001b[0m verification_mode \u001b[39m=\u001b[39m VerificationMode(\n\u001b[1;32m 2588\u001b[0m (verification_mode \u001b[39mor\u001b[39;00m VerificationMode\u001b[39m.\u001b[39mBASIC_CHECKS) \u001b[39mif\u001b[39;00m \u001b[39mnot\u001b[39;00m save_infos \u001b[39melse\u001b[39;00m VerificationMode\u001b[39m.\u001b[39mALL_CHECKS\n\u001b[1;32m 2589\u001b[0m )\n\u001b[1;32m 2591\u001b[0m \u001b[39m# Create a dataset builder\u001b[39;00m\n\u001b[0;32m-> 2592\u001b[0m builder_instance \u001b[39m=\u001b[39m load_dataset_builder(\n\u001b[1;32m 2593\u001b[0m path\u001b[39m=\u001b[39;49mpath,\n\u001b[1;32m 2594\u001b[0m name\u001b[39m=\u001b[39;49mname,\n\u001b[1;32m 2595\u001b[0m data_dir\u001b[39m=\u001b[39;49mdata_dir,\n\u001b[1;32m 2596\u001b[0m data_files\u001b[39m=\u001b[39;49mdata_files,\n\u001b[1;32m 2597\u001b[0m cache_dir\u001b[39m=\u001b[39;49mcache_dir,\n\u001b[1;32m 2598\u001b[0m features\u001b[39m=\u001b[39;49mfeatures,\n\u001b[1;32m 2599\u001b[0m download_config\u001b[39m=\u001b[39;49mdownload_config,\n\u001b[1;32m 2600\u001b[0m download_mode\u001b[39m=\u001b[39;49mdownload_mode,\n\u001b[1;32m 2601\u001b[0m revision\u001b[39m=\u001b[39;49mrevision,\n\u001b[1;32m 2602\u001b[0m token\u001b[39m=\u001b[39;49mtoken,\n\u001b[1;32m 2603\u001b[0m storage_options\u001b[39m=\u001b[39;49mstorage_options,\n\u001b[1;32m 2604\u001b[0m trust_remote_code\u001b[39m=\u001b[39;49mtrust_remote_code,\n\u001b[1;32m 2605\u001b[0m _require_default_config_name\u001b[39m=\u001b[39;49mname \u001b[39mis\u001b[39;49;00m \u001b[39mNone\u001b[39;49;00m,\n\u001b[1;32m 2606\u001b[0m \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mconfig_kwargs,\n\u001b[1;32m 2607\u001b[0m )\n\u001b[1;32m 2609\u001b[0m \u001b[39m# Return iterable dataset in case of streaming\u001b[39;00m\n\u001b[1;32m 2610\u001b[0m \u001b[39mif\u001b[39;00m streaming:\n",
245
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/datasets/load.py:2264\u001b[0m, in \u001b[0;36mload_dataset_builder\u001b[0;34m(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, token, use_auth_token, storage_options, trust_remote_code, _require_default_config_name, **config_kwargs)\u001b[0m\n\u001b[1;32m 2262\u001b[0m download_config \u001b[39m=\u001b[39m download_config\u001b[39m.\u001b[39mcopy() \u001b[39mif\u001b[39;00m download_config \u001b[39melse\u001b[39;00m DownloadConfig()\n\u001b[1;32m 2263\u001b[0m download_config\u001b[39m.\u001b[39mstorage_options\u001b[39m.\u001b[39mupdate(storage_options)\n\u001b[0;32m-> 2264\u001b[0m dataset_module \u001b[39m=\u001b[39m dataset_module_factory(\n\u001b[1;32m 2265\u001b[0m path,\n\u001b[1;32m 2266\u001b[0m revision\u001b[39m=\u001b[39;49mrevision,\n\u001b[1;32m 2267\u001b[0m download_config\u001b[39m=\u001b[39;49mdownload_config,\n\u001b[1;32m 2268\u001b[0m download_mode\u001b[39m=\u001b[39;49mdownload_mode,\n\u001b[1;32m 2269\u001b[0m data_dir\u001b[39m=\u001b[39;49mdata_dir,\n\u001b[1;32m 2270\u001b[0m data_files\u001b[39m=\u001b[39;49mdata_files,\n\u001b[1;32m 2271\u001b[0m cache_dir\u001b[39m=\u001b[39;49mcache_dir,\n\u001b[1;32m 2272\u001b[0m trust_remote_code\u001b[39m=\u001b[39;49mtrust_remote_code,\n\u001b[1;32m 2273\u001b[0m _require_default_config_name\u001b[39m=\u001b[39;49m_require_default_config_name,\n\u001b[1;32m 2274\u001b[0m _require_custom_configs\u001b[39m=\u001b[39;49m\u001b[39mbool\u001b[39;49m(config_kwargs),\n\u001b[1;32m 2275\u001b[0m )\n\u001b[1;32m 2276\u001b[0m \u001b[39m# Get dataset builder class from the processing script\u001b[39;00m\n\u001b[1;32m 2277\u001b[0m builder_kwargs \u001b[39m=\u001b[39m dataset_module\u001b[39m.\u001b[39mbuilder_kwargs\n",
246
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/datasets/load.py:1915\u001b[0m, in \u001b[0;36mdataset_module_factory\u001b[0;34m(path, revision, download_config, download_mode, dynamic_modules_path, data_dir, data_files, cache_dir, trust_remote_code, _require_default_config_name, _require_custom_configs, **download_kwargs)\u001b[0m\n\u001b[1;32m 1910\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39misinstance\u001b[39m(e1, \u001b[39mFileNotFoundError\u001b[39;00m):\n\u001b[1;32m 1911\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mFileNotFoundError\u001b[39;00m(\n\u001b[1;32m 1912\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mCouldn\u001b[39m\u001b[39m'\u001b[39m\u001b[39mt find a dataset script at \u001b[39m\u001b[39m{\u001b[39;00mrelative_to_absolute_path(combined_path)\u001b[39m}\u001b[39;00m\u001b[39m or any data file in the same directory. \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 1913\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mCouldn\u001b[39m\u001b[39m'\u001b[39m\u001b[39mt find \u001b[39m\u001b[39m'\u001b[39m\u001b[39m{\u001b[39;00mpath\u001b[39m}\u001b[39;00m\u001b[39m'\u001b[39m\u001b[39m on the Hugging Face Hub either: \u001b[39m\u001b[39m{\u001b[39;00m\u001b[39mtype\u001b[39m(e1)\u001b[39m.\u001b[39m\u001b[39m__name__\u001b[39m\u001b[39m}\u001b[39;00m\u001b[39m: \u001b[39m\u001b[39m{\u001b[39;00me1\u001b[39m}\u001b[39;00m\u001b[39m\"\u001b[39m\n\u001b[1;32m 1914\u001b[0m ) \u001b[39mfrom\u001b[39;00m \u001b[39mNone\u001b[39;00m\n\u001b[0;32m-> 1915\u001b[0m \u001b[39mraise\u001b[39;00m e1 \u001b[39mfrom\u001b[39;00m \u001b[39mNone\u001b[39;00m\n\u001b[1;32m 1916\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[1;32m 1917\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mFileNotFoundError\u001b[39;00m(\n\u001b[1;32m 1918\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mCouldn\u001b[39m\u001b[39m'\u001b[39m\u001b[39mt find a dataset script at \u001b[39m\u001b[39m{\u001b[39;00mrelative_to_absolute_path(combined_path)\u001b[39m}\u001b[39;00m\u001b[39m or any data file in the same directory.\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 1919\u001b[0m )\n",
247
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/datasets/load.py:1888\u001b[0m, in \u001b[0;36mdataset_module_factory\u001b[0;34m(path, revision, download_config, download_mode, dynamic_modules_path, data_dir, data_files, cache_dir, trust_remote_code, _require_default_config_name, _require_custom_configs, **download_kwargs)\u001b[0m\n\u001b[1;32m 1879\u001b[0m \u001b[39mpass\u001b[39;00m\n\u001b[1;32m 1880\u001b[0m \u001b[39m# Otherwise we must use the dataset script if the user trusts it\u001b[39;00m\n\u001b[1;32m 1881\u001b[0m \u001b[39mreturn\u001b[39;00m HubDatasetModuleFactoryWithScript(\n\u001b[1;32m 1882\u001b[0m path,\n\u001b[1;32m 1883\u001b[0m revision\u001b[39m=\u001b[39;49mrevision,\n\u001b[1;32m 1884\u001b[0m download_config\u001b[39m=\u001b[39;49mdownload_config,\n\u001b[1;32m 1885\u001b[0m download_mode\u001b[39m=\u001b[39;49mdownload_mode,\n\u001b[1;32m 1886\u001b[0m dynamic_modules_path\u001b[39m=\u001b[39;49mdynamic_modules_path,\n\u001b[1;32m 1887\u001b[0m trust_remote_code\u001b[39m=\u001b[39;49mtrust_remote_code,\n\u001b[0;32m-> 1888\u001b[0m )\u001b[39m.\u001b[39;49mget_module()\n\u001b[1;32m 1889\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[1;32m 1890\u001b[0m \u001b[39mreturn\u001b[39;00m HubDatasetModuleFactoryWithoutScript(\n\u001b[1;32m 1891\u001b[0m path,\n\u001b[1;32m 1892\u001b[0m revision\u001b[39m=\u001b[39mrevision,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 1896\u001b[0m download_mode\u001b[39m=\u001b[39mdownload_mode,\n\u001b[1;32m 1897\u001b[0m )\u001b[39m.\u001b[39mget_module()\n",
248
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/datasets/load.py:1537\u001b[0m, in \u001b[0;36mHubDatasetModuleFactoryWithScript.get_module\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 1526\u001b[0m _create_importable_file(\n\u001b[1;32m 1527\u001b[0m local_path\u001b[39m=\u001b[39mlocal_path,\n\u001b[1;32m 1528\u001b[0m local_imports\u001b[39m=\u001b[39mlocal_imports,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 1534\u001b[0m download_mode\u001b[39m=\u001b[39m\u001b[39mself\u001b[39m\u001b[39m.\u001b[39mdownload_mode,\n\u001b[1;32m 1535\u001b[0m )\n\u001b[1;32m 1536\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[0;32m-> 1537\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[1;32m 1538\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mLoading \u001b[39m\u001b[39m{\u001b[39;00m\u001b[39mself\u001b[39m\u001b[39m.\u001b[39mname\u001b[39m}\u001b[39;00m\u001b[39m requires you to execute the dataset script in that\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 1539\u001b[0m \u001b[39m\"\u001b[39m\u001b[39m repo on your local machine. Make sure you have read the code there to avoid malicious use, then\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 1540\u001b[0m \u001b[39m\"\u001b[39m\u001b[39m set the option `trust_remote_code=True` to remove this error.\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 1541\u001b[0m )\n\u001b[1;32m 1542\u001b[0m module_path, \u001b[39mhash\u001b[39m \u001b[39m=\u001b[39m _load_importable_file(\n\u001b[1;32m 1543\u001b[0m dynamic_modules_path\u001b[39m=\u001b[39mdynamic_modules_path,\n\u001b[1;32m 1544\u001b[0m module_namespace\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mdatasets\u001b[39m\u001b[39m\"\u001b[39m,\n\u001b[1;32m 1545\u001b[0m subdirectory_name\u001b[39m=\u001b[39m\u001b[39mhash\u001b[39m,\n\u001b[1;32m 1546\u001b[0m name\u001b[39m=\u001b[39m\u001b[39mself\u001b[39m\u001b[39m.\u001b[39mname,\n\u001b[1;32m 1547\u001b[0m )\n\u001b[1;32m 1548\u001b[0m \u001b[39m# make the new module to be noticed by the import system\u001b[39;00m\n",
249
+ "\u001b[0;31mValueError\u001b[0m: Loading wiki_dpr requires you to execute the dataset script in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option `trust_remote_code=True` to remove this error."
250
+ ]
251
+ }
252
+ ],
253
+ "source": [
254
+ "from datasets import load_dataset\n",
255
+ "\n",
256
+ "# データセットのロード\n",
257
+ "dataset = load_dataset('wiki_dpr', 'psgs_w100')\n"
258
+ ]
259
+ },
260
+ {
261
+ "cell_type": "code",
262
+ "execution_count": 9,
263
+ "metadata": {},
264
+ "outputs": [
265
+ {
266
+ "ename": "ValueError",
267
+ "evalue": "BuilderConfig 'psgs_w100' not found. Available: ['psgs_w100.nq.exact', 'psgs_w100.nq.compressed', 'psgs_w100.nq.no_index', 'psgs_w100.multiset.exact', 'psgs_w100.multiset.compressed', 'psgs_w100.multiset.no_index', 'psgs_w100.nq.exact.no_embeddings', 'psgs_w100.nq.compressed.no_embeddings', 'psgs_w100.nq.no_index.no_embeddings', 'psgs_w100.multiset.exact.no_embeddings', 'psgs_w100.multiset.compressed.no_embeddings', 'psgs_w100.multiset.no_index.no_embeddings']",
268
+ "output_type": "error",
269
+ "traceback": [
270
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
271
+ "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
272
+ "\u001b[1;32m/home/user/app/polls/test.ipynb Cell 7\u001b[0m line \u001b[0;36m4\n\u001b[1;32m <a href='vscode-notebook-cell://kenken999-fastapi-django-main--1027.hf.space/home/user/app/polls/test.ipynb#W6sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0'>1</a>\u001b[0m \u001b[39mfrom\u001b[39;00m \u001b[39mdatasets\u001b[39;00m \u001b[39mimport\u001b[39;00m load_dataset\n\u001b[1;32m <a href='vscode-notebook-cell://kenken999-fastapi-django-main--1027.hf.space/home/user/app/polls/test.ipynb#W6sdnNjb2RlLXJlbW90ZQ%3D%3D?line=2'>3</a>\u001b[0m \u001b[39m# データセットのロード\u001b[39;00m\n\u001b[0;32m----> <a href='vscode-notebook-cell://kenken999-fastapi-django-main--1027.hf.space/home/user/app/polls/test.ipynb#W6sdnNjb2RlLXJlbW90ZQ%3D%3D?line=3'>4</a>\u001b[0m dataset \u001b[39m=\u001b[39m load_dataset(\u001b[39m'\u001b[39;49m\u001b[39mwiki_dpr\u001b[39;49m\u001b[39m'\u001b[39;49m, \u001b[39m'\u001b[39;49m\u001b[39mpsgs_w100\u001b[39;49m\u001b[39m'\u001b[39;49m, trust_remote_code\u001b[39m=\u001b[39;49m\u001b[39mTrue\u001b[39;49;00m)\n",
273
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/datasets/load.py:2592\u001b[0m, in \u001b[0;36mload_dataset\u001b[0;34m(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)\u001b[0m\n\u001b[1;32m 2587\u001b[0m verification_mode \u001b[39m=\u001b[39m VerificationMode(\n\u001b[1;32m 2588\u001b[0m (verification_mode \u001b[39mor\u001b[39;00m VerificationMode\u001b[39m.\u001b[39mBASIC_CHECKS) \u001b[39mif\u001b[39;00m \u001b[39mnot\u001b[39;00m save_infos \u001b[39melse\u001b[39;00m VerificationMode\u001b[39m.\u001b[39mALL_CHECKS\n\u001b[1;32m 2589\u001b[0m )\n\u001b[1;32m 2591\u001b[0m \u001b[39m# Create a dataset builder\u001b[39;00m\n\u001b[0;32m-> 2592\u001b[0m builder_instance \u001b[39m=\u001b[39m load_dataset_builder(\n\u001b[1;32m 2593\u001b[0m path\u001b[39m=\u001b[39;49mpath,\n\u001b[1;32m 2594\u001b[0m name\u001b[39m=\u001b[39;49mname,\n\u001b[1;32m 2595\u001b[0m data_dir\u001b[39m=\u001b[39;49mdata_dir,\n\u001b[1;32m 2596\u001b[0m data_files\u001b[39m=\u001b[39;49mdata_files,\n\u001b[1;32m 2597\u001b[0m cache_dir\u001b[39m=\u001b[39;49mcache_dir,\n\u001b[1;32m 2598\u001b[0m features\u001b[39m=\u001b[39;49mfeatures,\n\u001b[1;32m 2599\u001b[0m download_config\u001b[39m=\u001b[39;49mdownload_config,\n\u001b[1;32m 2600\u001b[0m download_mode\u001b[39m=\u001b[39;49mdownload_mode,\n\u001b[1;32m 2601\u001b[0m revision\u001b[39m=\u001b[39;49mrevision,\n\u001b[1;32m 2602\u001b[0m token\u001b[39m=\u001b[39;49mtoken,\n\u001b[1;32m 2603\u001b[0m storage_options\u001b[39m=\u001b[39;49mstorage_options,\n\u001b[1;32m 2604\u001b[0m trust_remote_code\u001b[39m=\u001b[39;49mtrust_remote_code,\n\u001b[1;32m 2605\u001b[0m _require_default_config_name\u001b[39m=\u001b[39;49mname \u001b[39mis\u001b[39;49;00m \u001b[39mNone\u001b[39;49;00m,\n\u001b[1;32m 2606\u001b[0m \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mconfig_kwargs,\n\u001b[1;32m 2607\u001b[0m )\n\u001b[1;32m 2609\u001b[0m \u001b[39m# Return iterable dataset in case of streaming\u001b[39;00m\n\u001b[1;32m 2610\u001b[0m \u001b[39mif\u001b[39;00m streaming:\n",
274
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/datasets/load.py:2301\u001b[0m, in \u001b[0;36mload_dataset_builder\u001b[0;34m(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, token, use_auth_token, storage_options, trust_remote_code, _require_default_config_name, **config_kwargs)\u001b[0m\n\u001b[1;32m 2299\u001b[0m builder_cls \u001b[39m=\u001b[39m get_dataset_builder_class(dataset_module, dataset_name\u001b[39m=\u001b[39mdataset_name)\n\u001b[1;32m 2300\u001b[0m \u001b[39m# Instantiate the dataset builder\u001b[39;00m\n\u001b[0;32m-> 2301\u001b[0m builder_instance: DatasetBuilder \u001b[39m=\u001b[39m builder_cls(\n\u001b[1;32m 2302\u001b[0m cache_dir\u001b[39m=\u001b[39;49mcache_dir,\n\u001b[1;32m 2303\u001b[0m dataset_name\u001b[39m=\u001b[39;49mdataset_name,\n\u001b[1;32m 2304\u001b[0m config_name\u001b[39m=\u001b[39;49mconfig_name,\n\u001b[1;32m 2305\u001b[0m data_dir\u001b[39m=\u001b[39;49mdata_dir,\n\u001b[1;32m 2306\u001b[0m data_files\u001b[39m=\u001b[39;49mdata_files,\n\u001b[1;32m 2307\u001b[0m \u001b[39mhash\u001b[39;49m\u001b[39m=\u001b[39;49mdataset_module\u001b[39m.\u001b[39;49mhash,\n\u001b[1;32m 2308\u001b[0m info\u001b[39m=\u001b[39;49minfo,\n\u001b[1;32m 2309\u001b[0m features\u001b[39m=\u001b[39;49mfeatures,\n\u001b[1;32m 2310\u001b[0m token\u001b[39m=\u001b[39;49mtoken,\n\u001b[1;32m 2311\u001b[0m storage_options\u001b[39m=\u001b[39;49mstorage_options,\n\u001b[1;32m 2312\u001b[0m \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mbuilder_kwargs,\n\u001b[1;32m 2313\u001b[0m \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mconfig_kwargs,\n\u001b[1;32m 2314\u001b[0m )\n\u001b[1;32m 2315\u001b[0m builder_instance\u001b[39m.\u001b[39m_use_legacy_cache_dir_if_possible(dataset_module)\n\u001b[1;32m 2317\u001b[0m \u001b[39mreturn\u001b[39;00m builder_instance\n",
275
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/datasets/builder.py:374\u001b[0m, in \u001b[0;36mDatasetBuilder.__init__\u001b[0;34m(self, cache_dir, dataset_name, config_name, hash, base_path, info, features, token, use_auth_token, repo_id, data_files, data_dir, storage_options, writer_batch_size, name, **config_kwargs)\u001b[0m\n\u001b[1;32m 372\u001b[0m config_kwargs[\u001b[39m\"\u001b[39m\u001b[39mdata_dir\u001b[39m\u001b[39m\"\u001b[39m] \u001b[39m=\u001b[39m data_dir\n\u001b[1;32m 373\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mconfig_kwargs \u001b[39m=\u001b[39m config_kwargs\n\u001b[0;32m--> 374\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mconfig, \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mconfig_id \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_create_builder_config(\n\u001b[1;32m 375\u001b[0m config_name\u001b[39m=\u001b[39;49mconfig_name,\n\u001b[1;32m 376\u001b[0m custom_features\u001b[39m=\u001b[39;49mfeatures,\n\u001b[1;32m 377\u001b[0m \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mconfig_kwargs,\n\u001b[1;32m 378\u001b[0m )\n\u001b[1;32m 380\u001b[0m \u001b[39m# prepare info: DatasetInfo are a standardized dataclass across all datasets\u001b[39;00m\n\u001b[1;32m 381\u001b[0m \u001b[39m# Prefill datasetinfo\u001b[39;00m\n\u001b[1;32m 382\u001b[0m \u001b[39mif\u001b[39;00m info \u001b[39mis\u001b[39;00m \u001b[39mNone\u001b[39;00m:\n\u001b[1;32m 383\u001b[0m \u001b[39m# TODO FOR PACKAGED MODULES IT IMPORTS DATA FROM src/packaged_modules which doesn't make sense\u001b[39;00m\n",
276
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/datasets/builder.py:599\u001b[0m, in \u001b[0;36mDatasetBuilder._create_builder_config\u001b[0;34m(self, config_name, custom_features, **config_kwargs)\u001b[0m\n\u001b[1;32m 597\u001b[0m builder_config \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mbuilder_configs\u001b[39m.\u001b[39mget(config_name)\n\u001b[1;32m 598\u001b[0m \u001b[39mif\u001b[39;00m builder_config \u001b[39mis\u001b[39;00m \u001b[39mNone\u001b[39;00m \u001b[39mand\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mBUILDER_CONFIGS:\n\u001b[0;32m--> 599\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[1;32m 600\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mBuilderConfig \u001b[39m\u001b[39m'\u001b[39m\u001b[39m{\u001b[39;00mconfig_name\u001b[39m}\u001b[39;00m\u001b[39m'\u001b[39m\u001b[39m not found. Available: \u001b[39m\u001b[39m{\u001b[39;00m\u001b[39mlist\u001b[39m(\u001b[39mself\u001b[39m\u001b[39m.\u001b[39mbuilder_configs\u001b[39m.\u001b[39mkeys())\u001b[39m}\u001b[39;00m\u001b[39m\"\u001b[39m\n\u001b[1;32m 601\u001b[0m )\n\u001b[1;32m 603\u001b[0m \u001b[39m# if not using an existing config, then create a new config on the fly\u001b[39;00m\n\u001b[1;32m 604\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mnot\u001b[39;00m builder_config:\n",
277
+ "\u001b[0;31mValueError\u001b[0m: BuilderConfig 'psgs_w100' not found. Available: ['psgs_w100.nq.exact', 'psgs_w100.nq.compressed', 'psgs_w100.nq.no_index', 'psgs_w100.multiset.exact', 'psgs_w100.multiset.compressed', 'psgs_w100.multiset.no_index', 'psgs_w100.nq.exact.no_embeddings', 'psgs_w100.nq.compressed.no_embeddings', 'psgs_w100.nq.no_index.no_embeddings', 'psgs_w100.multiset.exact.no_embeddings', 'psgs_w100.multiset.compressed.no_embeddings', 'psgs_w100.multiset.no_index.no_embeddings']"
278
+ ]
279
+ }
280
+ ],
281
+ "source": [
282
+ "from datasets import load_dataset\n",
283
+ "\n",
284
+ "# データセットのロード\n",
285
+ "dataset = load_dataset('wiki_dpr', 'psgs_w100', trust_remote_code=True)\n"
286
+ ]
287
+ },
288
+ {
289
+ "cell_type": "code",
290
+ "execution_count": 10,
291
+ "metadata": {},
292
+ "outputs": [
293
+ {
294
+ "name": "stderr",
295
+ "output_type": "stream",
296
+ "text": [
297
+ "Downloading data: 0%| | 0/157 [02:34<?, ?files/s]\n"
298
+ ]
299
+ },
300
+ {
301
+ "ename": "KeyboardInterrupt",
302
+ "evalue": "",
303
+ "output_type": "error",
304
+ "traceback": [
305
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
306
+ "\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)",
307
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/tqdm/contrib/concurrent.py:51\u001b[0m, in \u001b[0;36m_executor_map\u001b[0;34m(PoolExecutor, fn, *iterables, **tqdm_kwargs)\u001b[0m\n\u001b[1;32m 49\u001b[0m \u001b[39mwith\u001b[39;00m PoolExecutor(max_workers\u001b[39m=\u001b[39mmax_workers, initializer\u001b[39m=\u001b[39mtqdm_class\u001b[39m.\u001b[39mset_lock,\n\u001b[1;32m 50\u001b[0m initargs\u001b[39m=\u001b[39m(lk,)) \u001b[39mas\u001b[39;00m ex:\n\u001b[0;32m---> 51\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mlist\u001b[39;49m(tqdm_class(ex\u001b[39m.\u001b[39;49mmap(fn, \u001b[39m*\u001b[39;49miterables, chunksize\u001b[39m=\u001b[39;49mchunksize), \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs))\n",
308
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/tqdm/std.py:1181\u001b[0m, in \u001b[0;36mtqdm.__iter__\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 1180\u001b[0m \u001b[39mtry\u001b[39;00m:\n\u001b[0;32m-> 1181\u001b[0m \u001b[39mfor\u001b[39;00m obj \u001b[39min\u001b[39;00m iterable:\n\u001b[1;32m 1182\u001b[0m \u001b[39myield\u001b[39;00m obj\n",
309
+ "File \u001b[0;32m/usr/local/lib/python3.10/concurrent/futures/_base.py:621\u001b[0m, in \u001b[0;36mExecutor.map.<locals>.result_iterator\u001b[0;34m()\u001b[0m\n\u001b[1;32m 620\u001b[0m \u001b[39mif\u001b[39;00m timeout \u001b[39mis\u001b[39;00m \u001b[39mNone\u001b[39;00m:\n\u001b[0;32m--> 621\u001b[0m \u001b[39myield\u001b[39;00m _result_or_cancel(fs\u001b[39m.\u001b[39;49mpop())\n\u001b[1;32m 622\u001b[0m \u001b[39melse\u001b[39;00m:\n",
310
+ "File \u001b[0;32m/usr/local/lib/python3.10/concurrent/futures/_base.py:319\u001b[0m, in \u001b[0;36m_result_or_cancel\u001b[0;34m(***failed resolving arguments***)\u001b[0m\n\u001b[1;32m 318\u001b[0m \u001b[39mtry\u001b[39;00m:\n\u001b[0;32m--> 319\u001b[0m \u001b[39mreturn\u001b[39;00m fut\u001b[39m.\u001b[39;49mresult(timeout)\n\u001b[1;32m 320\u001b[0m \u001b[39mfinally\u001b[39;00m:\n",
311
+ "File \u001b[0;32m/usr/local/lib/python3.10/concurrent/futures/_base.py:453\u001b[0m, in \u001b[0;36mFuture.result\u001b[0;34m(self, timeout)\u001b[0m\n\u001b[1;32m 451\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m__get_result()\n\u001b[0;32m--> 453\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_condition\u001b[39m.\u001b[39;49mwait(timeout)\n\u001b[1;32m 455\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_state \u001b[39min\u001b[39;00m [CANCELLED, CANCELLED_AND_NOTIFIED]:\n",
312
+ "File \u001b[0;32m/usr/local/lib/python3.10/threading.py:320\u001b[0m, in \u001b[0;36mCondition.wait\u001b[0;34m(self, timeout)\u001b[0m\n\u001b[1;32m 319\u001b[0m \u001b[39mif\u001b[39;00m timeout \u001b[39mis\u001b[39;00m \u001b[39mNone\u001b[39;00m:\n\u001b[0;32m--> 320\u001b[0m waiter\u001b[39m.\u001b[39;49macquire()\n\u001b[1;32m 321\u001b[0m gotit \u001b[39m=\u001b[39m \u001b[39mTrue\u001b[39;00m\n",
313
+ "\u001b[0;31mKeyboardInterrupt\u001b[0m: ",
314
+ "\nDuring handling of the above exception, another exception occurred:\n",
315
+ "\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)",
316
+ "\u001b[1;32m/home/user/app/polls/test.ipynb Cell 8\u001b[0m line \u001b[0;36m4\n\u001b[1;32m <a href='vscode-notebook-cell://kenken999-fastapi-django-main--1027.hf.space/home/user/app/polls/test.ipynb#X10sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0'>1</a>\u001b[0m \u001b[39mfrom\u001b[39;00m \u001b[39mdatasets\u001b[39;00m \u001b[39mimport\u001b[39;00m load_dataset\n\u001b[1;32m <a href='vscode-notebook-cell://kenken999-fastapi-django-main--1027.hf.space/home/user/app/polls/test.ipynb#X10sdnNjb2RlLXJlbW90ZQ%3D%3D?line=2'>3</a>\u001b[0m \u001b[39m# データセットのロード\u001b[39;00m\n\u001b[0;32m----> <a href='vscode-notebook-cell://kenken999-fastapi-django-main--1027.hf.space/home/user/app/polls/test.ipynb#X10sdnNjb2RlLXJlbW90ZQ%3D%3D?line=3'>4</a>\u001b[0m dataset \u001b[39m=\u001b[39m load_dataset(\u001b[39m'\u001b[39;49m\u001b[39mwiki_dpr\u001b[39;49m\u001b[39m'\u001b[39;49m, \u001b[39m'\u001b[39;49m\u001b[39mpsgs_w100.nq.exact\u001b[39;49m\u001b[39m'\u001b[39;49m, trust_remote_code\u001b[39m=\u001b[39;49m\u001b[39mTrue\u001b[39;49;00m)\n",
317
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/datasets/load.py:2614\u001b[0m, in \u001b[0;36mload_dataset\u001b[0;34m(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)\u001b[0m\n\u001b[1;32m 2611\u001b[0m \u001b[39mreturn\u001b[39;00m builder_instance\u001b[39m.\u001b[39mas_streaming_dataset(split\u001b[39m=\u001b[39msplit)\n\u001b[1;32m 2613\u001b[0m \u001b[39m# Download and prepare data\u001b[39;00m\n\u001b[0;32m-> 2614\u001b[0m builder_instance\u001b[39m.\u001b[39;49mdownload_and_prepare(\n\u001b[1;32m 2615\u001b[0m download_config\u001b[39m=\u001b[39;49mdownload_config,\n\u001b[1;32m 2616\u001b[0m download_mode\u001b[39m=\u001b[39;49mdownload_mode,\n\u001b[1;32m 2617\u001b[0m verification_mode\u001b[39m=\u001b[39;49mverification_mode,\n\u001b[1;32m 2618\u001b[0m num_proc\u001b[39m=\u001b[39;49mnum_proc,\n\u001b[1;32m 2619\u001b[0m storage_options\u001b[39m=\u001b[39;49mstorage_options,\n\u001b[1;32m 2620\u001b[0m )\n\u001b[1;32m 2622\u001b[0m \u001b[39m# Build dataset for splits\u001b[39;00m\n\u001b[1;32m 2623\u001b[0m keep_in_memory \u001b[39m=\u001b[39m (\n\u001b[1;32m 2624\u001b[0m keep_in_memory \u001b[39mif\u001b[39;00m keep_in_memory \u001b[39mis\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mNone\u001b[39;00m \u001b[39melse\u001b[39;00m is_small_dataset(builder_instance\u001b[39m.\u001b[39minfo\u001b[39m.\u001b[39mdataset_size)\n\u001b[1;32m 2625\u001b[0m )\n",
318
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/datasets/builder.py:1027\u001b[0m, in \u001b[0;36mDatasetBuilder.download_and_prepare\u001b[0;34m(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)\u001b[0m\n\u001b[1;32m 1025\u001b[0m \u001b[39mif\u001b[39;00m num_proc \u001b[39mis\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mNone\u001b[39;00m:\n\u001b[1;32m 1026\u001b[0m prepare_split_kwargs[\u001b[39m\"\u001b[39m\u001b[39mnum_proc\u001b[39m\u001b[39m\"\u001b[39m] \u001b[39m=\u001b[39m num_proc\n\u001b[0;32m-> 1027\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_download_and_prepare(\n\u001b[1;32m 1028\u001b[0m dl_manager\u001b[39m=\u001b[39;49mdl_manager,\n\u001b[1;32m 1029\u001b[0m verification_mode\u001b[39m=\u001b[39;49mverification_mode,\n\u001b[1;32m 1030\u001b[0m \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mprepare_split_kwargs,\n\u001b[1;32m 1031\u001b[0m \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mdownload_and_prepare_kwargs,\n\u001b[1;32m 1032\u001b[0m )\n\u001b[1;32m 1033\u001b[0m \u001b[39m# Sync info\u001b[39;00m\n\u001b[1;32m 1034\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39minfo\u001b[39m.\u001b[39mdataset_size \u001b[39m=\u001b[39m \u001b[39msum\u001b[39m(split\u001b[39m.\u001b[39mnum_bytes \u001b[39mfor\u001b[39;00m split \u001b[39min\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39minfo\u001b[39m.\u001b[39msplits\u001b[39m.\u001b[39mvalues())\n",
319
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/datasets/builder.py:1100\u001b[0m, in \u001b[0;36mDatasetBuilder._download_and_prepare\u001b[0;34m(self, dl_manager, verification_mode, **prepare_split_kwargs)\u001b[0m\n\u001b[1;32m 1098\u001b[0m split_dict \u001b[39m=\u001b[39m SplitDict(dataset_name\u001b[39m=\u001b[39m\u001b[39mself\u001b[39m\u001b[39m.\u001b[39mdataset_name)\n\u001b[1;32m 1099\u001b[0m split_generators_kwargs \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_make_split_generators_kwargs(prepare_split_kwargs)\n\u001b[0;32m-> 1100\u001b[0m split_generators \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_split_generators(dl_manager, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49msplit_generators_kwargs)\n\u001b[1;32m 1102\u001b[0m \u001b[39m# Checksums verification\u001b[39;00m\n\u001b[1;32m 1103\u001b[0m \u001b[39mif\u001b[39;00m verification_mode \u001b[39m==\u001b[39m VerificationMode\u001b[39m.\u001b[39mALL_CHECKS \u001b[39mand\u001b[39;00m dl_manager\u001b[39m.\u001b[39mrecord_checksums:\n",
320
+ "File \u001b[0;32m~/.cache/huggingface/modules/datasets_modules/datasets/wiki_dpr/66fd9b80f51375c02cd9010050e781ed3e8f759e868f690c31b2686a7a0eeb5c/wiki_dpr.py:143\u001b[0m, in \u001b[0;36mWikiDpr._split_generators\u001b[0;34m(self, dl_manager)\u001b[0m\n\u001b[1;32m 141\u001b[0m data_dir \u001b[39m=\u001b[39m os\u001b[39m.\u001b[39mpath\u001b[39m.\u001b[39mjoin(\u001b[39m\"\u001b[39m\u001b[39mdata\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mconfig\u001b[39m.\u001b[39mwiki_split, data_dir)\n\u001b[1;32m 142\u001b[0m files \u001b[39m=\u001b[39m [os\u001b[39m.\u001b[39mpath\u001b[39m.\u001b[39mjoin(data_dir, \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mtrain-\u001b[39m\u001b[39m{\u001b[39;00mi\u001b[39m:\u001b[39;00m\u001b[39m05d\u001b[39m\u001b[39m}\u001b[39;00m\u001b[39m-of-\u001b[39m\u001b[39m{\u001b[39;00mnum_shards\u001b[39m:\u001b[39;00m\u001b[39m05d\u001b[39m\u001b[39m}\u001b[39;00m\u001b[39m.parquet\u001b[39m\u001b[39m\"\u001b[39m) \u001b[39mfor\u001b[39;00m i \u001b[39min\u001b[39;00m \u001b[39mrange\u001b[39m(num_shards)]\n\u001b[0;32m--> 143\u001b[0m downloaded_files \u001b[39m=\u001b[39m dl_manager\u001b[39m.\u001b[39;49mdownload_and_extract(files)\n\u001b[1;32m 144\u001b[0m \u001b[39mreturn\u001b[39;00m [\n\u001b[1;32m 145\u001b[0m datasets\u001b[39m.\u001b[39mSplitGenerator(name\u001b[39m=\u001b[39mdatasets\u001b[39m.\u001b[39mSplit\u001b[39m.\u001b[39mTRAIN, gen_kwargs\u001b[39m=\u001b[39m{\u001b[39m\"\u001b[39m\u001b[39mfiles\u001b[39m\u001b[39m\"\u001b[39m: downloaded_files}),\n\u001b[1;32m 146\u001b[0m ]\n",
321
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/datasets/download/download_manager.py:434\u001b[0m, in \u001b[0;36mDownloadManager.download_and_extract\u001b[0;34m(self, url_or_urls)\u001b[0m\n\u001b[1;32m 418\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39mdownload_and_extract\u001b[39m(\u001b[39mself\u001b[39m, url_or_urls):\n\u001b[1;32m 419\u001b[0m \u001b[39m \u001b[39m\u001b[39m\"\"\"Download and extract given `url_or_urls`.\u001b[39;00m\n\u001b[1;32m 420\u001b[0m \n\u001b[1;32m 421\u001b[0m \u001b[39m Is roughly equivalent to:\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 432\u001b[0m \u001b[39m extracted_path(s): `str`, extracted paths of given URL(s).\u001b[39;00m\n\u001b[1;32m 433\u001b[0m \u001b[39m \"\"\"\u001b[39;00m\n\u001b[0;32m--> 434\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mextract(\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mdownload(url_or_urls))\n",
322
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/datasets/download/download_manager.py:257\u001b[0m, in \u001b[0;36mDownloadManager.download\u001b[0;34m(self, url_or_urls)\u001b[0m\n\u001b[1;32m 255\u001b[0m start_time \u001b[39m=\u001b[39m datetime\u001b[39m.\u001b[39mnow()\n\u001b[1;32m 256\u001b[0m \u001b[39mwith\u001b[39;00m stack_multiprocessing_download_progress_bars():\n\u001b[0;32m--> 257\u001b[0m downloaded_path_or_paths \u001b[39m=\u001b[39m map_nested(\n\u001b[1;32m 258\u001b[0m download_func,\n\u001b[1;32m 259\u001b[0m url_or_urls,\n\u001b[1;32m 260\u001b[0m map_tuple\u001b[39m=\u001b[39;49m\u001b[39mTrue\u001b[39;49;00m,\n\u001b[1;32m 261\u001b[0m num_proc\u001b[39m=\u001b[39;49mdownload_config\u001b[39m.\u001b[39;49mnum_proc,\n\u001b[1;32m 262\u001b[0m desc\u001b[39m=\u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mDownloading data files\u001b[39;49m\u001b[39m\"\u001b[39;49m,\n\u001b[1;32m 263\u001b[0m batched\u001b[39m=\u001b[39;49m\u001b[39mTrue\u001b[39;49;00m,\n\u001b[1;32m 264\u001b[0m batch_size\u001b[39m=\u001b[39;49m\u001b[39m-\u001b[39;49m\u001b[39m1\u001b[39;49m,\n\u001b[1;32m 265\u001b[0m )\n\u001b[1;32m 266\u001b[0m duration \u001b[39m=\u001b[39m datetime\u001b[39m.\u001b[39mnow() \u001b[39m-\u001b[39m start_time\n\u001b[1;32m 267\u001b[0m logger\u001b[39m.\u001b[39minfo(\u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mDownloading took \u001b[39m\u001b[39m{\u001b[39;00mduration\u001b[39m.\u001b[39mtotal_seconds()\u001b[39m \u001b[39m\u001b[39m/\u001b[39m\u001b[39m/\u001b[39m\u001b[39m \u001b[39m\u001b[39m60\u001b[39m\u001b[39m}\u001b[39;00m\u001b[39m min\u001b[39m\u001b[39m\"\u001b[39m)\n",
323
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/datasets/utils/py_utils.py:511\u001b[0m, in \u001b[0;36mmap_nested\u001b[0;34m(function, data_struct, dict_only, map_list, map_tuple, map_numpy, num_proc, parallel_min_length, batched, batch_size, types, disable_tqdm, desc)\u001b[0m\n\u001b[1;32m 509\u001b[0m batch_size \u001b[39m=\u001b[39m \u001b[39mmax\u001b[39m(\u001b[39mlen\u001b[39m(iterable) \u001b[39m/\u001b[39m\u001b[39m/\u001b[39m num_proc \u001b[39m+\u001b[39m \u001b[39mint\u001b[39m(\u001b[39mlen\u001b[39m(iterable) \u001b[39m%\u001b[39m num_proc \u001b[39m>\u001b[39m \u001b[39m0\u001b[39m), \u001b[39m1\u001b[39m)\n\u001b[1;32m 510\u001b[0m iterable \u001b[39m=\u001b[39m \u001b[39mlist\u001b[39m(iter_batched(iterable, batch_size))\n\u001b[0;32m--> 511\u001b[0m mapped \u001b[39m=\u001b[39m [\n\u001b[1;32m 512\u001b[0m _single_map_nested((function, obj, batched, batch_size, types, \u001b[39mNone\u001b[39;00m, \u001b[39mTrue\u001b[39;00m, \u001b[39mNone\u001b[39;00m))\n\u001b[1;32m 513\u001b[0m \u001b[39mfor\u001b[39;00m obj \u001b[39min\u001b[39;00m hf_tqdm(iterable, disable\u001b[39m=\u001b[39mdisable_tqdm, desc\u001b[39m=\u001b[39mdesc)\n\u001b[1;32m 514\u001b[0m ]\n\u001b[1;32m 515\u001b[0m \u001b[39mif\u001b[39;00m batched:\n\u001b[1;32m 516\u001b[0m mapped \u001b[39m=\u001b[39m [mapped_item \u001b[39mfor\u001b[39;00m mapped_batch \u001b[39min\u001b[39;00m mapped \u001b[39mfor\u001b[39;00m mapped_item \u001b[39min\u001b[39;00m mapped_batch]\n",
324
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/datasets/utils/py_utils.py:512\u001b[0m, in \u001b[0;36m<listcomp>\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m 509\u001b[0m batch_size \u001b[39m=\u001b[39m \u001b[39mmax\u001b[39m(\u001b[39mlen\u001b[39m(iterable) \u001b[39m/\u001b[39m\u001b[39m/\u001b[39m num_proc \u001b[39m+\u001b[39m \u001b[39mint\u001b[39m(\u001b[39mlen\u001b[39m(iterable) \u001b[39m%\u001b[39m num_proc \u001b[39m>\u001b[39m \u001b[39m0\u001b[39m), \u001b[39m1\u001b[39m)\n\u001b[1;32m 510\u001b[0m iterable \u001b[39m=\u001b[39m \u001b[39mlist\u001b[39m(iter_batched(iterable, batch_size))\n\u001b[1;32m 511\u001b[0m mapped \u001b[39m=\u001b[39m [\n\u001b[0;32m--> 512\u001b[0m _single_map_nested((function, obj, batched, batch_size, types, \u001b[39mNone\u001b[39;49;00m, \u001b[39mTrue\u001b[39;49;00m, \u001b[39mNone\u001b[39;49;00m))\n\u001b[1;32m 513\u001b[0m \u001b[39mfor\u001b[39;00m obj \u001b[39min\u001b[39;00m hf_tqdm(iterable, disable\u001b[39m=\u001b[39mdisable_tqdm, desc\u001b[39m=\u001b[39mdesc)\n\u001b[1;32m 514\u001b[0m ]\n\u001b[1;32m 515\u001b[0m \u001b[39mif\u001b[39;00m batched:\n\u001b[1;32m 516\u001b[0m mapped \u001b[39m=\u001b[39m [mapped_item \u001b[39mfor\u001b[39;00m mapped_batch \u001b[39min\u001b[39;00m mapped \u001b[39mfor\u001b[39;00m mapped_item \u001b[39min\u001b[39;00m mapped_batch]\n",
325
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/datasets/utils/py_utils.py:380\u001b[0m, in \u001b[0;36m_single_map_nested\u001b[0;34m(args)\u001b[0m\n\u001b[1;32m 373\u001b[0m \u001b[39mreturn\u001b[39;00m function(data_struct)\n\u001b[1;32m 374\u001b[0m \u001b[39mif\u001b[39;00m (\n\u001b[1;32m 375\u001b[0m batched\n\u001b[1;32m 376\u001b[0m \u001b[39mand\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39misinstance\u001b[39m(data_struct, \u001b[39mdict\u001b[39m)\n\u001b[1;32m 377\u001b[0m \u001b[39mand\u001b[39;00m \u001b[39misinstance\u001b[39m(data_struct, types)\n\u001b[1;32m 378\u001b[0m \u001b[39mand\u001b[39;00m \u001b[39mall\u001b[39m(\u001b[39mnot\u001b[39;00m \u001b[39misinstance\u001b[39m(v, (\u001b[39mdict\u001b[39m, types)) \u001b[39mfor\u001b[39;00m v \u001b[39min\u001b[39;00m data_struct)\n\u001b[1;32m 379\u001b[0m ):\n\u001b[0;32m--> 380\u001b[0m \u001b[39mreturn\u001b[39;00m [mapped_item \u001b[39mfor\u001b[39;00m batch \u001b[39min\u001b[39;00m iter_batched(data_struct, batch_size) \u001b[39mfor\u001b[39;00m mapped_item \u001b[39min\u001b[39;00m function(batch)]\n\u001b[1;32m 382\u001b[0m \u001b[39m# Reduce logging to keep things readable in multiprocessing with tqdm\u001b[39;00m\n\u001b[1;32m 383\u001b[0m \u001b[39mif\u001b[39;00m rank \u001b[39mis\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mNone\u001b[39;00m \u001b[39mand\u001b[39;00m logging\u001b[39m.\u001b[39mget_verbosity() \u001b[39m<\u001b[39m logging\u001b[39m.\u001b[39mWARNING:\n",
326
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/datasets/utils/py_utils.py:380\u001b[0m, in \u001b[0;36m<listcomp>\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m 373\u001b[0m \u001b[39mreturn\u001b[39;00m function(data_struct)\n\u001b[1;32m 374\u001b[0m \u001b[39mif\u001b[39;00m (\n\u001b[1;32m 375\u001b[0m batched\n\u001b[1;32m 376\u001b[0m \u001b[39mand\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39misinstance\u001b[39m(data_struct, \u001b[39mdict\u001b[39m)\n\u001b[1;32m 377\u001b[0m \u001b[39mand\u001b[39;00m \u001b[39misinstance\u001b[39m(data_struct, types)\n\u001b[1;32m 378\u001b[0m \u001b[39mand\u001b[39;00m \u001b[39mall\u001b[39m(\u001b[39mnot\u001b[39;00m \u001b[39misinstance\u001b[39m(v, (\u001b[39mdict\u001b[39m, types)) \u001b[39mfor\u001b[39;00m v \u001b[39min\u001b[39;00m data_struct)\n\u001b[1;32m 379\u001b[0m ):\n\u001b[0;32m--> 380\u001b[0m \u001b[39mreturn\u001b[39;00m [mapped_item \u001b[39mfor\u001b[39;00m batch \u001b[39min\u001b[39;00m iter_batched(data_struct, batch_size) \u001b[39mfor\u001b[39;00m mapped_item \u001b[39min\u001b[39;00m function(batch)]\n\u001b[1;32m 382\u001b[0m \u001b[39m# Reduce logging to keep things readable in multiprocessing with tqdm\u001b[39;00m\n\u001b[1;32m 383\u001b[0m \u001b[39mif\u001b[39;00m rank \u001b[39mis\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mNone\u001b[39;00m \u001b[39mand\u001b[39;00m logging\u001b[39m.\u001b[39mget_verbosity() \u001b[39m<\u001b[39m logging\u001b[39m.\u001b[39mWARNING:\n",
327
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/datasets/download/download_manager.py:300\u001b[0m, in \u001b[0;36mDownloadManager._download_batched\u001b[0;34m(self, url_or_filenames, download_config)\u001b[0m\n\u001b[1;32m 295\u001b[0m \u001b[39mpass\u001b[39;00m\n\u001b[1;32m 296\u001b[0m max_workers \u001b[39m=\u001b[39m (\n\u001b[1;32m 297\u001b[0m config\u001b[39m.\u001b[39mHF_DATASETS_MULTITHREADING_MAX_WORKERS \u001b[39mif\u001b[39;00m size \u001b[39m<\u001b[39m (\u001b[39m20\u001b[39m \u001b[39m<<\u001b[39m \u001b[39m20\u001b[39m) \u001b[39melse\u001b[39;00m \u001b[39m1\u001b[39m\n\u001b[1;32m 298\u001b[0m ) \u001b[39m# enable multithreading if files are small\u001b[39;00m\n\u001b[0;32m--> 300\u001b[0m \u001b[39mreturn\u001b[39;00m thread_map(\n\u001b[1;32m 301\u001b[0m download_func,\n\u001b[1;32m 302\u001b[0m url_or_filenames,\n\u001b[1;32m 303\u001b[0m desc\u001b[39m=\u001b[39;49mdownload_config\u001b[39m.\u001b[39;49mdownload_desc \u001b[39mor\u001b[39;49;00m \u001b[39m\"\u001b[39;49m\u001b[39mDownloading\u001b[39;49m\u001b[39m\"\u001b[39;49m,\n\u001b[1;32m 304\u001b[0m unit\u001b[39m=\u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mfiles\u001b[39;49m\u001b[39m\"\u001b[39;49m,\n\u001b[1;32m 305\u001b[0m position\u001b[39m=\u001b[39;49mmultiprocessing\u001b[39m.\u001b[39;49mcurrent_process()\u001b[39m.\u001b[39;49m_identity[\u001b[39m-\u001b[39;49m\u001b[39m1\u001b[39;49m] \u001b[39m# contains the ranks of subprocesses\u001b[39;49;00m\n\u001b[1;32m 306\u001b[0m \u001b[39mif\u001b[39;49;00m os\u001b[39m.\u001b[39;49menviron\u001b[39m.\u001b[39;49mget(\u001b[39m\"\u001b[39;49m\u001b[39mHF_DATASETS_STACK_MULTIPROCESSING_DOWNLOAD_PROGRESS_BARS\u001b[39;49m\u001b[39m\"\u001b[39;49m) \u001b[39m==\u001b[39;49m \u001b[39m\"\u001b[39;49m\u001b[39m1\u001b[39;49m\u001b[39m\"\u001b[39;49m\n\u001b[1;32m 307\u001b[0m \u001b[39mand\u001b[39;49;00m multiprocessing\u001b[39m.\u001b[39;49mcurrent_process()\u001b[39m.\u001b[39;49m_identity\n\u001b[1;32m 308\u001b[0m \u001b[39melse\u001b[39;49;00m \u001b[39mNone\u001b[39;49;00m,\n\u001b[1;32m 309\u001b[0m max_workers\u001b[39m=\u001b[39;49mmax_workers,\n\u001b[1;32m 310\u001b[0m tqdm_class\u001b[39m=\u001b[39;49mtqdm,\n\u001b[1;32m 311\u001b[0m )\n\u001b[1;32m 312\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[1;32m 313\u001b[0m \u001b[39mreturn\u001b[39;00m [\n\u001b[1;32m 314\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_download_single(url_or_filename, download_config\u001b[39m=\u001b[39mdownload_config)\n\u001b[1;32m 315\u001b[0m \u001b[39mfor\u001b[39;00m url_or_filename \u001b[39min\u001b[39;00m url_or_filenames\n\u001b[1;32m 316\u001b[0m ]\n",
328
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/tqdm/contrib/concurrent.py:69\u001b[0m, in \u001b[0;36mthread_map\u001b[0;34m(fn, *iterables, **tqdm_kwargs)\u001b[0m\n\u001b[1;32m 55\u001b[0m \u001b[39m\u001b[39m\u001b[39m\"\"\"\u001b[39;00m\n\u001b[1;32m 56\u001b[0m \u001b[39mEquivalent of `list(map(fn, *iterables))`\u001b[39;00m\n\u001b[1;32m 57\u001b[0m \u001b[39mdriven by `concurrent.futures.ThreadPoolExecutor`.\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 66\u001b[0m \u001b[39m [default: max(32, cpu_count() + 4)].\u001b[39;00m\n\u001b[1;32m 67\u001b[0m \u001b[39m\"\"\"\u001b[39;00m\n\u001b[1;32m 68\u001b[0m \u001b[39mfrom\u001b[39;00m \u001b[39mconcurrent\u001b[39;00m\u001b[39m.\u001b[39;00m\u001b[39mfutures\u001b[39;00m \u001b[39mimport\u001b[39;00m ThreadPoolExecutor\n\u001b[0;32m---> 69\u001b[0m \u001b[39mreturn\u001b[39;00m _executor_map(ThreadPoolExecutor, fn, \u001b[39m*\u001b[39;49miterables, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mtqdm_kwargs)\n",
329
+ "File \u001b[0;32m/usr/local/lib/python3.10/site-packages/tqdm/contrib/concurrent.py:49\u001b[0m, in \u001b[0;36m_executor_map\u001b[0;34m(PoolExecutor, fn, *iterables, **tqdm_kwargs)\u001b[0m\n\u001b[1;32m 46\u001b[0m lock_name \u001b[39m=\u001b[39m kwargs\u001b[39m.\u001b[39mpop(\u001b[39m\"\u001b[39m\u001b[39mlock_name\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39m\"\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[1;32m 47\u001b[0m \u001b[39mwith\u001b[39;00m ensure_lock(tqdm_class, lock_name\u001b[39m=\u001b[39mlock_name) \u001b[39mas\u001b[39;00m lk:\n\u001b[1;32m 48\u001b[0m \u001b[39m# share lock in case workers are already using `tqdm`\u001b[39;00m\n\u001b[0;32m---> 49\u001b[0m \u001b[39mwith\u001b[39;00m PoolExecutor(max_workers\u001b[39m=\u001b[39mmax_workers, initializer\u001b[39m=\u001b[39mtqdm_class\u001b[39m.\u001b[39mset_lock,\n\u001b[1;32m 50\u001b[0m initargs\u001b[39m=\u001b[39m(lk,)) \u001b[39mas\u001b[39;00m ex:\n\u001b[1;32m 51\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mlist\u001b[39m(tqdm_class(ex\u001b[39m.\u001b[39mmap(fn, \u001b[39m*\u001b[39miterables, chunksize\u001b[39m=\u001b[39mchunksize), \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mkwargs))\n",
330
+ "File \u001b[0;32m/usr/local/lib/python3.10/concurrent/futures/_base.py:649\u001b[0m, in \u001b[0;36mExecutor.__exit__\u001b[0;34m(self, exc_type, exc_val, exc_tb)\u001b[0m\n\u001b[1;32m 648\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39m__exit__\u001b[39m(\u001b[39mself\u001b[39m, exc_type, exc_val, exc_tb):\n\u001b[0;32m--> 649\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mshutdown(wait\u001b[39m=\u001b[39;49m\u001b[39mTrue\u001b[39;49;00m)\n\u001b[1;32m 650\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mFalse\u001b[39;00m\n",
331
+ "File \u001b[0;32m/usr/local/lib/python3.10/concurrent/futures/thread.py:235\u001b[0m, in \u001b[0;36mThreadPoolExecutor.shutdown\u001b[0;34m(self, wait, cancel_futures)\u001b[0m\n\u001b[1;32m 233\u001b[0m \u001b[39mif\u001b[39;00m wait:\n\u001b[1;32m 234\u001b[0m \u001b[39mfor\u001b[39;00m t \u001b[39min\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_threads:\n\u001b[0;32m--> 235\u001b[0m t\u001b[39m.\u001b[39;49mjoin()\n",
332
+ "File \u001b[0;32m/usr/local/lib/python3.10/threading.py:1096\u001b[0m, in \u001b[0;36mThread.join\u001b[0;34m(self, timeout)\u001b[0m\n\u001b[1;32m 1093\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mRuntimeError\u001b[39;00m(\u001b[39m\"\u001b[39m\u001b[39mcannot join current thread\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[1;32m 1095\u001b[0m \u001b[39mif\u001b[39;00m timeout \u001b[39mis\u001b[39;00m \u001b[39mNone\u001b[39;00m:\n\u001b[0;32m-> 1096\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_wait_for_tstate_lock()\n\u001b[1;32m 1097\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[1;32m 1098\u001b[0m \u001b[39m# the behavior of a negative timeout isn't documented, but\u001b[39;00m\n\u001b[1;32m 1099\u001b[0m \u001b[39m# historically .join(timeout=x) for x<0 has acted as if timeout=0\u001b[39;00m\n\u001b[1;32m 1100\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_wait_for_tstate_lock(timeout\u001b[39m=\u001b[39m\u001b[39mmax\u001b[39m(timeout, \u001b[39m0\u001b[39m))\n",
333
+ "File \u001b[0;32m/usr/local/lib/python3.10/threading.py:1116\u001b[0m, in \u001b[0;36mThread._wait_for_tstate_lock\u001b[0;34m(self, block, timeout)\u001b[0m\n\u001b[1;32m 1113\u001b[0m \u001b[39mreturn\u001b[39;00m\n\u001b[1;32m 1115\u001b[0m \u001b[39mtry\u001b[39;00m:\n\u001b[0;32m-> 1116\u001b[0m \u001b[39mif\u001b[39;00m lock\u001b[39m.\u001b[39;49macquire(block, timeout):\n\u001b[1;32m 1117\u001b[0m lock\u001b[39m.\u001b[39mrelease()\n\u001b[1;32m 1118\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_stop()\n",
334
+ "\u001b[0;31mKeyboardInterrupt\u001b[0m: "
335
+ ]
336
+ }
337
+ ],
338
+ "source": [
339
+ "from datasets import load_dataset\n",
340
+ "\n",
341
+ "# データセットのロード\n",
342
+ "dataset = load_dataset('wiki_dpr', 'psgs_w100.nq.exact', trust_remote_code=True)\n"
343
+ ]
344
+ },
345
+ {
346
+ "cell_type": "code",
347
+ "execution_count": 11,
348
+ "metadata": {},
349
+ "outputs": [
350
+ {
351
+ "name": "stderr",
352
+ "output_type": "stream",
353
+ "text": [
354
+ "Downloading data: 90%|█████████ | 142/157 [23:35<01:01, 4.07s/files]"
355
+ ]
356
+ }
357
+ ],
358
+ "source": [
359
+ "from datasets import load_dataset\n",
360
+ "from transformers import RagTokenizer, RagRetriever, RagSequenceForGeneration\n",
361
+ "import faiss\n",
362
+ "import numpy as np\n",
363
+ "\n",
364
+ "# データセットのロード\n",
365
+ "dataset = load_dataset('wiki_dpr', 'psgs_w100.nq.exact', trust_remote_code=True)\n",
366
+ "\n",
367
+ "# モデルとトークナイザーのロード\n",
368
+ "model_name = \"facebook/rag-sequence-nq\"\n",
369
+ "tokenizer = RagTokenizer.from_pretrained(model_name)\n",
370
+ "retriever = RagRetriever.from_pretrained(model_name, use_dummy_dataset=True)\n",
371
+ "model = RagSequenceForGeneration.from_pretrained(model_name, retriever=retriever)\n",
372
+ "\n",
373
+ "# ドキュメントのエンコーディング\n",
374
+ "def embed_passages(passages):\n",
375
+ " embeddings = []\n",
376
+ " for passage in passages:\n",
377
+ " inputs = tokenizer(passage, return_tensors=\"pt\", truncation=True, padding=\"max_length\", max_length=256)\n",
378
+ " outputs = model.retriever.question_encoder(**inputs)\n",
379
+ " embeddings.append(outputs.pooler_output.detach().numpy())\n",
380
+ " return np.vstack(embeddings)\n",
381
+ "\n",
382
+ "# ドキュメントのエンベッド\n",
383
+ "passages = dataset['train']['text'][:1000] # デモのため、最初の1000ドキュメントのみを使用\n",
384
+ "passage_embeddings = embed_passages(passages)\n",
385
+ "\n",
386
+ "# Faissインデックスの作成\n",
387
+ "index = faiss.IndexFlatL2(passage_embeddings.shape[1])\n",
388
+ "index.add(passage_embeddings)\n",
389
+ "faiss.write_index(index, \"faiss_index\")\n",
390
+ "\n",
391
+ "# Faissインデックスをモデルに読み込む\n",
392
+ "retriever.index = faiss.read_index(\"faiss_index\")\n",
393
+ "\n",
394
+ "# 質問のトークナイズ\n",
395
+ "question = \"What is the capital of France?\"\n",
396
+ "inputs = tokenizer(question, return_tensors=\"pt\")\n",
397
+ "\n",
398
+ "# 回答の生成\n",
399
+ "outputs = model.generate(inputs[\"input_ids\"])\n",
400
+ "answer = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]\n",
401
+ "\n",
402
+ "print(answer)\n"
403
+ ]
404
+ }
405
+ ],
406
+ "metadata": {
407
+ "kernelspec": {
408
+ "display_name": "Python 3",
409
+ "language": "python",
410
+ "name": "python3"
411
+ },
412
+ "language_info": {
413
+ "codemirror_mode": {
414
+ "name": "ipython",
415
+ "version": 3
416
+ },
417
+ "file_extension": ".py",
418
+ "mimetype": "text/x-python",
419
+ "name": "python",
420
+ "nbconvert_exporter": "python",
421
+ "pygments_lexer": "ipython3",
422
+ "version": "3.10.13"
423
+ }
424
+ },
425
+ "nbformat": 4,
426
+ "nbformat_minor": 2
427
+ }