jsulz HF staff commited on
Commit
9f23379
·
1 Parent(s): 7f477b2

formatting

Browse files
Files changed (1) hide show
  1. app.py +116 -109
app.py CHANGED
@@ -191,7 +191,6 @@ with gr.Blocks() as demo:
191
  df, written, spoken = load_transform_dataset()
192
  # store the dataframe in a state object before passing to component functions
193
  df_state = gr.State(df)
194
-
195
  # Build out the top level static charts and content
196
  gr.Markdown(
197
  """
@@ -201,130 +200,138 @@ with gr.Blocks() as demo:
201
  )
202
 
203
  gr.Markdown(
204
- "In addition to analyzing the content, this space also leverages the [Qwen/2.5-72B-Instruct](https://deepinfra.com/Qwen/Qwen2.5-72B-Instruct) model to summarize a speech. The model is tasked with providing a concise summary of a speech from a given president."
205
- )
206
- gr.Markdown("## Summarize a Speech")
207
- gr.HTML("<div id=summarize-anchor></div>")
208
- gr.Markdown(
209
- """
210
- Context is king, so before we start looking at the speeches in the aggregate, get a summary of a State of the Union first. Use the dropdown to select a speech from a president and click the button to summarize the speech. [Qwen/2.5-72B-Instruct](https://deepinfra.com/Qwen/Qwen2.5-72B-Instruct) will provide a concise summary of the speech with the proper historical and political context.
211
- """
212
  )
213
- speeches = df["speech_key"].unique()
214
- speeches = speeches.tolist()
215
- speech = gr.Dropdown(label="Select a Speech", choices=speeches)
216
- # create a dropdown to select a speech from a president
217
- run_summarization = gr.Button(value="Summarize")
218
- fin_speech = gr.Textbox(label="Summarized Speech", type="text", lines=10)
219
- run_summarization.click(streaming, inputs=[speech, df_state], outputs=[fin_speech])
220
-
221
- # Basic line chart showing the total number of words in each address
222
- gr.Markdown(
223
- """
224
- ## The shape of words
225
- The line chart to the right shows the total number of words in each address. However, not all SOTUs are created equally. From 1801 to 1916, each address was a written message to Congress. In 1913, Woodrow Wilson broke with tradition and delivered his address in person. Since then, the addresses have been a mix of written and spoken (mostly spoken).
226
 
227
- The spikes you see in the early 1970's and early 1980's are from written addresses by Richard Nixon and Jimmy Carter respectively.
 
 
 
 
 
228
 
229
- Now that we have a little historical context, what does this data look like if we split things out by president? The bar chart below shows the average number of words in each address by president. The bars are grouped by written and spoken addresses.
230
- """
231
- )
232
- fig1 = px.line(
233
- df,
234
- x="date",
235
- y="word_count",
236
- title="Total Number of Words in Addresses",
237
- line_shape="spline",
238
- )
239
- fig1.update_layout(
240
- xaxis=dict(title="Date of Address"),
241
- yaxis=dict(title="Word Count"),
242
- )
243
- gr.Plot(fig1, scale=2)
244
- # group by president and category and calculate the average word count sort by date
245
- avg_word_count = (
246
- df.groupby(["potus", "categories"])["word_count"].mean().reset_index()
247
- )
248
- # Build a bar chart showing the average number of words in each address by president
249
- fig2 = px.bar(
250
- avg_word_count,
251
- x="potus",
252
- y="word_count",
253
- title="Average Number of Words in Addresses by President",
254
- color="categories",
255
- barmode="group",
256
- )
257
- fig2.update_layout(
258
- xaxis=dict(
259
- title="President",
260
- tickangle=-45, # Rotate labels 45 degrees counterclockwise
261
- ),
262
- yaxis=dict(
263
- title="Average Word Count",
264
- tickangle=0, # Default label angle (horizontal)
265
- ),
266
- legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1),
267
- )
268
- gr.Plot(fig2)
269
 
270
- # Create a line chart showing the Automated Readability Index in each address
271
- with gr.Row():
272
- ari = df[["potus", "date", "ari", "categories"]]
273
- fig3 = px.line(
274
- ari,
275
  x="date",
276
- y="ari",
277
- title="Automated Readability Index in each Address",
278
  line_shape="spline",
279
  )
280
- fig3.update_layout(
281
  xaxis=dict(title="Date of Address"),
282
- yaxis=dict(title="ARI Score"),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
283
  )
284
- gr.Plot(fig3, scale=2)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
285
  gr.Markdown(
286
  """
287
- The line chart to the left shows the Automated Redibility Index (ARI) for each speech by year. The ARI is calculated using the formula: 4.71 * (characters/words) + 0.5 * (words/sentences) - 21.43. In general, ARI scores correspond to U.S. grade levels. For example, an ARI of 8.0 corresponds to an 8th grade reading level.
288
 
289
- While there are other scores that are more representative of attributes we might want to measure, they require values like syllables. The ARI is a simple score to compute with our data.
 
 
 
 
290
 
291
- The drop off is quite noticeable, don't you think? ;)
292
- """
293
- )
294
- gr.Markdown(
295
  """
296
- ## Dive Deeper on Each President
297
-
298
- Use the dropdown to select a president a go a little deeper.
299
-
300
- To begin with, there is an [n-gram](https://en.wikipedia.org/wiki/N-gram) bar chart built from all of the given president's addresses. An n-gram is a contiguous sequence of n items from a given sample of text or speech. Because written and spoken speech is littered with so-called "stop words" such as "and", "the", and "but", they've been removed to provide a more rich (albeit sometimes more difficult to read) view of the text.
301
-
302
- The slider only goes up to 4-grams because the data is sparse beyond that. I personally found the n-grams from our last three presidents to be less than inspiring and full of platitudes. Earlier presidents have more interesting n-grams.
303
-
304
- Next up is a word cloud of the lemmatized text from the president's addresses. [Lemmatization](https://en.wikipedia.org/wiki/Lemmatization) is the process of grouping together the inflected forms of a word so they can be analyzed as a single item. Think of this as a more advanced version of [stemming](https://en.wikipedia.org/wiki/Stemming) where we can establish novel links between words like "better" and "good" that might otherwise be overlooked in stemming.
305
-
306
- You can also see a line chart of word count and ARI for each address.
307
- """
308
- )
309
- # get all unique president names
310
- presidents = df["potus"].unique()
311
- presidents = presidents.tolist()
312
- presidents.append("All")
313
-
314
- # create a dropdown to select a president
315
- president = gr.Dropdown(label="Select a President", choices=presidents, value="All")
316
- # create a text area to display the summarized speech
317
- # create a slider for number of word grams
318
- grams = gr.Slider(
319
- minimum=1, maximum=4, step=1, label="N-grams", interactive=True, value=1
320
- )
321
 
322
- # show a bar chart of the top n-grams for a selected president
323
- gr.Plot(plotly_ngrams, inputs=[grams, president, df_state])
324
 
325
- gr.Plot(plt_wordcloud, scale=2, inputs=[president, df_state])
 
326
 
327
- # show a line chart of word count and ARI for a selected president
328
- gr.Plot(plotly_word_and_ari, inputs=[president, df_state])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
329
 
330
  demo.launch()
 
191
  df, written, spoken = load_transform_dataset()
192
  # store the dataframe in a state object before passing to component functions
193
  df_state = gr.State(df)
 
194
  # Build out the top level static charts and content
195
  gr.Markdown(
196
  """
 
200
  )
201
 
202
  gr.Markdown(
203
+ "In addition to analyzing the content, this space also leverages the [Qwen/2.5-72B-Instruct](https://deepinfra.com/Qwen/Qwen2.5-72B-Instruct) model to summarize a speech. The model is tasked with providing a concise summary of a speech from a given president. To get a summary, go to the 'Summarize a Speech' tab."
 
 
 
 
 
 
 
204
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
205
 
206
+ with gr.Tab(label="Speech Data"):
207
+ # Basic line chart showing the total number of words in each address
208
+ gr.Markdown(
209
+ """
210
+ ## The shape of words
211
+ The line chart to the right shows the total number of words in each address. However, not all SOTUs are created equally. From 1801 to 1916, each address was a written message to Congress. In 1913, Woodrow Wilson broke with tradition and delivered his address in person. Since then, the addresses have been a mix of written and spoken (mostly spoken).
212
 
213
+ The spikes you see in the early 1970's and early 1980's are from written addresses by Richard Nixon and Jimmy Carter respectively.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
214
 
215
+ Now that we have a little historical context, what does this data look like if we split things out by president? The bar chart below shows the average number of words in each address by president. The bars are grouped by written and spoken addresses.
216
+ """
217
+ )
218
+ fig1 = px.line(
219
+ df,
220
  x="date",
221
+ y="word_count",
222
+ title="Total Number of Words in Addresses",
223
  line_shape="spline",
224
  )
225
+ fig1.update_layout(
226
  xaxis=dict(title="Date of Address"),
227
+ yaxis=dict(title="Word Count"),
228
+ )
229
+ gr.Plot(fig1, scale=2)
230
+ # group by president and category and calculate the average word count sort by date
231
+ avg_word_count = (
232
+ df.groupby(["potus", "categories"])["word_count"].mean().reset_index()
233
+ )
234
+ # Build a bar chart showing the average number of words in each address by president
235
+ fig2 = px.bar(
236
+ avg_word_count,
237
+ x="potus",
238
+ y="word_count",
239
+ title="Average Number of Words in Addresses by President",
240
+ color="categories",
241
+ barmode="group",
242
+ )
243
+ fig2.update_layout(
244
+ xaxis=dict(
245
+ title="President",
246
+ tickangle=-45, # Rotate labels 45 degrees counterclockwise
247
+ ),
248
+ yaxis=dict(
249
+ title="Average Word Count",
250
+ tickangle=0, # Default label angle (horizontal)
251
+ ),
252
+ legend=dict(
253
+ orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1
254
+ ),
255
  )
256
+ gr.Plot(fig2)
257
+
258
+ # Create a line chart showing the Automated Readability Index in each address
259
+ with gr.Row():
260
+ ari = df[["potus", "date", "ari", "categories"]]
261
+ fig3 = px.line(
262
+ ari,
263
+ x="date",
264
+ y="ari",
265
+ title="Automated Readability Index in each Address",
266
+ line_shape="spline",
267
+ )
268
+ fig3.update_layout(
269
+ xaxis=dict(title="Date of Address"),
270
+ yaxis=dict(title="ARI Score"),
271
+ )
272
+ gr.Plot(fig3, scale=2)
273
+ gr.Markdown(
274
+ """
275
+ The line chart to the left shows the Automated Redibility Index (ARI) for each speech by year. The ARI is calculated using the formula: 4.71 * (characters/words) + 0.5 * (words/sentences) - 21.43. In general, ARI scores correspond to U.S. grade levels. For example, an ARI of 8.0 corresponds to an 8th grade reading level.
276
+
277
+ While there are other scores that are more representative of attributes we might want to measure, they require values like syllables. The ARI is a simple score to compute with our data.
278
+
279
+ The drop off is quite noticeable, don't you think? ;)
280
+ """
281
+ )
282
  gr.Markdown(
283
  """
284
+ ## Dive Deeper on Each President
285
 
286
+ Use the dropdown to select a president a go a little deeper.
287
+
288
+ To begin with, there is an [n-gram](https://en.wikipedia.org/wiki/N-gram) bar chart built from all of the given president's addresses. An n-gram is a contiguous sequence of n items from a given sample of text or speech. Because written and spoken speech is littered with so-called "stop words" such as "and", "the", and "but", they've been removed to provide a more rich (albeit sometimes more difficult to read) view of the text.
289
+
290
+ The slider only goes up to 4-grams because the data is sparse beyond that. I personally found the n-grams from our last three presidents to be less than inspiring and full of platitudes. Earlier presidents have more interesting n-grams.
291
 
292
+ Next up is a word cloud of the lemmatized text from the president's addresses. [Lemmatization](https://en.wikipedia.org/wiki/Lemmatization) is the process of grouping together the inflected forms of a word so they can be analyzed as a single item. Think of this as a more advanced version of [stemming](https://en.wikipedia.org/wiki/Stemming) where we can establish novel links between words like "better" and "good" that might otherwise be overlooked in stemming.
293
+
294
+ You can also see a line chart of word count and ARI for each address.
 
295
  """
296
+ )
297
+ # get all unique president names
298
+ presidents = df["potus"].unique()
299
+ presidents = presidents.tolist()
300
+ presidents.append("All")
301
+
302
+ # create a dropdown to select a president
303
+ president = gr.Dropdown(
304
+ label="Select a President", choices=presidents, value="All"
305
+ )
306
+ # create a text area to display the summarized speech
307
+ # create a slider for number of word grams
308
+ grams = gr.Slider(
309
+ minimum=1, maximum=4, step=1, label="N-grams", interactive=True, value=1
310
+ )
311
+
312
+ # show a bar chart of the top n-grams for a selected president
313
+ gr.Plot(plotly_ngrams, inputs=[grams, president, df_state])
 
 
 
 
 
 
 
314
 
315
+ gr.Plot(plt_wordcloud, scale=2, inputs=[president, df_state])
 
316
 
317
+ # show a line chart of word count and ARI for a selected president
318
+ gr.Plot(plotly_word_and_ari, inputs=[president, df_state])
319
 
320
+ with gr.Tab(label="Summarize a Speech"):
321
+ gr.Markdown("## Summarize a Speech")
322
+ gr.Markdown(
323
+ """
324
+ Context is king; get a summary of a State of the Union now that you've seen a bit more. Use the dropdown to select a speech from a president and click the button to summarize the speech. [Qwen/2.5-72B-Instruct](https://deepinfra.com/Qwen/Qwen2.5-72B-Instruct) will provide a concise summary of the speech with the proper historical and political context.
325
+ """
326
+ )
327
+ speeches = df["speech_key"].unique()
328
+ speeches = speeches.tolist()
329
+ speech = gr.Dropdown(label="Select a Speech", choices=speeches)
330
+ # create a dropdown to select a speech from a president
331
+ run_summarization = gr.Button(value="Summarize")
332
+ fin_speech = gr.Textbox(label="Summarized Speech", type="text", lines=10)
333
+ run_summarization.click(
334
+ streaming, inputs=[speech, df_state], outputs=[fin_speech]
335
+ )
336
 
337
  demo.launch()