Corey Morris commited on
Commit
e7c50af
1 Parent(s): e1345be

changed the wording of moral scenarios

Browse files
Files changed (1) hide show
  1. app.py +7 -5
app.py CHANGED
@@ -340,21 +340,23 @@ fig = create_plot(filtered_data, 'Parameters', 'MMLU_abstract_algebra')
340
  st.plotly_chart(fig)
341
 
342
  # Moral scenarios plots
343
- st.markdown("### Moral Scenarios Performance")
344
  def show_random_moral_scenarios_question():
345
  moral_scenarios_data = pd.read_csv('moral_scenarios_questions.csv')
346
  random_question = moral_scenarios_data.sample()
347
  expander = st.expander("Show a random moral scenarios question")
348
  expander.write(random_question['query'].values[0])
349
 
350
- show_random_moral_scenarios_question()
351
 
352
  st.write("""
353
- While smaller models can perform well at many tasks, the model size threshold for decent performance on moral scenarios is much higher.
354
- There are no models with less than 13 billion parameters with performance much better than random chance. Further investigation into other capabilities that emerge at 13 billion parameters could help
355
- identify capabilities that are important for moral reasoning.
356
  """)
357
 
 
 
358
  fig = create_plot(filtered_data, 'Parameters', 'MMLU_moral_scenarios', title="Impact of Parameter Count on Accuracy for Moral Scenarios")
359
  st.plotly_chart(fig)
360
  st.write()
 
340
  st.plotly_chart(fig)
341
 
342
  # Moral scenarios plots
343
+ st.markdown("### MMLU’s Moral Scenarios Benchmark Doesn’t Measure What You Think it Measures")
344
  def show_random_moral_scenarios_question():
345
  moral_scenarios_data = pd.read_csv('moral_scenarios_questions.csv')
346
  random_question = moral_scenarios_data.sample()
347
  expander = st.expander("Show a random moral scenarios question")
348
  expander.write(random_question['query'].values[0])
349
 
350
+
351
 
352
  st.write("""
353
+ After a deeper dive into the moral scenarios task, it appears that benchmark is not a valid measurement of moral judgement.
354
+ The challenges these models face are not rooted in understanding each scenario, but rather in the structure of the task itself.
355
+ I would recommend using a different benchmark for moral judgement. More details of the analysis can be found here: [MMLU’s Moral Scenarios Benchmark Doesn’t Measure What You Think it Measures ](https://medium.com/p/74fd6e512521)
356
  """)
357
 
358
+ show_random_moral_scenarios_question()
359
+
360
  fig = create_plot(filtered_data, 'Parameters', 'MMLU_moral_scenarios', title="Impact of Parameter Count on Accuracy for Moral Scenarios")
361
  st.plotly_chart(fig)
362
  st.write()