|
--- |
|
license: mit |
|
base_model: |
|
- FacebookAI/roberta-base |
|
pipeline_tag: text-classification |
|
library_name: transformers |
|
language: |
|
- en |
|
--- |
|
|
|
This model is an instance of RoBERTa-Base finetuned to classify student postsecondary administrative transcripts into the National Center of Education Statistics' 2010 College Course Map (CCM). |
|
|
|
The College Course Map is a hierarchical taxonomy of course content that roughly aligns with the commonly used Classification of Instructional Program codes used in the United States. |
|
|
|
The College Course Map was developed for use with longitudinal surveys including the High School Longitudinal Study of 2009 (HSLS 2009), Baccalaureate and Beyond Longitudinal Study of 2008-2012 (B&B 2008), Beginning Postsecondary Students Longitudinal Study of 2004-2009 (BPS 2004), and Beginning Postsecondary Students Longitudinal Study of 2012-2017 (BPS 2012). |
|
|
|
Administrative transcripts for all survey participants were collected along with each survey and each course enrollment in the transcripts were labelled with the appropriate six-digit CCM by human annotators. More information about the development of the CCM and the annotation process are available here: |
|
|
|
Bryan, M. & Simone, S. (2012). *2010 College Course Map Technical Report*. National Center |
|
for Education Statistics. https://nces.ed.gov/pubs2012/2012162rev.pdf. |
|
|
|
This RoBERTa model is fine-tuned to classify course records into the appropriate two-digit CCM code (for example, 45 represents Social Science courses and 38.01 represents Philosophy and Religion courses). This model is fine-tuned on 802,190 unique course sections from the four surveys referenced above. |
|
|
|
More information about the fine-tuning process is available here: |
|
|
|
Annaliese Paulson, Kevin Stange, and Allyson Flaster. (2024). *Classifying Courses at Scale: a Text as Data Approach to Characterizing Student Course-Taking Trends with Administrative Transcripts.* (EdWorkingPaper: 24-1042). Annenberg Institute at Brown University. https://doi.org/10.26300/7fpas433 |
|
|
|
The model is fine-tuned on data formatted as "{SUBJECT CODE} {CATALOG NUMBER} --- {COURSE TITLE}". For example, for a course offered in an economics department with subject code "ECON", course number "101", and course title "Principles of Microeconomics", the model anticipates the following string: "ECON 101 --- Principles of Microeconomics." [This](https://colab.research.google.com/drive/1iebZ_Zznpv3XPgF34LmwFozd7fSg0ZCh?usp=sharing) Colab Notebook provides a short vignette applying the model. |
|
|
|
Six-Digit Prediction Accuracy on Course Sections: 0.65 <br> |
|
Six-Digit Prediction Accuracy on Enrollment Weighted Course Sections: 0.75 <br> |