We trained a language model to automatically score the IELTS (International English Language Testing System) essays by using massive the training dataset by human raters.

The training dataset is consisting of 18,000 real IELTS exam essays and their official scores. Our model scoring results are measured in five dimensions of task achievement, coherence and cohesion, vocabulary, grammar and overall according to the official IELTS standards. OVERALL is the composite score of the IELTS essays.

The impressive result in the test dataset is as follows: Accuracy = 0.82, F1 Score = 0.81. As far as the current results are concerned, our model could roughly replace human raters for IELTS essays to some degree, but we will continue to optimize it to improve its accuracy and effectiveness.

Please cite this paper if you use this model:

@article{sun2024automatic,
  title={Automatic Essay Multi-dimensional Scoring with Fine-tuning and Multiple Regression},
  author={Kun Sun and Rong Wang},
  year={2024},
  journal={ArXiv},
  url={https://arxiv.org/abs/2406.01198}
}

The following is the code to implement the model for scoring new IELTS essays. In the following example, an essay is taken from the test dataset with the overall score 8.0. Our model grades the essay as 8.5, which is very close the score given by the human rater.


from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
import numpy as np

# Load the pre-trained model and tokenizer
model_path = "KevSun/IELTS_essay_scoring"
model = AutoModelForSequenceClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Example text to be evaluated, the essay with the score by human rater (= 8.5) in the test dataset.

new_text = (
    "It is important for all towns and cities to have large public spaces such as squares and parks. "
    "Do you agree or disagree with this statement? It is crucial for all metropolitan cities and towns to "
    "have some recreational facilities like parks and squares because of their numerous benefits. A number of "
    "arguments surround my opinion, and I will discuss it in upcoming paragraphs. To commence with, the first "
    "and the foremost merit is that it is beneficial for the health of people because in morning time they can "
    "go for walking as well as in the evenings, also older people can spend their free time with their loved ones, "
    "and they can discuss about their daily happenings. In addition, young people do lot of exercise in parks and "
    "gardens to keep their health fit and healthy, otherwise if there is no park they glue with electronic gadgets "
    "like mobile phones and computers and many more. Furthermore, little children get best place to play, they play "
    "with their friends in parks if any garden or square is not available for kids then they use roads and streets "
    "for playing it can lead to serious incidents. Moreover, parks have some educational value too, in schools, "
    "students learn about environment protection in their studies and teachers can take their pupils to parks because "
    "students can see those pictures so lively which they see in their school books and they know about importance "
    "and protection of trees and flowers. In recapitulate, parks holds immense importance regarding education, health "
    "for people of every society, so government should build parks in every city and town."
)


encoded_input = tokenizer(new_text, return_tensors='pt', padding=True, truncation=True, max_length=512)


model.eval()

# Perform the prediction
with torch.no_grad():
    outputs = model(**encoded_input)

predictions = outputs.logits.squeeze()


predicted_scores = predictions.numpy()  

# Normalize the scores
normalized_scores = (predicted_scores / predicted_scores.max()) * 9  # Scale to 9


rounded_scores = np.round(normalized_scores * 2) / 2

item_names = ["Task Achievement", "Coherence and Cohesion", "Vocabulary", "Grammar", "Overall"]


for item, score in zip(item_names, rounded_scores):
    print(f"{item}: {score:.1f}")

##the output:
#Task Achievement: 9.0
#Coherence and Cohesion: 7.5
#Vocabulary: 8.0
#Grammar: 7.5
#Overall: 8.5
Downloads last month
100
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Spaces using KevSun/IELTS_essay_scoring 3

Collection including KevSun/IELTS_essay_scoring