NOTE: NLP for Automatic Grading of Open-Ended Questions in eBooks


This is a component of research on making reading more motivating to children and increasing their comprehension. We obtained and graded a set of answers to open-ended questions embedded in a fiction novel written in English


Computer science students used a subset of the graded answers to develop algorithms designed to grade new answers to the questions. The algorithms utilized the story text, existing graded answers for a given question, and publicly accessible databases in grading new responses.


  • (1) prepare questions
    Two of the researchers reread Weightless and generated 52 candidate open-ended inference questions,
    related to character mood and causal antecedents
    (either based on human behavior or physical situations involving simple science).
** The researchers were qualified to work with inference questions and answers, because 
one of them, the lead author of this article, has conducted multiple studies involving inferencing and reading (e.g., Smith et al., 2010; Smith et al., 2013), 

(b) both researchers in this phase have experience with K-12 teaching,and university-level teaching, and so are experienced educators.
  • (2) prepare standards:
Open-ended question: 
        How does Shiranna feel as the shuttle is taking off?

    Exemplar answer: 
        She is feeling excited but also nervous, scared, or apprehensive.

    Grading criteria: 
        0.5 for excited and 0.5 for stressed or anxious or scared

    Type of inference: 
        Character emotional reaction

    Text needed to make the inference: 
        Just the thought of leaving Earth both thrilled and terrified her … Her heart stopped as the trailer-sized shuttle moved forward on the track without making a sound. She took in a deep breath. … she said, trying to be brave …

    Background knowledge needed: 

    Difficulty level: 2

(3) prepare answers:

Based on an IRB-approved study protocol, the data were obtained by master’s level students for extra credit in courses.Participants were 87 master’s students in Library Sciences and Instructional Technology. 

Interactive eBooks were a part of the course curriculum.However, students could optionally earn extra credit by participating in this research or through other means. The average age and gender of participants were not known.Users/readers were presented with the question and a text box to type their answers, which were then saved in a database.

(4) Modify the standard

Two of the researchers met and reviewed approximately 5% of the answers, comparing the answers to the exemplars.In some cases, the reviewers modified the exemplars based on answers which suggested shortcomings in the exemplars. This preliminary review process was halted once no new revisions to exemplars occurred, as per the snowball sampling methodology (Goodman, 1961).

(5) Annotation data

Two of the researchers read all the answers and independently graded them with scores of 0.0, 0.5, or 1 by comparing them semantically with the exemplar answers. Answers that had none of the required elements of the exemplar answers were deemed incorrect and given a score of 0.0. Answers with all of the required elements from the exemplar were given a score of 1.0, while answers with some, but not all, elements of the exemplar were scored 0.5. After the two researchers had graded all the answers, they resolved all discrepantly graded answers by discussing them.

(6) Training Model

The student groups used different approaches in their algorithms. All the groups used initial preprocessing of the text, including “stopwords” removal (words with little semantic representation, e.g., “and” or “the”), lemmatization, or stems. Stemming means obtaining the root of the word only. Lemmatization changes the word to get its base representation. For example, the stem for the word “having” is “hav”, while the lemma is “have.” Therefore, the lemma retains the word’s semantic meaning.

Implementations of single example models consisted of different string-matching techniques. Students represented the exemplar answer as a set of words and did the same for the answer to evaluate. Then they figured out the threshold of how many words must co-occur on both answers so that the answer to evaluate is correct (Groups 4, 7, and 8). Group 3 used the same approach, except they used stems instead of lemmas. Some groups represented words as vectors and then compared the cosine angle between groups of vectors of an exemplar answer and the answer to evaluate (Groups 1, 2, 5, and 6). Group 9 chose a completely different approach, representing sentences as graphs, formed from triples of nouns and verbs representing subject–predicate–object (Khashabi et al., 2018). After the graphs were formed, they used graph similarity measures to decide whether a given answer was correct or not.
For the all examples model, groups used ideas from single example models and improved them by using machine learning algorithms for classification. Also, they defined additional sets of features used by these models. The groups used standard machine learning algorithms, such as random forests, SVMs, naive Bayes, or decision trees. For a given training data set, these algorithms build a model that can then predict classes for new unseen answers.

For the all examples+ model, groups could use any additional text resource they could find to enrich their data. Although the domain of the selected book seemed somewhat narrow, the questions required some general knowledge to answer correctly. Therefore, some student groups used common databases known in the field of NLP, such as WordNet (Miller, 1995), GloVe vectors (Pennington et al., 2014), or ConceptNet (Liu & Singh, 2004). WordNet is a large database that consists of groups of words, called synsets, or interconnected semantic relationships that define synonyms, hypernyms, or antonyms. GloVe vectors are vector representations for words that encode aggregated global word-word co-occurrence from the large corpus of data. Finally, ConceptNet is a freely available semantic network, designed to help computers understand the meanings of words that people use. For example, for the word “dog,” we could find relations that this word is of type “a pet” or “mammal,” can “run” or “bark,” and has “four legs” and “two ears.” Group 6 introduced additional changes in their model to create three different versions of the model, by using WordNet, semantic triples (subject–predicate–object), and important words. Although they introduced many changes to their all examples+ model, their all examples+ model did not improve compared with their all examples model.


Group 1’s solution is based on representing text using TF-IDF and Word2Vec vectors (Mikolov et al., 2015). 

The TF-IDF is a standard vector-based weighting scheme that tries to promote more important words in a corpus. 

The Word2Vec vectors are trained using neural networks similar to GloVe and encode the similarities among words.


Computer Science Meets Education: Natural Language Processing for Automatic Grading of Open-Ended Questions in eBooks

Add a Comment