RevUP: Automatically Generating Questions from Educational Texts

What is RevUP?

RevUP automatically generates gap-fill multiple choice questions from online texts. Do take a look at the magical demo

Why RevUP?

Students like myself are increasingly turning to online texts to supplement classroom material. However, these texts do not come with review questions which are crucial in helping me reinforce the relevant concepts. Furthermore, the continued crafting of varied recall and application questions can be extremely time intensive for teachers. By automatically generating questions, RevUP makes online learning and teaching efficient for students and teachers. In the long-run, RevUP represents a step towards automatically creating mini-courses from any online text.

Key and Novel Contributions of RevUP


I propose selecting topically important sentences by ranking them based on topic distributions obtained from a topic model. Harnessing the generative power of deep learning methods, I propose a novel Deep Autoencoder Topic Model (DATM), which discovers topics that are more coherent than the widely-used Latent Dirichlet Allocation and Latent Semantic Analysis.


To select gap-phrases from each selected sentence, we collected human annotations, using the Amazon MTurk, on the relative relevance of candidate gaps. This data is used to train a discriminative classifier to predict the educational relevance of gaps, achieving an accuracy of 81.0%.


I propose a novel method to choose distractors that are semantically and syntactically similar to the gap-phrase and have contextual fit to the gap-fill question. By crowd-sourcing the evaluation of my method through the Amazon MTurk, I found that 76% of the distractors selected were good.



Hello! I'm Girish Kumar from Singapore.

It all started when the 6-year-old me was disappointed that computers were not as smart as sci-fi movies made it to be. This piqued my interest on building smart machines and programs. I joined my school's robotics team when I was 8, working with Lego Mindstorms and VEX kits. Participating and winning prizes every year at the National Junior Robotics Competition kept me motivated.

I soon started taking online courses on Machine Leaning and AI taught by Stanford Professors Andrew Ng and Sebastian Thrun. My forays into ML research followed. I worked on projects ranging from gesture recognition mobile apps to automatic indoor mapping programs, working at institutions including the MIT Media Lab and Singapore's A-STAR. My work has been accepted to international peer-reviewed conferences including NAACL and IJCNN. Recently, I won Singapore's national science talent search.

Founders of technology and research driven startups inspire me greatly, e.g. Elon Musk and Larry Page. I would like to emulate these individuals by leveraging technology to empower mankind. I have been accepted into Kairos Society which believes in high tech, high impact ventures. 

Winning the GSF would make the long nights I have spent converting caffeine to code worth it. Furthermore, it's a core desire of mine to deploy my GSF project for use by students & teachers all over the world. I believe winning GSF would provide me with the publicity, resources and motivation to do so. 



I, like many students, actively consume online texts to supplement material taught and provided by my teachers. However, online texts are typically not accompanied by the instructional material in standard classroom notes & textbooks. Such instructional material (e.g. review questions, practice assessments) are crucial in helping me reinforce the relevant concepts. Furthermore, as my teachers have lamented to me, the manual crafting of varied questions for every relevant online text would be extremely time consuming. Automatic Question Generation (AQG) shows incredible promise here. However, work in AQG has mostly been on transforming sentences into grammatically and syntactically correct Why,What,Who,Where,What,How questions with little attention to the semantics and educational importance of the questions. 


Given this, is it possible to automatically generate semantically and educationally relevant questions for teachers and students? 


First, I aimed to build upon advances in Natural Language Processing and Machine Learning to automatically generate gap-fill multiple choice questions. Specifically:

  1. Using topic models to select topically important and coherent sentences from the text to ask about
  2. Training a machine learning model to identify which part of the resulting sentence to choose as an educationally relevant gap
  3. Selecting multiple choice distractors that are semantically & syntactically similar to the gap and have contextual fit to the question sentence

A web interface, using the Django framework, will also be built for the algorithm. Finally, I aimed to evaluate our system with a group of students to gain insights into the effectiveness of the system.


Work in Automatic Question Generation(AQG) for educational purposes has mainly involved transforming sentences into questions. This dates back to the Autoquest system [14] in which a completely syntactic approach was used to generate wh-questions. Since then, significant progress has been made in this field and much of the work has involved transforming sentences intodivided into two main categories: Wh-Question Generation(WQG) and Gap-Fill Question Generation (GFQG). 

Wh-Question Generation (WQG)

Multiple approaches have been proposed for WQG with the mains one being syntactic, template-based and semantic. Mitkov et al. used a three step approach for generating multiple-choice wh-questions [8]. Shallow parsing was 1 used for selecting domain-specific terms. WordNet was then utilised to select relevent multiple choice options (distractors). Wh-questions were then generated from sentences containing domain-specific terms using tranformation rules. A template-based approach was taken by Mostow et al. where 7 question templates were used to generate What, Why, How, What-would-happen-if,Whenwould-x-happen, What-would-happen-when and Why-x questions with 71.3% of the generated questions rated as acceptable [9]. Wyse et al. use syntax trees for input sentences and select parts of these sentences that are compatible with specific rules derived from syntactic patterns [15]. Templates are then used in conjunction with the matched parts of the sentence, to generate a question and the corresponding (short) answer. Heilman et al. proposed a over-generate and rank methodology where questions were generated using a syntactic method similar to Wyse et al.. A training corpus was created with human annotations on the quality of questions generated from a random sample of Wikipedia articles. A discriminative re-ranker was trained with this courpus to rank over-generated questions [4]. Yao et al. presented the usage of minimal recursion semantics and a rule-set for WQG [16]. Though these works represent significant progress, they have mostly focused on the grammatical aspect of questions. Many of these methods are rule and template-based which constrain the domain to which they can be applied. 

Gap-Fill Question Generation

On the other hand, GFQG overcomes WQG’s need for grammaticality by blanking out meaningful words, henceforth referred to as gaps, in known good sentence [3]. The methods presented in the paper focus on GFQG due to this inherent advantage. This would allow us to focus on the educational and semantic relevance of the question generated.

Previous works in GFQG have generally worked with vocabulary-testing and language learning [26, 27]. Smith et al. presented TedClogg which took gap-phrases as input and found multiple choice distractors from a distributional thesaurus. 53.3% of the questions generated were acceptable. Our work aligns more closely to that of Aggarwal et al. where a weighted sum of lexical, syntactic features were utilised to select sentences, gaps and distractors from informative texts [1]. Becker et al. built upon the former’s work by collecting human ratings of questions generated from a Wikipedia-based corpus. A machine-learning model was trained to effectively replicate these judgments, achieving a true positive rate of 0.83 and false positive rate of 0.19 [3]. 


Sentence Selection

Current systems use extractive summarization methods for sentence selection. However, these methods maximize content coverage, which could result in complex or incoherent sentences. I propose selecting topically important & coherent sentences by ranking them based on topic distributions obtained from a topic model. Each sentence is assigned a score which is the weighted sum of the highest probabilities of its topic distribution.

Our method favors sentences with peaked topic distributions which means:

  1. Few topics expressed(Coherence)
  2. Topic expressed to a high degree(Importance)

However, widely-used topic-models, Latent Dirichlet Allocation & Latent Semantic Analysis, cannot handle the sparsity of information in sentences. As such, harnessing the generative power of deep learning methods, I propose a novel Deep Autoencoder Topic Model (DATM). The DATM discovers topics from sentences in two steps as ifollows.

Our novel contribution is the Sparse, Selective RBM based on the Restricted Boltzmann Machine (RBM). Sparsity and selectivity penalties were added to the RBM cost function to ensure that the DATM works well for short texts. 

A more comprehensive overview.

Gap Selection

RevUP aims to choose educationally relevant gap-phrases from each question sentence by replicating human judgments. Amazon Mechanical Turk was used to collect human annotations of these candidates in a cost and time efficient manner. Collecting ratings on the absolute relevance of gap candidates has plagued past work with inter-annotator agreement issues as each annotator had different requirements for each possible rating. Therefore, I proposed a task involving the ranking of candidate gaps from a selected sentence. Candidates are nouns, adjectives, and phrases with a Wikipedia page in the sentence, selected using the Stanford Parser.  For shortening purposes, each task involved the ranking of 3 gap-phrases from one source-sentence. For every source sentence, multiple sets of gap-phrase triplets were created.

Final Task-UI below.


Since conventional inter-annotator agreement metrics cannot be used for ranking tasks, I proposed a new measure:.


where sgn(·) is the sign function, rn(Z) is the rank assigned by ranker n to gap Z

Also, the collected gap-phrase rankings were pre-processed. Gaps which were ranked first, second and third and gaps that showed no inter-annotator agreement were removed. 

Finally, to train a binary classifier (Support Vector Machine) with the collected human judgements, I proposed the following features.

A more comprehensive overview.

Distractor Selection

Past work involved thesauruses and rule-based approaches. Instead, I propose selecting distractors with these properties:

Semantic Similarity: Distractors should be similar in meaning to the gap. To do so, I used Word2vec which learns distributed representations of words, from input texts, in a semantic vector space. To find semantically similar distractors, words closest to the gap in the vector space are chosem.

Syntactic Similarity: Distractors that look similar to the gap can be effective. This was done by computing the Dice Coefficient for the gap-phrase and each candidate distractor.

Contextual Fit: Distractors should fit into the question sentence. To do so, I used a language model to calculate the probability of a candidate distractor occurring in the question.

A more comprehensive overview.


Sentence Selection

Since the sentence selection algorithm heavily depends on the performance of the proposed DATM, I benchmarked it with the widely-used Latent Dirichlet Allocation(LDA) and Latent Semantic Analysis(LSA). The corpora used are high school textbooks for biology and US history: Campbell's Biology & The American Pageant respectively. Each sentence is treated as a document. The textbook corpora allows me to benchmark the performance of the DATM on short and informative sentences which is relevant to my work. I pre-processed the corpora by removing stop-words and stemming. Average Topic Coherence (ATC) is used for performance benchmarking [19]. ATC computes a sum of pair-wise scores on the top n words, w, that describe a topic.

where T is the number of topics while D(wi) and D(wi , wj) are the counts of training documents containing the word wi and both the words wi and wj respectively. A better topic model will result in a less negative ATC. ATC was chosen as it was found to be strongly correlated to human judgement of topics [19].

Each of the topic model was programmed to discover different number of topics (5 to 100) for each textbook. Results follow.

Clearly, the DATM is able to discover topics more coherent that the state-of-the-art LDA and LSA.

Gap Selection 

200 sentences, from the Campbell's biology textbook, were deployed with rankings collected for 1306 gaps in total. The inter-ranker agreement calculated with the proposed metric was high at 0.783. This indicates that raters agreed with each other on the relative relevance of two gaps 78.3% of the time. A Support Vector Machine(SVM) was trained with the data.

The following table details the average accuracy, precision, recall and F1-score achieved for 5-fold and 10-fold cross validation tests. The detailed calculation of the measures are provided in Section 6.5.1. Given an accuracy of 81%, it can concluded that RevUP performs fairly well for gap selection, on par with previous work [3].  

Besides, the results prove the huge impact pre-processing had on classifier performance with accuracy increasing by about 4%

To understand impact of each feature on classifier performance, I obtained the classifier accuracy without each feature over 10-folds.

Note that feature numbers correspond to the numbers in the feature table in the previous section. Most features have an equal effect on classifier performance with the exception of WordVec (Feature 10). Without WordVec, classifier performance drops to 76.6%. The large impact of WordVec is mainly because it strongly encodes the semantics of candidate gap-phrases. 

Distractor Selection

Amazon Mechanical Turk was used to crowd-source the evaluation of the distractor selection method. The UI of the task provided to the human turkers is below.

75 sentences with 300 distractors from the Campbell’s Biology Textbook were deployed. Each distractor was rated by 5 turkers. A majority vote was used to classify a distractor as good or bad. Results follow.

76% of the distractors selected by RevUP were rated as good which is significantly higher than the 40% in [1] and 47% in [33].


Preliminary Usability Study

In the previous section, especially so for gap selection and distractor selection, our evaluation was crowd-sourced to "non-experts". These non-experts might not have knowledge on high-school biology. As such, I decided to do a smaller study but this time with actual students. 

15 students from NUS High School, whom had taken at least 4 years of high school biology, were recruited for the study. The students were aged between 15 and 18. each student was tasked to provide us with a high school biology text of his/her choice. A list of questions were generated for each text and the students were asked to rate each question on a scale of 0 to 3. A cascade rating scheme was used as described in the following table.

From the texts that the 15 students sent, 495 questions were collectively generated. The following figure shows the number of questions that belongs to each rating band.

In terms of sentence and gap selection, our system performed very well. 94% of sentences 87% of gaps were considered good. However, this was not the case of distractor selection where only 60% of the distractors were considered good. This can be mainly attributed to the fact that our algorithm, in its current state, is not able to reject distractors that could be correct answers to the question. However, it is still to be noted that we significantly outperform previous work by up to 20%. 

Conclusion & Future Work

In summary, we have leveraged upon data-driven machine learning methods to propose RevUP: an automated, domain independent pipeline for GFQG. With the aim of topic modelling short and informative texts, we proposed the novel DATM which out-performed LDA & LSA on the topic coherence metric. Leveraging on DATM, a new topic-distribution based ranking method was proposed for sentence selection. For gap-selection, a discriminative binary classifier was trained on human annotations. With the classifier, RevUP could predict the relevance of a gap-phrase with an accuracy of 81.0%. We finally proposed a novel method for generating semantically-similar distractors with contextual fit and demonstrated that a 76% of the generated distractors were good.

For future work, we hope to extend our work on DATM by using it for other tasks (micro-search,etc.) to conclusively prove DATM’s superiority. As for gap selection, we could explore the usage of more features and the usage of learning-to-rank methods. We also intend to cast the distractor selection problem as a machine learning problem to be trained from human judgments. Another possibility is the integration of RevUP into e-learning platforms such as Moodle to allow public usage of the tool. This could pave the way for usability tests to be conducted to understand the impact RevUP has on the learning process and educational performance of students. Furthermore, RevUP could be used to generate questions from transcribed lectures on MOOC platforms such as Coursera and Udacity. 



I'm extremely grateful to Dr. Rafael E. Banchs and Dr. Luis Fernando D'Haro from the the Institute for Infocomm Research, A*STAR for providing me with the opportunity to work at their lab. I am thankful for the guidance they have provided me whilst also providing me with the freedom to set the direction for the project. 


[1] Manish Agarwal and Prashanth Mannem. Automatic gap-fill question generation from text books. In Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications, pages 56–64. Association for Computational Linguistics, 2011.

[2] Fr´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.

[3] Lee Becker, Sumit Basu, and Lucy Vanderwende. Mind the gap: Learning to choose gaps for question generation. In HLT-NAACL, pages 742–751. The Association for Computational Linguistics, 2012.

[4] Y. Bengio. Learning deep architectures for ai. Foundations and Trends in Machine Learning, November 2009.

[5] James Bergstra, Olivier Breuleux, Fr´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral Presentation.

[6] Thomas L Griffiths and Mark Steyvers. Finding scientific topics. Proceedings of the National academy of Sciences of the United States of America, 101(Suppl 1):5228–5235, 2004.

[7] Michael Heilman and Noah A Smith. Question generation via overgenerating transformations and ranking. Technical report, DTIC Document, 2009.

[8] G. E. Hinton. A practical guide to training restricted boltzmann machines. Momentum, 2010.

[9] G. E. Hinton, S. Osindero, and Y-W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 2006.

[10] Geoffrey Hinton. A practical guide to training restricted boltzmann machines. Momentum, 9(1):926, 2010.

[11] Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.

[12] Geoffrey Hinton and Ruslan Salakhutdinov. Discovering binary codes for documents by learning deep generative models. Topics in Cognitive Science, 3(1):74–91, 2011.

[13] J.J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Biophysics, 1982.

[14] Judy Kay, Peter Reimann, Elliot Diebold, and Bob Kummerfeld. Moocs: So many learners, so much potential... IEEE Intelligent Systems, 28(3):70–77, 2013.

[15] Alison King. Comparison of self-questioning, summarizing, and notetaking-review as strategies for learning from lectures. American Educational Research Journal, 29(2):303–323, 1992.

[16] Honglak Lee, Chaitanya Ekanadham, and Andrew Y Ng. Sparse deep belief net model for visual area v2. In Advances in neural information processing systems, pages 873–880, 2008.

[17] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013.

[18] George A. Miller. Wordnet: A lexical database for english. COMMUNICATIONS OF THE ACM, 38:39–41, 1995.

[19] David Mimno, Hanna M Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 262–272. Association for Computational Linguistics, 2011.

[20] S. Miranda, G.R. Mangione, F. Orciuoli, M. Gaeta, and V. Loia. Automatic generation of assessment objects and remedial works for moocs. In Information Technology Based Higher Education and Training (ITHET), 2013 International Conference on, pages 1–8, Oct 2013.

[21] Ruslan Mitkov, Le An Ha, and Nikiforos Karamanis. A computer-aided environment for generating multiple-choice test items. Nat. Lang. Eng., 12(2):177–194, June 2006.

[22] Jack Mostow and Wei Chen. Generating instruction automatically for the reading strategy of self-questioning. In Proceedings of the 2009 Conference on Artificial Intelligence in Education: Building Learning Systems That Care: From Knowledge Representation to Affective Modelling, pages 465–472, Amsterdam, The Netherlands, The Netherlands, 2009. IOS Press.

 [23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

[24] Radim Reh˚uˇrek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora. In ˇ Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA.

[25] R. Salakhutdinov and G. E. Hinton. Deep boltzmann machines. In International Conference on Artificial Intelligence and Statistics, 2009.

[26] Adam Kilgarriff Simon Smith, PVS Avinesh, and Adam Kilgarriff. Gap-fill tests for language learners: Corpus-driven item generation. In Proceedings of ICON-2010: 8th International Conference on Natural Language Processing, 2010.

[27] Eiichiro Sumita, Fumiaki Sugaya, and Seiichi Yamamoto. Measuring non-native speakers’ proficiency of english by using a test with automatically-generated fill-in-the-blank questions. In Proceedings of the second workshop on Building Educational Applications Using NLP, pages 61–68. Association for Computational Linguistics, 2005.

[28] Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 173–180. Association for Computational Linguistics, 2003.

[29] Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 173–180. Association for Computational Linguistics, 2003.

[30] Wikipedia. Receiver operating characteristic — wikipedia, the free encyclopedia, 2014. [Online; accessed 26-December- 2014].

[31] John H. Wolfe, Navy Personnel Research, and CA. Development Center, San Diego. An Aid to Independent Study through Automatic Question Generation (AUTOQUEST) [microform] / John H. Wolfe. Distributed by ERIC Clearinghouse [Washington, D.C.], 1975.

[32] Brendan Wyse and Paul Piwek. Generating questions from openlearn study units. 2009. 

[33] Niraula, Nobal B., et al. "Mining Gap-fill Questions from Tutorial Dialogues."