Tải bản đầy đủ - 0trang
2 Results for Research Question 2: Relation to Human Ratings
Classification of Writing Patterns Using Keystroke Logs
The cluster-specific regression models resulted in correlation coefficients of 0.47
for cluster 1 and 0.65 for cluster 2 for the policy recommendation essay. Similar
results were observed for the argumentation essay, where cluster 1 also yielded a
considerably lower correlation coefficient (0.46) than cluster 2 (0.51).
Our analysis focused on the 25 longest inter-word pauses in each essay and indicated
that student response patterns fell into two major groups. In pattern 1, the 25 longest
pauses were distributed relatively evenly throughout the composition. Essays that
exemplify pattern 1 received higher mean human scores, contained more words, and
were composed over a longer active writing time. In pattern 2, the longest 25 pauses
were somewhat shorter than in pattern 1, and were concentrated at the beginning
and end of the composition. Essays that exemplify pattern 2 received lower mean
human scores, had fewer words, and were written over a shorter active composition
time. We replicated these findings across two writing prompts, each focused on a
different writing purpose and administered to different student samples. It is worth
stressing that the results of writing patterns should be interpreted at the group level;
that is, the patterns do not reflect the distribution of the 25 longest IWIs of any
In the literature, pauses of more than 2 s are generally described as terminating
‘bursts’ of text production (Chenoweth & Hayes 2001), and tend to be interpreted in
think-aloud protocols as occasions for sentence-level planning (Baaijen et al. 2012).
This cognitive interpretation can readily be applied to pattern 1. As can be observed
in Figs. 6 and 7, the longest pauses produced by pattern-1 writers fell most often
between 1.5 and 4 s in duration, typical of pauses between bursts. This interpretation
is strengthened by the fact that pattern-1 pauses were evenly distributed across
the entire text. The resulting rhythm—a regular series of bursts of fast production
delimited by pauses of 2 s or more—may be emblematic of fluent text production,
in which the writer pauses primarily to plan the next major grammatical or textual
The striking feature of pattern 2 is the presence of a second kind of pause, mostly
shorter than the pauses observed in pattern 1, concentrated near the beginning and
the end of the composition. These time points are arguably where a writer who
has difficulty generating text is most likely to experience difficulty, consistent with
Baaijen et al. (2012)’s observation that certain kinds of behavioral events, such as
text production followed by revision, are associated with shorter pauses. It is thus
possible, though by no means certain, that the higher frequency of short pauses
concentrated at the beginning and ends of pattern 2 essays reflects difficulties in text
generation, leading to false starts and interruptions instead of fluent text production
at the beginning of an essay (when the writer is under the most stress to plan
content), and at the end of an essay (when the writer may be running out of ideas, and
thus once more experiencing higher levels of uncertainty about what to write next).
M. Zhang et al.
We conducted a post-hoc qualitative analysis of a small subset of logs from this
dataset, and found that some weaker writers did produce a small amount of text—a
few words, or even part of a word—and then delete it after a short pause, only to
proceed to another false start. It is thus possible that pattern 2 involves this kind
of hesitation, although we cannot confirm it without further analysis in which we
correlate the distribution of IWIs with the distribution of deletions and edits.
In this study, we propose a new way to compare the temporal sequence of IWIs
across different students using a vector representation. This approach enables us
to describe global patterns in pausing behavior, which may correspond to different
cognitive strategies or styles of writing. This study represents an initial attempt,
using a specific keystroke log feature (IWIs) and a specific similarity metric,
to explore ways to represent and directly compare KLs, analyze the resulting
classification patterns, and pose cognitive accounts for the identified patterns in the
context of writing done for standardized tests. Overall, our analysis indicates that
there do appear to be qualitative differences among groups of writers in the timecourse of text production, some of which differences can be detected from a very
small sample of events (e.g., only the 25 longest inter-word intervals).
However, it should be note that the method we employed in this study represents
our starting point to explore better representations and similarity measures for
the KLs. Based on the current methodological scheme, we observed some clear
pattern differences in students’ writing processes, which held across two prompts.
However, a different scale transformation, for example, on the IWI time-point, will
change the similarity matrix structure and affect the clustering results. In our future
investigations, we will experiment with other similarity measures (e.g., Euclidean
or Mahalanobis types of distance measures) and representations such as matched
filtering, which might be more robust than the current approach.
It is also important to note that the decision to target the 25 longest IWIs
represents two levels of abstraction: first, by restricting attention to IWIs, and
second, by excluding shorter IWIs from the analysis. These decisions provided a
useful lens with which to examine the data, since the literature provides strong
reasons to suspect that the longest IWIs will reflect global differences in writing
patterns and strategies. The decision to standardize to the 25 longest IWIs also
made it easier to compare essays of different lengths (and which were composed
over shorter or longer time periods), but it does represent a small portion of the total
data; hence, it will be useful to extend the scope of future analysis to include all
Deane (2014) provides evidence that many keystroke features are not particularly
stable across changes in prompt, genre, and/or topic. Therefore, caution should be
exercised in generalizing the results. Further studies are needed to determine the
extent to which these results reflect prompt-specific or general differences in student
Classification of Writing Patterns Using Keystroke Logs
writing behaviors, which will require studying students from other grade levels and
writing prompts targeting other writing purposes (e.g., narrative writing).
Finally, it might be valuable to enrich the representation to include information
about the context of such writing actions as IWI. For example, some IWIs happen
between words in a long burst of text production; others, in the context of other
actions, such as edits or deletions. We would interpret the second cluster, in which
most pauses were near the beginning and end of student essays, very differently
if they were associated with editing and deletion, than we would if they were
associated with uninterrupted text production. Thus, it would be of particular value
to enrich the current approach by undertaking analyses that identify qualitative,
linguistic or behavioral differences and that would allow us to relate those findings
to the differences in writing patterns observed here.
Acknowledgements We would like to thank Marie Wiberg, Don Powers, Gary Feng, Tanner
Jackson, and Andre Rupp for their technical and editorial suggestions for this manuscript, thank
Randy Bennett for his support of the study, and thank Shelby Haberman for his advice on the
statistical analyses in this study.
Almond, R., Deane, P., Quinlan, T., & Wagner, M. (2012). A preliminary analysis of keystroke log
data from a timed writing task (RR-12-23). Princeton, NJ: ETS Research Report.
Alves, R. A., Castro, S. L., & de Sousa, L. (2007). Influence of typing skill on pause–execution
cycles in written composition. In G. Rijlaarsdam (Series Ed.), M. Torrance, L. van Waes, & D.
Galbraith (Vol. Eds.), Writing and cognition: Research and applications (Studies in Writing,
Vol. 20, pp. 55–65). Amsterdam: Elsevier.
Baaijen, V. M., Galbraith, D., & de Glopper, K. (2012). Keystroke analysis: Reflections on
procedures and measures. Written Communications, 29, 246–277.
Banerjee, R., Feng, S., Kang, J. S., & Choi, Y. (2014). Keystroke patterns as prosody in digital
writings: A case study with deceptive reviews and essays. Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Processing, Doha, Qatar.
Beauvais, C., Olive, T., & Passerault, J. (2011). Why are some texts good and others not?
Relationship between text quality and management of the writing processes. Journal of
Educational Psychology, 103, 415–428.
Bennett, R. E. (2010). Cognitively Based Assessment of, for, and as Learning (CBAL): A
preliminary theory of action for summative and formative assessment. Measurement, 8,
Bennett, R. E., Deane, P., van Rijn, P. (2016). From cognitive-domain theory to assessment
practice. Educational Psychologist, 51, 82–107.
Chenoweth, N. A., & Hayes, J. R. (2001). Fluency in writing: Generating text in L1 and L2. Written
Communication, 18, 80–98.
Chukharev-Hudilainen, E. (2014). Pauses in spontaneous written communication: A keystroke
logging study. Journal of Writing Research, 6, 61–84.
Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213–220.
Deane, P. (2014). Using writing process and product features to assess writing quality and explore
how those features relate to other literacy tasks (RR-14-03). Princeton, NJ: ETS Research
M. Zhang et al.
Deane, P., Sabatini, J. S., Feng, G., Sparks, J., Song, Y., Fowles, M., et al. (2015). Key practices in
the English Language Arts (ELA): Linking learning theory, assessment, and instruction (RR15-17). Princeton, NJ: ETS Research Report.
Deane, P., & Zhang, M. (2015). Exploring the feasibility of using writing process features to assess
text production skills (RR-15-26). Princeton, NJ: ETS Research Report.
Dragsted, B., & Carl, M. (2013). Towards a classification of translation styles based on eye-tracking
and keylogging data. Journal of Writing Research, 5, 133–158.
Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33,
Gould, J. D. (1980). Experiments on composing letters: Some facts, some myths, and some
observations. In L. Gregg & E. Steinberg (Eds.), Cognitive processes in writing (pp. 97–127).
Hillsdale, NJ: Lawrence Erlbaum.
Grabowski, J. (2008). The internal structure of university students’ keyboard skills. Journal of
Writing Research, 1, 27–52.
Hao, J., Smith, L., Mislevy, R., von Davier, A., & Bauer, M. (2016). Taming log files from game
and simulation-based assessment: Data model and data analysis tool. (RR-16-10) Princeton,
NJ: ETS Research Report.
Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika, 32, 241–254.
Jones, E., Oliphant, T., & Peterson, P. (2014). SciPy: Open source scientific tools for Python
[Computer software]. Retrieved from http://www.scipy.org/.
Kalbfleisch, J. D., & Prentice, R. L. (2002). The statistical analysis of failure time data (2nd ed.).
Hoboken, NJ: Wiley.
Leijten, M., Macken, L., Hoste, V., van Horenbeeck, E., & van Waes, L. (2012). From character to
word level: Enabling the linguistic analyses of Inputlog process data. Proceedings of the EACL
2012 Workshop on Computational Linguistics and Writing, Avignon, France.
Leijten, M., & van Waes, L. (2013). Keystroke logging in writing research using Inputlog to
analyze and visualize writing processes. Written Communication, 30, 358–392.
Leijten, M., van Waes, L., Schriver, K., & Hayes, J. R. (2014). Writing in the workplace:
Constructing documents using multiple digital sources. Journal of Writing Research, 5,
Miller, K. S. (2000). Academic writers on-line: Investigating pausing in the production of text.
Language Teaching Research, 4, 123–148.
Roca de Larios, J., Manchon, R., Murphy, L., & Marin, J. (2008). The foreign language writer’s
strategic behavior in the allocation of time to writing processes. Journal of Second Language
Writing, 17, 30–47.
Ulrich, R., & Miller, J. (1993). Information processing models generating log normally distributed
reaction times. Journal of Mathematical Psychology, 37, 513–525.
van der Linden, W. (2006). A lognormal model for response times on test items. Journal of
Educational and Behavioral Statistics, 31, 181–204.
van Waes, L., Leijten, M., & van Weijen, D. (2009). Keystroke logging in writing research:
Observing writing processes with Inputlog. GFI-Journal, No 2-3.
Xu, X., & Ding, Y. (2014). An exploratory study of pauses in computer-assisted EFL writing.
Language Learning & Technology, 18, 80–96.
Zhang, M., & Deane, P. (2015). Process features in writing: Internal structure and incremental
value over product features (RR-15-27). Princeton, NJ: ETS Research Report.
Identifying Useful Features to Detect Off-Topic
Essays in Automated Scoring Without Using
Topic-Specific Training Essays
Jing Chen and Mo Zhang
Abstract E-rater is the automated scoring engine used at ETS to score the writing
quality of essays. A pre-screening filtering system is embedded in e-rater to detect
and exclude essays that are not suitable to be scored by e-rater. The pre-screening
filtering system is composed of a set of advisory flags, each of which marks some
unusualness of the essay (e.g. repetition of words and sentences, restatement of the
prompt). This study examined the effectiveness of an advisory flag in the filtering
system that detected off-topic essays. The detection of off-topic essays usually
requires topic specific training essays to train the engine in order to identify essays
that are very different from the other essays of the same topic. The advisory flag
used here is designed to detect off-topic essays without using topic-specific training
essays because topic-specific training essays may not available in real test settings.
To enhance the capability of this off-topic advisory flag, we identified a set of
essay features that are potentially useful in distinguishing off-topic essays that do
not require topic specific training essays. These features include essay length, the
number of word types (exclude non-content-bearing words), the number of word
tokens, the similarity of an essay to training essays, essay organization, and the
variety of sentences in an essay.
Keywords Automated essay scoring • Feature selection • Off-topic essay
Automated scoring is now more and more widely used to score constructed response
items given its advantages such as low cost, real-time feedback, quick scoreturnaround and consistency over time (Williamson, Bejar & Hone 1999). Automated
scoring engines have been developed to score different types of constructed response
such as short responses, essays and speaking responses. Automated Essay Scoring
J. Chen ( ) • M. Zhang
Educational Testing Service, Princeton, NJ 08541, USA
© Springer International Publishing Switzerland 2016
L.A. van der Ark et al. (eds.), Quantitative Psychology Research, Springer
Proceedings in Mathematics & Statistics 167, DOI 10.1007/978-3-319-38759-8_24
J. Chen and M. Zhang
(AES), which is the subject of this paper, is defined as using computer technology
to evaluate and score the written prose (Dikli 2006). The scoring process of an
AES system usually involves extracting features using Natural Language Processing
(NLP) techniques and statistical modeling that predict human scores based on the
extracted essay features. Though the performance of AES systems may equal or
surpass that of human raters in many aspects of writing, the systems do not really
“read” or “understand” the essays. If an essay is very unusual, an AES system may
fail to process the essay or it may assign a score that does not reflect the criteria
specified in the scoring rubrics.
1.1 The Filtering System of Automated Essay Scoring
A pre-screening filtering system is often used to identify unusual essays that are not
suitable to be scored by the automated scoring system. An effective pre-screening
filtering system is important to ensure the validity of automated scores. If the scoring
engine handles problematic responses in the same way as it handles the other
normal responses, it may degrade users’ confidence in using the scoring engine.
Furthermore, essays detected as unusual by the filtering system often need to be
scored by human raters. An effective filtering system would detect unscorable essays
while minimizing the number of essays that are falsely identified as unscorable to
avoid unnecessary human scoring cost.
Several widely used AES systems have pre-screening filtering systems to detect
unusual responses. The Intelligent Essay Assessor (IEA, Landauer, Laham & Foltz
2003) checks for a number of things such as the number and degree of clustering
of word type repetitions, the extent to which the essay is off topic and whether
the essay is a copy or rearrangement of another essay (Wollack & Fremer 2013).
The IntelliMetric system developed by Vantage Learning (Elliot 2003; Shermis &
Barrera 2002) has warning flags for things such as nonsensical writing, violent
language, copying the prompt and plagiarism (Rudner, Garcia & Welch 2006).
E-rater , the automated scoring engine developed at ETS (Attali & Burstein 2006)
has a set of advisory flags to identify atypicality in an essay (e.g., too many
repetitions of words and sentences, unusual organization, off-topic content).
This study investigates the effectiveness of an e-rater advisory flag that is
designed to detect off-topic essays. In high-stakes writing assessments, if an essay
triggers any advisory flag, the essay will be excluded from automated scoring
and scored by human raters only. Thus, an effective advisory flag would detect
unsuitable responses while minimizing the number of essays that are falsely
identified as unsuitable for automated scoring in order to save human scoring cost.
For low-stakes writing assessments, e-rater is often used as the sole scoring method.
If an essay triggers the off-topic advisory flags, e-rater will still generate a score for
the essay but provides a warning to indicate the essay might be off-topic so the score
assigned may not be valid. The assessments used in this study are all high-stakes
assessments which means essays that triggered the off-topic advisory flag will be
Identifying Useful Features to Detect Off-Topic Essays in Automated Scoring. . .
excluded from automated scoring and scored by human raters only. We examine the
performance of an off-topic advisory flag and identify potentially useful features to
detect off-topic essays more accurately. In the following section, we introduce our
definition of off-topic essays and the off-topic advisory flag we evaluated in this
1.2 Off-Topic Essays and the E-rater Off-Topic
Off-topic essays are often off-topic in many divergent ways. We define off-topic
essays somewhat broadly to include unexpected topic essays, bad faith essays,
essays in a foreign language, essays consisting of only keystroke characters with
little lexical content or essays that are illegible. In scoring rubrics, these essays
usually receive human scores of ‘0’. Two common types of off-topic essays in
writing assessments are essays written on an unexpected-topic and bad faith essays.
An essay of unexpected-topic can be well-written, but it provides no evidence of an
attempt to address the assigned topic. The bad faith essays are usually written by
examinees who respond uncooperatively. They write text irrelevant to the assigned
topic because of boredom or other reasons.
In the pre-filtering system of e-rater, there are two types of advisory flags that
detect off-topic essays. One type of off-topic advisory flags require topic specific
training essays to train the advisory in order to identify essays that are very different
from the other essays on the same topic. The other type of off-topic advisory flag
does not require topic specific training essays. The detection of off-topic essays
relies on essay features that can be computed without topic specific training essays.
In real test settings, topic-specific training essays may not available when new
prompts are administered. In addition, the sample size of topic-specific training
essays may not be sufficient sometimes. An advisory flag that does not need topicspecific training essays can be used more flexibly and broadly in real test settings.
Thus, we only investigate this type of off-topic advisory flag in this paper and try
to find essay features that can be computed without topic-specific training essays to
improve the detection of off-topic essays.
More specifically, the advisory flag we evaluate in this study detects off-topic
essays based on the similarity between the text of an essay and the prompt on which
the essay is supposed to have been written (Higgins, Burstein, & Attali, 2006). This
similarity is measured by a feature (abbreviated as “S_Prompt”) calculated based
on Content Vector Analysis (CVA, for a detailed introduction, see Kaplan 2010, p.
531). The similarities between an essay and its target prompt (i.e. S_Prompt) as well
as the reference prompts are calculated and sorted. If the similarity between an essay
and its target prompt ranked amongst the top 15 % of the similarity scores, then the
essay is considered on topic. Otherwise, it is identified as off-topic.
This study evaluates the effectiveness of the off-topic advisory flag introduced
above. Besides the feature used in this flag, e-rater extracts a lot of other essay
J. Chen and M. Zhang
features to evaluate essays. This study explores what additional essay features can
potentially be used to detect off-topic essays. The study is guided by the following
two research questions:
1. How effective is the advisory flag in detecting off-topic essays?
2. Among the essay features that e-rater extracts, what are the potentially useful
ones that can detect off-topic essays?
The data for this study came from the writing tasks of two large-scale high-stakes
assessments. Assessment I is a college level test, and Assessment II is an English
proficiency test. The writing section in Assessment I includes two tasks, which
we refer to as Task A and Task B for the purpose of this paper. Task A requires
examinees to critique an argument while Task B requires examinees to articulate
an opinion and support their opinions by using examples or relevant reasoning.
Similar to the writing tasks of Assessment I, the writing section of Assessment II
also included two tasks, which we refer to as Task C and Task D. Task C requires
test takers to respond in writing by synthesizing the information that they had read
with the information they had heard. Task D requires test takers to articulate and
support an opinion on a topic.
The score scale of Task A and B is from 1 to 6, and that of Task C and D is from
1 to 5. The lowest score, 1, indicates a poorly written essay and the highest score,
5 or 6, indicates a very well written essay. Specifically, the scoring rubrics of these
four writing tasks all specify that an essay at score level ‘0’ is not connected to the
topic, is written in a foreign language, consists of keystroke characters, or is blank.
Therefore, for the purpose of this study, we classify all the essays that received a
human score of ‘0’ as off-topic essays (except the blank ones) and the other essays
with non-zero scores as on-topic essays.
In operational scoring design, essays from high-stakes assessments usually are
scored by a human rater first. If the human rater assigns a score of ‘0’ to an
essay which indicates that the essay is very unusual, the essay will be excluded
from automated scoring entirely. Instead, a second human rater will evaluate this
essay to check the score from the first human rater. Because off-topic essays will
be flagged by human raters, the issue of off-topic responses is not viewed as a
serious problem for automated scoring in high-stakes assessments. However, in lowstakes assessment when e-rater is used as the primary or sole scoring system, it’s
important to have an effective flag to detect off-topic essays that may not suitable
for automated scoring.
To evaluate the effectiveness of the off-topic flag discussed previously, we
selected a random sample of around 200,000 essays from each writing task that was