Tải bản đầy đủ - 0 (trang)
2 Results for Research Question 2: Relation to Human Ratings

2 Results for Research Question 2: Relation to Human Ratings

Tải bản đầy đủ - 0trang

Classification of Writing Patterns Using Keystroke Logs


The cluster-specific regression models resulted in correlation coefficients of 0.47

for cluster 1 and 0.65 for cluster 2 for the policy recommendation essay. Similar

results were observed for the argumentation essay, where cluster 1 also yielded a

considerably lower correlation coefficient (0.46) than cluster 2 (0.51).

5 Discussion

Our analysis focused on the 25 longest inter-word pauses in each essay and indicated

that student response patterns fell into two major groups. In pattern 1, the 25 longest

pauses were distributed relatively evenly throughout the composition. Essays that

exemplify pattern 1 received higher mean human scores, contained more words, and

were composed over a longer active writing time. In pattern 2, the longest 25 pauses

were somewhat shorter than in pattern 1, and were concentrated at the beginning

and end of the composition. Essays that exemplify pattern 2 received lower mean

human scores, had fewer words, and were written over a shorter active composition

time. We replicated these findings across two writing prompts, each focused on a

different writing purpose and administered to different student samples. It is worth

stressing that the results of writing patterns should be interpreted at the group level;

that is, the patterns do not reflect the distribution of the 25 longest IWIs of any

individual KL.

In the literature, pauses of more than 2 s are generally described as terminating

‘bursts’ of text production (Chenoweth & Hayes 2001), and tend to be interpreted in

think-aloud protocols as occasions for sentence-level planning (Baaijen et al. 2012).

This cognitive interpretation can readily be applied to pattern 1. As can be observed

in Figs. 6 and 7, the longest pauses produced by pattern-1 writers fell most often

between 1.5 and 4 s in duration, typical of pauses between bursts. This interpretation

is strengthened by the fact that pattern-1 pauses were evenly distributed across

the entire text. The resulting rhythm—a regular series of bursts of fast production

delimited by pauses of 2 s or more—may be emblematic of fluent text production,

in which the writer pauses primarily to plan the next major grammatical or textual


The striking feature of pattern 2 is the presence of a second kind of pause, mostly

shorter than the pauses observed in pattern 1, concentrated near the beginning and

the end of the composition. These time points are arguably where a writer who

has difficulty generating text is most likely to experience difficulty, consistent with

Baaijen et al. (2012)’s observation that certain kinds of behavioral events, such as

text production followed by revision, are associated with shorter pauses. It is thus

possible, though by no means certain, that the higher frequency of short pauses

concentrated at the beginning and ends of pattern 2 essays reflects difficulties in text

generation, leading to false starts and interruptions instead of fluent text production

at the beginning of an essay (when the writer is under the most stress to plan

content), and at the end of an essay (when the writer may be running out of ideas, and

thus once more experiencing higher levels of uncertainty about what to write next).


M. Zhang et al.

We conducted a post-hoc qualitative analysis of a small subset of logs from this

dataset, and found that some weaker writers did produce a small amount of text—a

few words, or even part of a word—and then delete it after a short pause, only to

proceed to another false start. It is thus possible that pattern 2 involves this kind

of hesitation, although we cannot confirm it without further analysis in which we

correlate the distribution of IWIs with the distribution of deletions and edits.

6 Conclusion

In this study, we propose a new way to compare the temporal sequence of IWIs

across different students using a vector representation. This approach enables us

to describe global patterns in pausing behavior, which may correspond to different

cognitive strategies or styles of writing. This study represents an initial attempt,

using a specific keystroke log feature (IWIs) and a specific similarity metric,

to explore ways to represent and directly compare KLs, analyze the resulting

classification patterns, and pose cognitive accounts for the identified patterns in the

context of writing done for standardized tests. Overall, our analysis indicates that

there do appear to be qualitative differences among groups of writers in the timecourse of text production, some of which differences can be detected from a very

small sample of events (e.g., only the 25 longest inter-word intervals).

However, it should be note that the method we employed in this study represents

our starting point to explore better representations and similarity measures for

the KLs. Based on the current methodological scheme, we observed some clear

pattern differences in students’ writing processes, which held across two prompts.

However, a different scale transformation, for example, on the IWI time-point, will

change the similarity matrix structure and affect the clustering results. In our future

investigations, we will experiment with other similarity measures (e.g., Euclidean

or Mahalanobis types of distance measures) and representations such as matched

filtering, which might be more robust than the current approach.

It is also important to note that the decision to target the 25 longest IWIs

represents two levels of abstraction: first, by restricting attention to IWIs, and

second, by excluding shorter IWIs from the analysis. These decisions provided a

useful lens with which to examine the data, since the literature provides strong

reasons to suspect that the longest IWIs will reflect global differences in writing

patterns and strategies. The decision to standardize to the 25 longest IWIs also

made it easier to compare essays of different lengths (and which were composed

over shorter or longer time periods), but it does represent a small portion of the total

data; hence, it will be useful to extend the scope of future analysis to include all


Deane (2014) provides evidence that many keystroke features are not particularly

stable across changes in prompt, genre, and/or topic. Therefore, caution should be

exercised in generalizing the results. Further studies are needed to determine the

extent to which these results reflect prompt-specific or general differences in student

Classification of Writing Patterns Using Keystroke Logs


writing behaviors, which will require studying students from other grade levels and

writing prompts targeting other writing purposes (e.g., narrative writing).

Finally, it might be valuable to enrich the representation to include information

about the context of such writing actions as IWI. For example, some IWIs happen

between words in a long burst of text production; others, in the context of other

actions, such as edits or deletions. We would interpret the second cluster, in which

most pauses were near the beginning and end of student essays, very differently

if they were associated with editing and deletion, than we would if they were

associated with uninterrupted text production. Thus, it would be of particular value

to enrich the current approach by undertaking analyses that identify qualitative,

linguistic or behavioral differences and that would allow us to relate those findings

to the differences in writing patterns observed here.

Acknowledgements We would like to thank Marie Wiberg, Don Powers, Gary Feng, Tanner

Jackson, and Andre Rupp for their technical and editorial suggestions for this manuscript, thank

Randy Bennett for his support of the study, and thank Shelby Haberman for his advice on the

statistical analyses in this study.


Almond, R., Deane, P., Quinlan, T., & Wagner, M. (2012). A preliminary analysis of keystroke log

data from a timed writing task (RR-12-23). Princeton, NJ: ETS Research Report.

Alves, R. A., Castro, S. L., & de Sousa, L. (2007). Influence of typing skill on pause–execution

cycles in written composition. In G. Rijlaarsdam (Series Ed.), M. Torrance, L. van Waes, & D.

Galbraith (Vol. Eds.), Writing and cognition: Research and applications (Studies in Writing,

Vol. 20, pp. 55–65). Amsterdam: Elsevier.

Baaijen, V. M., Galbraith, D., & de Glopper, K. (2012). Keystroke analysis: Reflections on

procedures and measures. Written Communications, 29, 246–277.

Banerjee, R., Feng, S., Kang, J. S., & Choi, Y. (2014). Keystroke patterns as prosody in digital

writings: A case study with deceptive reviews and essays. Proceedings of the 2014 Conference

on Empirical Methods in Natural Language Processing, Doha, Qatar.

Beauvais, C., Olive, T., & Passerault, J. (2011). Why are some texts good and others not?

Relationship between text quality and management of the writing processes. Journal of

Educational Psychology, 103, 415–428.

Bennett, R. E. (2010). Cognitively Based Assessment of, for, and as Learning (CBAL): A

preliminary theory of action for summative and formative assessment. Measurement, 8,


Bennett, R. E., Deane, P., van Rijn, P. (2016). From cognitive-domain theory to assessment

practice. Educational Psychologist, 51, 82–107.

Chenoweth, N. A., & Hayes, J. R. (2001). Fluency in writing: Generating text in L1 and L2. Written

Communication, 18, 80–98.

Chukharev-Hudilainen, E. (2014). Pauses in spontaneous written communication: A keystroke

logging study. Journal of Writing Research, 6, 61–84.

Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213–220.

Deane, P. (2014). Using writing process and product features to assess writing quality and explore

how those features relate to other literacy tasks (RR-14-03). Princeton, NJ: ETS Research



M. Zhang et al.

Deane, P., Sabatini, J. S., Feng, G., Sparks, J., Song, Y., Fowles, M., et al. (2015). Key practices in

the English Language Arts (ELA): Linking learning theory, assessment, and instruction (RR15-17). Princeton, NJ: ETS Research Report.

Deane, P., & Zhang, M. (2015). Exploring the feasibility of using writing process features to assess

text production skills (RR-15-26). Princeton, NJ: ETS Research Report.

Dragsted, B., & Carl, M. (2013). Towards a classification of translation styles based on eye-tracking

and keylogging data. Journal of Writing Research, 5, 133–158.

Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33,


Gould, J. D. (1980). Experiments on composing letters: Some facts, some myths, and some

observations. In L. Gregg & E. Steinberg (Eds.), Cognitive processes in writing (pp. 97–127).

Hillsdale, NJ: Lawrence Erlbaum.

Grabowski, J. (2008). The internal structure of university students’ keyboard skills. Journal of

Writing Research, 1, 27–52.

Hao, J., Smith, L., Mislevy, R., von Davier, A., & Bauer, M. (2016). Taming log files from game

and simulation-based assessment: Data model and data analysis tool. (RR-16-10) Princeton,

NJ: ETS Research Report.

Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika, 32, 241–254.

Jones, E., Oliphant, T., & Peterson, P. (2014). SciPy: Open source scientific tools for Python

[Computer software]. Retrieved from http://www.scipy.org/.

Kalbfleisch, J. D., & Prentice, R. L. (2002). The statistical analysis of failure time data (2nd ed.).

Hoboken, NJ: Wiley.

Leijten, M., Macken, L., Hoste, V., van Horenbeeck, E., & van Waes, L. (2012). From character to

word level: Enabling the linguistic analyses of Inputlog process data. Proceedings of the EACL

2012 Workshop on Computational Linguistics and Writing, Avignon, France.

Leijten, M., & van Waes, L. (2013). Keystroke logging in writing research using Inputlog to

analyze and visualize writing processes. Written Communication, 30, 358–392.

Leijten, M., van Waes, L., Schriver, K., & Hayes, J. R. (2014). Writing in the workplace:

Constructing documents using multiple digital sources. Journal of Writing Research, 5,


Miller, K. S. (2000). Academic writers on-line: Investigating pausing in the production of text.

Language Teaching Research, 4, 123–148.

Roca de Larios, J., Manchon, R., Murphy, L., & Marin, J. (2008). The foreign language writer’s

strategic behavior in the allocation of time to writing processes. Journal of Second Language

Writing, 17, 30–47.

Ulrich, R., & Miller, J. (1993). Information processing models generating log normally distributed

reaction times. Journal of Mathematical Psychology, 37, 513–525.

van der Linden, W. (2006). A lognormal model for response times on test items. Journal of

Educational and Behavioral Statistics, 31, 181–204.

van Waes, L., Leijten, M., & van Weijen, D. (2009). Keystroke logging in writing research:

Observing writing processes with Inputlog. GFI-Journal, No 2-3.

Xu, X., & Ding, Y. (2014). An exploratory study of pauses in computer-assisted EFL writing.

Language Learning & Technology, 18, 80–96.

Zhang, M., & Deane, P. (2015). Process features in writing: Internal structure and incremental

value over product features (RR-15-27). Princeton, NJ: ETS Research Report.

Identifying Useful Features to Detect Off-Topic

Essays in Automated Scoring Without Using

Topic-Specific Training Essays

Jing Chen and Mo Zhang


Abstract E-rater is the automated scoring engine used at ETS to score the writing

quality of essays. A pre-screening filtering system is embedded in e-rater to detect

and exclude essays that are not suitable to be scored by e-rater. The pre-screening

filtering system is composed of a set of advisory flags, each of which marks some

unusualness of the essay (e.g. repetition of words and sentences, restatement of the

prompt). This study examined the effectiveness of an advisory flag in the filtering

system that detected off-topic essays. The detection of off-topic essays usually

requires topic specific training essays to train the engine in order to identify essays

that are very different from the other essays of the same topic. The advisory flag

used here is designed to detect off-topic essays without using topic-specific training

essays because topic-specific training essays may not available in real test settings.

To enhance the capability of this off-topic advisory flag, we identified a set of

essay features that are potentially useful in distinguishing off-topic essays that do

not require topic specific training essays. These features include essay length, the

number of word types (exclude non-content-bearing words), the number of word

tokens, the similarity of an essay to training essays, essay organization, and the

variety of sentences in an essay.

Keywords Automated essay scoring • Feature selection • Off-topic essay


1 Introduction

Automated scoring is now more and more widely used to score constructed response

items given its advantages such as low cost, real-time feedback, quick scoreturnaround and consistency over time (Williamson, Bejar & Hone 1999). Automated

scoring engines have been developed to score different types of constructed response

such as short responses, essays and speaking responses. Automated Essay Scoring

J. Chen ( ) • M. Zhang

Educational Testing Service, Princeton, NJ 08541, USA

e-mail: jingchenhao@gmail.com

© Springer International Publishing Switzerland 2016

L.A. van der Ark et al. (eds.), Quantitative Psychology Research, Springer

Proceedings in Mathematics & Statistics 167, DOI 10.1007/978-3-319-38759-8_24



J. Chen and M. Zhang

(AES), which is the subject of this paper, is defined as using computer technology

to evaluate and score the written prose (Dikli 2006). The scoring process of an

AES system usually involves extracting features using Natural Language Processing

(NLP) techniques and statistical modeling that predict human scores based on the

extracted essay features. Though the performance of AES systems may equal or

surpass that of human raters in many aspects of writing, the systems do not really

“read” or “understand” the essays. If an essay is very unusual, an AES system may

fail to process the essay or it may assign a score that does not reflect the criteria

specified in the scoring rubrics.

1.1 The Filtering System of Automated Essay Scoring

A pre-screening filtering system is often used to identify unusual essays that are not

suitable to be scored by the automated scoring system. An effective pre-screening

filtering system is important to ensure the validity of automated scores. If the scoring

engine handles problematic responses in the same way as it handles the other

normal responses, it may degrade users’ confidence in using the scoring engine.

Furthermore, essays detected as unusual by the filtering system often need to be

scored by human raters. An effective filtering system would detect unscorable essays

while minimizing the number of essays that are falsely identified as unscorable to

avoid unnecessary human scoring cost.

Several widely used AES systems have pre-screening filtering systems to detect

unusual responses. The Intelligent Essay Assessor (IEA, Landauer, Laham & Foltz

2003) checks for a number of things such as the number and degree of clustering

of word type repetitions, the extent to which the essay is off topic and whether

the essay is a copy or rearrangement of another essay (Wollack & Fremer 2013).

The IntelliMetric system developed by Vantage Learning (Elliot 2003; Shermis &

Barrera 2002) has warning flags for things such as nonsensical writing, violent

language, copying the prompt and plagiarism (Rudner, Garcia & Welch 2006).


E-rater , the automated scoring engine developed at ETS (Attali & Burstein 2006)

has a set of advisory flags to identify atypicality in an essay (e.g., too many

repetitions of words and sentences, unusual organization, off-topic content).

This study investigates the effectiveness of an e-rater advisory flag that is

designed to detect off-topic essays. In high-stakes writing assessments, if an essay

triggers any advisory flag, the essay will be excluded from automated scoring

and scored by human raters only. Thus, an effective advisory flag would detect

unsuitable responses while minimizing the number of essays that are falsely

identified as unsuitable for automated scoring in order to save human scoring cost.

For low-stakes writing assessments, e-rater is often used as the sole scoring method.

If an essay triggers the off-topic advisory flags, e-rater will still generate a score for

the essay but provides a warning to indicate the essay might be off-topic so the score

assigned may not be valid. The assessments used in this study are all high-stakes

assessments which means essays that triggered the off-topic advisory flag will be

Identifying Useful Features to Detect Off-Topic Essays in Automated Scoring. . .


excluded from automated scoring and scored by human raters only. We examine the

performance of an off-topic advisory flag and identify potentially useful features to

detect off-topic essays more accurately. In the following section, we introduce our

definition of off-topic essays and the off-topic advisory flag we evaluated in this



1.2 Off-Topic Essays and the E-rater Off-Topic

Advisory Flags

Off-topic essays are often off-topic in many divergent ways. We define off-topic

essays somewhat broadly to include unexpected topic essays, bad faith essays,

essays in a foreign language, essays consisting of only keystroke characters with

little lexical content or essays that are illegible. In scoring rubrics, these essays

usually receive human scores of ‘0’. Two common types of off-topic essays in

writing assessments are essays written on an unexpected-topic and bad faith essays.

An essay of unexpected-topic can be well-written, but it provides no evidence of an

attempt to address the assigned topic. The bad faith essays are usually written by

examinees who respond uncooperatively. They write text irrelevant to the assigned

topic because of boredom or other reasons.

In the pre-filtering system of e-rater, there are two types of advisory flags that

detect off-topic essays. One type of off-topic advisory flags require topic specific

training essays to train the advisory in order to identify essays that are very different

from the other essays on the same topic. The other type of off-topic advisory flag

does not require topic specific training essays. The detection of off-topic essays

relies on essay features that can be computed without topic specific training essays.

In real test settings, topic-specific training essays may not available when new

prompts are administered. In addition, the sample size of topic-specific training

essays may not be sufficient sometimes. An advisory flag that does not need topicspecific training essays can be used more flexibly and broadly in real test settings.

Thus, we only investigate this type of off-topic advisory flag in this paper and try

to find essay features that can be computed without topic-specific training essays to

improve the detection of off-topic essays.

More specifically, the advisory flag we evaluate in this study detects off-topic

essays based on the similarity between the text of an essay and the prompt on which

the essay is supposed to have been written (Higgins, Burstein, & Attali, 2006). This

similarity is measured by a feature (abbreviated as “S_Prompt”) calculated based

on Content Vector Analysis (CVA, for a detailed introduction, see Kaplan 2010, p.

531). The similarities between an essay and its target prompt (i.e. S_Prompt) as well

as the reference prompts are calculated and sorted. If the similarity between an essay

and its target prompt ranked amongst the top 15 % of the similarity scores, then the

essay is considered on topic. Otherwise, it is identified as off-topic.

This study evaluates the effectiveness of the off-topic advisory flag introduced

above. Besides the feature used in this flag, e-rater extracts a lot of other essay


J. Chen and M. Zhang

features to evaluate essays. This study explores what additional essay features can

potentially be used to detect off-topic essays. The study is guided by the following

two research questions:

1. How effective is the advisory flag in detecting off-topic essays?

2. Among the essay features that e-rater extracts, what are the potentially useful

ones that can detect off-topic essays?

2 Methods

2.1 Data

The data for this study came from the writing tasks of two large-scale high-stakes

assessments. Assessment I is a college level test, and Assessment II is an English

proficiency test. The writing section in Assessment I includes two tasks, which

we refer to as Task A and Task B for the purpose of this paper. Task A requires

examinees to critique an argument while Task B requires examinees to articulate

an opinion and support their opinions by using examples or relevant reasoning.

Similar to the writing tasks of Assessment I, the writing section of Assessment II

also included two tasks, which we refer to as Task C and Task D. Task C requires

test takers to respond in writing by synthesizing the information that they had read

with the information they had heard. Task D requires test takers to articulate and

support an opinion on a topic.

The score scale of Task A and B is from 1 to 6, and that of Task C and D is from

1 to 5. The lowest score, 1, indicates a poorly written essay and the highest score,

5 or 6, indicates a very well written essay. Specifically, the scoring rubrics of these

four writing tasks all specify that an essay at score level ‘0’ is not connected to the

topic, is written in a foreign language, consists of keystroke characters, or is blank.

Therefore, for the purpose of this study, we classify all the essays that received a

human score of ‘0’ as off-topic essays (except the blank ones) and the other essays

with non-zero scores as on-topic essays.

In operational scoring design, essays from high-stakes assessments usually are

scored by a human rater first. If the human rater assigns a score of ‘0’ to an

essay which indicates that the essay is very unusual, the essay will be excluded

from automated scoring entirely. Instead, a second human rater will evaluate this

essay to check the score from the first human rater. Because off-topic essays will

be flagged by human raters, the issue of off-topic responses is not viewed as a

serious problem for automated scoring in high-stakes assessments. However, in lowstakes assessment when e-rater is used as the primary or sole scoring system, it’s

important to have an effective flag to detect off-topic essays that may not suitable

for automated scoring.

To evaluate the effectiveness of the off-topic flag discussed previously, we

selected a random sample of around 200,000 essays from each writing task that was

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

2 Results for Research Question 2: Relation to Human Ratings

Tải bản đầy đủ ngay(0 tr)