Tải bản đầy đủ - 0 (trang)
2 Off-Topic Essays and the E-rater® Off-Topic Advisory Flags

2 Off-Topic Essays and the E-rater® Off-Topic Advisory Flags

Tải bản đầy đủ - 0trang

318



J. Chen and M. Zhang



features to evaluate essays. This study explores what additional essay features can

potentially be used to detect off-topic essays. The study is guided by the following

two research questions:

1. How effective is the advisory flag in detecting off-topic essays?

2. Among the essay features that e-rater extracts, what are the potentially useful

ones that can detect off-topic essays?



2 Methods

2.1 Data

The data for this study came from the writing tasks of two large-scale high-stakes

assessments. Assessment I is a college level test, and Assessment II is an English

proficiency test. The writing section in Assessment I includes two tasks, which

we refer to as Task A and Task B for the purpose of this paper. Task A requires

examinees to critique an argument while Task B requires examinees to articulate

an opinion and support their opinions by using examples or relevant reasoning.

Similar to the writing tasks of Assessment I, the writing section of Assessment II

also included two tasks, which we refer to as Task C and Task D. Task C requires

test takers to respond in writing by synthesizing the information that they had read

with the information they had heard. Task D requires test takers to articulate and

support an opinion on a topic.

The score scale of Task A and B is from 1 to 6, and that of Task C and D is from

1 to 5. The lowest score, 1, indicates a poorly written essay and the highest score,

5 or 6, indicates a very well written essay. Specifically, the scoring rubrics of these

four writing tasks all specify that an essay at score level ‘0’ is not connected to the

topic, is written in a foreign language, consists of keystroke characters, or is blank.

Therefore, for the purpose of this study, we classify all the essays that received a

human score of ‘0’ as off-topic essays (except the blank ones) and the other essays

with non-zero scores as on-topic essays.

In operational scoring design, essays from high-stakes assessments usually are

scored by a human rater first. If the human rater assigns a score of ‘0’ to an

essay which indicates that the essay is very unusual, the essay will be excluded

from automated scoring entirely. Instead, a second human rater will evaluate this

essay to check the score from the first human rater. Because off-topic essays will

be flagged by human raters, the issue of off-topic responses is not viewed as a

serious problem for automated scoring in high-stakes assessments. However, in lowstakes assessment when e-rater is used as the primary or sole scoring system, it’s

important to have an effective flag to detect off-topic essays that may not suitable

for automated scoring.

To evaluate the effectiveness of the off-topic flag discussed previously, we

selected a random sample of around 200,000 essays from each writing task that was



Identifying Useful Features to Detect Off-Topic Essays in Automated Scoring. . .



319



Table 1 Essay sample size of each writing task (Sample 1)

No. of selected essays

Proportion of off-topic essays (%)

Table 2 Number of selected

off-topic and on-topic essays

(Sample 2)



Task A

199,650

0.02



Task B

199,656

0.03



Task C

200,782

0.7



Task A

Task B

Task C

Task D



Task D

199,605

0.08



Off-topic

388

809

24,244

3147



On-topic

388

809

24,244

3147



collected during July 2012 to June 2013 for Assessment I and during July 2011 to

June 2013 for Assessment II. These random samples include both off-topic and ontopic essays. Table 1 lists the precise number of essays selected from each writing

task and the proportion of off-topic essays in each sample.

In operational scoring, an essay will be scored by two human raters if the first

human rater assigned a score “0”. Our off-topic essay sample only includes essays

that received score “0” from both human raters to ensure that judgment of score “0”

is agreed by both human raters. Among the essays that received score “0” from both

human raters, we also excluded essays that did not meet the length requirement of

containing at least two sentences. These extremely short essays will be excluded

from automated scoring by an advisory flag that detects extremely short essays.

Thus, we do not need to consider how well the off-topic advisory flag detects these

extremely short off-topic essays. Because of these reasons, the proportion of offtopic essays listed in Table 1 is lower than the proportion of essays that received

score “0” in operational scoring.

The sample listed in Table 1 (Sample 1) is used to evaluate the effectiveness of the

off-topic advisory flag (e.g. calculate precision and recall rates). In this sample, there

are only small numbers of off-topic essays from Task A and Task B (i.e. fewer than

100 essays). To compare the linguistic features of on-topic and off-topic essays and

identify the most distinctive features, we selected off-topic essays from a broader

time range to have more off-topic essays. For each writing task, we selected all

the off-topic essays (i.e. essays that received score “0” from both human raters and

longer than two sentences) from operational scoring during Jul. 2011 to Jun. 2015.

We randomly selected a set of on-topic essays (i.e. essays that received non-zero

scores from human raters) from each writing task from the same time range to match

the sample size of the selected off-topic essays. Table 2 presents the resulting sample

sizes of these off-topic and on-topic essays. This is our second sample. For both

groups of essays, we extract all the essay features using the latest version of the

e-rater engine.

In our analysis, we include nine high-level features that predict human scores,

their associated low-level features and some additional features that are not used in

predicting human scores but are used to provide additional information about the



320



J. Chen and M. Zhang



essays. The nine high-level features are grammar, usage, mechanics, development,

organization, word choice, word length, collocation and preposition, and sentence

variety. Most of these nine features are composed of sets of low-level features

computed using Natural Language Processing (NLP) techniques that are then

combined to produce the high-level feature values. The features used to provide

additional information about the essays included features that measure essay length

(e.g. number of sentences), features related to word type and word token usage and

features that measure the similarity between an unseen essay and the training essays.

More specifically, among the features that provide additional information, one

feature is the number of word types, which is a count of the number of unique words

in the essay. Another feature is the number of word token, which is the frequency of

unique words. If a unique word type appears multiple times in an essay, “the number

of word types” will only count once, but “the number of word tokens” will count

multiple times. When calculating “the number of word types” and “the number of

word tokens” features, a “stop list” is used to exclude non-content-bearing words

(e.g. words such as “the”, “of”) from the calculation. So these two features only

count unique content-bearing words. Another feature, ZTT (Z-score of the ratio of

the number of word types to the number of word tokens), provides a measure of the

variety of words in an essay. The feature value will be high if each unique word only

appears once and will be low if each unique word appears many times.

The similarity between an essay and the training essays is measured by several

features including S_Max, S1, S2, S3, S4, S5 and S6. A similarity value between

an unseen essay and each of the training essays can be calculated. S_Max is the

largest similarity value among all these similarity values. The S1 feature measures

the similarity between an unseen essay and all the training essays that received score

“1” assigned by human raters. Similarly, feature S2 measures the similarity between

an unseen essay and all the training essays that received score “2” and so on. All

these similarity features are calculated based on CVA.1



2.2 Data Analysis

First, to find out the effectiveness of the off-topic advisory flag, we evaluated the

precision, recall and F-score of this flag in detecting off-topic essays based on

sample 1. Precision is the proportion of detected off-topic essays that are truly offtopic. Recall is the proportion of true off-topic essays (as classified by human raters)

that are detected by advisory flag. F-score2 is a measure that balances precision and



1



The essays used to train the CVA based features were collected from Assessment I and II during

July 2012 to July 2013 and during July 2011 to June 2013 respectively. Around 100,000 essays

were used as training essays for each writing task of the assessments. There were no essays in

common between the dataset described in Table 2 and the data used to train these CVA features.

2

F-score is defined as F D 2* (Precision*Recall)/(Precision C Recall).



Identifying Useful Features to Detect Off-Topic Essays in Automated Scoring. . .



321



recall rates. Second, to identify the features that are potentially useful in detecting

off-topic essays, we compare the values of each of the e-rater features between the

on-topic and the off-topic essay groups based on sample 2. We calculated the mean

and standard deviation of each feature for the two groups of essays and used Cohen’s

d to measure the difference between the means of these two groups.

For Cohen’s d, an effect size of 0.2–0.3 indicates a “small” effect. An effect size

around 0.5 suggests a “medium” effect and an effect size of 0.8 to infinity suggests

a “large” effect (Cohen 1988). Features with large Cohen’s d (greater than 0.8) are

considered as the most distinctive features. We examined whether these identified

features can be grouped into some common themes that measure particular aspects

of writing.



3 Results

3.1 Effectiveness of the Off-Topic Advisory Flag

Table 3 presents the precision, recall and F-score of the advisory flag we evaluated.

The precision rate is 100 % across four writing tasks. All the essays detected as offtopic are classified as off-topic by human raters as well. The high precision rate will

save human scoring cost associated with scoring false positive cases. The recall rate

varies between 2.2 and 18.1 % across four writing tasks. Improvements could be

made to capture more off-topic essays. Therefore, we investigated what additional

essay features can potentially be used to detect off-topic essays.



3.2 Most Distinctive Features Between On-Topic

and Off-Topic Essays

Features with large Cohen’s d (greater than 0.8) are identified as the most distinctive

features for off-topic essays. These features can be grouped into five categories,

each of which measures a particular aspect of writing. In this section, we describe

the difference between the on-topic and the off-topic essays in these five aspects.

Essay length. Off-topic essays are considerably shorter than on-topic essays.

Table 4 presents some descriptive statistics of essay length of the on-topic and the

off-topic essays from each writing task. Essay length is calculated from the number

Table 3 The precision, recall

and f-score of the off-topic

advisory flag



Precision

Recall

F-score



Task A

100.0

16.7

28.6



Task B

100.0

16.0

27.6



Task C

100.0

2.2

4.4



Task D

100.0

18.1

30.7



322



J. Chen and M. Zhang

Table 4 Essay length of the on-topic and off-topic essays



Task A



Task B



Task C



Task D



Chars

Words

Sentences

Chars

Words

Sentences

Chars

Words

Sentences

Chars

Words

Sentences



On-topic

Mean

1961.08

395.61

18.17

1987.26

406.73

19.25

1068.53

216.30

11.27

1597.08

338.54

18.00



SD

645.09

127.53

5.97

654.77

133.38

6.68

272.14

54.56

3.48

416.08

85.37

5.65



Off-topic

Mean

945.01

185.58

9.01

1244.92

264.52

13.81

727.59

141.95

8.54

765.39

167.02

11.45



SD

718.54

138.87

6.72

696.06

143.54

7.66

327.42

63.40

3.77

550.88

121.69

8.89



Cohen’s d

1.49

1.58

1.44

1.10

1.03

0.76

1.13

1.26

0.75

1.70

1.63

0.88



Table 5 Number of word types and number of word tokens of the on-topic and

off-topic essays



Task A



Task B



Task C



Task D



Number of word types

Number of word tokens

ZTT

Number of word types

Number of word tokens

ZTT

Number of word types

Number of word tokens

ZTT

Number of word types

Number of word tokens

ZTT



On-topic

Mean

118.11

211.50

0.00

131.13

221.45

0.04

78.12

117.89

0.03

113.03

190.44

0.06



SD

33.47

68.56

0.93

41.16

73.18

0.99

17.35

29.24

0.97

29.28

48.60

0.96



Off-topic

Mean

SD

59.07 28.96

113.80 84.74

0.71

2.25

83.06 39.59

151.34 89.19

0.17

1.60

62.69 21.48

86.47 38.50

0.97

1.16

60.62 39.15

103.15 77.61

0.89

2.17



Cohen’s d

1.89

1.27

0.41

1.19

0.86

0.10

0.79

0.92

0.88

1.52

1.35

0.49



of characters, words, and sentences. Across all four writing tasks and all three

measures of essay length, off-topic essays are considerably shorter than on-topic

essays. For example, on average, the on-topic essays from Task A have around

395 words. However, the off-topic essays from the same task only have around 186

words.

The number of word types and the number of word tokens. Our analysis

reveals that the off-topic and on-topic essays show a large difference in terms of two

features: the number of word types and the number of word tokens. Table 5 presents

the statistics of these two features, which shows that off-topic essays have fewer

unique words and lower occurrences of unique words compared to on-topic essays.



Identifying Useful Features to Detect Off-Topic Essays in Automated Scoring. . .



323



However, these two features may closely related to essay length since longer

essays by definition have more words (new or repeated). So we looked at another

feature that measures the word variety of an essay but has less influence from essay

length. This feature (ZTT) is based on the Z-score of the ratio of the number of word

types to the number of word tokens. Table 5 lists descriptive statistics of this feature.

In general, ZTT of off-topic essays are higher than that of on-topic essays except

Task B. This pattern is reasonable because when an essay has many unique words

without any repetition, the essay may fail to focus on key concepts and stay on-topic.

However, future research is needed to examine whether ZTT is really effective in

predicting off-topic essays across different types of writing tasks since the pattern

from Task B is not consistent with the others.

Similarity features. Table 6 lists the mean, standard deviation of the similarity

features for the off-topic and on-topic essay groups and the Cohen’s d of the mean

difference. The similarity between an off-topic essay and the training essays is much

lower than that between an on-topic essay and the training essays. Some off-topic

essays, such as bad faith essays or essays written in a foreign language are expected

to have very low similarity to the majority of essays. So it is reasonable that on

average, the similarity features of off-topic essays are lower than those of on-topic

essays. Though the size of mean difference varies across four tasks, in general, the

similarity features of the off-topic essays are much lower than those of the on-topic

essays. These similarity features can potentially be used to distinguish off-topic

essays from the on-topic ones.

Organization. The organization feature score of off-topic essays is significantly

lower than that of on-topic essays. The organization feature consists of a set of

low-level features that detect whether particular discourse elements are present or

absent in an essay (Burstein, Marcu & Knight 2003). These particular discourse

elements include introductory material (to provide the context or set the stage), a

thesis statement (to state the writer’s position in relation to the prompt), main ideas

(to assert the author’s main message), supporting ideas (to provide evidence and

support the claims in the main ideas, thesis, or conclusion), and a conclusion (to

summarize the essay’s entire argument) (Attali & Burstein 2006).

Table 7 lists the descriptive statistics of the organization feature. The organization

feature of the off-topic essays is considerably lower than those of the on-topic essays

across all four writing tasks. Some bad faith essays might not be argumentative or

summary-like in nature, which might lead to the lack of organizational elements that

are typical of argumentative writing (e.g. main idea, supporting evidence). Some

off-topic essays may not have good organization because they are too short. The

difference in the organization feature between on-topic and off-topic essays might

reflect the difference in essay length. Thus, the organization feature that detects

whether particular discourse elements are present is a potentially useful feature and

it needs to be examined to see whether it can provide useful information in addition

to essay length in detecting off-topic essays.

Sentence variety. Another finding is that the sentence variety feature values of

the off-topic essays are much lower than those of the on-topic essays. The sentence



324

Table 6 Similarity features

of the on-topic and the

off-topic essays



J. Chen and M. Zhang



Task A



Task B



Task C



Task D



Table 7 Organization

feature of the on-topic and the

off-topic essays



S_Max

S1

S2

S3

S4

S5

S6

S_Max

S1

S2

S3

S4

S5

S6

S_Max

S1

S2

S3

S4

S5

S_Max

S1

S2

S3

S4

S5



Task A

Task B

Task C

Task D



On-topic

Mean SD

0.11 1.02

0.07 0.04

0.16 0.04

0.18 0.04

0.18 0.04

0.16 0.04

0.10 0.07

0.07 0.97

0.11 0.05

0.18 0.05

0.20 0.05

0.20 0.06

0.18 0.05

0.12 0.04

0.01 1.02

0.13 0.04

0.14 0.04

0.15 0.04

0.14 0.04

0.13 0.04

0.08 0.96

0.10 0.03

0.18 0.05

0.20 0.05

0.19 0.04

0.17 0.05



Off-topic

Mean SD

0.37 1.76

0.08 0.06

0.12 0.04

0.12 0.05

0.11 0.05

0.11 0.05

0.07 0.06

1.16 1.28

0.08 0.03

0.13 0.05

0.13 0.05

0.13 0.05

0.11 0.04

0.07 0.03

1.23 2.08

0.09 0.04

0.09 0.03

0.09 0.03

0.09 0.03

0.08 0.03

1.17 1.07

0.06 0.03

0.10 0.05

0.10 0.05

0.10 0.05

0.08 0.05



Cohen’s d

0.33

0.20

1.00

1.33

1.55

1.10

0.46

1.08

0.73

1.00

1.40

1.27

1.55

1.41

0.76

1.00

1.41

1.70

1.41

1.41

1.07

1.33

1.60

2.00

1.99

1.80



On-topic

Mean SD

1.97

0.38

1.96

0.34

1.75

0.42

1.94

0.32



Off-topic

Mean SD

1.13

0.62

1.52

0.63

1.26

0.56

1.18

0.65



Cohen’s d

1.63

0.87

0.99

1.48



variety feature is composed of a set of low-level features that measure the occurrence

of particular types of words, phrases, and punctuations. A higher sentence variety

score indicates that an essay has heterogeneous sentences. The statistics list in

Table 8 suggest that off-topic essays have much lower sentence variety scores

than on-topic essays. Some examinees could have written homogeneous sentences

because of low language abilities. However, it is also possible that the lower

sentence variety scores of off-topic essays are due to the fact that off-topic essays are

often too short to include a large variety of syntactic types. Thus, sentence variety



Identifying Useful Features to Detect Off-Topic Essays in Automated Scoring. . .

Table 8 Sentence variety

feature of the on-topic and

off-topic essays



Task A

Task B

Task C

Task D



On-topic

Mean SD

3.86

0.45

3.82

0.45

3.12

0.48

3.44

0.45



Off-topic

Mean SD

3.00

0.70

3.00

0.70

2.93

0.60

2.46

0.79



325



Cohen’s d

1.46

1.39

0.35

1.52



is a potentially useful feature and future research needs to be done to investigate

whether this feature provides useful information in addition to essay length to detect

off-topic essays.



4 Discussion

This study investigates the effectiveness of an advisory flag in detecting off-topic

essays in automated scoring. A well-formed and well-written essay that does not

address the assigned topic may receive an overestimated score from an AES system

because of its linguistic features. Successful identification of off-topic essays is

essential to ensure the validity of machine scores and to support automated scoring

as the primary scoring method.

Our investigation of the effectiveness of the existing advisory flag reveals that this

flag has a 100 % precision rate in detecting off-topic essays across the four data sets

we evaluated. The recall rate varies around 2.2–18.1 % across four data sets. These

results suggest that all detected essays are truly off-topic but a large number of truly

off-topic essays are not captured by the advisory flag. To improve the performance

of the existing advisory flag, we identified some features that can potentially be used

to build new advisory flags to detect more off-topic essays. These features include

essay length, the number of word types (excluding non-content-bearing words), the

number of word tokens, the word variety feature (ZTT), the similarity of an unseen

essay to the training essays, essay organization, and sentence variety.

Two limitations of this study should be noted. First, our evaluation of the

performance of the off-topic advisory flag is relatively imprecise. We did not further

classify all the essays that received a human score of 0 into different categories

according to the way in which an essay diverges from the requested essay topic

(e.g. unexpected topic essays, bad faith essays). We lack information to evaluate the

performance of the flag in detecting different types of off-topic essays. Second, we

only used Cohen’s d to identify the features that are potentially useful in detecting

off-topic essays. A lot of methods are available for feature selection. For example,

Guyon and Elisseeff (2003) introduced variable and feature selection methods such

as variable ranking. In future studies, we will apply other feature selection methods

and compare the results across methods to provide a better selection of features.



326



J. Chen and M. Zhang



Future research could start with the identified features to build new flagging

mechanisms. These identified features can be combined and refined to find out the

most effective ones in predicting off-topic essays. The off-topic flag we evaluated

only uses the similarity between essay text and prompt text to detect off-topic essays.

When an essay triggers the flag, it’s easy to tell why the flag is triggered and what

kind of problem the essay may have. However, considering one flagging criterion

at a time and using a pre-specified triggering threshold may not work as well as

building statistical models that use multiple criteria simultaneously and learning

from real data to predict the probability of being off-topic. For example, a logistic

regression model can be built to predict the likelihood that an essay might be offtopic using features such as essay length, organization, and sentence variety as

independent variables.

Finally, additional features such as response time and process data (e.g. essay

keystroke) can be collected to predict off-topic essays. For example, if an examinee

submits an essay in a very short time after the assessment begins (e.g. 30 s), the

essay is likely to be a bad faith essay. Since off-topic essays can be off-topic in

many different ways, more features will capture the unusualness of the essays from

different aspects, which will help to detect off-topic essays more accurately.



References

Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® V. 2. Journal of

Technology, Learning, and Assessment, 4(3), 1–31. Retrieved from http://www.jtla.org.

Burstein, J., Marcu, D., & Knight, K. (2003). Finding the WRITE stuff: Automatic identification

of discourse structure in student essays. IEEE Intelligent Systems, 18(1), 32–39.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:

Lawrence Erlbaum.

Dikli, S. (2006). An overview of automated scoring of essays. Journal of Technology, Learning

and Assessment, 5(1), 1–36. Retrieved from http://www.jtla.org.

Elliot, S. (2003). IntelliMetric: From here to validity. In M. D. Shermis & J. C. Burstein (Eds.),

Automated essay scoring: A cross disciplinary approach. Mahwah, NJ: Lawrence Erlbaum.

Guyon, I., & Elisseeff, A. (2003). An introduction of variable and feature selection. Journal of

Machine Learning Research, 3, 1157–1182.

Higgins, D., Burstein, J., & Attali, Y. (2006). Identifying off-topic student essays without topicspecific training data. Natural Language Engineering, 12(2):145–159.

Kaplan, R. B. (2010). The Oxford handbook of applied linguistics. Oxford, UK: Oxford University

Press.

Landauer, T. K., Laham, D., & Foltz, P. W. (2003). Automated essay scoring: A cross disciplinary

perspective. In M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring and annotation

of essays with the intelligent essay assessor (pp. 87–112). Mahwah, NJ: Lawrence Erlbaum.

Rudner, L. M., Garcia, V., & Welch, C. (2006). An evaluation of the intellimetric essay

scoring system. Journal of Technology, Learning and Assessment, 4(4). Retrieved from

http://www.jtla.org.

Shermis, M., & Barrera, F. (2002). Exit assessments: Evaluating writing ability through Automated

Essay Scoring. ERIC document reproduction service no ED 464 950.

Williamson, D. M., Bejar, I. I., & Hone, A. S. (1999). Mental model comparison of automate and

human scoring. Journal of Educational Measurement, 36, 158–184.

Wollack, J. A., & Fremer, J. J. (2013). Handbook of test security. New York, NY: Routledge.



Students’ Perceptions of Their Mathematics

Teachers in the Longitudinal Study of American

Youth (LSAY): A Factor Analytic Approach

Mohammad Shoraka



Abstract This study investigated the psychometric properties of questionnaire

items used to measure students’ perceptions of mathematics teachers in the Longitudinal Study of American Youth (LSAY) during middle school. The perceptions

of students regarding their math teachers were gathered through 16 questions. The

National Science Foundation (NSF) has funded the LSAY and the questionnaire

is a collaboration work of the National Centre for Vocational Education Research

(NCVER), Department of Education and Training and Wallies Consulting Group.

The dataset was randomly split into two samples so that exploratory analyses could

be conducted on one-half of the sample and confirmatory analyses could be conducted on the second half. One item, “teacher encourages extra work”, was removed

from the questionnaire after analyses, due to low loading and ambiguous meaning.

Four factors were extracted under different methods of extraction within oblique

rotations and were named by the author are: Teachers Characteristics, Teacher

Instructional Expectations, Teacher Fairness and Teacher Focus on Outcomes.

Keywords Students’ perceptions • Mathematics teachers • Classic Test Theory



1 Introduction

One of the contemporary issues in education of adolescents is their perceptions

toward human resources in middle school. The study of students’ perceptions has

gone beyond the education sector (Balci 2011) with students being viewed as

consumers whose parents pay taxes and expect good schooling in return. That is

one of the reasons why students’ views of teachers or teacher empathy with students

have become a topic of interest (Ouazad & Page, 2011). Moreover, finding the

dimensional structure of students’ perceptions was investigated by conducting a

survey and analyzing the resulting data by means of factor analysis.



M. Shoraka ( )

University of Windsor, Windsor, ON, Canada

e-mail: Shoraka@uwindsor.ca

© Springer International Publishing Switzerland 2016

L.A. van der Ark et al. (eds.), Quantitative Psychology Research, Springer

Proceedings in Mathematics & Statistics 167, DOI 10.1007/978-3-319-38759-8_25



327



328



M. Shoraka



Middle school students’ perceptions of parents, peers, families, and teachers are

widely viewed as being factors linked to student achievement. For instance, Azmitia,

Cooper, and Brown (2009) indicated that the perceptions of students regarding their

teachers are not the significant factor of student achievement.

Research suggested that student perceptions of teachers’ emphasis on mastery

goals predict student self-efficacy. Levpušˇcek and Zupanˇciˇc (2009) used hierarchical

linear modeling analysis and showed that students’ perceptions of math teachers’

behavior were significant factor on student motivational beliefs as well as their

mathematics achievement.

Students by and large spend more of their time with teachers than other academic

human resources as they are obligated to attend school regardless of parents’ will,

in most developed countries. Therefore, there is a legitimate reason why some

researchers are concerned about the effect of teacher characteristics on student

achievement. Those characteristics could be teacher biases towards student gender

and ethnic background (Dee, 2007; Hinnerich, Hoglin, & Johanneson, 2011). Also,

teaching experience, performance on state teacher certification exams, certification

status and area, competitiveness of a teacher’s undergraduate institution, pathway

into teaching, and SAT scores on student achievement (Boyd, Lankford, Loeb,

Rockoff, & Wyckoff, 2008; Clotfelter, Ladd, & Vigdor, 2007; Neild, FarleyRipple, & Byrnes, 2009; Olaleye, 2011; Rockoff, 2004).

Many researchers have studied the impact of students’ views of teachers’

perceptions on students, such as the role of perceived teacher goals on student selfefficacy (Friedel, Cortina, Turner, & Midgley, 2010). Some studies have investigated

the effect of student perceptions of receiving support from teacher and classmate

on declining attendance (DeWit, Karioja, & Rye, 2010; Nelson-Smith, 2008).

Others have studied the effect of teacher expertise on the student’s sense of

school belonging (Stevens, Hamman, & Olivárez, 2007). In another study, Chen,

Thompson, Kromrey, and Chang (2011) investigated the association of teachers’

expectations and students’ perceptions of teachers’ oral feedback in relation to the

students.



2 Students’ Perceptions of Teachers

Students’ perceptions of teachers have been measured according to student-teacher

relationships at all levels of education (Cambridge Education, 2012; Hughes, Wu,

Kwok, Villarreal, & Johnson, 2012; Koomen, Spilt, & Oort, 2011). For instance,

Spilt, Koomen, and Mantzicopoulos (2010) provided children photographs that

represented teacher-child interaction and asked children to rate their teachers based

upon closeness (e.g., my teacher always listens to me), conflict (e.g., my teacher

often gets angry).

By comparison, at the middle school and high school levels, researchers think

beyond teacher-student relationships. In his dissertation, Semmel (2007) studied

three more factors in addition to student-teacher relationships including: justice,



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

2 Off-Topic Essays and the E-rater® Off-Topic Advisory Flags

Tải bản đầy đủ ngay(0 tr)

×