Tải bản đầy đủ - 0 (trang)
Chapter 3. Evaluation: Evidence for Public Policy

Chapter 3. Evaluation: Evidence for Public Policy

Tải bản đầy đủ - 0trang




his contribution provides an overview of the political, institutional and

methodological challenges that confront public policy evaluation with a view

to stimulating a constructive, collaborative response. It begins by specifying

the different questions asked in policy evaluation and links these to

techniques for answering them. The second section focuses on the threats to

effective policy evaluation that arise from the policy process, the nature of

policy and the marketplace for evaluation. The final section grapples with the

challenges of building and nurturing a vibrant and sustainable culture of

evaluation. Extensive use is made of case-studies, both of specific policy

evaluations and of the United Kingdom which, since the election of a New

Labour government in 1997, has sought to introduce a culture of policy

evaluation into the heart of central government policy making.

“Will this policy work?” “What kind of policies have worked in the past?”

If questions such as these could be answered satisfactorily, much of the risk

would be taken out of politics and policy making turned into a science.

Reality falls far short of such an aspiration. Not only are these kinds of

question rarely asked of social science, social scientists seem unable to answer

them with much sense of security. After a great deal of effort, robust answers

may be provided to tightly prescribed questions that limit generalisation, while

the big strategic questions generate imprecise answers hedged by

qualifications. Consequently, evaluation evidence is likely to remain just one of

the many sources of information deployed by policymakers and politics will

continue to be a risky enterprise. To the extent that this makes politics a degree

less boring, it may actually be good for the health of democracy.

Nevertheless, reducing the risk of introducing poor policies or rejecting

good ones by even a small amount could prevent billions of dollars, euros or

pounds from being wasted and add greatly to the sum of human well-being. It

is therefore incumbent on the policy community to seek to improve the

quality, quantity and use of evaluative evidence. This will necessitate a close

partnership between those seeking evaluative evidence – who need to ask for

it sufficiently early, help fund its generation and be prepared to act on both

welcome and unwelcome findings – and producers who have to engage with

the relevancies of policymakers while protecting and advancing standards of

scientific enquiry. This, in turn, requires institutions to promote, foster and

facilitate the partnership by encouraging productive dialogue and ensuring

that the diverse rewards are appropriately shared.





This overview of the issues begins by specifying the different questions

asked in policy evaluation and links these to techniques for answering them.

The second section focuses on the threats to effective policy evaluation that

are born of the policy process, the nature of policy and the marketplace for

evaluation. The final section grapples with the challenges of building and

nurturing a vibrant and sustainable culture of evaluation. Extensive use is

made of case-studies, both of specific policy evaluations and of the United

Kingdom which, since the election of a New Labour government in 1997, has

sought to introduce a culture of policy evaluation into the heart of central

government policy making.

A question of evaluation

There are many kinds of policy evaluation but a simple means of

categorising them is in terms of the question being asked and the timing of

the question (Table 3.1). There are two basic evaluation questions, one

descriptive: “Does the policy work?”, the other analytic: “Why?”. However, to

avoid the teleological implications of “why?” questions, it is preferable to

reformulate the second question as a “how?” question: “How does the policy

work or not work?” Evaluations that address the first question are variously

termed “summative”, “program(me)” or “impact” evaluations and are typically

quantitative (Orr, 1998; Greenberg and Schroder, 1997; Worthen et al., 1996).

Those that focus on the second question are frequently called “formative”

evaluations and are often qualitative (Patton, 2002; Pawson and Tilley, 1997,

Yanow, 1999). However, a plethora of different terms has been used to describe

formative evaluation that reflect subtly, and sometimes radically, different

ontological positions (see below).

Increasingly, evaluations of public policies are combining elements of

summative and formative evaluation (Gibson and Duncan, 2002). The fact that

this is a comparatively recent development, particularly in the US, is

somewhat perplexing since one would have thought it natural to want to

know why a policy worked or did not. Perhaps evaluators thought that it was

enough to know that a policy worked because it could then be continued or

implemented elsewhere. (The term “demonstration project”, commonly used

in the US to describe programme evaluations suggests that policymakers have

such great faith in the policy package being evaluated that the prospect of

failure, and the need to analyse why, is seldom countenanced.) Alternatively,

it may be because the methodologies designed to answer the two questions

were developed in different parts of social science and have only come

together belatedly through necessity or in recognition of the value of interdisciplinarity in applied policy research (Ritchie and Lewis, 2003).





Table 3.1. Questions of evaluation

Counterpart formative

evaluation question

Illustrative formative

evaluation approaches

What worked?


Systematic review

How did it work?

Systematic review

Ashworth et al. (2002)

(Case-study 4)


Did the policy work?

Retrospective evaluation

How did it work/not work?

Retrospective interviews

Participative judgement

(Connoisseurship studies)

Retrospective case-study

Huby and Dix (1992)

(Case-study 3)


Is this policy working?


• Interrupted time series

• Natural experiments

How is it working/not working?

Process studies

Implementation evaluation


Present to future

Is there a problem?

Basic research

Policy analysis

What is the problem?

Basic research

Rapid reconnaissance

Close future

Can we make this policy




How can we make this policy


Theory of change

Participative research

Action research

Loumidis et al. (2001)

(Case-study 2)


Will this policy work?

Programme evaluation

How will it work/not work?

(Impact or summative evaluation)

• Random assignment

• Matched designs

• Cohort designs

• Statistical controls

Theory of change

Laboratory evaluation

Michalopoulos et al., (2002)

(Case-study 1)

Hills et al. (2001)

(Case-study 8)

Expansive future

What policy would work?

Prospective evaluation

• Micro-simulation

• Laboratory experimentation

• Gaming

Laboratory evaluation

Delphi consultation


Brewer et al., 2001

(Case-study 5)

Voyer et al., 2002

(Case-study 6)

Walker et al., (1987)

(Case-study 7)

Evaluation question

Extensive past

Source: OECD.

How would it work?





evaluation method(s)

Time perspective



The other dimension of the categorisation of evaluation studies relates to

when in the policy cycle the two evaluation questions are asked. The tense

used in the evaluation questions (present, past or future) will generally

indicate whether the evaluation is conducted, concurrently, retrospectively or


However, before discussing the implications of these distinctions, there is

a prior question: “What is the problem or opportunity that requires an

institutional policy response?”

Issues appear on the policy agenda for a range of reasons. These include:

the occurrence of a crisis; the result of secular social, economic and/or

political change; the successful activities of motivated individuals, interest

groups, or editors; and errors and mistakes on the part of politicians or

administrators (Hall et al., 1975; Guess and Farnham, 2000). When an issue has

emerged that may warrant a public policy response, the guidance offered in

most policy handbooks is that basic research should be undertaken to

delineate the nature of the problem and, where possible, to identify pathways

of causality that may indicate points for policy intervention (Cm., 1999).

However, handbooks are often ignored and this stage in the policy process is

frequently omitted or undertaken only cursorily. When an issue is thought to

be important and urgent, policy is often devised and implemented before the

issue is well understood. Even where this is not the case, the research

undertaken would normally be construed as applied research or policy

analysis rather than evaluation. However, to the extent that much policy is

iatrogenic, a response to the failure of pre-existing policy, this preliminary

scoping research is likely to have an evaluative component: “If existing policy

did not work, what were the reasons?” Moreover, if evaluation ever becomes a

central element in the policy process, evaluative evidence will, in turn, be

fundamental to any prospective policy review.

Programme evaluation

It is appropriate to begin discussion of the various kinds of evaluation

with the model in which a policy is tested before full implementation since

this is sometimes presented as the ideal-type evaluative strategy (Orr, 1998). In

this case, the evaluation question is properly expressed in the simple future:

“Will this policy work?” although the present tense (“Does this policy work?”)

is sometimes used, which has the unwarranted effect of turning a specific

question into a general one that seems unbounded by time and place.

Addressing the question “Will it work?” usually involves conducting a

programme evaluation or policy experiment (Greenberg and Schroder, 1997).

Conceptually this is the most straightforward form of evaluation. Certain

people are subjected to a policy intervention while others are not and the





outcome observed for both groups of people. Any difference in outcomes

established between the two groups is interpreted as a measure of the impact

of the policy, assuming all other things are held constant. Programme

evaluation was pioneered in the USA with the celebrated negative income tax

experiments of the early 1970s and remains the mainstay of the US policy

evaluation industry.

Even so, there are complex issues of definition and implementation. First,

the objectives of the policy need to be precisely defined and prioritised (this may

itself be a spur to improved policy making which often makes do with political

aspirations in lieu of objectives). The prioritisation of objectives is necessary

because evaluation designs can rarely measure performance against multiple

policy objectives with equal precision and are therefore usually devised to be

most precise with respect to the most important objective.

Secondly, there should be agreement as to the degree of change that

would constitute success. An aspiration to reduce the poverty rate, for

example, has to be accompanied by a statement of the number of percentage

points by which poverty is to be reduced. This is required so that samples used

in the evaluation can be large enough to determine with adequate precision

whether or not the policy has been a success.

Thirdly, some model or theory of change ought to be defined which would

lead one to expect that the policy being implemented would indeed bring

about the anticipated change. Such a model would allow appropriate outcome

measures to be devised. Also, the better specified the model, the greater the

number of intermediate outcomes that could be incorporated into the

evaluation to test the model of change, and to provide diagnostic indicators

when outcomes do not match up to expectations.

Finally, it is necessary to define a counterfactual, the situation that would

have obtained had the policy to be evaluated not been introduced. The

counterfactual provides a baseline against which the performance of the

policy is to be assessed. It is usually inadequate simply to compare the pattern

of outcomes before and after a policy is introduced since other features of the

policy environment may change that influence the effectiveness of the policy.

For example, a booming economy is likely to reduce poverty even in the

absence of anti-poverty policies; in such circumstances, “other things would

not remain constant” as required by the evaluative model. The role of the

counterfactual is to partial out the effects of these other changes to isolate the

impact of the policy alone.

The method that separates out the impact of a policy with minimum bias

and maximum precision entails randomly assigning members of the policy

target group to one of two subgroups: the so-called “action group” which is

given access to the new policy and the “control group” which is not. Any





difference observed in the outcomes for the two groups can confidently be

attributed to the effects of the policy since all other changes will, by definition

because of the randomisation process, randomly influence both groups.

Although often treated as the gold standard in evaluation, randomisation

is not without its limitations (Bottomley and Walker, 1996; Stafford et al., 2002).

One is the difficulty of securing political agreement. Politicians often object to

random assignment for two related reasons. The first is that experimentation

denies some people access to what may be considered a self-evident good. In

reality, of course, the reason for evaluating a policy is uncertainty as to the

benefits or, at least, the cost effectiveness of the policy. The second reason is

that the presumed good is allocated at random rather than with respect to

need or to the likelihood that people will gain from it. Given that it may be

impossible to say in advance who will benefit most, if at all, from the policy

intervention, random allocation is arguably as good a method as any of

assigning scarce resources. Nevertheless, these concerns have proved to be a

major obstacle to the use of randomised assignment in Britain.

There also technical limitations to random assignment. The most

important arise when the policy to be evaluated is either intended to affect the

system as a whole as well as individuals with in it, or when unintended

consequences of the policy are likely operate at this level (Bottomley and

Walker 1996). Take, for example, a policy to give welfare recipients a voucher

that allows employers to offset some of the costs of employing them. In an

experiment involving random assignment, only a proportion of welfare

recipients would receive vouchers and they would enjoy a competitive

advantage over other jobseekers that would disappear on full implementation

when all jobseekers would be given a voucher. The effect of this, so called,

queuing bias is to exaggerate the apparent effectiveness of a policy initiative.

However, the system wide consequences, for example, a reduction in wage

rates, are likely to be understated since not every welfare recipient receives a

voucher; the partial equilibrium effect.

Unfortunately, there is no practical way of determining the scale of

queuing bias or the partial equilibrium effect (Burtless and Orr, 1986). Quasiexperimental designs, in which a policy may be fully implemented in one

jurisdiction and not in another, may allow for estimation of system wide

effects, while also avoiding the need to allocate access to policies on a random

basis. However, because no two areas are identical, the control of factors

exogenous to the policy itself is much weaker.

Further difficulties associated with random assignment include ensuring

that policy staff really do allocate access to clients at random and preventing

members of the control group from surreptitiously obtaining the same services

as the action group.





Case study 1

Programme evaluation: The Self Sufficiency Project – Michalopoulos

et al., 2002

A decade ago the Canadian Department of Employment and Immigration

determined to investigate the effect that a “make work pay” strategy would

have on the ability of long-term welfare recipients to make the transition into

full-time employment. The Department therefore commissioned a 10-year

demonstration project of a specially designed policy initiative based on a

generous temporary earnings supplement, the Self-Sufficiency Project (SSP).

This involved 9 000 lone-parent families in two provinces, New Brunswick

and British Columbia.

To measure the impact of the SSP, a social experiment was conducted

involving random assignment. A sample was drawn of welfare recipients

who had been in receipt for more than a year, and one half randomly

assigned to a programme group and offered the SSP supplement, while the

remainder constituted the control group. Members of the programme group

could receive the supplement for a maximum of three years.

Approximately 6 500 sampled welfare recipients were visited at home during

which a 30-minute “baseline” survey was conducted. Respondents were the

told that they had been selected to join the study and invited to sign a consent

form after being told about the study and the principle of random assignment.

The response rate for the baseline survey was around 90 per cent. Respondents

were interviewed again 36 months and 54 months after random assignment,

and administrative records from various government departments used to

track their progress. The odds of being assigned to the programme group were

50:50 in both provinces with the exception of a 12 month period in New

Brunswick when people were assigned equally to one of three groups: a

programme group, a control group and a SSP-plus group in which participants

were additionally offered job-search assistance and counselling.

Three subsidiary studies were nested in the design: the SSP applicant study,

the SSP-plus study and the “Cliff” study. The first adopted an experimental

design and entailed randomly assigning about 3 000 new applicants for welfare

to a programme group, allowing them to receive the supplement 12 months after

application, or a to control. The SSP-Plus study followed the experience of those

offered SSP-plus, allowing for comparison with both the controls and those

receiving the basic SSP. The cliff study examined the consequences of the

withdrawal of the supplement after three years of receipt. Using administrative

records, it followed the trajectories of 378 people identified in the 54-month

follow-up survey to be approaching the end of their entitlement period. A subsample of 52 participants in this group were recruited to take part in a qualitative

study comprising an initial focus group, followed by three telephone interviews,

one before expiry of eligibility and two, respectively four months and eight

months afterwards.





Case study 1 (cont.)

The evaluation demonstrated that SSP increased employment, earnings

and income and reduced welfare use and poverty: programme group

members received an average of $6 300 more in total income, including

welfare payments, over the 54 month follow-up period. Combining the

supplement with services (SSP-plus) helped people to find more stable

employment than their counterparts in the control group. The social benefits

of SSP outweighed the cost to government.

When randomisation proves impossible, the counterfactual has to be

defined in other ways and numerous creative designs have evolved (Orr, 1998).

These include matched group designs where members of the control group

are chosen to be as similar as possible to members of the action (Loumidis et al.,

2001; Brice, 1996), cohort designs that seek to exploit situations in which

successive generations follow the same trajectory (Smith et al., 2001),

interrupted time series designs that can be used when statistical series exist

prior to policy implementation and various forms of natural experiment

(Curington, 1994). In addition, analytic strategies have been designed, perhaps

the most promising of which is propensity score matching, which attempt to

define a counterfactual from variation within a non-randomised design.

Indeed, Cullen and Hills (1996, p. 14) argue that quasi-experimental designs

allied with statistical modelling “are the only practical solution to the

unrealisable dream of total randomisation”.


The policy prototype addresses a very different question from

programme evaluation. With the decision to proceed to full implementation

already taken in principle, the question asked is “How can we make this policy

work?” The task in the prototype is therefore to fine-tune policy content to

best effect and to determine the optimal mode of implementation. These aims

place less emphasis on measuring outcome and more on understanding the

process of implementation with the result that methodology is both eclectic

and varied, including work-task analyses of the kind undertaken by

operational researchers, and large scale, multi-method evaluations with both

quasi-experimental, summative and formative components. Reliance on

administrative data is also often heavy. Many of the major evaluations, the socalled “pilots”, commissioned by the British Labour government since 1997,

are more accurately called prototypes rather than programme evaluations.

The design of prototypes is often shaped by the rapid speed with which

they are implemented and expected to report and by their closeness to the





Case study 2

Prototype: New Deal for Disabled People – Loumidis et al., 2001

The New Deal for Disabled People personal adviser pilots is quite typical of

the design of the policy evaluations commissioned under the UK Labour

government after 1997. The aim was to determine whether recipients of

Incapacity Benefit and certain other disability benefits would respond to an

invitation to a work-focused interview with a personal adviser, and whether,

with the assistance of personal advisers, more disabled people could secure

and retain paid employment.

Two groups of six pilots were established in a range of different labour

markets, the first group run by the Employment Service and the second by

partnerships of public, private and voluntary organisations. The first set of

pilots was initiated before commissioning of the evaluation. The design did

not allow for randomised assignment since the Personal Adviser Service was

to be made available to all eligible disabled people in the pilot areas.

Therefore, the invitation to tender suggested the establishment of

12 comparison areas. In fact, the Department also had aspirations to generate

base-line statistics against which to assess the impact of any national

implementation. It therefore commissioned a design that included, in

addition to interviews with applicants to the New Deal programme,

interviews with a national sample of disabled people drawn from outside the

pilot areas but stratified according to the same criteria as were used to select

the pilot areas. This national survey was intended to be used to establish the

counterfactual against which the effectiveness of the Personal Advisers is to

be assessed. A comprehensive programme of process evaluation

accompanied the impact analysis.

A critical issue in the design of all policy evaluations is the anticipated size

of any effect. When large, sample sizes can be comparatively small. However,

should the actual effect turn out to be much smaller than expected, the

power and precision of a design can be severely tested. In the case of the New

Deal for Disabled People pilots, resource constraints served to limit attainable

sample sizes (approximately 3 000 in total) while the take-up of the scheme

proved to be much lower than expected.

While the New Deal for Disabled People pilots were commissioned

explicitly to inform decisions about the possibility subsequent national

implementation, it was always intended that such decisions should be taken

half way through the two year pilot and before the results of impact analyses

were available. In such circumstances, the advance of policy did not fully

benefit from sophisticated evaluation, or, at least, not in the short-term.





policy process. Interim and repeated reporting is normal and often linked to a

staged rollout of policy, with evaluation results being used to alter

implementation in a way reminiscent of action research. Indeed, in Britain full

policy implementation has often preceded the results of the prototype

evaluation becoming available (Walker, 2001).

Some prototypes in Britain have commenced with very limited

prescription of policy content. Agencies on the ground were charged to

develop this within a resource framework and descriptive accounts of the

policies evolved reported to policymakers with or without a research-based

commentary (Walker, 2000a).

Monitoring and retrospective evaluation

Changing the evaluation question from the future tense to the present,

“Is policy working?” or the past, “Did policy work?”, means that defining a

secure counterfactual is all but impossible for the simple reason that the

policy is already available to everybody as the result of full implementation.1 In

these scenarios, evaluators turn to monitoring and retrospective evaluation.

Monitoring and retrospective evaluation remain popular approaches

despite their obvious imitations. This is partly because they form part of the

normal process of policy audit, which seeks to establish who has receives the

service and at what cost, often by reference to administrative information.

Retrospective evaluation may also be triggered by suspicion, often aroused by

monitoring, that the policy is not working well. These modes of evaluation do

not require the same level of institutional commitment to evidence based

policy making as programme evaluation. They are not, for example, located on

the critical path from policy idea to policy introduction that demands

policymakers rein back their enthusiasm for implementation to await the

outcome of a lengthy evaluation. Monitoring and retrospective evaluation are

also generally cheaper than programme evaluation, but tend to answer

different questions in different ways.

The lack of a secure counterfactual typically means that greater emphasis

is given to resource inputs and their conversion into service provision

(administrative efficiency) than to the contribution of service delivery to

meeting policy objectives (effectiveness). This is the province of operational

researchers, official statisticians and sociologists with an interest in institutions

and policy implementation rather than economists who, especially in the US,

have led the development of programme evaluations. Frequently, the results of

monitoring find their way into the public domain as compendia of discrete

statistics (Cm. 2002) rather than as rounded assessments of particular policies,

which is principally the preserve of retrospective evaluations.





The methodologies employed in retrospective evaluations tend to be

eclectic and adverse circumstances can stimulate creative designs. Pluralistic

approaches are often used in which the experiences and opinions of key

actors in the policy implementation are collated and triangulated to reach an

overall judgement on policy effectiveness. Personal interview surveys may be

used to solicit the views of policy recipients, qualitative interviews conducted

with administrators and other interest groups and observational techniques

used at the point of service delivery. These accounts can provide irrefutable

evidence about the efficiency or lack of efficiency of implementation and

provide a sound basis for reform.

A particular focus is often on targeting, since a prima facie case can

usually be made that if a policy is not reaching its target population it is

unlikely to be particularly effective. Two aspects are important: first, the

proportion of policy recipients who receive services unnecessarily because of

poor policy design, mal-administration or fraud, and secondly, take up: the

proportion of the eligible population that actually receives the service (Knapp,

1984; Walker, 2004). The first can usually be informed by assembling

administrative statistics especially if judgements about who receives services

“unnecessarily” are made in relation to the programme specification rather

than to policy outcomes (which would might require the definition of a

counterfactual). Specification of take-up often poses greater difficulty since

eligible non-claimants are usually invisible to the administration and hard to

track down empirically.

Where retrospective evaluations have sought to assess policy impact they

have typically used one or more of three approaches: trend analysis, quasiexperimentation and reportage. At it simplest, the first approach entails

searching for an inflection in a time-series variable that coincides with its

introduction of the policy. If there is confidence that the variable is likely to be

affected by the policy, and an inflection is apparent and in the right direction,

the policy is presumed to have had an effect, the abruptness of the inflection

indicates the size of the effect. More sophisticated analyses use time-series

regression or other simulation techniques to define a counterfactual by

predicting the trend of the variable in the absence of the policy and comparing

the prediction with the actual trend (White and Riley, 2002). The success of

this approach depends on the reliability of the trend variable, the precision of

the regression predictions and the stability of the relationships before and

after implementation of the policy.

The second approach depends on the identifying an ostensibly similar

group who are not affected by the policy to serve as a counterfactual and

comparing the experience of this group with that of people targeted by the

policy. Hasluck et al. (2000), for example, used mothers in couples as a

counterfactual for lone parents targeted in a welfare to work policy. The



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Chapter 3. Evaluation: Evidence for Public Policy

Tải bản đầy đủ ngay(0 tr)