Tải bản đầy đủ - 0 (trang)
8 Phylogenetic inference: The contest between likelihood and cladistic parsimony

8 Phylogenetic inference: The contest between likelihood and cladistic parsimony

Tải bản đầy đủ - 0trang

Common ancestry


trace back to a common ancestor, not whether the data favor (HC)G over

H(CG). It now is time to take up this second type of question.

In the kind of ‘‘classic’’ phylogenetic inference problem I want to

discuss, the observed taxa are assumed to be the tips of a bifurcating tree,

and the goal is to infer just the ‘‘topology’’ of the tree, not the amount of

time between branching events or the amount of evolution that has taken

place on branches, or the character states of interior vertices.33 Two of the

main methods that biologists now use to solve such problems are maximum likelihood (ML) and maximum parsimony (MP); distance methods

constitute a third approach, which I won’t examine (not that they aren’t

interesting). ML seeks to find the tree topology that confers the highest

probability on the observed characteristics of tip species. MP seeks to find

the tree topology that requires the fewest changes in character state to

produce the characteristics of those tip species. Besides saying what the

‘‘best’’ tree is for a given data set, both methods also provide an ordering

of trees, from best to worst. The two methods sometimes disagree about

this ordering – most vividly, when they disagree about which tree is best

supported by the evidence. For this reason, biologists have had to think

about the methodological conflict between ML and MP; they can’t set it

aside as a merely philosophical dispute of dubious relevance to scientists

in the trenches.

The main criticism that has been lodged against ML is that it requires

the adoption of a model of the evolutionary process that one has scant

reason to think is true. ML requires a process model because hypotheses

that specify a tree topology (and nothing more) do not, by themselves,

confer probabilities on the observations. Here we face yet another instance

of the Duhem–Quine thesis, which was a leitmotif in Chapters 2 and 3.

This thesis asserts that theories in science typically do not make predictions about observables all by themselves but need to be supplemented by

auxiliary propositions if they are to do so. As before, we need to give this

thesis a probabilistic twist. From a likelihood point of view, it isn’t

essential that hypotheses about the topology of a phylogenetic tree

deductively entail observational claims about the characteristics of species.34 What is required is that they confer probabilities on those observations. The problem is that, all by themselves, they do not. In the



The task of reconstructing the character states of the ancestors in a tree that is presumed to be true

was discussed in §3.3 and §3.11 in connection with testing selection hypotheses.

In Sober (1988: Chapter 4), I discuss and criticize some attempts to justify phylogenetic

parsimony in terms of Popperian ideas about falsification (§2.8).


Common ancestry

language of statistics, these genealogical hypotheses are composite, not


The main objection that has been made against MP is that parsimony

implicitly assumes this or that dubious proposition about the evolutionary

process. The force of this objection is somewhat unclear, since it is

controversial which propositions the method in fact assumes. Does MP

assume that evolution proceeds parsimoniously? That is, if a lineage starts

with one character state and ends with another, is one obliged to assume

that the lineage got there via a trajectory that involved the smallest possible number of evolutionary changes? This allegation has been strenuously denied by proponents of parsimony (e.g., Farris 1983), some of

whom maintain that parsimony assumes only that there has been descent

with modification.35

Which is better – using a method that explicitly makes unrealistic

assumptions or a method whose assumptions are unclear? I will argue that

this unhappy dilemma misrepresents the dialectical situation twice over.

Although ML has usually been implemented in the way described, where

the analysis is carried out by stating a single process model and assuming

that it is true, there is every reason to shift to a model-selection framework

(§1.7) in which multiple process models can be taken into account. This

means that a statistical approach to phylogenetic inference is not stopped

dead by the objection against ML that I just described. With respect to

the criticism of MP, something substantive is known about what parsimony assumes, though the issue of parsimony’s presuppositions has often

been misunderstood.

The debate about ML and MP may seem to be settled by the type of

data one wishes to analyze, the thought being that aligned sequences

require ML and phenotypes require MP. To be sure, ML is often applied

to sequences and rarely to phenotypes (see Lewis 2001 for an exception)

while MP is often applied to morphological data and with increasing

reluctance to sequences. However, this is a sociological fact, not a logical

inevitability. In what follows I’ll try to show that the questions that need

to be answered when ML is applied to sequence data also are central to

the task of applying ML to phenotypes. Symmetrically, MP can be

applied to sequence data just as it can be applied to morphology. In

addition, ML and MP are sometimes equivalent (more on this below), so

it is hard to see how MP can be tied essentially to one type of data and

ML to another.


For discussion of Farris’s argument, see Sober 1988.

Common ancestry


















Figure 4.20 Each of the dichotomous traits A and B can experience two changes and each

change can occur on each of the two branches. There are eight parameters ( p1, . . . , p8) –

one per change, per trait, per branch.

Although ML methods are most familiar in the context of analyzing

sequence data, I want to start discussing that methodology in the context of

models of phenotypic evolution. To get a feeling for the different process

models that might be used, consider two dichotomous traits that evolve on

the two branches of the phylogenetic tree depicted in Figure 4.20. If we

assign a separate parameter to characterize the probability of each change

that might occur in each trait on each branch, there will be eight parameters. We can reduce the number of parameters by introducing constraints;

these constraints require various parameters to have the same value. Here

are three examples:

 A constraint on changes within traits within branches: p1 ¼ p2, p3 ¼ p4,

p5 ¼ p6, p7 ¼ p8.

 A constraint on changes across traits within branches: p1 ¼ p3, p2 ¼ p4,

p5 ¼ p7, p6 ¼ p8.

 A constraint on changes within traits across branches: p1 ¼ p5, p2 ¼ p6,

p3 ¼ p7, p4 ¼ p8.

A very simple model can be constructed by imposing all three of these

constraints; I’ll call this the yes–yes–yes model. This model contains a

single parameter; it rules out biased processes such as natural selection,

since it says that a change from A to ÀA has the same probability as a

change from ÀA to A. At the opposite extreme is the ?–?–? model; this is

the eight-parameter model just mentioned. It does not deny the equalities

expressed in the constraints just described; rather, this model simply

declines to assert that they are true (this is why I use three question marks

Common ancestry


Simpler and

more idealized



More complex

and more realistic



Figure 4.21 Models are more complex the larger the number of adjustable parameters

they contain. Arrows represent deductive implication; ‘‘M1 !M2’’ means that if M1 is

true, M2 must be true.

rather than three ‘‘no’’s to represent this model). This model is compatible with drift or selection, and with homogeneity and heterogeneity

between branches and between different traits on the same branch. In

between the one parameter yes–yes–yes and the eight-parameter ?–?–?,

there are six intermediate models. For example, the yes–yes–? model rules

out natural selection, but it allows that the two branches might experience

different rates of neutral evolution. And the ?–?–yes model allows that

selection is possible, but requires that a given character experience the

same process across branches (be it biased or unbiased). These different

models are related to each other by the relation of logical implication, as

shown in Figure 4.21. The most constrained model is a special case of all

the less constrained models. Removing constraints produces a logically

weaker model.36 Notice that the two intermediate models described in the

figure, yes–yes–? and ?–?–yes, are not related to each other by the entailment relation; neither is a special case of the other.

Although this taxonomy of process models applies to dichotomous

phenotypic traits, it easily generalizes to sequence data. Each site in a

sequence has one of four possible states (G, A, T, and C). Consider two

aligned sequences drawn from different branches of a phylogenetic tree, as

shown in Figure 4.22. The models usually used in phylogenetic inference


Even with just two characters on two branches, further complications might be introduced. For

example, the eight models described all assume that traits on the same branch evolve independently;

models that allow for correlated changes within branches would introduce additional adjustable


Common ancestry

Site 1


Site 2

Branch 1

Branch 2

Figure 4.22 Two sites in two aligned sequences that come from different branches of a

phylogenetic tree.

for molecular characters are a small subset of the possibilities. Virtually all

are time reversible; it is assumed that a change from one state to another in

a site on a branch has the same probability as a change in the opposite

direction (Swofford et al. 1996: 433). This excludes selection. And a

change at one site on a branch is assumed to have the same probability as

the same change at a different site on the same branch. However, branches

are allowed to differ; even if a model says that all changes have the same

probability per unit time, it will usually allow that branches have different

durations. Recall from §3.5 that Markov models allow one to compute

the probability that a branch ends in one state, given that it begins in

another; the values of these branch transition probabilities are functions of

the duration of the branch and the instantaneous probabilities of different

changes. A given change will be more probable on a branch that lasts a

long time than it is on a branch that has only a short duration.

If most of the models of molecular evolution used in phylogenetic

inference ignore selection and assume that a given change on a branch has

the same probability, regardless of which site one considers, how do these

models differ? The Jukes–Cantor (1969) model contains a single

adjustable parameter that represents the (instantaneous) probability of all

change at all sites on all branches. The Kimura (1980) model has two

parameters; it allows transversions and transitions to have different

probabilities.37 These models assume that the four nucleotides have the

same expected frequencies throughout the tree. The Felsenstein (1981)

model says that all substitutions on all branches have the same probability

but allows that base frequencies may be unequal. All three of these models

are special cases of the general time-reversible (GTR) model (Lanave et al.

1984; Tavere´ 1986; Rodriguez et al. 1990). As shown in Figure 4.23, the

relation of logical implication links some of these models to others, just as

was true in Figure 4.21. As before, the two intermediate models, Kimura


Changes between A and G and between C and T are transitions; all other changes are transversions.

Common ancestry


Simpler and

more idealized

Jukes–Cantor (1969)

Kimura (1981)

More complex

and more realistic

Felsenstein (1981)


Figure 4.23 Four models of molecular evolution and their logical relationships (figure

adapted from Swofford et al. 1996: 434).

(1981) and Felsenstein (1981) are not related in this way; neither is a

special case of the other.

How are these different process models put to work in a likelihood

assessment of phylogenetic hypotheses? Let’s continue to use the example

of humans, chimps, and gorillas. Assuming that the tree must be strictly

bifurcating (i.e., that it contains no reticulations or polytomies), there are

three possible rooted trees: (HC)G, H(CG), and (HG)C. As noted earlier,

none of these, by itself, confers a probability on the characteristics we

observe. However, the same is true if we conjoin one of these genealogical

hypotheses with one or another of the process models just described. The

reason is that each process model contains at least one adjustable parameter. Until values for adjustable parameters are specified, we cannot

talk about the probability of the data under different hypotheses. In short,

the propositions that have well-defined likelihoods take the form of a

conjunction that contains three conjuncts:

Tree topology & process model & specified values for the parameters in

the model.

The parameters that describe the probabilities of different changes are

examples of what statisticians call nuisance parameters. The reason for this

name is not that biologists never take an interest in the values of these

parameters; rather, the point is that when we are interested in comparing

the likelihoods of different tree topologies, we are forced to deal with

questions about the evolutionary process even though these are not the

focus of our inquiry. Naturally, what is a nuisance parameter in one

problem may be the subject of interest in another. Our present concern is

Common ancestry


testing tree topologies against each other; in Chapter 3, we considered

different process models (for example, selection versus drift). In that

setting, the tree topology might be thought of as a nuisance parameter.

To assess the likelihood of a three-conjunct conjunction that has the

form just described, we first need to recall the very different approaches

that Bayesianism and the Neyman–Pearson theory take to the problem of

handling nuisance parameters (§1.3, §1.5). For a Bayesian, the likelihood

of a tree topology is an average. There are many different process models

that might be true and many different values that the parameters in a

given model might have. The likelihood of (HC)G reflects all of these:

Prẵdata j HC ị G


Prẵdata j HC ÞG & Model i and Parameter vaules jŠ


· Pr½Model i & parameter vaules j j ðHC Þ GŠ:

If a given model M were known to be true (or if this is an assumption

whose consequences one wishes to explore), this summation would simplify to


Prẵdata j HC ị G ẳ j PrM ½data j ðHC ÞG & parameter vaules jŠ

· PrM ½parameter vaules j j ðHC Þ GŠ:

The subscript M on the probability function means that the probabilities

are all assigned on the assumption that model M is true. The same sort of

averaging would have to be undertaken for the other topologies under

consideration, and then the average likelihoods of the different topologies

could be compared.

Given a process model M that one is prepared to regard as true, the

Neyman–Pearson method of handling nuisance parameters is very different. One isn’t interested in averaging over all possible values; rather,

one looks at the single setting of those parameters that makes the data

most probable. For the topology (HC)G, the quantity of interest is

Prfdata j LẵHC ị G & model M Šg:

Here ‘‘L[(HC)G & model M]’’ denotes the likeliest member of [(HC)G &

model M]. The values of the parameters in M that maximize the likelihood

of (HC)G need not be the same as the ones that maximize the likelihood of

other topologies.

Common ancestry






L[(HC)G & Jukes–Cantor]

L[H(CG) & Jukes–Cantor]

L[(HG)C & Jukes–Cantor]

Felsenstein 1981

L[(HC)G & F]

L[H(CG) & F]

L[(HG)C & F]

Kimura 1980

L[(HC)G & K]

L[H(CG) & K]

L[(HG)C & K]


L[(HC)G & GTR]

L[H(CG) & GTR]

L[(HG)C & GTR]

Figure 4.24 Conjunctions of the form ‘‘tree topology & process model’’ containing

adjustable parameters; these are nuisance parameters in the context of making inferences

about topologies. Frequentists set these at their maximum likelihood values, denoted by

‘‘L(process model & tree topology).’’

Most statistical work in phylogenetic inference has been carried out

within a frequentist, not a Bayesian, framework. The usual practice has

been to adopt a single model of the evolutionary process and then

compare topologies with the parameters in the model set at their maximum likelihood values. This amounts to making ‘‘horizontal’’ comparisons within a single row in Figure 4.24. When the trees are ‘‘specified

in advance,’’ biologists frequently seek to determine which of the three

conjunctions has the highest likelihood, thus bypassing questions about

which hypothesis is the null and what value of a should be chosen.

However, other procedures (e.g., the SOWH test; see Felsenstein 2004:

371–2 for discussion) are sometimes used when the ML tree is compared

with one that is less likely; here, the ML tree is regarded as the null

hypothesis, and the question is whether an alternative tree is significantly

less likely than it. We see here a pattern that often arises in frequentist

practice; the statistical procedure is not determined by logical and

mathematical relationships among data, hypotheses, and background

assumptions but involves facts about what goes on in the mind of the

investigator (recall the discussion of stopping rules in §1.6). Your treatment of (HC)G, H(CG), and (HG)C depends on whether you design your

test before gathering data or design the test already knowing that (HC)G

is the most likely tree. Bayesians and likelihoodists find it hard to

understand why this difference should make a difference.

Frequentists also make vertical comparisons in Figure 4.23; here, you

are not testing a topology; rather, you are testing different process models

against each other, given the assumption that some topology is true. The

typical procedure is to use the likelihood ratio test. The question is not

which conjunction has the higher likelihood; we know in advance that

models with a larger number of adjustable parameters will fit the data

better, so likelihoods must increase as one moves from the Jukes–Cantor

model to Kimura (1980) and then to GTR. Rather, the question is

Common ancestry


whether the likelihood of a more complex model is sufficiently greater than

the likelihood of a simpler model to justify rejecting the simpler model.

As noted in §1.5, this methodology has a frequentist justification only for

nested models. It is possible to compare each of [(HC)G & Felsenstein]

and [(HC)G & Kimura] with [(HC)G & Jukes–Cantor], but one can’t

compare the first two with each other. Another property of the likelihood

ratio test is that it can yield different answers depending on whether one

starts with the simplest model and works up or starts with the most

complex model and works down (§1.5).

These limitations of the Neyman–Pearson theory suggest that it may

make sense to place phylogenetic inference within a model-selection

framework.38 In using AIC, or some other model-selection criterion, one

obtains an ordered list, from best to worse, of conjunctive hypotheses,

each of which has the form ‘‘genealogical hypothesis & process model.’’

The Duhemian point continues to apply: In the first instance, what one

is testing are the different conjunctions, not the genealogical hypotheses

taken on their own. Still, one can reach inside these conjunctions and

examine the conjuncts of interest in the following way. Suppose (HC)G

is the genealogical hypothesis that figures in the first five, or the first ten,

or the first fifteen conjunctions at the top of the list. The larger this

group of conjunctions is, the more we are entitled to conclude that the

data favor (HC)G. In this case, (HC)G is robust across variation in

process model, and the more robust the better. But suppose that (HC)G

appears in the first, but not the second, of the conjunctions on this list

and then appears in the third through twentieth entries. Since AIC

provides a quantitative score for each conjunction, and not just an

ordering of conjunctions, one can ask what the average effect is of shifting

from one tree topology to another, within each of several process models.

For example, perhaps AIC scores are on average improved by moving

from H(CG) to (HC)G.

What resources does Bayesianism have for testing tree topologies

against each other across a range of possible process models? Just like

frequentist work on phylogenetic inference, most Bayesian analyses have

opted for a single process model and then compare topologies within the

context of that one model; what makes the work distinctly Bayesian is that

a prior distribution is employed for the values that the nuisance parameters in the model might have. But Bayesians also have started to consider


Kishino and Hasegawa (1990) applied AIC to choice between tree topologies; see Posada and

Crandall (2001) and Posada and Buckley (2004) for further discussion.


Common ancestry

multiple process models within a model-selection framework (see, for

example, Huelsenbeck et al. 2004). If one topology has a higher average

likelihood than another for each of the process models one has considered, this shows that the result is robust; it does not depend on which of

these process models one chooses. And if unanimity across models fails,

the fact that BIC provides quantitative values for the average likelihoods

of different conjunctions, and not just an ordering, becomes important.

BIC can be used to evaluate the average likelihoods of conjunctions of the

form (tree topology & process model M), and one can see what the

average effect is of shifting from one tree topology to another across each

of several process models.

Model-selection theory, whether it is Akaikean or Bayesian, provides

the resources for statistically testing tree topologies against each other

without requiring one to decide in advance which process model is true.

Choosing between the two approaches requires one to consider the different goals that AIC and BIC have and involves the questions surveyed in

Chapter 1 concerning whether various assumptions that go into the two

procedures are reasonable. I won’t repeat those points here, but I want to

recall one theme. If the process models one is considering all contain

idealizations, all are false, so there won’t be much point to asking which of

them has the highest probability of being true. A better paradigm is the

goal of estimating predictive accuracy, of finding fitted models that are

close to the truth.

What does cladistic parsimony assume about the evolutionary process?

What does the word ‘‘assume’’ mean in the question that is the title of this

section? An example from outside science provides some guidance.

Consider the two sentences


Jones is poor but honest



There is a conflict between being poor and being honest.

I hope it is clear that P assumes that A is true, but that A does not assume

that P is true. Notice that P entails A – that is, if P is true, then A must

also be true. However, A does not entail P; if there is a conflict between

poverty and honesty, this says nothing about Jones and the characteristics

Common ancestry


he happens to have. This example points to a general fact about what it

means to talk about the assumptions of a proposition:

If P assumes A, then P entails A.

To find out what a proposition assumes, you must look for conditions

that are necessary for the proposition to be true, not for conditions that

suffice for the proposition’s truth.

When are likelihood and parsimony ordinally equivalent?

Given this clarification of what an assumption is, we can turn to the

question of what it means to talk about the assumptions that are involved

in using cladistic parsimony to infer tree topologies. What parsimony

assumes about the evolutionary process are the propositions that must be

true if parsimony is to be a legitimate method of phylogenetic inference.

But what does ‘‘legitimate’’ mean? There are a number of choices to

consider. For example, one might demand that a legitimate phylogenetic

method be statistically consistent – that it converge on the true phylogeny

as the number of observations is made large without limit. We will

consider this idea in the next section. Another interpretation – the one I

want to explore now – maintains that parsimony is a legitimate method

precisely when it is ordinally equivalent with likelihood. This idea is easy

to understand by considering the Fahrenheit and Centigrade scales of

temperature. These are ordinally equivalent, meaning that for any two

objects, the first has a higher temperature in Fahrenheit than the second,

precisely when the first has a higher temperature in Centigrade than

the second. The two scales induce the same ordering of objects. For

parsimony and likelihood to be ordinally equivalent, the requirement

is that


For any phylogenetic hypotheses H1 and H2, and for any data set

D, H1 provides a more parsimonious explanation of D than H2

does precisely when PrM(D j H1) > PrM(D j H2).

The subscript M in the likelihood terms is a reminder of the Duhemian

point that phylogenetic hypotheses do not confer probabilities on data,

save in the context of a process model. In fact, it is misleading to talk of

parsimony and ‘‘likelihood’’ being, or failing to be, ordinally equivalent.

Rather, the question is whether likelihood when implemented by an

assumed process model M is or is not ordinally equivalent with parsimony.

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

8 Phylogenetic inference: The contest between likelihood and cladistic parsimony

Tải bản đầy đủ ngay(0 tr)