8 Phylogenetic inference: The contest between likelihood and cladistic parsimony
Tải bản đầy đủ - 0trang
Common ancestry
333
trace back to a common ancestor, not whether the data favor (HC)G over
H(CG). It now is time to take up this second type of question.
In the kind of ‘‘classic’’ phylogenetic inference problem I want to
discuss, the observed taxa are assumed to be the tips of a bifurcating tree,
and the goal is to infer just the ‘‘topology’’ of the tree, not the amount of
time between branching events or the amount of evolution that has taken
place on branches, or the character states of interior vertices.33 Two of the
main methods that biologists now use to solve such problems are maximum likelihood (ML) and maximum parsimony (MP); distance methods
constitute a third approach, which I won’t examine (not that they aren’t
interesting). ML seeks to find the tree topology that confers the highest
probability on the observed characteristics of tip species. MP seeks to find
the tree topology that requires the fewest changes in character state to
produce the characteristics of those tip species. Besides saying what the
‘‘best’’ tree is for a given data set, both methods also provide an ordering
of trees, from best to worst. The two methods sometimes disagree about
this ordering – most vividly, when they disagree about which tree is best
supported by the evidence. For this reason, biologists have had to think
about the methodological conflict between ML and MP; they can’t set it
aside as a merely philosophical dispute of dubious relevance to scientists
in the trenches.
The main criticism that has been lodged against ML is that it requires
the adoption of a model of the evolutionary process that one has scant
reason to think is true. ML requires a process model because hypotheses
that specify a tree topology (and nothing more) do not, by themselves,
confer probabilities on the observations. Here we face yet another instance
of the Duhem–Quine thesis, which was a leitmotif in Chapters 2 and 3.
This thesis asserts that theories in science typically do not make predictions about observables all by themselves but need to be supplemented by
auxiliary propositions if they are to do so. As before, we need to give this
thesis a probabilistic twist. From a likelihood point of view, it isn’t
essential that hypotheses about the topology of a phylogenetic tree
deductively entail observational claims about the characteristics of species.34 What is required is that they confer probabilities on those observations. The problem is that, all by themselves, they do not. In the
33
34
The task of reconstructing the character states of the ancestors in a tree that is presumed to be true
was discussed in §3.3 and §3.11 in connection with testing selection hypotheses.
In Sober (1988: Chapter 4), I discuss and criticize some attempts to justify phylogenetic
parsimony in terms of Popperian ideas about falsification (§2.8).
334
Common ancestry
language of statistics, these genealogical hypotheses are composite, not
simple.
The main objection that has been made against MP is that parsimony
implicitly assumes this or that dubious proposition about the evolutionary
process. The force of this objection is somewhat unclear, since it is
controversial which propositions the method in fact assumes. Does MP
assume that evolution proceeds parsimoniously? That is, if a lineage starts
with one character state and ends with another, is one obliged to assume
that the lineage got there via a trajectory that involved the smallest possible number of evolutionary changes? This allegation has been strenuously denied by proponents of parsimony (e.g., Farris 1983), some of
whom maintain that parsimony assumes only that there has been descent
with modification.35
Which is better – using a method that explicitly makes unrealistic
assumptions or a method whose assumptions are unclear? I will argue that
this unhappy dilemma misrepresents the dialectical situation twice over.
Although ML has usually been implemented in the way described, where
the analysis is carried out by stating a single process model and assuming
that it is true, there is every reason to shift to a model-selection framework
(§1.7) in which multiple process models can be taken into account. This
means that a statistical approach to phylogenetic inference is not stopped
dead by the objection against ML that I just described. With respect to
the criticism of MP, something substantive is known about what parsimony assumes, though the issue of parsimony’s presuppositions has often
been misunderstood.
The debate about ML and MP may seem to be settled by the type of
data one wishes to analyze, the thought being that aligned sequences
require ML and phenotypes require MP. To be sure, ML is often applied
to sequences and rarely to phenotypes (see Lewis 2001 for an exception)
while MP is often applied to morphological data and with increasing
reluctance to sequences. However, this is a sociological fact, not a logical
inevitability. In what follows I’ll try to show that the questions that need
to be answered when ML is applied to sequence data also are central to
the task of applying ML to phenotypes. Symmetrically, MP can be
applied to sequence data just as it can be applied to morphology. In
addition, ML and MP are sometimes equivalent (more on this below), so
it is hard to see how MP can be tied essentially to one type of data and
ML to another.
35
For discussion of Farris’s argument, see Sober 1988.
Common ancestry
A
1
335
5
–A
A
2
–A
6
3
B
4
–B
7
B
8
–B
Figure 4.20 Each of the dichotomous traits A and B can experience two changes and each
change can occur on each of the two branches. There are eight parameters ( p1, . . . , p8) –
one per change, per trait, per branch.
Although ML methods are most familiar in the context of analyzing
sequence data, I want to start discussing that methodology in the context of
models of phenotypic evolution. To get a feeling for the different process
models that might be used, consider two dichotomous traits that evolve on
the two branches of the phylogenetic tree depicted in Figure 4.20. If we
assign a separate parameter to characterize the probability of each change
that might occur in each trait on each branch, there will be eight parameters. We can reduce the number of parameters by introducing constraints;
these constraints require various parameters to have the same value. Here
are three examples:
A constraint on changes within traits within branches: p1 ¼ p2, p3 ¼ p4,
p5 ¼ p6, p7 ¼ p8.
A constraint on changes across traits within branches: p1 ¼ p3, p2 ¼ p4,
p5 ¼ p7, p6 ¼ p8.
A constraint on changes within traits across branches: p1 ¼ p5, p2 ¼ p6,
p3 ¼ p7, p4 ¼ p8.
A very simple model can be constructed by imposing all three of these
constraints; I’ll call this the yes–yes–yes model. This model contains a
single parameter; it rules out biased processes such as natural selection,
since it says that a change from A to ÀA has the same probability as a
change from ÀA to A. At the opposite extreme is the ?–?–? model; this is
the eight-parameter model just mentioned. It does not deny the equalities
expressed in the constraints just described; rather, this model simply
declines to assert that they are true (this is why I use three question marks
Common ancestry
336
Simpler and
more idealized
yes–yes–yes
yes–yes–?
More complex
and more realistic
?–?–yes
?–?–?
Figure 4.21 Models are more complex the larger the number of adjustable parameters
they contain. Arrows represent deductive implication; ‘‘M1 !M2’’ means that if M1 is
true, M2 must be true.
rather than three ‘‘no’’s to represent this model). This model is compatible with drift or selection, and with homogeneity and heterogeneity
between branches and between different traits on the same branch. In
between the one parameter yes–yes–yes and the eight-parameter ?–?–?,
there are six intermediate models. For example, the yes–yes–? model rules
out natural selection, but it allows that the two branches might experience
different rates of neutral evolution. And the ?–?–yes model allows that
selection is possible, but requires that a given character experience the
same process across branches (be it biased or unbiased). These different
models are related to each other by the relation of logical implication, as
shown in Figure 4.21. The most constrained model is a special case of all
the less constrained models. Removing constraints produces a logically
weaker model.36 Notice that the two intermediate models described in the
figure, yes–yes–? and ?–?–yes, are not related to each other by the entailment relation; neither is a special case of the other.
Although this taxonomy of process models applies to dichotomous
phenotypic traits, it easily generalizes to sequence data. Each site in a
sequence has one of four possible states (G, A, T, and C). Consider two
aligned sequences drawn from different branches of a phylogenetic tree, as
shown in Figure 4.22. The models usually used in phylogenetic inference
36
Even with just two characters on two branches, further complications might be introduced. For
example, the eight models described all assume that traits on the same branch evolve independently;
models that allow for correlated changes within branches would introduce additional adjustable
parameters.
Common ancestry
Site 1
337
Site 2
Branch 1
Branch 2
Figure 4.22 Two sites in two aligned sequences that come from different branches of a
phylogenetic tree.
for molecular characters are a small subset of the possibilities. Virtually all
are time reversible; it is assumed that a change from one state to another in
a site on a branch has the same probability as a change in the opposite
direction (Swofford et al. 1996: 433). This excludes selection. And a
change at one site on a branch is assumed to have the same probability as
the same change at a different site on the same branch. However, branches
are allowed to differ; even if a model says that all changes have the same
probability per unit time, it will usually allow that branches have different
durations. Recall from §3.5 that Markov models allow one to compute
the probability that a branch ends in one state, given that it begins in
another; the values of these branch transition probabilities are functions of
the duration of the branch and the instantaneous probabilities of different
changes. A given change will be more probable on a branch that lasts a
long time than it is on a branch that has only a short duration.
If most of the models of molecular evolution used in phylogenetic
inference ignore selection and assume that a given change on a branch has
the same probability, regardless of which site one considers, how do these
models differ? The Jukes–Cantor (1969) model contains a single
adjustable parameter that represents the (instantaneous) probability of all
change at all sites on all branches. The Kimura (1980) model has two
parameters; it allows transversions and transitions to have different
probabilities.37 These models assume that the four nucleotides have the
same expected frequencies throughout the tree. The Felsenstein (1981)
model says that all substitutions on all branches have the same probability
but allows that base frequencies may be unequal. All three of these models
are special cases of the general time-reversible (GTR) model (Lanave et al.
1984; Tavere´ 1986; Rodriguez et al. 1990). As shown in Figure 4.23, the
relation of logical implication links some of these models to others, just as
was true in Figure 4.21. As before, the two intermediate models, Kimura
37
Changes between A and G and between C and T are transitions; all other changes are transversions.
Common ancestry
338
Simpler and
more idealized
Jukes–Cantor (1969)
Kimura (1981)
More complex
and more realistic
Felsenstein (1981)
GTR
Figure 4.23 Four models of molecular evolution and their logical relationships (figure
adapted from Swofford et al. 1996: 434).
(1981) and Felsenstein (1981) are not related in this way; neither is a
special case of the other.
How are these different process models put to work in a likelihood
assessment of phylogenetic hypotheses? Let’s continue to use the example
of humans, chimps, and gorillas. Assuming that the tree must be strictly
bifurcating (i.e., that it contains no reticulations or polytomies), there are
three possible rooted trees: (HC)G, H(CG), and (HG)C. As noted earlier,
none of these, by itself, confers a probability on the characteristics we
observe. However, the same is true if we conjoin one of these genealogical
hypotheses with one or another of the process models just described. The
reason is that each process model contains at least one adjustable parameter. Until values for adjustable parameters are specified, we cannot
talk about the probability of the data under different hypotheses. In short,
the propositions that have well-defined likelihoods take the form of a
conjunction that contains three conjuncts:
Tree topology & process model & specified values for the parameters in
the model.
The parameters that describe the probabilities of different changes are
examples of what statisticians call nuisance parameters. The reason for this
name is not that biologists never take an interest in the values of these
parameters; rather, the point is that when we are interested in comparing
the likelihoods of different tree topologies, we are forced to deal with
questions about the evolutionary process even though these are not the
focus of our inquiry. Naturally, what is a nuisance parameter in one
problem may be the subject of interest in another. Our present concern is
Common ancestry
339
testing tree topologies against each other; in Chapter 3, we considered
different process models (for example, selection versus drift). In that
setting, the tree topology might be thought of as a nuisance parameter.
To assess the likelihood of a three-conjunct conjunction that has the
form just described, we first need to recall the very different approaches
that Bayesianism and the Neyman–Pearson theory take to the problem of
handling nuisance parameters (§1.3, §1.5). For a Bayesian, the likelihood
of a tree topology is an average. There are many different process models
that might be true and many different values that the parameters in a
given model might have. The likelihood of (HC)G reflects all of these:
Prẵdata j HC ị G
X
ẳ
Prẵdata j HC ÞG & Model i and Parameter vaules j
i;j
· Pr½Model i & parameter vaules j j ðHC Þ G:
If a given model M were known to be true (or if this is an assumption
whose consequences one wishes to explore), this summation would simplify to
X
Prẵdata j HC ị G ẳ j PrM ½data j ðHC ÞG & parameter vaules j
· PrM ½parameter vaules j j ðHC Þ G:
The subscript M on the probability function means that the probabilities
are all assigned on the assumption that model M is true. The same sort of
averaging would have to be undertaken for the other topologies under
consideration, and then the average likelihoods of the different topologies
could be compared.
Given a process model M that one is prepared to regard as true, the
Neyman–Pearson method of handling nuisance parameters is very different. One isn’t interested in averaging over all possible values; rather,
one looks at the single setting of those parameters that makes the data
most probable. For the topology (HC)G, the quantity of interest is
Prfdata j LẵHC ị G & model M g:
Here ‘‘L[(HC)G & model M]’’ denotes the likeliest member of [(HC)G &
model M]. The values of the parameters in M that maximize the likelihood
of (HC)G need not be the same as the ones that maximize the likelihood of
other topologies.
Common ancestry
340
(HC)G
H(CG)
(HG)C
Jukes–Cantor
L[(HC)G & Jukes–Cantor]
L[H(CG) & Jukes–Cantor]
L[(HG)C & Jukes–Cantor]
Felsenstein 1981
L[(HC)G & F]
L[H(CG) & F]
L[(HG)C & F]
Kimura 1980
L[(HC)G & K]
L[H(CG) & K]
L[(HG)C & K]
GTR
L[(HC)G & GTR]
L[H(CG) & GTR]
L[(HG)C & GTR]
Figure 4.24 Conjunctions of the form ‘‘tree topology & process model’’ containing
adjustable parameters; these are nuisance parameters in the context of making inferences
about topologies. Frequentists set these at their maximum likelihood values, denoted by
‘‘L(process model & tree topology).’’
Most statistical work in phylogenetic inference has been carried out
within a frequentist, not a Bayesian, framework. The usual practice has
been to adopt a single model of the evolutionary process and then
compare topologies with the parameters in the model set at their maximum likelihood values. This amounts to making ‘‘horizontal’’ comparisons within a single row in Figure 4.24. When the trees are ‘‘specified
in advance,’’ biologists frequently seek to determine which of the three
conjunctions has the highest likelihood, thus bypassing questions about
which hypothesis is the null and what value of a should be chosen.
However, other procedures (e.g., the SOWH test; see Felsenstein 2004:
371–2 for discussion) are sometimes used when the ML tree is compared
with one that is less likely; here, the ML tree is regarded as the null
hypothesis, and the question is whether an alternative tree is significantly
less likely than it. We see here a pattern that often arises in frequentist
practice; the statistical procedure is not determined by logical and
mathematical relationships among data, hypotheses, and background
assumptions but involves facts about what goes on in the mind of the
investigator (recall the discussion of stopping rules in §1.6). Your treatment of (HC)G, H(CG), and (HG)C depends on whether you design your
test before gathering data or design the test already knowing that (HC)G
is the most likely tree. Bayesians and likelihoodists find it hard to
understand why this difference should make a difference.
Frequentists also make vertical comparisons in Figure 4.23; here, you
are not testing a topology; rather, you are testing different process models
against each other, given the assumption that some topology is true. The
typical procedure is to use the likelihood ratio test. The question is not
which conjunction has the higher likelihood; we know in advance that
models with a larger number of adjustable parameters will fit the data
better, so likelihoods must increase as one moves from the Jukes–Cantor
model to Kimura (1980) and then to GTR. Rather, the question is
Common ancestry
341
whether the likelihood of a more complex model is sufficiently greater than
the likelihood of a simpler model to justify rejecting the simpler model.
As noted in §1.5, this methodology has a frequentist justification only for
nested models. It is possible to compare each of [(HC)G & Felsenstein]
and [(HC)G & Kimura] with [(HC)G & Jukes–Cantor], but one can’t
compare the first two with each other. Another property of the likelihood
ratio test is that it can yield different answers depending on whether one
starts with the simplest model and works up or starts with the most
complex model and works down (§1.5).
These limitations of the Neyman–Pearson theory suggest that it may
make sense to place phylogenetic inference within a model-selection
framework.38 In using AIC, or some other model-selection criterion, one
obtains an ordered list, from best to worse, of conjunctive hypotheses,
each of which has the form ‘‘genealogical hypothesis & process model.’’
The Duhemian point continues to apply: In the first instance, what one
is testing are the different conjunctions, not the genealogical hypotheses
taken on their own. Still, one can reach inside these conjunctions and
examine the conjuncts of interest in the following way. Suppose (HC)G
is the genealogical hypothesis that figures in the first five, or the first ten,
or the first fifteen conjunctions at the top of the list. The larger this
group of conjunctions is, the more we are entitled to conclude that the
data favor (HC)G. In this case, (HC)G is robust across variation in
process model, and the more robust the better. But suppose that (HC)G
appears in the first, but not the second, of the conjunctions on this list
and then appears in the third through twentieth entries. Since AIC
provides a quantitative score for each conjunction, and not just an
ordering of conjunctions, one can ask what the average effect is of shifting
from one tree topology to another, within each of several process models.
For example, perhaps AIC scores are on average improved by moving
from H(CG) to (HC)G.
What resources does Bayesianism have for testing tree topologies
against each other across a range of possible process models? Just like
frequentist work on phylogenetic inference, most Bayesian analyses have
opted for a single process model and then compare topologies within the
context of that one model; what makes the work distinctly Bayesian is that
a prior distribution is employed for the values that the nuisance parameters in the model might have. But Bayesians also have started to consider
38
Kishino and Hasegawa (1990) applied AIC to choice between tree topologies; see Posada and
Crandall (2001) and Posada and Buckley (2004) for further discussion.
342
Common ancestry
multiple process models within a model-selection framework (see, for
example, Huelsenbeck et al. 2004). If one topology has a higher average
likelihood than another for each of the process models one has considered, this shows that the result is robust; it does not depend on which of
these process models one chooses. And if unanimity across models fails,
the fact that BIC provides quantitative values for the average likelihoods
of different conjunctions, and not just an ordering, becomes important.
BIC can be used to evaluate the average likelihoods of conjunctions of the
form (tree topology & process model M), and one can see what the
average effect is of shifting from one tree topology to another across each
of several process models.
Model-selection theory, whether it is Akaikean or Bayesian, provides
the resources for statistically testing tree topologies against each other
without requiring one to decide in advance which process model is true.
Choosing between the two approaches requires one to consider the different goals that AIC and BIC have and involves the questions surveyed in
Chapter 1 concerning whether various assumptions that go into the two
procedures are reasonable. I won’t repeat those points here, but I want to
recall one theme. If the process models one is considering all contain
idealizations, all are false, so there won’t be much point to asking which of
them has the highest probability of being true. A better paradigm is the
goal of estimating predictive accuracy, of finding fitted models that are
close to the truth.
What does cladistic parsimony assume about the evolutionary process?
What does the word ‘‘assume’’ mean in the question that is the title of this
section? An example from outside science provides some guidance.
Consider the two sentences
(P)
Jones is poor but honest
and
(A)
There is a conflict between being poor and being honest.
I hope it is clear that P assumes that A is true, but that A does not assume
that P is true. Notice that P entails A – that is, if P is true, then A must
also be true. However, A does not entail P; if there is a conflict between
poverty and honesty, this says nothing about Jones and the characteristics
Common ancestry
343
he happens to have. This example points to a general fact about what it
means to talk about the assumptions of a proposition:
If P assumes A, then P entails A.
To find out what a proposition assumes, you must look for conditions
that are necessary for the proposition to be true, not for conditions that
suffice for the proposition’s truth.
When are likelihood and parsimony ordinally equivalent?
Given this clarification of what an assumption is, we can turn to the
question of what it means to talk about the assumptions that are involved
in using cladistic parsimony to infer tree topologies. What parsimony
assumes about the evolutionary process are the propositions that must be
true if parsimony is to be a legitimate method of phylogenetic inference.
But what does ‘‘legitimate’’ mean? There are a number of choices to
consider. For example, one might demand that a legitimate phylogenetic
method be statistically consistent – that it converge on the true phylogeny
as the number of observations is made large without limit. We will
consider this idea in the next section. Another interpretation – the one I
want to explore now – maintains that parsimony is a legitimate method
precisely when it is ordinally equivalent with likelihood. This idea is easy
to understand by considering the Fahrenheit and Centigrade scales of
temperature. These are ordinally equivalent, meaning that for any two
objects, the first has a higher temperature in Fahrenheit than the second,
precisely when the first has a higher temperature in Centigrade than
the second. The two scales induce the same ordering of objects. For
parsimony and likelihood to be ordinally equivalent, the requirement
is that
(OE)
For any phylogenetic hypotheses H1 and H2, and for any data set
D, H1 provides a more parsimonious explanation of D than H2
does precisely when PrM(D j H1) > PrM(D j H2).
The subscript M in the likelihood terms is a reminder of the Duhemian
point that phylogenetic hypotheses do not confer probabilities on data,
save in the context of a process model. In fact, it is misleading to talk of
parsimony and ‘‘likelihood’’ being, or failing to be, ordinally equivalent.
Rather, the question is whether likelihood when implemented by an
assumed process model M is or is not ordinally equivalent with parsimony.