Premium
A problem in POY tree searches (and its work‐around) when some sequences are observed to be absent in some terminals
Author(s) -
De Laet Jan
Publication year - 2010
Publication title -
cladistics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 2.323
H-Index - 92
eISSN - 1096-0031
pISSN - 0748-3007
DOI - 10.1111/j.1096-0031.2010.00306.x
Subject(s) - tree (set theory) , work (physics) , biology , mathematics , combinatorics , engineering , mechanical engineering
Sir, The methods implemented in POY (Wheeler et al., 2003; Varón et al., 2010) include direct optimization (Wheeler, 1996) and fixed-states analysis (Wheeler, 1999), which are intended as heuristic techniques for calculating tree cost in the tree alignment problem (sensu Sankoff, 1975; Sankoff and Cedergren, 1983; see De Laet, 2005, pp. 97–99 for background and discussion). It develops that POY s implementation of those heuristics can give erroneous results in some cases. When the data include sequences or fragments that are absent in some terminals, POY may fail to count the indel events that are required to account for those absences. This can lead to misidentification of optimal trees and incorrectly resolved consensus trees. Here I describe the problem, provide a work-around, and discuss an example in which the issue has affected results with empirical data recently reported in this journal (by Agolin and D Haese, 2009). The basic problem is presented in the data set of Fig. 1a, with two short fragments for three terminals. The first fragment is present and identical in the three terminals, the second fragment is identical in the first two but absent in the third: this third terminal lacks anything comparable to the second fragment. To account for this data set on the single tree for three terminals, one indel event has to be postulated, on the branch leading to the third terminal. But when the absent sequence is represented as a zero-length string in the fasta input file for POY (Fig. 1b), POY reports a cost of zero, irrespective of the tree cost heuristic employed and the cost matrix applied. This is the case in both POY3 and POY4 (up to 4.1.2, the most recent version available). The zero cost that POY reports would be correct if the second fragment in the third terminal were not an observed absence but missing data. The problem, then, is that POY sometimes treats absence as missing data. Information input as an observed absence of a sequence—as a zero-length string—is interpreted as missing data in the cost calculations. But the treatment of absences is not consistent. If the data are summarized with the report(crossreferences) command, POY lists the second fragment as absent for the third terminal. Similarly, when POY4 produces an implied alignment (Schwikowski and Vingron, 1997; cf. De Laet, 2005, pp. 98–99) for the data set of Fig. 1a, the third terminal s second fragment is presented as a gap, i.e. as an absent sequence, not as missing data. This can be confirmed by using TNT (Goloboff et al., 2008) to evaluate the implied alignment on the single tree for three terminals. TNT returns a length of one unit indel, which is the correct length when absence is correctly treated as absence. A work-around can be used to force POY to treat absences correctly. To each fragment that is absent in some of the terminals, add a zero-cost uninformative position in every terminal for which the fragment is nonmissing. Figure 1c illustrates applying this method to