Author | Title | Year | Journal/Proceedings | Reftype | DOI/URL |
---|---|---|---|---|---|

Adebiyi, E. & Dipe, T. | Finding higher order motifs under the levenshtein measure | 2003 | Proceedings of the 2003 IEEE Bioinformatics Conference. CSB 2003 | inproceedings | DOI |

Abstract: We study the problem of finding higher order motifs under the levenshtein measure, otherwise known as the edit distance. In the problem set-up, we are given N sequences, each of average length n, over a finite alphabet /spl Sigma/ and thresholds D and q, we are to find composite motifs that contain motifs of length P (these motifs occur with almost D differences) in 1 /spl les/ q /spl les/ N distinct sequences. Two interesting but involved algorithms for finding higher order motifs under the edit distance was presented by Marsan and Sagot. Their second algorithm is much more complicated and its complexity is asymptotically not better. Their first algorithm runs in O(M /spl middot/ N/sup 2/n/sup 1+/spl alpha/ /spl middot/p /spl middot/ pow(/spl epsi/)/) where p /spl ges/ 2, /spl alpha/ > 0, pow(/spl epsi/) is a concave function that is less than 1, /spl epsi/= D/P and M is the expected number of all monad motifs. We present an alternative algorithmic approach also for Edit distance based on the concept described. The resulting algorithm is simpler and runs in O(N/sup 2/n/sup 1+p /spl middot/ pow(/spl epsi/)/) expected time. |
|||||

BibTeX:
@inproceedings{Adebiyi2003, author = {Adebiyi, E.F. and Dipe, T.}, title = {Finding higher order motifs under the levenshtein measure}, booktitle = {Proceedings of the 2003 IEEE Bioinformatics Conference. CSB 2003}, year = {2003}, pages = {583 - 584}, doi = {http://dx.doi.org/10.1109/CSB.2003.1227414} } |
|||||

Aho, A. & Peterson, T. | A minimum distance error-correcting parser for context-free languages | 1972 | SIAM Journal on Computing | article | DOI |

Abstract: We assume three types of syntax errors can debase the sentences of a language generated by a context-free grammar: the replacement of a symbol by an incorrect symbol, the insertion of an extraneous symbol, or the deletion of a symbol. We present an algorithm that will parse any input string to completion finding the fewest possible number of errors. On a random access computer the algorithm requires time proportional to the cube of the length of the input. |
|||||

BibTeX:
@article{Aho1972, author = {Aho, A.V. and Peterson, T.G.}, title = {A minimum distance error-correcting parser for context-free languages}, journal = {SIAM Journal on Computing}, publisher = {SIAM}, year = {1972}, volume = {1}, number = {4}, pages = {305--312}, doi = {http://dx.doi.org/10.1137/0201022} } |
|||||

Amir, A., Butman, A., Crochemore, M., Landau, G. M. & Schaps, M. | Two-dimensional pattern matching with rotations | 2004 | Theoretical Computer Science | article | DOIURL |

Abstract: The problem of pattern matching with rotation is that of finding all occurrences of a two-dimensional pattern in a text, in all possible rotations. We prove an upper and lower bound on the number of such different possible rotated patterns. Subsequently, given an m x m array (pattern) and an n x n array (text) over some finite alphabet [Sigma], we present a new method yielding an O(n2m3) time algorithm for this problem. |
|||||

BibTeX:
@article{Amir2004, author = {Amihood Amir and Ayelet Butman and Maxime Crochemore and Gad M. Landau and Mary Schaps}, title = {Two-dimensional pattern matching with rotations}, journal = {Theoretical Computer Science}, year = {2004}, volume = {314}, number = {1-2}, pages = {173-187}, url = {http://www.sciencedirect.com/science/article/B6V1G-4BCWWT1-3/2/77c251d04039632956b020d952bcd1ee}, doi = {http://dx.doi.org/10.1016/j.tcs.2003.10.039} } |
|||||

Apostolico, A., Erdos, P. L. & Lewenstein, M. | Parameterized matching with mismatches | 2007 | Journal of Discrete Algorithms | article | DOIURL |

Abstract: The problem of approximate parameterized string searching consists of finding, for a given text t=t1t2...tn and pattern p=p1p2...pm over respective alphabets [Sigma]t and [Sigma]p, the injection [pi]i from [Sigma]p to [Sigma]t maximizing the number of matches between [pi]i(p) and titi+1...ti+m-1 (i=1,2,...,n-m+1). We examine the special case where both strings are run-length encoded, and further restrict to the case where one of the alphabets is binary. For this case, we give a construction working in time O(n+(rpxrt)[alpha](rt)log(rt)), where rp and rt denote the number of runs in the corresponding encodings for y and x, respectively, and [alpha] is the inverse of the Ackermann's function. |
|||||

BibTeX:
@article{Apostolico2007a, author = {Alberto Apostolico and Peter L. Erdos and Moshe Lewenstein}, title = {Parameterized matching with mismatches}, journal = {Journal of Discrete Algorithms}, year = {2007}, volume = {5}, number = { 1}, pages = {135-140}, url = {http://www.sciencedirect.com/science/article/B758J-4JW128C-6/2/ad01648d986bf4e2f252b1fa8f5116c3}, doi = {http://dx.doi.org/10.1016/j.jda.2006.03.014} } |
|||||

Apostolico, A. & Galil, Z. | Pattern matching algorithms | 1997 | book | URL | |

Abstract: Issues of matching and searching on elementary discrete structures arise pervasively in computer science and many of its applications, and their relevance is expected to grow as information is amassed and shared at an accelerating pace. Several algorithms were discovered as a result of theseneeds, which in turn created the subfield of Pattern Matching. This book provides an overview of the current state of Pattern Matching as seen by specialists who have devoted years of study to the field. It covers most of the basic principles and presents material advanced enough to faithfullyportray the current frontier of research. Because of these recent advances, this is the right time for a book that brings together information relevant to both graduate students and specialists in need of an in-depth reference. |
|||||

BibTeX:
@book{Apostolico1997, author = {Alberto Apostolico and Zvi Galil}, title = {Pattern matching algorithms}, publisher = {Oxford University Press}, year = {1997}, url = {http://books.google.com.co/books?hl=es&lr=&id=mFd_grFyiT4C&oi=fnd&pg=RA2-PA2&sig=b_iozTtafRNpZgKced40Na08sJI&dq=Parallel+computations+of+Levenshtein+distances} } |
|||||

Apostolico, A. & Giancarlo, R. | Periodicity and Repetitions in Parameterized Strings | 2005 | Electronic Notes in Discrete Mathematics | article | URL |

Abstract: N/A |
|||||

BibTeX:
@article{Apostolico2005, author = {Alberto Apostolico and Raffaele Giancarlo}, title = {Periodicity and Repetitions in Parameterized Strings}, journal = {Electronic Notes in Discrete Mathematics}, year = {2005}, volume = {21}, pages = {227-230}, url = {http://www.sciencedirect.com/science/article/B75GV-4GST147-13/2/46fc6d000a32de7151a1f5411c013601} } |
|||||

Apostolico, A. & Pizzi, C. | Motif discovery by monotone scores | 2007 | Discrete Applied Mathematics | article | DOIURL |

Abstract: The detection of frequent patterns such as motifs and higher aggregates is of paramount interest in biology and invests many other applications of automated discovery. The problem with its variants is usually plagued with computational burden. A related difficulty is posed by the fact, that due to the sheer mole of candidates, the tables and indices at the outset tend to be bulky, un-manageable, and ultimately uninformative. For solid patterns, it is possible to compact the size of statistical indices by resort to certain monotonicities exhibited by popular scores. The savings come from the fact that these monotonicities enable one to partition the candidate over-represented words into families in such a way that it suffices to consider and weigh only one candidate per family. In this paper, we study the problem of extracting, from given source x and error threshold k, substrings of x that occur unusually often in x within k substitutions or mismatches. Specifically, we assume that the input textstring x of n characters is produced by an i.i.d. source, and design efficient methods for computing the probability and expected number of occurrences for substrings of x with (either exactly or up to) k mismatches. Two related schemes are presented. In the first one, an O(nk) time preprocessing of x is developed that supports the following subsequent query: for any substring w of x arbitrarily specified as input, the probability of occurrence of w in x within (either exactly or up to) k mismatches is reported in O(k2) time. In the second scheme, a length or length range is arbitrarily specified, and the above probabilities are computed for all substrings of x having length in that range, in overall O(nk) time. Further, monotonicity conditions are introduced and studied for the probability and expected frequency of a substring under extension, increased number of errors, or both. Over intervals of constant frequency count, these monotonicities translate to some of the scores in use, thereby reducing the size of tables at the outset and enhancing the process of discovery. These latter derivations extend to patterns with mismatches an analysis previously devoted to exact patterns. |
|||||

BibTeX:
@article{Apostolico2007, author = {Alberto Apostolico and Cinzia Pizzi}, title = {Motif discovery by monotone scores}, journal = {Discrete Applied Mathematics}, year = {2007}, volume = {155}, number = {6-7}, pages = {695-706}, url = {http://www.sciencedirect.com/science/article/B6TYW-4M3RNX6-1/2/91ab96ab1ba5d4d1d5a548e9aa5b8f35}, doi = {http://dx.doi.org/10.1016/j.dam.2005.09.017} } |
|||||

Arslan, A. | An algorithm for string edit distance allowing substring reversals | 2006 | Sixth IEEE Symposium on BioInformatics and BioEngineering. BIBE 2006 | conference | DOI |

Abstract: The edit distance between given two strings X and Y is the minimum number of edit operations that transform X into Y . Ordinarily, string editing is based on character insert, delete, and substitute operations. It has been suggested that extending this model with block (substring) edits would be useful in applications such as DNA sequence comparison. In its general form, the resulting problem is NP-hard. However, there are efficient algorithms when string edits include only character, and block replacements. We introduce a new edit model which permits insertions, deletions, and substitutions at character level, and also reversals of substrings. We present an algorithm whose worst-case time complexity is O(n2m) where n = |X| le m = |Y |, and we prove that the average running time of the algorithm is O(nm). Our experiments on randomly generated strings verify these results. The main contribution of this paper is that we present an algorithm to find all possible reversals using a generalized suffix tree, which is fast on average. |
|||||

BibTeX:
@conference{Arslan2006, author = {Arslan, A.N.}, title = {An algorithm for string edit distance allowing substring reversals}, booktitle = {Sixth IEEE Symposium on BioInformatics and BioEngineering. BIBE 2006}, year = {2006}, pages = {220 - 226}, doi = {http://dx.doi.org/10.1109/BIBE.2006.253338} } |
|||||

Arslan, A. N. & Egecioglu, Ö. | Efficient Algorithms For Normalized Edit Distance | 2000 | Journal of Discrete Algorithms | article | URL |

Abstract: A common model for computing the similarity of two strings X and Y of lengths m and n respectively, with m n, is to transform X into Y through a sequence of edit operations, called an edit sequence. The edit operations are of three types: insertion, deletion, and substitution. A given cost function assigns a weight to each edit operation. The amortized weight for a given edit sequence is the ratio of its weight to its length, and the minimum of this ratio over all edit sequences is the normalized edit distance. Existing algorithms for normalized edit distance computation with proven complexity bounds require O(mn) time in the worst-case. We give provably better algorithms: an O(mn log n)-time algorithm when the cost function is uniform, i.e, the weights of edit operations depend only on the type but not on the individual symbols involved, and an O(mn log m)-time algorithm when the weights are rational. |
|||||

BibTeX:
@article{Arslan2000, author = {Abdula N. Arslan and Ömer Egecioglu}, title = {Efficient Algorithms For Normalized Edit Distance}, journal = {Journal of Discrete Algorithms}, year = {2000}, volume = {1}, number = {1}, pages = {3--20}, url = {citeseer.ist.psu.edu/arslan00efficient.html} } |
|||||

Atallah1, M. J., Jacquet, P. & Szpankowski, W. | Pattern matching with mismatches: A probabilistic analysis and a randomized algorithm | 1992 | Combinatorial Pattern Matching | article | DOI |

Abstract: Given a text of length n and a pattern of length m over some (possibly unbounded) alphabet, we consider the problem of finding all positions in the text at which the pattern ldquoalmost occursrdquo. Here by ldquoalmost occursrdquo we mean that at least some fixed fraction rgr of the characters of the pattern (for example, ge 60% of them) are equal to their corresponding characters in the text. We design a randomized algorithm that has O(n log m) worst-case time complexity and computes with high probability all of the almost-occurrences of the pattern in the text. This algorithm assumes that the fraction rgr is given as part of its input, and it works well even for relatively small values of rgr. It makes no assumptions about the probabilistic characteristics of the input. Our second contribution deals with the issue of which values of rgr correspond to the intuitive notion of similarity between pattern and text, and this leads us to the development of a probabilistic analysis for the case where both input strings are random (in the usual, i.e., Bernoulli, model).The first author's research was supported by the Office of Naval Research under Grants N0014-84-K-0502 and N0014-36-K-0689, and in part by AFOSR Grant 90-0107, and the NSF under Grant DCR-8451393, and in part by Grant R01 LM05118 from the National Library of Medicine. The second author was supported by NATO Collaborative Grant 0057/89. The third author's research was supported by AFOSR Grant 90-0107 and NATO Collaborative Grant 0057/89, and, in part by the NSF Grant CCR-8900305, and by Grant R01 LM05118 from the National Library of Medicine |
|||||

BibTeX:
@article{Atallah11992, author = {Mikhail J. Atallah1 and Philippe Jacquet and Wojciech Szpankowski}, title = {Pattern matching with mismatches: A probabilistic analysis and a randomized algorithm}, journal = {Combinatorial Pattern Matching}, year = {1992}, volume = {644/1992}, doi = {http://dx.doi.org/10.1007/3-540-56024-6} } |
|||||

Baeza-Yates, R. A., Choffrut, C. & Gonnet, G. H. | On Boyer-Moore automata | 1994 | Algorithmica | article | URL |

Abstract: The notion of Boyer-Moore automaton was introduced by Knuth, Morris and Pratt in their historical paper on fast pattern matching. It leads to an algorithm that requires more pre-processing but is more efficient than the original Boyer-Moore's algorithm. |
|||||

BibTeX:
@article{Baeza-Yates1994, author = {Ricardo A. Baeza-Yates and Christian Choffrut and Gaston H. Gonnet}, title = {On Boyer-Moore automata}, journal = {Algorithmica}, year = {1994}, volume = {12}, pages = {268--292}, url = {http://www.dcc.uchile.cl/~rbaeza/cv/journals.html} } |
|||||

Baeza-Yates, R. A. & Perleberg, C. H. | Fast and practical approximate string matching | 1996 | Information Processing Letters | article | DOIURL |

Abstract: We present new algorithms for approximate string matching based in simple, but efficient, ideas. First, we present an algorithm for string matching with mismatches based in arithmetical operations that runs in linear worst case time for most practical cases. This is a new approach to string searching. Second, we present an algorithm for string matching with errors based on partitioning the pattern that requires linear expected time for typical inputs. |
|||||

BibTeX:
@article{Baeza-Yates1996, author = {Ricardo A. Baeza-Yates and Chris H. Perleberg}, title = {Fast and practical approximate string matching}, journal = {Information Processing Letters}, year = {1996}, volume = {59}, number = {1}, pages = {21-2}, url = {http://www.sciencedirect.com/science/article/B6V0F-3VVCJTY-5/2/1a81a71d32b703f975a0198be573494e}, doi = {http://dx.doi.org/10.1016/0020-0190(96)00083-X} } |
|||||

Baeza-Yates, R. & Schott, R. | Parallel searching in the plane | 1995 | Computational Geometry, | article | DOIURL |

Abstract: In this paper we investigate parallel searching in the plane using robots as searchers. We show that a huge number of robots are not necessary for several problems and under different assumptions. This corresponds to real situations since, actually, the number of processors of parallel machines is fixed and independent of the dimension of the problem to be solved. |
|||||

BibTeX:
@article{Baeza-Yates1995, author = {Ricardo Baeza-Yates and Rene Schott}, title = {Parallel searching in the plane}, journal = {Computational Geometry,}, year = {1995}, volume = {5}, number = {3}, pages = {143-154}, url = {http://www.sciencedirect.com/science/article/B6TYS-3YVD08Y-5/2/74e38c5aa555040a375a9910cf70f6b4}, doi = {http://dx.doi.org/10.1016/0925-7721(95)00003-R} } |
|||||

Baker, B. S. | Parameterized Pattern Matching: Algorithms and Applications | 1996 | Journal of Computer and System Sciences | article | DOIURL |

Abstract: The problem of finding sections of code that either are identical or are related by the systematic renaming of variables or constants can be modeled in terms ofparameterized strings(p-strings) andparameterized matches(p-matches). P-strings are strings over two alphabets, one of which represents parameters. Two p-strings are aparameterized match(p-match) if one p-string is obtained by renaming the parameters of the other by a one-to-one function. In this paper, we investigate parameterized pattern matching via parameterized suffix trees (p-suffix trees). We give two algorithms for constructing p-suffix trees: one (eager) that runs in linear time for fixed alphabets, and another that uses auxiliary data structures and runs inO(n log(n)) time for variable alphabets, wherenis input length. We show that using a p-suffix tree for a pattern p-stringP, it is possible to search for all p-matches ofPwithin a text p-stringTin space linear in P and time linear in T for fixed alphabets, orO(T log(min(P, [sigma])) time andO(P) space for variable alphabets, where[sigma]is the sum of the alphabet sizes. The simpler p-suffix tree construction algorithmeagerhas been implemented, and experiments show it to be practical. Since it runs faster than predicted by the above worst-case bound, we reanalyze the algorithm and show thateagerruns in timeO(min(t S+m(t, S) | t>0) log [sigma])), where for an input p-stringS, m(t, S) is the number of maximal p-matches of length at leasttthat occur withinS, and[sigma]is the sum of the alphabet sizes. Experiments with the author's programdup(B. Baker,in"Comput. Sci. Statist.," Vol. 24, 1992) for finding all maximal p-matches within a p-string have foundm(t, S) to be less than S in practice unlesstis small. |
|||||

BibTeX:
@article{, author = {Brenda S. Baker}, title = {Parameterized Pattern Matching: Algorithms and Applications}, journal = {Journal of Computer and System Sciences}, year = {1996}, volume = {52}, number = {1}, pages = {28-42}, url = {http://www.sciencedirect.com/science/article/B6WJ0-45NJMKJ-J/2/0b04b8bd4e79a7b2284d4a52c451e50d}, doi = {http://dx.doi.org/10.1006/jcss.1996.0003} } |
|||||

Bar-Yossef, Jayram, Z., Krauthgamer, T. S. & Kumar, R. | Approximating edit distance efficiently [BibTeX] |
2004 | Proceedings 45th Annual IEEE Symposium on Foundations of Computer Science, 2004 | inproceedings | DOI |

BibTeX:
@inproceedings{Bar-Yossef2004, author = {Bar-Yossef and Z. Jayram and T. S. Krauthgamer and R. Kumar}, title = {Approximating edit distance efficiently}, booktitle = {Proceedings 45th Annual IEEE Symposium on Foundations of Computer Science, 2004}, year = {2004}, pages = {550 - 559}, doi = {http://dx.doi.org/10.1109/FOCS.2004.14} } |
|||||

Boyar, J. & Larsen, K. S. | Faster parallel computation of edit distances | 1994 | techreport | URL | |

Abstract: Fast parallel algorithms are given for the longest common subsequence problem and the string editing problem with bounded weights. In the COMMON PRAM model, the algorithm for the longest common subsequence problem takes time O(log m), where m is the length of the shorter string, while the algorithm for the string editing problem with bounded weights takes time O(max(log m, log n/ log log n)), where m is the length of the shorter string and n is the length of the longer string. |
|||||

BibTeX:
@techreport{Boyar1994, author = {Joan Boyar and Kim S. Larsen}, title = {Faster parallel computation of edit distances}, year = {1994}, url = {citeseer.ist.psu.edu/boyar94faster.html} } |
|||||

Bunke, H. & Csirik, J. | An improved algorithm for computing the edit distance of run-length coded strings | 1995 | Inf. Process. Lett. | article | DOI |

Abstract: N/A |
|||||

BibTeX:
@article{Bunke1995, author = {H. Bunke and J. Csirik}, title = {An improved algorithm for computing the edit distance of run-length coded strings}, journal = {Inf. Process. Lett.}, publisher = {Elsevier North-Holland, Inc.}, year = {1995}, volume = {54}, number = {2}, pages = {93--96}, doi = {http://dx.doi.org/10.1016/0020-0190(95)00005-W} } |
|||||

Bunke, H. & Csirik, J. | Parametric string edit distance and its application to pattern recognition | 1995 | IEEE Transactions on Systems, Man and Cybernetics | article | DOI |

Abstract: A generalized version of the string matching algorithm by Wagner and Fischer (1974) is proposed. It is based on a parametrization of the edit cost. We assume constant cost for any delete and insert operation, but the cost for replacing a symbol is given as a parameter τ. For any two strings A and B, our algorithm computes their edit distance in terms of the parameter τ. We give the new algorithm, study some of its properties, and discuss potential applications to pattern recognition |
|||||

BibTeX:
@article{Bunke1995a, author = {Bunke, H. and Csirik, J.}, title = {Parametric string edit distance and its application to pattern recognition}, journal = {IEEE Transactions on Systems, Man and Cybernetics}, year = {1995}, volume = {25}, number = {1}, pages = {202 - 206}, doi = {http://dx.doi.org/10.1109/21.362950} } |
|||||

Bustos, J. A. | Estudio Comparativo de algoritmos indexados para búsqueda aproximada de texto | 2003 | School: Universidad de Chile |
mastersthesis | URL |

Abstract: El problema de la búsqueda aproximada en texto aparece en una gran cantidad de áreas delas Ciencias de la Computación; se aplica, entre otros, a recuperación de texto y biología computacional. |
|||||

BibTeX:
@mastersthesis{Bustos2003, author = {Javier Alejandro Bustos}, title = {Estudio Comparativo de algoritmos indexados para búsqueda aproximada de texto}, school = {Universidad de Chile}, year = {2003}, url = {http://www.uchile.cl/} } |
|||||

Chávez, E. & Navarro, G. | Probabilistic proximity search: Fighting the curse of dimensionality in metric spaces | 2003 | Information Processing Letters,Information Processing Letters | article | DOIURL |

Abstract: Proximity searches become very difficult on "high dimensional" metric spaces, that is, those whose histogram of distances has a large mean and/or a small variance. This so-called "curse of dimensionality", well known in vector spaces, is also observed in metric spaces. The search complexity grows sharply with the dimension and with the search radius. We present a general probabilistic framework applicable to any search algorithm and whose net effect is to reduce the search radius. The higher the dimension, the more effective the technique. We illustrate empirically its practical performance on a particular class of algorithms, where large improvements in the search time are obtained at the cost of a very small error probability. |
|||||

BibTeX:
@article{Chavez2003, author = {Edgar Ch\'avez and Gonzalo Navarro}, title = {Probabilistic proximity search: Fighting the curse of dimensionality in metric spaces}, journal = {Information Processing Letters,Information Processing Letters}, year = {2003}, volume = {85}, number = {1}, pages = {39-46}, url = {http://www.sciencedirect.com/science/article/B6V0F-475BCSD-4/2/4b9e5165db1137830dc22bfb2dbe10c3}, doi = {http://dx.doi.org/10.1016/S0020-0190(02)00344-7} } |
|||||

Chang, W. I. & Lampe, J. | heoretical and empirical comparisons of approximate string matching algorithms | 1992 | ombinatorial Pattern MatchingCombinatorial Pattern Matching | article | DOI |

Abstract: We study in depth a model of non-exact pattern matching based on edit distance, which is the minimum number of substitutions, insertions, and deletions needed to transform one string of symbols to another. More precisely, the k differences approximate string matching problem specifies a text string of length n, a pattern string of length m, the number k of differences (substitutions, insertions, deletions) allowed in a match, and asks for all locations in the text where a match occurs. We have carefully implemented and analyzed various O(kn) algorithms based on dynamic programming (DP), paying particular attention to dependence on b the alphabet size. An empirical observation on the average values of the DP tabulation makes apparent each algorithm's dependence on b. A new algorithm is presented that computes much fewer entries of the DP table. In practice, its speedup over the previous fastest algorithm is 2.5X for binary alphabet; 4X for four-letter alphabet; 10X for twenty-letter alphabet. We give a probabilistic analysis of the DP table in order to prove that the expected running time of our algorithm (as well as an earlier ldquocut-offrdquo algorithm due to Ukkonen) is O(kn) for random text. Furthermore, we give a heuristic argument that our algorithm is O(kn/(radicb-1)) on the average, when alphabet size is taken into consideration.This research was conducted at the University of California, Berkeley, and was supported in part by Department of Energy grant DE-FG03-90ER60999 |
|||||

BibTeX:
@article{Chang1992, author = {William I. Chang and Jordan Lampe}, title = {heoretical and empirical comparisons of approximate string matching algorithms}, journal = {ombinatorial Pattern MatchingCombinatorial Pattern Matching}, year = {1992}, volume = {644/1992}, pages = {175-184}, doi = {http://dx.doi.org/10.1007/3-540-56024-6_14} } |
|||||

Chrobak, M., Kolman, P. & r i Sgall, J. | The greedy algorithm for the minimum common string partition problem | 2005 | ACM Trans. Algorithms | article | DOI |

Abstract: In the Minimum Common String Partition problem (MCSP), we are given two strings on input, and we wish to partition them into the same collection of substrings, minimizing the number of the substrings in the partition. This problem is NP-hard, even for a special case, denoted 2-MCSP, where each letter occurs at most twice in each input string. We study a greedy algorithm for MCSP that at each step extracts a longest common substring from the given strings. We show that the approximation ratio of this algorithm is between Ω(n0.43) and O(n0.69). In the case of 2-MCSP, we show that the approximation ratio is equal to 3. For 4-MCSP, we give a lower bound of Ω(log n). |
|||||

BibTeX:
@article{Chrobak2005, author = {Marek Chrobak and Petr Kolman and Ji\~r\~i Sgall}, title = {The greedy algorithm for the minimum common string partition problem}, journal = {ACM Trans. Algorithms}, publisher = {ACM Press}, year = {2005}, volume = {1}, number = {2}, pages = {350--366}, doi = {http://doi.acm.org/10.1145/1103963.1103971} } |
|||||

Chávez, E. & Navarro, G. | A Metric Index for Approximate String Matching | 2006 | Theoretical Computer Science | article | DOIURL |

Abstract: We present a radically new indexing approach for approximate string matching. The scheme uses the metric properties of the edit distance and can be applied to any other metric between strings. We build a metric space where the sites are the nodes of the suffix tree of the text, and the approximate query is seen as a proximity query on that metric space. This permits us finding the occ occurrences of a pattern of length m, permitting up to r differences, in a text of length n over an alphabet of size [sigma], in average time O(m1+[epsilon]+occ) for any [epsilon]>0, if and . The index works well up to , where it achieves its maximum average search complexity . The construction time of the index is and its space is . This is the first index achieving average search time polynomial in m and independent of n, for . Previous methods achieve this complexity only for . We also present a simpler scheme needing O(n) space. |
|||||

BibTeX:
@article{Chavez2006, author = {Edgar Ch{\'a}vez and Gonzalo Navarro}, title = {A Metric Index for Approximate String Matching}, journal = {Theoretical Computer Science}, publisher = {Springer-Verlag}, year = {2006}, volume = {352}, number = {Issues 1-3}, pages = {266--279}, url = {http://www.sciencedirect.com/science/article/B6V1G-4HXBGSH-1/2/889a42ca051eca9d71743577b1672cc4}, doi = {http://dx.doi.org/10.1016/j.tcs.2005.11.037} } |
|||||

Crochemore, M., Iliopoulos, C. S., Pinzón, Y. J. & Reid, J. F. | A fast and practical bit-vector algorithm for the Longest Common Subsequence problem | 2001 | Information Processing Letters | article | DOIURL |

Abstract: This paper presents a new practical bit-vector algorithm for solving the well-known Longest Common Subsequence (LCS) problem. Given two strings of length m and n, n[greater-or-equal, slanted]m, we present an algorithm which determines the length p of an LCS in O(nm/w) time and O(m/w) space, where w is the number of bits in a machine word. This algorithm can be thought of as column-wise "parallelization" of the classical dynamic programming approach. Our algorithm is very efficient in practice, where computing the length of an LCS of two strings can be done in linear time and constant (additional/working) space by assuming that m[less-than-or-equals, slant]w. |
|||||

BibTeX:
@article{Crochemore2001, author = {Maxime Crochemore and Costas S. Iliopoulos and Yoan J. Pinz\'on and James F. Reid}, title = {A fast and practical bit-vector algorithm for the Longest Common Subsequence problem}, journal = {Information Processing Letters}, year = {2001}, volume = {80}, number = {6}, pages = {279-285}, url = {http://www.sciencedirect.com/science/article/B6V0F-4475W6K-2/2/8ff4c255a0756ed7bdfe7fe77fe2f964}, doi = {http://dx.doi.org/10.1016/S0020-0190(01)00182-X} } |
|||||

Egecioglu, O. & Ibel, M. | Parallel algorithms for fast computation of normalized edit distances | 1996 | Eighth IEEE Symposium on Parallel and Distributed Processing | conference | DOI |

Abstract: The authors give work-optimal and polylogarithmic time parallel algorithms for solving the normalized edit distance problem. The normalized edit distance between two strings X and Y with lengths n⩾m is the minimum quotient of the sum of the costs of edit operations transforming X into Y by the length of the edit path corresponding to those edit operations. Marzal and Vidal (1993) proposed a sequential algorithm with a time complexity of O(nm2). They show that this algorithm can be parallelized work-optimally on an array of n (or m) processors, and on a mesh of n×m processors. They then propose a sublinear time algorithm that is almost work-optimal: using O(mn1.75) processors, the time complexity of the algorithm is O(n0.75 log n) and the total number of operations is O (mn 2.5 log n). This algorithm runs on a CREW PRAM, but is likely to work on weaker PRAM models and hypercubes with minor modifications. Finally, they present a polylogarithmic O(log2 n) time algorithm based on matrix multiplication which runs on a O(n6/log n) processor hypercube |
|||||

BibTeX:
@conference{Egecioglu1996, author = {Egecioglu, O. and Ibel, M.}, title = {Parallel algorithms for fast computation of normalized edit distances}, booktitle = {Eighth IEEE Symposium on Parallel and Distributed Processing}, year = {1996}, pages = { 496 - 503}, doi = {http://dx.doi.org/10.1109/SPDP.1996.57037} } |
|||||

Ehrenfeucht, A. & Haussler, D. | A new distance metric on strings computable in linear time | 1988 | Discrete Appl. Math. | article | DOI |

Abstract: We describe a new metric for sequence comparison that emphasizes global similarity over sequential matching at the local level. It has the advantage over the Levenshtein metric that strings of lengths n and m can be compared in time proportional to n + m instead of nm. Various mathematical properties of the metric are established. |
|||||

BibTeX:
@article{Ehrenfeucht1988, author = {A. Ehrenfeucht and D. Haussler}, title = {A new distance metric on strings computable in linear time}, journal = {Discrete Appl. Math.}, publisher = {Elsevier Science Publishers B. V.}, year = {1988}, volume = {20}, number = {3}, pages = {191--203}, doi = {http://dx.doi.org/10.1016/0166-218X(88)90076-5} } |
|||||

Ferragina, P., Manzini, G., Mäkinen, V. & Navarro, G. | Compressed Representations of Sequences and Full-Text Indexes | 2006 | ACM Transactions on Algorithms (TALG) | article | URL |

Abstract: Given a sequence S = s(1)s(2) ... s(n) of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any s(i), as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order empirical entropy of S and nH0(S) provides an Information Theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. For larger r, we can still represent S in nH0(S) + o(n log r) bits and answer queries in O(log r / log log n) time.Another contribution of this paper is to show how to combine our compressed representation of integer sequences with an existing compression boosting technique to design compressed full-text indexes that scale well with the size of the input alphabet A. Namely, we design a variant of the FM-index that indexes a string T[1,n] within nHk(T) + o(n) bits of storage, where Hk(T) is the k-th order empirical entropy of T. This space bound holds simultaneously for all k <= a log_|A| n, constant 0< a < 1, and |A| = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P[1,p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log^(1+e) n) time, for any constant 0< e<1; and it reports a text substring of length l in O(l + log^(1+e) n) time. Compared to all previous works, our index is the first one that removes the alphabet-size dependance from all query times, in particular counting time is linear in the pattern length. Still, our index uses essentially the same space of the k-th order entropy of the text T, which is the best space obtained in previous work. We can also handle larger alphabets of size |A|=O(n^b), for any 0< b<1, by paying o(n log |A|) extra space and by multiplying all query times by O(log |A| / log log n). |
|||||

BibTeX:
@article{Ferragina2006, author = {P. Ferragina and G. Manzini and V. M{\"a}kinen and G. Navarro}, title = {Compressed Representations of Sequences and Full-Text Indexes}, journal = {ACM Transactions on Algorithms (TALG)}, year = {2006}, note = {To appear}, url = {http://www.dcc.uchile.cl/~gnavarro/publ.html} } |
|||||

Fredriksson, K. | Metric Indexes for Approximate String Matching in a Dictionary | 2004 | Proceedings of the 11th International Symposium on String Processing and Information Retrieval (SPIRE’2004) | article | DOI |

Abstract: We consider the problem of finding all approximate occurrences of a given string q, with at most k differences, in a finite database or dictionary of strings. The strings can be e.g. natural language words, such as the vocabulary of some document or set of documents. This has many important application in both off-line (indexed) and on-line string matching. More precisely, we have a universe $mathbb U$ of strings, and a non-negative distance function $d: mathbb U times mathbb U rightarrow mathbb N$. The distance function is metric, if it satisfies (i) $d(x,y) = 0 ~ Leftrightarrow ~ x = y$; (ii) d(x,y) = d(y,x); (iii) d(x,y) le d(x,z) + d(z,y). The last item is called the ldquotriangular inequalityrdquo, and is the most important property in our case. Many useful distance functions are known to be metric, in particular edit (Levenshtein) distance is metric, which we will use for d. |
|||||

BibTeX:
@article{Fredriksson2004, author = {Kimmo Fredriksson}, title = {{Metric Indexes for Approximate String Matching in a Dictionary}}, journal = {Proceedings of the 11th International Symposium on String Processing and Information Retrieval (SPIRE’2004)}, publisher = {Springer}, year = {2004}, volume = {3246/2004}, pages = {212--213}, doi = {http://dx.doi.org/10.1007/b100941} } |
|||||

Gilbert, D. & Schroeder, M. | FURY: fuzzy unification and resolution based on edit distance | 2000 | Proceedings IEEE International Symposium on Bio-Informatics and Biomedical Engineering | inproceedings | DOI |

Abstract: The authors present a theoretically founded framework for fuzzy unification and resolution based on edit distance over trees. Their framework extends classical unification and resolution conservatively. They prove important properties of the framework and develop the FURY system, which implements the framework efficiently using dynamic programming. The authors evaluate the framework and system on a large problem in the bioinformatics domain, that of detecting typographical errors in an enzyme name database |
|||||

BibTeX:
@inproceedings{Gilbert2000, author = {Gilbert, D. and Schroeder, M.}, title = {FURY: fuzzy unification and resolution based on edit distance}, booktitle = {Proceedings IEEE International Symposium on Bio-Informatics and Biomedical Engineering}, year = {2000}, number = {Nov}, pages = {330 - 336}, doi = {http://dx.doi.org/10.1109/BIBE.2000.889625} } |
|||||

Hannenhalli, S. | Polynomial-time algorithm for computing translocation distance between genomes | 1995 | Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching | inproceedings | URL |

Abstract: With the advent of large-scale DNA physical mapping and sequencing, studies of genome rearrangements are becoming increasingly important in evolutionary molecular biology. From a computational perspective, the study of evolution based on rearrangements leads to a rearrangement distance problem, i.e., computing the minimum number of rearrangement events required to transform one genome into another. Different types of rearrangement events give rise to a spectrum of interesting combinatorial problems. The complexity of most of these problems is unknown. Multichromosomal genomes frequently evolve by a rearrangement event called translocation which exchanges genetic material between different chromosomes. In this paper we study the translocation distance problem, modeling the evolution of genomes evolving by translocations. The translocation distance problem was recently studied for the first time by Kececioglu and Ravi, who gave a 2-approximation algorithm for computing translocation distance. In this paper we prove a duality theorem leading to a polynomial time algorithm for computing translocation distance for the case when the orientations of the genes are known. This leads to an algorithm generating a most parsimonious (shortest) scenario, transforming one genome into another by translocations. |
|||||

BibTeX:
@inproceedings{Hannenhalli1995, author = {Sridhar Hannenhalli}, title = {Polynomial-time algorithm for computing translocation distance between genomes}, booktitle = {Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching}, publisher = {Springer-Verlag, Berlin}, year = {1995}, number = {937}, pages = {162--176}, url = {citeseer.ist.psu.edu/hannenhalli96polynomialtime.html} } |
|||||

Huang, X. | A lower bound for the edit-distance problem under an arbitrary cost function | 1988 | Inf. Process. Lett. | article | DOI |

Abstract: NA |
|||||

BibTeX:
@article{Huang1988, author = {Xiaoqiu Huang}, title = {A lower bound for the edit-distance problem under an arbitrary cost function}, journal = {Inf. Process. Lett.}, publisher = {Elsevier North-Holland, Inc.}, year = {1988}, volume = {27}, number = {6}, pages = {319--321}, doi = {http://dx.doi.org/10.1016/0020-0190(88)90220-7} } |
|||||

Hyyrö, H. | Bit-parallel approximate string matching algorithms with transposition | 2005 | Journal of Discrete Algorithms | article | DOIURL |

Abstract: Using bit-parallelism has resulted in fast and practical algorithms for approximate string matching under Levenshtein edit distance, which permits a single edit operation to insert, delete or substitute a character. Depending on the parameters of the search, currently the fastest non-filtering algorithms in practice are the O(k[left ceiling]m/w[right ceiling]n) algorithm of Wu and Manber, the O([left ceiling](k+2)(m-k)/w[right ceiling]n) algorithm of Baeza-Yates and Navarro, and the O([left ceiling]m/w[right ceiling]n) algorithm of Myers, where m is the pattern length, n is the text length, k is the error threshold and w is the computer word size. In this paper we discuss a uniform way of modifying each of these algorithms to permit also a fourth type of edit operation: transposing two adjacent characters in the pattern. This type of edit distance is also known as Damerau edit distance. In the end we also present an experimental comparison of the resulting algorithms. |
|||||

BibTeX:
@article{Hyyro2005, author = {Heikki Hyyr\"o}, title = {Bit-parallel approximate string matching algorithms with transposition}, journal = {Journal of Discrete Algorithms}, year = {2005}, volume = {3}, number = {2-4}, pages = {215-229}, url = {http://www.sciencedirect.com/science/article/B758J-4DBCF4Y-1/2/d2a34c9cb978a70d1a174b18faf3c6c1}, doi = {http://dx.doi.org/10.1016/j.jda.2004.08.006} } |
|||||

Hyyrö, H. | A bit-vector algorithm for computing Levenshtein and Damerau edit distances | 2003 | Nordic Journal of Computing | article | URL |

Abstract: The edit distance between strings A and B is defined as the minimum number of edit operations needed in converting A into B or vice versa. The Levenshtein edit distance allows three types of operations: an insertion, a deletion or a substitution of a character. The Damerau edit distance allows the previous three plus in addition a transposition between two adjacent characters. To our best knowledge the best current practical algorithms for computing these edit distances run in time O(dm) and O(m/w(n + σ)), where d is the edit distance between the two strings, m and n are their lengths (m ≤ n), w is the computer word size and σ is the size of the alphabet. In this paper we present an algorithm that runs in time O(d/wm + n/wσ) or O(d/wn + m/wσ). The structure of the algorithm is such, that in practice it is mostly suitable for testing whether the edit distance between two strings is within some pre-determined error threshold. We also present some initial test results with thresholded edit distance computation. In them our algorithm works faster than the original algorithm of Myers. |
|||||

BibTeX:
@article{Hyyro2003, author = {Heikki Hyyr\"o}, title = {A bit-vector algorithm for computing Levenshtein and Damerau edit distances}, journal = {Nordic Journal of Computing}, publisher = {Publishing Association Nordic Journal of Computing}, year = {2003}, volume = {10}, number = {1}, pages = {29--39}, url = {citeseer.ist.psu.edu/537930.html} } |
|||||

Hyyrö, H. & Navarro, G. | Bit-Parallel Computation of Local Similarity Score Matrices with Unitary Weights | 2006 | International Journal of Foundations of Computer Science (IJFCS) | article | URL |

Abstract: Local similarity computation between two sequences permits detecting all the relevant alignments present between subsequences thereof. A well-known dynamic programming algorithm works in time O(mn), m and n being the lengths of the subsequences. The algorithm is rather slow when applied over many sequence pairs. In this paper we present the first bit-parallel computation of the score matrix, for a simplified choice of scores. If the computer word has w bits, then the resulting algorithm works in O(mn*log min(m,n,w) / w) time, achieving up to 8-fold speedups in practice. Some DNA comparison applications use precisely the simplified scores we handle, and thus our algorithm is directly applicable. In others, our method could be used as a raw filter to discard most of the strings, so the classical algorithm can be focused only on the substring pairs that can yield relevant results. |
|||||

BibTeX:
@article{Hyyro2006, author = {H. Hyyr{\"o} and G. Navarro}, title = {Bit-Parallel Computation of Local Similarity Score Matrices with Unitary Weights}, journal = {International Journal of Foundations of Computer Science (IJFCS)}, year = {2006}, volume = {17}, number = {6}, pages = {1325--1344}, url = {http://www.dcc.uchile.cl/~gnavarro/publ.html} } |
|||||

Hyyrö, H. & Navarro, G. | Bit-parallel Witnesses and their Applications to Approximate String Matching | 2005 | Algorithmica | article | URL |

Abstract: We present a new bit-parallel technique for approximate string matching. We build on two previous techniques. The first one, BPM [Myers, J. of the ACM, 1999], searches for a pattern of length m in a text of length n permitting k differences in O(ceil(m/w) n) time, where w is the width of the computer word. The second one, ABNDM [Navarro and Raffinot, ACM JEA, 2000], extends a sublinear-time exact algorithm to approximate searching. ABNDM relies on another algorithm, BPA [Wu and Manber, Comm. ACM, 1992], which makes use of an O(k ceil(m/w) n) time algorithm for its internal workings. BPA is slow but flexible enough to support all operations required by ABNDM. We improve previous ABNDM analyses, showing that it is average-optimal in number of inspected characters, although the overall complexity is higher because of the O(k ceil(m/w)) work done per inspected character. We then show that the faster BPM can be adapted to support all the operations required by ABNDM. This involves extending it to compute edit distance, to search for any pattern suffix, and to detect in advance the impossibility of a later match. The solution to those challenges is based on the concept of a witness, which permits sampling some dynamic programming matrix values so as to bound, deduce, or compute others fast. The resulting algorithm is average-optimal for m <= w, assuming the alphabet size is constant. In practice, it performs better than the original ABNDM and is the fastest algorithm for several combinations of m, k and alphabet sizes that are useful, for example, in natural language searching and computational biology. To show that the concept of witnesses can be used in further scenarios, we also improve a recent bit-parallel algorithm based on Myers [Fredriksson, SPIRE 2003]. The use of witnesses greatly improves the running time of this algorithm too. |
|||||

BibTeX:
@article{Hyyro2005a, author = {H. Hyyr{\"o} and G. Navarro}, title = {Bit-parallel Witnesses and their Applications to Approximate String Matching}, journal = {Algorithmica}, publisher = {Springer}, year = {2005}, volume = {41}, number = {3}, pages = {203--231}, url = {http://www.dcc.uchile.cl/~gnavarro/publ.html} } |
|||||

Hyyrö, H., Pinzon, Y. & Shinohara, A. | New Bit-Parallel Indel-Distance Algorithm | 2005 | Proceedings of the 4th International Workshop on Efficient and Experimental Algorithms (WEA'05) | inproceedings | URL |

Abstract: The task of approximate string matching is to find all locations at which a pattern string p of length m matches a substring of a text string t of length n with at most k differences. It is common to use Levenshtein distance [5], which allows the differences to be single-character insertions, deletions, or substitutions. Recently, in [3], the IndelMYE, IndelWM and IndelBYN algorithms where introduced as modified version of the bit-parallel algorithms of Myers [6], Wu&Manber [10] and Baeza-Yates&Navarro [1], respectively. These modified versions were made to support the indel distance (only single-character insertions and/or deletions are allowed). In this paper we present an improved version of IndelMYE that makes a better use of the bit-operations and runs 24.5 percent faster in practice. In the end we present a complete set of experimental results to support our findings. |
|||||

BibTeX:
@inproceedings{Hyyroe2005, author = {Heikki Hyyrö and Yoan Pinzon and Ayumi Shinohara}, title = {New Bit-Parallel Indel-Distance Algorithm}, booktitle = {Proceedings of the 4th International Workshop on Efficient and Experimental Algorithms (WEA'05)}, publisher = {Springer-Verlag, Berlin}, year = {2005}, volume = {3503}, pages = {380-390}, url = {http://dis.unal.edu.co/profesores/ypinzon/ps/wea05.pdf} } |
|||||

Hyyrö, H., Pinzon, Y. & Shinohara, A. | Fast Bit-Vector Algorithms for Approximate String Matching under Indel-Distance | 2005 | Proceedings of the 31st Annual Conference on Current Trends in Theory and Practice of Informatics (SOFSEM'05) | inproceedings | URL |

Abstract: The approximate string matching problem is to find all location at which a query p of length m matches a substring of a text t of length n with at most k differences (insertion, deletion, and substitution). The fastest solutions in practice for this problem are the bit-parallel NFA simulation algorithms of Wu & Manber [4] and Baeza-Yates & Navarro [1], and the bit-parallel dynamic programming algorithm of Myers [3]. In this paper we present modified versions of these algorithms to deal with the restricted case where only insertions and deletions (called indel for short) are permitted. We also show test results with the algorithms. |
|||||

BibTeX:
@inproceedings{Hyyroe2005a, author = {Heikki Hyyrö and Yoan Pinzon and Ayumi Shinohara}, title = {Fast Bit-Vector Algorithms for Approximate String Matching under Indel-Distance}, booktitle = {Proceedings of the 31st Annual Conference on Current Trends in Theory and Practice of Informatics (SOFSEM'05)}, publisher = {Springer-Verlag}, year = {2005}, volume = {3381}, pages = {380-384}, url = {http://dis.unal.edu.co/profesores/ypinzon/ps/sofsem05.pdf} } |
|||||

Hébrard, J. | Distances sur les mots. Application à la recherche de motifs | 1984 | School: Université de Rouen |
phdthesis | URL |

Abstract: Dans divers domaines, on a à définir des critères de ressemblance entre chaînes de caractères (mots). Etude de 3 distances définies sur les mots: une généralisation de la distance de Hamming et deux distances fondées sur la comparaison de sous-mots. Définition de l'automate des sous-mots d'un mot et de la factorisation en couches |
|||||

BibTeX:
@phdthesis{H'ebrard1984, author = {J.-J. H{\'e}brard}, title = {Distances sur les mots. Application {\`a} la recherche de motifs}, school = {Universit{\'e} de Rouen}, year = {1984}, url = {http://cat.inist.fr/?aModele=afficheN&cpsidt=9693508} } |
|||||

Hébrard, J. & Crochemore, M. | Calcul de la distance par les sous-mots | 1986 | Informatique théorique et applications | article | URL |

Abstract: Reconnaissance caractère ; Chaîne caractère ; Automate état fini ; Structure donnée ; Automate déterministe ; Distance ; Partition ; Algorithme ; Programmation linéaire ; Sous-mot ; Plus courte sous-suite ; |
|||||

BibTeX:
@article{Hebrard1986, author = {J.-J. H{\'e}brard and M. Crochemore}, title = {Calcul de la distance par les sous-mots}, journal = {Informatique théorique et applications}, year = {1986}, volume = {20}, number = {4}, pages = {441--456}, url = {http://cat.inist.fr/?aModele=afficheN&cpsidt=8246207} } |
|||||

Irving, R. W. & Fraser, C. B. | Two algorithms for the longest common subsequence of three (or more) strings | 1992 | Combinatorial Pattern Matching | article | DOI |

Abstract: Various algorithms have been proposed, over the years, for the longest common subsequence problem on 2 strings (2-LCS), many of these improving, at least for some cases, on the classical dynamic programming approach. However, relatively little attention has been paid in the literature to the k-LCS problem for k > 2, a problem that has interesting applications in areas such as the multiple alignment of sequences in molecular biology.In this paper, we describe and analyse two algorithms with particular reference to the 3-LCS problem, though each algorithm can be extended to solve the k-LCS problem for general k. The first algorithm, which can be viewed as a ldquolazyrdquo version of dynamic programming, has time and space complexity that is O(n(n–1) 2) for 3 strings, and O(kn(n–1) k-1) for k strings, where n is the common length of the strings and l is the length of an LCS. The second algorithm, which involves evaluating entries in a ldquothresholdrdquo table in diagonal order, has time and space complexity that is O(l(n–1)2+sn) for 3 strings, and O(kl(n–1) k –1+ksn) for k strings, where s is the alphabet size. For simplicity, the algorithms are presented for equal-length strings, though extension to unequal-length strings is straightforward. Empirical evidence is presented to show that both algorithms show significant improvement on the basic dynamic programming approach, and on an earlier algorithm proposed by Hsu and Du, particularly, as would be expected, in the case where l is relatively large, with the balance of evidence being heavily in favour of the threshold approach. Key words string algorithms - longest common subsequence Supported by a postgraduate research studentship from the Science and Engineering Research Council |
|||||

BibTeX:
@article{Irving1992, author = {Robert W. Irving and Campbell B. Fraser}, title = {Two algorithms for the longest common subsequence of three (or more) strings}, journal = {Combinatorial Pattern Matching}, year = {1992}, pages = {214-229}, doi = {http://dx.doi.org/10.1007/3-540-56024-6_18} } |
|||||

Kashyap, R. & Oommen, B. | The Noisy Substring Matching Problem | 1983 | Software Engineering, IEEE Transactions on | article | URL |

Abstract: Let T(U) be the set of words in the dictionary H which contains U as a substring. The problem considered here is the estimation of the set T(U) when U is not known, but Y, a noisy version of U is available. The suggested set estimate S*(Y) of T(U) is a proper subset of H such that its every element contains at least one substring which resembles Y most according to the Levenshtein metric. The proposed algorithm for-the computation of S*(Y) requires cubic time. The algorithm uses the recursively computable dissimilarity measure Dk(X, Y), termed as the kth distance between two strings X and Y which is a dissimilarity measure between Y and a certain subset of the set of contiguous substrings of X. Another estimate of T(U), namely SM(Y) is also suggested. The accuracy of SM(Y) is only slightly less than that of S*(Y), but the computation time of SM(Y) is substantially less than that of S*(Y). Experimental results involving 1900 noisy substrings and dictionaries which are subsets of 1023 most common English words [11] indicate that the accuracy of the estimate S*(Y) is around 99 percent and that of SM(Y) is about 98 percent. |
|||||

BibTeX:
@article{Kashyap1983, author = {Kashyap, R.L. and Oommen, B.J. }, title = {The Noisy Substring Matching Problem}, journal = {Software Engineering, IEEE Transactions on}, year = {1983}, volume = {SE-9}, pages = {365 - 370}, url = {http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1703065&isnumber=35934&punumber=32&k2dockey=1703065@ieeejrns&query=%28oommen%3Cin%3Emetadata%29&pos=14} } |
|||||

Kashyap, R. L. & Oommen, B. J. | An effective algorithm for string correction using generalized edit distances I: description of the algorithm and its optimality | 1981 | INFO. SCI. | article | URL |

Abstract: This paper deals with the problem of estimating a transmitted string X, from thecorresponding received string Y, which is a noisy version of X,. We assume that Y contains*any number of substitution, insertion, and deletion errors, and that no two consecutive symbols of X, were deleted in transmission. We have shown that for channels which cause independent errors, and whose error probabilities exceed those of noisy strings studied in the literature [ 121, at least 99.5% of the erroneous strings will not contain two consecutive deletion errors. The best estimate X * of X, is defined as that element of H which minimizes the generalized Levenshtein distance D( X/Y) between X and Y. Using dynamic programming principles, an algorithm is presented which yields X+ without computing individually the distances between every word of H and Y. Though this algorithm requires more memory, it can be shown that it is, in general, computationally less complex than all other existing algorithms which perform the same tas |
|||||

BibTeX:
@article{Kashyap1981, author = {R. L. Kashyap and B. J. Oommen}, title = {An effective algorithm for string correction using generalized edit distances {I}: description of the algorithm and its optimality}, journal = {INFO. SCI.}, year = {1981}, volume = {23}, number = {2}, pages = {123--142}, url = {http://www.scs.carleton.ca/~oommen/papers/TrieBFS01Published.pdf} } |
|||||

Kececioglu, J. D. & Ravi, R. | Of mice and men: algorithms for evolutionary distances between genomes with translocation | 1995 | SODA '95: Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms | inproceedings | URL |

Abstract: NA |
|||||

BibTeX:
@inproceedings{Kececioglu1995, author = {John D. Kececioglu and R. Ravi}, title = {Of mice and men: algorithms for evolutionary distances between genomes with translocation}, booktitle = {SODA '95: Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms}, publisher = {Society for Industrial and Applied Mathematics}, year = {1995}, pages = {604--613}, url = {http://portal.acm.org/citation.cfm?id=313825&dl=ACM&coll=GUIDE#} } |
|||||

Kececioglu, J. D. & Sankoff, D. | Efficient Bounds for Oriented Chromosome Inversion Distance | 1994 | CPM '94: Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching | inproceedings | DOI |

Abstract: ...two circular chromosomes that have...by chromosome inversion, assuming that...that tight bounds on the...simple and efficient algorithms |
|||||

BibTeX:
@inproceedings{Kececioglu1994, author = {John D. Kececioglu and David Sankoff}, title = {Efficient Bounds for Oriented Chromosome Inversion Distance}, booktitle = {CPM '94: Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching}, publisher = {Springer-Verlag}, year = {1994}, pages = {307--325}, doi = {http://dx.doi.org/10.1007/3-540-58094-8_26} } |
|||||

Kececioglu, J. D. & Sankoff, D. | Exact and Approximation Algorithms for the Inversion Distance Between Two Chromosomes | 1993 | CPM '93: Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching | inproceedings | DOI |

Abstract: Motivated by the problem in computational...the series of chromosome inversions by which one... |
|||||

BibTeX:
@inproceedings{Kececioglu1993, author = {John D. Kececioglu and David Sankoff}, title = {Exact and Approximation Algorithms for the Inversion Distance Between Two Chromosomes}, booktitle = {CPM '93: Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching}, publisher = {Springer-Verlag}, year = {1993}, pages = {87--105}, doi = {http://dx.doi.org/10.1007/BFb0029799} } |
|||||

Kim, S. & Park, K. | A Dynamic Edit Distance Table | 2000 | COM '00: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching | inproceedings | URL |

Abstract: ...of the edit distance problem: given a solution to...the D-table, which leads... |
|||||

BibTeX:
@inproceedings{Kim2000, author = {Sung-Ryul Kim and Kunsoo Park}, title = {A Dynamic Edit Distance Table}, booktitle = {COM '00: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching}, publisher = {Springer-Verlag}, year = {2000}, pages = {60--68}, url = {http://www.sinab.unal.edu.co:2090/content/?k=A+Dynamic+Edit+Distance+Table} } |
|||||

Klein, P. N. | Computing the Edit-Distance between Unrooted Ordered Trees | 1998 | ESA '98: Proceedings of the 6th Annual European Symposium on Algorithms | inproceedings | URL |

Abstract: An ordered tree is a...think of the tree as...trees. The edit distance between A and... |
|||||

BibTeX:
@inproceedings{740125, author = {Philip N. Klein}, title = {Computing the Edit-Distance between Unrooted Ordered Trees}, booktitle = {ESA '98: Proceedings of the 6th Annual European Symposium on Algorithms}, publisher = {Springer-Verlag}, year = {1998}, pages = {91--102}, url = {http://www.sinab.unal.edu.co:2090/content/?k=Computing+the+Edit-Distance+between+Unrooted+Ordered+Trees} } |
|||||

Knight, J. R. & Myers, E. W. | Approximate regular expression pattern matching with concave gap penalties | 1992 | inbook | DOI | |

Abstract: Given a sequence A of length M and a regular expression R of length P, an approximate regular expression pattern matching algorithm computes the score of the best alignment between A and one of the sequences exactly matched by R. There are a variety of schemes for scoring alignments. In a concave gap-penalty scoring scheme, a function delta(a, b) gives the score of each aligned pair of symbols a and b, and a concave function w(k) gives the score of a sequence of unaligned symbols, or gap, of length k. A function w is concave if and only if it has the property that for all k > 1, w(k+ 1)-w(k) |
|||||

BibTeX:
@inbook{Knight1992, author = {James R. Knight and Eugene W. Myers}, title = {Approximate regular expression pattern matching with concave gap penalties}, year = {1992}, volume = {644/1992}, pages = {67-78}, doi = {http://dx.doi.org/10.1007/3-540-56024-6} } |
|||||

Kolman, P. & Walen, T. | Approximating reversal distance for strings with bounded number of duplicates | 2007 | Discrete Applied Mathematics | article | DOIURL |

Abstract: For a string A=a1...an, a reversal [rho](i,j), 1[less-than-or-equals, slant]i[less-than-or-equals, slant]j[less-than-or-equals, slant]n, transforms the string A into a string A'=a1...ai-1ajaj-1...aiaj+1... an, that is, the reversal [rho](i,j) reverses the order of symbols in the substring ai...aj of A. In the case of signed strings, where each symbol is given a sign + or -, the reversal operation also flips the sign of each symbol in the reversed substring. Given two strings, A and B, signed or unsigned, sorting by reversals (SBR) is the problem of finding the minimum number of reversals that transform the string A into the string B. Traditionally, the problem was studied for permutations, that is, for strings in which every symbol appears exactly once. We consider a generalization of the problem, k-SBR, and allow each symbol to appear at most k times in each string, for some k[greater-or-equal, slanted]1. The main result of the paper is an O(k2)-approximation algorithm running in time O(n). For instances with , this is the best known approximation algorithm for k-SBR and, moreover, it is faster than the previous best approximation algorithm. |
|||||

BibTeX:
@article{Kolman2007, author = {Petr Kolman and Tomasz Walen}, title = {Approximating reversal distance for strings with bounded number of duplicates}, journal = {Discrete Applied Mathematics}, year = {2007}, volume = {155}, number = { 3}, pages = {327-336}, url = {http://www.sciencedirect.com/science/article/B6TYW-4KV2R92-1/2/55bead7bd33c89b9a27d222815837736}, doi = {http://dx.doi.org/10.1016/j.dam.2006.05.011} } |
|||||

Konstantinidis, S. | Computing the Levenshtein distance of a regular language | 2005 | IEEE Information Theory Workshop | inproceedings | DOI |

Abstract: The edit distance (or Levenshtein distance) between two words is the smallest number of substitutions, insertions, and deletions of symbols that can be used to transform one of the words into the other. In this paper we consider the problem of computing the edit distance of a regular language (also known as constraint system), that is, the set of words accepted by a given finite automaton. This quantity is the smallest edit distance between any pair of distinct words of the language. We show that the problem is of polynomial time complexity. We distinguish two cases depending on whether the given automaton is deterministic or nondeterministic. In the latter case the time complexity is higher. |
|||||

BibTeX:
@inproceedings{Konstantinidis2005, author = {Konstantinidis, S. }, title = {Computing the Levenshtein distance of a regular language}, booktitle = {IEEE Information Theory Workshop}, year = {2005}, pages = {4 pp}, doi = {http://dx.doi.org/10.1109/ITW.2005.1531868} } |
|||||

Kurtz, S. | Approximate string searching under weighted edit distance | 1996 | Proceedings of the 3rd South American Workshop on String Processing | inproceedings | URL |

Abstract: Let p 2 Sigma be a string of length m and t 2 Sigma be a string of length n. The approximate string searching problem is to find all approximate matches of p in t having weighted edit distance at most k from p. We present a new method that preprocesses the pattern into a DFA which scans t online in linear time, thereby recognizing all positions in t where an approximate match ends. We show how to reduce the exponential preprocessing effort and propose two practical algorithms. The... |
|||||

BibTeX:
@inproceedings{Kurtz1996, author = {S. Kurtz}, title = {Approximate string searching under weighted edit distance}, booktitle = {Proceedings of the 3rd South American Workshop on String Processing}, publisher = {#cup#}, year = {1996}, pages = {156--170}, url = {citeseer.ist.psu.edu/kurtz96approximate.html} } |
|||||

Landau, G. M., Myers, E. & Ziv-Ukelson, M. | Two Algorithms for LCS Consecutive Suffix Alignment [BibTeX] |
2007 | Journal of Computer and System Sciences | article | DOI |

BibTeX:
@article{Landau2007, author = {Gad M. Landau and Eugene Myers and Michal Ziv-Ukelson}, title = {Two Algorithms for LCS Consecutive Suffix Alignment}, journal = {Journal of Computer and System Sciences}, year = {2007}, doi = {http://dx.doi.org/10.1016/j.jcss.2007.03.004} } |
|||||

Lee, I. & Pinzón, Y. J. | Linear Time Algorithm for the Generalised Longest Common Repeat Problem | 2005 | Proceedings of the 12th International Symposium on String Processing and Information Retrieval (SPIRE'05) | inproceedings | URL |

Abstract: Given a set of strings U = T1, T2, . . . , T, the longest common repeat problem is to find the longest common substring that appears at least twice in each string of U, considering direct, inverted, mirror as well as everted repeats. In this paper we define the generalised longest common repeat problem, where we can set the number of times that a repeat should appear in each string. We present a linear time algorithm for this problem using the suffix array. We also show an application of our algorithm for finding a longest common substring which appears only in a subset U of U but not in U −U |
|||||

BibTeX:
@inproceedings{Lee2005, author = {Inbok Lee and Yoan José Pinzón}, title = {Linear Time Algorithm for the Generalised Longest Common Repeat Problem}, booktitle = {Proceedings of the 12th International Symposium on String Processing and Information Retrieval (SPIRE'05)}, publisher = {Springer-Verlag}, year = {2005}, volume = {3772}, pages = {191-201}, url = {http://dis.unal.edu.co/profesores/ypinzon/ps/spire05a.pdf} } |
|||||

Lemström, K., Navarro, G. & Pinzón, Y. | Practical algorithms for transposition-invariant string-matching | 2005 | Journal of Discrete Algorithms | article | DOIURL |

Abstract: We consider the problems of (1) longest common subsequence (LCS) of two given strings in the case where the first may be shifted by some constant (that is, transposed) to match the second, and (2) transposition-invariant text searching using indel distance. These problems have applications in music comparison and retrieval. We introduce two novel techniques to solve these problems efficiently. The first is based on the branch and bound method, the second on bit-parallelism. Our branch and bound algorithm computes the longest common transposition-invariant subsequence (LCTS) in time O((m2+loglog[sigma])log[sigma]) in the best case and O((m2+log[sigma])[sigma]) in the worst case, where m and [sigma], respectively, are the length of the strings and the size of the alphabet. On the other hand, we show that the same problem can be solved by using bit-parallelism and thus obtain a speedup of O(w/logm) over the classical algorithms, where the computer word has w bits. The advantage of this latter algorithm over the present bit-parallel ones is that it allows the use of more complex distances, including general integer weights. Since our branch and bound method is very flexible, it can be further improved by combining it with other efficient algorithms such as our novel bit-parallel algorithm. We experiment on several combination possibilities and discuss which are the best settings for each of those combinations. Our algorithms are easily extended to other musically relevant cases, such as [delta]-matching and polyphony (where there are several parallel texts to be considered). We also show how our bit-parallel algorithm is adapted to text searching and illustrate its effectiveness in complex cases where the only known competing method is the use of brute force. |
|||||

BibTeX:
@article{Lemstrom2005, author = {Kjell Lemstr\"om and Gonzalo Navarro and Yoan Pinz\'on}, title = {Practical algorithms for transposition-invariant string-matching}, journal = {Journal of Discrete Algorithms}, year = {2005}, volume = {3}, number = {2-4}, pages = {267-292}, note = {Combinatorial Pattern Matching (CPM) Special Issue}, url = {http://www.sciencedirect.com/science/article/B758J-4DBJSWW-3/2/6071057098593f208055c4cfd1cf580f}, doi = {http://dx.doi.org/10.1016/j.jda.2004.08.009} } |
|||||

Liben-Nowell, D. | On the Structure of Syntenic Distance | 1999 | CPM '99: Proceedings of the 10th Annual Symposium on Combinatorial Pattern Matching | inproceedings | URL |

Abstract: This paper examines some of the rich structure of the syntenic distance model of evolutionarydistance, introduced by Ferretti, Nadeau, and Sanko. The syntenic distance between two genomes is the minimum number of ssions, fusions, and translocations required to transform one into the other, ignoring gene order within chromosomes. We prove that the previously unanalyzed algorithm given by Ferretti et al. is a 2-approximation and no better, and that, further, it always outperforms the algorithm presented by DasGupta, Jiang, Kannan, Li, and Sweedyk. We also prove the same results for an improved version of the Ferretti et al. algorithm. We then prove a number of properties which give insight into the structure of optimal move sequences. We give instances in which any move sequence working solely within connected components is nearly twice optimal, and a general lower bound based on the spread of genes from each chromosome. We then prove a monotonicity property for the syntenic distance, and bound the diculty of the hardest instance of any size. We discuss the results of implementing these algorithms and testing them on real and simulated synteny data. |
|||||

BibTeX:
@inproceedings{Liben-Nowell1999, author = {David Liben-Nowell}, title = {On the Structure of Syntenic Distance}, booktitle = {CPM '99: Proceedings of the 10th Annual Symposium on Combinatorial Pattern Matching}, publisher = {Springer-Verlag}, year = {1999}, pages = {50--65}, url = {http://portal.acm.org/citation.cfm?id=738636} } |
|||||

Liu, H. & Fu, K. | VLSI Arrays for Minimum-Distance Classifications | 1984 | VLSI for Pattern Recognition and Image Processing | incollection | URL |

Abstract: NA |
|||||

BibTeX:
@incollection{Liu1984, author = {Liu, HH and Fu, KS}, title = {{VLSI Arrays for Minimum-Distance Classifications}}, journal = {VLSI for Pattern Recognition and Image Processing}, year = {1984}, volume = {63}, url = {http://scholar.google.com.co/scholar.bib?num=100&hl=es&lr=&q=info:R2f6u7ENgn0J:scholar.google.com/&output=citation&oe=ASCII&oi=citation} } |
|||||

Lu, S. & Fu, K. | Error-Correcting Tree Automata for Syntactic Pattern Recognition | 1978 | IEEE Transactions on Computers | article | URL |

Abstract: The syntax errors on trees are defined in terms of five types of error transformations, namely, substitution, stretch, split, branch, and deletion. The distance between two trees is the least cost sequence of error transformations needed to transform one to the other. Based on this definition, a class of error-correcting tree automata (ECTA) is proposed. The operation of ECTA is illustrated by a character recognition example. |
|||||

BibTeX:
@article{Lu1978, author = {Shin-Yee Lu and King-Sun Fu}, title = {Error-Correcting Tree Automata for Syntactic Pattern Recognition}, journal = {IEEE Transactions on Computers}, year = {1978}, volume = {C-27}, number = {11}, pages = {1040 - 1053}, url = {http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1674993&isnumber=35164&punumber=12&k2dockey=1674993@ieeejrns&query=%28%28automata+for+syntactic+pattern+recognition%29%3Cin%3Emetadata%29&pos=0} } |
|||||

Makinen, V., Navarro, G. & Ukkonen, E. | Transposition invariant string matching | 2005 | Journal of Algorithms | article | DOIURL |

Abstract: Given strings A=a1a2...am and B=b1b2...bn over an alphabet , where is some numerical universe closed under addition and subtraction, and a distance function d(A,B) that gives the score of the best (partial) matching of A and B, the transposition invariant distance is , where A+t=(a1+t)(a2+t)...(am+t). We study the problem of computing the transposition invariant distance for various distance (and similarity) functions d, including Hamming distance, longest common subsequence (LCS), Levenshtein distance, and their versions where the exact matching condition is replaced by an approximate one. For all these problems we give algorithms whose time complexities are close to the known upper bounds without transposition invariance, and for some we achieve these upper bounds. In particular, we show how sparse dynamic programming can be used to solve transposition invariant problems, and its connection with multidimensional range-minimum search. As a byproduct, we give improved sparse dynamic programming algorithms to compute LCS and Levenshtein distance. |
|||||

BibTeX:
@article{Makinen2005, author = {Veli Makinen and Gonzalo Navarro and Esko Ukkonen}, title = {Transposition invariant string matching}, journal = {Journal of Algorithms}, year = {2005}, volume = {56}, number = {2}, pages = {124-153}, url = {http://www.sciencedirect.com/science/article/B6WH3-4DB57J3-2/2/f34d24da44c09bf00ba099456cc89535}, doi = {doi:10.1016/j.jalgor.2004.07.008} } |
|||||

Marzal, A. & Vidal, E. | Computation of normalized edit distance and applications | 1993 | IEEE Transactions on Pattern Analysis and Machine Intelligence | article | DOIURL |

Abstract: Given two strings X and Y over a finite alphabet, the normalized edit distance between X and Y, d(X,Y) is defined as the minimum of W(P)/L(P), where P is an editing path between X and Y, W(P) is the sum of the weights of the elementary edit operations of P, and L(P) is the number of these operations (length of P). It is shown that in general, d(X ,Y) cannot be computed by first obtaining the conventional (unnormalized) edit distance between X and Y and then normalizing this value by the length of the corresponding editing path. In order to compute normalized edit distances, an algorithm that can be implemented to work in O(m×n2) time and O( n2) memory space is proposed, where m and n are the lengths of the strings under consideration, and m ⩾n. Experiments in hand-written digit recognition are presented, revealing that the normalized edit distance consistently provides better results than both unnormalized or post-normalized classical edit distances |
|||||

BibTeX:
@article{Marzal1993, author = {Andr\'es Marzal and Enrique Vidal}, title = {Computation of normalized edit distance and applications}, journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence}, year = {1993}, volume = {15}, number = { 9}, pages = {926 - 932}, url = {http://ieeexplore.ieee.org/search/wrapper.jsp?arnumber=232078}, doi = {http://dx.doi.org/10.1109/34.232078} } |
|||||

Masek, W. & Paterson, M. | How to compute string-edit distances quickly | 1983 | Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison | incollection | URL |

Abstract: NA |
|||||

BibTeX:
@incollection{Masek1983, author = {Masek, WJ and Paterson, MS}, title = {{How to compute string-edit distances quickly}}, journal = {Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison}, year = {1983}, pages = {337--349}, url = {http://scholar.google.com.co/scholar.bib?num=100&hl=es&lr=&q=info:ckdmlVTdHD8J:scholar.google.com/&output=citation&oe=ASCII&oi=citation} } |
|||||

Masek, W. & Paterson, M. | A faster algorithm computing string edit distances | 1980 | Journal of Computer and System Sciences | article | URL |

Abstract: NA |
|||||

BibTeX:
@article{Masek1980, author = {Masek, W.J. and Paterson, M.S.}, title = {{A faster algorithm computing string edit distances}}, journal = {Journal of Computer and System Sciences}, year = {1980}, volume = {20}, number = {1}, pages = {18--31}, url = {http://scholar.google.com.co/scholar.bib?num=100&hl=es&lr=&q=info:amGRBj6hhYkJ:scholar.google.com/&output=citation&oe=ASCII&oi=citation} } |
|||||

Meidanis, J. | Distance and similarity in the presence of nonincreasing gap-weighting functions | Proc. of the II South American Workshop on String Processing | inproceedings | URL | |

Abstract: NA |
|||||

BibTeX:
@inproceedings{Meidanis, author = {Meidanis, J.}, title = {{Distance and similarity in the presence of nonincreasing gap-weighting functions}}, journal = {Proc. of the II South American Workshop on String Processing}, pages = {27--37}, url = {http://scholar.google.com.co/scholar.bib?num=100&hl=es&lr=&q=info:gxvQUJxkZhMJ:scholar.google.com/&output=citation&oe=ASCII&oi=citation} } |
|||||

Meidanis, J., Walter, M. & Dias, Z. | Transposition distance between a permutation and its reverse | 1997 | Proceedings of the 4th South American Workshop on String | inproceedings | URL |

Abstract: In this note we solve an open question posed by Bafna and Pevzner |
|||||

BibTeX:
@inproceedings{Meidanis1997, author = {J. Meidanis and M.E. Walter and Z. Dias}, title = {Transposition distance between a permutation and its reverse}, booktitle = {Proceedings of the 4th South American Workshop on String}, publisher = {#cup#}, year = {1997}, pages = {70--79}, url = {http://scholar.google.com.co/scholar?num=100&hl=es&lr=&cluster=15996713785801433440} } |
|||||

Myers, E. W. & Miller, W. | Approximate matching of regular expressions | 1989 | Bulletin of Mathematical Biology | article | URL |

Abstract: Given a sequence A and regular expression R, the approximate regular expression matching problem is to find a sequence matching R whose optimal alignment with A is the highest scoring of all such sequences. This paper develops an algorithm to solve the problem in time O(MN), where M and N are the lengths of A and R. Thus, the time requirement is asymptotically no worse than for the simpler problem of aligning two fixed sequences. Our method is superior to an earlier algorithm by Wagner and Seiferas in several ways. First, it treats real-valued costs, in addition to integer costs, with no loss of asymptotic efficiency. Second, it requires only O(N) space to deliver just the score of the best alignment. Finally, its structure permits implementation techniques that make it extremely fast in practice. We extend the method to accommodate gap penalties, as required for typical applications in molecular biology, and further refine it to search for substrings of A that strongly align with a sequence in R, as required for typical data base searches. We also show how to deliver an optimal alignment between A and R in only O(N + log M) space using O(MN log M) time. Finally, an O(MN(M + N) + N2log N) time algorithm is presented for alignment scoring schemes where the cost of a gap is an arbitrary increasing function of its length. |
|||||

BibTeX:
@article{Myers1989, author = {Eugene W. Myers and Webb Miller}, title = {Approximate matching of regular expressions}, journal = {Bulletin of Mathematical Biology}, year = {1989}, volume = {51}, number = {1}, pages = {5-37}, url = {http://www.sciencedirect.com/science/article/B6WC7-4GP24Y4-2/2/53192b636f53c7c8797acdffe1060772} } |
|||||

Myers, R., Wilson, R. & Hancock, E. | Efficient relational matching with local edit distance | 1998 | Proceedings Fourteenth International Conference on Pattern Recognition | inproceedings | DOI |

Abstract: This paper describes a novel framework for comparing and matching corrupted relational graphs. The paper develops the idea of edit distance originally used for graph-matching by Sanfeliu and Fu (1983). We show how the normalised edit distance of Marzal and Vidal (1993) can be used to model the probability distribution for structural errors in the graph-matching problem. This probability distribution is used to locate matches using MAP label updates. We compare the resulting graph-matching algorithm with that reported by Wilson and Hancock (1997) |
|||||

BibTeX:
@inproceedings{Myers1998, author = {Myers, R. and Wilson, R.C. and Hancock, E.R.}, title = {Efficient relational matching with local edit distance}, booktitle = {Proceedings Fourteenth International Conference on Pattern Recognition}, year = {1998}, volume = {2}, pages = {1711 - 1714}, doi = {http://dx.doi.org/10.1109/ICPR.1998.712053} } |
|||||

Navarro, G. | Pattern matching [BibTeX] |
2004 | Journal of Applied Statistics | article | URL |

BibTeX:
@article{Navarro2004a, author = {G. Navarro}, title = {Pattern matching}, journal = {Journal of Applied Statistics}, year = {2004}, volume = {31}, number = {8}, pages = {925--949}, note = {Special issue on Pattern Discovery.}, url = {http://www.dcc.uchile.cl/~gnavarro/publ.html} } |
|||||

Navarro, G. & Baeza-Yates, R. | Very fast and simple approximate string matching | 1999 | Information Processing Letters | article | DOIURL |

Abstract: We improve the fastest known algorithm for approximate string matching, which can be used only for low error levels. By using a new method to verify potential matches and a new optimization technique for biased texts (such as English), the algorithm also becomes the fastest one for medium error levels. This includes most of the interesting cases in this area. |
|||||

BibTeX:
@article{Navarro1999, author = {Gonzalo Navarro and Ricardo Baeza-Yates}, title = {Very fast and simple approximate string matching}, journal = {Information Processing Letters}, year = {1999}, volume = {72}, number = {1-2}, pages = {65-70}, url = {http://www.sciencedirect.com/science/article/B6V0F-3Y0JBNH-9/2/3f1418375e8c7f068b58904c84cbe129}, doi = {http://dx.doi.org/10.1016/S0020-0190(99)00121-0} } |
|||||

Navarro, G. & Fredriksson, K. | Average complexity of exact and approximate multiple string matching | 2004 | Theoretical Computer Science | article | DOIURL |

Abstract: We show that the average number of characters examined to search for r random patterns of length m in a text of length n over a uniformly distributed alphabet of size [sigma] cannot be less than [Omega](n log[sigma](rm)/m). When we permit up to k insertions, deletions, and/or substitutions of characters in the occurrences of the patterns, the lower bound becomes (n(k+log[sigma](rm))/m). This generalizes previous single-pattern lower bounds of Yao (for exact matching) and of Chang and Marr (for approximate matching), and proves the optimality of several existing multipattern search algorithms. |
|||||

BibTeX:
@article{Navarro2004, author = {Gonzalo Navarro and Kimmo Fredriksson}, title = {Average complexity of exact and approximate multiple string matching}, journal = {Theoretical Computer Science}, year = {2004}, volume = {321}, number = {2-3}, pages = {283-290}, url = {http://www.sciencedirect.com/science/article/B6V1G-4C477DB-7/2/42fbd86fd89d621bbed9c2bd9fba1c43}, doi = {http://dx.doi.org/10.1016/j.tcs.2004.03.058} } |
|||||

Nykanen, M. & Ukkonen, E. | The Exact Path Length Problem | 2002 | Journal of Algorithms | article | DOIURL |

Abstract: We study a problem related to finding shortest paths in weighted graphs. We ask whether or not there is a path between two nodes that has a given total cost k. The edge weights of the graph can be both positive and negative integers or even integer vectors. We show that many variants of this problem are NP-complete. We develop a pseudo-polynomial algorithm for (both positive and negative) integer weights. The running time of this algorithm is O(W2n3 + k min(k, W)n2), where n is the number of nodes in the graph, W is the largest absolute value of any edge weight, and k is the target cost. The algorithm is based on preprocessing the graph with a relaxation algorithm to eliminate the effects of weight sign alternations along a path. This preprocessing stage is applicable to other problems as well. For example, we show how to find the minimum absolute cost of any path between two given nodes in a graph with integer weights in O(W2n3) time. |
|||||

BibTeX:
@article{Nykanen2002, author = {Matti Nykanen and Esko Ukkonen}, title = {The Exact Path Length Problem}, journal = {Journal of Algorithms}, year = {2002}, volume = {42}, number = {1}, pages = {41-53}, url = {http://www.sciencedirect.com/science/article/B6WH3-45B552P-3/2/861c2f6903c5f4a239875b6e82e4602f}, doi = {http://dx.doi.org/10.1006/jagm.2001.1201} } |
|||||

Oommen, B. & Zhang, K. | The normalized string editing problem revisited | 1996 | IEEE Transactions on Pattern Analysis and Machine Intelligence | article | DOI |

Abstract: Marzal and Vidal (1993) considered the problem of computing the normalized edit distance between two strings, and reported experimental results which demonstrated the use of the measure to recognize hand-written characters. Their paper formulated the theoretical properties of the measure and developed two algorithms to compute it. In this short communication the authors demonstrate how this measure is related to an auxiliary measure already defined in the literature-the inter-string constrained edit distance. Since the normalized edit distance can be computed efficiently using the latter, the analytic and experimental results reported in the above paper can be obtained just as accurately, but more efficiently, using the strategies presented here |
|||||

BibTeX:
@article{Oommen1996, author = {Oommen, B.J. and Zhang, K.}, title = {The normalized string editing problem revisited}, journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence}, year = {1996}, volume = {18}, number = {6}, pages = {669 - 672}, doi = {http://dx.doi.org/10.1109/34.506420} } |
|||||

Oommen, B. J. | Recognition of noisy subsequences using constrained edit distances | 1987 | IEEE Trans. Pattern Anal. Mach. Intell. | article | URL |

Abstract: NA |
|||||

BibTeX:
@article{Oommen1987, author = {B. J. Oommen}, title = {Recognition of noisy subsequences using constrained edit distances}, journal = {IEEE Trans. Pattern Anal. Mach. Intell.}, publisher = {IEEE Computer Society}, year = {1987}, volume = {9}, number = {5}, pages = {676--685}, url = {http://ieeexplore.ieee.org/xpl/tocresult.jsp?isYear=2007&isnumber=4135669&Submit32=Go+To+Issue} } |
|||||

Ouangraoua, A., Ferraro, P., Tichit, L. & Dulucq, S. | Local similarity between quotiented ordered trees | 2007 | Journal of Discrete Algorithms | article | DOIURL |

Abstract: In this paper we propose a dynamic programming algorithm to evaluate local similarity between ordered quotiented trees using a constrained edit scoring scheme. A quotiented tree is a tree defined with an additional equivalent relation on vertices and such that the quotient graph is also a tree. The core of the method relies on two adaptations of an algorithm proposed by Zhang et al. [K. Zhang, D. Shasha, Simple fast algorithms for the editing distance between trees and related problems (1989) 1245-1262] for comparing ordered rooted trees. After some preliminary definitions and the description of this tree edit algorithm, we propose extensions to globally and locally compare two quotiented trees. This last method allows to find the region in each tree with the highest similarity. Algorithms are currently being used in genomic analysis to evaluate variability between RNA secondary structures. |
|||||

BibTeX:
@article{Ouangraoua2007, author = {Aida Ouangraoua and Pascal Ferraro and Laurent Tichit and Serge Dulucq}, title = {Local similarity between quotiented ordered trees}, journal = {Journal of Discrete Algorithms}, year = {2007}, volume = {5}, number = {1}, pages = {23-35}, url = {http://www.sciencedirect.com/science/article/B758J-4JXY3VF-1/2/56b1ec1b7c1b91427ef90cbfb8c12ace}, doi = {http://dx.doi.org/10.1016/j.jda.2006.03.010} } |
|||||

Park, S., Lee, D. & Chu, W. | Fast retrieval of similar subsequences in long sequence databases | 1999 | Proceedings 1999 Workshop on Knowledge and Data Engineering Exchange (KDEX '99) | inproceedings | DOI |

Abstract: Although the Euclidean distance has been the most popular similarity measure in sequence databases, recent techniques prefer to use high-cost distance functions such as the time warping distance and the editing distance for wider applicability. However, if these distance functions are applied to the retrieval of similar subsequences, the number of subsequences to be inspected during the search is quadratic to the average length L¯ of data sequences. We propose a novel subsequence matching scheme, called the aligned subsequence matching, where the number of subsequences to be compared with a query sequence is reduced to linear to L¯. We also present an indexing technique to speed-up the aligned subsequence matching using the similarity measure of the modified time warping distance. Experiments on synthetic data sequences demonstrate the effectiveness of our proposed approach; ours consistently outperformed sequential scanning and achieved an up to 6.5 times speed-up |
|||||

BibTeX:
@inproceedings{Park1999, author = {Park, S. and Lee, D. and Chu, W.W.}, title = {Fast retrieval of similar subsequences in long sequence databases}, booktitle = {Proceedings 1999 Workshop on Knowledge and Data Engineering Exchange (KDEX '99)}, year = {1999}, pages = {60 - 67}, doi = {http://dx.doi.org/10.1109/KDEX.1999.836610} } |
|||||

Pighizzini, G. | How hard is computing the edit distance? | 2001 | Information and Computation | article | DOI |

Abstract: The notion of edit distance arises in very different fields such as self-correcting codes, parsing theory, speech recognition, and molecular biology. The edit distance between an input string and a language L is the minimum cost of a sequence of edit operations (substitution of a symbol in another incorrect symbol, insertion of an extraneous symbol, deletion of a symbol) needed to change the input string into a sentence of L. In this paper we study the complexity of computing the edit distance, discovering sharp boundaries between classes of languages for which this function can be efficiently evaluated and classes of languages for which it seems to be difficult to compute. Our main result is a parallel algorithm for computing the edit distance for the class of languages accepted by one-way nondeterministic auxiliary pushdown automata working in polynomial time, a class that strictly contains context–free languages. Moreover, we show that this algorithm can be extended in order to find a sentence of the language from which the input string has minimum distance. |
|||||

BibTeX:
@article{Pighizzini2001, author = {Giovanni Pighizzini}, title = {How hard is computing the edit distance?}, journal = {Information and Computation}, publisher = {Academic Press, Inc.}, year = {2001}, volume = {165}, number = {1}, pages = {1--13}, doi = {http://dx.doi.org/10.1006/inco.2000.2914} } |
|||||

Pinzón, Y. | Efficient Approximate String Algorithms Using Bit-Vector Techniques [BibTeX] |
2001 | School: King's College |
phdthesis | URL |

BibTeX:
@phdthesis{Pinzon2001, author = {Yoan Pinz\'on}, title = {Efficient Approximate String Algorithms Using Bit-Vector Techniques}, school = {King's College}, year = {2001}, url = {http://dis.unal.edu.co/profesores/ypinzon/pub.html} } |
|||||

Rafiei, D., Moise, D. & Sun, D. | Finding Syntactic Similarities Between XML Documents | 2006 | 17th International Conference on Database and Expert Systems Applications. DEXA'06 | conference | DOI |

Abstract: Detecting structural similarities between XML documents has been the subject of several recent work, and the proposed algorithms mostly use tree edit distance between the corresponding trees of XML documents. However, evaluating a tree edit distance is computationally expensive and does not easily scale up to large collections. We show in this paper that a tree edit distance computation often is not necessary and can be avoided. In particular, we propose a concise structural summary of XML documents and show that a comparison based on this summary is both fast and effective. Our experimental evaluation shows that this method does an excellent job of grouping documents generated by the same DTD, outperforming some of the previously proposed solutions based on a tree comparison. Furthermore, the time complexity of the algorithm is linear on the size of the structural description. |
|||||

BibTeX:
@conference{Rafiei2006, author = {Rafiei, D. and Moise, D.L. and Dabo Sun}, title = {Finding Syntactic Similarities Between XML Documents}, booktitle = {17th International Conference on Database and Expert Systems Applications. DEXA'06}, year = {2006}, pages = {512 - 516}, doi = {http://dx.doi.org/10.1109/DEXA.2006.62} } |
|||||

Rice, S. V., Bunke, H. & Nartker, T. A. | Classes of Cost Functions for String Edit Distance | 1997 | Algorithmica | article | DOI |

Abstract: Finding a sequence of edit operations that transforms one string of symbols into another with the minimum cost is a well-known problem. The minimum cost, or edit distance, is a widely used measure of the similarity of two strings. An important parameter of this problem is the cost function, which specifies the cost of each insertion, deletion, and substitution. We show that cost functions having the same ratio of the sum of the insertion and deletion costs divided by the substitution cost yield the same minimum cost sequences of edit operations. This leads to a partitioning of the universe of cost functions into equivalence classes. Also, we show the relationship between a particular set of cost functions and the longest common subsequence of the input strings. |
|||||

BibTeX:
@article{Rice1997, author = {S. V. Rice and H. Bunke and T. A. Nartker}, title = {Classes of Cost Functions for String Edit Distance}, journal = {Algorithmica}, year = {1997}, volume = {18}, number = {2}, pages = {271--280}, doi = {http://dx.doi.org/10.1007/BF02526038} } |
|||||

Ristad, E. S. & Yianilos, P. N. | Learning string-edit distance | 1998 | IEEE Transactions on Pattern Analysis and Machine Intelligence | article | DOIURL |

Abstract: In many applications, it is necessary to determine the similarity of two strings. A widely-used notion of string similarity is the edit distance: the minimum number of insertions, deletions, and substitutions required to transform one string into the other. In this report, we provide a stochastic model for string-edit distance. Our stochastic model allows us to learn a string-edit distance function from a corpus of examples. We illustrate the utility of our approach by applying it to the difficult problem of learning the pronunciation of words in conversational speech. In this application, we learn a string-edit distance with nearly one-fifth the error rate of the untrained Levenshtein distance. Our approach is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes |
|||||

BibTeX:
@article{Ristad1998, author = {Ristad, E. S. and Yianilos, P. N.}, title = {Learning string-edit distance}, journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence}, year = {1998}, volume = {20}, number = {5}, pages = {522-532}, url = {http://ieeexplore.ieee.org/iel4/34/14993/00682181.pdf?isnumber=14993∏=STD&arnumber=682181&arnumber=682181&arSt=522&ared=532&arAuthor=Ristad%2C+E.S.%3B+Yianilos%2C+P.N.}, doi = {http://dx.doi.org/10.1109/34.682181} } |
|||||

Régnier, M. | A language approach to string searching evaluation | 1992 | Combinatorial Pattern Matching | article | DOI |

Abstract: We propose a general framework to derive average performance of string searching algorithms that preprocess the pattern. It relies mainly on languages and combinatorics on words, joined to some probabilistic tools. The approach is quite powerful: although we concentrate here on Morris-Pratt and Boyer-Moore-Horspool, it applies to a large class of algorithms. A fairly general character distribution is assumed, namely a Markovian one, suitable for applications such as natural languages or biological databases searching. The average searching time, expressed as the number of text-pattern comparisons, is proven to be asymptotically Kn and the linearity constant is given.This work was partially supported by the ESPRIT II Basic Research Actions Program of the EC under contract No. 3075 (project ALCOM) |
|||||

BibTeX:
@article{Regnier1992, author = {Mireille Régnier}, title = {A language approach to string searching evaluation}, journal = {Combinatorial Pattern Matching}, year = {1992}, volume = {644/1992}, pages = {15-26}, doi = {http://dx.doi.org/10.1007/3-540-56024-6_2} } |
|||||

Sahinalp, S., Tasan, M., Macker, J. & Ozsoyoglu, Z. | Distance based indexing for string proximity search | 2003 | Proceedings 19th International Conference on Data Engineering | inproceedings | URL |

Abstract: In many database applications involving string data, it is common to have near neighbor queries (asking for strings that are similar to a query string) or nearest neighbor queries (asking for strings that are most similar to a query string). The similarity between strings is defined in terms of a distance function determined by the application domain. The most popular string distance measures are based on (a weighted) count of (i) character edit or (ii) block edit operations to transform one string into the other. Examples include the Levenshtein edit distance and the recently introduced compression distance. The main goal is to develop efficient near(est) neighbor search tools that work for both character and block edit distances. Our premise is that distance-based indexing methods, which are originally designed for metric distances can be modified for string distance measures, provided that they form almost metrics. We show that several distance measures, such as the compression distance and weighted character edit distance are almost metrics. In order to analyze the performance of distance based indexing methods (in particular VP trees) for strings, we then develop a model based on distribution of pairwise distances. Based on this model we show how to modify VP trees to improve their performance on string data, providing tradeoffs between search time and space. We test our theoretical results on synthetic data sets and protein strings. |
|||||

BibTeX:
@inproceedings{Sahinalp2003, author = {Sahinalp, S.C. and Tasan, M. and Macker, J. and Ozsoyoglu, Z.M.}, title = {Distance based indexing for string proximity search}, booktitle = {Proceedings 19th International Conference on Data Engineering}, year = {2003}, number = {5-8}, pages = {125- 136}, url = {http://ieeexplore.ieee.org/iel5/8910/28179/01260787.pdf?isnumber=28179∏=STD&arnumber=1260787&arnumber=1260787&arSt=+125&ared=+136&arAuthor=Sahinalp%2C+S.C.%3B+Tasan%2C+M.%3B+Macker%2C+J.%3B+Ozsoyoglu%2C+Z.M.} } |
|||||

Samet, H. | Distance transform for images represented by quadtrees | 1982 | IEEE TRANS. PATTERN ANALY. AND MACH. INTELLIG | article | URL |

Abstract: NA |
|||||

BibTeX:
@article{Samet1982, author = {H. Samet}, title = {Distance transform for images represented by quadtrees}, journal = {IEEE TRANS. PATTERN ANALY. AND MACH. INTELLIG}, year = {1982}, volume = {4}, number = {3}, pages = {298--303}, url = {http://scholar.google.com.co/scholar?num=100&hl=es&lr=&cluster=10077354743656118870} } |
|||||

Sankoff, D. | Edit distance for genome comparison based on non-local operations | 1992 | Combinatorial Pattern Matching | inproceedings | DOI |

Abstract: Detailed knowledge of gene maps and and even complete nucleotide sequences for small genomes has led to the feasibility of evolutionary inference based on the macrostructure of entire genomes, rather than on the traditional comparison of homologous versions of a single gene in different organisms. In this paper, we define a number of measures of gene order rearrangement, describe algorithm design and software development for the calculation of some of these quantities in single-chromosome genomes, and report on the the results of applying these tools to a database of mitochondrial gene orders inferred from genomic sequences.Work supported by operating and infrastructure grants from the Natural Science and Engineering Research Council (Canada) and a team grant from the Fonds pour la formation de chercheurs et l'aide à la recherche (Quebec). The author is a Fellow of the Canadian Institute for Advanced Research. This paper was begun while a guest of Professor Claude Weber at the University of Geneva in 1991. I thank Natalie Antoine, Robert Cedergren, Franz Lang and Bruno Paquin who constructed the database from which the mitochondrial gene orders used here were drawn, and especially Guillaume Leduc for his perseverance in implementing DERANGE |
|||||

BibTeX:
@inproceedings{Sankoff1992, author = {D. Sankoff}, title = {Edit distance for genome comparison based on non-local operations}, booktitle = {Combinatorial Pattern Matching}, publisher = {#svb#}, year = {1992}, number = {664}, pages = {121--135}, doi = {http://dx.doi.org/10.1007/3-540-56024-6} } |
|||||

Sastry, R., Ranganathan, N. & Remedios, K. | CASM: a VLSI chip for approximate string matching | 1995 | IEEE Transactions onPattern Analysis and Machine Intelligence | article | DOI |

Abstract: he edit distance between two strings a1, ..., am and b 1, ..., bn is the minimum cost s of a sequence of editing operations (insertions, deletions and substitutions) that convert one string into the other. This paper describes the design and implementation of a linear systolic array chip for computing the edit distance between two strings over a given alphabet. An encoding scheme is proposed which reduces the number of bits required to represent a state in the computation. The architecture is a parallel realization of the standard dynamic programming algorithm proposed by Wagner and Fischer (1974), and can perform approximate string matching for variable edit costs. More importantly, the architecture does not place any constraint on the lengths of the strings that can be compared. It makes use of simple basic cells and requires regular nearest neighbor communication, which makes it suitable for VLSI implementation. A prototype of this array has been built at the University of South Florida |
|||||

BibTeX:
@article{Sastry1995, author = {Sastry, R. and Ranganathan, N. and Remedios, K. }, title = {CASM: a VLSI chip for approximate string matching}, journal = {IEEE Transactions onPattern Analysis and Machine Intelligence}, year = {1995}, volume = {17}, number = {8}, pages = {824 - 830}, doi = {http://dx.doi.org/10.1109/34.400575} } |
|||||

Sellers, P. | The Theory and Computation of Evolutionary Distances: Pattern Recognition | 1980 | J. Algorithms | article | URL |

Abstract: NA |
|||||

BibTeX:
@article{Sellers1980, author = {Sellers, P.H.}, title = {{The Theory and Computation of Evolutionary Distances: Pattern Recognition}}, journal = {J. Algorithms}, year = {1980}, volume = {1}, number = {4}, pages = {359--373}, url = {http://scholar.google.com.co/scholar?num=100&hl=es&lr=&q=The+theory+and+computation+of+evolutionary+distances%3A+Pattern+recognition&btnG=B%C3%BAsqueda&lr=} } |
|||||

Sellers, P. | An algorithm for the distance between two finite sequences | 1974 | Journal of Combinatorial Theory | article | URL |

Abstract: NA |
|||||

BibTeX:
@article{Sellers1974, author = {Sellers, P.H.}, title = {{An algorithm for the distance between two finite sequences}}, journal = {Journal of Combinatorial Theory}, year = {1974}, volume = {16}, pages = {253--258}, url = {http://scholar.google.com.co/scholar?num=100&hl=es&lr=&q=An+algorithm+for+the+distance+between+two+finite+sequences&btnG=B%C3%BAsqueda&lr=} } |
|||||

Sellers, P. H. | On the theory and computation of evolutionary distance | 1974 | SIAM Journal on Applied Mathematics | article | URL |

Abstract: This paper gives a formal definition of the biological concept of evolutionary distance and an algorithm to compute it. For any set $S$ of finite sequences of varying lengths this distance is a real-valued function on $S times S$, and it is shown to be a metric under conditions which are wide enough to include the biological application. The algorithm, introduced here, lends itself to computer programming and provides a method to compute evolutionary distance which is shorter than the other methods currently in use. |
|||||

BibTeX:
@article{Sellers1974a, author = {P. H. Sellers}, title = {On the theory and computation of evolutionary distance}, journal = {SIAM Journal on Applied Mathematics}, year = {1974}, volume = {26}, pages = {787--793}, url = {http://links.jstor.org/sici?sici=0036-1399(197406)26%3A4%3C787%3AOTTACO%3E2.0.CO%3B2-0#abstract} } |
|||||

Shasha, D. & Zhang, K. | Fast algorithms for the unit cost editing distance between trees | 1990 | J. Algorithms | article | DOI |

Abstract: NA |
|||||

BibTeX:
@article{Shasha1990, author = {Dennis Shasha and Kaizhong Zhang}, title = {Fast algorithms for the unit cost editing distance between trees}, journal = {J. Algorithms}, publisher = {Academic Press, Inc.}, year = {1990}, volume = {11}, number = {4}, pages = {581--621}, doi = {http://dx.doi.org/10.1016/0196-6774(90)90011-3} } |
|||||

Stephen, G. A. | String Searching Algorithms | 1994 | book | URL | |

Abstract: A bibliographic overview of string searching and an anthology of descriptions of the principal algorithms available. Topics covered include methods for finding exact and approximate string matches, calculating "edit" distances between strings, and finding common sequences. |
|||||

BibTeX:
@book{Stephen1994, author = {Graham A. Stephen}, title = {String Searching Algorithms}, publisher = {World Scientific Pub Co Inc}, year = {1994}, pages = {256}, url = {http://books.google.com.co/books?id=rfSjZhDxBtUC&dq=String+Searching+Algorithms} } |
|||||

TANAKA, K. & TANAKA, E. | THE TREE-TO-TREE EDITING PROBLEM International Journal of Pattern Recognition and Artificial IntelligenceTHE TREE-TO-TREE EDITING PROBLEM | 1988 | International Journal of Pattern Recognition and ArtificialInternational Journal of Pattern Recognition and Artificial Intelligence | article | DOI |

Abstract: This paper describes the computing alogrithms for the tree distance based on the structure preserving mapping. The distance is defined as the minimum sum of the weights of edit operations needed to transform tree Tα to tree Tβ under restriction of the structure preserving mapping. The edit operations allow substituting a vertex of a tree to another, deleting a vertex of a tree and inserting a vertex to a tree. Proposed algorithms determine the distance between Tα and Tβ in time O(NαNβLα) or O(NαNβLβ), and in space O(NαNβ), where Nα, Nβ, Lα and Lβ are the number of vertices of Tα, Tβ, the number of’ leaves of Tα and Tβ, respectively. The time complexity is close to the unapproachable lowest bound O(NαNβ). Improved algorithms are presented. This tree distance can be applied to any problems including pattern recognition, syntactic tree comparison and classification, and tree comparison whose structures are important in structure preserving mapping. |
|||||

BibTeX:
@article{, author = {KEIKO TANAKA and EIICHI TANAKA}, title = {THE TREE-TO-TREE EDITING PROBLEM International Journal of Pattern Recognition and Artificial IntelligenceTHE TREE-TO-TREE EDITING PROBLEM}, journal = {International Journal of Pattern Recognition and ArtificialInternational Journal of Pattern Recognition and Artificial Intelligence}, year = {1988}, volume = {2}, number = {2}, pages = {221-240}, doi = {http://dx.doi.org/10.1142/S0218001488000157} } |
|||||

Torsello, A. & Hancock, E. R. | Tree Edit Distance from Information Theory | 2003 | Graph Based Representations in Pattern Recognition | article | URL |

Abstract: This paper presents a method for estimating the cost of tree edit operations. The approach poses the problem as that of estimating a generative model for a set of tree samples. The generative model uses the tree-union as the structural archetype for every tree in the distribution and assigns to each node in the archetype the probability that the node is present in a sample. A minimum descriptor length formulation is then used to estimate the structure and parameters of this tree model as well as the node-correspondences between trees in the sample-set and the tree model. |
|||||

BibTeX:
@article{Torsello2003, author = {Andrea Torsello and Edwin R. Hancock}, title = {Tree Edit Distance from Information Theory}, journal = {Graph Based Representations in Pattern Recognition}, year = {2003}, volume = {2726/2003}, url = {http://www.sinab.unal.edu.co:2090/content/ff9xry3cde9d7kqu/?p=98a6054b622f4176a55a079bd9aab732&pi=0} } |
|||||

Ukkonen, E. & Wood, D. | Approximate string matching with suffix automata | 1993 | Algorithmica | article | DOI |

Abstract: ...is, given a text string...text with edit distance at mostk...edit distance table. Some experiments... |
|||||

BibTeX:
@article{Ukkonen1993, author = {Esko Ukkonen and Derick Wood}, title = {Approximate string matching with suffix automata}, journal = {Algorithmica}, year = { 1993}, volume = {10}, number = {5}, doi = {http://dx.doi.org/10.1007/BF01769703} } |
|||||

Vidal, E., Marzal, A. & Aibar, P. | Fast computation of normalized edit distances | 1995 | IEEE Transactions on Pattern Analysis and Machine Intelligence | article | DOIURL |

Abstract: The normalized edit distance (NED) between two strings X and Y is defined as the minimum quotient between the sum of weights of the edit operations required to transform X into Y and the length of the editing path corresponding to these operations. An algorithm for computing the NED was introduced by Marzal and Vidal (1993) that exhibits 0(mn2 ) computing complexity, where m and n are the lengths of X and Y. We propose here an algorithm that is observed to require in practice the same 0(mn) computing resources as the conventional unnormalized edit distance algorithm does. The performance of this algorithm is illustrated through computational experiments with synthetic data, as well as with real data consisting of OCR chain-coded strings |
|||||

BibTeX:
@article{Vidal1995, author = {Vidal, E. and Marzal, A. and Aibar, P.}, title = {Fast computation of normalized edit distances}, journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence}, year = {1995}, volume = {17}, number = {9}, pages = {899-902}, url = {http://ieeexplore.ieee.org/iel1/34/9134/00406656.pdf?isnumber=9134∏=STD&arnumber=406656&arnumber=406656&arSt=899&ared=902&arAuthor=Vidal%2C+E.%3B+Marzal%2C+A.%3B+Aibar%2C+P. }, doi = {http://dx.doi.org/10.1109/34.406656} } |
|||||

Wang, J., Shapiro, B., Shasha, D., Zhang, K. & Currey, K. | An algorithm for finding the largest approximately common substructures of two trees | 1998 | IEEE Transactions on Pattern Analysis and Machine Intelligence | article | DOI |

Abstract: Ordered, labeled trees are trees in which each node has a label and the left-to-right order of its children (if it has any) is fixed. Such trees have many applications in vision, pattern recognition, molecular biology and natural language processing. We consider a substructure of an ordered labeled tree T to be a connected subgraph of T. Given two ordered labeled trees T1 and T2 and an integer d, the largest approximately common substructure problem is to find a substructure U1 of T1 and a substructure U2 of T2 such that U1 is within edit distance d of U2 and where there does not exist any other substructure V1 of T1 and V2 of T2 such that V1 and V2 satisfy the distance constraint and the sum of the sizes of V1 and V2 is greater than the sum of the sizes of U1 and U2. We present a dynamic programming algorithm to solve this problem, which runs as fast as the fastest known algorithm for computing the edit distance of two trees when the distance allowed in the common substructures is a constant independent of the input trees. To demonstrate the utility of our algorithm, we discuss its application to discovering motifs in multiple RNA secondary structures (which are ordered labeled trees) |
|||||

BibTeX:
@article{Wang1998, author = {Wang, J.T.L. and Shapiro, B.A. and Shasha, D. and Zhang, K. and Currey, K.M.}, title = {An algorithm for finding the largest approximately common substructures of two trees}, journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence}, year = {1998}, volume = {20}, number = {8}, pages = {889 - 895}, doi = {http://dx.doi.org/10.1109/34.709622} } |
|||||

Wei, J. | Markov edit distance | 2004 | IEEE Transactions on Pattern Analysis and Machine Intelligence | article | DOIURL |

Abstract: Edit distance was originally developed by Levenstein several decades ago to measure the distance between two strings. It was found that this distance can be computed by an elegant dynamic programming procedure. The edit distance has played important roles in a wide array of applications due to its representational efficacy and computational efficiency. To effect a more reasonable distance measure, the normalized edit distance was proposed. Many algorithms and studies have been dedicated along this line with impressive performances in recent years. There is, however, a fundamental problem with the original definition of edit distance that has remained elusive: its context-free nature. In determining the possible actions, i.e., insertion, deletion, and substitution, no systematic consideration was given to the local behaviors of the string/pattern in question that indeed encompass great amount of useful information regarding its content. In this paper, inspired by the success of the Markov Random Field theory, a new edit distance called Markov edit distance (MED) within the dynamic programming framework is proposed to take full advantage of the local statistical dependencies in the pattern in order to arrive at enhanced matching performance. Within this framework, two specialized distance measures are developed: The reshuffling MED to handle cases where a subpattern in the target pattern is the reshuffles of that in the source pattern, and the coherence MED which is able to incur local content based substitution, insertion, and deletion. The applications based on these two MEDs in string matching are then explored, whereof encouraging empirical results have been observed. |
|||||

BibTeX:
@article{Wei2004, author = {Jie Wei}, title = {Markov edit distance}, journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence}, year = {2004}, volume = {26}, number = {3}, pages = {311- 321}, url = {http://ieeexplore.ieee.org/iel5/34/28219/01262315.pdf?isnumber=28219∏=JNL&arnumber=1262315&arnumber=1262315&arSt=+311&ared=+321&arAuthor=Jie+Wei}, doi = {http://dx.doi.org/10.1109/TPAMI.2004.1262315} } |
|||||

Wenk, C. | Applying an Edit Distance to the Matching of Tree Ring Sequences in Dendrochronology | 1999 | Combinatorial Pattern Matching: 10th Annual Symposium, CPM 99, Warwick University, UK, July 1999. Proceedings | inproceedings | URL |

Abstract: In dendrochronology wood samples are dated according to the tree rings they contain. The dating process consists of comparing the sequence of tree ring widths in the sample to a dated master sequence. Assuming that a tree forms exactly one ring per year a simple sliding algorithm solves this matching task.But sometimes a tree produces no ring or even two rings in a year. If a sample sequence contains this kind of inconsistencies it cannot be dated correctly by the simple sliding algorithm. We therefore introduce a algorithm for dating such a sample sequence against an error-free master sequence, where n and m are the lengths of the sequences. Our algorithm takes into account that the sample might contain up to missing or double rings and suggests possible positions for these kind of inconsistencies. This is done by employing an edit distance as the distance measure. |
|||||

BibTeX:
@inproceedings{Wenk1999, author = {Carola Wenk}, title = {Applying an Edit Distance to the Matching of Tree Ring Sequences in Dendrochronology}, booktitle = {Combinatorial Pattern Matching: 10th Annual Symposium, CPM 99, Warwick University, UK, July 1999. Proceedings}, publisher = {#svb#}, year = {1999}, number = {1645}, pages = {223-242}, url = {http://www.springerlink.com/content/0r7ty6qhmljb909t/} } |
|||||

Wilson, R. & Hancock, E. | Levenshtein distance for graph spectral features | 2004 | Proceedings of the 17th International Conference on Pattern Recognition ICPR 2004 | inproceedings | DOI |

Abstract: Graph structures play a critical role in computer vision, but they are inconvenient to use in pattern recognition tasks because of their combinatorial nature and the consequent difficulty in constructing feature vectors. Spectral representations have been used for this task which are based on the eigensystem of the graph Laplacian matrix. However, graphs of different sizes produce eigensystems of different sizes where not all eigenmodes are present in both graphs. We use the Levenshtein distance to compare spectral representations under graph edit operations which add or delete vertices. The spectral representations are therefore of different sizes. We use the concept of the string-edit distance to allow for the missing eigenmodes and compare the correct modes to each other. We evaluate the method by first using generated graphs to compare the effect of vertex deletion operations. We then examine the performance of the method on graphs from a shape database. |
|||||

BibTeX:
@inproceedings{Wilson2004, author = {Wilson, R.C. and Hancock, E.R.}, title = {Levenshtein distance for graph spectral features}, booktitle = {Proceedings of the 17th International Conference on Pattern Recognition ICPR 2004}, year = {2004}, volume = {2}, pages = {489 - 492}, doi = {http://dx.doi.org/10.1109/ICPR.2004.1334272} } |
|||||

Zhang, K. | Efficient parallel algorithm for the editing distance between ordered trees | 1998 | Combinatorial Pattern Matching | inproceedings | DOI |

Abstract: Ordered labeled trees are trees whose nodes are labeled and in which the left-to-right order among siblings is significant. The tree editing problem for input ordered labeled trees T 1 and T 2 is defined as transforming T 1 into T 2 by performing a series of weighted edit operations on T 1 with overall minimum cost. An edit operation can be the deletion, the insertion, and the substitution. Previous results on this problem are only for some special cases and the time complexity depends on the actual distance, though for the more restricted version of degree-2 edit distance problem there are efficient solutions. In this extended abstract, we show polylogrithmic time algorithm for this problem. |
|||||

BibTeX:
@inproceedings{Zhang1998, author = {K. Zhang}, title = {Efficient parallel algorithm for the editing distance between ordered trees}, booktitle = {Combinatorial Pattern Matching}, publisher = {#svb#}, year = {1998}, number = {1448}, pages = {80--90}, doi = {http://dx.doi.org/10.1007/BFb0030776} } |
|||||

Zhang, K. | A New Editing based Distance between Unordered Labeled Trees | 1993 | CPM '93: Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching | inproceedings | URL |

Abstract: This paper considers the problem of computing a new editing based distance between unordered labeled trees. The problem of approximate unordered tree matching is also considered. We present algorithms solving these problems in sequential time O(¦T 1¦ × ¦T 2¦ × maxdeg(T 1), deg (T2) × log2(maxdeg(T 1), deg (T2))). Our previous result shows that computing the editing distance between unordered labeled trees is NP-complete.Research supported by the Natural Sciences and Engineering Research Council of Canada under Grant No. OGP0046373. |
|||||

BibTeX:
@inproceedings{Zhang1993, author = {Kaizhong Zhang}, title = {A New Editing based Distance between Unordered Labeled Trees}, booktitle = {CPM '93: Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching}, publisher = {Springer-Verlag}, year = {1993}, pages = {254--265}, url = {http://portal.acm.org/citation.cfm?coll=GUIDE&dl=GUIDE&id=738282#} } |
|||||

Zhang, K. & Shasha, D. | Simple fast algorithms for the editing distance between trees and related problems | 1989 | SIAM Journal on Computing | article | DOIURL |

Abstract: Ordered labeled trees are trees in which the left-to-right order among siblings is significant. The distance between two ordered trees is considered to be the weighted number of edit operations (insert, delete, and modify) to transform one tree to another. The problem of approximate tree matching is also considered. Specifically, algorithms are designed to answer the following kinds of questions:1. What is the distance between two trees? 2. What is the minimum distance between $T_1 $ and $T_2 $ when zero or more subtrees can be removed from $T_2 $? 3. Let the pruning of a tree at node $n$ mean removing all the descendants of node $n$. The analogous question for prunings as for subtrees is answered.A dynamic programming algorithm is presented to solve the three questions in sequential time $O(|T_1 | times |T_2 | times min (depth(T_1 ),leaves(T_1 )) times min (depth(T_2 ),leaves(T_2 )))$ and space $O(|T_1 | times |T_2 |)$ compared with $O(|T_1 | times |T_2 | times (depth(T_1 ))^2 times (depth(T_2 ))^2 )$ for the best previous published algorithm due to Tai [J. Assoc. Comput. Mach., 26 (1979), pp, 422-433]. Further, the algorithm presented here can be parallelized to give time $O(|T_1 | times |T_2 |)$. |
|||||

BibTeX:
@article{Zhang1989, author = {K. Zhang and D. Shasha}, title = {Simple fast algorithms for the editing distance between trees and related problems}, journal = {SIAM Journal on Computing}, publisher = {Society for Industrial and Applied Mathematics}, year = {1989}, volume = {18}, number = {6}, pages = {1245--1262}, url = {http://link.aip.org/link/?SMJ/18/1245/1}, doi = {http://dx.doi.org/10.1137/0218082} } |
|||||

Zhang, K., Statman, R. & Shasha, D. | On the editing distance between unordered labeled trees | 1992 | Algorithms and Computation | article | DOI |

Abstract: iven two rooted, labeled, unordered trees, the common subtree problem is to find a bijective matching between subsets of vertices of the trees of maximum cardinality which preserves labels and ancestry relationship. The tree editing distance problem is to determine the least cost sequence of additions, deletions and changes that converts a tree into another given tree. Both problems are known to be hard to approximate within some constant factor in general.We present polynomial algorithms for several special classes of trees as well as a tighter approximation hardness proof, which together comes close to characterizing the complexity of both problems on all interesting special classes of trees. We also present the first approximation algorithm with non-trivial approximation ratios. In particular, we achieve a ratio of log2 n, where n is the number of vertices in the trees. Work done in large part while the first author visited JAIST, first as PFU Visiting Chair, and later supported by JAIST Foundation. Also work done in part while the second author visited NTT Telecommunication Networks Laboratories under the supervision of Toshihiko Yamakami. |
|||||

BibTeX:
@article{Zhang1992b, author = {K. Zhang and R. Statman and D. Shasha}, title = {On the editing distance between unordered labeled trees}, journal = {Algorithms and Computation}, year = {1992}, volume = {42}, pages = {133--139}, doi = {http://dx.doi.org/10.1007/BFb0009475} } |
|||||

Zhang, K., Yang, Y., Wang, X., Wang, J. & Shasha, D. | An Approximate Oracle for Distance in Metric Spaces | 1998 | Combinatorial Pattern Matching | inproceedings | DOI |

Abstract: An important subtask of the pattern discovery process is pattern matching, where the pattern sought is already known and we want to determine how often and where it occurs in a sequence. In this paper we review the most practical techniques to find patterns of different kinds. We show how regular expressions can be searched for with general techniques, and how simpler patterns can be dealt with more simply and efficiently. We consider exact as well as approximate pattern matching. Also, we cover both sequential searching, where the sequence cannot be preprocessed, and indexed searching, where we have a data structure built over the sequence to speed up the search. |
|||||

BibTeX:
@inproceedings{Zhang1998a, author = {K. Zhang and Y. Yang and X. Wang and J. Wang and D. Shasha}, title = {An Approximate Oracle for Distance in Metric Spaces}, booktitle = {Combinatorial Pattern Matching}, publisher = {#svb#}, year = {1998}, number = {1448}, pages = {104--117}, doi = {http://dx.doi.org/10.1007/BFb0030784} } |

Created by JabRef on 26/03/2007.