• Project title:
“Faster Algorithms for Optimal Multiple Sequence Alignment Based on Pairwise Comparisons.”
-- Yonatan Bilu, Pankaj K. Agarwal, and Rachel Kolodny, IEEE/ACM Vol.3, No.4, 2006
• Team menbers' name:
Ai-Ti, Cheng
• Abstract of the project:
The running time of the known scheme for finding an optimal alignment, based on dynamic programming, increase exponentially with the numbers of input sequence. They propose algorithms that find an optimal alignment in which matches are restricted to positions in predefined matching segments. These techniques makes dynamic programming algorithm more efficient. Not like most heuristics which offer an efficient computation but cannot guarantee the optimality, these techniques not only find an optimal alignment but also expedite the running time.
• A brief Plan of this project:
a. What will I implement?
I will implement the Version 2 of the algorithm, which is the FindOptimalPath(z).
b. What methods am I going to use for this project?
In the project, First, I will use the BLAST to find local alignments with set of sequences and add breakpoints onto them based on the local alignments. If the segments between these breakpoints have the same length, I simply add a connecting edge to the SMG. I am going to compare this algorithm with the full dynamic programming and the version 1 of the algorithm mentioned in this paper.
c. Which datasets are we going to use and where will we get them from (links if possible)?
For the data, I will use the Protein Data Bank ( http://www.rcsb.org/pdb/home/home.do ) to get the data sequences.
d. What kind of experiment will we run and what will we measure (e.g., time, score, p-value etc).
For analyzing the performance of this algorithm, I will choose several kinds of proteins from PDB and run the BLAST. Then I could alternatively choose to run the Version 2 of the algorithm, full dynamic programming, or the Version 1 of the algorithm. Finally, Compare the times the data needed to run on these three different methods.
• Workload distribution:
Ai-Ti, Cheng – all parts
• List of papers for read:
[1] S. Altschul, F. Stephen, L.M. Thomas, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman, “Gapped BLAST and PSIBLAST: A New Generation of Protein Database Search Programs,” Nucleic Acids Research, vol. 25, pp. 3389-3402, 1997.
[4] H. Carrillo and D. Lipman, “The Multiple Sequence Alignment
Problem in Biology,” SIAM J. Applied Math., vol. 48, no. 5, pp. 1073-1082, 1988.
[11] D. Eppstein, Z. Galil, R. Giancarlo, and G.F. Italiano, “Sparse
Dynamic-Programming: I. Linear Cost-Functions,” J. ACM, vol. 39, no. 3, pp. 519-545, 1992.
[12] D. Eppstein, Z. Galil, R. Giancarlo, and G.F. Italiano, “Sparse
Dynamic-Programming: II. Convex and Concave Cost-Functions,” J. ACM, vol. 39, no. 3, pp. 546-567, 1992.
[15] D. Gusfield, “Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology.” Cambridge Univ. Press, 1997.
[27] B. Morgenstern, “Dialign 2: Improvement of the Segment-to-
Segment Approach to Multiple Sequence Alignment,” Bioinformatics, vol. 15, no. 3, pp. 211-218, 1999.
[28] B. Morgenstern, “A Simple and Space-Efficient Fragment-Chaining Algorithm for Alignment of DNA and Protein Sequences,” Applied Math. Letters, vol. 15, no. 1, pp. 11-16, 2002.
[29] B. Morgenstern, A. Dress, and T. Werner, “Multiple DNA and
Protein Sequence Alignment Based on Segment-to-Segment
Comparison,” Proc. Nat'l Academy of Sciences , vol. 93, no. 22,
pp. 12098-12103, 1996.
[33] C. Notredame, “Recent Progress in Multiple Sequence Alignment: A Survey,” Pharmacogenomics, vol. 3, no. 1, pp. 131-144, 2002.
[34] C. Notredame, D.G. Higgins, and J. Heringa, “T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment,”
J. Molecular Biology, vol. 302, no. 1, pp. 205-217, 2000.
[40] J.D. Thompson, F. Plewniak, and O. Poch, “BAliBASE: A Benchmark Alignment Database for the Evaluation of Multiple Alignment Programs,” Bioinformatics, vol. 15, pp. 87-88, 1999.