Sequence alignment consists in the comparison of two or more strings to find similarities between the sequences. Each symbol of a string is assigned to at most one (maybe none) symbol in another string. These similarities are particularly important when the sequences represent biological molecules, such as DNA, RNA and proteins, because we assume that similar sequences hold a similar function, structure, and there might be an evolutionary relationships.
In this course, we will address the issue of sequence similarity. We will introduce the concept of sequence alignment, and the concept of sequence similarity in terms of distance between symbols. Given a matrix of similarities between all possible pairs of symbols, the similarity of the alignment is the sum of the similarities between the aligned symbols. The problem is to find the alignment between two sequences having the maximum similarity (minimum distance). A global search in the space of all possible alignments is computationally impractical, and we will see a dynamic programming algorithm (Needelman and Wunsch) that efficiently solves the problem. We will describe the way in which the two most commonly used sets of distance matrices are derived. We will briefly address the issue of the statistical significance of the similarity between aligned sequences.
We will devote some time to the problem of finding local alignments between sequences, because it may be more relevant to align those regions in the sequence showing high similarity than attempting to align the whole sequences. We will describe the modification of the global alignment dynamic programming algorithm for such local alignments (Smith and Waterman). Finally we will focus on algorithms where short sequences need to be compared to a reference. This is often the case with next generation sequencing technologies, which produces ten of millions of short strings that need to be locally matched on a long one. A wide variety of alignment algorithms have been developed over the past few years and we will discuss their properties and applications on different types of experimental data.