INDIAN INSTITUTE OF INFORMATION
TECHNOLOGY, BHOPAL
Assignment 10
Lab Manual: Design and Analysis of Algorithm (IT-215)
Topic: Gene Sequencing Analysis
Branch: Information Technology Section-2
3rd Semester
Submitted by: Submitted to:
Aryan Dixit Dr. Yatendra Sahu
Scholar no.: 23U03116 Assistant Professor, IIIT Bhopal
Title: Gene Sequence Alignment (Needleman-Wunsch Algorithm)
using Dynamic Programming
4. Proposed Methodology:
The project uses the Needleman-Wunsch algorithm to develop a reliable
approach for sequence alignment. Key steps include data collection, matrix
setup, score calculation, and path backtracking.
4.1 Data Collection:
Genetic data is sourced from public databases, such as GenBank or Ensembl,
focusing on nucleotide sequences for gene regions of interest. The chosen
sequences should be homologous to observe the effects of the algorithm on
evolutionary relationships.
4.2 Alignment Scoring Matrix:
The scoring matrix is foundational to Needleman-Wunsch, where scores for
matches, mismatches, and gaps are predefined. Matches are assigned positive
scores based on sequence similarity, while mismatches and gaps receive
penalties. By adjusting these scores, the algorithm can be tuned to emphasize
either sequence identity or evolutionary distance.
4.3 Dynamic Programming Matrix Construction:
The alignment matrix is constructed using a recursive relationship, filling each
cell based on the highest score calculated from the neighboring cells. Each cell
score represents the best alignment score achievable to that point, incorporating
either a match, mismatch, or gap. The matrix's final cell contains the highest
possible alignment score for the two sequences.
4.4 Backtracking for Optimal Path:
After filling the matrix, the algorithm backtracks from the final cell to
determine the optimal alignment path. By moving through the matrix in reverse,
the algorithm identifies the sequence of matches, mismatches, and gaps that
yield the maximum score.
4.5 Validation and Testing:
To evaluate the effectiveness of the alignment, the resulting sequences are
compared with known homologous sequences. Metrics such as alignment score,
accuracy, and computation time are analyzed to assess algorithm performance.
Testing will include comparison across different scoring matrices to understand
the sensitivity of the algorithm to parameter adjustments.
Algorithm:
1. Initialization of matrices
This is how both matrices look like after initialization, where the linear gap
penalty = -1 is used.
2. Calculate scores to fill score matrix and traceback matrix.
3. Deduce the best alignment from traceback matrix
Traceback begins with the bottom right-most cell (last cell to be filled). Move
according to the value in the cell until ‘done’ cell is reached.
How to interpret the best alignment from above matrix?
The cell value ‘diag’ interprets that residues from two sequences are aligned,
‘up’ can be interpreted as a gap added in top sequence or insertion. Similarly,
‘left’ can be interpreted as a gap added in left sequence or deletion.
This is the optimal alignment derived using Needleman-Wunsch algorithm.
5. Implementation and Results Analysis:
5.1 Code Implementation:
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
using namespace std;
// Function to perform Needleman-Wunsch alignment and return the alignment
score and aligned sequences
pair<int, pair<string, string>> needlemanWunsch(const string &seqA, const
string &seqB, int match, int mismatch, int gap)
{
int m = seqA.size();
int n = seqB.size();
// Create the scoring matrix
vector<vector<int>> scoreMatrix(m + 1, vector<int>(n + 1, 0));
// Initialize the first row and column with gap penalties
for (int i = 0; i <= m; ++i)
scoreMatrix[i][0] = i * gap;
for (int j = 0; j <= n; ++j)
scoreMatrix[0][j] = j * gap;
// Fill the score matrix
for (int i = 1; i <= m; ++i)
{
for (int j = 1; j <= n; ++j)
{
int matchScore = scoreMatrix[i - 1][j - 1] + (seqA[i - 1] == seqB[j - 1] ?
match : mismatch);
int deleteScore = scoreMatrix[i - 1][j] + gap;
int insertScore = scoreMatrix[i][j - 1] + gap;
scoreMatrix[i][j] = max({matchScore, deleteScore, insertScore});
}
}
// Alignment score is in the bottom-right cell
int alignmentScore = scoreMatrix[m][n];
// Backtracking to find the aligned sequences
string alignedA = "", alignedB = "";
int i = m, j = n;
while (i > 0 || j > 0)
{
if (i > 0 && j > 0 && scoreMatrix[i][j] == scoreMatrix[i - 1][j - 1] +
(seqA[i - 1] == seqB[j - 1] ? match : mismatch))
{
alignedA = seqA[i - 1] + alignedA;
alignedB = seqB[j - 1] + alignedB;
--i;
--j;
}
else if (i > 0 && scoreMatrix[i][j] == scoreMatrix[i - 1][j] + gap)
{
alignedA = seqA[i - 1] + alignedA;
alignedB = "-" + alignedB;
--i;
}
else
{
alignedA = "-" + alignedA;
alignedB = seqB[j - 1] + alignedB;
--j;
}
}
return {alignmentScore, {alignedA, alignedB}};
}
int main()
{
string seqA = "GATTACA";
string seqB = "GCATGCU";
int match = 1;
int mismatch = -1;
int gap = -2;
// Call the Needleman-Wunsch function
auto result = needlemanWunsch(seqA, seqB, match, mismatch, gap);
// Output similarity score and aligned sequences
cout << "Similarity Score: " << result.first << endl;
cout << "Aligned Sequences:\n";
cout << result.second.first << "\n";
cout << result.second.second << endl;
return 0;
}
Output:
Time Complexity: O(m×n), where m and n are the lengths of seqA and seqB.
The matrix filling step iterates through every cell once.
Matrix initialization: O(m+n)
Matrix filling: O(m×n)
Backtracking: O(m+n)
Total Time Complexity=O(m×n).
Space Complexity: O(m×n) for storing the scoreMatrix and additional O(m+n)
for the aligned sequences during backtracking.
Full scoring matrix: O(m×n) (or O(n) with row optimization)
Aligned sequences: O(m+n)
Total Space Complexity=O(m×n)(or O(n) with optimization).
5.2 Challenges Encountered:
Large Sequence Alignment: The alignment matrix grows significantly
with sequence length, requiring optimized memory management to
maintain computational efficiency. Techniques such as sparse matrices or
sliding window implementations are considered for managing larger data.
Scoring System Calibration: Assigning biologically accurate scoring
requires fine-tuning to reflect evolutionary relationships accurately. This
was approached through empirical testing and comparison with known
benchmarks.
5.3 Result Analysis:
Visualization of Aligned Sequences: Aligned sequences are displayed
with highlights on matches, mismatches, and gaps, providing a clear view
of conserved regions and genetic divergence.
Accuracy Metrics: The alignment accuracy is quantified by comparing
with benchmarked sequences, with metrics such as sensitivity and
specificity. Results demonstrate how well the algorithm identifies true
matches versus evolutionary gaps, essential for phylogenetic studies.
6. Conclusion and Future Work:
The project successfully applies the Needleman-Wunsch algorithm for global
sequence alignment, producing accurate and biologically meaningful
alignments. The results showcase the algorithm’s potential in revealing
conserved genetic regions, essential for studies in comparative genomics and
evolutionary biology. Future work could include:
Multi-sequence Alignment Extension: Expanding the algorithm to align
multiple sequences, beneficial for evolutionary studies across multiple
species.
Incorporating Machine Learning: Using machine learning to improve
scoring by learning from known sequence alignments, adapting scoring
dynamically for different organisms or sequence types.
Graphical Interface Development: A GUI could improve accessibility,
allowing users to adjust scoring parameters and visualize alignments in
real-time.
Integration with High-Performance Computing (HPC): Running the
algorithm on parallel computing frameworks to handle genome-wide
alignments efficiently.
7. References:
Needleman, S. B., & Wunsch, C. D. (1970).
"A general method applicable to the search for similarities in the
amino acid sequence of two proteins." Journal of Molecular Biology,
48(3), 443-453.
Smith, T. F., & Waterman, M. S. (1981).
"Identification of common molecular subsequences." Journal of
Molecular Biology, 147(1), 195-197.
Altschul, S. F., et al. (1990).
"Basic local alignment search tool." Journal of Molecular Biology,
215(3), 403-410.
Durbin, R., Eddy, S. R., Krogh, A., & Mitchison, G. (1998).
Biological Sequence Analysis: Probabilistic Models of Proteins and
Nucleic Acids. Cambridge University Press.
Mount, D. W. (2004).
Bioinformatics: Sequence and Genome Analysis. 2nd Edition. Cold
Spring Harbor Laboratory Press.
Gusfield, D. (1997).
Algorithms on Strings, Trees, and Sequences: Computer Science and
Computational Biology. Cambridge University Press.
Waterman, M. S. (1995).
Introduction to Computational Biology: Maps, Sequences, and
Genomes. CRC Press.
Pearson, W. R., & Lipman, D. J. (1988).
"Improved tools for biological sequence comparison." Proceedings of
the National Academy of Sciences, 85(8), 2444-2448.
Gapped BLAST and PSI-BLAST: A new generation of protein
database search programs.
Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z.,
Miller, W., & Lipman, D.J. (1997). Nucleic Acids Research, 25(17),
3389-3402.