Hey folks, I have the following assignment and really need help with it. I don't even know where to start really. I'm not asking for full answers here, just trying to have a place to discuss it and get a little help. Any ideas would be GREATLY appreciated. Thanks for your time.

Finding Similar Genomic Sequences

Many questions in biology are investigated with the help of efficient computational methods for searching, comparing, and organizing huge amounts of sequence data. Both DNA sequences and protein sequences can be treated as character strings. For DNA there are 4 possible letters: A, C, G, and T corresponding to the 4 bases adenine, cytosine, guanine, and thymine, respectively. For proteins there are 20 possible letters: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y, each representing one of the 20 amino acids. One frequently used technique in computational biology is the comparison of sequences to find a “match”—or just similarity. For example, scientists have successfully predicted some proteins’ function and/or 3-dimensional structure by extrapolating from similar proteins with known function and 3-d structure. Consider the proteins that appear in, say, a mouse. For many of them there are similar proteins that appear in a rat or a cat. Although such proteins accomplish the same or similar functions, they are not identical; there are substitutions, insertions, and deletions of amino acids. There are corresponding substitutions, insertions, and deletions in the portions of the DNA that encode the information a cell uses to build the

proteins. You will write an application for finding subsequences (in a long sequence) that are similar to a query sequence. Solving this problem in full generality requires the use of Dynamic Programming, an algorithm design technique that is studied in upper level courses such as CS4820, Introduction to Analysis of Algorithms. In this assignment, we restrict our definition of “similar” to allow just substitutions and a limited number

of insertions. You will write a function proteinMatch such that the statement

n= proteinMatch(’genome.txt’, ’protein.txt’, ’result.txt’) reads sequence data from a file called genome.txt in the current directory, reads sequence data from a file called protein.txt, identifies all subsequences in the genome data that are similar to the protein sequence, and writes the findings to a file called result.txt. The number of similar subsequences found is stored in variable n.

Defining a Match

1. Allow one nucleotide to substitute for another. T is similar to A and C is similar to G. Consider the following example:

AATGCCCAACCG

ATCC

The top string is the data and the bottom string is the query. If we consider C and G to be similar then there is a match that begins at position 2 of the data string. We assign a “penalty” for each substitution: an A-T substitution draws a penalty of 1 and a C-G substitution draws a penalty of 2. We can say a match is found based on some allowed penalty. For the above example:

A “Match Position” is the position (index) in the data where the beginning of the match occurs.

2. Allow the insertion of nucleotides. This is commonly described as a gap. In this example we do not consider substitution.

AATGCCCATGGG

AT AT

Our query string is ATAT, which is not found in the data. However, if we allow a gap of length 4, as laid out above, then we say that we have a match. We restrict our search to allow only one continuous gap. The penalty for each gap space is one. (So a gap of length 4 draws a penalty of 4.) Combining these two ideas, we define two (sub)sequences to be similar—we say they match—if the gap (if one is needed) has a length of no more than 3 and the combined penalty for matching is less than 5.

File Format

Your program writes the results in an ASCII (plain text) file. Write three lines of text for each match found:

1. The “match position,” which is the position (index) in the genome string at which the match occurs

2. The subsequence of the genome string that matches (starting from the “match position”)

3. The protein sequence with any inserted gap filled in by the * character

The data files humanC5 ?.txt where ? is 1, 2, or 3, are portions of the human chromosome 5 nucleotide sequence downloaded from the data bank of the European Bioinformatics Institute. The sequences in the files are approximately 100, 1000, and 10000 bases long, respectively. The files contain ASCII characters (plain text). Each file begins with a line of text that identifies the sequence but is not part of the nucleotide

sequence. The remaining lines in the file contain the sequence and are of varying lengths. Your code should work with the given files—do not modify the files in any way!

# Matlab string comparison

Page 1 of 1## 2 Replies - 2492 Views - Last Post: 01 April 2012 - 05:01 AM

##
**Replies To:** Matlab string comparison

### #2

## Re: Matlab string comparison

Posted 30 March 2012 - 01:33 PM

Quick question for you, why do you want to use MATLAB? It seems by the problem description that you need to implement the algorithm and not use the available functionality in MATLAB or the Bioinformatics Toolbox.

MATLAB's string manipulation routines are not all that great. Is an implementation in MATLAB absolutely required?

MATLAB's string manipulation routines are not all that great. Is an implementation in MATLAB absolutely required?

### #3

## Re: Matlab string comparison

Posted 01 April 2012 - 05:01 AM

frog, on 30 March 2012 - 01:33 PM, said:

Quick question for you, why do you want to use MATLAB? It seems by the problem description that you need to implement the algorithm and not use the available functionality in MATLAB or the Bioinformatics Toolbox.

MATLAB's string manipulation routines are not all that great. Is an implementation in MATLAB absolutely required?

MATLAB's string manipulation routines are not all that great. Is an implementation in MATLAB absolutely required?

Thanks for your response. Matlab is a requirement for this question. Any ideas on how to implement?

Page 1 of 1