14 Replies - 2913 Views - Last Post: 07 December 2014 - 12:47 AM Rate Topic: ****- 1 Votes

#1 gouldsc   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 12
  • Joined: 03-December 14

Python 2.7 script for working with DNA in .csv

Posted 03 December 2014 - 09:41 PM

I'm a hobbyist coder who mostly works in Objective-C and Swift. I know a little Python and have done a few simple things with it in the past. It's a great language, but my code probably isn't very "pythonic." So please forgive that.

My sister is working on her doctorate in marine biology and she is doing some DNA analysis as part of her research. She has a large .csv file that she needs to manipulate. I put together this script that imports the data from the .csv (using a small sample size of the data to test the script with). I have a Fish class and a Locus class. Each fish object has a list property of Loci called "genotype". After importing the data into these objects. I then export them into a .csv file. I wanted to do this in an object-oriented fashion because I suspect my sister will be calling for further help in the future, and this should be much easier to expand upon later.

The script runs, but the output .csv file is not what I expected. Iíve been ďcaveman debuggingĒ it with print statements and have located the issue. Itís creating way more loci in the genotype list than expected (480 instead of the expected 12). I must have a logic error in the function that creates each fishís genotype list (either in the Fish class' "import_data()" method or the "import_structure()" function that calls it), double-looping or something. Iíve been banging my head on the desk trying to spot my mistake, but Iím drawing a blank.

Iím including the full script, but Iíve commented out the exporting process because itís unrelated to the bug (itís all there though so you can see how itís supposed to work). Iím also including the test .csv file.

I really appreciate any help you can provide.

TestFile.csv
kFile_name_input = "/Users/gouldsc/Desktop/StructureTest.csv"
kFile_name_output = "/Users/gouldsc/Desktop/PCAMatrixTest.csv"

def import_structure():
    fish_list = []
    data_string = get_data_from_filename( file_name = kFile_name_input )
    table_rows = data_string.splitlines()
    loci_identifiers = table_rows[0].split( "," )
    #   delete the first two items which will be blank:
    loci_identifiers = loci_identifiers[2:]
    #   exclude the header row for row count, each fish is 2 rows
    row_count = (len( table_rows ) - 1)

    if is_odd( row_count ):
        raise NameError( "Need even number of rows, something is wrong with the data file." )
    
    i = 1
    while i < row_count:
        #   each individual fish has 2 rows of data
        if is_odd( i ):
            odd_fish_data = table_rows[i].split( "," )
            even_fish_data = table_rows[i + 1].split( "," )

            fish = Fish()
            fish.import_data( odd_fish_data, even_fish_data, loci_identifiers )
            fish_list.append( fish )
        else:
            raise NameError( "Logic error in code. This shouldn't get called" )
        
        i = i + 2
    
    return fish_list
    
def get_data_from_filename( file_name ):
    data_string = ""
    in_file = open( file_name, "r" )
    data_string = in_file.read()
    in_file.close()
    
    return data_string
    
def is_odd( number ):
    if number == 0:
        raise NameError( "Modulus Division by Zero" )
        
    if number % 2 != 0:
        return True
    else:
        return False

def matrix_as_CSV_string( fish_list ):
    '''Outputs a single string with details of all fish data in the fish list in CSV format'''
    matrix_string = "Fish_ID,Population,"
    for locus in fish_list[0].genotype:
        matrix_string += locus.identifier
        matrix_string += ","
    matrix_string += "\n"
    
    for fish in fish_list:
        matrix_string += fish.fish_CSV_output_string()
        matrix_string += "\n"
        
    return matrix_string
        
def export_matrix( matrix_string ):
    out_file = open( kFile_name_output, "w" )
    out_file.write( matrix_string )
    out_file.close()

class Fish:
    genotype = []
    name = ""
    population = ""
    
    def __repr__( self ):
        fish_description = "name:" + self.name + "/n"
        fish_description += "population:" + self.population + "/n"
        fish_description += "genotype:\n"
        
        for loci in self.genotype:
            fish_description += str( loci )
            fish_description += "\n"
            
        return fish_description
        
    def import_data( self, data_list1, data_list2, header_list ):
        '''Updates the Fish properties based on the passed in data'''
        
        self.name = data_list1[0]
        self.population = data_list1[1]
        #   delete first two entries from both lists
        data_list1 = data_list1[2:]
        data_list2 = data_list2[2:]
        
        i = 0        
        while i < len( data_list1 ):
            locus = Locus()
            locus.identifier = header_list[i]
            if data_list1[i] == "1":
                locus.a += 0.5
            if data_list2[i] == "1":
                locus.a += 0.5
            
            if data_list1[i] == "2":
                locus.c += 0.5
            if data_list2[i] == "2":
                locus.c += 0.5
            
            if data_list1[i] == "3":
                locus.g += 0.5
            if data_list2[i] == "3":
                locus.g += 0.5
                
            if data_list1[i] == "4":
                locus.t += 0.5
            if data_list2[i] == "4":
                locus.t += 0.5
            
            self.genotype.append( locus )
            i = i + 1
    
    def fish_CSV_output_string( self ):
        '''This will produce the string that is used for each line of the matrix output'''
        fish_description_a = self.name + "_a,"
        fish_description_a += self.population
        fish_description_a += ","
        for locus in self.genotype:
            fish_description_a += str( locus.a )
            fish_description_a += ","
        fish_description_a = self.strip_trailing_comma( fish_description_a )
        fish_description_a += "\n"
        
        fish_description_c = self.name + "_c,"
        fish_description_c += self.population
        fish_description_c += ","
        for locus in self.genotype:
            fish_description_c += str( locus.c )
            fish_description_c += ","
        fish_description_c = self.strip_trailing_comma( fish_description_c )
        fish_description_c += "\n"
        
        fish_description_g = self.name + "_g,"
        fish_description_g += self.population
        fish_description_g += ","
        for locus in self.genotype:
            fish_description_g += str( locus.g )
            fish_description_g += ","
        fish_description_g = self.strip_trailing_comma( fish_description_g )
        fish_description_g += "\n"
        
        fish_description_t = self.name + "_t,"
        fish_description_t += self.population
        fish_description_t += ","
        for locus in self.genotype:
            fish_description_t += str( locus.t )
            fish_description_t += ","
        fish_description_t = self.strip_trailing_comma( fish_description_t )
        fish_description_t += "\n"
        
        combined_fish_description = fish_description_a + fish_description_c + fish_description_g + fish_description_t
        return combined_fish_description
        
    def strip_trailing_comma( self, string ):
        if string[-1] == ",":
            return string[:-1]
            
class Locus:
    identifier = ""
    a = 0.0
    c = 0.0
    g = 0.0
    t = 0.0
    
    def __str__( self ):
        description_string = ""
        description_string = self.identifier
        description_string += ( "\n" )
        description_string += ( " A:" + str( self.a ) + "\n" )
        description_string += ( " C:" + str( self.c ) + "\n" )
        description_string += ( " G:" + str( self.g ) + "\n" )
        description_string += ( " T:" + str( self.t ) + "\n" )
        
        return description_string
    
#Script Starts Here
if __name__ == "__main__":
    print "Starting Script"
    fish_list = import_structure()
    
    #Demonstrating the bug:
    actual_genotype_size = len( fish_list[0].genotype )
    expected_size_of_genotype = 12
    if actual_genotype_size != expected_size_of_genotype:
        print "!!! expected genotype size of: " + str( expected_size_of_genotype ) + " instead it's: " + str( actual_genotype_size )
        
#   Commented out the rest of the script below:        
#    matrix_string = matrix_as_CSV_string( fish_list )
#    export_matrix( matrix_string )


Is This A Good Question/Topic? 0
  • +

Replies To: Python 2.7 script for working with DNA in .csv

#2 jon.kiparsky   User is offline

  • Beginner
  • member icon


Reputation: 11679
  • View blog
  • Posts: 19,840
  • Joined: 19-March 11

Re: Python 2.7 script for working with DNA in .csv

Posted 03 December 2014 - 09:55 PM

The first thing I'd suggest is using the csv library - always use library code unless that's the wheel you're reinventing.

The second thing I'd suggest is to try to get rid of some of that code - when you tidy up your code a couple of good things happen. First, you have to examine all of the stuff that you've written, and that means sometimes you spot your error. If not that, then sometimes you just clean it up by accident, when you're reducing repetition and making things run more sensibly. And if your error is still there when you're done, there's a lot less code for it to hide in, so that's a good thing.

For example, look at the loop from about line 99 - that can be cut down substantially. I'm not sure exactly what you're up to there, but I'm pretty sure you can do it a lot nicer if you try. You also have a lot of repetition in your fish_CSV_output_string method. Surely that can be taken back a ways, perhaps with a loop.

I might spot something if I look through this stuff again, if so, I'll have a third thing to suggest - but that's where I'd start: simplify!


EDIT: also, zero is even
Was This Post Helpful? 1
  • +
  • -

#3 gouldsc   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 12
  • Joined: 03-December 14

Re: Python 2.7 script for working with DNA in .csv

Posted 03 December 2014 - 10:46 PM

Thanks for taking the time to post your thoughtful reply. I originally was going to use the .csv library, but my sister was looking over my shoulder impatiently as I was browsing the docs. She has a very limited understanding of programming, so I was trying to avoid doing anything too clever or use modules she was unfamiliar with. You're right though, using the library is the correct choice, so I'm going to be refactoring that in (it should especially clean up the exporting code you mentioned).

I suspect the bug may be hiding around line 99 as you said. I agree it's pretty ugly. I wasn't exactly sure how to approach it cleanly. Essentially, there are 2 lists of strings (either a "1", "2", "3" or "4"), each value corresponds to either a, c, g or t (adenine, cytosine, guanine, and thymine). Each locus has these 4 properties and we're incrementing them by 0.5 for both lists at each locus site in the list. For example, if there is a "1" in the first list at index 3 and a "4" in the second list at index 3, then that locus at index 3 would have the properties a = 0.5, c = 0.0, g = 0.0 and t = 0.5. My instinct was to use an enum but that looks like a Python3 feature. Anyways, I'm getting to work on implementing that CSV library integration (as well as correcting my is_odd() function-- :oops:). I'll repost with the changes.
Was This Post Helpful? 0
  • +
  • -

#4 jon.kiparsky   User is offline

  • Beginner
  • member icon


Reputation: 11679
  • View blog
  • Posts: 19,840
  • Joined: 19-March 11

Re: Python 2.7 script for working with DNA in .csv

Posted 03 December 2014 - 11:44 PM

Here's a useful python idiom that might help: If you're adding some quantity n to a variable for each item in a list that has a certain value, you can think of that as counting the items in that list which have that value, and multiplying by n. So you can use the count() function to find the number items in the list whose value is foo, and then multiply, and you're set:

>>> 0.5 *  [1,2,3,1,2,1].count(1)
1.5



So that's a little nicer than the looping.
But you have to do this a few times. Okay, you probably want to pair up some values: '1' maps to 'a', and so forth. It's going to be a little easier if you think of this as a dictionary mapping labels to values instead of four different unrelated variables in Locus. Make the amino acid counts in Locus a dictionary, mapping each acid identifier ('a', 'c', 'g', 't') to a value. Then give yourself a mapping from the character in the file ('1', '2','3,'4') to ('a', 'c', 'g', 't'), and you can do something like:

numerical_to_name = zip (('1', '2 ','3', '4') , ('a', 'c', 'g', 't'))
for num, ltr in numerical_to_name:
  locus.acid_count[ltr] = 0.5 * data_list1.count(num)
  locus.acid_count[ltr] += 0.5 * data_list2.count(num)



Something along these lines should get you the same results, without all the repetition, and it'll probably run a bit easier as well. Notice that I'm using the zip function to turn a pair of lists into a list of pairs:

>>> zip (('1', '2 ','3', '4') , ('a', 'c', 'g', 't'))
[('1', 'a'), ('2 ', 'c'), ('3', 'g'), ('4', 't')]
>>> 


I like this because it allows me to express the relationship between the letters and numbers a little more clearly. You can of course express the same thing as the simple list of tuples if that seems better to you.
And then I unpack each of the resulting tuples as individual values in the loop. This is one of the nicer python idioms: by getting us away from index variables, python lets us focus on the things that really matter, which is the task at hand.


(of course this is just a sketch to give you some ideas - you're going to have to do a little work to fit this into your existing code, but I think you can figure it out)
Was This Post Helpful? 1
  • +
  • -

#5 gouldsc   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 12
  • Joined: 03-December 14

Re: Python 2.7 script for working with DNA in .csv

Posted 04 December 2014 - 12:54 AM

View Postjon.kiparsky, on 03 December 2014 - 11:44 PM, said:

(of course this is just a sketch to give you some ideas - you're going to have to do a little work to fit this into your existing code, but I think you can figure it out)

Very helpful. Thanks for taking the time to make these suggestions! I'm sure I can figure it out. I never would have thought of using count() in that way. That's a very elegant solution. And by using a dict to model my locus objects I can iterate through it. Thank you.

I'll be sure to post the edited script once I've refactored it.
Was This Post Helpful? 0
  • +
  • -

#6 gouldsc   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 12
  • Joined: 03-December 14

Re: Python 2.7 script for working with DNA in .csv

Posted 04 December 2014 - 03:56 PM

So Iíve updated my code using the suggested patterns, and Iím still experiencing the same problem with the len( fish.genome ) being radically different than expected (by the same amount). Hereís the very weird part that came to light after using your approach (again, thanks for your suggestion): itís still creating way more locus objects than it should and incrementing each amino acid way more than it should.

Itís behaving as if itís incrementing the same amino acids (by reference) but then generating separate loci referencing the same amino acid dict for all loci. In other words, each value for A, C, T, G should be at most 1.0 (incremented by 0.5 by data_list1 and again by data_list2 if they both happen to reference the same acid at that particular index), but Iím seeing the following values in every loci: {'A': 91.5, 'C': 70.5, 'T': 82.5, 'G': 67.5} and with every fish when running the test data. Itís as if each individual fish object is referencing the same genome list. Iím not sure how this reference is being passed around, as Iím creating separate Locus objects, and separate Fish objects. Itís like theyíre behaving as global variables. I can't seem to spot the bug though. Either it's something to do with object creation in Python that I'm unfamiliar with, or possibly a logic error in the importing code. It appears to iterate the expected number of times through the various loops when I insert print statements.

Thanks again for your help. The code looks better I think (cut out a bunch of code). Although i'm a bit concerned it'll be harder for my sister to follow since it's doing more "clever" things. Then again, she'd rather use a script that actually did what it supposed to do...

import csv
kFile_name_input = "/Users/gouldsc/Desktop/StructureTest.csv"
kFile_name_output = "/Users/gouldsc/Desktop/PCAMatrixTest.csv"

def import_structure():
    fish_list = []
    
    with open( kFile_name_input ) as csv_data_file:
        data_reader = csv.reader( csv_data_file, dialect = "excel" )
        table_rows = []
        for row in data_reader:
            table_rows.append( row )
        
        csv_data_file.close()
        
    loci_identifiers = table_rows[0]
    #   delete the first two items which will be blank:
    loci_identifiers = loci_identifiers[2:]
    #   exclude the header row for row count, each fish is 2 rows
    row_count = len( table_rows ) - 1

    if is_odd( row_count ):
        raise NameError( "Need even number of rows, something is wrong with the data file." )
    
    i = 1
    while i < row_count:
        #   each individual fish has 2 rows of data
        if is_odd( i ):
            odd_fish_data = table_rows[i]
            even_fish_data = table_rows[i + 1]

            fish = Fish()
            fish.import_data( odd_fish_data, even_fish_data, loci_identifiers )
            fish_list.append( fish )
        else:
            raise NameError( "Logic error in code. This shouldn't get called" )
        
        i = i + 2
    
    return fish_list
    
def is_odd( number ):
    if number == 0:
        return False
        
    if number % 2 != 0:
        return True
    else:
        return False
        
def export_matrix( fish_list ):
    '''writes out a .csv file, iterating through each fish in the list'''
    header_row_list = ["Fish_ID", "Population"]
    locus_identifiers = fish_list[0].locus_identifier_list()
    header_row_list += locus_identifiers
    
    with open( kFile_name_output, "wb" ) as csv_out_file:
        data_writer = csv.writer( csv_out_file, dialect = "excel" )
        data_writer.writerow( header_row_list )
        
        for fish in fish_list:
            fish.export_fish_to_CSV_with_writer( data_writer )
        
        csv_out_file.close()
        
class Fish:
    genotype = []
    name = ""
    population = ""
    
    def __repr__( self ):
        fish_description = "name:" + self.name + "/n"
        fish_description += "population:" + self.population + "/n"
        fish_description += "genotype:\n"
        
        for loci in self.genotype:
            fish_description += str( loci )
            fish_description += "\n"
            
        return fish_description
        
    def import_data( self, data_list1, data_list2, header_list ):
        '''Updates the Fish properties based on the passed in data'''
        
        self.name = data_list1[0]
        self.population = data_list1[1]
        #   delete first two entries from both lists
        data_list1 = data_list1[2:]
        data_list2 = data_list2[2:]
        locus_pairs_list = self.__combine_lists_into_list_pairs( data_list1, data_list2 )
        
        i = 0        
        for locus_pair in locus_pairs_list:
            locus = Locus()
            locus.identifier = header_list[i]
            
            locus.amino_acids["A"] += 0.5 * locus_pair.count( "1" )
            locus.amino_acids["C"] += 0.5 * locus_pair.count( "2" )
            locus.amino_acids["G"] += 0.5 * locus_pair.count( "3" )
            locus.amino_acids["T"] += 0.5 * locus_pair.count( "4" )
            
            self.genotype.append( locus )
            i = i + 1
            
    def __combine_lists_into_list_pairs( self, list1, list2 ):
        '''This takes 2 lists and creates a third list with list pairs for each index'''
        if len( list1 ) != len( list2 ):
            raise NameError( "Lists have to have equal length!!!" )
        
        combined_list = []
        i = 0
        while i < len( list1 ):
            array_pair = []
            array_pair.append( list1[i] )
            array_pair.append( list2[i] )
            
            combined_list.append( array_pair )
            i = i + 1

        return combined_list
    
    def locus_identifier_list( self ):
        '''returns a list of the identifiers for each locus in the genome list'''
        loci_list = []
        
        for locus in self.genotype:
            loci_list.append( locus.identifier )
        
        return loci_list
    
    def export_fish_to_CSV_with_writer( self, writer ):
        '''Takes a csv.writer object and will export 4 rows of data to the file associated with the writer--one row for A, C, G, T respectively'''
        amino_acid_list = ["A", "C", "G", "T"]

        for current_amino_acid in amino_acid_list:
            fish_name = self.name + "_" + current_amino_acid
            fish_population = self.population
            output_row_list = [fish_name, fish_population]
            locus_values_for_current_amino_acid = []
            for locus in self.genotype:
                locus_values_for_current_amino_acid.append( locus.amino_acids[current_amino_acid] )
                
            output_row_list += locus_values_for_current_amino_acid
            writer.writerow( output_row_list )
            
class Locus:
    identifier = ""
    amino_acids = {"A":0.0, "C":0.0, "G":0.0, "T":0.0}
    
    def __str__( self ):
        description_string = self.identifier + ": " 
        description_string += str( self.amino_acids )
        
        return description_string
    
#Script Starts Here
if __name__ == "__main__":
    print "Starting Script"
    fish_list = import_structure()
    
    #Demonstrating the bug:
    actual_genotype_size = len( fish_list[0].genotype )
    expected_size_of_genotype = 12
    if actual_genotype_size != expected_size_of_genotype:
        print "!!! expected genotype size of: " + str( expected_size_of_genotype ) + " instead it's: " + str( actual_genotype_size )
        
    #Commented out the rest of the script below:            
    #export_matrix( fish_list )


Was This Post Helpful? 0
  • +
  • -

#7 baavgai   User is offline

  • Dreaming Coder
  • member icon


Reputation: 7492
  • View blog
  • Posts: 15,527
  • Joined: 16-October 07

Re: Python 2.7 script for working with DNA in .csv

Posted 05 December 2014 - 06:51 AM

This looked like fun. Note, you don't appear to be using that loci row from anything other than a counter. Is that right? Also, you don't seem to understand classes so much: Locus is being used as an odd kind of global. While classes are a great way to organize code, they can also add confusion if you're not careful.

Near as I can figure, this does the same thing as your code:

import csv

def getTableData(filename):
    with open( filename ) as csv_data_file:
        data_reader = csv.reader( csv_data_file, dialect = "excel" )
        table_rows = [ row for row in data_reader ]
        # csv_data_file.close() redundant, the with cleans this up
    return table_rows

def organizeRows(data):
    loci_identifiers = data.pop(0)[2:] #   skip first two items which will be blank:
    if len(data) % 2 == 1:
        raise NameError( "Need even number of rows, something is wrong with the data file." )

    def toRow(i):
        (x,y) = data[i:][:2]
        (name, population) = x[:2]
        pairs = list(x for xs in zip(x[2:], y[2:]) for x in xs)
        return name, population, pairs
    rows = [ toRow(i) for i in range(0,len(data),2) ]
    return loci_identifiers, rows


def appyPairs(data):
    (header, rows) = data
    acgt = (('A','1'), ('C','2'), ('G','3'), ('T','4')) 

    def nextRow(row):
        name, population, pairs = row
        amino = {"A":0.0, "C":0.0, "G":0.0, "T":0.0}
        for (k,v) in acgt: 
            amino[k] += 0.5 * pairs.count(v)
        # for locus_id in header:
        return name, population, amino

    return header, [ nextRow(row) for row in rows ]

def writePairs(data, filename):
    def writeRow(writer, header, row):
        name, population, amino = row

        for aminoId in "ACGT":
            fish_name = name + "_" + aminoId
            output_row_list = [fish_name, population]
            output_row_list += [ amino[aminoId] for x in header ]
            writer.writerow( output_row_list )

    (header, rows) = data

    with open( filename, "wb" ) as csv_out_file:
        data_writer = csv.writer( csv_out_file, dialect = "excel" )
        data_writer.writerow( ["Fish_ID", "Population"] + header )
        for row in rows:
            writeRow(data_writer, header, row)
        csv_out_file.close()

def main(filenameIn, filenameOut):
    print "Starting Script"
    data = getTableData(filenameIn)
    # print data
    data = organizeRows(data)
    # print data
    data = appyPairs(data)
    # print data
    writePairs(data, filenameOut)

if __name__ == "__main__":
    main('fish_in.csv', 'fish_out.csv')
    # main("/Users/gouldsc/Desktop/StructureTest.csv", "/Users/gouldsc/Desktop/PCAMatrixTest.csv")



Hope this helps.
Was This Post Helpful? 0
  • +
  • -

#8 gouldsc   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 12
  • Joined: 03-December 14

Re: Python 2.7 script for working with DNA in .csv

Posted 05 December 2014 - 11:57 AM

View Postbaavgai, on 05 December 2014 - 06:51 AM, said:

This looked like fun. Note, you don't appear to be using that loci row from anything other than a counter. Is that right?

That's correct the loci row (header row) is counting the number of columns (in the test data we have 12--excluding the fish_identifier and location columns, but the real data set is much larger). Each one is the identifier for that locus. In my code, there's a genotype list containing each locus (there should be 12 for this test data) for each individual fish. The ACGT values of the loci in the genotype should be unique for each fish. In my code it appears to be treating them as globals even though that's not my intent. Every 2 rows represents a fish (DNA has 2 strands). The goal is to output 4 rows for each fish (A, C, G & T). The value in each column should be either 0.0, 0.5 (one of the DNA strands has this amino acid at that loci), and 1.0 (both DNA strands have this amino acid). I ran your code (I greatly appreciate you putting in so much effort) and unfortunately, the output values for each loci are not the expected outputs (although unlike mine, you have the correct number of columns and rows--a major improvement). I haven't had a chance to really study it yet though. I will try to figure out where it's going wrong and post back (it could be that I've been very unclear about what should happen--I apologize if that's the case).

Quote

Also, you don't seem to understand classes so much: Locus is being used as an odd kind of global.

Are you saying this because you see errors in my code? or because of my responses? When you say "Locus is being used as an odd kind of global," do you mean that in the code the loci are being declared as globals (because this is not the intent!!!, please tell me where this is happening and I'll fix it right away), or are you referring to how the loci identifiers are constant for each fish's genotype?

Quote

While classes are a great way to organize code, they can also add confusion if you're not careful.

Near as I can figure, this does the same thing as your code:

Hope this helps.

Ideally, I'd like to keep the classes in the script (even if it makes things a bit longer) since I know my sister has other plans. For one, she will be doing something similar with bacteria DNA (the bioluminescent symbiont of the fish) that has a different data set (but will have a similar class structure). This will keep code reuse high and reduce the chance of errors between data sets. Also, I tend to think better in terms of objects, properties, methods, etc. I suspect my bug may be related to values being passed as references somehow. I tried doing a deepcopy() on some of the lists before passing them into my methods as parameters under the assumption that it would destroy any references. It didn't fix the bug though and just cluttered up the code (and reduced performance I assume), so I removed them.

I really do appreciate the help. I'm going to study your code to see if I can't make it do what I need it to.

Oh and if anyone's curious what this is for. It's the DNA of a Japanese Cardinal fish off of Okinawa, Siphamia tubifer:

Posted Image

My sister's dissertation is looking at different genetic differences in populations of these fish at various locations and their bioluminescent bacterial symbionts to advance the knowledge about how fish acquire their symbionts more generally. This script is trying to normalize the data between the fish and the bacteria so she can run fancy statistical things (that are way over my head) comparing the populations. Thanks again for everyone's help.
Was This Post Helpful? 0
  • +
  • -

#9 gouldsc   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 12
  • Joined: 03-December 14

Re: Python 2.7 script for working with DNA in .csv

Posted 05 December 2014 - 01:16 PM

Well I just spoke with my sister and she's changed the requirements on how she wants the data exported. Now she wants an A, C, G, T for each locus (4 columns for each locus instead of 1) and only one row for each fish (instead of 4). I suspected these kinds of changes might happen, (which is why I was looking to use an object oriented approach). I'm going to refactor my exporting code and repost. This shouldn't fix the bug though, because it's an importing issue (something's wrong with how I'm creating the fish objects).
Was This Post Helpful? 0
  • +
  • -

#10 gouldsc   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 12
  • Joined: 03-December 14

Re: Python 2.7 script for working with DNA in .csv

Posted 05 December 2014 - 02:40 PM

I just made the changes to the code's export process. The import bug is still there, but the exporting process appears to be functioning as expected.

Here's the updated code:
import csv
kFile_name_input = "/Users/gouldsc/Desktop/StructureTest.csv"
kFile_name_output = "/Users/gouldsc/Desktop/PCAMatrixTest.csv"

def import_structure():
    fish_list = []
    
    with open( kFile_name_input ) as csv_data_file:
        data_reader = csv.reader( csv_data_file, dialect = "excel" )
        table_rows = []
        for row in data_reader:
            table_rows.append( row )
        
        csv_data_file.close()
        
    loci_identifiers = table_rows[0]
    #   delete the first two items which will be blank:
    loci_identifiers = loci_identifiers[2:]
    #   exclude the header row for row count, each fish is 2 rows
    row_count = len( table_rows ) - 1

    if is_odd( row_count ):
        raise NameError( "Need even number of rows, something is wrong with the data file." )
    
    i = 1
    while i < row_count:
        #   each individual fish has 2 rows of data
        if is_odd( i ):
            odd_fish_data = table_rows[i]
            even_fish_data = table_rows[i + 1]

            fish = Fish()
            fish.import_data( odd_fish_data, even_fish_data, loci_identifiers )
            fish_list.append( fish )
        else:
            raise NameError( "Logic error in code. This shouldn't get called" )
        
        i = i + 2
    
    return fish_list
    
def is_odd( number ):
    if number == 0:
        return False
        
    if number % 2 != 0:
        return True
    else:
        return False
        
def export_matrix( fish_list ):
    '''writes out a .csv file, iterating through each fish in the list'''
    header_row_list = ["Fish_ID", "Population"]
    locus_identifiers = fish_list[0].locus_identifier_list()
    expanded_locus_identifiers = []
    acgt = ["A", "C", "G", "T"]
    for identifier in locus_identifiers:
        for amino_acid in acgt:
            tagged_identifier = identifier + "_" + amino_acid
            expanded_locus_identifiers.append( tagged_identifier )
            
    header_row_list += expanded_locus_identifiers
    
    with open( kFile_name_output, "wb" ) as csv_out_file:
        data_writer = csv.writer( csv_out_file, dialect = "excel" )
        data_writer.writerow( header_row_list )
        
        for fish in fish_list:
            fish.export_fish_to_CSV_with_writer( data_writer )
        
        csv_out_file.close()
        
class Fish:
    genotype = []
    name = ""
    population = ""
    
    def __repr__( self ):
        fish_description = "name:" + self.name + "/n"
        fish_description += "population:" + self.population + "/n"
        fish_description += "genotype:\n"
        
        for loci in self.genotype:
            fish_description += str( loci )
            fish_description += "\n"
            
        return fish_description
        
    def import_data( self, data_list1, data_list2, header_list ):
        '''Updates the Fish properties based on the passed in data'''
        
        self.name = data_list1[0]
        self.population = data_list1[1]
        #   delete first two entries from both lists
        data_list1 = data_list1[2:]
        data_list2 = data_list2[2:]
        locus_pairs_list = self.__combine_lists_into_list_pairs( data_list1, data_list2 )
        
        i = 0        
        for locus_pair in locus_pairs_list:
            locus = Locus()
            locus.identifier = header_list[i]
            
            locus.amino_acids["A"] += 0.5 * locus_pair.count( "1" )
            locus.amino_acids["C"] += 0.5 * locus_pair.count( "2" )
            locus.amino_acids["G"] += 0.5 * locus_pair.count( "3" )
            locus.amino_acids["T"] += 0.5 * locus_pair.count( "4" )
            
            self.genotype.append( locus )
            i = i + 1
            
    def __combine_lists_into_list_pairs( self, list1, list2 ):
        '''This takes 2 lists and creates a third list with list pairs for each index'''
        if len( list1 ) != len( list2 ):
            raise NameError( "Lists have to have equal length!!!" )
        
        combined_list = []
        i = 0
        while i < len( list1 ):
            array_pair = []
            array_pair.append( list1[i] )
            array_pair.append( list2[i] )
            
            combined_list.append( array_pair )
            i = i + 1

        return combined_list
    
    def locus_identifier_list( self ):
        '''returns a list of the identifiers for each locus in the genome list'''
        loci_list = []
        
        for locus in self.genotype:
            loci_list.append( locus.identifier )
        
        return loci_list
    
    def export_fish_to_CSV_with_writer( self, writer ):
        '''Takes a csv.writer object and will export 4 rows of data to the file associated with the writer--one row for A, C, G, T respectively'''
        output_row_list = [self.name, self.population]
        locus_columns = []
        for locus in self.genotype:
            amino_acid_values_list = locus.amino_acids.values()
            for amino_acid_value in amino_acid_values_list:
                locus_columns.append( amino_acid_value )
            
        output_row_list += locus_columns
        writer.writerow( output_row_list )
            
class Locus:
    identifier = ""
    amino_acids = {"A":0.0, "C":0.0, "G":0.0, "T":0.0}
    
    def __str__( self ):
        description_string = self.identifier + ": " 
        description_string += str( self.amino_acids )
        
        return description_string
    
#Script Starts Here
if __name__ == "__main__":
    print "Starting Script"
    fish_list = import_structure()
    
    #Demonstrating the bug:
    actual_genotype_size = len( fish_list[0].genotype )
    expected_size_of_genotype = 12
    if actual_genotype_size != expected_size_of_genotype:
        print "!!! expected genotype size of: " + str( expected_size_of_genotype ) + " instead it's: " + str( actual_genotype_size )
        
    #Commented out the rest of the script below:            
    #export_matrix( fish_list )


Was This Post Helpful? 0
  • +
  • -

#11 baavgai   User is offline

  • Dreaming Coder
  • member icon


Reputation: 7492
  • View blog
  • Posts: 15,527
  • Joined: 16-October 07

Re: Python 2.7 script for working with DNA in .csv

Posted 05 December 2014 - 03:59 PM

View Postgouldsc, on 05 December 2014 - 02:57 PM, said:

The value in each column should be either 0.0, 0.5 (one of the DNA strands has this amino acid at that loci), and 1.0 (both DNA strands have this amino acid).


Thanks, this makes a lot more sense.

View Postgouldsc, on 05 December 2014 - 02:57 PM, said:

Are you saying this because you see errors in my code?


Yes and no. Logic errors, it looks like.

Understand that in a class, variables defined in the class:
class Foo:
    a = 1



Belong to the class, not the instance. Also, a good class has a constructor, even it it's empty. By setting a value in an instance, you're defining a new value, but it's not the same a the class value. Seeing that kind of thing is concerning.


View Postgouldsc, on 05 December 2014 - 04:16 PM, said:

Now she wants an A, C, G, T for each locus (4 columns for each locus instead of 1)


Cool. Here's a stab:
import csv

def getTableData(filename):
    with open( filename ) as csv_data_file:
        data_reader = csv.reader( csv_data_file, dialect = "excel" )
        table_rows = [ row for row in data_reader ]
        # csv_data_file.close() redundant, the with cleans this up
    return table_rows

def organizeRows(data):
    loci_identifiers = data.pop(0)[2:] #   skip first two items which will be blank:
    if len(data) % 2 == 1:
        raise NameError( "Need even number of rows, something is wrong with the data file." )

    def toRow(i):
        (x,y) = data[i:][:2]
        (name, population) = x[:2]
        # pairs = list(x for xs in zip(x[2:], y[2:]) for x in xs)
        pairs = zip(x[2:], y[2:])
        return name, population, pairs
    rows = [ toRow(i) for i in range(0,len(data),2) ]
    return loci_identifiers, rows


def appyPairs(data):
    (header, rows) = data
    def add(xs, value):
        i = "1234".find(value)
        if not i==-1:
            xs[i] += 0.5
    def addPair(pair):
        xs = [ 0.0, 0.0, 0.0, 0.0 ]
        add(xs, pair[0])
        add(xs, pair[1])
        return xs
    def nextRow(row):
        name, population, pairs = row
        xs = [ name, population ]
        for pair in pairs: 
            xs.extend(addPair(pair))
        return xs
    return header, [ nextRow(row) for row in rows ]


def writePairs(data, filename):
    (header, rows) = data
    with open( filename, "wb" ) as csv_out_file:
        writer = csv.writer( csv_out_file, dialect = "excel" )
        writer.writerow( ["", ""] +  [ c for c in ("ACGT" * len(header)) ] )
        writer.writerow( ["Fish_ID", "Population"] +  [ x for x in header for i in range(4) ] )
        for row in rows:
            writer.writerow(row)

def main(filenameIn, filenameOut):
    print "Starting Script"
    data = getTableData(filenameIn)
    # print data
    data = organizeRows(data)
    # print data
    data = appyPairs(data)
    # print data
    writePairs(data, filenameOut)

if __name__ == "__main__":
    main('fish_in.csv', 'fish_out.csv')
    # main("/Users/gouldsc/Desktop/StructureTest.csv", "/Users/gouldsc/Desktop/PCAMatrixTest.csv")



It actually ended up being easier, knowing the rules. One row per fish also made it easier.
Was This Post Helpful? 1
  • +
  • -

#12 baavgai   User is offline

  • Dreaming Coder
  • member icon


Reputation: 7492
  • View blog
  • Posts: 15,527
  • Joined: 16-October 07

Re: Python 2.7 script for working with DNA in .csv

Posted 05 December 2014 - 05:26 PM

Upon further consideration and thought it would be interesting to give this the OOP treatment. It actually worked out pretty well, I think:
import csv

class Locus:
    def __init__(self, left, right):
        def toNum(x): return "1234".find(x)
        self.left, self.right = toNum(left), toNum(right)
    def getData(self):
        xs = [ 0.0, 0.0, 0.0, 0.0 ]
        xs[self.left] += 0.5
        xs[self.right] += 0.5
        return xs

class Fish:
    def __init__(self, left, right):
        self.name, self.population = left[:2]
        self.loci = [ Locus(*x) for x in zip(left[2:], right[2:]) ]
    def getData(self):
        xs = [ self.name, self.population ]
        for x in self.loci:
            xs.extend(x.getData())
        return xs

class Fishes:
    def __init__(self, filename):
        self.items = []
        with open( filename, 'rb' ) as fh:
            reader = csv.reader( fh, dialect = "excel" )
            i, l, r = 0, None, None
            for row in reader:
                if i==0:
                    self.loci_identifiers = row[2:] #   skip first two items which will be blank:
                    # print self.loci_identifiers
                elif i % 2 == 1:
                    l = row
                else:
                    r = row
                    self.items.append(Fish(l,r))
                i += 1
    def getData(self):
        yield ["", ""] +  [ c for c in ("ACGT" * len(self.loci_identifiers)) ]
        yield ["Fish_ID", "Population"] +  [ x for x in self.loci_identifiers for i in range(4) ] 
        for x in self.items:
            yield x.getData()
    def write(self, filename):
        with open( filename, "wb" ) as csv_out_file:
            writer = csv.writer( csv_out_file, dialect = "excel" )
            for x in self.getData():
                writer.writerow(x)

def main(filenameIn, filenameOut):
    print "Starting Script"
    fishes = Fishes(filenameIn)
    # print fishes.items[0].getData()
    # print [ x for x in fishes.getData() ]
    fishes.write(filenameOut)

if __name__ == "__main__":
    main('fish_in.csv', 'fish_out.csv')
    # main("/Users/gouldsc/Desktop/StructureTest.csv", "/Users/gouldsc/Desktop/PCAMatrixTest.csv")


Was This Post Helpful? 1
  • +
  • -

#13 gouldsc   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 12
  • Joined: 03-December 14

Re: Python 2.7 script for working with DNA in .csv

Posted 06 December 2014 - 10:55 AM

Thank you so much for your help. I've learned a lot from this thread. First I had no clue that I was declaring class variables instead of instance variables (the source of my bug). In the other languages I've used this is done by using special keywords. I'm surprised by how this is done in Python, but it actually does make some sense since one of the design goals of the language was to minimize clutter (still a bit weird to me though, but I'll get over it).

One thing I don't understand is if I was declaring all of my properties as class variables, how did it preserve the unique properties like name for each fish? Why isn't that being overwritten with each new fish created if that variable is being shared across all instances? If all fish had the same identifier, that would have made the bug much more obvious. Clearly I need to do more reading.

Some of the patterns you guys are using are very interesting. I think many would be applicable to stuff I write in Swift. They say it's good for programmers to learn many languages, this thread has validated that idea for me. Again, thanks so much for your help.
Was This Post Helpful? 0
  • +
  • -

#14 baavgai   User is offline

  • Dreaming Coder
  • member icon


Reputation: 7492
  • View blog
  • Posts: 15,527
  • Joined: 16-October 07

Re: Python 2.7 script for working with DNA in .csv

Posted 06 December 2014 - 06:57 PM

View Postgouldsc, on 06 December 2014 - 01:55 PM, said:

One thing I don't understand is if I was declaring all of my properties as class variables, how did it preserve the unique properties like name for each fish?


This is kind of fun:
>>> class Foo:
...     value = 42
... 
>>> x = Foo()
>>> x.value
42
>>> x.value = 1
>>> x.value
1
>>> y = Foo()
>>> y.value
42
>>> x.value
1
>>> Foo.value
42
>>> x.__dict__
{'value': 1}
>>> y.__dict__
{}
>>> x.thingy = "Hi"
>>> x.__dict__
{'thingy': 'Hi', 'value': 1}
>>> 



So, essentially, an object instance just a dictionary bucket with some extra scaffolding. When you ask for an attribute that's undefined, you'll fall through to the class variable. When you set an instance variable, it will override the class value.

On the global end, try:
>>> Foo.value = 3.14
>>> y.value
3.14
>>> z = Foo()
>>> z.value
3.14



Python is interesting in that there is no true private. You can poke around the guts of things fairly easily. Classes are more contracts than hard definitions. You can take an instance and add more stuff to it, similar to the prototype language, Javascript. Indeed, when you pass an instance in Python, you don't care about the type so much as that it has the properties you wish to access.

Good Python programs never ask about type, they just assume and move forward. If the assumption proves invalid, they crash. If you're coming from a more structured language, the duck type paradigm will seem jarring.
Was This Post Helpful? 2
  • +
  • -

#15 jon.kiparsky   User is offline

  • Beginner
  • member icon


Reputation: 11679
  • View blog
  • Posts: 19,840
  • Joined: 19-March 11

Re: Python 2.7 script for working with DNA in .csv

Posted 07 December 2014 - 12:47 AM

Nice examples. Just to lean on this point, an thing to notice here is that if the instance has a value explicitly set, it keeps it: it's in the instance's __dict__. But if the instance inherits a value from the class, then the value is retrieved by a lookup to the class.

This can be a useful way to get static values, but sometimes will be confusing. For your own sanity, I would suggest that if you mean a value to be particular to the instance (as in this case) you should initialize it in the object's constructor (the __init__ function).
Was This Post Helpful? 1
  • +
  • -

Page 1 of 1