Join 307,090 Programmers for FREE! Get instant access to thousands of experts, tutorials, code snippets, and more! There are 2,039 people online right now. Registration is fast and FREE... Join Now!
Hi, I am comparing 2000 files with one other file. I want the program to go through each line in both files and compare. If the line is present, then it has to write to another file. What I tried was to open both the files and use readlines() to read into an list. Then I used for loop like this:
CODE
chain_sep=[] complex_file=open ("1complex.txt", "r") complex_lines = complex_file.readlines() complex_lines = map(string.strip, complex_lines) splitter = [s.split('\t') for s in complex_lines] complex_file.close()
for file in os.listdir("."): basename=os.path.basename(file) if basename.endswith(".pd"): chain_sep.append(basename) for (i,s) in izip(chain_sep,splitter): fhandle_6 =open (i, "r") from_pd = fhandle_6.readlines() from_pd = map(string.strip,from_pd) fhandle_6.close() fhandle_13 = open(s[0]+".cr", 'r') fhandle_13_l = fhandle_13.readlines() fhandle_13_l = map(string.strip, fhandle_13_l) fhandle_13.close() fopen_7=open (i+"r.pdb", "w") fopen_8=open (i+"l.pdb", "w") for (a,y) in izip(from_pd,fhandle_13_l): #from_pd and fhandle_13_l is not of the same length :( if a[0:4]=="ATOM": if a[21] == "R": print >>fopen_7, a else: if a[7:13]==y[7:13]: print >>fopen_8, a fopen_7.close() fopen_8.close()
The above code is only a chunk btw. My problem is that both the files are not of the same size so I feel using zip or izip is not ideal in this situation. Is there any other solution where in I can compare line by line between both the files and iterate the complete file.
Thanks in advance, Cheers, Chav.
P.S: Though this is an assignment, I am not asking for the code (unless it is too complex). Please let me know any ideas or modules that can help me. Also both the files are huge with around 2000 lines each.
This post has been edited by chavanak: 3 Nov, 2009 - 06:27 AM
splitter = [ s.split('\t') for s in getFileLines("1complex.txt") ] chain_sep = [ os.path.basename(file) for file in os.listdir('.') if file.endswith('.pd') ]
for (i,s) in izip(chain_sep,splitter): fopen_7=open (i+"r.pdb", "w") fopen_8=open (i+"l.pdb", "w") for (a,y) in izip_longest(getFileLines(i), getFileLines(s[0]+".cr"), None): if a[0:4]=="ATOM": if a[21] == "R": print >>fopen_7, a elif y!=None and a[7:13]==y[7:13]: print >>fopen_8, a fopen_7.close() fopen_8.close()
If not, well, it really shouldn't be too hard to roll your own.
splitter = [ s.split('\t') for s in getFileLines("1complex.txt") ] chain_sep = [ os.path.basename(file) for file in os.listdir('.') if file.endswith('.pd') ]
for (i,s) in izip(chain_sep,splitter): fopen_7=open (i+"r.pdb", "w") fopen_8=open (i+"l.pdb", "w") for (a,y) in izip_longest(getFileLines(i), getFileLines(s[0]+".cr"), None): if a[0:4]=="ATOM": if a[21] == "R": print >>fopen_7, a elif y!=None and a[7:13]==y[7:13]: print >>fopen_8, a fopen_7.close() fopen_8.close()
If not, well, it really shouldn't be too hard to roll your own.
Hi, Thanks for the reply. Though izip_longest provides the necessary solution for me, I cannot use it on my work computer since it is pegged at python version 2.4 and izip_longest is a 2.6 feature
So is it possible to do it any other way. Please do let me know Cheers, Chav
"My library doesn't have that function so I give up" is not a programmer's attitude. Rather, the mindset should be "in the absence of a library, I get to write my own function to suit my needs."
In this instance, the function is almost trivial and shouldn't take more than a few minutes to put together. About five, actually.
The izip and izip_longest functions are iterators. Here's my take on what I'd want from izip_longest:
python
def myZip(list1, list2): if len(list1)<len(list2): smallest = len(list1) else: smallest = len(list2) for i in range(smallest): yield (list1[i], list2[i]) for v in list1[smallest:]: yield (v, None) for v in list2[smallest:]: yield (None, v)
a1 = ['x'+str(i) for i in range(5) ] a2 = ['y'+str(i) for i in range(8) ] print [i for i in myZip(a1, a2)]
Hi, Thanks for the help. I do understand that if there is no library, you gotto create one but this is mostly the last time I will be using python and also I have just started with it I did try out your solution by making changes as I wanted but I failed at getting what I want Just to expand a bit on my problem.
CODE
file-1 ATOM [b]2197 CB CYS I 51[/b] 38.091 -13.002 6.320 1.00 20.12 ATOM [b]2198 SG CYS I 51[/b] 39.781 -12.827 5.691 1.00 26.67 ATOM [b]2199 N MET I 52[/b] 37.845 -15.766 5.722 1.00 33.08 ATOM [b]2200 CA MET I 52[/b] 38.312 -17.144 5.674 1.00 33.08
CODE
file-2 ATOM [b]2197 O ASP L 50[/b] 18.653 89.329 84.802 1.00 0.00 ATOM [b]2198 CB ASP L 50[/b] 16.004 87.278 84.523 1.00 0.00 ATOM [b]2199 CG ASP L 50[/b] 15.349 86.109 85.277 1.00 0.00 ATOM [b]2200 OD1 ASP L 50[/b] 15.347 85.935 86.514 1.00 0.00
As you see in the above piece of data, the only part that is common to both files is the one in bold (the above is just a chunk of a code). So ideally I am supposed to compare the bold data from file 1 and if it exists in file 2, I have to retain it and remove the remaining data. For e.g.:
CODE
[b]2197 CB CYS I 51[/b] [b]2197 CB CYS I 51[/b]
If the above entry is there in both files then I gotto retain it in file-2 and remove all other entries. I tried to add the required list position to the sample code you gave me but I failed to get the results. Please let me know if I can differentiate the above data and if so how can I do it? I tried the same in perl and I am able to do it very easily but the same in python is becoming tougher for me as I am very new to python (learning for the past week or so) Cheers, Chav
I'm not really following your logic. However, I can offer this:
Don't think about processing files. File processing is a pain. Rather, in python in particular, think about processing lists.
The following code takes a pair of lists and marks them based on matching criteria and if the value is valid. From this you may be able to find something applicable to your project. Remember, don't be afraid of functions; they add organization and to some extent are self documenting. I use a number of "list comprehensions" here. If you don't know what they are, learn them well; they are a common python technique.
def markJunk(plist): return markAll([v for v in plist if not v[0][0:4]=="ATOM" ], 1)
def markBoth(pList1, pList2): def markMatch(plist, value, markValue): return markAll([v for v in plist if v[0]==value ], markValue)
for item in pList1: if item[1]==0: value = item[0] if not markMatch(pList2, value, 3)==0: markMatch(pList1, value, 3) else: markMatch(pList1, value, 2)
# make a list of the values with flags, set to unknown pList1 = [ [s, 0] for s in list1 ] pList2 = [ [s, 0] for s in list2 ]
# exclude junk markJunk(pList1) markJunk(pList2)
# find matches markBoth(pList1, pList2)
# those left in pList2 are unique markAll([v for v in pList2 if v[1]==0], 2)
return (pList1, pList2)
def test(): def showList1(list, name): print "---", name for item in list: print "\t", item print "---\n"
def showList2(list, name): print "---", name for item in list: print "%s\t%s" % (listDataTypes[item[1]], item[0]) print "---\n"