School Assignment? Project Due Tomorrow? Chat LIVE With A Programming Expert!

Welcome to Dream.In.Code
Become an Expert!

Join 307,090 Programmers for FREE! Get instant access to thousands of experts, tutorials, code snippets, and more! There are 2,039 people online right now. Registration is fast and FREE... Join Now!




Comparing two files line by line

 

Comparing two files line by line

chavanak

3 Nov, 2009 - 05:58 AM
Post #1

New D.I.C Head
*

Joined: 3 Nov, 2009
Posts: 4


My Contributions
Hi,
I am comparing 2000 files with one other file. I want the program to go through each line in both files and compare. If the line is present, then it has to write to another file. What I tried was to open both the files and use readlines() to read into an list. Then I used for loop like this:
CODE

chain_sep=[]
complex_file=open ("1complex.txt", "r")
complex_lines = complex_file.readlines()
complex_lines = map(string.strip, complex_lines)
splitter = [s.split('\t') for s in complex_lines]
complex_file.close()      

for file in os.listdir("."):
    basename=os.path.basename(file)
    if basename.endswith(".pd"):
        chain_sep.append(basename)
for (i,s) in izip(chain_sep,splitter):
    fhandle_6 =open (i, "r")
    from_pd = fhandle_6.readlines()
    from_pd = map(string.strip,from_pd)
    fhandle_6.close()
    fhandle_13 = open(s[0]+".cr", 'r')
    fhandle_13_l = fhandle_13.readlines()
    fhandle_13_l = map(string.strip, fhandle_13_l)
    fhandle_13.close()
    fopen_7=open (i+"r.pdb", "w")
    fopen_8=open (i+"l.pdb", "w")
    for (a,y) in izip(from_pd,fhandle_13_l): #from_pd and fhandle_13_l is not of the same length :(
    if a[0:4]=="ATOM":
        if a[21] == "R":
            print >>fopen_7, a
        else:
            if a[7:13]==y[7:13]:
          print >>fopen_8, a
fopen_7.close()
fopen_8.close()


The above code is only a chunk btw. My problem is that both the files are not of the same size so I feel using zip or izip is not ideal in this situation. Is there any other solution where in I can compare line by line between both the files and iterate the complete file.

Thanks in advance,
Cheers,
Chav.

P.S: Though this is an assignment, I am not asking for the code (unless it is too complex). Please let me know any ideas or modules that can help me. Also both the files are huge with around 2000 lines each.

This post has been edited by chavanak: 3 Nov, 2009 - 06:27 AM

User is offlineProfile CardPM
+Quote Post


baavgai

RE: Comparing Two Files Line By Line

3 Nov, 2009 - 07:01 AM
Post #2

Dreaming Coder
Group Icon

Joined: 16 Oct, 2007
Posts: 4,349



Thanked: 411 times
Dream Kudos: 550
Expert In: C, C++, Java, C#, ASP.NET, PHP, Perl, Python, Oracle, SQL Server, MySql, HTML, JavaScript, Lua, Cheese

My Contributions
There is a version of izip, called izip_longest that may suit: http://docs.python.org/library/itertools.h...ls.izip_longest

Here's my refactoring of your code, if it helps.
python

def getFileLines(fileName):
fh = open (fileName, "r")
lines = fh.readlines()
fh.close()
return map(string.strip, lines)

splitter = [ s.split('\t') for s in getFileLines("1complex.txt") ]
chain_sep = [ os.path.basename(file) for file in os.listdir('.') if file.endswith('.pd') ]

for (i,s) in izip(chain_sep,splitter):
fopen_7=open (i+"r.pdb", "w")
fopen_8=open (i+"l.pdb", "w")
for (a,y) in izip_longest(getFileLines(i), getFileLines(s[0]+".cr"), None):
if a[0:4]=="ATOM":
if a[21] == "R":
print >>fopen_7, a
elif y!=None and a[7:13]==y[7:13]:
print >>fopen_8, a
fopen_7.close()
fopen_8.close()


If not, well, it really shouldn't be too hard to roll your own.

User is online!Profile CardPM
+Quote Post

chavanak

RE: Comparing Two Files Line By Line

3 Nov, 2009 - 07:19 AM
Post #3

New D.I.C Head
*

Joined: 3 Nov, 2009
Posts: 4


My Contributions
QUOTE(baavgai @ 3 Nov, 2009 - 07:01 AM) *

There is a version of izip, called izip_longest that may suit: http://docs.python.org/library/itertools.h...ls.izip_longest

Here's my refactoring of your code, if it helps.
python

def getFileLines(fileName):
fh = open (fileName, "r")
lines = fh.readlines()
fh.close()
return map(string.strip, lines)

splitter = [ s.split('\t') for s in getFileLines("1complex.txt") ]
chain_sep = [ os.path.basename(file) for file in os.listdir('.') if file.endswith('.pd') ]

for (i,s) in izip(chain_sep,splitter):
fopen_7=open (i+"r.pdb", "w")
fopen_8=open (i+"l.pdb", "w")
for (a,y) in izip_longest(getFileLines(i), getFileLines(s[0]+".cr"), None):
if a[0:4]=="ATOM":
if a[21] == "R":
print >>fopen_7, a
elif y!=None and a[7:13]==y[7:13]:
print >>fopen_8, a
fopen_7.close()
fopen_8.close()


If not, well, it really shouldn't be too hard to roll your own.


Hi,
Thanks for the reply. Though izip_longest provides the necessary solution for me, I cannot use it on my work computer since it is pegged at python version 2.4 and izip_longest is a 2.6 feature sad.gif sad.gif

So is it possible to do it any other way. Please do let me know
Cheers,
Chav

User is offlineProfile CardPM
+Quote Post

baavgai

RE: Comparing Two Files Line By Line

3 Nov, 2009 - 02:02 PM
Post #4

Dreaming Coder
Group Icon

Joined: 16 Oct, 2007
Posts: 4,349



Thanked: 411 times
Dream Kudos: 550
Expert In: C, C++, Java, C#, ASP.NET, PHP, Perl, Python, Oracle, SQL Server, MySql, HTML, JavaScript, Lua, Cheese

My Contributions
"My library doesn't have that function so I give up" is not a programmer's attitude. Rather, the mindset should be "in the absence of a library, I get to write my own function to suit my needs."

In this instance, the function is almost trivial and shouldn't take more than a few minutes to put together. About five, actually.

The izip and izip_longest functions are iterators. Here's my take on what I'd want from izip_longest:
python

def myZip(list1, list2):
if len(list1)<len(list2):
smallest = len(list1)
else:
smallest = len(list2)
for i in range(smallest): yield (list1[i], list2[i])
for v in list1[smallest:]: yield (v, None)
for v in list2[smallest:]: yield (None, v)

a1 = ['x'+str(i) for i in range(5) ]
a2 = ['y'+str(i) for i in range(8) ]
print [i for i in myZip(a1, a2)]


Result:
CODE

[('x0', 'y0'), ('x1', 'y1'), ('x2', 'y2'), ('x3', 'y3'), ('x4', 'y4'), (None, 'y5'), (None, 'y6'), (None, 'y7')]



User is online!Profile CardPM
+Quote Post

chavanak

RE: Comparing Two Files Line By Line

4 Nov, 2009 - 01:47 AM
Post #5

New D.I.C Head
*

Joined: 3 Nov, 2009
Posts: 4


My Contributions
Hi,
Thanks for the help. I do understand that if there is no library, you gotto create one but this is mostly the last time I will be using python and also I have just started with it sad.gif I did try out your solution by making changes as I wanted but I failed at getting what I want sad.gif Just to expand a bit on my problem.
CODE

file-1
ATOM   [b]2197  CB  CYS I  51[/b]      38.091 -13.002   6.320  1.00 20.12
ATOM   [b]2198  SG  CYS I  51[/b]      39.781 -12.827   5.691  1.00 26.67
ATOM   [b]2199  N   MET I  52[/b]      37.845 -15.766   5.722  1.00 33.08
ATOM   [b]2200  CA  MET I  52[/b]      38.312 -17.144   5.674  1.00 33.08

CODE

file-2
ATOM   [b]2197  O   ASP L  50[/b]      18.653  89.329  84.802  1.00  0.00
ATOM   [b]2198  CB  ASP L  50[/b]      16.004  87.278  84.523  1.00  0.00
ATOM   [b]2199  CG  ASP L  50[/b]      15.349  86.109  85.277  1.00  0.00
ATOM   [b]2200  OD1 ASP L  50[/b]      15.347  85.935  86.514  1.00  0.00


As you see in the above piece of data, the only part that is common to both files is the one in bold (the above is just a chunk of a code). So ideally I am supposed to compare the bold data from file 1 and if it exists in file 2, I have to retain it and remove the remaining data.
For e.g.:
CODE

[b]2197  CB  CYS I  51[/b]
[b]2197  CB  CYS I  51[/b]


If the above entry is there in both files then I gotto retain it in file-2 and remove all other entries. I tried to add the required list position to the sample code you gave me but I failed to get the results. Please let me know if I can differentiate the above data and if so how can I do it? I tried the same in perl and I am able to do it very easily but the same in python is becoming tougher for me as I am very new to python (learning for the past week or so)
Cheers,
Chav
User is offlineProfile CardPM
+Quote Post

baavgai

RE: Comparing Two Files Line By Line

4 Nov, 2009 - 05:03 AM
Post #6

Dreaming Coder
Group Icon

Joined: 16 Oct, 2007
Posts: 4,349



Thanked: 411 times
Dream Kudos: 550
Expert In: C, C++, Java, C#, ASP.NET, PHP, Perl, Python, Oracle, SQL Server, MySql, HTML, JavaScript, Lua, Cheese

My Contributions
I'm not really following your logic. However, I can offer this:

Don't think about processing files. File processing is a pain. Rather, in python in particular, think about processing lists.

The following code takes a pair of lists and marks them based on matching criteria and if the value is valid. From this you may be able to find something applicable to your project. Remember, don't be afraid of functions; they add organization and to some extent are self documenting. I use a number of "list comprehensions" here. If you don't know what they are, learn them well; they are a common python technique.

python

listDataTypes = ['Unknown', 'Junk', 'Unique', 'Both']

def processLists(list1, list2):
def markAll(plist, markValue):
count = 0
for item in plist:
item[1] = markValue
count += 1
return count

def markJunk(plist):
return markAll([v for v in plist if not v[0][0:4]=="ATOM" ], 1)

def markBoth(pList1, pList2):
def markMatch(plist, value, markValue):
return markAll([v for v in plist if v[0]==value ], markValue)

for item in pList1:
if item[1]==0:
value = item[0]
if not markMatch(pList2, value, 3)==0:
markMatch(pList1, value, 3)
else:
markMatch(pList1, value, 2)

# make a list of the values with flags, set to unknown
pList1 = [ [s, 0] for s in list1 ]
pList2 = [ [s, 0] for s in list2 ]

# exclude junk
markJunk(pList1)
markJunk(pList2)

# find matches
markBoth(pList1, pList2)

# those left in pList2 are unique
markAll([v for v in pList2 if v[1]==0], 2)

return (pList1, pList2)



def test():
def showList1(list, name):
print "---", name
for item in list:
print "\t", item
print "---\n"

def showList2(list, name):
print "---", name
for item in list:
print "%s\t%s" % (listDataTypes[item[1]], item[0])
print "---\n"

d1 = ['ATOM 123','junk data','ATOM 213','ATOM 132']
d2 = ['ATOM 232','ATOM 123','junk data','ATOM 321']
showList1(d1, 'list (or file contents) 1')
showList1(d2, 'list (or file contents) 2')

(pd1, pd2) = processLists(d1, d2)
showList2(pd1, 'processed list 1')
showList2(pd2, 'processed list 2')

test()


Results:
CODE

--- list (or file contents) 1
    ATOM   123
    junk data
    ATOM   213
    ATOM   132
---

--- list (or file contents) 2
    ATOM   232
    ATOM   123
    junk data
    ATOM   321
---

--- processed list 1
Both    ATOM   123
Junk    junk data
Unique    ATOM   213
Unique    ATOM   132
---

--- processed list 2
Unique    ATOM   232
Both    ATOM   123
Junk    junk data
Unique    ATOM   321
---


Hope this helps.

User is online!Profile CardPM
+Quote Post

Fast ReplyReply to this topicStart new topic

Time is now: 11/21/09 11:28AM

Live Help!

Be Social

Dream.In.Code RSS Feed Dream.In.Code LinkedIn Group Follow Us On Twitter Fan Us On Facebook

Tutorials

Programming

Web Development

Reference Sheets

Code Snippets

DIC Chatroom

Bye Bye Ads

Monthly Drawing

Thumb Drive

Top Contributors

Top 10 Kudos This Month