Compare multiple files for common entries

irishman211

New Member

Join Date: Apr 2013

Posts: 1
#1

Compare multiple files for common entries

Apr 2 '13, 11:02 AM

I apologize in advance, I'm trying to teach myself python in my spare time since I was assigned this task. I am working on a way to examine a directory of thousands of files looking for common entries. In this instance, we have multiple cases when I have extracted telephone numbers from thousands of pieces of code and stored them in individual folders for each case. However, the name of the files are all the same "telephone_hist ogram.txt". What I'm trying to accomplish is to figure out how to compare the entire directory and have it produce a file that tells me if a number appears in more than just one file and how many files does it appear in. The other problem is that each of the txt files have two columns, with the number appearing in the second column. Here's what we have got so far, but I've only been able to compare two files, not a whole directory:

Code:

# Open each file and suck all of the data into an array called searchlines # Then sort the array with open("folder1/telephone_histogram.txt", "r") as f: searchlines = f.readlines() with open("folder2/telephone_histogram.txt", "r") as f: searchlines = searchlines+f.readlines() searchlines.sort(); # dupe will be the variable to compare against the value of the next line # dupe_count will be the number of times the item is found in the file # dupe is initialized to a junk value and dupe_count is set to 0 dupe="DUPE" dupe_count=1 for i, line in enumerate(searchlines): if dupe in line: dupe_count +=1; else: if dupe_count==1: #Item is unique print searchlines[i-1]; nothing=False# delete this line. It is just here so I can comment out the lines before without error else: #Item is duplicated print the item preceeded by the number of times it was duplicated #print dupe_count, searchlines[i-1]; nothing=False # delete this line. It is just here so I can comment out the lines before without error dupe_count=1; dupe=line;]

If you can help, thank you so much in advance

Last edited by bvdet; Apr 2 '13, 02:17 PM. Reason: Please use code tags when posting code [code]....[/code]
Tags: None
bvdet

Recognized Expert Specialist

Join Date: Oct 2006

Posts: 2851
#2

Apr 2 '13, 02:33 PM

I would approach it like this:
Generate a list of files to read. os.walk() is ideal for this.

Initialize a dictionary. The phone numbers will be the keys and the counts will be the values.

Iterate over the files, updating the dictionary with each entry.

Dictionary method get() or setdefault() can be used to increment the counts. Example:

Code:

>>> key = '555-555-5555' >>> v = dd.get(key, 0) >>> dd[key] = v+1 >>> key = '555-555-5556' >>> v = dd.setdefault(key, 0) >>> dd[key] += 1 >>>
Comment

Compare multiple files for common entries

Compare multiple files for common entries

Comment