Compare multiple files for common entries

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • irishman211
    New Member
    • Apr 2013
    • 1

    Compare multiple files for common entries

    I apologize in advance, I'm trying to teach myself python in my spare time since I was assigned this task. I am working on a way to examine a directory of thousands of files looking for common entries. In this instance, we have multiple cases when I have extracted telephone numbers from thousands of pieces of code and stored them in individual folders for each case. However, the name of the files are all the same "telephone_hist ogram.txt". What I'm trying to accomplish is to figure out how to compare the entire directory and have it produce a file that tells me if a number appears in more than just one file and how many files does it appear in. The other problem is that each of the txt files have two columns, with the number appearing in the second column. Here's what we have got so far, but I've only been able to compare two files, not a whole directory:
    Code:
    # Open each file and suck all of the data into an array called searchlines
    # Then sort the array
    with open("folder1/telephone_histogram.txt", "r") as f:
    searchlines = f.readlines()
    with open("folder2/telephone_histogram.txt", "r") as f:
    searchlines = searchlines+f.readlines()
    searchlines.sort();
    # dupe will be the variable to compare against the value of the next line
    # dupe_count will be the number of times the item is found in the file
    # dupe is initialized to a junk value and dupe_count is set to 0
    dupe="DUPE"
    dupe_count=1
    for i, line in enumerate(searchlines):
    if dupe in line:
        dupe_count +=1;
    else:
        if dupe_count==1:
            #Item is unique
            print searchlines[i-1];
            nothing=False# delete this line.  It is just here so I can comment out the lines before without error
    
        else:
            #Item is duplicated print the item preceeded by the number of times it was duplicated
            #print dupe_count, searchlines[i-1];
            nothing=False # delete this line.  It is just here so I can comment out the lines before without error
        dupe_count=1;
    dupe=line;]
    If you can help, thank you so much in advance
    Last edited by bvdet; Apr 2 '13, 02:17 PM. Reason: Please use code tags when posting code [code]....[/code]
  • bvdet
    Recognized Expert Specialist
    • Oct 2006
    • 2851

    #2
    I would approach it like this:
    • Generate a list of files to read. os.walk() is ideal for this.
    • Initialize a dictionary. The phone numbers will be the keys and the counts will be the values.
    • Iterate over the files, updating the dictionary with each entry.
    Dictionary method get() or setdefault() can be used to increment the counts. Example:
    Code:
    >>> key = '555-555-5555'
    >>> v = dd.get(key, 0)
    >>> dd[key] = v+1
    >>> key = '555-555-5556'
    >>> v = dd.setdefault(key, 0)
    >>> dd[key] += 1
    >>>

    Comment

    Working...