I apologize in advance, I'm trying to teach myself python in my spare time since I was assigned this task. I am working on a way to examine a directory of thousands of files looking for common entries. In this instance, we have multiple cases when I have extracted telephone numbers from thousands of pieces of code and stored them in individual folders for each case. However, the name of the files are all the same "telephone_hist ogram.txt". What I'm trying to accomplish is to figure out how to compare the entire directory and have it produce a file that tells me if a number appears in more than just one file and how many files does it appear in. The other problem is that each of the txt files have two columns, with the number appearing in the second column. Here's what we have got so far, but I've only been able to compare two files, not a whole directory:
If you can help, thank you so much in advance
Code:
# Open each file and suck all of the data into an array called searchlines
# Then sort the array
with open("folder1/telephone_histogram.txt", "r") as f:
searchlines = f.readlines()
with open("folder2/telephone_histogram.txt", "r") as f:
searchlines = searchlines+f.readlines()
searchlines.sort();
# dupe will be the variable to compare against the value of the next line
# dupe_count will be the number of times the item is found in the file
# dupe is initialized to a junk value and dupe_count is set to 0
dupe="DUPE"
dupe_count=1
for i, line in enumerate(searchlines):
if dupe in line:
dupe_count +=1;
else:
if dupe_count==1:
#Item is unique
print searchlines[i-1];
nothing=False# delete this line. It is just here so I can comment out the lines before without error
else:
#Item is duplicated print the item preceeded by the number of times it was duplicated
#print dupe_count, searchlines[i-1];
nothing=False # delete this line. It is just here so I can comment out the lines before without error
dupe_count=1;
dupe=line;]
Comment