How to remove words from text file?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Nawaf Ali
    New Member
    • Nov 2010
    • 3

    How to remove words from text file?

    I am trying to do some text statistics, like word frequency, average word length, average sentence length, and average paragraph length, I managed to do the word frequency and the average sentence and word length. What I need to do next is preprocess the text file by removing some words, "listed in some other text file", and then do my statistics. And if some one can tell me how to do the average paragraph length too, please.
    Any help is appreciated.
  • bvdet
    Recognized Expert Specialist
    • Oct 2006
    • 2851

    #2
    What would be the definition of a paragraph? A blank line?

    To eliminate words from another file, let's assume you have read the other file and split the words into a list (remove list). Let's also assume you have read in the file that you need statistics for and split the words into a list (stat list). Initialize a new list (keep list), iterate on the stat list, and if a word is not in the remove list, append to the keep list.

    Code:
    >>> remove_list = ['a','b','c']
    >>> stat_list = ['a','a','1','x','f','t']
    >>> keep_list = []
    >>> for word in stat_list:
    ... 	if word not in remove_list:
    ... 		keep_list.append(word)
    ... 		
    >>> keep_list
    ['1', 'x', 'f', 't']
    >>>
    It also can be done with sets.
    Code:
    >>> keep_list = list(set(stat_list)-set(remove_list))
    >>> keep_list
    ['1', 'x', 't', 'f']
    >>>
    Give it a try and post back if you need more help.

    Comment

    • Nawaf Ali
      New Member
      • Nov 2010
      • 3

      #3
      First, Thanks for the fast response, I made these changes:
      Code:
      filename = 'Jay.txt' 
      functionWords = 'function_words.txt'
      processedText=[]
      word_list = re.split('\s+', file(filename).read().lower())
      functionWordList = re.split('\s+', file(functionWords).read().lower())
      
      for word in word_list:
          if word not in functionWordList:
              processedText.append(word)
      # Then I got this error
      Traceback (most recent call last):
        File "F:\Python24\word_count1", line 21, in -toplevel-
          if word not in functionWordList:
      TypeError: iterable argument required
      Can you help me with that?
      Last edited by bvdet; Nov 5 '10, 03:05 AM. Reason: Please use code tags when posting code. [code]....code goes here....[/code]

      Comment

      • Nawaf Ali
        New Member
        • Nov 2010
        • 3

        #4
        I fixed it, but thanks for your help, I wouldn't find it without your help. Now do you know how to do the paragraph length? Usually paragraphs are separated by new line or two. I appreciate your help

        Comment

        • bvdet
          Recognized Expert Specialist
          • Oct 2006
          • 2851

          #5
          Assume a paragraph is separated by a blank line. One way to do it would be to iterate on the file object as in:
          Code:
          f = open("filename.txt")
          for line in f:
              ....
          Strip the line (string method strip(), removing whitespace). If the line has no content, you have reached a new paragraph.

          Comment

          Working...