Document Distance: Program Version 4

Problem Definition | Data Sets | Programs: v1 - v2 - v3 - v4 - v5 - v6 | Programs Using Dictionaries


The problem with count_frequency is that for each word in the text file, it searches linearly down the list of word/frequency pairs to find the given word.

Thus, if the document contains n distinct words, we will again have quadratic (Θ(n2)) running time.

The solution is to use hashing, which gives constant time (i.e. Θ(1) running time) routines to store and retrieve key/value pairs from a table. In Python, a hash table is called a dictionary; documentation on dictionaries can be found here here.

Basics:

D = {}         # Initialize  D  to the empty dictionary.
D['ab'] = 5    # Store the value  5  in the dictionary under key  'ab'
D['ab'] += 1   # Increment the value stored with key  'ab' 
D.has_key('ab')# Return True iff  D  has a key/value pair with key  'ab' 
D.items()      # Returns an (unsorted) list of key/value pairs in  D 

With this capability, we modify our routine count_frequency to the following:

def count_frequency(word_list):
    """
    Return a list giving pairs of form: (word,frequency)
    """
    D = {}
    for new_word in word_list:
        if D.has_key(new_word):
            D[new_word] = D[new_word]+1
        else:
            D[new_word] = 1
    return D.items()

With this modification, our new document distance program is now docdist4.py (PY).

Running the program again on the same inputs as before, we get:

>docdist4.py t2.bobsey.txt t3.lewis.txt 
File t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words
File t3.lewis.txt : 15996 lines, 182355 words, 8530 distinct words
The distance between the documents is: 0.574160 (radians)
         4059255 function calls in 41.651 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 :0(acos)
  1229965    4.575    0.000    4.575    0.000 :0(append)
    22663    0.096    0.000    0.096    0.000 :0(extend)
   232140    0.773    0.000    0.773    0.000 :0(has_key)
  1277585    4.607    0.000    4.607    0.000 :0(isalnum)
        2    0.022    0.011    0.022    0.011 :0(items)
   232140    0.786    0.000    0.786    0.000 :0(join)
   345651    1.265    0.000    1.265    0.000 :0(len)
   232140    0.769    0.000    0.769    0.000 :0(lower)
        2    0.001    0.000    0.001    0.000 :0(open)
        2    0.000    0.000    0.000    0.000 :0(range)
        2    0.014    0.007    0.014    0.007 :0(readlines)
        1    0.004    0.004    0.004    0.004 :0(setprofile)
        1    0.000    0.000    0.000    0.000 :0(sqrt)
        1    0.003    0.003   41.647   41.647 <string>:1(<module>)
        2    0.948    0.474    1.743    0.871 docdist4.py:110(count_frequency)
        2   11.409    5.705   11.410    5.705 docdist4.py:125(insertion_sort)
        2    0.001    0.000   41.288   20.644 docdist4.py:147(word_frequencies_for_file)
        3    0.193    0.064    0.344    0.115 docdist4.py:165(inner_product)
        1    0.000    0.000    0.344    0.344 docdist4.py:191(vector_angle)
        1    0.013    0.013   41.644   41.644 docdist4.py:201(main)
        2    0.000    0.000    0.015    0.007 docdist4.py:53(read_file)
        2    0.206    0.103   28.120   14.060 docdist4.py:69(get_words_from_line_list)
    22663   12.978    0.001   27.818    0.001 docdist4.py:82(get_words_from_string)
        1    0.000    0.000   41.651   41.651 profile:0(main())
        0    0.000             0.000          profile:0(profiler)
   232140    1.467    0.000    2.235    0.000 string.py:218(lower)
   232140    1.522    0.000    2.308    0.000 string.py:306(join)

Much better! We reduced the running time by a factor of 2 for these inputs, and replaced one of our quadratic time routines with a linear-time one, so the running time will scale better for larger inputs.

We still aren't done. The running time still seems too much (40 seconds!). The two obvious "big nails" to hit are

  • get_words_from_string (12 seconds) and
  • insertion_sort (11 seconds).

What can we do to improve get_words_from_string? Here it is again:

def get_words_from_string(line):
    """
    Return a list of the words in the given input string,
    converting each word to lower-case.

    Input:  line (a string)
    Output: a list of strings 
              (each string is a sequence of alphanumeric characters)
    """
    word_list = []          # accumulates words in line
    character_list = []     # accumulates characters in word
    for c in line:
        if c.isalnum():
            character_list.append(c)
        elif len(character_list)>0:
            word = string.join(character_list,"")
            word = string.lower(word)
            word_list.append(word)
            character_list = []
    if len(character_list)>0:
        word = string.join(character_list,"")
        word = string.lower(word)
        word_list.append(word)
    return word_list

How can this be improved?