Document Distance: Program Version 4
Problem Definition | Data Sets | Programs: v1 - v2 - v3 - v4 - v5 - v6 | Programs Using Dictionaries
The problem with count_frequency is that for each word in the text file, it searches linearly down the list of word/frequency pairs to find the given word.
Thus, if the document contains n distinct words, we will again have quadratic (Θ(n2)) running time.
The solution is to use hashing, which gives constant time (i.e. Θ(1) running time) routines to store and retrieve key/value pairs from a table. In Python, a hash table is called a dictionary; documentation on dictionaries can be found here here.
Basics:
D = {} # Initialize D to the empty dictionary. D['ab'] = 5 # Store the value 5 in the dictionary under key 'ab' D['ab'] += 1 # Increment the value stored with key 'ab' D.has_key('ab')# Return True iff D has a key/value pair with key 'ab' D.items() # Returns an (unsorted) list of key/value pairs in D
With this capability, we modify our routine count_frequency to the following:
def count_frequency(word_list): """ Return a list giving pairs of form: (word,frequency) """ D = {} for new_word in word_list: if D.has_key(new_word): D[new_word] = D[new_word]+1 else: D[new_word] = 1 return D.items()
With this modification, our new document distance program is now docdist4.py (PY).
Running the program again on the same inputs as before, we get:
>docdist4.py t2.bobsey.txt t3.lewis.txt File t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words File t3.lewis.txt : 15996 lines, 182355 words, 8530 distinct words The distance between the documents is: 0.574160 (radians) 4059255 function calls in 41.651 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.000 0.000 :0(acos) 1229965 4.575 0.000 4.575 0.000 :0(append) 22663 0.096 0.000 0.096 0.000 :0(extend) 232140 0.773 0.000 0.773 0.000 :0(has_key) 1277585 4.607 0.000 4.607 0.000 :0(isalnum) 2 0.022 0.011 0.022 0.011 :0(items) 232140 0.786 0.000 0.786 0.000 :0(join) 345651 1.265 0.000 1.265 0.000 :0(len) 232140 0.769 0.000 0.769 0.000 :0(lower) 2 0.001 0.000 0.001 0.000 :0(open) 2 0.000 0.000 0.000 0.000 :0(range) 2 0.014 0.007 0.014 0.007 :0(readlines) 1 0.004 0.004 0.004 0.004 :0(setprofile) 1 0.000 0.000 0.000 0.000 :0(sqrt) 1 0.003 0.003 41.647 41.647 <string>:1(<module>) 2 0.948 0.474 1.743 0.871 docdist4.py:110(count_frequency) 2 11.409 5.705 11.410 5.705 docdist4.py:125(insertion_sort) 2 0.001 0.000 41.288 20.644 docdist4.py:147(word_frequencies_for_file) 3 0.193 0.064 0.344 0.115 docdist4.py:165(inner_product) 1 0.000 0.000 0.344 0.344 docdist4.py:191(vector_angle) 1 0.013 0.013 41.644 41.644 docdist4.py:201(main) 2 0.000 0.000 0.015 0.007 docdist4.py:53(read_file) 2 0.206 0.103 28.120 14.060 docdist4.py:69(get_words_from_line_list) 22663 12.978 0.001 27.818 0.001 docdist4.py:82(get_words_from_string) 1 0.000 0.000 41.651 41.651 profile:0(main()) 0 0.000 0.000 profile:0(profiler) 232140 1.467 0.000 2.235 0.000 string.py:218(lower) 232140 1.522 0.000 2.308 0.000 string.py:306(join)
Much better! We reduced the running time by a factor of 2 for these inputs, and replaced one of our quadratic time routines with a linear-time one, so the running time will scale better for larger inputs.
We still aren't done. The running time still seems too much (40 seconds!). The two obvious "big nails" to hit are
- get_words_from_string (12 seconds) and
- insertion_sort (11 seconds).
What can we do to improve get_words_from_string? Here it is again:
def get_words_from_string(line): """ Return a list of the words in the given input string, converting each word to lower-case. Input: line (a string) Output: a list of strings (each string is a sequence of alphanumeric characters) """ word_list = [] # accumulates words in line character_list = [] # accumulates characters in word for c in line: if c.isalnum(): character_list.append(c) elif len(character_list)>0: word = string.join(character_list,"") word = string.lower(word) word_list.append(word) character_list = [] if len(character_list)>0: word = string.join(character_list,"") word = string.lower(word) word_list.append(word) return word_list
How can this be improved?