Document Distance: Program Version 1
Problem Definition | Data Sets | Programs: v1 - v2 - v3 - v4 - v5 - v6 | Programs Using Dictionaries
The initial version of our program for computing the distance between two documents (PY)
This program seems to give correct results. Here is some output:
>docdist1.py t1.verne.txt t2.bobsey.txt File t1.verne.txt : 1057 lines, 8943 words, 2150 distinct words File t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words The distance between the documents is: 0.582949 (radians) >docdist1.py t2.bobsey.txt t2.bobsey.txt File t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words File t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words The distance between the documents is: 0.000000 (radians) >docdist1.py t2.bobsey.txt t3.lewis.txt File t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words File t3.lewis.txt : 15996 lines, 182355 words, 8530 distinct words The distance between the documents is: 0.574160 (radians)
However, this program seems very SLOW as the inputs get large.
The last example above seemed to take approximately THREE MINUTES!
There seems to be no hope of comparing all of Shakespeare's works to all of Churchill's in a reasonable amount of time...
What is wrong with the efficiency of this program?
Can you figure it out?