Programming Stuff: Finding Shortest distance between words

You have a large text file. Given any two words, find the shortest distance(in terms of number of words) between the two given words in the file. Can you make the searching operation in O(1) time? what about the space complexity for your solution?
Ex:
File contains:
as was is the as the yahoo you me was the and
Words given: the, was
Answer should be: 0

If Words given: you, the
Answer: 2

int FindShortestDistance( const char *ipWord1, const char* ipWord2)
{
ifstream reader; // properly initialized with file name & stream opened
char wordRead[100]; // Handle the size more carefully.
char whiteSpace = ' '; // whitespace value
bool isFirstPatternFound = FALSE;
int shortestDistance = 0;
int currentCount = 0;

// Start reading the file
while( ! reader.eof() )
{ // Get function takes a delimiter with default delimiter value as --> \n so
// change the delimiter to ' ' i.e whitespace

reader.get( wordRead, whiteSpace );
if( 0 == strcmp( ipWord1, wordRead) )
{ // 1st pattern found
if( isFirstPatternFound == FALSE )
{
isFirstPatternFound = TRUE;
}
// reset count everytime you find 1st pattern because only resetting
// gives shortest distance
currentCount = 0;
}
// check if 2nd pattern is found
else if( 0 == strcmp(pWord2, wordRead) )
{
if( isFirstPatternFound == FALSE )
{ // starting word not yet found
continue;
}

// 1st & 2nd both pattern found
if( currentCount < shortestDistance || 0 == shortestDistance )
{
shortestDistance = currentCount;
}
// 1st pattern & 2nd pattern both found at least once.
// But, 1st pattern can occurs 2nd,3rd.. nth time
// So reset for onward scanning
isFirstPatternFound = FALSE;
}
else if( TRUE == isFirstPatternFound )
{ // increment only when 1st pattern is found & 2nd pattern not found
++currentCount;
}
}
return shortestDistance;
}
I am reading the file one word at a time. i.e O(k) // assuming k is average length.
Now strcmp will use compare the k characters (in worst case). For a file with 'n' words i will have O(nk) to read + O(nk) for strcmp.

So total complexity = O(nk) + O(nk) ~ O(nk)

The 2nd & perhaps better option is to compute hash [dont use a hastable, just compute the hash value of the word].
Just replace the call to strcms by something like gethashedvalue(currentWord). The compare the hashed values ipWord1, ipWord2 with the hashed value of currentWord

"Assuming your hash function is not dependent on the length of word" the hash generation time will then take O(n) time.
But, i cannot think of a hash function that does not go through each character of the word to compute its hash i.e k characters. If you can guarantee such hash function then
Total complexity = O(nk) // for reading + O(n) // for hash genration & comparison i.e
~ O(nK)

I don't think it can be optimized further ! No reading mechanism can read you a file of n(words) * k (character) in less than nk time.

Let me know what you or others think about 1)my hash based approach & 2)overall complexity of (nk)

Programming Stuff

Monday, June 30, 2008

Finding Shortest distance between words

No comments:

Blog Archive

About Me