String Edit Distance

One commonly used approach for measuring similarity between strings is the string edit distance. The basic premise of the string edit distance is to find the minimum number of character edit operations (copy, replace, add, and remove) needed to transform one string into the other. For example, the edit distance between parks and spark is two: one edit is needed to add a leading s to parks and a second edit is needed to delete the trailing s. Not surprisingly, string edit distance is used in spell-checking applications.

It is convenient to look at string edit distance as a string alignment problem. The example below should provide some insight into this:

 String 1:     p a r k s 
 String 2:   s p a r k   
 Operation:  i c c c c d

The operation is the character operation needed to align or transform string 1 into string 2. i is insert, c is copy, r is replace, and d is delete. (You may find other references using s (substitute) instead of r (replace).) The space at the beginning of string 1 or at the end of string 2 is called a gap.

For a given set of strings, there may be many ways to align the strings. For example, the strings abc123 and abcqwertabc123 have two obvious alignments:

 a b c                 1 2 3
 a b c q w e r t a b c 1 2 3

and

                 a b c 1 2 3
 a b c q w e r t a b c 1 2 3

The particular application would determine which alignment is preferred.

Different distance metrics are used to select the desired alignment. For example, the measurement used in the original example of the cost between parks and spark is the Levenshtein distance. In this metric the cost of replacing, adding or deleting a character is one whereas the cost of equal (or copied) characters is zero. (In the conversion of parks to spark, the cost of copying letters p a r k is zero.) The second alignment above (where abc123 has no gap) can be obtained using the Smith-Waterman distance.

For simple strings it is relatively easy to understand how to transform one string into another but with longer strings or very dissimilar strings it is not so clear. For example:

 See Spot run
 We work and play

There are brute force techniques that consider all possible alignments and find the one(s) with the minimum cost. I refer you to a few good references on string edit distance and the associated dynamic programming algorithms for more information.

Advertisements

5 comments

  1. Not much details on the topic. If u have details information about string edit distance and related topics,can u please give reference where I can find the information.I am very interested to know about the topic,like how it work..are there any algorithm..and so on.So if u can get in touch with me as soon as possible I will be very happy.Thank you.

  2. Hello Amzad! I apologize for the lack of information. I had planned on posting much about the topic, but as it always seems to happen, life has put too many obstacles in the road and I have to focus all of my time elsewhere.
    To start your research, you can follow all of the links I provide in the articles. They provide all of the base information that you need to know. A good book on the topic is “Web Data Mining” by Bing Liu.
    If you have any specific questions, please don’t hesitate to email me. I will try to answer to the best of my ability.

  3. how do i relate string edit distance and directed acyclic graph to find the shortest path?pls reply asap..

  4. Sounds like someone’s got a school homework assignment to do 🙂 When you work through the answer please post it here for others to learn from.
    Best of luck to you!

  5. i m given to give a brief presentation on edit distance problem….
    i m too much in trouble how to fill the matrix of this problem ……why and on what basis we are aligning the strings. on what conditions we are insering or deleting …??


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s