URL Similarity

In order to make template extraction as palatable as possible I am going to start by walking through how one discovers the templates used to generate a set of URLs. I’m going to use Amazon.com as an example.

Search for “book” on Amazon.com. You should see a page with a number of results / data records. Placing your mouse over each of the titles shows many similar urls, some of which I have pasted below:

We can easily see a pattern (the template) in those urls:

http://www.amazon.com/<title>/dp/<book id>/ref=<ref id>?ie=UTF8&s=<product type>&qid=1196364228&sr=8-<index>

Our job is to find an algorithm that will allow us to automatically discover that template.

(I should mention that if one continues to other pages in the search results that there are more differences in the URLs that leads to a different pattern. In order to keep the discussion simple, I will only use the first page’s URLs. This demonstrates that a single page, even one that contains multiple records, might not contain enough information to deduce a complete template.

I also should mention that the ability to label the fields in the template as I did in the example above will not be covered (in the near future) in this series of postings. I will only state that there are techniques available in the literature on web wrappers for labeling fields.)

We are going to take two approaches to extract templates from URLs. The first will use traditional string edit distance algorithms and the second will exploit the fact that there is underlying structure in the url (i.e. the scheme, authority, path, query, and fragment).

Advertisements

2 comments

  1. Hi!
    I’m a university student in China and I got here via Google. Right now I’m working on a task which involves URL similarity calculation and template generating. The information you provided in the article is quite related to what I need but I need more detailed algorithm or approach. I’m wondering if you can provide me more information about this. Either publishing links/articles here or emailing me would be fine. And thanks a lot!

  2. Hello Joe! Thanks for visiting and commenting.
    Unfortunately life is putting up its usual set of roadblocks and I haven’t been able to spend much time on this lately.
    I would recommend Bing Liu’s book “Web Data Mining” for more information.
    Please do come back and share any knowledge that you have. I’m facinated with this type of work!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s