In order to make template extraction as palatable as possible I am going to start by walking through how one discovers the templates used to generate a set of URLs. I’m going to use Amazon.com as an example.
Search for “book” on Amazon.com. You should see a page with a number of results / data records. Placing your mouse over each of the titles shows many similar urls, some of which I have pasted below:
We can easily see a pattern (the template) in those urls:
http://www.amazon.com/<title>/dp/<book id>/ref=<ref id>?ie=UTF8&s=<product type>&qid=1196364228&sr=8-<index>
Our job is to find an algorithm that will allow us to automatically discover that template.
(I should mention that if one continues to other pages in the search results that there are more differences in the URLs that leads to a different pattern. In order to keep the discussion simple, I will only use the first page’s URLs. This demonstrates that a single page, even one that contains multiple records, might not contain enough information to deduce a complete template.
I also should mention that the ability to label the fields in the template as I did in the example above will not be covered (in the near future) in this series of postings. I will only state that there are techniques available in the literature on web wrappers for labeling fields.)
We are going to take two approaches to extract templates from URLs. The first will use traditional string edit distance algorithms and the second will exploit the fact that there is underlying structure in the url (i.e. the scheme, authority, path, query, and fragment).