Template Extraction

The goal of template extraction is to discover the template(s) (if there are any) used to generate a page. There are two major categories of templates: those that are used within a single web page (such as the individual search results entries on a Google search result page) and those that are used across web pages (such as a story template used on CNN.com). In the former case, the goal is to be able to extract the template(s) from any page that has at least two similar data records on it. In the latter case, the goal is to be able to extract the template(s) from any two pages that have similar data records. As more similar records or pages are found we expect our precision to increase.

By visually looking at a rendered Google search results page, for example, a human can easily see similar, repeated sections. Although it is a bit more difficult, someone familiar with HTML can look at the source to that same Google page and find the repeated sections. Similarly, there are two general approaches to automatic template discovery — one that relies on the rendered representation of the page and one that relies on the source of the page. A survey of the research literature does not show a distinct advantage of one technique over the other and in many cases the underlying algorithms are very similar. I am going to focus on the latter approach since that is where my experience lies.

Advertisements

One comment


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s