With the recent movement towards mashups, the semantic web and market intelligence, there is a large need to get at the data and information that is stored in web pages. Data extraction startups are popping up like weeds (e.g. InfoSquire and QL2). Many of these startups focus on services where you specify what sites you want scraped and they provide you with the resulting data feed. The technologies that they use are primarily rules-based (e.g. regex). Rules-based systems are highly brittle given the dynamic nature of the web. Specifically, there is a high maintenance cost to maintaining and monitoring the rules to ensure that they are up to date with any changes made to the underlying web pages. The ability to automatically generate data extractors with high precision would be a vast improvement over a rules-based system.
Much research has gone into extracting data (either structured or unstructured) from a web page using web wrappers. A web wrapper is a tool for “converting information implicitly stored as an HTML document into information explicitly stored as a data-structure for further processing” [W4F]. One particular type of automatically generated web wrapper uses template extraction. Template extraction is the inverse of creating a web page from a template — for a given web page, attempt to deduce the template that was used to generate the page. If a template can be generated for any (template derived) web page, then the data that populates that template can be easily extracted.
Over the next few weeks I am going to focus on machine learing and other automatic web wrapper technologies in a series of postings.