I am unfortunately deprecating my old http://www.realityinteractive.com domain because I’m too lazy and cheap to maintain it.
I have just finished importing my old Movable Type blog into WordPress. It was all accomplished with a single click. My hat’s off to the WordPress folks!
Since I’m not too bright, most of my intra-blog links were hardcoded so they’re unfortunately pointing to URLs that will not live for much longer. I will slowly convert them all to relative URLs over time.
I’m also using the blog as a sandbox for messing with WordPress templates.
Bare with me and the mess ….
One commonly used approach for measuring similarity between strings is the string edit distance. The basic premise of the string edit distance is to find the minimum number of character edit operations (copy, replace, add, and remove) needed to transform one string into the other. For example, the edit distance between parks and spark is two: one edit is needed to add a leading s to parks and a second edit is needed to delete the trailing s. Not surprisingly, string edit distance is used in spell-checking applications.
It is convenient to look at string edit distance as a string alignment problem. The example below should provide some insight into this:
String 1: p a r k s String 2: s p a r k Operation: i c c c c d
The operation is the character operation needed to align or transform string 1 into string 2. i is insert, c is copy, r is replace, and d is delete. (You may find other references using s (substitute) instead of r (replace).) The space at the beginning of string 1 or at the end of string 2 is called a gap.
For a given set of strings, there may be many ways to align the strings. For example, the strings abc123 and abcqwertabc123 have two obvious alignments:
a b c 1 2 3 a b c q w e r t a b c 1 2 3
a b c 1 2 3 a b c q w e r t a b c 1 2 3
The particular application would determine which alignment is preferred.
Different distance metrics are used to select the desired alignment. For example, the measurement used in the original example of the cost between parks and spark is the Levenshtein distance. In this metric the cost of replacing, adding or deleting a character is one whereas the cost of equal (or copied) characters is zero. (In the conversion of parks to spark, the cost of copying letters p a r k is zero.) The second alignment above (where abc123 has no gap) can be obtained using the Smith-Waterman distance.
For simple strings it is relatively easy to understand how to transform one string into another but with longer strings or very dissimilar strings it is not so clear. For example:
See Spot run We work and play
There are brute force techniques that consider all possible alignments and find the one(s) with the minimum cost. I refer you to a few good references on string edit distance and the associated dynamic programming algorithms for more information.
In order to make template extraction as palatable as possible I am going to start by walking through how one discovers the templates used to generate a set of URLs. I’m going to use Amazon.com as an example.
Search for “book” on Amazon.com. You should see a page with a number of results / data records. Placing your mouse over each of the titles shows many similar urls, some of which I have pasted below:
We can easily see a pattern (the template) in those urls:
http://www.amazon.com/<title>/dp/<book id>/ref=<ref id>?ie=UTF8&s=<product type>&qid=1196364228&sr=8-<index>
Our job is to find an algorithm that will allow us to automatically discover that template.
(I should mention that if one continues to other pages in the search results that there are more differences in the URLs that leads to a different pattern. In order to keep the discussion simple, I will only use the first page’s URLs. This demonstrates that a single page, even one that contains multiple records, might not contain enough information to deduce a complete template.
I also should mention that the ability to label the fields in the template as I did in the example above will not be covered (in the near future) in this series of postings. I will only state that there are techniques available in the literature on web wrappers for labeling fields.)
We are going to take two approaches to extract templates from URLs. The first will use traditional string edit distance algorithms and the second will exploit the fact that there is underlying structure in the url (i.e. the scheme, authority, path, query, and fragment).
The goal of template extraction is to discover the template(s) (if there are any) used to generate a page. There are two major categories of templates: those that are used within a single web page (such as the individual search results entries on a Google search result page) and those that are used across web pages (such as a story template used on CNN.com). In the former case, the goal is to be able to extract the template(s) from any page that has at least two similar data records on it. In the latter case, the goal is to be able to extract the template(s) from any two pages that have similar data records. As more similar records or pages are found we expect our precision to increase.
By visually looking at a rendered Google search results page, for example, a human can easily see similar, repeated sections. Although it is a bit more difficult, someone familiar with HTML can look at the source to that same Google page and find the repeated sections. Similarly, there are two general approaches to automatic template discovery — one that relies on the rendered representation of the page and one that relies on the source of the page. A survey of the research literature does not show a distinct advantage of one technique over the other and in many cases the underlying algorithms are very similar. I am going to focus on the latter approach since that is where my experience lies.
With the recent movement towards mashups, the semantic web and market intelligence, there is a large need to get at the data and information that is stored in web pages. Data extraction startups are popping up like weeds (e.g. InfoSquire and QL2). Many of these startups focus on services where you specify what sites you want scraped and they provide you with the resulting data feed. The technologies that they use are primarily rules-based (e.g. regex). Rules-based systems are highly brittle given the dynamic nature of the web. Specifically, there is a high maintenance cost to maintaining and monitoring the rules to ensure that they are up to date with any changes made to the underlying web pages. The ability to automatically generate data extractors with high precision would be a vast improvement over a rules-based system.
Much research has gone into extracting data (either structured or unstructured) from a web page using web wrappers. A web wrapper is a tool for “converting information implicitly stored as an HTML document into information explicitly stored as a data-structure for further processing” [W4F]. One particular type of automatically generated web wrapper uses template extraction. Template extraction is the inverse of creating a web page from a template — for a given web page, attempt to deduce the template that was used to generate the page. If a template can be generated for any (template derived) web page, then the data that populates that template can be easily extracted.
Over the next few weeks I am going to focus on machine learing and other automatic web wrapper technologies in a series of postings.
Testing by itself does not improve software quality. Test results are an indicator of quality, but in and of themselves, they don’t improve it. Trying to improve Software quality by increasing the amount of testing is like trying to lose weight by weighing yourself more often. What you eat before you step onto the scale determines how much you will weigh, and the software development techniques you use determine how many errors testing will find. If you want to lose weight, don’t buy a new scale; change your diet. If you want to improve your software, don’t test more; develop better.”
I’m often asked why I don’t hop on the lastest language bandwagon and just start coding up a storm. The answer comes in two parts: the first is that I do try out these languages to see what the hype is all about, to see where they can fit in and to see their pros and cons. The second is that I realize that there is more to software engineering than just writing code. Software spends disproportionately more time in maintenance than it does in initial development. Just because a language such as Ruby is much faster for initial development doesn’t mean that it’s much easier to maintain. (Do note that I’m not saying that Ruby is hard / harder to maintain. I’m simply saying that a one cannot determine what the maintenance model for a language is from doing only initial development.) The long and short of all of this is that I am forced by my professionalism and my responsibilites to not only look at how a language works for initial development but also for long term maintenance. By definition this means that it takes me a very long time to determine if a language is suitable in the long term. Since many newly in vogue languages simply haven’t been out long enough to have either the community’s or my own understanding of its maintenance model one simply cannot start writing production code with them.
One quick example of all of this is AOP. I’m enamored with AOP but I cannot and will not use it in production software. The reason is that AOP simply does not have a maintenance model at all. In other words, I cannot take an AOP’ified application and future apply AOP on it (i.e. maintain it) and have understandable and determinable effects.
Editor’s note: this is a stream of consciousness posting to get an idea down and is not complete or thorough in any way. But as always, comments are welcome.
I don’t know about the rest of you but my parents baffle me. When working with them on issues that crop up from time to time I try to view them as two average people so that I can be objective about the situation and not allow decisions to be marred by emotion. The recent situation that has arisen (and caused me to write this entry) revolves around my parent’s view of planning. The way my parent’s approach a difficult and potentially costly planning problem is to simply worry. “Huh?!?” you may be thinking. You read that right. Their solution to the a problem involving planning (and in fact most problems) is to worry. What’s funny (in an ironic sense) is that their “solution” tends to lead to the worst, most costly and most stressful conculsions which leads them to worry more.
I could elborate further on this topic but I’m still in the head shaking phase (i.e. “denial”). Yargh!