The Oscillating Shrinking Window
Saturday, 29 August 2009
I talked previously about how Agent Ralph identifies functionally equivalent methods. It also can find clones embedded within a method, as shown at the end of my last post. This time I’ll talk a bit about how that works.
Agent Ralph’s preferred unit of comparison is the method. The comparison implementation accepts two methods and walks their ASTs, comparing each node. To find clones embedded within a method then it must somehow convert a subset of method statements into a new method. This provisional method need only exist for the lifetime of the comparison. If the provisional method matches some other existing method then the set of statements under consideration are a clone.
So, how to create that provisional method? If you guessed the Extract Method refactoring then indeed you are correct. Agent Ralph will scan a method from top to bottom choosing sub sets of contiguous statements to perform an extract method on. If the Extract Method operation is successful and the new method matches some other method, we’ve found a clone.
A “window” is a set of contiguous statements. We start with the largest possible window, and that consists of all the statements except the last. (Using all the statements would simply create an identical function.) Next, perform an Extract Method operation, which creates the provisional method, and compare. Then proceed to the next window and repeat. The next window is determined by shifting the window down one statement. IOW, we now include the last statement and exclude the first statement. Extract method, compare, continue. At this point we’ve run out of windows at this size so we shrink the window size by one and start at the top again. When the window size reaches 1 that is the last iteration.
I call this algorithm the Oscillating Shrinking Window. An illustrative graphic is in order, and nothing says cutting edge technology like an animated gif:
Each yellow block encompasses a set of statements that will have an Extract Method performed on them.
Some statements can of course contain sets of statements as children. If statements, for loops, ect. This requires recursing into the children of said statements and repeating the algorithm. Another graphic:
As you might guess, this is very computationally expensive. Indeed, the algorithmic complexity is at least O(n^3) where n is number of AST nodes. The at least qualifier is on there because I didn’t prove it for all cases. I choose a single particular case and analyzed that (with some help). The case was a tree of n nodes where the root node has n-1 child nodes. (Think very wide and very shallow.) Heuristics will play a large part in making this actually usable.