The Oscillating Shrinking Window

I talked previously about how Agent Ralph identifies functionally equivalent methods.  It also can find clones embedded within a method, as shown at the end of my last post.  This time I’ll talk a bit about how that works.

Agent Ralph’s preferred unit of comparison is the method.  The comparison implementation accepts two methods and walks their ASTs, comparing each node.  To find clones embedded within a method then it must somehow convert a subset of method statements into a new method.  This provisional method need only exist for the lifetime of the comparison.  If the provisional method matches some other existing method then the set of statements under consideration are a clone.

So, how to create that provisional method?  If you guessed the Extract Method refactoring then indeed you are correct.  Agent Ralph will scan a method from top to bottom choosing sub sets of contiguous statements to perform an extract method on.  If the Extract Method operation is successful and the new method matches some other method, we’ve found a clone.

A “window” is a set of contiguous statements.  We start with the largest possible window, and that consists of all the statements except the last.  (Using all the statements would simply create an identical function.)  Next, perform an Extract Method operation, which creates the provisional method, and compare.  Then proceed to the next window and repeat.  The next window is determined by shifting the window down one statement.  IOW, we now include the last statement and exclude the first statement.  Extract method, compare, continue.  At this point we’ve run out of windows at this size so we shrink the window size by one and start at the top again.  When the window size reaches 1 that is the last iteration.

I call this algorithm the Oscillating Shrinking Window.  An illustrative graphic is in order, and nothing says cutting edge technology like an animated gif:

Each yellow block encompasses a set of statements that will have an Extract Method performed on them.

Some statements can of course contain sets of statements as children.  If statements, for loops, ect.  This requires recursing into the children of said statements and repeating the algorithm.  Another graphic:

As you might guess, this is very computationally expensive.  Indeed, the algorithmic complexity is at least O(n^3) where n is number of AST nodes.  The at least qualifier is on there because I didn’t prove it for all cases.  I choose a single particular case and analyzed that (with some help[1]).  The case was a tree of n nodes where the root node has n-1 child nodes.  (Think very wide and very shallow.)  Heuristics will play a large part in making this actually usable.

[1]    I brought in a consultant.  Sean helped me with the math in exchange for one large Franco Cajun Pride and a Diet Dr. Pepper.

2 Responses to “The Oscillating Shrinking Window”

  1. Scott Wegner writes:

    Very cool Josh. A couple questions:

    You point out that this won’t really scale well, and I can see why. Have you looked at any real-life performance results? How large can the source file be for the refactoring to run in some bearable-amount of time?

    Do you have plans for extending this refactoring to be more robust? It might be neat to extract clones which only differ by input value, so you could extract things like MyClone(100), MyClose(200). I imagine this would add some additional complexity to your AST representation..

  2. josh writes:

    Hi Scott.

    On performance: I opened a ~75 line method (the extract method implementation itself, ironically) and it tanked. I left it to spin and went to bed. After about 7 hours it still wasn’t done. I realized I needed a heuristic. So, I modified the algorithm to only consider extraction windows whose size matches an existing method. For example, don’t bother extracting a three statement block if you’re just going to compare the result to a four statement block. This made scanning the big methods practical. I think it changes the algorithmic complexity to be linear, maybe even sub linear.

    On robustifying: My philosophy in this project is that I want to detect clones by proving two pieces of code functionally equivalent. To prove that, I apply a series of strict refactorings, where strict means the refactored result has not changed it’s inputs, outputs, or side effects. If the code can be safely coerced to a perfect match of some other code then they are clones. So to answer your question, the way I would solve your example is not by “relaxing” the comparison to say, disregard literal input values. Instead I would apply an Introduce Local Variable refactoring to the literals. Subsequent extract methods would exclude said variable definitions producing ASTs that when compared now do not fail due to different literal inputs.

    It’s a different way of accomplishing the same thing, but I think it has some advantages. One is that refactoring implementations are external, sort of like plug-ins. I can implement a new one (or accept someone else’s) and just add it right to the list, improving the whole system without having to touch the core code.

Leave a Reply