An Idea For Robust Clone Detection Using Abstract Syntax Trees
Sunday, 19 July 2009
My last post concluded with the promise to go into detail on some implementation ideas of my clone analyzer.
As I argued previously, a text based matching tool is not good enough, it’s simply too easy to fool. What we want is a matching tool that considers the full syntax of the language being analyzed. That leads us to a solution based on the analysis of abstract syntax trees (ASTs).
ASTs can be generated easily from a partial compilation of the files under analysis. Basic comparison is easy too. It’s a straightforward tree walking algorithm. There are many ways to do it, and I’ll go into my implementation later. Where things get interesting is when we consider ASTs that are functionally equivalent but not identical. That is, ASTs that differ in unimportant ways. The initial impulse is to begin ‘relaxing’ the tree walking comparison. I.e., ignore things like local variables names, parameter order, and other obvious irrelevancies to the concern of functional equivalence. Instead I proposed that we attempt to refactor one tree and see if we can transform it into an AST that does match. If so, we can conclude the original ASTs match. We don’t actually need to perform these refactorings ‘for real’. Knowing the safe transform exists is enough for a clone repair tool to replace one method with the other.
Now, my theory here can be divided into two distinct parts. First, we need to be able to tell when any two methods are completely identical. Second, if we can we take two non matching methods and automatically apply a series of safe refactorings that will convert one into an identical match of the other then we can conclude that the original methods are clones.
Part I – Matching Identical ASTs
The initial match algorithm is a careful tree traversal with a node for node comparison which exits at the first mismatch. For this part of the project I relied on the open source and very nice NRefactory project. It includes a C# parser, among other useful stuff. Thanks to the availability of source and decent examples I was able to get up and running very quickly.
The first step is to get an AST by passing the class file to a Parse() function. One caveat of my implementation is that it will not work on code that does not compile. When Parse() encounters a syntax error it returns null. In practice, I don’t anticipate this limitation having much effect on usefulness.
The AST generated from this method…
int Foo() {
return 7 + 8 * (4 - 6);
}
…looks like this[1]:

There’s a couple of things to note about these ASTs. Each node has a distinct type like MethodDeclaration, ReturnStatement, Operator, Literal, ect. Some nodes also have other properties. For example, Operators have an Op property that is (in this example) one of +, -, or *. Literals hold the literal value in the property named Val (’7′, ’8′, ’4′, and ’6′ here), and the type of that value, called Type.
We now need to compare the ASTs. Let’s call the function to do this Compare(left_tree, right_tree):bool. Starting at the root node (in this case, a MethodDeclaration node) of the left hand tree we begin walking that tree. At each left hand node we compare to the corresponding right hand node. The individual node comparison first checks that the node types match (both are Operators, both are Literals, ect). Then it compares the values of each of the node properties. At this point we have confirmed the node matches and we can proceed to it’s children.
The actual implementation is based on a slightly modified Visitor pattern. Each of the Visitor class’s Visit methods take a second parameter of type INode (base class of all AST nodes), in addition to the normal strictly typed first parameter. The second parameter is there because we need to drag the right hand node along on each Visit call, and then pass corresponding right hand child node(s) to each Accept call. Here’s the partial IVisitor definition:
public interface IVisitor {
void Visit(MethodDeclaration left, INode right);
void Visit(ReturnStatement left, INode right);
void Visit(Operator left, INode right);
void Visit(LiteralType left, INode right);
}
Here’s the modified INode.Accept method interface. Note the inclusion of the second parameter, right, which will hold the right hand tree node that corresponds to the left hand AST node. The left node is of course ‘this’.
interface INode {
void Accept(IVisitor v, INode right);
...
}
And all Accept implementations look pretty much identical. Here’s Operator’s. Notice it is dutifully passing that right parameter on to the Visit call?
public class ReturnStatement : INode {
void Accept(IVisitor v, INode right) {
v.Visit(this, right);
}
public INode LeftExpr;
public INode RightExpr;
}
So far it’s been pretty boiler plate Visitor pattern stuff, with the inclusion of the extra INode parameter named right. Now let’s look at the concrete Visitor implementation which is where the good stuff happens.
If you recall, I said earlier that the actual comparison of two single INodes involves these steps:
1. Confirm the nodes’ types match.
2. Compare each of the node specific property values. Since they are of the same type, they have the same property sets.
3. Recursively call Accept() on the left hand node’s children, passing the right hand node’s children as the second parameter.
In standard Visitor pattern fashion there is an IVisitor.Visit method for every non-abstract INode subclass. Operator’s Visit looks like this[2]:
public class ComparisonVisitor : IVisitor {
public void Visit(Operator left, INode right) {
// 1. Confirm nodes' types match.
Operator right_operator = right as Operator;
if(right_operator == null) {
SetFailure();
return;
}
// 2. Compare each of the node specific property values.
if(this.Op != right_operator.Op) {
SetFailure();
return;
}
// 3. Recursively call Accept on the left children, passing the right children.
left.LeftExpr.Accept(v, right_operator.LeftExpr);
left.RightExpr.Accept(v, right_operator.RightExpr);
}
... // Repeat for the remaining IVisitor implementations
}
This does require adding a new Accept(left,right) method overload on each node. That is, I had to go back and modify the NRefactory AST implementation to make this work. It was one of those moments where I realize, again, how much open source rocks.
Further
This is surprisingly trivial to implement. In fact it was so easy that I got bored doing it and ended up autogenenerating all of the tree walking and much of the comparison code. NRefactory has it’s own generator that creates all of the INode subclasses (Operator, MethodDeclaration, ect). It lays out the node specific properties, default ctors, and Accept implementations. It even generates some premade concrete Visitor implementations of it’s own. I hijacked this and hooked in the additional generation of my modified Visitor to the concrete INode subclasses. It also generates the vast majority of my ComparisonVisitor as a partial class. The parts that remain are from the node specific property matching (step 2) which lives in node type specific Match(left,right) functions. An example is bool Match(Operator left, Operator right), and it is called right there in the conditional of the step 2 if. I wrote just a handful of those so that I could have a decent subset of C# to carry on with.
As I wrote this blog post it occurred to me that I might be able to auto generate the Match functions too. Clearly enough info is delivered to the generation routines so that they can lay out the properties on the INode subclasses. I can use the same info to autogenerate the Match method of step 2.
Part II – Applying refactorings
There you have it, the basic, core clone detection algorithm. It’s so basic in fact, that it’s going to do no better than a text based match which ignores whitespace. It will not detect a clone like this:
public double Area(double radius) {
double PI = 3.14;
return PI*Math.Pow(r, 2);
}
public double CircleArea(double radius) {
double pi = 3.14;
return pi*Math.Pow(r, 2);
}
The difference in the clones is the name of a local variable. At the beginning I stated that we were starting with the case of identical method clones. Because we are working from that precondition it made our Compare() function extremely easy to implement. But actually it’s a no more productive match function than a basic text based clone finder. A more useful AST comparison implementation might ignore the names of automatic variables. This is not how I go about it.
The way I do it is to transform the AST in a way that does not change it’s functionality, yet creates a new AST that can be recompared. We apply so called ‘safe’ AST transformations – transformations that produce new ASTs yet the methods take the same inputs, produce the same outputs, and have not had their side effects modified. The ASTs under consideration can be considered clones if there is a safe transformation – or even a series of safe transformations – that would convert one tree into the other. These “safe AST transformations” are simply common code refactorings. Going back to the example above, we could apply a rename local variable refactoring directly to the AST. A recomparison would show them as equivalent, and that is enough to deem the original methods clones. Each refactoring can be coded and applied in isolation. That will help keep the implementation complexity low.
Summary
And that’s all there is to it. Apply refactorings to one abstract syntax tree until it matches the other abstract syntax tree, or not. My algorithm at this point is really just brute force. I try all combinations of available refactorings (exactly one at the time of this writing) by applying them to one of the trees until I get a match or exhaust the possibilities. One of my next steps is looking for heuristics that will allow us to reduce the number of refactorings that get performed during the search. For example, if the compare fails due to a name mismatch on a local variable then it can store that fact for use in later selecting a candidate refactoring like ‘rename local variable’.
In the next post I’ll show how I used Extract Method to deal with the methods-only limitation of the tool so far. And, I’ll be writing more refactoring operations and I’m sure I’ll learn some things worth writing about then as well.
Special thanks to my friend Sean for all the proofreading, feedback, skepticism, and challenging questions.
[1] I produced the graph with this nifty tool, using the phrase [MethodDeclaration(Name=Foo,ReturnType=int) [ReturnStatement [Operator(Op=+) Literal(Type=int,Val=7) [Operator(Op=*) [Literal(Type=int,Val=8)][Operator(Op=-) [Literal(Type=int,Val=4)][Literal(Type=int,Val=6)]]]]]].
[2] Why don’t I return bools from the IVisitor methods instead of calling SetFailure() and returning anyway? – Because I reserve the right to continue analyzing in the event of a mismatch. This might be useful when choosing heuristics for later refactoring application.
No. 1 — August 5th, 2009 at 5:15 am
[...] backend scans source files handed to it by the front end, using the techniques I’ve been blogging about. Specifically, clones are identified by comparing abstract syntax trees of methods. The ASTs [...]
No. 2 — August 29th, 2009 at 7:08 pm
[...] talked previously about how Agent Ralph identifies functionally equivalent methods. It also can find clones [...]
No. 3 — January 23rd, 2010 at 11:00 pm
I implemented and wrote a technical paper on a clone detector based on AST tree matching back in 1998. Check out the web site for discussion, link to technical paper, and same clone analysis reports for several languages.
No. 4 — January 25th, 2010 at 5:59 am
Hello Ira,
I read your paper, though I have not tried out your implementation. If I remember correctly, the high level view of the algorithm was to compute a hash on the ASTs and sub-ASTs, and then compare hashes for clone detection. I think that one advantage to my approach is that when a clone is reported there is a higher degree of certainty that it is valid. (I think like 100% certainty, though I haven’t proven that or anything.) The ‘fuzzy hash’ idea where the hash calculation handles some constructs differently in an effort to detect near miss clones (like ignoring small sub trees for example) seems like it would generate false positives. For my tool false positives aren’t acceptable as I want to automate the clone repair. Of course, a hash based implementation would be a lot faster. I am dealing with the kind of poor algorithmic complexity you mention in your paper, like O(n^3) and worse sometimes.
I would like to read more about how you automated the repair of the clones. Do you talk about that in one of the other papers? I’ve only read the one.
Thanks for the comment,
Josh
No. 5 — January 28th, 2010 at 4:33 am
Well, the algorighm only uses hashes to find possible matches, and then compares them exactly. So it detects exact clones without error.
It also detects “near miss” clones which you can think of as parameterized code, e.g., if you made a macro out of block of code and replaced well-formed sections, you’d end up with a parameterized clone.
The present (2010) implementation operates somewhat differently than the 1998 paper, but the basic ideas are the same. A 2004 study (published in IEEE Transactions on Software Engineering) by Steve Bellon compared several detectors, and concluded that ours produce the smallest number of false positives, so I think your worry is misplaced.
You’ve observed that matching exact trees is “easy”. In fact, I agree. What isn’t easy is matching inexact trees to produce the near miss clones, and making this work at the 2 million line scale for multiple programming languages.
You can find a number of examples of clone detection runs on different languages at the website.
– IDB
No. 6 — January 30th, 2010 at 8:20 pm
Ira,
I didn’t realize that there was a exact match following the hashed match. I agree, my false positive concern is misplaced.
I am looking forward to reading your papers.
Josh