Code Clones

Code clones are code constructs or functionality that is repeated throughout a system.  It’s a well documented problem.  In short, the issue is duplication of logic.  Take this example:

double Area(double radius)
{
    return 3.14*Math.Pow(radius, 2);
}

double ComputeArea(double radius)
{
    return 3.14*Math.Pow(radius, 2);
}

Two functions identical in every way except name.  The cost of future modifications to whatever system they reside in is increasing.  Suffice it to say that when the maintenance programmer decides to increase precision of the area calculation he must append digits to the 3.14 constant in at least two places.  When this kind of duplication is allowed it leads to unmaintainable systems.  It is yet another form of unnecessary complexity, and perhaps even the worst.  Take any single instance of unnecessary complexity and it can likely be brushed aside as poor form that is easily corrected.  While that is true, it is the repetition of poor form that ultimately renders our systems unmaintainable.


We usually we think of clones as being textually equivalent sections of code (think cut and paste coding), but they can also be functionally equivalent. I define textually equivalent to mean two code sections which are line for line identical (comments and whitespace optionally ignored). Two code blocks are functionally equivalent when their logic is the same even though the text may differ.  Building on our earlier example, here pi is called out explicitely:

double Area(double radius)
{
     const double PI = 3.14;
     return PI*Math.Pow(radius, 2);
}

Here pi is a literal, defined inline.

double ComputeArea(double radius)
{
    return 3.14*Math.Pow(radius, 2);
}

Clearly these are functionally equivalent. For all inputs they produce the same output. They are not textually equivalent. The magic number has been replaced with a constant. Another example is two functions that differ only by local variable naming, where all types and operations are identical, are functionally equivalent.


My general term for correcting code clones is collapsing. Collapsing is the process of safely merging code clones into, or replacing code clones with, a common implementation. The ‘safely’ part implies that a collapsing operation does not modify the expected inputs, outputs, and side effects, as perceived by the clients of the former clone. It is a form of refactoring specifically targeted at the elimination of code duplication. For the last example, the collapsing operation consists of replacing each call to ComputeArea with a call to Area.  Once ComputeArea is no longer used anywhere it can be deleted.


We do have tools available to us for identifying clones. I have used three good ones: Simian, Clone Detective, and TeamCity. These tools are lacking in two distinct ways. First, they seem to treat clone detection as a text matching problem. Second, they do not provide clone relationship analysis and automated correction assistance.


Text matching clone finders are easily defeated by differences in variable names, whitespace, comments, brace/delimiter placement, and literals.  Differences of these types are inconsequential in most languages and will cause primitive clone detectors to return fewer results, or fragmented result sets (many smaller clones instead of fewer large clones).  Member order within cloned classes also causes result fragmentation.  I will say that all the tools I’ve mentioned do have options that take these inhibitors into account.  However those issues are relatively easy problems to solve.  Not so easy is things like parameter order, which screws up matches at both the cloned parameter definitions and cloned call sites.


Ultimately, text based clone detectors do not provide the rich result sets that we need to create powerful automated clone collapsing tools.  The alternative to text based matchers that I propose is clone detectors that operate directly on parsed syntax tree representations of source code.  Analyzers operating against syntax structures will facilitate the elimination of the match noise in clean, language specific ways.


The poor state of clone identification being what it is, simple detection is not enough. A proper clone analysis tool should analyze the clone and it’s context and then present the user with automated options for collapsing and removal. Unfortunately, it is not always as simple as doing a couple of extract method operations with some find and replace.   Consider the clones in these two classes (the lines above and below the calculation of the area variable, on lines 7 and 19):

class Cylinder
{
     public double Volume()
     {
          double height = GetHeight();
          double scaling = GetScalingFactor();
          double area = 3.14 * Math.Pow(this.radius, 2);
          double volume =  height * area;
          return volume * scaling;
     }
     ...
}
class Box
{
     public double ComputeVolume()
     {
          double height = GetHeight();
          double scaling = GetScalingFactor();
          double area = this.length * this.width;
          double volume =  height * area;
          return volume * scaling;
     }
     ...
}

The ideal collapse result is to isolate the differing logic into two distinct functional units.  That is, perform an extract method on the unique logic between the clones.  Now it becomes clear that the two classes are collapsable into a single class where the unique logic can be inserted via polymorphism or dependency injection.  Like so:

abstract class Shape
{
     public abstract double ComputeArea();

     public double Volume()
     {
          double height = GetHeight();
          double scaling = GetScalingFactor();
          double area = ComputeArea();
          double volume =  height * area;
          return volume * scaling;
     }
     ...
}
class Cylinder : Shape
{
     public override double ComputeArea()
     {
          return 3.14 * Math.Pow(this.radius, 2);
     }
     ...
}
class Box : Shape
{
     public override double ComputeArea()
     {
          return this.length* this.width;
     }
     ...
}

The contextual analysis of the clones’ relationship to surrounding code and other clones led us to a more optimal result.  Indeed, otherwise the most likely path of collapsing chosen would be to jump in with extract methods against the clones themselves.  This would have put us in the situation of having a number of identical methods instead of blocks, but would not have provided any clear direction on the best way to proceed with collapsing.


It’s worth noting that sometimes collapsing will result in more lines of code.  This does not diminish the value of collapsing.   The extra lines are boiler plate like class, interface, and function definitions.   The value is derived from the reduction and isolation of the formerly duplicated logic.  Additionally, as we increase the number of functional units (methods and classes) we make more malleable code because these units provide the boundaries that our existing code manipulation tools tend to operate on.


In my next post I will present some implementation ideas for the rich clone analyzer I have outlined here. In further posts I will discuss how else we might put such a tool to use.

Leave a Reply