I have had a theory for some time that TODO comments in code remain there (almost) forever. They serve to appease the conscience of the developer, but they mostly are forgotten. But I wanted to prove this with numbers.

The Plan

I needed a large source of code that I could analyze. And the largest source out there is GitHub. So the idea is quite simple:

  • find TODO comments in GitHub repositories
  • use git blame to find out how long the TODO comment has been there

Finding a Dataset

I can’t just use the search bar of GitHub to find all TODO comments and then run some kind of webscraper on the results. That would be a slow process and could give me many false positives. Searching for “TODO:” in the search bar returns results without the colon (:), which is often not a result I want.

Luckily, GitHub and Google have worked together to put some big datasets into the Google Cloud Platform. I can then query this data with BigQuery. The first issue was finding the best dataset:

The dataset I used only contains .NET code but it’s a good starting point. Felipe Hoffa created this dataset. The only drawback is that it’s quite old (2016). But at least I have something to work with. Once I have the rest of my process figured out, I should be able to feed it more recent data, and from more languages.

Finding the TODO Comments

To find all TODO comments, I used this query in BigQuery:

SELECT  
  sample_repo_name, sample_path, STRPOS(content, 'TODO:') AS pos
FROM `fh-bigquery.github_extracts.contents_net_cs`
WHERE 
  STRPOS(content, 'TODO:') > 0
ORDER BY sample_repo_name, sample_path, pos

This results in 156581 results. I saved this to a CSV file in my Google Drive and then downloaded it.

Git Blame

It would be unrealistic to clone every repository in my CSV file and do a git blame. Fortunately, GitHub has a GraphQL API (which you can try online here). With this query, we can get the results of a git blame on a file:

{
  repository(owner: "00091701", name: "ADFC-NewsApp-Mono") {
    defaultBranchRef {
      name
      target {
        ... on Commit {
          history(first: 1) {
            edges {
              node {
                committedDate
              }
            }
          }
          blame(path: "NewsAppDroid/NewsAppDroid/BusLog/Database/Rss.cs") {
            ranges {
              startingLine
              endingLine
              commit {
                committedDate
              }
            }
          }
        }
      }
    }
  }
}

In short what this does is:

  • get the default branch name because this isn’t always “master”: we’ll need this to retrieve the file so we can find the TODO comment
  • get the committedDate of the latest commit: necessary to know how “old” this repository is
  • get a blame of the file: we’ll use this information to find out when the TODO comment was created

Why do we need the repository age?

When you created a TODO comment 6 years ago, but then the code stopped being maintained 5 years ago, we can’t say the TODO comment is 6 years old.

GitHub is full of old repositories that are no longer actively maintained. If nobody is working on the repository, we can’t blame anyone for not fixing the TODO comment. I’m trying to find out how long TODO comments stay alive while the code is still being maintained.

Getting the Linenumber

The only issue now is that I had the position of the TODO comment, not the linenumber. To get this, I requested the raw content of the file (available at raw.githubusercontent.com) and used this piece of .NET code:

var lineNumber = text.Take(input.Position).Count(c => c == '\n') + 1;

Putting It All Together

I threw together a .NET Core console application that would:

  • use CsvHelper to read out the CSV file with the BigQuery results
  • get the linenumber for each TODO comment
  • do the git blame and get the age of the TODO comment
  • write the results to a new CSV file

It’s not the prettiest code, but it does the job. You can find it in my GitHub account.

The Caveats

There are some remarks to be made. Things I hope to solve or improve in the future.

  • This is only for .NET code
  • It’s based on a dataset from 2016, but I performed the git blame on the current repositories
  • The command line app is slow because it analyzes one line of the CSV file at a time. I could probably run tasks in parallel or in the cloud.

But probably the most important issue is that this analysis says nothing about TODO comments that have already been fixed/removed. So we only know something about the lifetime of TODO comments that are still present today. I would have to work out this project further to gain more insights.

But the analysis seems valid to me for the current set of unfixed TODO comments.

The Results

This has become a lengthy post so I will be showing the results in a next post. It’ll allow me to drill into some details more deeply.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This site uses Akismet to reduce spam. Learn how your comment data is processed.