I have had a theory for some time that TODO comments in code remain there (almost) forever. They serve to appease the conscience of the developer, but they mostly are forgotten. But I wanted to prove this with numbers.
The Plan
I needed a large source of code that I could analyze. And the largest source out there is GitHub. So the idea is quite simple:
- find TODO comments in GitHub repositories
- use git blame to find out how long the TODO comment has been there
Finding a Dataset
I can’t just use the search bar of GitHub to find all TODO comments and then run some kind of webscraper on the results. That would be a slow process and could give me many false positives. Searching for “TODO:” in the search bar returns results without the colon (:), which is often not a result I want.
Luckily, GitHub and Google have worked together to put some big datasets into the Google Cloud Platform. I can then query this data with BigQuery. The first issue was finding the best dataset:
- https://bigquery.cloud.google.com/table/bigquery-public-data:github_repos.contents: this one didn’t contain the repository and paths of the files, which I would need to perform a git blame
- https://bigquery.cloud.google.com/table/bigquery-public-data:github_repos.sample_contents: this dataset only contains a sample of all data and gave me a smaller result set than I wanted
- https://bigquery.cloud.google.com/table/fh-bigquery:github_extracts.contents_net_cs: I used this dataset to start with
The dataset I used only contains .NET code but it’s a good starting point. Felipe Hoffa created this dataset. The only drawback is that it’s quite old (2016). But at least I have something to work with. Once I have the rest of my process figured out, I should be able to feed it more recent data, and from more languages.
Finding the TODO Comments
To find all TODO comments, I used this query in BigQuery:
SELECT
sample_repo_name, sample_path, STRPOS(content, 'TODO:') AS pos
FROM `fh-bigquery.github_extracts.contents_net_cs`
WHERE
STRPOS(content, 'TODO:') > 0
ORDER BY sample_repo_name, sample_path, pos
This results in 156581 results. I saved this to a CSV file in my Google Drive and then downloaded it.
Git Blame
It would be unrealistic to clone every repository in my CSV file and do a git blame. Fortunately, GitHub has a GraphQL API (which you can try online here). With this query, we can get the results of a git blame on a file:
{
repository(owner: "00091701", name: "ADFC-NewsApp-Mono") {
defaultBranchRef {
name
target {
... on Commit {
history(first: 1) {
edges {
node {
committedDate
}
}
}
blame(path: "NewsAppDroid/NewsAppDroid/BusLog/Database/Rss.cs") {
ranges {
startingLine
endingLine
commit {
committedDate
}
}
}
}
}
}
}
}
In short what this does is:
- get the default branch name because this isn’t always “master”: we’ll need this to retrieve the file so we can find the TODO comment
- get the committedDate of the latest commit: necessary to know how “old” this repository is
- get a blame of the file: we’ll use this information to find out when the TODO comment was created
Why do we need the repository age?
When you created a TODO comment 6 years ago, but then the code stopped being maintained 5 years ago, we can’t say the TODO comment is 6 years old.
GitHub is full of old repositories that are no longer actively maintained. If nobody is working on the repository, we can’t blame anyone for not fixing the TODO comment. I’m trying to find out how long TODO comments stay alive while the code is still being maintained.
Getting the Linenumber
The only issue now is that I had the position of the TODO comment, not the linenumber. To get this, I requested the raw content of the file (available at raw.githubusercontent.com) and used this piece of .NET code:
var lineNumber = text.Take(input.Position).Count(c => c == '\n') + 1;
Putting It All Together
I threw together a .NET Core console application that would:
- use CsvHelper to read out the CSV file with the BigQuery results
- get the linenumber for each TODO comment
- do the git blame and get the age of the TODO comment
- write the results to a new CSV file
It’s not the prettiest code, but it does the job. You can find it in my GitHub account.
The Caveats
There are some remarks to be made. Things I hope to solve or improve in the future.
- This is only for .NET code
- It’s based on a dataset from 2016, but I performed the git blame on the current repositories
- The command line app is slow because it analyzes one line of the CSV file at a time. I could probably run tasks in parallel or in the cloud.
But probably the most important issue is that this analysis says nothing about TODO comments that have already been fixed/removed. So we only know something about the lifetime of TODO comments that are still present today. I would have to work out this project further to gain more insights.
But the analysis seems valid to me for the current set of unfixed TODO comments.
The Results
This has become a lengthy post so I will be showing the results in a next post. It’ll allow me to drill into some details more deeply.