by Sylvain Kalache

The future of code quality, security and agility lies in machine learning

Jan 25, 20195 mins
Application SecurityDevelopment ToolsMachine Learning

Why and how machine learning on code is the next frontier to a whole new series of software building tools.

Source code is the new printing press, the new coal, the new oil, the new assembly line; the generator of the next new economy – some call it the fourth industrial revolution. From the auto industry manufacturing self-driving cars with millions of lines of code to doctors performing surgery with robots halfway around the world, source code is everywhere.

With software security breaches costing millions of dollars and an estimated $3 trillion global GDP loss coming from developer inefficiency, businesses are only just beginning to understand how critical their code and the processes that manage it really are. Just as businesses audit their financial statements, their processes, even their assembly lines, it is becoming as critical (if not more) to do the same with their software portfolio. That’s where machine learning on code comes in.

With enough data, machine learning can solve challenging problems for many industries. From facial recognition for automated photo tagging to movie recommendation engines based on user preferences, machine learning is poised to create code bases that are more secure and easier to maintain.

Currently, companies have no easy way to measure progress with regard to key digital transformation initiatives – like adopting a new logging system, a major API change or painful projects such as becoming GDPR compliant. Their code is constantly changing, often fragmented across different repositories and programming languages, making it very hard to have any visibility into the state of the whole codebase. And with the increasing use of open source code bringing external dependencies, while services keep on becoming smaller as the source code of the monolith is split into microservices, the task is getting even more difficult.

Treating code as the rich dataset that it is

As we turn everything into data in an effort for better understanding the processes that surround us – from open government to open source, Code as Data is inevitable. Code as Data is about extracting insights from code repositories, including the source and all of the versions it went through before reaching the current state. Code as Data tasks includes code retrieval, language classification, program parsing, token extraction and other language-agnostic analysis which allows us to compute any metrics and easily see its evolution over time and predict future trends.

For instance, source{d} has been developing a platform leveraging Machine Learning to automate code review for developers while helping executives measure engineering effectiveness and inform their IT strategy based on data rather than feelings. It can track framework and programming language adoption, help management with hiring decisions. Cumbersome questions such as “how far are we with our migration from Angular to Angular 2?” can be easily answered. Codebase sanity can be checked, for every commit, the technology can make sure that the code respects predefined technical guidelines and is free of the most common security vulnerabilities such as SQL injection or API key leaks. Another startup in the space called Semmle goes even further by discovering new types of source code vulnerabilities.

Source code repository analysis can also reveal information about the developers writing it. Team dynamics can be highlighted by analyzing commits time and content: managers can identify when software engineers are the most productive, arranging meetings and encouraging cross-team collaboration accordingly. Looking at programming languages and frameworks trend can inform hiring managers on what type of talent to hire and what upskilling education resources can they provide. Adding source code as a new dataset in enterprises’ data warehouses and visualization platforms such as Power BI, Looker or Tableau will provide everyone in the engineering organization with a whole new level of source code and development process observability. 

Learning from source code to build better tooling

Yet the most exciting aspect of looking at code as a dataset is that it can be used to train Machine Learning models that can automate many different repetitive tasks for developers. We’re already starting to see new machine learning based applications for assisted code review or suggestions on GitHub. Imagine how much time developers could save if bots were to remind them of style or naming conventions or look for similar code detection from project to function level. There is also a class of tasks where automating actually means we are able to perform the task with higher performance than humans, for instance finding whether a piece of code is a duplicate from some other existing dependency or even from any popular open source project – in this case, a human would fail to memorize millions of lines of code, while it’s an easy win for a good algorithm.

Taking this further, the future of software engineering may lay in training and managing machine learning models and get them to do the coding work that humans are currently doing. For instance, Diffblue uses machine learning to automatically write unit tests for your code. Unlike humans, computers can work 24/7 and easily identify patterns or flag issues over really large codebases. These new machine learning based tools will enable developers to build better and faster software as a team by focusing on what’s really important and let the non-essential tasks to bots. These machine learning models and applications are the building blocks for the next generation developer tools that will forever change the way students and developers learn programming as well as how they write and review code. It’s inevitable, machine learning on code is the next frontier to a whole new series of software building tools.