Why and how machine learning on code is the next frontier to a whole new series of software building tools. Credit: NegativeSpace.co Source code is the new printing press, the new coal, the new oil, the new assembly line; the generator of the next new economy – some call it the fourth industrial revolution. From the auto industry manufacturing self-driving cars with millions of lines of code to doctors performing surgery with robots halfway around the world, source code is everywhere. With software security breaches costing millions of dollars and an estimated $3 trillion global GDP loss coming from developer inefficiency, businesses are only just beginning to understand how critical their code and the processes that manage it really are. Just as businesses audit their financial statements, their processes, even their assembly lines, it is becoming as critical (if not more) to do the same with their software portfolio. That’s where machine learning on code comes in. With enough data, machine learning can solve challenging problems for many industries. From facial recognition for automated photo tagging to movie recommendation engines based on user preferences, machine learning is poised to create code bases that are more secure and easier to maintain. SUBSCRIBE TO OUR NEWSLETTER From our editors straight to your inbox Get started by entering your email address below. Please enter a valid email address Subscribe Currently, companies have no easy way to measure progress with regard to key digital transformation initiatives – like adopting a new logging system, a major API change or painful projects such as becoming GDPR compliant. Their code is constantly changing, often fragmented across different repositories and programming languages, making it very hard to have any visibility into the state of the whole codebase. And with the increasing use of open source code bringing external dependencies, while services keep on becoming smaller as the source code of the monolith is split into microservices, the task is getting even more difficult. Treating code as the rich dataset that it is As we turn everything into data in an effort for better understanding the processes that surround us – from open government to open source, Code as Data is inevitable. Code as Data is about extracting insights from code repositories, including the source and all of the versions it went through before reaching the current state. Code as Data tasks includes code retrieval, language classification, program parsing, token extraction and other language-agnostic analysis which allows us to compute any metrics and easily see its evolution over time and predict future trends. For instance, source{d} has been developing a platform leveraging Machine Learning to automate code review for developers while helping executives measure engineering effectiveness and inform their IT strategy based on data rather than feelings. It can track framework and programming language adoption, help management with hiring decisions. Cumbersome questions such as “how far are we with our migration from Angular to Angular 2?” can be easily answered. Codebase sanity can be checked, for every commit, the technology can make sure that the code respects predefined technical guidelines and is free of the most common security vulnerabilities such as SQL injection or API key leaks. Another startup in the space called Semmle goes even further by discovering new types of source code vulnerabilities. Source code repository analysis can also reveal information about the developers writing it. Team dynamics can be highlighted by analyzing commits time and content: managers can identify when software engineers are the most productive, arranging meetings and encouraging cross-team collaboration accordingly. Looking at programming languages and frameworks trend can inform hiring managers on what type of talent to hire and what upskilling education resources can they provide. Adding source code as a new dataset in enterprises’ data warehouses and visualization platforms such as Power BI, Looker or Tableau will provide everyone in the engineering organization with a whole new level of source code and development process observability. Learning from source code to build better tooling Yet the most exciting aspect of looking at code as a dataset is that it can be used to train Machine Learning models that can automate many different repetitive tasks for developers. We’re already starting to see new machine learning based applications for assisted code review or suggestions on GitHub. Imagine how much time developers could save if bots were to remind them of style or naming conventions or look for similar code detection from project to function level. There is also a class of tasks where automating actually means we are able to perform the task with higher performance than humans, for instance finding whether a piece of code is a duplicate from some other existing dependency or even from any popular open source project – in this case, a human would fail to memorize millions of lines of code, while it’s an easy win for a good algorithm. Taking this further, the future of software engineering may lay in training and managing machine learning models and get them to do the coding work that humans are currently doing. For instance, Diffblue uses machine learning to automatically write unit tests for your code. Unlike humans, computers can work 24/7 and easily identify patterns or flag issues over really large codebases. These new machine learning based tools will enable developers to build better and faster software as a team by focusing on what’s really important and let the non-essential tasks to bots. These machine learning models and applications are the building blocks for the next generation developer tools that will forever change the way students and developers learn programming as well as how they write and review code. It’s inevitable, machine learning on code is the next frontier to a whole new series of software building tools. Related content opinion Project-based learning coming to disrupt research Our centuries-old educational model is failing our students, and itu2019s failing us. By Sylvain Kalache Jan 08, 2018 5 mins Technology Industry IT Skills IT Leadership opinion Finance disrupted, tech it or leave it By Sylvain Kalache Jun 09, 2017 5 mins Careers opinion Tech companies pitching in to battle the talent shortage Tech companies are struggling to find the right talent as the need for more qualified tech workers explodes. They're finding that great salaries and perks don't work if the supply of available workers doesn't meet demand. By Sylvain Kalache Oct 19, 2016 3 mins Enterprise Applications opinion Next generation of software engineers need training, not retraining Got a job? Coding boot camps can help you polish your ru00e9sumu00e9. But what about the tens of thousands leaving high school today who want to become software engineers?rn By Sylvain Kalache Sep 22, 2016 3 mins IT Skills Careers Software Development Podcasts Videos Resources Events SUBSCRIBE TO OUR NEWSLETTER From our editors straight to your inbox Get started by entering your email address below. Please enter a valid email address Subscribe