by Rebecca Merrett

The Search Party taps machine learning to spot variations in 15 million resumes

Feb 12, 2016
Data Mining

Online recruitment marketplace, The Search Party, is using machine learning algorithms to scan 15 million candidate resumes to ensure it provides the right candidates to employers.

Speaking at the Chief Data Officer Forum in Sydney, head of data science, Dylan Hogg, discussed the use of a custom clustering algorithm and deep neural network to spot variations in resumes belonging to the same person and job titles.

“As you can imagine a resume has a lot of text data, so a lot of what we do is natural language processing. It’s taking any natural language such as English and trying to infer the structure out of it and get insights from it,” Hogg said.

The Search Party is an online recruitment marketplace that was founded in Sydney in 2014. Its aim is to continuously improve its algorithms to serve up the most relevant candidate resumes to employers.

Hogg and his team developed a custom clustering algorithm to find different versions of a candidate’s resume. A candidate might have updated their resume at different points in time with changes to contact details and skills, or they have created different resumes tailored to different roles.

“It’s similar to solving multiple customer records to get a single view of the customer,” Hogg said, pointing out it is not as simple as cross referencing because variations in names and skills makes this harder to determine if two resumes belong to the same person.

The variables the algorithm looks at are full name, email address, names of companies a candidate has worked for, phone number(s) and list of skills sets.

The text is processed in a way that turns categorical data into numerical vectors, as clustering works best with numerical data. First, the data is tokenised into text snippets. For example, the name Dylan is broken up into segments ‘dy, ‘yl’, ‘la’, and ‘an’. This makes it robust to spelling variations, Hogg said.

Then TF–IDF (term frequency – inverse document frequency) is applied, which looks at how frequently a word appears in a document and its importance relative to the whole set of documents. It can be used to represent a word as a vector of numbers.

The next step is using a fast canopy clustering method, which groups potential duplicate candidates that require further investigation.

The Search Party also built a deep neural network (a neural network with many hidden layers) to find variations in job titles. Using a list of job titles from the Internet as a source of truth to train the neural net, Hogg was able to map the job description to the job title and have the model learn how different job titles closely relate to each other.

“Then once it’s trained, it gives you a probability distribution over what job titles it believes it is. It deals well with acronyms and synonyms,” Hogg said.

Hogg first needed to turn textual data into numerical vectors, and did this using Word2Vec, Google’s open source tool. It enables vector arithmetic on words, and maps words into n-dimensional space. It is able to predict words using the context of other words.

“We are finding we are getting a lot of use out of that method now. You can see which words are close to a word and stuff like that. That’s just the preprocessing of the text to then feed it into the neural network.

“We give it the training data and job titles and over time it learns to rank the correct job titles above other job titles.”