It's difficult to talk about big data without also discussing the big data skills gap in nearly the same breath. But is it as bad as it seems?
According to a recent CompTIA survey of 500 U.S. business and IT executives, 50 percent of firms that are ahead of the curve in leveraging data, and 71 percent of firms that are average or lagging in leveraging data, feel that their staff are moderately or significantly deficient in data management and analysis skills.
These firms see real costs associated with a failure to come to grips with their data, from wasted time that could be spent on other areas of their business to internal confusion over priorities, lost sales, lack of agility and more.
Forecasters paint a seemingly dire portrait of a skills shortfall that will only get worse over time. The McKinsey Global Institute estimates that by 2018, there will be a shortage of 1.7 million workers with big data skills in the U.S. alone—140,000 to 190,000 workers with deep technical and analytical expertise and 1.5 million managers and analysts with the skills to work with big data outputs.
But Tim Herbert, vice president of research and market intelligence at CompTIA and author of its second annual Big Data Insights and Opportunities study, says the situation may not be as drastic as you think.
--Sara Sproehnle, vice president of Educational Services at Cloudera
"There will be a situation where, at the highest levels, probably the Fortune 100, there will be a skills shortage," Herbert says. "For most medium and small companies, they probably will be able to satisfy their skills needs by a combination of retraining and additional staff. The tools associated with big data will mature. The capabilities and ease of use will mature over time and that will certainly help. Like a lot of other technologies, there will be individuals that maybe they weren't trained to do it but they will have an aptitude to work with data."
Hadoop Isn't Incomprehensible
Sara Sproehnle, vice president of Educational Services at Cloudera, provider of one of the most popular Hadoop distributions, agrees.
"Training has really been a strategic component of what we do at Cloudera," she says. "Hadoop is a new technology and there really is a skills gap. But you can easily cross-train people. It's not that the technology is incomprehensible. You just need to take existing developers, analysts and admins and cross-train them."
Case in point: Persado, a pioneer in "Marketing Language Engineering." Persado helps brands optimize their marketing messages to their target audience at every digital interaction through a systematic methodology that leverages math, computational linguistics and big data.
"We can look at the different 'genes' of a marketing message and break it down and build it back up using mathematics, linguistics and technology to make it a marketing message that a marketer would be happy to bring to market and a consumer would be more likely to interact with and click on," says Matthew Novick, chief financial officer at Persado.
[Slideshow: 10 Big Data Trends Changing the Face of Business ]
Achieving this requires continuous data collection and the ability to query that massive volume of data. Persado's business depends upon its data warehouse.
Persado's development team is focused on ensuring that the company's infrastructure is aligned with the needs of its data scientists, including regularly generating key performance indicator (KPI) reports, managing data from heterogeneous sources, preparing customized analyses and implementing specific statistical algorithms in Java based on reference implementations of R.
But in 2010, not long after Persado was born, the relational database management system (RDBMS) the company was using to power its data warehouse was becoming unwieldy. The development team, led by Christos Soulios, software team leader and application architect at Persado, began the process of migrating to a NoSQL environment. With its analytics and reporting needs becoming more sophisticated, it then needed to decouple the online analytical processing (OLAP) system into a technology stack of its own.
Soulios decided that Apache Hadoop was the right solution to collect, aggregate and process data from Persado's heterogeneous data sources, including MongoDB, MySQL config servers and Apache logs populated by structured and semi-structured files in Amazon Web Services (AWS) S3 buckets using libraries built on Apache Kafka and Apache ZooKeeper.
But those tasks were easier said than done. Persado didn't have the big data engineers on its staff that it needed to grow capabilities and scale its systems. Moreover, while Persado is a global company with headquarters in London and New York, its development team is based in Athens, Greece, making big data talent even harder to come by.
"Most of our development team and the resources are here in Athens, Greece," says Xinyu Huang, vice president of Engineering at Persado. "Unlike in the U.S., where big data is all over the place, in Greece it's still in the early stage."
Persado Looks to Train Its Teams to Use Big Data Tools
Without the ability to buy the talent it needed, Persado decided to create its own, Huang says. Soulios brought in Cloudera—specifically, Cloudera University. Soulios and the development team worked with Cloudera University's curriculum team to tailor a private, week-long onsite training course for Persado.
"We started benefiting from our decision to work with Cloudera almost right away, since no other company offers a full Data Analyst Training targeted at both developers and analysts, which was one of our biggest priorities," says Soulios, speaking of a course on Apache Hive and Apache Pig. "The intensive workshop also included the full Cloudera Developer Training for Apache Hadoop with the option of testing for the sought-after CCDH certification following the class."
"Having the training in-house was really important," adds Huang. "It got our team interacting with the technology and understanding what you could possibly do with it. We have the data, but the team has been dealing with data on an ad-hoc basis, chunk by chunk. Doing the training really helped us to know how these tools really can help us. Over the long run, what was most beneficial for the team was to talk to someone who actually has real experience working with this big data technology. That really opened up the mindset of the developers here, especially the local developers that we have in Athens."
Hadoop Is a Game-Changing Technology
After the training, Huang says Persado has successfully built up its new data warehouse capabilities using Hadoop, Hive and Pig.
"What we find is that Hadoop is kind of a game-changing new technology," says Sproehnle. "It's not that people can't learn it, but they need to invest in that training. They really need to learn this brand new technology. We find that if people fumble around on their own, it's really hard to get Hadoop into production. But if you invest in a week of training you can begin maximizing that investment really quickly."
Thor Olavsrud covers IT Security, Big Data, Open Source, Microsoft Tools and Servers for CIO.com. Follow Thor on Twitter @ThorOlavsrud. Follow everything from CIO.com on Twitter @CIOonline, Facebook, Google + and LinkedIn. Email Thor at email@example.com