by Mary Branscombe

Diving into Microsoft’s Azure Data Lake

Feature
Dec 20, 2016
AnalyticsBig DataCloud Computing

The new Azure Data Lake service aims to let you get value out of all your data in the cloud, using familiar tools and skills.

data lakes
Credit: Thinkstock

Microsoft has been working with big data for a long time. Internally, it uses a tool called Cosmos, built with its own distributed processing technology, Dryad, to handle the data for everything from Bing AdCenter to Windows telemetry. Cosmos is used for curation, processing, analysis and reporting on massive data sets; imagine what looks like a single file that contains all the URLs Bing has ever seen, against which you can run interactive queries, with those queries running on maybe 50,000 machines in parallel.

That kind of data handling is useful in a lot of industries. It could help take the internet of things (IoT) from today’s often disjointed set of connected devices, to devices that are connected together. A smart car or a smart home or a smart city are going to need to connect a massive number of “things,” both old and new, emitting different kinds of data over a mix of protocols — and enable those things to interact with each other at scale, with a bigger end in mind.

Piping in data

Cars, plane engines and container ships all individually produce terabytes of data each day. The connected world will need to handle thousands of those at once, in real time — and then go back to  analyze the data again for larger patterns later.

Even understanding customers means working with more signals than ever.

It should be relatively easy for a business to calculate how valuable a customer is: You can look at their purchase history, along with the pattern of how often they make purchases and how often they return them, how quickly they pay their bills, what the margins are on the products they buy, and what it costs you to sell to and support that customer. But if you want to predict how valuable a customer could be to you, and how much you should invest in attracting them, you’ll want to include a lot more sources of data.

Obviously, you’d want to look at clickstream data from your website to see if they’re already a customer, or just a window shopper, and how they behave when they visit your site. If you have a mobile app for your business, you can analyze how customers use it, what they’re doing when they access your mobile app and whether they share any information from it. Looking at their social media graph will tell you not just what they’re interested in but if they’re someone who acts as an influencer, recommending products and services — and how effective they are at it. You might want to bring together several extremely large data sets to see if you can get insights from them,  and if you can’t get any insights, you want to throw those data sets away just as quickly and try some others.

Those kinds of data processing problems aren’t exactly the same as those that Cosmos solves for Microsoft, which is why Cosmos has never turned into a product Microsoft sells. But what Microsoft learned from building and using Cosmos, along with what it knows about data warehouses from years of SQL Server, what it has learned from running big data services based on Hadoop and Apache Spark, and the big data processing that underlies its recent breakthroughs in machine learning have all gone into creating Azure Data Lake, a new service that’s just gone from preview to full availability.

Azure Data Lake Store

In fact, Azure Data Lake includes multiple services, starting with the Azure Data Lake Store, which is where you collect all your data in a hyperscale repository built on Hadoop Distributed File System that’s designed for multiple big data analytics workloads. This is about getting all your data in one place so that you can experiment with it. To speed up ingestion, you can store both structured and unstructured data in its raw, native form without having to transform it or define a schema or hierarchy in advance to model it (by contrast, in a data warehouse, you have to transform and model the data before you load it, making the warehouse more efficient but less agile). You don’t have to repartition data to analyze it, you don’t have to create a schema in advance and there’s no limit to the size of data or the number of files and objects you can store.

That puts tables, comma delimited files, relational database files, semi-structured logs and clickstream data and streams of sensor data, alongside media files and social media content and any other data you want to work with, whether the file size is a few kilobytes or over a petabyte in size (that’s considerably larger than other cloud data stores). You also get access to the Azure Data Catalog — because not all the data you might want to analyze comes from your own systems.

Not only can Azure Data Lake Store handle high throughput to cope with analyzing those exabytes of information, as well as getting longer term insights by correlating multiple source of data you’ve gathered over time with offline batch processing (and even machine learning), it can also handle high volumes of small writes at low latency — so it works for real-time scenarios where you need results and alerts as the data arrives, like streaming in IoT sensor data and clickstreams for website analytics. It does that by ingesting the data fast and then periodically re-integrating and updating the production data.

Swimming in data

Being able to take on multiple roles and allowing queries by multiple tools at once is another of the big advantages of a data lake over a data warehouse.

And the important part of Azure Data Lake is the wide variety of ways you can work with all your data. You want to be able to do that where the data lives, because moving the data to be processed anywhere else would be slow, expensive and missing the point of having a data lake.

HDInsight (based on the Hortonworks Data Platform and running on your choice of Windows Server or Ubuntu Linux) is part of the Azure Data Lake service and gives you all the options of the Hadoop ecosystem — from the Hadoop tools like Spark, Storm, H-Base, R-Server and Kafka that are directly supported, to the many third-party tools that work with Hadoop, to Informatica’s upcoming Data Lake Management suite (which can itself run on Azure) for synchronizing data sets into Azure Data Lake Store and out again to data visualization tools like Tableau. But with Azure Data Lake, you’re using those Hadoop tools as part of a managed service that’s geographically distributed where you don’t have to build and run your own cluster.

Azure Data Lake Analytics

Then there’s Microsoft’s own Azure Data Lake Analytics. Instead of thinking about standing up a Hadoop cluster, think about submitting an analytics job to a service that uses Apache YARN (a generic resource management and distributed application framework designed to give Hadoop more data processing applications than just MapReduce).

The way you work with Data Lake Analytics is using U-SQL, a query language that’s a subset of T-SQL enhanced with temporal functions. It runs natively on YARN as another analytics engine. That’s how you can partition the computation over a huge number of machines, running asynchronously and restarting on failure, and that is what turns your data lake into a vast, distributed, fault-tolerant computer on which you can run massively distributed analytics.

The ‘U’ in U-SQL probably stands for “unified” because it can handle both structured and unstructured data processing; because you can combine both declarative SQL and user code written in R, Python or C#; and because it can query data from many different sources, like Azure Data Lake, Azure blob storage and Azure SQL Database. (Depending on who you talk to at Microsoft, alternative reasons for the name are either that you need a submarine to get the bottom of a data lake that might look more like a swamp of unidentified objects or simply that U follows T in the alphabet.)

Handling big data entirely by writing your own code gives you the most power, but it takes a major investment to get started. SQL-based tools are easy to get started with but hard to extend. The idea of U-SQL is to be as easy to use as SQL but as powerful and expressive as C# (and familiar to both SQL and .NET developers so they can readily work with big data). You can start with SQL-like queries and then bring in your own algorithms.

Again, this is based on what Microsoft has learned from years of analyzing its own exabyte-scale data lake of information about Office, Windows, Xbox Live and Bing (using a parallel query language called SCOPE), so it’s mature. U-SQL tools for Visual Studio and Visual Studio Code give you cross-platform options for authoring, debugging and performance analysis (which matters when your jobs could be running across thousands of nodes).

Analysis and intelligence at warp speed

U-SQL also has built-in support for some machine learning algorithms. It can detect and label objects in images, detect faces and the emotions on them, extract key phrases from text, analyze the sentiment in text and OCR text in scanned documents. And you can build more intelligent algorithms with Python and R Services, running that in U-SQL or using R Server on HDInsight.

“We have to put the intelligence in the data lake,” Joseph Sirosh, corporate vice presiden for Microsoft’s data group tells CIO.com. “We’re bringing algorithms to the data and operating on the data where it lives, not moving the data to where the intelligence resides, because then you’re seriously limited by the cost of data movement.” You also get cloud performance (tagging images in Azure Data Lake can handle around a million images a minute).

All this creates what Sirosh calls cognitive databases. “The concept is that you make these things into functions; you call them like database functions in SQL and run them over large amounts of data,” Sirosh says. “Just as you would run stored procedures in the database, now the machine learning model lives inside the database. That makes querying easier. And the models become something you can share among multiple apps.”

Crucially, the results of these functions also go back into the data lake. “You’re extracting the age, gender, facial emotions from a picture and that becomes data that can be queried and joined with other types of data. The predictions you make with machine learning also land back in the same data lake so you can combine that with other things.”

Azure Data Lake follows Microsoft’s usual strategy of offering the same kind of cloud tools as other vendors, but focusing on turning them into managed services, improving them with features from the company’s other offerings and integrating them with both Microsoft’s own tools and the wider ecosystem.

If you’re ready to start working with petabytes and exabytes of data — or just with a wide mix of data types on which you want to run multiple levels of analysis — Azure Data Lake promises a bridge to the data science world, using tools that will be familiar to enterprise developers and analysts.