Discovering and enriching your data assets with Microsoft Azure Data Catalog

ADC MainNo one has the urge to volunteer for documenting every solution delivered, especially when it comes to data assets. Without the help of a reliable subject matter expert, there are times when it can be difficult to locate the required data source or to understand the meaning behind the data. When I first heard about Microsoft Azure Data Catalog, I was intrigued by how Microsoft were planning to address the notion of many people, “documentation is boring.” My curiosity lead me to an exciting discovery!

To get started, all you need is an Azure subscription. Jump into the Azure Data Catalog Portal and enter the required information. I found it very easy to setup and configure – it took me less than 10 minutes to complete the setup process.

An introduction to Azure Data Catalog

Microsoft Azure Data Catalog is a fully managed cloud-based service, part of Cortana Intelligence Suite – a unified set of tools for building high-performance analytic solutions. It acts as a central repository for information about your data sources, allowing you to discover your registered enterprise data assets. Your metadata (information about your data) can be structural (i.e. database name, table name, column name, data type, etc.) and descriptive (extended property of an object for annotation). Since it is a fully-managed, there is no need to invest in a costly server to get started, there’s no additional manpower and you don’t need to worry about maintenance costs.

How can Azure Data Catalog help BI developers?

Imagine you are a new BI developer that recently joined the team and you were asked to develop a report for the finance department, you would probably ask yourself:  What data should I use? Where is the data located? How can I connect to data source? Who is the right person to ask for permissions? What does the data look like? And the list goes on…

Wouldn’t it be good if there was an application to help answer all these questions?

Azure Data Catalog to the rescue! It helps us easily discover, analyse, visualise and consume data sources. It also helps us to share our perspective and insights on our data. The main objectives are to avoid recreating data that already exists and minimise the amount of time spent in finding for the right data to use, giving us more time for data analysis.

ADC Process

Figure 1: Azure Data Catalog Process

Discovering data

Discovering data assets is simple and easy. You can search by server name, database name, schema name, object name and column name. Discovering Data

You can also search by tags and by asset owner. Another way of discovering data asset is by using filters. You can filter by tags, object type and source type. It also lists all the recently viewed assets, pinned assets and saved searches. All registered data assets by default can be discovered by users that have granted access to Azure Data Catalog. Furthermore, there is an option for users to take ownership and to hide specific data assets if necessary.

Understanding the data

During data source registration, you have the option to include sample records of your data (by default, the first 20 records). Understanding Data

This allows you to have a quick glance of how the data looks like without connecting to the actual data source. Another important feature of Azure Data Catalog is the ability to see the table and column profiles. Table profile contains the number of rows, total size, the date of last data update as well as the last schema update. Column profile shows column name, data type, number of null values, number of distinct values, minimum and maximum value, average and standard deviation.

Consuming data

Depending on the object type and if you have permission to connect to the data source, you can easily explore the data in Excel, SQL Server Data Tools, Power BI and Power Query.Consuming data

Unlike the typical way of connecting to a database, you don’t need to know, or manually enter the connection details. However, please remember, the entire data is not imported to Azure Data Catalog and the permission to access the data is still under the control of your database administrator.

 

Sharing your insights on the data

Sharing your insights

Azure Data Catalog bring business and technical people onto the same page by allowing everyone to share their perspective on data assets or objects.

 

You can enter friendly names, annotate, describe and create detailed documentation. You can also assign subject matter experts or owners and use the business terms as tags from the business glossary.

 

Supported data sources

The majority of data sources can be registered, whether in the cloud or on premises. Data can be structured, semi-structured and non-structured. Below are the most common data sources supported by Azure Data Catalog:

SQL Server Azure Blob Azure Data Lake File System
SQL Data Warehouse Azure Table MySQL Database FTP
Analysis Services HDFS HANA Database HTTP
Reporting Services Hive SQL Server Master Data Services Odata
Oracle Database Teradata Azure DocumentDB Salesforce

Please note that Microsoft is actively adding more data sources. You can check the latest list from Azure Data Catalog supported data sources.

How to register data sources

Publish your data sourceThe registration tool allows you to publish your data sources in Azure Data Catalog for discovery and management. On your first attempt to register your data source (click on launch application), the portal will automatically download and install the data catalog desktop application. You can register your data sources by using the data catalog desktop application or manually, using the browser. I would recommend using the desktop application, it’s very user-friendly.

In my personal R&D, I was able to register various data sources including SQL Server, SSRS, SSAS, Azure SQL Database, Azure Data Warehouse, files in Azure Blob storage, files from a SharePoint site and files from my local machine.

Useful scenarios and business use cases

In my opinion, all organisations can benefit from using Azure Data Catalog. Whether you have a small number of data sources or if you have a mountain of reports which no one can explain.

Some useful scenarios and business use cases for Azure Data Catalog can be found here: https://azure.microsoft.com/en-us/documentation/articles/data-catalog-common-scenarios/

Azure Data Catalog Versions

Microsoft Azure Data Catalog has two versions:

  • Free – Supports unlimited users and 5,000 registered data assets. Cannot use the glossary and does not support Azure Active Directory.
  • Standard – The cost is $1 per user per month. Supports unlimited users and can register 100,000 data assets. It supports Azure Active Directory users and groups authorisation. It includes access to the glossary and asset-level authorisation.

To learn more about pricing, please visit: https://azure.microsoft.com/en-us/pricing/details/data-catalog/

Limitations and missing features of Azure Data Catalog

  • Cannot apply business rules on your data. For example, right now a retail organisation with subsidiaries in different countries is unable to apply a business rule so users in each country see the retail price of the product based on their local currency.
  • Ownership is asset level. There is no way to hide specific column to particular user or group of users. For example, all registered users can see all the columns in the employee table it is not possible to restrict views to sensitive data like salary and bank account details.
  • Does not capture the relationships or dependencies between objects. It is very helpful for developers to easily pre-determine what would be the possible tables, views, stored procedures and reports that will get affected when updating a database table. With this, the developer can easily assess the impact of any changes and determine the time required to adjust the affected objects.

Conclusion

Azure Data Catalog is a user-friendly application that can help us to easily discover, understand, consume and enrich our data assets. However, the success of data asset documentation does not entirely depend on the tool used (although this does help) but rather in the level of effort exerted to annotate, describe, tag and manage every single registered data asset. There are still few limitations but it’s a good starting point for everyone dealing with data assets to share their perspective and establish a community and culture of data. Will documentation still be boring with Azure Data Catalog. Why not give it a try?

Useful links & external resources

Detailed documentation, videos and tutorials can be found here: https://azure.microsoft.com/en-us/documentation/services/data-catalog/

Create your free Azure subscription here:

https://azure.microsoft.com/en-us/free/

Create your first Azure Data Catalog here:

https://azure.microsoft.com/en-us/services/data-catalog/

Azure Data Warehouse – Jump Start (part 1)

I have worked with Azure Data Warehouse few times in my past projects and I think that it is a good idea to share some insights about it and discuss how it differs with traditional data warehousing on premise.

What is Azure SQL Data Warehouse?

It is a fully elastic, managed, and parallelised relational database service in the cloud. It is a platform as a service (PaaS) and also a Data Warehouse as a Service (DWaaS). It is Massively Parallel Processing which means it uses Distributed Storage and Distributed Compute – uses multiple nodes and storage running in parallel.

Why choose Azure SQL Data Warehouse over conventional on premise data warehouse?

Well aside for its MPP characteristic, you don’t need to build expensive server. You also don’t need extra manpower to maintain the server or operating system. And importantly, you can easily scale storage and compute up or down in seconds as you need it.

The unit of underlying compute power of your database is called Data Warehouse Unit (DWU). Currently, it can scale from 100 to 6000 DWU. Although it is very easy to change the scale, please be cautious on playing with this, because the higher the scale the higher is the cost!

You are billed based on 2 components:

  • Storage – you are billed by Gigabyte and the outbound data transfer. There is no cost for storage transactions.
  • Compute – Always remember, the higher the compute the more you will pay.

What are the options in provisioning Azure Data Warehouse?

There are 3 ways of provisioning Azure SQL Data Warehouse, by Azure portal, by Power Shell and by TSQL. In this blog, I will show you how to provision Azure Data Warehouse using Azure Portal.

Assuming that you already setup your Azure subscription, jump straight to the portal to start: https://portal.azure.com

  • Click +New > Data + Storage > SQL Data Warehouse

sqldw-p1

  • Enter the database name, select subscription, enter/select resource group, create/select the server name and don’t forget to select the lowest scale at this stage.

sqldw-p2

  • Add your IP address in Firewall settings after the server is provisioned. In the Firewall settings window, click on the +Add client IP and click Save. The IP address that you added here will have access to all the databases inside this server.

sqldw-p3

 

In part 2 of this blog, I will discuss some of the design considerations and ways of loading data in Azure Data Warehouse.