Lucid Technologies & Solutions Pvt. Ltd.


PowerBI Data Flows for enabling a “Collibra Data Mart”

In the age of self-service analytics, the key expectation from any BI platform is to offer the powerful capabilities that exists in enterprise-scale environments to even the individual tool user. One such platform is PowerBI. True to its name it provides powerful and cool features simplified for individual users.

Can you imagine a decade ago how huge and complex data warehousing efforts were? Data integration and consolidation of data from varied systems into one single source of truth to support analytics?

Microsoft has enabled a similar architecture with the same power of data consolidation to a common model but achievable by individual users unlike an army working on a similar data warehousing effort.

This is based on apowerful combination of PowerBI Dataflow, the self-service data prep tool and the PowerBI shared datasets known as the Common Data Model. Refer to the architecture diagram from the Microsoft site (https://docs.microsoft.com/en-us/power-bi/transform-model/service-dataflows-overview)

PowerBI Dataflow is a low-code/no-code tool that lets you pull data from any type of data sources that is supported by PowerBI. With simplicity of any data prep tool, all the transformations needed can be done to all the sourced datasets and loaded as entities into what is a Common Data Model (CDM). CDM is another powerful feature to enable a common semantic model for datasets for consumption across the organization.

One of the usage scenarios for this architecture is to enable self-service analytics on a Data Governance platform like Collibra.

Dataflows can be used to connect to the Collibra REST API interfaces to pull together key metadata entities into a common data model for a ‘Collibra Data Mart’. This then becomes a “live”(meta)data mart for analysis using the PowerBI reports and dashboards. Thus, powerful Data Governance Dashboards can be enabled. As future extensions, analysis can be done across Collibra instances extending to even other Data Governance platforms in the Enterprise.

The data captured as part of this ‘Collibra Data Mart’ is stored in an Azure Data Lake Storage making it extensible and scalable. With a PowerBI Premium license a storage of 100TB/node is available enabling historical analysis.

To know more about how to build such “Collibra Data Marts” and other solution accelerators from Lucid, contact us at collibrainfo@lucidtechsol.com.


The “Actionable” Metadata

During the days of building data warehouses and BI systems, all that was captured, as truly “metadata,” was the data about creation/access of the data structures for audits, data about the execution of data loads, and so on. Primarily, whatever supported the “operations” of the data warehouse. During the Data Modeling phase, the description of the entities was captured in the modeling tool to help with the discussions, but once the model was approved and implemented, nobody really bothered to keep it updated.

Some of these descriptions went into the physical columns, as most of the database products do support comments for the physical structures. Attempts were made then to reverse engineer and publish the physical data structure details in a portal for ease of designing the ETL job creators and for the BI teams to query the warehouse for the report generation. These were the real foundations of capturing “operational,” “business,” and “technical” metadata, respectively, playing more a passive role of infrastructure support than anything actionable.

I never realized that one day the same metadata would become as “powerful” and “actionable” as the data it describes. In fact, without this metadata, even the data it describes loses its power as there is no real “context” to that data. Implementing Data Governance initiatives for my clients for the past five years has only increased this realization.

Let me share in this blog what some of these real “actions” are based on these implementation experiences while we delve into the details of “how” in subsequent blogs.

The table below lists some of the “actions” various types of metadata can drive.

Data Literacy: With the increasing need for users to “find” and “understand” the data they are consuming from various data lakes and data stores, the business metadata that describes this data becomes particularly important. Data catalogs provide this literacy, tagging all the contents with the right glossary elements — captured through collaboration or by auto-tagging through ML algorithms.

Data Protection: Increasingly, organizations face the challenge of complying with data privacy and protection regulations such as the GDPR, CCPA, and so on. Data classification algorithms drive this sensitive data identification and tag the content with business metadata, such as the PII tags. Organizations can then use this metadata to drive various access policies for the data.

Data Redundancy Check: As important as building trust in data is, there is a need to improve the efficiency in Data Management to get the agility the business needs from the data infrastructure. Eliminating data redundancies created due to older practices is one way of doing this. The business metadata that are tagged to the data is another means to identify redundancies.

Analytics Governance: One of the key focus areas in organizations is to ensure the consumers of analytics see trustable content that they can use to drive business decisions. Technical metadata captured from the data sources to the reporting/analytics platforms can be tied together to understand the data provenance and, hence, the quality of the output. This can drive concepts like ”report certification” that improve the trust in the final content consumed.

Data Provenance and Impact Analysis: As described above, the technical metadata from various platforms can be tied together and can help the platform development teams understand the impact of the changes. For example, what happens if the structure of a table in a data source is modified? What are all the reports impacted? What is the impact if the data integration platform has to be migrated?

Development Standards: Technical metadata from reporting platforms such as the SQLs are embedded in the reports; those from data integration platforms can indicate bad programming practices. Those can then be used to drive the creation of development standards to increase efficiencies.

Fraud Analysis: System access logs are a great source of information to understand fraudulent transactions. Similarly, metadata around email exchanges between parties provide useful input to understand their role in fraudulent activity.

Infrastructure Maintenance: Logs that provide resource utilization of various applications and processes act as a key input for infrastructure maintenance. The data points can even be fed into a machine learning algorithm to predict possible failures and drive what is called the “preventive maintenance” of these servers.

Usage Analysis: As organizations look to modernize their analytics platform, inputs such as most used reports, data sources, etc. help rationalize the objects to be migrated from the older platform to the new, rather than blindly transferring everything from the older to the new platform.


Leveraging SAP HANA metadata to make it a “Trusted” Analytics platform – Part 2 of 2

In Part 1, we listed the key use cases that supports and increases the trust on HANA as an analytical platform leveraging HANA’s metadata. I list them here again for our reference:

  1. Ability to draw the end-to-end lineage or trace the provenance of a data element from its system of use to its system of record or creation – This could be true for a platform consuming HANA (eg: BI/analytical tools) or for the platforms that are downstream consumed by HANA (eg: relational databases)
  2. Ability to trigger governance processes and notify relevant stakeholders of change to a data element in the HANA platform
  3. Push back to the HANA platform indicators that its data element is more traceable and trustable, as an outcome of the governance processes applied on them

We also looked at the first use case of drawing up the end-to-end lineage from the upstream systems/platforms to HANA or from HANA to its downstream systems and the benefits in brings.

Let us review the remaining two use cases as well.

One of the factor that improves ‘Trust’ on a data platform like HANA for the data consumers, is its ability audit changes to the underlying data structures. Business requirements being dynamic and hence the data needs, changes to the data structures are inevitable. Leaving alone the flexibility technology offers to implement these in a NoSQL database vis a vis SQL/relational database, auditing changes is key (a) to keep the data consumer informed of the changes (b) ensure security is not compromised (c) address regulatory needs of having an audit log of changes or (d) could be to ensure certain enterprise standards are adhered to.

Ability to capture changes on the HANA platform and triggering relevant governance processes on change then becomes essential. This is the second use case where HANA metadata can be leveraged to compare the version in the HANA platform against the version in the Data Governance(DG) platform and on change, trigger workflows that support notification to stakeholders, validation etc

It is definitely worthwhile to govern the HANA platforms from within the DG platform, looking at the lineage, doing impact analysis or triggering workflows. This is best suited for the DG teams or stakeholder with access to the DG platform. However, from an end user (of the HANA platform) perspective, it makes more sense to leverage the outcome of these DG processes right there in the HANA platform (without really logging into the DG platform).

This is where the third use case comes in. Ability to write-back / feed the outcome of the governance processes into the source HANA platform. For example, what if we are able to add a comment to a Catalog table or Catalog column based on the governed glossary definition for that object? Or use these comments to tag how ‘trustable’ the object is based on the assessment done within the DG platform say ‘Certified’, ‘Certified – Bronze level’ etc. I am not sure if HANA provides a means to update comments for the Calculation views yet but if this is available, tagging these analytical objects similarly will be huge boost to its reliability by the data consumers.

One bonus use case that also gets handled is the cataloging of the HANA objects! Data catalogs have become an essential component of a modern enterprise’s data architecture.  It is nothing but a catalog of an enterprise’s data sets, making it searchable and understand its contents and additional metadata, if available (typically likened to a product catalog of e-retailers like Amazon)

My experience has been to try out these use cases in Collibra, one of the leading DG platform. First step was to integrate the HANA platform with Collibra and load metadata of HANA objects (Catalog tables/views/columns/stored procs/functions as well as those of the calculation/analytic/attribute views, including their dependencies).

Then the load was setup for periodic refresh, which is where changes to the HANA platform could be monitored. Governance processes then were setup as Activiti-based workflows in the Collibra platform.

This surely is a big step towards governing the HANA platform and thus increasing its ‘Trust’ factor.


Leveraging SAP HANA metadata to make it a “Trusted” Analytics platform – Part 1 of 2

Data Governance has become one of the key areas of focus for organizations as they tackle the challenge of growing data volumes to derive meaningful business outcomes.

What does it mean when we say “Govern the Data”? As Collibra, one of the leading Data Governance platform, puts it, it is the ability for organization to enable the data consumers find the right data they need, understand what that data is made up of and finally, trust the data for consumption based on its quality.

SAP HANA, with its in-memory computing capability,  is increasingly becoming the platform of choice for both analytical and transactional applications. The three capabilities highlighted above are very relevant to HANA in order to improve the “trust” factor, which is key for any analytical platform apart from its technical capabilities.

Enabling these capabilities are not going to come easy given the ever increasing complexities in the data architecture of organization supporting the volumes/variety/velocity of data or the agile analytic processes that have limited focus on truly supporting the data stored with right level of information for its use (a simple one being adding a column description in a typical relational table).

Then how do we tackle this challenge? One of the proven ways is to harvest the “metadata” from the data platforms.

Going beyond the cliched “data about data” definition, what is meant by “metadata” of the data platforms here? Information about what data structures they hold, which of these data structures are used by processes and application in other platforms, what relations exist between these data structures and so on. These are often called the “technical metadata” of the data platforms. This when combined with the “Business metadata”, which typically are the Business definitions, processes, KPIs or the Business rules and policies, gives the right ammunition for the organization to taken on the Data Governance challenge.

Key use cases that supports and increases the trust on HANA as an analytical platform:

  1. Ability to draw the end-to-end lineage or trace the provenance of a data element from its system of use to its system of record or creation – This could be true for a platform consuming HANA (eg: BI/analytical tools) or for the platforms that are downstream consumed by HAN (eg: relational databases)
  2. Ability to trigger governance processes and notify relevant stakeholders of change to a data element in the HANA platform
  3. Push back to the HANA platform indicators that its data element is more traceable and trustable, as an outcome of the governance processes applied on them

Let us review each of these use cases:

First the end-to-end data lineage use case. For example, users are keen to know the lineage of data attributes from BI tools right up to their system of record. This means the data attribute needs to be traced across several data platforms such as analytical tools, data integration tools and data stores.

This often helps in two ways

(a) from a Business perspective, tracing the origin of the data attribute or its provenance helps the users understand how the value for the attribute is derived and when there are quality issues, where to go address them. This is very important especially from a regulatory standpoint as well trusting the reports/dashboards the users see

(b) more from a technical teams’ perspective, ability to do impact analysis on the consumers of the HANA objects when changes are made to the HANA objects as well as impact to the HANA objects when changes are made downstream on objects that are consumed by HANA objects.

Several Data Governance(DG) platforms provide these lineage views in their platform, including SAP’s Information Steward, which does this natively. However, in order to generate these lineage views, the metadata from all relevant data platforms have to be loaded into the DG platform.

Let us assume there is a Tableau worksheet built on top of a set of HANA Calculation views. These Calculation views in turn use several level of analytic and attribute views. These are built on top of several catalog views and tables (present in multiple schema).

In order to get an end-to-end lineage view of a Tableau report attribute right to its system-of-record, which could be few columns in the catalog tables, we  load the metadata both from Tableau and SAP HANA (or the information about the various Tableau and SAP HANA objects) into the DG platform.  The DG platform then enables stitching of these metadata to enable the lineage views.

In the next part of this blog, we will look at use cases 2 & 3 and also  review few high level approaches to implement these use cases.