A Discussion on Asset Naming in Collibra DGC

In Collibra, lineage of an asset is established by linking the assets identified by the same name from different metadata sources. For example, to establish a lineage for a report attribute, the name of the database column referenced in the report and the name of the column in the database need to match. Thus, they are linked, and lineage established primarily based on their names.

However, the possibility of duplication in asset names are plenty, same column name across tables or same attribute name across reports are quite common. Then how do we ensure these assets are uniquely identified?

It is by qualifying the asset names with their parent assets names to a level where can ensure uniqueness. For example, column name can be made unique by doing a 5-part naming of <Host/System name>.<Database name>.<Schema name>.<Table name>.<Column name>.

Similarly a Tableau report attribute can be made unique by <Host>.<Site>.<Project>.<Workbook>.<Worksheet>.<Report attribute>. In some cases, GUIDs generated by the tools are leveraged to make the assets unique.

Collibra brings in restrictions to asset naming by ensuring assets of same name cannot exist in a particular domain. However, asset with same name and type can exists across domains of the same type. For example, a column named ‘Customer Name’ cannot be duplicated in a ‘DB Assets’ domain of type Physical Data Dictionary whereas a ‘Customer Name’ column can exists in another domain ‘More DB Assets’ of same type Physical Data Dictionary. Leveraging this flexibility and structuring the communities and domains in a hierarchy of parent-child asset types, asset with same names can be managed well. This, definitely, is an option. The question now is, how do the two approaches compare?

Let us look at couple of scenarios.

Scenario 1 – DB assets:

  • Approach 1 – Loading DB assets with a fully qualified asset names
  • Approach 2 – Loading DB assets leveraging the Taxonomy structure (Communities/Domains) with the flexibility to have duplicate asset names across domains
  • Sample data
  1. Two hosts/systems – mkt.xyz.com (for marketing dept) and sales.xyz.com (for Sales dept)
  2. Each of the servers have a CRMDB database
  3. There are multiple schemas in each but some of them have the same names. Similarly, common table names across schema and common column names across tables exist

Approach 1 – Using Fully Qualified Names

  • Community Domain Structure

<Any Community>

|— <Any unique domain name eg: Physical Data Assets> – type: Physical Data Dictionary

|—<Any unique domain name eg: Technology Assets> – type: Technology Asset

              Physical Data Assets domain will contain Schemas, Tables and Columns>

              Technology Assets domain will contain systems and databases

  • Naming convention

Column: <Server/System name>.<Database name>.<Schema name>.<Table name>.<Column name>

Table: <Server/System name>.<Database name>.<Schema name>.<Table name>

Schema: <Server/System name>.<Database name>.<Schema name>

Database: <Server/System name>.<Database name>

Server/System: <Server/System name>

  • Handling Column names uniquely identified across two databases, is managed by their qualified names

mkt.xyz.com.CRMDB.Party.Customer.Customer_name

sales.xyz.com.CRMDB.Party.Customer.Customer_name

Approach 2 – Leveraging the Collibra Taxonomy

  • Community Domain Structure

<Any Community>

|— <Sub-community named as Host/System>

|—<Sub-community named as Database>

       |—<Domain named as Schema>

       <This domain of type Physical Data Dictionary contains the schema, tables, columns>

       |—<A Technology Asset domain to gold the system and database assets>

The above nested structure needs to repeat for every Host

The same example now would something like

<Any Community>

        |— <mkt.xyz.com>

                     |—<CRMDB>

                     |—<Party (Domain of type Physical Data Dictionary)>

   – Party ; Asset type: Schema

   – Party.Customer ; Asset type: Table

   – Customer.Customer_name; Asset type: Column

                     |—<Technology Assets (Domain of type Technology Assets)>

   – mkt.xyz.com ; Asset type: Technology Asset or System

   – CRMDB ; Asset type: Database

We will see one more scenario with the Report assets.

Scenario 2 – Report Assets (Tableau for example)

  • Approach 1 – Loading Tableau reporting assets with a fully qualified asset names
  • Approach 2 – Loading Tableau assets leveraging the Taxonomy structure (Communities/Domains) with the flexibility to have duplicate asset names across domains
  • Sample data

Two Tableau hosts/servers – tblmkt.xyz.com (for marketing dept) and tblsales.xyz.com (for Sales dept).

Each of the servers have a RepCatalog Site.

There are multiple workbooks and worksheets in each but some of them have the same names. Similarly, common worksheet names across workbooks and common report attribute names across worksheets exist.

Approach 1 – Using Fully Qualified Names

  • Community Domain Structure

<Any Community>

|— <Any unique domain name eg: BI Data Assets> – type: Report Catalog

|— <Any unique domain name eg: Business Assets> – type: Business Asset

|— <Any unique domain name eg: Logical Data Assets> – type: Logical Data Dictionary

|— <Any unique domain name eg: Physical Data Assets> – type: Physical Data Dictionary

|—<Any unique domain name eg: Technology Assets> – type: Technology Asset

BI Data Assets contains the workbooks, stories, dashboards, worksheets, report attributes.

Business Assets contain the Dimensional attributes, Measures, Dimension, Fact, Filters.

Logical Data Assets domain contain the Data Model, Entity and Attribute that represent the calculation layer on top of the Tableau data sources..

Physical Data Assets domain will contain Schemas, Tables and Columns or Files/Fields that form part of the Tableau data sources.

Technology Assets domain will contain Tableau hosts/servers and databases or file folders.

  • Naming convention

Workbook: <Workbook name>.<Workbook GUID> – The Workbook GUID is unique based on the Server/Site/Project the workbook belongs to and hence any duplicates across these names are handled by suffixing this GUID to the assets.

Dashboard: <Workbook name>.<Dashboard name>.<Workbook GUID>.

Worksheet: <Workbook name>.<Worksheet name>.<Workbook GUID>.

Report attribute: Dashboard: <Workbook name>.<Worksheet name>.<Report attribute>.<Workbook GUID>.

Similarly, other assets will have a qualified name with the workbook and workbook GUID included as prefix and suffix respectively.

Column: <Server/System name>.<Database name>.<Schema name>.<Table name>.<Column name>.

Table: <Server/System name>.<Database name>.<Schema name>.<Table name>.

Schema: <Server/System name>.<Database name>.<Schema name>.

Database: <Server/System name>.<Database name>.

Server/System: <Server/System name>.

  • Handling same report attributes ‘Order Number’ uniquely identified across two worksheets again with same name ‘Order Report’ but different workbooks ‘Marketing Orders’ and ‘Sales Orders’

Marketing Orders.Order Report.Order Number.<Marketing Orders workbook GUID>.

Sales Orders.Order Report.Order Number.<Sales Orders workbook GUID>.

Approach 2 – Leveraging the Collibra Taxonomy

  • Community Domain Structure

<Any Community>

|— <Sub-community named as Tableau hosts/servers >

|—<Sub-community named as Site >

|—<Sub-community named as Project >

|— <Any unique domain name eg: BI Data Assets> – type: Report Catalog

|— <Any unique domain name eg: Business Assets> – type: Business Asset

|— <Any unique domain name eg: Logical Data Assets> – type: Logical Data Dictionary

|— <Any unique domain name eg: Physical Data Assets> – type: Physical Data Dictionary

|—<Any unique domain name eg: Technology Assets> – type: Technology Asset

The above nested structure needs to repeat for every Host/Site/Project

The same example now would something like

<Any Community>

                       |— <tblmkt.xyz.com>

|—< RepCatalog>

        |—< RepProj01>

                 |—<All domains>

         |—< RepProj02>

                  |—<All domains>

                              ….

|— <tblsales.xyz.com>

|—< RepCatalog>

|—< RepProj01>

|—<All domains>

|—< RepProj02>

|—<All domains>

                                    ….

Having seen the two scenarios represented based on both the approaches, let us now review their Benefits and Challenges.

Approach 1 – Using Fully Qualified Names

  • Benefits
  1. The assets are uniquely identified despite duplicate names across any layer
  2. Ability to support any number of assets within two compact domains
  3. Scalable solution that can handle any number of metadata sources across a large Enterprise
  4. The lineage diagrams look cleaner showing linages across several metadata source layers/platforms with good representation of parent – child asset hierarchy using the boxing features
  • Challenges
  1. The fully qualified asset name looks difficult to read and relate including truncation in the user interfaces and lineage diagrams
  2. However, a work-around available as of v5.2 of DGC is to use a OOTB attribute ‘Original Column Name’ to capture the asset name without the prefixes and leverage that in both the asset views as well as traceability/lineage views. There are still issues where we cannot hide the fully qualified name or setup hierarchy views on the ‘Original Column name’, which could be resolved in future versions of DGC or maybe an alternative found for managing duplicate names

Approach 2 – Leveraging the Collibra Taxonomy

  • Benefits
  1. The asset names look shorter and readable
  • Challenges
  1. The number of such nesting structures increases with the number of metadata sources. This could be a challenge to scale when we are looking at many metadata sources across a large Enterprise
  2. The asset name still has to be prefixed to handle duplicates at a domain level for eg: Tables have to prefixed with Schema name to avoid duplicate table names across schemas. IF Schema is created as a sub-community or a domain, then column name has to be prefixed with the Table name to handle similar named columns across Tables
  3. Establishing lineage based on asset names would not suffice. The asset name with the right domain and immediate parent community will also be needed
  4. Some of the features of showing parent-child asset hierarchy/relations through boxing etc won’t work as lineage diagram supports only relations and not the community/domain containership. We need to still create these as separate assets and establish relations
  5. In order to uniquely show a column name, the community and domain has to be overlaid with the asset name in the diagram and that might not be very readable; Also only one overlay is possible at a time

 Summary: Even though couple of approaches exist to handle duplicate asset names in Collibra, the approach of using fully qualified names has advantages till Colllibra comes out with a better and scalable alternative.

About the Author: Anand Govindarajan

Anand Govindarajan

Chief Data Architect
Email: anandg@lucidtechsol.com
Linkedin: http://in.linkedin.com/in/anandgovindarajan/