[REPLAY] Product Talk: Using AI to enhance the data marketplace search experience

Watch the replay
Glossary

Data lineage

Data lineage (or data traceability) provides full visibility of the data lifecycle inside and outside the organization, including any changes made.

What is data lineage?

As organizations become increasingly data-driven, they have to trust the data that they are working with. Data lineage (also known as data traceability) aims to build this trust by ensuring that there is a full picture of where particular data has come from, how it has been changed, processed, or enriched, where it has been used, who has used it, and where it will go in the future.

Companies need to be able to trace data upstream and downstream back to its original source to ensure quality, good governance and regulatory compliance, all the way to the end of its lifecycle. This helps them see how data is being reused, both inside and outside the organization.

 

Data lineage covers the full data lifecycle:

  • The origins of the data, and whether it is internal or external
  • The level of sensitivity of the data (such as if it contains personal customer information)
  • The systems it has flowed through
  • Any changes that have been made, including enrichment and standardization to meet governance requirements
  • Who it is shared with (internally and externally) and how this is used (such as for business intelligence, and within operational systems)

Data lineage solutions provide a visual representation of the data lifecycle, enabling data administrators to drill down into how it has been created and then transformed/moved and used throughout the organization and wider external ecosystem.

What is the difference between data lineage and data traceability?

The terms data lineage and data traceability are often used interchangeably as there is no real difference between them. They both describe the same process of understanding the data lifecycle and providing full visibility across it.

A third term – data provenance – refers to the origin of the data, i.e. how and where it was created.

Data lineage/data traceability can be broken down into two areas:

  • Business lineage: Looking at how data has been changed from a business perspective. It provides a simplified view of where data comes from, the policies/processes/standards that were applied to it and how it has been used. This gives business users trust in the data when using it in, for example, decision making.
  • Technical lineage: A more in-depth view of how data moves and transforms between systems, tables and columns, that is normally only understandable by technical/IT users. It covers areas such as the applications data flows through, technical transformations, look ups and staging tables. While too complex for business users it is vital to ensuring technical data quality and debugging errors in the data sharing process.

Why is data lineage important?

Data lineage is vital to delivering confidence in the data that is used to power a business. Strong data lineage allows organizations to:

  • Have trust that the data being used for business operations is accurate and high quality, so that any decisions based on it will therefore be valid. As companies increasingly introduce advanced analytics and AI that automate decision-making, traceability becomes even more critical.
  • Ensure data governance by tracking and monitoring how data is used (and by whom).
  • Support compliance by being able to prove that data meets both organizational policies and external privacy regulations, such as GDPR. This makes data lineage a key part of risk management when it comes to data.
  • Securely protect data by understanding the systems it flows through and who has access to it.
  • Enable debugging by highlighting errors that potentially impact data use and flow.
  • Manage technical migrations, such as to the cloud, by modeling data flows and the impact of any technology/system changes on downstream solutions.

What are the challenges to data lineage?

Organizations generate enormous amounts of data, and increasingly add to this with information from partners and their wider ecosystems. This brings five key challenges to data lineage:

  1. Volume and range: the number of different data sources continues to grow as organizations digitize and more and more data-producing devices (such as IoT sensors) are added to their infrastructure. This means that the amount of data an organization has to manage is growing exponentially and all need to be fully traceable across their life cycles.
  2. Speed: data now moves at a much greater velocity within organizations. Whereas in the past weekly or monthly reporting was standard, users now need access to trusted data on a real-time basis.
  3. Compliance: regulators (and consumers) are increasingly focused on ensuring that information, particularly personal data, is used and protected in ways that meet legislation such as the CCPA and GDPR. This adds a further level of importance to traceability to provide an audit trail to regulators as required.
  4. Complexity: All of these factors mean that organizations have a much more complex data environment to manage, again making traceability key.
  5. Collaboration: monitoring data across the organization and more importantly with external partners requires open collaboration between departments and organizations to break down silos.

Want to learn more about our data democratization platform? Contact one of our experts!

Learn more
The importance of data quality in turning information into value Data Trends
The importance of data quality in turning information into value

What is data quality and why is it important? We explain why data quality is central to building trust and increasing data use, and the processes and tools required to deliver consistent high-quality data across the organization.

Accelerating public sector data sharing – best practice from Australia Public Sector
Accelerating public sector data sharing – best practice from Australia

Data sharing enables public sector organizations to increase accountability, boost efficiency and meet changing stakeholder needs. Our blog shares use cases from Australia to inspire cities and municipalities around the world

Opendatasoft integrates Mistral AI’s LLM models to provide a multi-model AI approach tailored to client needs Product
Opendatasoft integrates Mistral AI’s LLM models to provide a multi-model AI approach tailored to client needs

To give customers choice when it comes to AI, the Opendatasoft data portal solution now includes Mistral AI's generative AI, alongside its existing deployment of OpenAI's model. As we explain in this blog, this multi-model approach delivers significant advantages for clients, their users, our R&D teams and future innovation.

Start creating the best data experiences