What is metadata and why is it as important as the data itself?
Understanding the importance of metadata and putting the right strategy in place is vital to effective data sharing and reuse via data portals to progress towards data democratization. Our comprehensive blog explains what metadata is, outlines its benefits and shares best practice for your strategy.
Metadata. You’ve probably heard the term before, and may have asked yourself either “what is metadata?” or “why is it as important as data?” This article aims to answer these two questions by explaining more about metadata and demonstrating its importance to overall data management strategies, and in particular how it helps share information through data portals.
Essentially applying metadata ensures that all data shared on data portals is discoverable, understandable, reusable and interoperable – by both humans and technology/artificial intelligence (AI).
Introducing and defining metadata
To start, what is metadata? A good way to answer this question is by explaining that metadata is a shorthand representation of the data to which it refers. To use an analogy, you can think of metadata as references to data. Think about the last time you searched Google. That search started with the metadata you had in your mind about something you wanted to find. You may have begun with a word, phrase, meme, place name, slang or something else. You were using metadata without even knowing it!
Another way of explaining metadata simply is as data that provides information about other data. Metadata summarizes basic information about data, making finding and working with particular instances of data easier. It doesn’t tell you what the content is, but instead describes the type of thing that it is. Essentially it helps explain its provenance – its origin, nature and lineage. That means if someone has never seen a particular dataset before they can immediately understand what it covers and how it has been created or collected by reviewing the metadata. A good analogy is a book – the data is the content of the book itself, while the metadata is the title, format, publication date, author and the book’s subject.
Why is metadata important? We live in a data-centric world, powered by information. Today organizations are creating and collecting increasing volumes of data, from a wide range of systems, software and sensors. All of this data is in different formats, making it difficult to understand factors such as what individual datasets cover, the measurement units they use, how regularly they are updated or who owns them. This all makes it difficult to compare or use datasets with confidence.
Metadata solves this challenge around understanding. That’s why it is as important as the data itself. Without it, understanding a particular dataset can be a matter of guesswork, particularly if you were not directly involved in creating it in the first place. That’s why metadata is especially important within centralized data portals. By describing data assets they ensure that they can be discovered easily and understood by any user in terms of what they cover, giving people confidence to access and reuse them in their working lives. Good metadata reduces the time and effort around unnecessary downloads, as users can easily find the data they want, first time.
Metadata is therefore essential to running a successful data portal and building a data-centric organization.
The purpose of metadata
At its heart, metadata should contribute to data being discoverable and reusable in eight different ways:
- Metadata should provide a dataset with context. This means explaining what it covers, the themes, keywords to describe it, how the data has been collected and information such as any numerical units used, such as dollars, inches or centimeters.
- Metadata should make a dataset unique. It has to differentiate it from other, similar datasets so that users can choose between them with confidence.
- Metadata should provide the framework for subsequent uses of a dataset. This includes licensing conditions, whether it can be used externally as well as internally, and any organizational rules around who can use the data, and for what purposes.
- Metadata should make you want to reuse a dataset. It has to be comprehensive and compelling, providing clear descriptions that accelerate usage, while outlining the formats it is available in, and suggesting potential ways it can be reused.
- Metadata should make a dataset interoperable. It should follow set internal or external standards so that data can be confidently used or compared with information in other datasets. At a basic level this means standardizing how fields are described, and formats such as dates.
- Metadata should provide reassurance regarding the dataset’s reliability. By including information on the source, how often it is updated, and what it covers, users should be able to ascertain how reliable the data is.
- Metadata should allow the dataset to be found by technology and humans. That means using standardized terms to describe your data (as set out in relevant online thesauruses and guides). This ensures that it can be immediately found through search, whether via your internal portal or, in the case of open data portals, through search engines such as Google. Good metadata also makes it easier for relevant datasets to be found and used by AI, which is essential to training models and algorithms.
- Metadata should ensure the longevity of a dataset. Data can have a long life and can be shared in many places. Therefore include contact details for the data owner alongside the license, but remember that people move on – rather than listing a person as a contact, give a department or team name and provide a generic, rather than personal email address.
The types of metadata
The possibilities for describing things with metadata seem endless. Certainly metadata schema can be simple or complex, but they broadly fit into four types:
- Descriptive metadata: this adds details to describe a dataset, such as who created it, its name and what it contains.
- Structural metadata: this specifies how data is classified in terms of format, helping it to be easily found and retrieved.
- Administrative metadata: this includes rights management and licensing information about the data
- Relationship metadata: this explains how datasets relate to other information, helping monitor data lineage.
Why is metadata important?
The benefits of metadata
Without metadata sharing information at scale is virtually impossible. While a simple dataset may be understandable without metadata to those who created it, once it is shared more widely different perspectives mean that misunderstandings are bound to occur, particularly if multiple datasets are being combined or compared. Metadata delivers seven key benefits:
- It enables data discoverability, sharing and reuse on data portals, by allowing users to quickly search for, find and use relevant datasets with confidence.
- It aids better decision-making. As data is better organized and can be easily compared, both humans and AI can make more informed, faster, and more confident business decisions.
- It is at the heart of effective data governance. Metadata is central to agreed data governance standards, delivering compliance with corporate policies.
- It improves data quality as metadata provides information on the quality and reliability of the dataset.
- It delivers time and efficiency savings as users can find and use relevant information more quickly themselves, without requiring support from data teams.
- It increases collaboration internally and externally by enabling people to work together with shared, mutually understood data.
- It ensures compliance. Metadata enables data stored in different systems and databases to be interoperable, providing an up-to-date record of information and any changes made to it.
Metadata models and standards
The W7 Ontological Model of Metadata
We talked about provenance as a term earlier in the blog. In Liu and Ram’s “A semiotic Framework for Analyzing Data Provenance Research“, the authors’ define a semantic model of provenance as a seven piece conceptual model. This consists of interconnected elements – what, when, where, who, how, which, and why. These are elements of several metadata frameworks. Basically, most metadata schemas ask these questions about their data.
Applying this to metadata, we are saying that it gives the following information about the data it models or represents:
- What – What is the dataset about?
- When – What is the time frame that the dataset covers
- Where – What is the spatial/geographic coverage of the dataset?
- Who – Who created it (organization, team, individual)?
- How – How can the dataset be used? I.e. what are the licensing conditions
- Which – Which source generated the dataset (software solution, sensor, machine)?
- Why – Why does the dataset exist? Why was it originally created and shared?
Metadata standards
While the idea of metadata is simple in principle, the concept of applying metadata to your datasets can seem daunting. Where do you start? How do you describe your data so that it is consistent and can be shared internally and externally? How do you scale your program?
To help, a number of international standards have been formulated and agreed. These include the Dublin Core standard, the W3C Data Catalog Vocabulary (DCAT), and the EU’s INSPIRE framework for spatial data. These are based on agreed ISO standards to ensure interoperability and wide reuse.
Opendatasoft natively integrates an effective metadata management tool into its solution to boost the discoverability of data at scale within the organization. There are three categories of metadata templates within the platform, each with associated benefits:
- Standard templates: to ensure a personalized level of compliance tailored to an organization’s requirements (taxonomy, sector-specific, or specific vocabulary).
- Interoperability templates (activatable only, non-editable): to ensure compliance with international standards such as DCAT, DCAT-AP, Inspire, or Dublin Core.
- Administrative templates (only visible to portal administrators): to ensure good internal governance of metadata.
Metadata ontologies
A lot of the discussions around data quality and data discoverability have revolved around metadata and something called ontologies. Ontologies are descriptions and definitions of relationships and can be used within metadata. Ontologies can include some or all of the following descriptions/information:
- Classes (general things, types of things)
- Instances (individual things)
- Relationships among things
- Properties of things
- Functions, processes, constraints, and rules relating to things.
Ontologies help us to understand the relationship between things. As an example, an “android phone” is a subject of an object class, “cell phone”.
Within metadata schemas, ontologies help ensure that different datasets interoperate within specific standards. They set out how a dataset is organized in terms of the fields covered and the type of information in each field (such as a numeric figure). This is reflected in the metadata, which provides a standard definition for each column header type.
Metadata best practice to enhance data reuse through data portals
Metadata is central to the success of data portals. Following these best practices ensures that metadata supports effective data sharing and reuse through data portals:
- Start by defining a metadata strategy based on overall business objectives around data sharing.
- Collect and understand user requirements and potential use cases. Prioritize adding metadata to the most important datasets to drive usage.
- Involve relevant data owners/users by creating a cross-departmental team that spans the company.
- Establish and agree a metadata classification scheme and create a common vocabulary based as much as possible on recognized standards.
- Educate all data owners about the importance of metadata and communicate standards, practices, templates, and processes.
- Monitor to ensure that metadata standards are being met, evolving them as necessary.
The importance of metadata to data sharing
We started this article by asking two questions – What is metadata? and Why is it as important as the data itself? Hopefully you now have a clear answer to both of these points, particularly around the importance of metadata to data democratization and sharing via data portals. Essentially, without metadata, confidently finding the right data assets is difficult, particularly when data is shared beyond expert users, meaning that your metadata strategy has to be comprehensive, standards-based and designed to encourage reuse on your portal.
Find out more about metadata management in our ebook.
Further reading
Other Discussions on Defining Metadata
- Dugas, M., et al. “Memorandum ‘Open Metadata’.” Methods of information in medicine 54.4 (2015): 376-378.
- Detken, Kai-Olivier, Dirk Scheuermann, and Bastian Hellmann. “Using Extensible Metadata Definitions to Create a Vendor-Independent SIEM System.” International Conference in Swarm Intelligence. Springer International Publishing, 2015.
- Jiang, Guoqian, et al. “Using Semantic Web Technologies for the Generation of Domain-Specific Templates to Support Clinical Study Metadata Standards.” Journal of biomedical semantics 7.1 (2016): 1.
- McCrae, John P., et al. “One Ontology to Bind Them All: The META-SHARE OWL ontology for the Interoperability of Linguistic Datasets on the Web.” European Semantic Web Conference. Springer International Publishing, 2015.
- Riechert, Mathias, et al. “Developing Definitions of Research Information Metadata as a Wicked Problem? Characterisation and Solution by Argumentation Visualisation.” Program 50.3 (2016).
- S. Ram and J. Liu, “A Semiotics Framework for Analyzing Data Provenance Research,” Journal of Computing Science and Engineering, vol. 2, pp. 221-248, 2008.)
- Wilson, Samuel P. “Developing a metadata repository for distributed file annotation and sharing.” Diss. PURDUE UNIVERSITY, 2015.
Other Discussions on Ontologies and Ontology
- Polleres, Axel and Simon Steyskal. “Semantic Web Standards for Publishing and Integrating open data.”Standards and Standardization: Concepts, Methodologies, Tools, and Applications (2015): 1.
- Daraio, Cinzia, et al. “The advantages of an Ontology-Based Data Management Approach: Openness, Interoperability, and Data Quality.” Scientometrics (2016): 1-15.
- Fukuta, Naoki. “Toward an Agent-Based Framework for Better Access to open data by Using Ontology Mappings and their Underlying Semantics.” Advanced Applied Informatics (IIAI-AAI), 2015 IIAI 4th International Congress. IEEE, 2015.
Well-structured metadata is crucial to enabling data to be found and used with confidence within organizations and ecosystems. It is central to effective data sharing and reuse. At Opendatasoft our mission is to accelerate data democratization, ensuring that everyone has access to easily understandable information in their working and home lives. Our data portal solution enables data democratization by centralizing all of an organization’s data assets and making it available to all internal and external users in a seamless, intuitive way, without requiring specialist skills or support.
Growing data volumes, increasing complexity and pressure on budgets - just some of the trends that CDOs need to understand and act on. Based on Gartner research, we analyze CDO challenges and trends and explain how they can deliver greater business value from their initiatives.
The DAMA-Data Management Body of Knowledge (DAMA-DMBOK) outlines the principles, framework and vocabulary needed to successfully manage data and use it to support business objectives. Our blog explains what it is and how it helps CDOs when creating and implementing their strategy.