Glossary
Dataset
A dataset is a collection of related data points, providing data in an understandable form to be shared and reused internally and externally.
What is a dataset?
A dataset (or data set) is a collection of related data points, stored in the same location, such as a table.
Each data point can be text, numbers, geographical information, or multimedia (such as an image or video).
For example, a simple, tabular dataset created by a retailer might include columns representing variables – the type of clothes, color, and stock levels. The rows then represent the values of each item, as shown in the example below:
Type | Color | Stock level |
Shirt | Blue | 4 |
Socks | Black | 8 |
Hat | Green | 2 |
When describing data within a dataset a hierarchy is followed, going from smallest to largest:
- Data point: The smallest element of data that cannot be further subdivided. “Shirt”, “Black” or “2” are all data points in the table above
- Data object: A collection of grouped, related data points that fit together. For example “Blue shirt with 4 in stock” is a data object
- Dataset: All of the data within the table.
Each data point within the dataset can be accessed individually and all of them share the same theme – in the example above all data points relate to clothes inventory.
Different datasets might be related, with these relationships described through data schemas. In our example, a second dataset might include the date and price of a sale of one of the items of clothing in dataset one. The data schema explains how the two datasets interrelate.
How can you reuse a dataset?
Datasets are intended to be shared, whether internally or externally. They therefore require supporting elements and tools to allow their reuse.
Metadata
This is all the information about the dataset: license, creation/modification date, producer, data model used, etc. This information allows the reuser to be reassured about the reliability and quality of the dataset. Some business sectors require the use of specific metadata to meet interoperability needs.
Data visualization
In its raw form, a dataset can be difficult to analyze. That’s why most datasets that are shared by organizations are accompanied by data visualizations, or at least tools to create it. These can be simple views like maps or graphs, or more advanced formats like dashboards or data stories.
APIs
APIs are essential when retrieving large datasets in real-time, and are generally provided by the producer of the dataset. Once connected, they allow you to automate the retrieval of information that is always up-to-date.
What can datasets be used fo?
Datasets are essential to creating value from data. Consequently the number and size of datasets that an organization has collected and made available internally and externally is a measure of how advanced its data sharing strategy is.
Internal uses to improve efficiency
- by data experts: datasets can be collected with data warehouses or data lakes and then analyzed and queried using business intelligence tools
- through self-service: They can be made available through a central data catalog to everyone within the organization, enabling them to be used for better decision-making and improved operations
- for training AI: Artificial intelligence algorithms learn by understanding the relationships between data points within datasets, allowing them to make more informed decisions. Training them therefore requires access to very large volumes of data, from one or more datasets.
External uses to increase transparency
- through open data: Public open data portals typically contain a large number of datasets, grouped into specific areas or themes. For example, UK Power Networks’ open data portal contains 39 datasets. These vary in size – one contains a complete list of its electricity distribution pylons (containing over 47,000 data points), and another is a list of all local authorities in its distribution area (116 records).
- for hackathons/competitions: Sharing datasets with the wider community not only increases transparency but provides opportunities for innovation. Releasing specific datasets and allowing them to be used for hackathons or competitions provides new opportunities for innovation from inside or outside the organization.
External use to create new services
- with a specific ecosystem: Datasets can be shared externally, either with a specific partner or with a wider, but closed ecosystem. Schneider Electric’s Exchange data marketplace shares 195 energy-related datasets with 540 users from 200 companies, enabling it to increase value for its partners, and for the company to launch new data services.
Learn more
Public Sector
5 key types of dataset for your municipality to share
To drive usage of your city or town’s data portal you need to publish information residents and visitors want to see, accessible in ways that best meet their needs. Based on analysis of our North American customers, we pinpoint the most popular datasets you should look to share.