Glossary
Data Lake
A data lake is a large-scale, centralized repository which stores and processes structured, semistructured, and unstructured data in its raw format.
What is a Data Lake?
A data lake is a centralized repository which stores and processes large amounts of structured, semistructured, and unstructured data in its raw/native format. A data lake uses a flat architecture to store data in its original form, primarily in files or object storage. That provides greater flexibility around data management, storage and usage as companies are not constrained in terms of the size, type or structure of data within their data lake.
Why is a Data Lake used for?
A data lake can contain all of an organization’s data including:
- Structured data, from transactional systems and relational databases
- Semi-structured data, such as XML files or webpages
- Unstructured data, such as emails, images, videos or PDFs
That makes a data lake ideal for carrying out big data analysis, with data scientists able to analyze massive amounts of information of all types. The raw data within a data lake is also ideal for training AI and machine learning models and for running complex, predictive analysis based on huge volumes of data.
How does a Data Lake differ from a Data Warehouse?
Both data lakes and data warehouses provide a single, centralized repository to store an organization’s data. However, in a data warehouse data is processed and standardized before being added so that it fits with the set schema, model and use cases. As it is based on a relational database architecture, data can only be structured or semi-structured.
By contrast a data lake stores all types of data in its raw form. The structure or schema is only defined when the data is read (schema-on-read). This widens the range of analysis that can be carried out, enabling extremely complex analysis. However, performing this analysis requires deeper technical skills than a data warehouse, and its complexity means that performance may be lower.
Because they are good at different things, many organizations use both a data warehouse and a data lake, either individually or as a hybrid data lakehouse. The data warehouse feeds business intelligence and supports better decision-making, while the data lake is used for more advanced big data analytics and AI/machine learning.
How does a Data Lake work?
A data lake is typically deployed in a Hadoop cluster or other big data environment. Data is added from all sources following an ELT (extract, load, transform) model. This means data is loaded in its raw form, and is only transformed and processed when data scientists want to use it. This makes the load stage much faster. To achieve this data experts use a range of specific tools for data ingestion, resource allocation, content indexing, restitution, graphics, migration, and analysis.
What are the advantages and disadvantages of a Data Lake?
What are the advantages of a Data Lake?
- A data lake is much more flexible than a data warehouse, meaning that data scientists can easily run analysis without having to follow fixed models or schema
- As it is simpler to create and run, and often use open source technology, data lake costs are relatively lower than a data warehouse
- Data lakes enable businesses to exploit their growing volumes of unstructured data
- As data is stored in its raw form, data lakes are ideal for advanced analytics and AI
What are the disadvantages of a Data Lake?
- Data is simply loaded into a data lake without any cleansing or standardization. That means that potentially inaccurate, incomplete or unreliable data is unknowingly used within analysis
- Companies need skilled data scientists to best use their data lakes. That increases costs and limits who can benefit from the data lake – data is not democratized
- As data is not defined by specific use cases, data lakes can be under-utilized and serve solely as a dumping ground for data, reducing their ROI. This has led to some data lake implementations being nicknamed “data swamps”
- As they combine a range of different tools and technologies, managing data lakes can be complex and time-consuming
- Given their size and the complexity of datasets, data lakes can suffer from issues around reliability, performance, governance and security
Learn more about the differences between data lakes and data warehouses and how to unlock value from your data in this Opendatasoft blog.
Learn more
Data Trends
Overcoming the top 5 challenges faced by Chief Data Officers
Chief Data Officers are central to organizations becoming data-centric, maximizing data sharing to ensure that everyone has immediate access to the information they need. We explore the challenges they face - and how they can be overcome with the right strategy and technology.
Data access
How to break down organizational silos to engage everyone in your data project
Organizational silos prevent data sharing and collaboration, increasing risk and reducing efficiency and innovation. How can companies remove them and ensure that data flows seamlessly around the organization so that it can be used by every employee?