The ultimate data glossary: 110 must-know terms

#Sales

Read this articles in:

Table of contents

Example H2

As the data landscape continues to evolve fast, it's crucial that teams that work with it learn to speak a common language. While internal teams may share some vocabulary, there are multiple data-related terms that get thrown around in meetings and the public domain, often leaving people confused and struggling to keep up.

To bridge this gap, we've compiled a holistic data terms glossary. I’ve provided the definitions to the most widely used data terms, as well as helpful resources and links to our data blog posts for those who want to dive deeper. This glossary is designed to serve as an essential reference point, ensuring that everyone on your team can communicate effectively and stay aligned, even as new trends, technologies and concepts emerge in the rapidly evolving world of data.

From foundational terms like ‘data analytics’ and ‘data visualization’ to cutting-edge concepts like ‘machine learning’ and ‘natural language processing,’ this glossary covers the full spectrum of the data landscape. Whether you're a seasoned data professional or just starting your journey, this resource will equip you with the knowledge needed to navigate the complexities of the data ecosystem with confidence.

Let's dive in and explore this comprehensive data glossary - your one-stop-shop for organizing your data knowledge and mastering the language of data.

The data landscape: a comprehensive glossary

Advanced analytics – the next level of data analytics. It involves the application of more sophisticated data analysis techniques like machine learning, data mining or natural language processing, to not only extract insights and discover patterns, but also make predictions. Usually carried out on large and complex datasets. Advanced analytics are applied when traditional analytical methods won’t work for insight extraction. Advanced analytics typically focuses on explaining why something happened and supports proactive behavior. In contrast, traditional data analytics focuses on what happened, and, as such, supports reactive behavior.
Algorithm – in the context of data, an algorithm refers to a set of well-defined rules or instructions that are applied to data to perform a specific task or solve a particular problem. Algorithms are used extensively in data processing, analysis, and manipulation.
API (Application Programming Interface) – a set of rules that allow different software applications to communicate and interact with each other. It defines how different software components should interact, specifying the types of requests that can be made, how to make them, the data formats that should be used, and the conventions to follow. APIs act as intermediaries or messengers that enable different software systems to talk to each other and share data or functionality. They provide a standardized way for applications to access and utilize the services or resources of another application or system.
Artificial Intelligence (AI) – an extensive field that encompasses the theory and development of computer systems capable of performing tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and language translation. At its core, AI aims to create intelligent machines that can simulate human cognitive functions, learn from experience, adapt to new inputs, and perform tasks autonomously. AI systems are designed to analyze data, recognize patterns, make decisions, and take actions in response to real-world situations. AI includes machine learning, deep learning, natural language processing, and expert systems that emulate the decision-making ability of human experts in specific domains, such as medical diagnosis, financial analysis or logistical operations.
‍Big Data – refers to extremely large, complex datasets that are difficult to process using traditional methods due to their size and complexity. These datasets also tend to grow with time.
Batch processing – a data processing technique, where data is processed in batches, as a single run or operation, rather than individually or in real-time. It is a type of processing mode where a set of inputs is collected, and then processed together as a group. In batch processing, data or instructions are accumulated and stored for a period of time, and then processed together at a scheduled time or when the entire batch is ready. This approach is often used when dealing with large volumes of data or when real-time processing is not necessary or feasible.
Business analytics (BA) – the practice of using data, statistical analysis, and quantitative methods to derive insights and make informed business decisions. It involves the systematic analysis of an organization's data to understand past performance, identify current trends, and predict future outcomes to improve operational efficiency and strategic planning.
Business intelligence (BI) – a technology-driven process that involves collecting, managing, and analyzing data to provide actionable insights for data-driven decision-making, including business analytics, data visualization, reporting, and dashboarding. We’ve explored the differences between Business Intelligence vs Business Analytics in a dedicated article.
Chief data officer (CDO) – a senior executive responsible for overseeing and managing an organization's data-related strategies, policies, and initiatives. The primary responsibilities of a CDO typically include data governance, data strategy, data analytics, infrastructure and personnel working with the data.
Clustering – a technique in unsupervised machine learning and data mining that involves grouping a set of data points or objects into subsets (called clusters) in such a way that data points within the same cluster are more similar to each other than those in other clusters. The goal is to maximize intra-cluster similarity and minimize inter-cluster similarity.
Customer Data Platform (CDP) – software that consolidates, integrates, and structures customer data from various touchpoints into a unified database, enabling marketing teams to gain relevant customer insights for targeted campaigns.
Data aggregation – the process of gathering and combining data from multiple sources into a summarized or consolidated form for analysis or reporting purposes. It involves collecting data from various sources, such as databases, files, or external systems, and then performing operations to group, summarize, or transform the data into a more meaningful and manageable format.
Data analyst - a specialist in processing and analyzing historical data to extract insights for decision-making. They collect, clean, and organize data, using statistical tools to identify patterns and trends. Their responsibilities include creating visualizations, maintaining databases, and communicating findings to stakeholders. Data analysts differ from data scientists and data engineers in scope and complexity. Data scientists typically have more advanced skills in machine learning and predictive modeling, often developing complex algorithms for future predictions. Data engineers focus on designing and maintaining the technical infrastructure for data collection and storage, ensuring data accessibility for both analysts and scientists.
Data analytics – the process of analyzing data to extract insights and patterns that can inform decision-making and drive better business outcomes. It involves the use of statistical, mathematical, and computational techniques to transform raw data into meaningful and actionable information.
Data analytics consulting – a professional service provided by data professionals oriented at helping organizations leverage data and analytics to drive business value and make informed decisions. Data analytics consultancies specialize in various aspects of data analytics, including data management, data modeling, statistical analysis, data visualization, and predictive modeling, while also covering the IT infrastructure aspect of data processing.
Data architecture – defines how information flows within an organization, governing the management and control of physical and logical data assets to translate business needs into data and system requirements.
Data augmentation – a technique to artificially increase the amount of training data from existing data without collecting new data.
Data capture – the process of collecting information and converting it into a format that can be processed by a computer.
Data center – a large, centralized facility that houses an organization's shared IT infrastructure, such as servers, databases, and networking equipment.
Data clean room – a secure environment where personally identifiable information data is anonymized, processed, and stored for joint analysis based on defined guidelines and restrictions.
Data cleansing – the process of preparing data for analysis by amending or removing incorrect, corrupted, improperly formatted, duplicated, irrelevant, or incomplete data.
Data cloud – a cloud-based platform or ecosystem that provides a centralized and integrated environment for storing, managing, processing, and analyzing data. It combines the scalability, flexibility, and cost-effectiveness of cloud computing with the capabilities for data management, analytics, and business intelligence.
Data collaboration – the practice of using data to enhance customer, partner, and go-to-market relationships and create new value through data sharing.
Data enrichment – the process of enhancing, appending, refining, and improving collected data with relevant third-party data.
Data extraction – the process of collecting or retrieving data from various sources for further data processing, storage, or analysis.
Data engineer - a professional who designs, builds, and maintains the infrastructure and architecture for data generation, collection, storage, and access. Their primary role is to create robust, scalable data systems and pipelines that enable efficient data flow within an organization. Key responsibilities include developing databases, data warehouses, and data lakes, implementing ETL/ELT processes, ensuring data quality and security, and optimizing data retrieval and processing. Data engineers work with various technologies, from traditional relational databases to modern NoSQL and distributed systems. Compared to data analysts and data scientists, data engineers focus more on the underlying infrastructure rather than data interpretation or advanced analytics.
Data engineering – the process of designing, building, and maintaining the infrastructure and systems required to collect, store, process, and make large datasets available for data analysis and business intelligence purposes.
Data federation - Data federation is a data management approach that enables organizations to access and query data from multiple disparate sources as if it were stored in a single, unified database. This method provides a virtual, consolidated view of data without physically moving or copying it from its original locations. Data federation keeps data in its source systems but allows real-time access through a federated layer, presenting users with a unified view regardless of the data's actual location or format. This approach preserves data sovereignty, offers flexibility in adding or removing data sources, and can be cost-effective by reducing the need for centralized data storage. It's particularly useful when data is distributed across multiple systems, real-time access is crucial, or there are barriers to centralizing data.
Data governance – Data governance is the overall framework that ensures an organization's data is effectively managed and leveraged to support its business objectives. It involves establishing policies, standards, and processes to define how data is acquired, stored, secured, and used across the enterprise. The goal of data governance is to improve data quality, accessibility, and reliability, empowering the organization to make more informed, data-driven decisions.
Data ingestion – the process of transporting data from multiple sources into a centralized database, such as a data warehouse, for further analysis.
Data integration – the process of consolidating data from different sources to achieve a single, unified view.
Data integrity – the overall accuracy, consistency, and trustworthiness of data throughout its lifecycle.
Data intelligence – the process of analyzing various forms of data to improve a company's services or investments.
Data lake – a centralized storage repository that holds large amounts of data in its natural/raw format.
Data lineage – the process of understanding, recording, and visualizing the origin and flow of data.
Data literacy – the ability to understand, interpret, and communicate with data effectively. It encompasses a set of skills and competencies that enable individuals to analyze data and make informed decisions on the basis of those analyses.
Data loading – the final step in the ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) process, where the transformed data is moved into a designated data warehouse.
Data management – the practice of organizing, storing, securing, and maintaining an organization's data assets throughout their entire lifecycle. It involves data governance, data architecture, data storage, data quality, data security and privacy, data integration and interoperability, data lifecycle management.
Data manipulation – the process of organizing and structuring data to make it more readable and usable.
Data mart – a subset of a data warehouse that is focused on a specific business function or department within an organization. It is designed to provide a more targeted view of the data, as opposed to the broader, enterprise-wide perspective of a data warehouse. We’ve written an article that compares data lakes vs data marts vs data warehouses.
Data masking – a technique used to protect sensitive or confidential data by replacing original data with fictitious but realistic-looking data. The primary purpose of data masking is to safeguard sensitive information while still allowing the data to be used for various purposes, such as testing, development, or analytics, without exposing the real data.
Data mesh – a decentralized data architecture that ensures data is highly available, discoverable, secure, and interoperable across an organization, unlike a centralized data warehouse or data lake approach.
Data migration – the process of transferring data between different file formats, databases, or storage systems.
Data mining – the process of discovering anomalies, patterns, and correlations within large volumes of data to solve problems through data analysis.
Data modeling – the process of visualizing and representing data elements and the relationships between them.
Data monetization – the process of generating revenue or deriving business value from data assets. It involves transforming data into a tangible business resource that can be leveraged to create new revenue streams, improve operational efficiency, or enhance decision-making.
Data obfuscation – making data less understandable or harder to interpret, typically for the purpose of protecting sensitive information or intellectual property. Common techniques used for data obfuscation include encryption, anonymization, masking and randomization.
DataOps – the practice of operationalizing data management, used by analytics and data teams to develop, improve the quality, and reduce the cycle time of data analytics.
Data orchestration – the process of gathering, combining, and organizing data to make it available for data analysis tools.
Data owner – an individual or entity within an organization that has the primary responsibility for a specific set of data assets. Data owners typically take on the role of data stewards and are responsible for the governance, privacy and security of the data, as well as determining relevant access and usage policies.
Data pipeline – the series of steps required to move data from one system (source) to another (destination).
Data platform – an integrated set of technologies used to collect and manage data, including hardware and software tools for reporting and business insights.
Data privacy – a branch of data security focused on the proper handling of data, including consent, notice, and regulatory obligations.
Data replication – the process of storing data in multiple locations to improve availability, reliability, redundancy, and accessibility.
Data-to-insight – the process of transforming raw data into meaningful and actionable information that can inform decision-making. This process is foundational in data analytics, business intelligence, and software development contexts, where the ability to rapidly and accurately interpret data impacts strategic planning, operational efficiency, and market competitiveness.
Data quality – a measure of a data set's reliability in serving an organization's specific needs, based on factors like accuracy, completeness, consistency, reliability, and timeliness.
Data querying – the process of retrieving and extracting specific data from a database or other data sources using a query language. The primary purpose of data querying is to obtain the necessary information to answer questions, support decision-making, or perform data analysis.
Data science – a multidisciplinary approach to extracting actionable insights from large and growing volumes of data.
Data scientist – a highly skilled professional who combines expertise in statistics, mathematics, and computer science to extract deep insights from complex data sets. They develop sophisticated models and algorithms to solve intricate problems, predict trends, and uncover hidden patterns using advanced techniques like machine learning and artificial intelligence. Data scientists' responsibilities include designing experiments, building hypotheses, developing statistical models, and working across the entire data lifecycle. They often create new methodologies and tools for data analysis, pushing the boundaries of what's possible with data. While data analysts interpret existing data for insights and data engineers ensure data availability and accessibility, data scientists create new algorithms for predictions. In smaller organizations, these roles may overlap, but they become more specialized in larger companies with complex data needs.
Data scrubbing – also known as data cleansing, is the process of identifying and correcting or removing inaccurate, incomplete, or irrelevant data from a dataset. The primary goal of data scrubbing is to improve the overall quality and reliability of the data, which is crucial for effective data analysis and decision-making.
Data security – set of practices and processes used to protect digital information from unauthorized access, theft, modification, or destruction. The main objectives of data security are to ensure the confidentiality, integrity, and availability of data.
Data sharing – the ability to distribute the same data resources to multiple users or applications while maintaining data fidelity across all entities consuming the data.
Data silo – a collection of information within an organization that is scattered, not integrated, and/or isolated from one another, and generally not accessible by other parts of the organization.
Data source – the origin of a set of information, which can be a location, system, database, document, flat file, or any other readable digital format.
Data sovereignty – the concept that data is subject to the laws and governance structures of the country or jurisdiction in which it is located, stored, or processed. It is the idea that data is an asset that is owned and controlled by the country or organization that collects and maintains it.
Data stack – a suite of tools used for data loading, warehousing, transforming, analyzing, and enabling business intelligence.
Data steward – a role or individual responsible for overseeing and managing the data assets within an organization. Data stewards play a crucial role in ensuring the quality, integrity, security, and appropriate use of data across the enterprise.
Data strategy – a comprehensive plan that outlines an organization's approach to managing and leveraging its data assets to achieve its business objectives. It includes data governance, data architecture, data management, data analytics, data infrastructure, data culture.
Data visualization – the graphical representation of information and data. It involves the creation of charts, graphs, plots, and other visual elements to communicate complex information in a clear and efficient manner.
Data warehouse – centralized repository that stores structured, filtered data that has been processed for a specific purpose. Examples include BigQuery, Redshift, and Snowflake.
Deep Learning – a subfield of machine learning that uses artificial neural networks with multiple layers to automate the process of feature extraction and pattern recognition from data. It is a powerful technique that has led to significant advancements in various domains, such as computer vision, natural language processing, speech recognition, and predictive analytics.
Descriptive analytics – a type of data analysis that focuses on summarizing and describing the characteristics of a dataset. The primary goal of descriptive analytics is to provide insights into what has happened or what is currently happening, without making any predictions or inferences about the future.
ELT (Extract, Load, Transform) – a data integration process where data is extracted and loaded into the warehouse directly, without any transformations. The target system is then used to perform the data transformations.
Enterprise AI – a strategic and systematic adoption of AI technologies within large organizations. It goes beyond isolated AI projects, aiming to integrate AI capabilities seamlessly into existing business processes, systems, and workflows to enhance overall performance and decision-making across the enterprise. The focus of enterprise AI is on scalability, where AI solutions are developed and deployed to be scaled across the organization. Enterprises also establish robust governance frameworks to manage the implementation, monitoring, and optimization of AI systems, ensuring alignment with organizational objectives and compliance with relevant regulations.
ETL (Extract, Transform, Load) – a data integration process where data is extracted and then transformed before being loaded into the warehouse directly.
Forecasting – the process of using statistical models, machine learning algorithms, or other techniques to predict future values or outcomes based on historical data and patterns.
GDPR (General Data Protection Regulation) – a comprehensive data privacy and security law that was adopted by the European Union (EU) in 2016 and became enforceable in 2018. The GDPR aims to protect the personal data and privacy of individuals within the EU and the European Economic Area (EEA).
Hadoop – an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It was originally developed by Yahoo! and is now maintained by the Apache Software Foundation.
IaaS (Infrastructure as a Service) – a type of cloud computing service model where the cloud provider offers virtualized computing resources over the internet.
IoT (Internet of Things) – the network of interconnected physical devices, vehicles, home appliances, and other "things" that are embedded with sensors, software, and network connectivity, enabling them to collect and exchange data.
Large Language Model LLM - a type of artificial intelligence system that is trained on a vast amount of text data, allowing it to understand and generate human-like language. A common LLM examples include ChatGPT, Claude, Gemini and LLaMa.
Machine Learning (ML) – a field of artificial intelligence that enables computers and systems to learn and improve from experience, without being explicitly programmed. It involves the development of algorithms and statistical models that allow systems to perform specific tasks effectively by using data, rather than relying on rule-based programming.
Metadata – the data that provides information about other data. It is data about data, describing the characteristics and attributes of the primary data being collected, stored, or analyzed. This could include information on data lineage, definitions and quality metrics.
Mixture of Experts (MoE) - a machine learning technique that involves combining multiple specialized models, called "experts," to solve a complex problem more effectively than a single, generalized model. The key idea behind the mixture of experts approach is to divide the problem space into smaller, more manageable subspaces, and then train a separate expert model for each subspace. These expert models are then combined using a gating network, which learns to route the input data to the most appropriate expert model based on the characteristics of the input.
Model deployment – the process of making a trained machine learning or deep learning model available for use in a production environment. It involves taking the model and integrating it into a system or application so that it can be used to make predictions or decisions based on new, real-world data.
Modern Data Stack (MDS) – a set of technologies and tools used to collect, process, and analyze data, ranging from turnkey solutions to customizable products designed to solve complex data situations.
Modern Data Platform (MDP) – an integrated system or architecture that enables organizations to effectively manage, process, analyze, and derive insights from large and diverse data sets in a scalable and efficient manner.
Multimodal AI - artificial intelligence systems that can process and integrate multiple types of data or "modalities," such as text, images, audio, video, and even sensor data. The key idea behind multimodal AI is to leverage the complementary information available across these different modalities to improve the overall performance and capabilities of the AI system.
Natural Language Processing (NLP) – a subfield of artificial intelligence and linguistics that focuses on the interaction between computers and human language. It involves the development of computational models and techniques to analyze, understand, and generate human language.
NoSQL database – a type of database management system that provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.
Data obfuscation – making data less understandable or harder to interpret, typically for the purpose of protecting sensitive information or intellectual property. Common techniques used for data obfuscation include encryption, anonymization, masking and randomization.
OLAP database – Online Analytical Processing database – a type of database technology that is designed to support complex analytical and reporting tasks. OLAP databases are optimized for fast and efficient data analysis, as opposed to traditional transactional databases that are optimized for handling high-volume, low-latency data processing.
OLTP database – OLTP (Online Transactional Processing database – is a type of database system that is designed to handle high-volume, low-latency transactions. OLTP databases are optimized for handling day-to-day business operations, such as processing sales, managing inventory, or updating customer records.
PaaS (Platform as a Service) – a category of cloud computing services that provides a platform for developers and organizations to build, deploy, and manage applications without the need to manage the underlying infrastructure.
Personally identifiable information (PII) – any information that can be used to identify an individual, either alone or in combination with other data. PII is a crucial concept in data privacy and security, as it is the type of information that needs to be protected to safeguard an individual's privacy.
Predictive analytics – the practice of using statistical models, machine learning algorithms, and data mining techniques to make predictions or forecasts about future events, behaviors, or outcomes. The goal of predictive analytics is to leverage historical data and patterns to anticipate and predict what might happen in the future.
Real-time analytics – the process of collecting, analyzing, and deriving insights from data as it is being generated, in near real-time. This is in contrast to traditional analytics, where data is collected, stored, and then analyzed at a later time.
Relational database – a type of database management system (DBMS) that organizes data into one or more tables, where data is stored in rows and columns. The key characteristic of a relational database is that it uses a relational model to represent and store data.
SaaS (Software as a Service) – a software distribution model where applications are hosted by a service provider and made available to customers over the internet. In the SaaS model, users access the software through a web browser or a dedicated application, without the need to install or maintain the software on their own local devices or servers.
Scalability – a fundamental concept in computer science and software engineering that refers to the ability of a system, network, or process to handle increasing amounts of work or load without significantly impacting performance or efficiency.
Sentiment analysis – also known as opinion mining, is a subfield of natural language processing (NLP) that focuses on identifying and extracting subjective information from text data. The goal of sentiment analysis is to determine the attitude, sentiment, or emotion expressed in a piece of text, whether it is positive, negative, or neutral.
Snowflake – a cloud-based data warehousing and analytics platform that provides a comprehensive solution for data storage, processing, and analysis. It is designed to address the challenges of traditional on-premises data warehousing and provide a scalable, flexible, and cost-effective alternative.
SQL (Structured Query Language) – a programming language that is primarily used for managing and manipulating relational databases. It is the standard language for interacting with and querying data stored in database management systems (DBMS), such as MySQL, PostgreSQL, Oracle, Microsoft SQL Server, and others.
Structured data – data that has been organized and predefined into a formatted repository before being placed in data storage.
Supervised learning – a type of machine learning algorithm where the model is trained on a labeled dataset. In other words, the training data consists of input features and their corresponding desired outputs or target variables.
Unstructured data – datasets (typically large collections of files) that are not arranged in a predetermined data model or schema.
Unsupervised learning – a type of machine learning algorithm that finds hidden patterns or structures in data without the use of labeled or pre-defined outcomes. In other words, unsupervised learning algorithms are used to discover interesting insights and relationships within a dataset, without being guided by specific target variables or labels.
Time-to-insight – a concept that refers to the amount of time it takes for an organization or individual to derive meaningful and actionable insights from data or information. It encompasses the entire process of data collection, processing, analysis, and the subsequent generation of insights that can inform decision-making. We’ve written an ebook about ways to reduce time-to-insight with a modern data platform.
Validation set – a subset of data that is used to evaluate the performance of a machine learning model during the training process. It is an important component of the model development and evaluation workflow.
Web scraping – also known as web data extraction, is the process of extracting data from websites using automated software or scripts. The goal of web scraping is to collect and structure data from the web in a format that is more easily consumable and useful for various applications, such as data analysis, market research, or content aggregation.

Need support with data strategy?

We’re hoping that the comprehensive data glossary provided you with a sufficient introduction to the world of data as well as the intricacies involved in building the data infrastructure required to help businesses become more data-driven.

If you need support with the process of building a comprehensive, modern data platform, look for data science expertise or simply to consult any of the data terms listed in the glossary, feel free to reach out to us via this contact form and we’ll get back to you promptly to schedule a free data strategy consultation.

The ultimate data glossary: 110 must-know terms

The data landscape: a comprehensive glossary

Need support with data strategy?

People also ask

Want to read more?

ELT Process: unlock the future of data integration with Extract, Load, Transform

Data integration: different techniques, tools and solutions

Supply chain analytics examples – 18 modern use cases

Services

Portfolio

eBooks

Blog

About us

open-source