Machine Learning Glossary: Data Structures Explained

The Definitive Machine Learning Glossary (2025)

Stepping into the world of machine learning can feel like learning a new language. You’re immediately confronted with a sea of acronyms and technical terms: algorithms, bias, classification, deep learning, encoding… it’s easy to feel overwhelmed. But understanding this vocabulary is the first and most crucial step to building true AI literacy.

This comprehensive glossary is designed to be your go-to reference. We’ve compiled and simplified dozens of the most essential machine learning terms, organizing them into logical categories to help you build your knowledge systematically. Whether you’re a student, a professional pivoting into tech, or just a curious mind, this guide will provide the clear, accessible definitions you need to speak the language of AI with confidence.

Foundational Concepts

These are the absolute core ideas that form the bedrock of the entire machine learning field.

Artificial Intelligence (AI): The broad field of computer science dedicated to creating systems that can perform tasks normally requiring human intelligence. Machine learning is a major subfield of AI. Learn more in our “What is AI?” guide.
Machine Learning (ML): A subset of AI where algorithms are trained to learn patterns from data and make predictions or decisions, rather than being explicitly programmed with rules.
Generative AI: A class of AI models that can generate new, original content, including text, images, music, and code, based on the patterns they learned from their training data. Tools like ChatGPT are examples of generative AI.
Model: The output of a machine learning algorithm after it has been trained on a dataset. The model is the file that contains the learned patterns and can be used for inference.
Algorithm: The specific mathematical procedure or technique used to learn from data and create a model. Examples include Linear Regression or Decision Trees.
Parameter: A variable internal to the model that is learned from the training data. The weights and biases in a neural network are parameters.
Hyperparameter: A configuration setting that is external to the model and whose value is set by the data scientist before training begins. Examples include the learning rate for gradient descent or the number of trees in a random forest.

Data & Preprocessing

High-quality models are built on high-quality data. These terms describe the data itself and the crucial process of preparing it for a model.

Data Preprocessing: The critical step of cleaning, transforming, and structuring raw data into a format suitable for a machine learning model. This includes tasks like handling missing values, encoding, and scaling. Learn more in our Guide to Data Preprocessing.
Features: The individual input variables that describe your data. If predicting a house price, features would be square footage, number of bedrooms, and location.
Feature Engineering: The creative process of using domain knowledge to create new, more informative features from raw data. This is often the most impactful step in improving model accuracy.
Label: The “correct answer” or output variable you are trying to predict in a supervised learning problem. For house price prediction, the label is the actual sale price.
Categorical Data: Data that can be divided into groups, like “City” or “Color.” It must be converted to a numerical format via encoding.
One-Hot Encoding: A common technique to convert categorical data into a numerical format by creating new binary (0 or 1) columns for each category.
Feature Scaling: The process of putting all numerical features onto a similar scale to prevent features with large magnitudes from dominating the model. Normalization and Standardization are the two main types.

Types of Machine Learning

Machine learning is typically broken down into three main paradigms, defined by the type of data and the problem they solve.

Supervised Learning: The model learns from labeled data, where both the inputs and the correct outputs are provided. The goal is to learn a function that can map inputs to outputs. For a full breakdown, read our guide on What Is Supervised Learning?.
Unsupervised Learning: The model learns from unlabeled data. It tries to find hidden patterns, structures, or clusters within the data itself without any pre-defined correct answers. Customer segmentation is a classic example.
Reinforcement Learning (RL): An agent learns by interacting with an environment. It takes actions and receives rewards or penalties, learning a strategy (a “policy”) to maximize its cumulative reward over time. This is used to train AI to play games or control robots. Dive deeper in our guide on What Is Reinforcement Learning?.

Neural Networks & Deep Learning

This is the most advanced and powerful subfield of machine learning, responsible for today’s biggest breakthroughs in AI.

Deep Learning: A subfield of machine learning based on artificial neural networks with many layers (hence “deep”). It has led to state-of-the-art performance in areas like computer vision and NLP.
Neural Network: A model inspired by the structure of the human brain, composed of interconnected nodes called “neurons” organized in layers (input, hidden, and output).
Activation Function: A mathematical function applied to each neuron’s output. It introduces non-linearity into the model, allowing it to learn complex patterns. Common examples are ReLU, Sigmoid, and Tanh.
Backpropagation: The algorithm used to train neural networks. It calculates the gradient of the loss function with respect to the network’s weights, allowing the model to adjust its parameters via gradient descent.
Convolutional Neural Network (CNN): A type of deep learning model specially designed for processing grid-like data, such as images. They are the backbone of modern computer vision.
Recurrent Neural Network (RNN): A type of neural network designed to work with sequential data, like text or time series, by having connections that form a directed cycle (a “memory”).
Transformer Model: A revolutionary neural network architecture, introduced in 2017, that uses a “self-attention” mechanism to process all parts of a sequence at once. It is the foundation for most modern large language models, including ChatGPT.
Transfer Learning: The technique of taking a model pre-trained on a large dataset (like ImageNet) and repurposing it as a starting point for a new, related task. This dramatically reduces the need for data and computation. Learn more in our Transfer Learning Guide.

Model Training & Evaluation

These terms describe the process of training a model and, crucially, evaluating whether it is actually any good.

Generalization: A model’s ability to make accurate predictions on new, unseen data. This is the ultimate goal of machine learning.
Overfitting: A common pitfall where a model learns the training data too well, including its noise. An overfit model performs great on training data but fails to generalize. Learn more in our Guide to Generalization & Regularization.
Underfitting: When a model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and new data.
Regularization: A set of techniques (like L1 and L2) used to prevent overfitting by adding a penalty for model complexity during training.
Bias-Variance Tradeoff: A fundamental concept in ML. Bias is error from wrong assumptions (underfitting), while Variance is error from being too sensitive to the training data (overfitting). The goal is to find a balance between the two.
Loss Function (or Cost Function): A function that measures the error of a model’s predictions compared to the true labels. The goal of training is to find the model parameters that minimize this function.
Gradient Descent: The core optimization algorithm used to train most models. It iteratively adjusts the model’s parameters in the direction that most quickly reduces the loss function. See our deep dive on how Gradient Descent works.
Accuracy: A common metric for classification tasks, it measures the percentage of correct predictions. (Number of Correct Predictions / Total Number of Predictions).
Precision and Recall: Two crucial classification metrics used when class imbalances exist. Precision measures the accuracy of positive predictions, while Recall measures how many of the actual positives were correctly identified.
Confusion Matrix: A table used to evaluate the performance of a classification model. It visualizes the True Positives, True Negatives, False Positives, and False Negatives.
Ensemble Learning: A technique that combines the predictions of multiple individual models to create a more robust and accurate “super model.” Methods include Bagging, Boosting, and Stacking. Learn more in our Ensemble Learning Guide.

Ready to Start Your Journey?

Understanding these terms is the first step. The next is to see how they fit together in a real project. Our AI Fundamentals Learning Roadmap is the perfect place to start your structured learning journey.

Explore the Data Science Career Path