Understanding the jargon before you get lost in the sea of ML terms
The world is shifting towards AI. Many witnessed the computer revolution and how these machines integrated into our everyday lives. It will be us fooling ourselves if we think that AI will not be directly integrated into our everyday work. Shocker, Shocker; The revolution has already started. I will most likely sound funny if I ask about your awareness of the new smart kid around with a name ending in GPT. Yeah, there will be more where that came from! One can say the development of these systems is as complex as they get. Regardless, it will be of great importance to know some of the most used terms when they are being built.
- Dataset — A Dataset is a file that contains data or several records collected for a specific purpose. For example, we can have a dataset for educational institutions in a country. This dataset contains the record for these institutions. These records can be the name of the institution, the location, the number of students, and much more. Datasets are not limited to text only but several others. We can also have a dataset for images of animals such as dogs, cats, etc.
2. Model — This is an interesting one. A model is simply a mathematical representation of a system that is used to make predictions. A model can range from a simple linear model to very complex systems. Aurelion Geron who led the Youtube Classification team rightly stated how confusing this term can be. He stated that “ the same word ‘model’ can refer to a type of model (e.g., Linear Regression), to a fully specified model architecture (e.g., Linear Regression with one input and one output), or to the final trained model ready to be used for predictions”.
3. Training Set — The training set is part of the overall dataset that the model is trained on before being tested. This usually constitutes 70–80% of the total dataset with some exceptions when the data is extremely large. The main reason we have a training set is to teach the model how to make accurate predictions. It functions as the “notes” that the model uses to study to know how to perform or predict several tasks.
4. Training Instance — Assuming we have 100,000 rows in a training set. A single row in that training set is what is referred to as a training instance. Training Instances consist of Features and Labels
5. Test Set — A test set is also part of the dataset that is used for testing the model that have being developed using the training set. The test set usually forms 20–30 % of the data set. This percentage is also not constant as it can change with the amount of data we have. Care should be taken to let no part of the test set be leaked when training. This will risk the model overfitting ( see 10).
6. Attributes & Features — Features and attributes are words that most see as synonymous. An attribute is a data type of the dataset. An example is “height”, “color”, and “age”. Generally, features refer to an attribute plus its value. Example: Color=black, age=98, height=4’5″. Please note that these two words are mostly used interchangeably.
7. Labels — Labels are the desired output or target variable that we are building the system to achieve. An example is if one is trying to predict the price of a particular item. The price of the item is what is referred to as a label.
8. Data Pipeline — A Data pipeline is a series of data processing components. This makes it easier to collect, process and transform from one or more sources and deposit it into a bucket. Say, for example, we have a predefined function that collects data from a database or flat flies, then they are run through a specified cleaning function. This is a simple example of a Data Pipeline.
9. Underfitting — Underfitting of a Model occurs when the model is too simple in comparison to the amount of data that is fed into the system for training. We say simple because it is not complex enough to capture underlying patterns in the dataset. To address this, we can simply change the model used from a simple model to a more complex one.
10. Overfitting — Overfitting occurring in a model means that a model has performed well on the training data, but it does not generalize well. This basically means that it performs badly when given new examples that it has not seen before in the training set. This usually happens if the model is too complex relative to the amount of data. A common solution is collecting more data, or reducing the complexity of the model.
11. Regularization — Regularization is a technique that is often done to reduce the risk of a model overfitting. This is done by adding constraints to a model to make it simpler. Regularization when done right will greatly improve the generalization of models.
12. Hyperparameters — These usually control the amount of regularization to apply during learning to prevent overfitting of the model. These are parameters of a training model which are usually set before training. They are essentially different from model parameters. Increasing the regularization hyperparameter to a very large value will almost always guarantee not overfitting, but you will not get a very good solution. Please note that this will result in a flat model with a gradient equal to 0.
13. Feature Engineering — Feature Engineering is the art of taking raw data, and extracting useful features from it to train the system on. This involves selecting the most useful features from the dataset — Feature selection, combining different existing features to produce a more useful one, etc. An example for the second part is we can have said the number of hospitals in 10 different cities, and the population of those cities. The number of hospitals and the population on their own cannot tell us much if we are trying to know access to hospital facilities. But, if we combine the two and have the no of hospitals per 100000 people, this can give more insights.
14. Batch Learning — In Batch Learning systems, the training has to be done all at once. Models cannot be trained incrementally. The training data is divided into batches and fed into the model one batch at a time. Batch Learning does not allow the model to adapt to new data. If one has new data and wants to train a system on it, we have to retrain the whole system again taking into consideration old and new data.
15. Online Learning — This is when a system is trained incrementally. Online learning is done when the system must be continuously trained. This is very vital when the system needs to be updated without having to be retrained each time. This can be for systems that receive a continuous flow of data from sources such as sensors, social media, financial markets, etc. Online learning
BONUS
16. Learning rate — It is obvious that if a system is continuously taking in data to update its model, the susceptibility to noise is very high. The learning rate gives the flexibility to adjust that, This is how fast a learning system adapts to a continuous flow of training data.
Setting a low learning rate ensures that the system is less sensitive to noise in new data or outliers. But, this leads to the system learning slowly. Setting a high learning rate gives the exact opposite outcome. Finding a sweet balance between the two is where the magic happens.
17. Transfer Learning — Instead of building a model from scratch, we can use an already pre-trained model and build on it. This process is known as transfer learning and it can save time considerably and resources. There are several available pre-trained models for tasks such as image recognition, object detection, etc. More on those later!
If you like this article, follow me for more!
I write to understand more about what I learn
If you noticed a mistake, have suggestions to improve the article, or want to reach out, feel free to message me on LinkedIn