GPU Cost Calculator

Tue, 24 Sep 2024 09:07:00 GMT

GPU Cost Calculator

Model Size (GB):

Cache Size (GB per input):

CB Factor:

Context Size:

Max GPUs:

Total Tokens:

]]>

Mon, 03 Jun 2024 10:57:00 GMT

The Problem of Vanishing/Exploding Gradients

During the training of deep neural networks, the mean and variance of activations can quickly shoot off to very high values or drop down to zero, causing the local gradients to become NaN or zero. This prevents the network from learning effectively. There are several techniques to handle this vanishing/exploding gradient problem:

Proper Weight Initialization: Initialization methods like Xavier/Glorot or Kaiming/He scale the weights based on the number of input and output units, ensuring that the variance of the activations is preserved across layers. It helps prevent the activations from exploding or vanishing in the initial stages of training.
Gradient Clipping: It clips the gradients to a maximum value during backpropagation to prevent them from becoming too large. It improves stability during training and can be applied to the gradients of individual parameters or the global norm of the gradients.
Batch Normalization: Batch/Layer Normalization is a widely used technique that normalizes the activations of each layer to have a mean of zero and a standard deviation of one. It also acts as a form of regularization, improving the model's generalisation performance.
Activation functions: Using activation functions that have non-vanishing gradients, such as ReLU (Rectified Linear Unit) or its variants (Leaky ReLU, ELU), can help mitigate the vanishing gradient problem. activation-functions-explained
Model Architecture: Residual connections, as used in ResNet architectures, provide an alternative path for the gradients to flow through the network. LSTM/GRU address the vanishing gradient problem in traditional recurrent networks.
Advanced Optimisers: Optimisers like LARS, LAMB or LION help to mitigate the exploding gradient problem by adjusting the learning rates dynamically during training. A proper warmup schedule also helps in achieving learning stability.

Although, there are multiple approaches, good initial parameters are very essential. Along with other techniques, a good initialization improves training efficiency, resulting in better models at lower cost.

Common Initialization Strategies

Zero Initialization

Setting all weights and biases to zero is a bad strategy as it prevents symmetry breaking and halts gradient flow.

Random Initialisation

Initializing weights with random numbers from a normal distribution can help break symmetry, but the activations still tend to diminish or explode for deeper layers.

Xavier/Glorot Initialization

Proposed in 2010, this initialization scales the weights by $\sqrt{1/n}$, where $n$ is the number of input units. It helps ensure that the variance remains the same across layers for tanh/sigmoid activations.

W_{i j} \sim U [- \frac{\sqrt{6}}{\sqrt{f a n_{i n} + f a n_{o u t}}}, \frac{\sqrt{6}}{\sqrt{f a n_{i n} + f a n_{o u t}}}]

Where 𝑈 is a uniform distribution and 𝑓𝑎𝑛𝑖𝑛 is the size of the previous layer (number of columns in 𝑊) and 𝑓𝑎𝑛𝑜𝑢𝑡 is the size of the current layer.

Kaiming/He Initialization

For ReLU activations, Xavier initialization is not optimal. Kaiming initialization scales the weights by $\sqrt{2/n}$, which helps maintain the variance across layers. This implies an initialization scheme of:

𝑤𝑙∼𝑁(0,2/𝑛)

That is, a zero-centered Gaussian with a standard deviation of 2/𝑛 (variance shown in equation above). Biases are initialized at 0.

LSUV Initialization

Layer-Sequential Unit-Variance Initialization ( All you need is a good init) is a simple method for weight initialization for deep net learning. The initialization strategy involves the following two-step:

First, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices.
Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one.

Fixup and T-Fixup Initialization

FixUp Initialization, or Fixed-Update Initialization, aims to train very deep residual networks stably at a maximal learning rate without normalization. It fixes the variance scaling issue due to residual connection. Initialize with Kaiming enable stability after activation: ie $Var(F(x))=Var(x)$ . But now with residual $Var(F(x)+x)$ will be greater than $ Var(x) $ so variance grows with each block!

The steps are as follows:

Initialize the classification layer and the last layer of each residual branch to 0.
Initialize every other layer using a standard method, e.g. Kaiming Initialization, and scale only the weight layers inside residual branches by $L^{\frac{1}{2m-2}}$.
Add a scalar multiplier (initialized at 1) in every branch and a scalar bias (initialized at 0) before each convolution, linear, and element-wise activation layer.

Setting weights to zero can result in the problem of gradient flow halting and symmetricity breaking. The Fixup initialization strategy addresses issues in deep residual networks by utilizing residual connections. These connections allow gradients to flow uninterrupted through the network, preventing vanishing gradients and ensuring continuous learning. Additionally, Fixup breaks symmetry by initializing the first convolutional layer with a non-zero method, enhancing the diversity of learned features. This asymmetry propagates through the network via residual connections, further boosting the model's representational power.

T-Fixup extends this concept further to transformer models.

Calculating Fan-in and Fan-out

For dense layers, fan-in is the number of inputs, and fan-out is the number of outputs. For convolutional layers:

fan_in = num_input_feature_maps * receptive_field_size
fan_out = num_output_feature_maps * receptive_field_size

PyTorch Implementation

PyTorch does not use modern initialization techniques by default for backward compatibility reasons. You can explicitly initialize the weights using torch.nn.init functions for proper initialization:

import torch
import torch.nn as nn

# Helper function for initializing weights
def init_weights(m):
    if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear):
        # Kaiming initialization for convolutional and linear layers
        nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
        if m.bias is not None:
            nn.init.constant_(m.bias, 0)
        if hasattr(m, 'special_init'):
            nn.init.constant_(m.weight, 0)
    elif isinstance(m, nn.BatchNorm2d):
        # Batch normalization layer initialization
        nn.init.constant_(m.weight, 1)
        nn.init.constant_(m.bias, 0)

# Example model
class ExampleModel(nn.Module):
    def __init__(self):
        super(ExampleModel, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        
        # Flag the 2nd convolutional layer for special init
        self.conv1.special_init = True
        
        
    def forward(self, x):
    	...
      

# Initialize the model weights
model = ExampleModel()
model.apply(init_weights)

Different initialization strategies are suitable for different activation functions and network architectures. Understanding the rationale behind these strategies will help you improve your networks efficiently. To learn more about this, these are some good references:

]]>

Mon, 11 Sep 2023 07:09:35 GMT

We will cover:

Representation learning methods
Ingredients of Contrastive Learning
Applications
Implementations - ML Libraries and Resources

Representation learning methods

There are three prominent self-supervised representation learning methods: generative, multi-task modelling, and contrastive. Generative representational learning approaches use autoencoder or adversarial learning to learn latent embedding space for regenerating images. As Generative methods typically operate directly in pixel space and require a high level of detail for image generation. This makes it computationally expensive, and bigger model size, which may not be necessary for representation learning.

Multi-task modelling is also a very powerful method for representation learning. It involves joint embedding space by performing multiple tasks such as classifications, detection, translation etc. It is being used in state-of-art big language models. However good task selection is very important. Otherwise, there may be suboptimal performance in unrelated tasks.

Contrastive methods currently achieve state-of-the-art performance in self-supervised learning. Contrastive approaches avoid a costly generation step in pixel space by bringing the representation of different views of the same image closer (‘positive pairs’) and spreading representations of views from different images (‘negative pairs’) apart.

Ingredients of Contrastive Learning

A few of fundamentals for contrastive learning are creating positive and negative pairs, using a proper distance measure to measure embedding distance, defining a good training objective which can optimise the distance, using a good model architecture to learn representations and then using good learning strategies for optimal flow of loss gradients for model learning. Let’s dive a little bit deeper into each aspect:

Positive and Negative Dataset

To perform contrastive learning, you need to positive and negative pairs. How they can be created depends upon whether the dataset is labelled or un-labelled. For labelled datasets, all images belonging to same class consist of positive pairs and from different classes as negative pairs. For unlabelled datasets, positive pairs are created via augmentation of same image and augmentation of different images constitutes negative pairs.

The amount of positive and negative samples is also very important. Siasemes Network used just a pair of images, positive and negative alternatively. Triplet Loss improved by using 3 images - anchor, positive and negative. Most of today’s state-of-art methods use multiple positive and negative samples. Larger the sample size, more information the model will have about the features it needs to bring closer or pull apart. For example, researchers from MoCo v3 presented that a negative sample size of 4000 is the optimal size for imagenet dataset.

If the negative sample size is smaller, hard negative mining can be used to find the most effective negative samples. It significantly speeds up the training. However negative hard mining is more effective in labelled data. In the case of unlabelled or noisy labelled data, hard negative mining results in the degradation of performance. Some recent work (BYOL, SwAV, VICReg, SimSiam) even showcase that just using positive samples yields better results, removing the need for negative samples altogether. However, these methods require longer training time.

Augmentation used in state-of-art methods for positive samples has converged to the combination of weak and strong augmentation from the SIMCLR method. One positive sample is generated through weak augmentation and another via strong augmentation. Then strongly modified image representation is used to bring closer to weakly modified representation.

Distance Measure

How the distance of two representation vectors is measured, directly affects the representation space learning. Some popular distance measures which can be used are Euclidian distance, cosine similarity, manhattan distance, KL divergence, JS divergence, Wassertain (EM) distance. Each of them imparts special properties to representational space. So choosing an appropriate distance measure is important.

Training objective

It calculates the final loss value using the distance of provided positive and negative sample. This loss value is used to optimise the model. Some of the popular loss functions are -

Contrastive loss (Siamese loss)
Triplet loss
N-pair loss
Lifted Structured Loss
NCE and InfoNCE Loss
Circle Loss
Soft Nearest Neighbor Loss
VICReg Loss
SigLIP and SigLIT

Network Architecture

Reference BYOL Paper

State-of-art methods use four-layer network architecture for contrastive learning - backbone model (view), representation layer, projection layer and prediction layer. Representation layer provides higher dimensional representation space, which can be used as input to the classifier or other downstream tasks. Projection layer is a lower dimension representation space which can be used for similarity measures. Prediction layer not only prevents collapse by providing asymmetry but also encourages the network to find features which are consistent between layers. Some approaches drop the prediction layer.

Learning Strategies

Grads Flow

Reference - SimSiam Paper

EMA

MoCo and BYOL models use EMA or momentum for target weight updates. It brings stability to representational space.

Target Temperature

In the case of a teacher-follower arm setup, teacher projection can be sharpened or the follower can be smoothened. It improves the feature sharpness of student.

Applications

Again, first, we should cover where it should not be used. It should not be the first step towards model/representation building. The use of pre-trained models via transfer learning is always a good start. There are some applications where it shines:

Label efficient training (one-shot/few-shot) - Self-supervised learning lets the model harness the power of unlabelled data to learn representation space. Linear evaluation or KNN-based methods in one or few shot provides significant results. It can further be used either finetuning with small labelled dataset or as a backbone for multiple classifiers on top. I can think of typical factory or medical oriented usecases where there is less labelled data or pre-trained model access and you need to work on multiple usecases. Here collecting large raw dataset is easy but building a well-labelled dataset not only requires experts for labelling but also is a challenging task.

Pretraining - Self-supervised achieves better results than transfer learning on pre-trained models even in fully labelled dataset availability.

Search and retrieval - Projector layer provides a really good feature vector to be used for search and retrieval of similar item search.

Implementations - ML Libraries and Resources

If you are in the TensorFlow ecosystem then TensorFlow Similarity is a really good option. It provides self-supervised learning on both labelled and unlabelled data, lets you control representation, projector and predictor layer configurations and have most state-of-art loss function implementations such as TripletMarginLoss, SoftNearestNeighborLoss etc.

Pytorch Metric Learning is a good library in the PyTorch ecosystem for labelled and unlabelled datasets. It also provides state-of-the-art loss functions, distance measures and miners for hard negative mining.

https://kevinmusgrave.github.io/pytorch-metric-learning/

If you are a researcher, you can also look into official repositories. These are also well organised. I prefer this method more if I want to tweak and try out some new ideas.

https://github.com/facebookresearch/moco-v3

Thank you for reading. There are some good references for further knowledge grasp:

Review Blog — MoCo v3: An Empirical Study of Training Self-Supervised Vision Transformers
Good Knowledge Blog - Contrastive Representation Learning, Lilian Weng
Practical Insights Paper - Rethinking Self-Supervised Learning: Small is Beautiful

]]>

Tue, 25 Oct 2022 16:50:00 GMT

HPO Algorithm	Notes
Random search	Simple approach that results in optimal parameters. Slow but supports embarrassing parallelism.
BOHB	Bayesian Optimisation + HyperBand. Resource-efficient and leads to faster convergence. My go-to choice.
Fabolas	Considers that HPO can be performed on a fraction of dataset. Good for large datasets. I typically control dataset fraction in BOHB to get this benefit.
Flaml Blendsearch & CFO	Cost Frugal Hyperparameter search. Good for NAS use cases.
PBT, PB2 & Fire PBT	Population Genetics/Bandits based training. Spawns a population of trials, evaluates, and selects new trials for the next stage of training based on genetics/bandit algos. Finds locally optimal hyperparameters for each stage of the trial. Better results than global parameters with 1-2% gain. Removes the requirement of learning rate or momentum decay/cycle. Sometimes gets stuck in the local optimum if the population size is less.

How to use - Ray Tune

Ray Tune is a really good tool for HPO. It is simple and powerful. As part of Ray ecosystem, it is scalable to multi-GPU and distributed environments. It provides all the above and additional SOTA algorithms. You can find more details here. Microsoft NNI is also a really good choice if you are using PyTorch ecosystem. It is even better for NAS use cases. But its support for non-PyTorch frameworks is limited.

HPO with Bayesian optimisation and HyperBand scheduler can be quickly implemented in Ray Tune via the following reference:

from ray import tune
from ray.tune.search.hyperopt import HyperOptSearch
from ray.tune.search import ConcurrencyLimiter
from ray.tune.schedulers import AsuncHyperBandScheduler
from ray.air import session


# 1. Define an objective function.
def trainable(config):
   #import torch/keras -        import pytorch/tf/keras here if using, known issue with GPU trials

    for x in range(20):  # "Train" for 20 epoch.
        one_epoch_training(model, config["lr"], config["a"])
        accuracy = calc_accuracy(model)
        session.report({"accuracy": accuracy})  # Send the score to Tune.

# 2. Define a search space.
search_space = {
   "lr": tune.loguniform(1e-8, 1e-2, base=10),
   "a": tune.choice([1, 2, 3]),
}


# 3. Define Search Algo and Scheduler
search_algo = HyperOptSearch()
search_algo = ConcurrencyLimiter(search_algo, max_concurrent=4) # Limit concurrent trials since BO doesn't parallelize very well

scheduler = AsuncHyperBandScheduler(metric="accuracy", mode="max", grace_period=5)

# 4. Start a Tune run that maximizes accuracy.
tune_config = tune.TuneConfig(
    search_alg=search_algo,
    metric="accuracy", mode="max",
    num_samples= 20 # Number of trials
)

tuner = tune.Tuner(
    trainable,
    tune_config= tune_config,
    param_space=search_space,
    scheduler = scheduler
)

results = tuner.fit()
print(results.get_best_result(metric="score", mode="min").config)

This should get you started on journey of optimizing hyperparameters efficiently with state-of-art algorithms. Combining with techniques like µTransfer makes it even more promising. OpenAI fine-tuned a 40 million parameter proxy GPT3 model before transferring the optimal hyperparameters to the 6.7B parameter variant. With only a 7% extra training budget for hyperparameter search, it outperformed the 13B variant. To learn more about this, these are some good references:

Ray Tune: https://docs.ray.io/en/latest/tune/key-concepts.html
Good blogs to take reference: Blog 1 , Blog 2
FLAML: https://github.com/microsoft/FLAML
Fabolas: https://arxiv.org/abs/1605.07079
PBT: https://www.deepmind.com/blog/population-based-training-of-neural-networks and https://www.deepmind.com/publications/faster-improvement-rate-population-based-training
µTransfer: https://decentdescent.org/tp5.html

]]>