Supervised vs. Contrastive Classification: How Does the Video Codec Influence the Prediction Scores?
By Sebastian Vater - Fellow researcher in ID&V
SHARE
Why are we interested in different codecs and their effect on prediction scores anyway?
Where we work, at Fourthline, our task is to help to safely onboard clients for our customers, such as banks, fintechs, and non-financial businesses, in a KYC (Know Your Customer) process. We research and develop Machine Learning and AI models that are used to verify a client’s identity and whether it is a live person that wants to onboard to their platform and use their services.
While experimenting in our lab with Machine Learning systems, we observed something interesting: Models trained with different loss functions respond differently when we change how the input video is encoded. This opened up the question of whether (and, if so, to what extent) some training paradigms are more sensitive than others to different video codecs, and thus, to compression artifacts.
Now, why we would care to check what influence an encoding or transcoding has on the output of our services? We could just optimize for the codec at hand and be happy! Well, leaving academia behind, we interact in an open world, and so do our services. Fourthline serves a multitude of banks, fintechs, and non-financial businesses. Each of them individually preprocesses their data with their own data capture and backend engineering before running them in our algorithms to check the liveness and identity of a person. We do care that our services always behave as expected — no matter the input codec!
In this post, we want to explore specifically how the training objectives coming from two paradigms — supervised and contrastive learning (see [8] for an overview) — might have an effect on the created embedding space (e.g. structure or robustness against noise and artifacts). We want to take a closer look at some characteristics of these losses and see how these affect the classification behavior in real-world scenarios. In particular, the goal is to fathom possible influences of the different losses on the classification score in presence of image space artifacts, such as those resulting from different codecs being used for the input data.
Max Crous and Roman Aleksandrov of Fourthline’s Biometrics team conducted the experiments and created the graphs on which this blog post is based on. They furthermore shared valuable insights and ideas through discussions.
What we found at Fourthline
At Fourthline, our product architecture employs multiple integrated modules that work in concert to maximize their collective advantages. This modular approach allows our systems to address challenges through multiple parallel processing paths.
As one example, different parts of the liveness system are researched and trained following different learning paradigms and architectures. Doing so, among others, we make use of supervised and contrastive learning methods.
Since we are interested in the effect of different video codecs on the outcome of our algorithms, we started performing some experimental tests on data in our domain, to which we applied different encoding and transcoding preprocessing. Below we show some qualitative exemplary results for two particular systems. Both these systems are trained on the same data domain, and while one is being trained with a supervised binary cross entropy loss (BCE), the other one is trained with a contrastive loss (Contrast.), where a local density estimator is employed for estimating decision thresholds.
The plots below show the distribution of the difference in score outputs (e.g. Softmax) of the differently trained systems, when we change the input video codec. When the score on the original codec for a sample to be classified had any value x0 (defined by the horizontal line ‘0.00’), the colored dots show where the values of the confidence output for each of the test samples land when the model was fed videos with a different encoding, resulting in scores x0+dx. The colored areas represent the spread of dx.
In these experiments, starting from the original MJPEG format (each frame in a video is encoded in JPEG format), we evaluate chroma subsampled MJPEG (4:4:4 and 4:2:0), intra-frame FFV1 (part of the FFmpeg project). We also show below the comparison between the FFV1 and the lossless VP9 encoding.
Distribution of differences of classifier scores when changing the input codec. Image generated and owned by Fourthline.
So, what do we see? We immediately see that the model trained with a contrastive loss performs significantly different on the encoded data than the BCE trained model does. For VP9, assuming approximately a Gaussian statistic, the 1-sigma area contains score differences from +/- 0.05 — this means a change in confidence from, e.g., 0.8 to 0.75–0.85. This is a huge difference for a model output!
Distribution of differences of classifier scores when changing the input codec. Image generated and owned by Fourthline.
Exemplary visualizations of artifacts stemming from different video codecs. We compute pixel-wise differences between the original video and the encoded one. The results are to be interpreted qualitatively. From left to right: (1) Original image, a person trying to fool the Liveness system by applying a paper mask (no video compression). (2) Encoded with MJPEG. (3) Encoded with FFV1. (4) Encoded with VP9. Images owned by Fourthline.
You may be asking: Why does the contrastive model behave so differently? And why does the BCE model not? The Contrastive model seems to be much more vulnerable to image encodings/transcodings. We asked ourselves: What could be the reason and how can we prevent it?
The paragraph below will shine some light on possible explanations and provides some glances for reasons that might explain this behavior. (Note: Since this is just a blog post, scientifically sound proofs will only be published somewhere else.)
Discussion
First, we see that divergence of scores is consistently high for the contrastive learning approach among different codecs. Given that all experiments were conducted on the same data domain, we are focusing on taking a closer look how the different learning schemes could explain this phenomenon.
In general, to prevent some undesired behavior in any technical system, it is most useful to understand what causes the observed behavior.
The Two Losses: Characteristics
A first possible explanation that comes to mind is that the learned feature spaces reveal different structures, in particular, different inter-class structures, for the two losses are different. Let us look at some characteristics about the losses (we assume a binary problem for our argumentation in this post):
Cross-Entropy binary loss (BCE):
BCE finds a hyperplane to separate the two classes, where it operates on class likelihoods (e.g. Softmax outputs).
There are no constraints on the model’s learned distribution (e.g. of class 1 and class 2 — they can assume any shape).
Learns separable features — any features, regardless of whether similar data points of the same class are close to each other (as long as they are ‘far away’ from all data points of the other class).
Summarizing, BCE does not care much about the intra-class distribution nor about the shape of the distribution itself at all — it just wants to create features that separate a sample of one class from all samples of the other class.
Contrastive loss:
Rather than finding a hyperplane, it ‘organizes’ the feature space:
The contrastive loss pulls similar images together into some form of a cluster while pushing dissimilar images away from each others. So it operates on data points and their relative positions to each other.
Learns a similarity metric (often Cosine distance) [that is normalized and temperature scaled, i.e. NT-Contrastive loss].
The contrastive training objective usually tries to match the hidden representation to be uniform on a hypersphere [1].
Is exposed to the effect of dimensional collapse [2]. This might be caused/enhanced by feature suppression [3] for the contrastive approach, see also below.
So, the contrastive learning paradigm wants to cluster similar samples, pushing clusters of dissimilar samples away — the former an apparent difference to the BCE.
Our general argument along the following is based on the fact that video encoding/transcoding — lossless or lossy — is causing artifacts to appear in the data, and, in general, information that a model can only interpret as noise, being induced. We want to add context to this relationship and highlight directions to draw conclusions.
To get a visual idea, let’s start by looking at typical 2D-representations of MNIST embedding for both, BCE and contrastive loss.
A look at latent spaces
The two figures below illustrate respective embedding spaces on the MNIST dataset for BCE and contrastive loss training. We can already see that they represent what we summarized above in the bullet points:
Trying, without constraint on the shape, for a sample of one class to be as far away as possible to all the other classes’ samples — versus — building compact clusters (note, that, for illustrative purposes, we fall back to a multiclass problems in the figure, the conclusions made here still hold without loss of generality for the binary problem.).
Embedding for MNIST samples from a CNN model trained with BCE.
Though just an illustrative example, the figure for the contrastive approach lets us already draw some corollary: without extra noise being added from any video encoding, there is already more overlap visible between the clusters compared to the BCE, being more vulnerable to any kind of noise and therefore changing confidences in classification.
Embedding for MNIST samples from a CNN model trained with contrastive loss.
Concluding, if you are working on anomaly or outlier detection, one hint for your research project or product is that you might prefer a Contrastive Loss over a BCE loss, to make use of the minimization of in the intra class variance!
Going a bit deeper that this demonstrating example, we want to look into a few related publications and what possible answers they might yield.
Related works for possible explanations
A. Non-Exhaustive Feature Space
In [2] they show that using a training scheme similar to SimCLR [4], a singular value decomposition of the covariance matrix of embedding vectors (they use a 128 dimensional embedding on ImageNet images), more than 20% of the dimensions remain basically unused, showing the dimensional collapse. In a corollary, they deduce:
“With strong augmentation, the embedding space covariance matrix becomes low-rank.”
So basically, the model is not using the entire feature space, thus limiting capacity and possible increasing sensitivity to noise.
From [2]. With increasing strength k of augmentations on a contrastive network, a large amount of the dimensions become non-relevant.
In our works, the contrastive approach inherently is built upon strong augmentation. The dimensional reduction possibly caused by this affects the model’s actual capacity and thus, its capability to compensate robustly for noise and artifacts in the input unseen during training.
B. Learned Feature Distribution
The authors in [5] investigate the effect of different losses, including margin losses and Gaussian Mixtures on the feature embedding space. They find that indeed a Cross Entropy loss does yield a feature space distribution that resembles a Gaussian Mixture Distribution, e.g. an organization of the feature space in clusters.
They furthermore find that training with a Gaussian Mixture loss, that enables learning clusters in a supervised manner is increasing robustness of the network. Specifically, they show that deep neural networks with high classification accuracies are vulnerable to adversarial examples when trained with a Softmax/Cross entropy loss. Adversarial examples created by the Fast Gradient Sign Method can be seen as adding noise to the image in a controlled way by exploiting the gradient of the loss with respect to the image. Without classifying compression artifacts as either deterministic or probabilistic, one can argue that both changes to the image have an effect of changing the confidence of the classification — the embedding is moving away from the original point in feature space.
The adversarial examples formed by intentionally adding small but worst-case perturbations cause the model to make incorrect classifications with high confidence, proofing that small changes can have drastic effect on confidence values in vanilla Softmax trained networks, They show this vulnerability by comparing the distributions of the confidence outputs of networks trained with a Softmax/Cross Entropy loss, a Center loss and a their proposes Large Margin-Gaussian Mixture (L-GM )loss as shown below. While the specifics of the losses are not of interest here, we just want to acknowledge, that the L-GM loss is a loss encouraging to build cluster in feature space and thus, trying to reduce intra class variance (pull similar samples together) and increase inter-class variance (push dissimilar samples away from each other).
From [5]. Distributions of confidences/likelihoods of sample classifications for models trained with different losses.
On the first glance, the results above are somewhat contradictory to the argumentation made so far. They find that the — push a sample from class 1 as far away as possible from any sample of class 2 — BCE loss (at least for the adversary examples) is less robust that their proposed L-GM loss. First, we should note here, that their L-GM loss is trained in a supervised way. That means that their supervised, cluster building, maximizing inter-class distance training approach might get much stronger signals, thereby indeed aiding robustness.
Secondly, the margin of the L-GM is a hyperparameter, where a margin of 0 corresponds to a GM loss. This means, that the comparison is a bit unfair, as one needs another set of data points to find the best margin. Concluding, with the right objective, a training scheme that pulls similar images together into some form of a cluster while pushing dissimilar images away from each others can yield to a robust classifier.
This being said, it would be desirable to see how the L-GM compares in their exact same experimental setup with a contrastive loss.
C. Quantifying the decision boundary
The works described in [6] is a theoretical one, where they derive a lower bound on the probability with which the inter-class distance is more than the intra-class distance quantify the separability of classes with the cross-entropy loss. So basically, they provide an analytical tool with which we can quantify (a lower bound) for which a sample of a class 1 is more likely to be classified as another class 2 for BCE trained models.
This is very powerful, as it gives as an actual value for a certain distance in feature space on how likely a sample belongs to either class.
The work provides a very interesting tool to investigate the problem at hand — at least for half of the problem — we do not have such a quantifying measure for our contrastive half. However, if we have a BCE model as well as model that is trained to learn a known distribution in feature space, e.g. by minimizing the KL of Jenson-Shannon divergence with respect to a GMM or by utilizing normalizing flows, one can already make such a quantitative evaluation, which would comprise a powerful experiment.
Anyway, we should keep an eye open for publications that come up with such a quantification for a contrastive loss. This would enable us to tackle the described problem in an analytic way!
D. Feature Suppression
In [3] the authors discuss the phenomenon of feature suppression in contrastive learning. Let us first discuss what is understood as feature suppression. They say:
“The suppression effect occurs among competing features shared across augmented views.”
We want to start by expanding on an example from the original SimCLR paper, which is also pointed at in [3]. They describe the phenomenon via a two-stage augmentation example — augmentation being the pillar that (self-supervised) contrastive learning is predicated on: “One functionality of data augmentation is to remove “easy-to-learn” but less transferable features for the contrastive loss.”
The example I want to give here is inspired by [3] and [4]. Let us consider the interaction between the augmentations color distortion (making an image all ‘green’) and image cropping (global vs. local and adjacent views, see figure below). Now let's look at an image of a dog standing on (green) grass:
All images from [3]: Cropping augmentations, original images, color distortion augmentation (from left to right).
Let us assume we have done a few crops on the original image, that we consider for our augmentation scheme (of course we do not select any particular augmentation application by hand) and we now look at the color distribution of our four augmented, e.g. cropped, images:
From [3]. Color distribution of random crops of two different images, image 1 (top), image 2 (bottom). The color feature is shared among the views by crop augmentation.
We see that there are no distinctive changes among the color distribution dimension, or feature, of our data: The random crops show almost the same distribution as the whole image! (This, of course, does not hold for any image.)
The idea of feature suppression is explaining this by arguing that one feature, one that is easy to learn (color distribution), suppresses another one (cropping).
In this case, it is super easy to just learn the color distribution, e.g. the green grass on which the dog is standing and completely ignoring (suppressing) the random cut augmentation.
This is why the authors in [3] argue that “Existing contrastive learning methods critically rely on data augmentation to favor certain sets of features than others, while one may wish that a network would learn all competing features as much as its capacity allows.” I recommend reading the paper [4] anyway for ML engineers, that work with contrastive learning or with augmentation.
To continue the examination, let us look at the color distribution of the same crops, when the image undergoes two augmentations, i.e. cropping and color distortion:
From [3]. Color distribution of random crops of two different images, image 1 (top), image 2 (bottom). Images are additionally being subject to a color distortion augmentation.
We see that combining the two augmentations (at least for these two considered images) is crucial so that the network can learn something meaningful!
What can we learn from that for our problem? Well, suppressed features are yet another expression of a suboptimal use of our feature embedding space and, in the worst scenario, an entire disregard of certain information of the data to learn from. To stay with the example above, when relying on the color distribution only when we the single augmentation is cropping, the learnt information about dogs, or any object, on grass is disastrously fragile on this very color being apparent (since other features are suppressed). This yields that image artifacts from encodings might distort this one feature easily, yielding to an altered classification results, or at least, a change in confidence.
To summarize about feature suppression:
Easy-to-learn features can suppress the learning of harder features.
A simple color distribution feature such as ‘green’ being learned to have the distinct meaning (among others) grass can suppress the ability to learn the classes that are associated/correlated to that color feature, e.g. crocodiles or just cows standing on grass.
As an addendum, in their experiments they add a fourth dimension of a constant integer to the image, which is “represented as n binary bits/channels”. They find that with “a few bits of the extra channel competing feature added”, shared for both contrastive views, the representation loses all information of the original RGB values. These results conclude that small useless "extra information" can change the fate of a prediction.
They conclude that in it is general “difficult to learn both of the competing features using existing contrastive loss (as in SimCLR)”. From that, we learn that we need to choose our augmentations very careful depending on our data at hand!
E. Loss Function vs. Domain Adaptation
In [7] the authors investigate yet another aspect of different loss functions and their effect on changing input data:
“Many objectives lead to statistically significant improvements in ImageNet accuracy over vanilla softmax cross-entropy, but the resulting fixed feature extractors transfer substantially worse to downstream tasks”.
While a different codec is not directly a downstream task, we can consider it as a transfer task, i.e. shifting the domain. With their main argument being that vanilla BCE learns more robust features than alternative (supervised) objectives, we discuss some of authors’ insights here. They conducted experiments with nine different objectives with ImageNet data, and several transfer settings.
Some of their key results:
“the choice of loss has little effect when networks are fully fine-tuned on the new tasks” [7].
“there exists a trade-off between learning invariant features for the original task and features relevant for transfer tasks” [7].
They further find, that “alternative objectives [over the Softmax BCE approach] appear to collapse within-class variability in representations, which accounts for both the improvement in accuracy on the original task and the reduction in the quality of the features on downstream tasks.”
Summarizing, they find that:
“the properties of a loss that yields good performance on the pretraining task are different from the properties of a loss that learns good generic features.”
Limited to the supervised setting in their experiments, what we could learn here is that one must carefully perform test on their to-go-to-production models, or models deployed to an unknown domain. Of course, as also mentioned in the paper, trying bigger networks and more data is always worth it. But in the end, when you cannot test the unknown unknown, how can you find confidence;) in your trained (bigger) model?
One advice is, that trained models that perform well on your benchmark test need to be validated against data of possibly different distributions, or data subject to noise/artifacts that might represent such changes, like encodings. Cross validation, another benchmark set (‘where from?’ you ask), estimating robustness and calibration suggest some directions to explore.
Summary and Conclusions
So, what did we learn here? First of all, be careful when changing the codec of your input videos for your trained convolutional neural network, particularly when the network is trained in a contrastive manner! BCE and Contrastive Learning objectives behave very differently, when it comes to the hidden representational structures, and ultimately, the output confidences learned.
These different structures themselves come with their peculiar behavior when it comes to changing input with respect to, e.g., but not limited to, encodings.
Some takeaways:
When you apply augmentations to your training, particular in a contrastive setting, you should take great care of their composition and incorporate the knowledge you have about your data domain.
When you are working within the real-world data domain, the codec of your input data might be out of your control — so the response of your system to different encodings will matter! Our results show that you should test your system with different video encodings and come back to this post to amend your findings.
Furthermore, we discussed several recent findings in the ML literature that explore either loss objective and their implications for classification confidences and shed some light on directions and fallbacks to follow and to avoid, respectively, when we train ML models.
This blog post can only be incomplete, let alone a scientific profound elaboration. However, it shall give many an insight and understanding of the underlying problem asked in the title “Supervised vs. Contrastive Classification: How Does the Video Codec Influence the Prediction Scores?”, and shall serve as a mere start of a discussion.
For the future, particularly take away a) has drawn our interesting and shall be a research topic within Fourthline with upcoming results to be shared with the community.
References
[1] Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. 2022.
[2] Li Jing, Pascal Vincent, Yann LeCun, Yuandong Tian. Understanding Dimensional Collapse in Contrastive Self-supervised Learning. 2022.