Supervised vs Contrastive Classification: How video codec influences prediction?

Why are we interested in different codecs and their effect on prediction scores anyway?

Where we work, at Fourthline, our task is to help to safely onboard clients for our customers, such as banks, fintechs, and non-financial businesses, in a KYC (Know Your Customer) process. We research and develop Machine Learning and AI models that are used to verify a client’s identity and whether it is a live person that wants to onboard to their platform and use their services.

While experimenting in our lab with Machine Learning systems, we observed something interesting: Models trained with different loss functions respond differently when we change how the input video is encoded. This opened up the question of whether (and, if so, to what extent) some training paradigms are more sensitive than others to different video codecs, and thus, to compression artifacts.

Now, why we would care to check what influence an encoding or transcoding has on the output of our services? We could just optimize for the codec at hand and be happy! Well, leaving academia behind, we interact in an open world, and so do our services. Fourthline serves a multitude of banks, fintechs, and non-financial businesses. Each of them individually preprocesses their data with their own data capture and backend engineering before running them in our algorithms to check the liveness and identity of a person. We do care that our services always behave as expected — no matter the input codec!

In this post, we want to explore specifically how the training objectives coming from two paradigms — supervised and contrastive learning (see [8] for an overview) — might have an effect on the created embedding space (e.g. structure or robustness against noise and artifacts). We want to take a closer look at some characteristics of these losses and see how these affect the classification behavior in real-world scenarios. In particular, the goal is to fathom possible influences of the different losses on the classification score in presence of image space artifacts, such as those resulting from different codecs being used for the input data.

Max Crous and Roman Aleksandrov of Fourthline’s Biometrics team conducted the experiments and created the graphs on which this blog post is based on. They furthermore shared valuable insights and ideas through discussions.

What we found at Fourthline

At Fourthline, our product architecture employs multiple integrated modules that work in concert to maximize their collective advantages. This modular approach allows our systems to address challenges through multiple parallel processing paths.

As one example, different parts of the liveness system are researched and trained following different learning paradigms and architectures. Doing so, among others, we make use of supervised and contrastive learning methods.

Since we are interested in the effect of different video codecs on the outcome of our algorithms, we started performing some experimental tests on data in our domain, to which we applied different encoding and transcoding preprocessing. Below we show some qualitative exemplary results for two particular systems. Both these systems are trained on the same data domain, and while one is being trained with a supervised binary cross entropy loss (BCE), the other one is trained with a contrastive loss (Contrast.), where a local density estimator is employed for estimating decision thresholds.

The plots below show the distribution of the difference in score outputs (e.g. Softmax) of the differently trained systems, when we change the input video codec. When the score on the original codec for a sample to be classified had any value x0 (defined by the horizontal line ‘0.00’), the colored dots show where the values of the confidence output for each of the test samples land when the model was fed videos with a different encoding, resulting in scores x0+dx. The colored areas represent the spread of dx.

In these experiments, starting from the original MJPEG format (each frame in a video is encoded in JPEG format), we evaluate chroma subsampled MJPEG (4:4:4 and 4:2:0), intra-frame FFV1 (part of the FFmpeg project). We also show below the comparison between the FFV1 and the lossless VP9 encoding.

Distribution of differences of classifier scores when changing the input codec. Image generated and owned by Fourthline.

Supervised vs. Contrastive Classification: How Does the Video Codec Influence the Prediction Scores?

Why are we interested in different codecs and their effect on prediction scores anyway?

What we found at Fourthline