Comparing Synthetic and Real Image Datasets Using UMAP

Overview

Synthetic data is engineered, or simulated, data that can be used to train and validate AI. In the computer vision realm, which typically focuses on unstructured data such as imagery or video, synthetic data can be created using 3D simulation techniques and through generative AI approaches. At Rendered.ai, we typically focus on simulation approaches.

This article describes a new approach to dataset comparison and quality assessment for unstructured synthetic data. This approach involves a human-in-the-loop method of visualizing image features in embedding space to identify overall trends in dataset attributes and whether key features are accurately represented in synthetic data.

Introduction

One of the most fundamental questions to ask when using synthetic data to train AI is: “Will my synthetic data function as if it were real data?” After all, if the answer is “no,” then your synthetic data is generally not going to be effective for training.

In the context of visible spectrum computer vision data, the first test that’s done to answer this question is “well, does my synthetic data look like my real data?” This is a valid first test that can be easily run by anyone with eyes, and if the answer falls far short of “yes,” it often means that we need to re-think our approach.

But this test does not answer the question of whether our synthetic data will actually be useful in our AI pipelines. For one, computer vision models pick up on features that may not be obvious to the human eye. Slight differences in the characteristics of an image, even those imperceptible to the human eye, can dramatically impact how a deep learning model perceives it. Therefore, even if an image looks real to us, we cannot be certain that our AI models will agree.

Conversely, not all features of real data need to be emulated for synthetic data to be useful, just the important ones. It’s helpful to remember that no matter how complex your simulator is, all simulation is an approximation of reality, and chasing realism for realism’s sake can have rapidly diminishing returns. If the right features are accurately represented in our data, we can start to get the results we’re after without spending time simulating unimportant features.

After visual review, the next test that’s typically done is to train a model. This can mean training solely on synthetic data and testing against real data, or training with a mix of real and synthetic data and seeing if this improves scores over training on real only. This is the “moment of truth” for synthetic data, a moment that either validates or invalidates our efforts. It’s also a moment where many people give up on synthetic data if it does not move the needle in a positive way, and we have spoken with many people who have reached this point and concluded that synthetic data simply doesn’t work.

To be clear, synthetic data does indeed work when applied correctly and there are many real-world proof points, but when confronted with at poor model results with little explanation, it can be easy to come to the conclusion that synthetic data doesn’t work. To properly analyze synthetic data and compare it with real data, we need to get more concrete information than what we can observe with our eyes, yet more nuanced information than overall model outcomes. We need to peer behind the curtain of how a deep learning model interprets the data.