A conversation about Synthetic Data with the USGIF MS&G working group

Synthetic imagery produced in Rendered.ai — Synthetic images using satellite imagery backgrounds and 3D model assets for construction equipment created in Rendered.ai

The intersection of Synthetic data with Explainable AI, Digital Twins, and CycleGANs were the primary subjects of curiosity from the audience at the working group on 2/22/2022 and they also turn out to be frequently asked questions of us at Rendered.ai.

Yesterday, I had the opportunity to speak to the USGIF Modeling, Simulation, and Gaming Working Group for the second time in the last couple of years. This time, my talk was about the work we’re doing at Rendered.ai in #syntheticdata.

Rendered.ai offers a Platform as a Service (#PaaS) for generating synthetic data, enabling data scientists and teams focused on Computer Vision problems to overcome the costs and challenges of acquiring and using real data for training machine learning and artificial intelligence systems. Many of our customers are in the Earth observations domain, so speaking with and participating in this group is certainly relevant!

Sign up for access to Rendered.ai here!

What happens in the working group generally stays in the working group, but three questions related to common topics we hear and discuss at Rendered.ai and are worth sharing more broadly:

Does Synthetic Data offer the opportunity to make AI more understandable or explainable?

Explainability of AI is of fundamental concern when using AI and ML to inform or influence high-impact, critical decisions such as in legal, medical, and defense applications.

Machine learning algorithms are essentially combinations of many small statistics calculations that are performed at multiple scales and in different patterns using input data. Seemingly simple input datasets can lead to millions of algorithmically derived training parameters that are incomprehensible to humans, but yet may seem to ‘work’ to identify, classify, and extract knowledge from real world datasets.

At Rendered.ai, we have frequent conversations about the need for analysis that can allow a user to understand why an AI outcome is achieved with one dataset yet not achieved with another. Synthetic data offers the opportunity to allow users to engineer data generation such that they can test and evaluate datasets with exact knowledge of the scenario, events, or components depicted in each instance of data. When coupled with tools for analyzing and comparing datasets, we believe that the process of experimentally adjusting and evaluating data generation through a platform such as ours will directly help customers understand and explain why some datasets are good for training particular AI problems while others are not.

Does Synthetic Data help solve the problem of knowing about and capturing the world in 3D?

Acquiring real sensor-based data for diverse real world scenarios and assets is expensive, can be risky, and may even be impossible in some cases. Did I mention that data is expensive? We’re hearing from many prospects and customers that data acquisition, including curating and labeling existing data sets, can be so expensive that in many cases business decide not to pursue products or offerings simply because they can’t get the data needed to train AI.

In the case of Digital Twins, for example, the goal is to have a time-enabled digital replica of a real world system, including AI systems deployed to help monitor, predict, and optimize performance of the digital twin. In practice, AI can’t be trained before the data exists to train it, thus if you want to deploy a digital twin with AI, either you have to wait till you start collecting data from the digital twin to train AI or you can simulate the data that will be generated by the digital twin to train the AI before the digital twin is even running.

You need to know what your desired description of the world is before you can generate synthetic data to represent it, whether in 2D, 3D, or 4D. However, once you have that description, you can use synthetic content generation to create nearly limitless varieties of data to test and train AI.

Is Synthetic Data in Rendered.ai created using CycleGAN techniques?

Many readers may not be familiar with CycleGANs, essentially a pattern of structuring ML algorithms in a feedback loop that can be used to find the optimum parameters to algorithmically output generated data with the properties of the source data used for training.

One technique for synthetic data generation is to train a CycleGAN and then to use the trained algorithm to generate new data directly from the GAN. For structured data, such as medical records, legal records, or other protected form-based information, this type of CycleGAN application can be used to generated simulated datasets that are anonymized, but which retain the properties of the source dataset used to train the GAN.

At Rendered.ai, we enable users to deploy CycleGANs as part of their synthetic data generation process, but typically in a post-processing step that helps them create synthetic data with more of the characteristics of real data. AI tends to be influenced by subtle patterns in data that may not be apparent to humans and this technique is one tool for reducing or removing characteristics that may be more ‘synthetic’ than are likely to appear in real data. Examples of this include subtle effects of atmospheric interference, unusual lens aberrations in a camera system, or possibly unusual mottling in roof textures.

To accomplish this with Rendered.ai, a user will train a CycleGAN using well established tools and some of their own real sensor data, then upload their trained CycleGAN model (typically a .pth file containing trained parameters). The user can then use their model to post-process datasets that have been generated with Rendered.ai. In the project that we are supporting for Orbital Insight, for example, this technique was used to enable synthetic satellite images to be nearly indistinguishable from real images to the human eye.

…

I look forward to more conversations with the working group and the GEOINT community! We’ll actually be at GEOINT 2022 in Booth 835 this April! We can’t wait to see you there.

In the meantime, if you’re curious about what we’re doing at Rendered.ai, sign up to access our PaaS and try it for yourself:

https://hubs.li/Q012NcJm0

A conversation about Synthetic Data with the USGIF MS&G working group

Explore the Newly Launched Platform for Synthetic Data from Rendered.ai