Skip to main content
Careers and JobsSynthetic Data Industry

AI Job of the Future: Synthetic Data Engineer

By April 15, 2022August 29th, 2023No Comments

We don’t teach data scientists about 3D Graphics — and we shouldn’t. The skillset required to create synthetic datasets fundamentally differs from the skillset required to analyze and use data. With synthetic data improving AI performance, driving down bias, and enabling innovation, a new type of engineer will be essential to the success and competitive advantage of organizations who are adopting AI and ML systems. We call these future heroes ‘Synthetic Data Engineers.’ Synthetic Data Engineers will need to be skilled in some of the most exciting technologies on the market: 3D, game engines, and physics-based simulation — and they will need tools to use these skills in support of Data Scientists.

Learn what skills will be required to be a Synthetic Data Engineer at GEOINT 2022 on April 26 in Aurora, Colorado!

To capitalize on synthetic data, adopters will need both users who can engineer data and those who know how to consume data and provide iterative feedback
To capitalize on AI, organizations will need both users who can engineer synthetic data and those who know how to consume data and provide iterative feedback

Understanding the problem

If, as Gartner believes, 60% of data used to train AI will be synthetic data by 2024, who is going to be generating all of that data? Today, we find that many companies are starting to hear about synthetic data and are beginning to investigate it, however they often don’t know how to begin creating it.

The first consideration is to look at the type of data being used to train an organization’s AI and ML systems. In our focus area, Computer Vision (CV) data, most customers’ first instinct is to say, “We just need to buy, collect, beg, or borrow more sensor data!” When customers see the cost, schedule, and environmental impact of collecting millions of images, videos, or 3D scans to get a few thousand instances that can be used to train a model, they start to consider options, such as synthetic data.

Synthetic data, for those who may not know, is engineered or fake data produced to specifications determined by real data that is used to train AI. Because synthetic data is designed, it can be created to meet specific training needs, such as reducing bias, rare object detection, or even training for data that doesn’t exist yet.

Typical progression of initial experiments with synthetic data
Typical progression of initial efforts with synthetic data

For CV customers*, customers must define the type of training data that is required. This includes understanding file formats, sizes, number of channels and bits-per-channel in imagery datasets, movement characteristics of the imaging platform and much more. Essentially, the customer needs to be able to characterize data collection such that it can be successfully simulated.

The next task becomes defining the world that needs to be imaged or collected in simulated sensor data. CV applications range from identifying rare species of plants and animals in wildlands, to counting military vehicles in urban areas, to detecting problems in factory manufacturing… they are endless.

These requirements are used to define a synthetic data application, which we call a ‘channel.’ In the PaaS, a user can deploy a channel, create graphs that describe specific dataset configuration, and generate datasets over and over to achieve desired results.

Curious? Sign up for a free account and generate data for yourself!

Hiring game designers for a data science team?

At this point, many organizations may be scratching their heads thinking something like, “Just how do I simulate images from a camera that hasn’t been designed yet and which will be flying over the Earth on a satellite at 17.500 mph?”

The answer is, there’s tech for that. In the last 10 years, the evolution of 3D hardware and software technologies have reached levels of speed, interactivity, physics-based accuracy, and realistic rendering (picture generation) that has the potential to put simulation technology in the hands of any scientist, engineer, or analyst across most CV domains.

The Synthetic Data Engineer is the experienced professional who can bring together many of these skills and technologies to be able to design and create new types of CV content for training AI. In fact, our expert team is often taking on the role of the Synthetic Data Engineer for our customers.

Some of the specific tools and skills that we have found useful for Synthetic Data Engineering include:

  • 3D modeling tools, formats, and specifications
  • Animation
  • Game engine development
  • Geospatial technology and data, GIS tools
  • Simulation modeling
  • Image manipulation
  • Python programming

There are not-so-small corners of the high tech industry that have been incubating Synthetic Data Engineers for a while. We’re seeing highly qualified candidate come from the VFX (visual effects), Defense M&S (Modeling and Simulation), game development, and even the Architecture and Engineering communities.

The Synthetic Data Engineers of tomorrow are already learning the skills they need today
The Synthetic Data Engineers of tomorrow are already learning the skills they need today

Why isn’t everyone hiring 3D developers to generate Synthetic Data?

It turns out that many CV-oriented organizations have tried to create synthetic data at some point in their AI evolution. If they thought they could hire up a game developer or a VFX artist, generate a few datasets, and then go on with their business, they were likely wrong.

Synthetic data generation tends to be an iterative, evolving capability that requires much more than just the Synthetic Data Engineer’s skills. Organizations also need access to cloud-based job management, user management, dataset curation and search, custom annotation generation, dataset characterization and comparison, and many other tools that enable users to design, generate, analyze, and organize synthetic data. These capabilities essentially are ubiquitous platform needs of anyone generating synthetic data, which is why we built the PaaS!

Successful synthetic data efforts require skills, domain knowledge, and platform tools that can be relied upon across many data generation projects
Successful synthetic data efforts require skills, domain knowledge, and platform tools that can be relied upon across many data generation projects

We are already seeing that access to a PaaS for synthetic data generation can expand the opportunity for successful AI adoption in organizations by enabling teams to adapt cutting edge market capabilities in 3D and simulation.

What other skills are helpful for Synthetic Data Engineers?

As a contributor to a team focused on designing high performance AI/ML capability that can enable an organization to compete, innovate, and adapt, the Synthetic Data Engineer needs to be able to work with a variety of peers and stakeholders to have optimum impact. Some of the skills that we also see adding value include:

  • Familiarity with ML/AI/DL techniques
  • Statistics experience and background
  • Familiarity with the sensor domain of the business or organization
  • Ability to translate requirements from data scientists, end stakeholders, and data collection experts into code

Learning more

If you’re interested in learning more about synthetic data,, and some of the skills that are needed to create successful synthetic CV content, check out some of the following links:

We even open source much of our tools for Synthetic Data Engineers who want to dive in:

And don’t forget to sign up and try out for yourself:

— — — — — — — — — — — — — — — — — — — — — — —

*In the case of text content, NLP training, and other speech and written data, we don’t dabble too much in that space, but there are some great companies out there who are doing so. Take a look at or, for example.

Leave a Reply