In late 2023, a story surfaced revealing that OpenAI had created a tool called Whisper, allegedly used to transcribe the entire YouTube library. This raised serious copyright concerns.    

Since then, we’ve learned that other AI competitors have employed similar tactics, often infringing on proprietary content like books, videos, and more. These moves weren’t necessarily about cutting corners but about feeding the insatiable hunger of large language models (LLMs) such as GPT—vast, foundational systems that need colossal amounts of training data.

The open internet wasn’t enough.

In fact, even after scouring every digital corner and using us—the consumers—as real-time reinforcement agents for AI learning  --  known as reinforcement learning, where models learn by receiving feedback from humans based on their output --  the hunt for more data continued.  Reinforcement learning involves providing feedback to models based on their actions, refining their capabilities to match real-world contexts.  But this constant need for refinement has led to new solutions.

Cue synthetic data.

Synthetic data isn’t new. Pioneers like AI River and MDClone were early players in this domain.  However, as AI models became more sophisticated, companies started generating their own synthetic data, designed by AI models themselves. 

The benefit? Machines could create massive amounts of artificial data that mimic human-generated content, maintaining the performance of these foundation models at an unlimited scale.

The Bias Challenge

AI models—despite their technical prowess—are only as good as the data they’re trained on. And our world is filled with biases. These biases naturally seep into AI models. For example, if you ask ChatGPT about typical gender roles in a workplace, you might receive an answer that reflects historical stereotypes. Given that human bias is unlikely to disappear, the use of synthetic data brings with it an intriguing dilemma: How will it impact society at large?

Imagine if ChatGPT had existed in the 1930s. If you asked it, “What does a typical woman do for a living?” its response would most likely describe a homemaker, responsible for managing household affairs. Ask the same question today, and you’ll receive a much more diverse and inclusive answer that reflects the many roles women hold in the modern workforce.

Now, consider this: In the next decade, if 99% of training data moves from human-generated content to synthetic data, what happens to the AI’s perspective of the world?   AI is everywhere—in our daily tasks, personal assistants, scheduling, and even bill payments. If synthetic data forms the bulk of training data, the biases and tendencies of society may stagnate, reflecting a kind of “freeze” in time.

The “Frozen” Model of AI

Theoretically, if AI models are primarily trained on synthetic data, they could mirror today’s biases and social structures indefinitely. Imagine an AI world that perpetually interprets gender roles, cultural shifts, and societal nuances as they existed in 2023, even as real-world progress is made. This could result in a lagging AI, slowing the pace at which societal changes are reflected in technology.

Of course, humans will continue to generate new data, so this won’t be a total freeze in time. But over the long term, we might witness a slow-down in how AIs evolve in understanding the world. A gap may form between how AI interprets human behavior and how society actually changes.

What’s Next?

As generative AI becomes more ubiquitous, its next big leap could be into robotics, creating AI-driven machines. Further down the line, the boundaries between humans and AI might blur. AI could become integrated into our bodies, linking brain and machine, enabling our tech to think and act with us. Picture this: you walk into a supermarket, and your AI chip has already scanned your fridge via your retina, identifying missing items. It might even diagnose a future illness or injury before symptoms arise.

While this sounds like science fiction, the next leap for AI is already on the horizon.

The questions we should ask ourselves are:  What roles do we want AI to play in our lives? How do we prevent it from solidifying our biases? And most critically, how do we ensure it progresses in tandem with humanity?

In this rapidly evolving technological landscape, stopping time may not just be a philosophical question—it might soon be a technical reality.

Interested in Learning More?