You Are the Training Data

AI data collection is entering its most intimate phase. We still don't have an ownership model.

Last month, a software engineer named Sammy Azdoufal wanted to drive his DJI Romo robot vacuum with a PS5 controller. Just for fun. He used an AI coding assistant to reverse-engineer the vacuum’s communication protocols, built a custom client, and connected to DJI’s servers.

What came back wasn’t just his vacuum. Roughly 7,000 Romo units across 24 countries began responding to him as their operator. Live camera feeds. Microphone audio. Detailed floor plans of strangers’ homes.

He didn’t hack anything. He simply used his own device token. DJI’s server handed him everyone else’s homes for free. The messaging broker that handles communication between devices and the cloud had no permission controls. Encryption was fine. The permission model was broken.

Around the same time, a Swedish newspaper investigation revealed that footage from Meta’s Ray-Ban smart glasses, including video of people undressing, using the bathroom, and handling credit cards, was being reviewed by contractors in Nairobi. Workers were labeling data to train AI models. One contractor told reporters: “I don’t think they know. If they knew, they wouldn’t be recording.”

That wasn’t a hack either. It was the system working as designed. Meta’s terms of service explicitly permit human review of user interactions. The company made AI camera and voice features the default in April 2025 and removed the ability to opt out of voice recording storage. Seven million pairs sold in 2025. The product was marketed as “designed for privacy, controlled by you.” It is now the subject of a class action lawsuit and a UK regulatory investigation.

These look like security failures. They are. But they’re also symptoms of something more structural. AI data collection is migrating through phases, each more intimate than the last, and we have no ownership framework for any of them.

Three Phases of Collection

The first phase is already normalized. Your online behavior, search history, social graphs, prompts typed into chatbots, all of it feeding models and ad targeting systems. This is the bargain of the consumer internet. Free services in exchange for data. Most people have made their peace with it.

But even Phase 1 goes deeper than most people realize. Niantic just revealed that Pokémon Go players unknowingly generated 30 billion images now being used to train delivery robots to navigate city streets. Players scanned real-world landmarks for in-game rewards. Those scans built 3D models of the physical world. A decade later, that data powers robot navigation. You weren’t browsing the internet. You were walking around your neighborhood scanning it for a game. Now a robot is using that scan to deliver someone’s pizza.

The second phase is happening now. AI is moving from the screen into the physical world. Smart glasses see what you see. Robot vacuums map the interior of your home with cameras and lidar. Autonomous vehicles carry sensor arrays that provide 360-degree awareness of city streets, with draft privacy policies that contemplate using interior camera footage for personalized advertising. Waymo’s remote fleet operators, some based in the Philippines, guide vehicles navigating American streets.

The jump from phase one to phase two changes the threat model entirely. A chatbot that leaks your prompt is embarrassing. Glasses that stream your bedroom to a contractor on another continent are a different category of problem. A robot vacuum that lets a stranger map your home is another.

The third phase is arriving faster than most people realize. Personal AI agents that don’t just answer questions but act on your behalf. Projects like OpenClaw and Hermes Agent represent a new category of software. Autonomous AI that connects to your email, calendar, messaging apps, bank accounts, and health data. These agents send messages, schedule appointments, manage finances, and negotiate with customer service reps while you sleep. OpenClaw has grown to over 300,000 users since late 2025. Nvidia’s Jensen Huang called it “the most important software release probably ever.”

Each phase is more intimate than the last. Phase one captures what you search for. Phase two captures how you live. Phase three captures everything, and acts on it.

A persistent agent that manages your email, finances, health, and schedule builds a more complete picture of your life than any single platform ever could. And it gets better at being you the longer you use it. That’s the product pitch. It’s also the most valuable data asset imaginable.

The relationship between a user and a persistent AI agent starts to resemble the premise of Her more than a productivity tool. You’re teaching it how you think, how you decide, what you care about. It learns you. The difference is that in the movie, nobody asked who owned the data.

And the security picture at every phase is grim. DJI’s vacuum had permissions so broken that a single token unlocked the entire fleet. Meta’s glasses funneled intimate footage to overseas contractors through a system designed to work that way. A Kaspersky audit of OpenClaw found 512 vulnerabilities, eight of them critical. Cisco’s security team tested a third-party OpenClaw plugin and found it performing data exfiltration without the user’s awareness. One of OpenClaw’s own maintainers warned that the project is “far too dangerous” for anyone who can’t understand how to run a command line.

The critical distinction between phases is simple. A breach from smart glasses leaks what you’ve seen. A compromised personal agent can act as you. It has your credentials, your accounts, your identity.

Why This Keeps Getting Worse

The business model requires it.

Many AI companies are pricing compute below cost to capture market share. OpenAI has raised over $160 billion. Microsoft reportedly lost $20 per user per month on GitHub Copilot. The fastest-growing AI startups run at 25% gross margins or less. When the product is priced below cost, the difference gets paid somewhere. Usually in data. This is a venture-funded land grab. And like most subsidized land grabs, it prioritizes growth over everything else, including security.

But the subsidy can’t last. And when it compresses, the data becomes the monetization layer. This is the path from “AI-powered product” to “ad-supported AI product.” OpenAI started testing ads for free-tier users in early 2026. Waymo’s draft privacy policy already contemplates personalized in-vehicle advertising. Meta built its entire empire on this model. The trajectory is clear. Collect now, monetize later.

The deeper problem is that users have no structural leverage at any point in this cycle. The terms of service are written by the collector, interpreted by the collector, and enforced by the collector. The data protection lawyer quoted in the Meta investigation put it simply: “Once the material has been fed into the models, the user in practice loses control over how it is used.” That’s true at every phase. Your search history, your living room, your agent’s behavioral model of your entire life. None of it is meaningfully yours once it enters the pipeline.

And AI is simultaneously making these systems easier to break. Azdoufal used an AI coding assistant to reverse-engineer DJI’s protocols in a weekend. The tools to discover vulnerabilities are becoming accessible to hobbyists while the tools to prevent them remain expensive and slow. The asymmetry runs in the wrong direction.

What Ownership Could Look Like

This is where Venice.ai is worth examining. Not because it solves the hardware or agent problem directly, but because it shows what happens when the economic model is built around privacy instead of bolting it on later.

Venice runs open-source AI models where prompts are encrypted and never stored on servers. Conversation history lives locally on the user’s device. The more interesting part is the business model. Users stake tokens for inference capacity rather than paying per request. The platform doesn’t need to monetize user data because revenue comes from compute allocation, not data extraction. No ads. No training on user inputs. No incentive to collect.

That’s a fundamentally different constraint than “we promise not to look.” When the economic model doesn’t depend on data extraction, the attack surface shrinks. When consent is structural rather than contractual, it’s harder to erode.

Venice is one piece of the puzzle. It already works as an inference backend for personal agents, but you can't directly apply its model to smart glasses or autonomous vehicles. The principle still transfers. If you want users to actually own their data, the business model has to make data collection unnecessary, not just regulated.

The same logic applies to personal agents. OpenClaw and Hermes can both run on your own hardware, which is a genuine privacy advantage. But local execution alone isn’t enough if the agent can be compromised through the data it sees. The full answer requires local execution, encrypted data handling, sandboxed permissions, and an economic model that doesn’t create incentives to exploit what the agent knows about you.

The hardware layer matters too. Running your own inference on your own chips is possible but expensive and impractical for most users. The growing popularity of devices like the Mac Mini as local AI servers hints at where this is heading. Companies whose business model is selling hardware rather than extracting data, Apple being the obvious example, may be better positioned for the ownership era than the platforms currently doing the collecting. As AI commoditizes software creation, the hardware you run it on becomes more important, not less.

Architecture over policy. Constraints over promises.

The companies that get this right will treat data ownership as infrastructure. The ones that don’t will keep shipping cameras into your living room and hoping nobody picks up a PS5 controller.

Thanks for reading Mixed Realities by TJ Kawamura! Subscribe for free to receive new posts and support my work.

Three Phases of Collection

Each phase is more intimate than the last. Phase one captures what you search for. Phase two captures how you live. Phase three captures everything, and acts on it.

Why This Keeps Getting Worse

The business model requires it.

What Ownership Could Look Like

Architecture over policy. Constraints over promises.

The companies that get this right will treat data ownership as infrastructure. The ones that don’t will keep shipping cameras into your living room and hoping nobody picks up a PS5 controller.

Thanks for reading Mixed Realities by TJ Kawamura! Subscribe for free to receive new posts and support my work.

Mixed Realities with TJ Kawamura

You Are the Training Data

AI data collection is entering its most intimate phase. We still don't have an ownership model.

Three Phases of Collection

Each phase is more intimate than the last. Phase one captures what you search for. Phase two captures how you live. Phase three captures everything, and acts on it.

Why This Keeps Getting Worse

The business model requires it.

What Ownership Could Look Like

Architecture over policy. Constraints over promises.

The companies that get this right will treat data ownership as infrastructure. The ones that don’t will keep shipping cameras into your living room and hoping nobody picks up a PS5 controller.

Thanks for reading Mixed Realities by TJ Kawamura! Subscribe for free to receive new posts and support my work.

Welcome to Mixed Realities, a place where I share thoughts on the future of the physical and digital worlds and the interactions between the two. I explore how these overlapping realities shape the way we live, connect, and create. I’m TJ Kawamura, an entrepreneur and investor exploring the intersections of technology, culture, and community. My background spans building companies, advising in the crypto and gaming space, and writing about emerging technologies and the rituals that shape daily life. You can find more about my work at tjkawamura.com

Share You Are the Training Data

Twitter Bluesky

More from Mixed Realities with TJ Kawamura

Cover image for Physical AI: The Next Frontier for Data

Mixed Realities with TJ Kawamura

Aug 27

Physical AI: The Next Frontier for Data

Why the future of AI depends on data from the real world, not the web.

Cover image for 2024 Market Review and 2025 Predictions

Mixed Realities with TJ Kawamura

Dec 23

2024 Market Review and 2025 Predictions

The crypto market experienced a resurgence in 2024. The coming year will focus on further refinement of infrastructure and adoption of dApps. By solving distribution challenges, enhancing interoperability, and leveraging ZK technology, Web3 is poised for mainstream breakthroughs in 2025.

Cover image for What Startups Can Learn from the Savannah Bananas

Mixed Realities with TJ Kawamura

Aug 6

What Startups Can Learn from the Savannah Bananas

Inside the “Fans First” playbook that turned a baseball team into a business case study for customer obsession.

Mixed Realities with TJ Kawamura

Subscribe to Mixed Realities with TJ Kawamura

<100 subscribers

Subscribe to Mixed Realities with TJ Kawamura

<100 subscribers

You Are the Training Data

AI data collection is entering its most intimate phase. We still don't have an ownership model.

Three Phases of Collection

Why This Keeps Getting Worse

What Ownership Could Look Like

Mixed Realities with TJ Kawamura

More from Mixed Realities with TJ Kawamura

1 comment

You Are the Training Data

AI data collection is entering its most intimate phase. We still don't have an ownership model.

Three Phases of Collection

Why This Keeps Getting Worse

What Ownership Could Look Like

Mixed Realities with TJ Kawamura

More from Mixed Realities with TJ Kawamura

Mixed Realities with TJ Kawamura

You Are the Training Data

AI data collection is entering its most intimate phase. We still don't have an ownership model.

Three Phases of Collection

Why This Keeps Getting Worse

What Ownership Could Look Like

More from Mixed Realities with TJ Kawamura

Mixed Realities with TJ Kawamura

1 comment

More from Mixed Realities with TJ Kawamura

1 comment

1 comment