Musk's hardline OpenAI, users suffer from the pool

I've only heard of social media trying to get users to stay longer, but I've never heard of actively putting a cap on people. Now the eyes are opened, Elon Musk is adding "minor protection" to all Twitter users, and all this, actually forced by AI?

Nowadays, the maximum number of tweets a Twitter user can view each day does not depend on the speed of their hands or whether they are willing to stay up late, but there is a clear number: 10,000 verified (that is, paid "Blue Bird" service) accounts, 1,000 unverified accounts, and only 500 newly registered unverified accounts.

This is the standard that Musk has raised twice in the face of angry users. As for the reason, it is "to address the extreme level of data capture and system manipulation".

He was referring to the AI companies, which need a lot of data as feed in order to train their models. Musk severed data ties with OpenAI last December and accused Microsoft of illegally using Twitter data in April of this year.

OpenAI is facing a class-action lawsuit at the same time that Musk is taking aggressive steps to stop data grabs. There are 16 plaintiffs in the lawsuit, all individuals - in other words, regular Internet surfers. They accuse OpenAI of secretly "crawling 300 billion words from the Internet" and stealing "vast amounts of private information" from that Internet user without permission in order to train ChatGPT.

With Internet users and platforms that have accumulated a lot of UGC content over the years on one side, and the emerging AIGC companies on the other, a war has been waged around data capture and privacy security.

Friday Friday, banging the gong. It's easy to get to the weekend, but Twitter users are dumbfounded when an error message is displayed on the screen, reminding them that they have exceeded the "rate limit" and violated Twitter's rules by viewing too many tweets.

People had no idea what this meant, and Twitter boss Musk came forward and said that there was indeed a rate limit, and announced that in order to address extreme levels of data grabs and system manipulation, the daily viewing limits for verified, unverified, and newly registered unverified accounts were 6,000, 600, and 300 tweets

But joking aside, Musk gave a clear explanation for his "test": to deal with data crawling. The users' dissatisfaction also lies in the effectiveness of the restriction of traffic, not in the issue of data crawling.

How serious is the situation of AI startups coming to Twitter to "pick up data"? In a tweet, Musk said the surge in traffic forced Twitter to activate its backup servers: "It's infuriating to activate a large number of online servers in an emergency just to help some AI startup's ridiculously high valuation."

The day before the stream-limiting fiasco, Epic Games CEO Tim (Tim Sweeney) also tweeted to complain that Twitter was also building a wall, to which Musk replied, "Hundreds (and more) of blocks are crawling extremely aggressively through Twitter data, to the point where it's affecting the user experience. What should we do? I'm keeping all ideas open."

Tim, who was complaining a moment ago, was quick to offer serious suggestions, such as adding a ban on data crawling to Twitter's terms of service, protecting the platform with information security engineering, and taking legal action against companies that abuse Twitter on a massive scale.

Notably, Musk replied that legal action would "absolutely" be taken against those who steal data: "(Optimistically) 2 to 3 years from now, expect to see them in court."

Whether or not the "adding fuel to paid subscriptions" speculation is a small-minded way of looking at Musk's heart, it's possible that Musk has a selfish agenda beyond raising the flag of user privacy, which in April was rumored to be setting up a new artificial intelligence company called X.AI to fight ChatGPT. data, of course, is only for their own use best.

The company has been preparing to fight against AI startups to the end, no matter how, active to limit the flow of the platform are done.

While Musk is taking heavy measures to restrict streams across the platform, OpenAI, the creator of ChatGPT, which started the AICG boom, is involved in a class action lawsuit.

The lawsuit was filed in the U.S. District Court for the Northern District of California, and the 16 plaintiffs, all anonymous, are individuals. The complaint is long, 157 pages, and begins with a quote from Stephen Hawking: "The rise of powerful artificial intelligence is either the best thing that has ever happened to mankind or the worst." The defendants, in addition to OpenAI, are Microsoft, which has injected tens of billions of dollars into it.

The central allegation is that ChatGPT violated "the copyrights and privacy of countless people" when it used data collected from the Internet to "train its technology."

The suit alleges that OpenAI violated privacy laws by secretly crawling 300 billion words from the Internet and eavesdropping on "books, articles, websites and posts, including personal information obtained without consent." Among other things, the OpenAI crawled large amounts of web data, including data from social media.

They also point out that OpenAI has a proprietary AI corpus that accumulates a lot of personal data, including data from Reddit posts and its links to websites.

This is the training model aspect of the allegations, and in addition, the plaintiffs claim that users' interactions with OpenAI's products, and private information within the products, were also illegally accessed and massively misappropriated by OpenAI.

This is not the first time OpenAI has faced a class action lawsuit in the US. Last November, a class action lawsuit was filed by Github programmers against Github, OpenAI and Microsoft, alleging that OpenAI allegedly violated the open source license by using their contributed code to train the proprietary AI tool GitHub Copilot.

ChatGPT was not yet live, and in retrospect, the AI training problem was already exposed then. Now, the latest class action lawsuit is against ChatGPT, which has a much wider range of users and a much wider range of people who have been violated (essentially the entire population), and more importantly, with the AIGC frenzy, any legal precedent could impact the future.

In a statement, Clarkson, the public interest law firm representing the case, called the class action a "landmark" federal case that serves as a warning to AI as a whole.

From this perspective, OpenAI does have a heavy burden on its shoulders.

OpenAI has already been in a lot of trouble for data capture and privacy, and the locking of its platform and the flip-flopping of its users are just the tip of the iceberg.

In Europe, OpenAI has been investigated by several countries, and even in April this year, Italy temporarily blocked ChatGPT for fear that it would violate European data protection laws.

Regulation is moving forward for the entire AI space. France launched its AI action plan in May, in which, in terms of AIGC, the French privacy regulator is particularly concerned about the practice of some AI models collecting data from the Internet and building datasets that are used to train big language models.

The most weighty is the EU AI regulation bill (EU AI Act), which is now heading towards the closing stage. The bill will likely serve as a model for global AI governance.

Platform, users, and regulation, three forces have formed a convergence, vowing to set rules for AIGC as soon as possible, and to start from the starting point of big model training.

On the one hand, time is short, AIGC is growing too fast.

Musk said "ridiculously high valuation of AI startups" refers to who, we do not know. After all, there are waves of financing in the AIGC field, all with hot money.

Among the startups, OpenAI is valued at nearly $30 billion, with total financing of $11.3 billion, the richest in AIGC; then Anthropic, the second richest, valued at more than $4 billion. And Inflection, which shocked Silicon Valley with $1.3 billion in funding just a few days ago, is already valued at $4 billion, and it was founded just over a year ago.

Inflection is using its own large language model, this time $1.3 billion to hand, announced to engage 22,000 Nvidia H100 chips, to do the world's largest artificial intelligence cluster. With such massive computing power, the number of target participants and data sets are bound to be staggering as well.

On the other hand, ChatGPT is not so easy to "fix" when it reveals its problems; OpenAI's generations of large language models have 40GB of text in the GPT-2 dataset and 570GB of training data in GPT-3 (the model used in the release of ChatGPT). As for GPT-4, which was only released this year, the size of the dataset was not disclosed at all.

The huge amount of data has not been well documented from the beginning. Former Google research scientist Nicia Sambaswan has said in interviews that tech companies don't document how they collect or annotate AI training data, or even know what's in the dataset.

The wooden ChatGPT is like a black box, and a black box built in a back room, and it's actually hard to do transparency and privacy protection these days, such as listing exactly what data was crawled, explaining how it will be used during use, and deleting a piece of data at the user's request.

There is another reason why Internet surfers and regulatory death bites OpenAIs that cannot be ignored - in those years when social media was developing and growing, the awareness of personal online data protection was still in its infancy, and when it was time to fight it, it was found to have missed too far.

When Zuckerberg sat in the congressional hearing for the first time in 2018, his social media platform Facebook had been launched 14 years ago. Facebook was embroiled in the "Cambridge scandal," which the company's chief technology officer said affected 87 million users. That was also a big mistake due to data grabs.

When Alterman sat in the U.S. Congress hearing in May this year, members of Congress frequently expressed regret for the lack of action in the age of social media, the meaning is clear: this time, if not ahead of the curve, but also at least to keep up with the AIGC.

One after another, the big model is still in training, data capture is a thread, clutch it to hope to sort out the AIGC muddle.

This is the standard that Musk has raised twice in the face of angry users. As for the reason, it is "to address the extreme level of data capture and system manipulation".