Is there a special feature of data compliance for ChatGPT-type application services?

Summary of views in this issue:

AI application service providers such as ChatGPT provide services directly to individuals, collect and process personal information, and can be considered as personal information protection compliance subjects - data controllers.
Compared with the typical scenario of mobile internet APPs, the personal information processing activities of generative AI service providers have their own characteristics and different data compliance focus.
According to the GDPR, the EU data protection authority (DPA) is a regulator, not a market access agency, and its role is mainly in guiding and supervising companies to meet data compliance requirements.
The real challenge in the future comes from AI-enabled application services of all kinds, and new thinking is needed to solve new data security problems.

C-terminal AI application service providers are data controllers

Not all market subjects are obligated subjects under the data compliance framework, which needs to be further determined according to technical principles, business scenarios and legal norms. When the identities of subjects overlap, they also need to match compliance obligations based on different business processes. Based on this analytical framework, we argued in detail in the previous article that large model developers may not be identified as legal subjects (data controllers) for privacy data compliance during the model development stage.

Based on the same analytical framework, we argue that operators of generative AI services for C-suite individual users may be identified as data controllers for privacy data compliance purposes. For example, when OpenAI released its ChatGPT app service to the public in 202211 and surpassed 100 million users in 2 months, making it the fastest growing consumer app in history, its status as a data controller was established.

And so it is. In terms of foreign practice, AI application service providers that have now been oriented to individuals have fully configured privacy policies and user agreements in the data compliance section to fully inform users of what types of data are collected and how the data are processed.OpenAI lists the types of collection in its privacy policy [1]; including account information, communication content, usage history, etc.; the purposes of data processing include, but are not limited to: providing improvement of services, fraud prevention, network information security, required to fulfill legal obligations, etc. Similarly, Midjourney, a public-facing image generation AI provider, also provides a clear and unambiguous privacy policy [2]. Although there are no officially launched products in China, some vendors have already embedded privacy policies in their beta versions.

It is not difficult to explain why the data protection authority DPA is the first regulator to enter the market. on March 31, the Italian data regulator Garante announced a temporary ban on ChatGPT and asked OpenAI to respond to related issues within 20 days [3]. This is a normal reflection of an emerging application by the data regulator DPA, but has been misinterpreted to mean that the DPA can take permanent measures against a specific business. In contrast, under the EU GDPR, DPAs, while having sky-high penalty powers, are strictly limited to corrective powers, including recommendations, warnings, and injunctions that are temporary or of a definite duration [4]. In other words, DPAs are not allowed to take market exclusion measures against service providers as long as they meet data compliance requirements. Following widespread criticism of its temporary ban, on April 12, Garante released the signal that "we are ready to reopen ChatGPT on April 30 if OpenAI takes effective measures" [5].

The uniqueness of data compliance for generative AI service providers

Compared with the mobile Internet, personal-oriented generative AI applications have many similarities in data compliance, including the development of privacy policies and business agreements that clarify the basis for the legality of handling user data and support users' rights related to personal information around their account information and generated in the course of using services through privacy-protective design in information systems, including query, access, correction, and deletion. On the one hand, however, we are more concerned with the uniqueness of its personal information processing activities:

First, the types of personal information collected are relatively small. Typical mobile APPs such as navigation software, taxi, shopping, etc. need to collect more types of personal information from users in real time in order to achieve a closed loop of personalized services to users; while current generative AI applications, such as OpenAI and Midjourney, for example, are more concerned with the quality of generated content from their underlying logic, and the collection of personal information in the application service phase is mainly about establishing user accounts The personal information collected in the application service phase is mainly to establish user accounts, accept user instructions (prompt) and interact with them, so the personal information collected is relatively small, including account information (username, email), usage records (cookies, etc.), and payment information if transactions such as purchasing services are involved. Therefore, Midjourney even has a table that clearly lists the types of user information that is not collected: including sensitive user information, biometric information, geolocation information, etc. This information is indeed irrelevant for generative AI applications as well.

Second, personal information is de-identified and anonymized at an earlier stage and more widely. In the process of providing services, generative AI mainly builds data security protection systems around user account systems and communication contents. Take ChatGPT as an example, although the data sources it collects in the model training phase contain less personal information of users (and mainly public information), in the application service phase, the question-and-answer session function generates more sensitive communication content, which is further analyzed and responses are generated by the model based on the content of communication with users (contextual environment). In order to reduce the risk arising from the leakage of user communication content, generative AI will adopt security-type measures such as de-identifying and anonymizing user identity information at an earlier stage, or separating user identity information from communication content, or deleting communication content in time after the model generates the reply content. This is also determined by the logic that generative AI is more concerned with feedback content rather than user behavior, which is significantly different from mobile APPs that are built on user behavior characteristics and are known for personalized recommendations.

Third, influenced by the above two aspects, generative AI differs from mobile APP in the risk area of data security. Mobile Internet APPs need to collect a large amount of personal information directly, and the user database is an easy target for hacking and data leakage. However, in generative AI applications, although they collect less kinds of user information directly, the risks are focused on the attack on the model and thus reverse traceability of the database, as well as the potential risk of leakage of user communication contents. The Italian data regulator issued a temporary ban on OpenAI due to an incident in which user communication content was leaked due to a service bug. To mitigate the risk, OpenAI, which already has a clear technological first-mover advantage, started exploring support for users to have the option to have their personal communication history deleted. on April 23, OpenAI introduced a new control that allows ChatGPT users to have the option to turn off their chat history and not use it for model training purposes [6].

Fourth, in the output stage, if the user-guided questions involve personal information, the personal information in the output results may be fabricated and false based on the algorithmic logic of the linguistic prediction generation of the large model, which may violate the information quality principle on the personal information protection law, i.e., maintaining the personal information accuracy requirement. But the essence behind such problems is the general problem that generative AI faces in content governance, that is, AI goes into "fantasy" and fabricates inaccurate or even false information.

OpenAI is committed to improving and solving such problems during the development stage, including the introduction of human expert feedback mechanisms and reinforcement learning (RLHF) to guide the AI to output accurate content. Currently, some generative AIs also incorporate a dual input (prompt) + output filtering mechanism to further avoid harmful content or infringement issues. Although the speed of progress of the Big Language model is mind-boggling, and ChatGPT 4 has improved the accuracy of its output information by a whopping 40% compared to GPT3.5 in only four months, and the likelihood of output violating content policies is reduced by 82% [7], there is still no guarantee of reliable accuracy of its generated content. Therefore, as users, they should also be vigilant and judgmental about ChatGPT's responses to avoid being misled.

In summary, to view the data compliance of generative AI, it is necessary to break away from the data compliance inertia in mobile Internet services, and adopt corresponding compliance and security protection measures targeted around its different characteristics in privacy and data security.

Future-oriented challenges: unprecedented data aggregation

Generative AI based on big language models is gaining the world's attention not for content generation, but for its Artificiall general interlligence (AGI) potential, with the industry exclaiming that the singularity moment for AGI is upon us. In the future, in addition to content-generating AI applications for the general public, the industry generally believes that AI will also rewrite the Internet paradigm. Existing business models will widely introduce AI intelligent models to significantly improve the efficiency of user interaction. On March 17, 2023, Microsoft released Microsoft 365 Copilot, which combines the Large Language Model (LLM) feature with Microsoft Office applications to help users unlock productivity [8].

Copilot will be built into the office family bucket, and in Word, Excel, and PowerPoint, AI will work with individuals to write documents, presentations, and data visualization through convenient language interaction; in Outlook, Teams , and Business Chat, AI can help users reply to emails, manage mailboxes, real-time In Outlook, Teams and Business Chat, AI can help users reply to emails, manage mailboxes, complete meeting summaries and to-do lists in real time, and improve meeting efficiency.

The leap forward in office efficiency is not only based on the powerful AI model capabilities, but also on the extensive data connectivity. Using Copilot means that users will authorize Microsoft to connect their personal data across various business platforms. As stated in Microsoft's privacy policy, data collected from different business environments (e.g., in the course of using more than two Microsoft products) will be merged for purposes such as business provisioning, improvement and product development [9].

This is just a prototype of the future super digital assistant, where each person can even have multiple digital alters to collaborate on tasks with the support of an intelligent infrastructure. It is conceivable that behind digital assistants are large language models accessing and linking private data of individuals as well as commercial enterprises, and the fusion of data utilization must be seamless and silky smooth. How the access processing of such data is carried out in a secure, compliant and privacy-protecting manner places higher demands on security technology safeguards.

Summary of views in this issue:

AI application service providers such as ChatGPT provide services directly to individuals, collect and process personal information, and can be considered as personal information protection compliance subjects - data controllers.
Compared with the typical scenario of mobile internet APPs, the personal information processing activities of generative AI service providers have their own characteristics and different data compliance focus.
According to the GDPR, the EU data protection authority (DPA) is a regulator, not a market access agency, and its role is mainly in guiding and supervising companies to meet data compliance requirements.
The real challenge in the future comes from AI-enabled application services of all kinds, and new thinking is needed to solve new data security problems.

C-terminal AI application service providers are data controllers

The uniqueness of data compliance for generative AI service providers

Future-oriented challenges: unprecedented data aggregation

furola

furola

No activity yet

furola

furola

No activity yet

No activity yet

No activity yet

Is there a special feature of data compliance for ChatGPT-type application services?

Is there a special feature of data compliance for ChatGPT-type application services?