
Content production, especially creative work, has always been considered the exclusive province of humans and the embodiment of intelligence. In the 2019 book "The Complete Biography of Artificial Intelligence" by Michael Wooldridge, Dean of the School of Computing at Oxford University, "writing interesting stories" is listed as one of the "far from achieved" tasks of AI.
AIGC (AI Generated Content) not only achieves "human-like" performance in many areas of writing, drawing, and composing, but also demonstrates extraordinary creative potential based on big data learning. The benchmark GPT-4 model was officially released to further improve the accuracy and compliance of generated content. A new paradigm of human-computer collaboration in digital content production is emerging, allowing creators and more ordinary people to cross the limits of "technique" and "effectiveness" and to be creative with their content.
There are also concerns about whether AI will make creators "unemployed" en masse, or even let "creation" itself go into decline, just as artworks in the era of mechanical reproduction may lose their "spirit". The same way. In other words, the popularity of AIGC gives us an opportunity to re-examine the questions of what "creation" is and whether it is uniquely human.
In this paper, we will analyze the current situation, key breakthroughs and challenges of AIGC in changing digital content creation, and try to explore the above questions.
AIGC is becoming the infrastructure of Internet content production Digital content is moving into an upgrade cycle of strong demand, video and creativity, and AIGC is at the right time. Online life has become the norm, on the one hand, user-created content has greatly liberated productivity, for example, short videos have turned videos that originally required long production cycles and high attention to investment into "industrial products" and "FMCG products" that can be produced continuously; on the other hand On the other hand, creativity as the core is still scarce, and new models are needed to assist creators to continuously generate, iterate and verify ideas. All these factors require new tools and methods that are more cost effective.
AIGC is increasingly involved in the creative generation of digital content, releasing value in a collaborative human-machine way and becoming the future content production infrastructure of the Internet.
In terms of scope, AIGC is gradually and deeply integrated into the production of text, code, music, pictures, videos and 3D in various media forms, and can serve as news, essay and novel writers, music composers and arrangers, painters of diverse styles, editors and post-processing engineers of long and short videos, 3D modelers and other diverse assistant roles, and complete the creation of specified thematic content under the guidance of humans, and editing and style migration work under human guidance.
In terms of results, AIGC is initially satisfactory in the areas of natural language-based text, speech and image generation, especially for knowledge-based short and medium texts, illustrations and other highly stylized image creation, where the creation results can match those of creators with intermediate experience; it is in the exploratory stage in areas of high media complexity such as video and 3D. Although there is still much room for improvement in AIGC's handling of extreme cases, control of details, and accuracy of the finished product, the potential it contains is promising.
In terms of approach, multimodal processing across text, image, video, and 3D in AIGC is hot. Andrew Ng sees multimodality as the most important trend in AI in 2021, and AI models have made significant advances in discovering relationships between text and images, such as OPEN AI's CLIP that matches images and text, and Dall-E that generates images corresponding to input text; DeepMind's Perceiver IO that can classify text, images, videos, and point clouds. Typical applications include Text-to-Speech (TTS), Text-to-Image (TTI), and in a broader sense, AI translation and image stylization can also be seen as a mapping between two different "modalities".

Key breakthrough: Natural language technology liberates creative power The liberation of AIGC for creators is reflected in the fact that "if you can talk, you can create", without the need to know the principles, learn code, or professional tools such as Photoshop. The creator describes the elements or even ideas in his mind to the AI in natural language (the term is "prompt"), and the AI generates the corresponding results. This is the next leap in human-computer interaction from punched paper tapes, to programming languages, to graphical interfaces.
Natural language is the root information and link between different digital content types, such as the word "cat" is the picture of Garfield, the musical "Cat" and countless content indexes, these different content types can be called "multimodal".
In this wave of AIGC, the biggest underlying evolution is in the leap of AI's ability to "understand" and "use" natural language, which cannot be separated from Google's Transformer released in 2017, which opened the Large Language Model (LLM). Large Language Model (LLM) era. With this powerful feature extractor, subsequent language models such as GPT and BERT have advanced by leaps and bounds, not only in terms of quality and efficiency, but also in terms of big data pre-training + small data fine-tuning, freeing them from the reliance on a large number of manual tuning parameters, resulting in significant breakthroughs in handwriting, speech and image recognition, language understanding, and increasingly accurate and natural content.
But large models imply extremely high barriers to research and use, such as GPT-3 with 175 billion parameters, which requires both large computing clusters and is not available to general users. midjourney, deployed on the Discord forum and available as a chatbot, became the first user-friendly AIGC application in 2022, bringing an AI painting boom, and one designer used it to generate images that even won an offline competition.

The low threshold of using simple text to communicate and the search engine like usage immediately ignited the enthusiasm of ordinary users to use AI. This was followed by a series of Text-to-Image products based on Diffusion Models, such as Stable Diffusion, which brought AI painting from the design community to the masses. The open source Stable Diffusion, which can run on just one computer, has been downloaded by more than 200,000 developers and has accumulated more than 10 million daily users as of October 2022, while the consumer-oriented DreamStudio has gained more than 1.5 million users and generated more than 170 million images. Its stunning artistic style, as well as the copyright and legal issues involved in images, have also sparked much controversy.
Before the shock of Diffusion wore off, ChatGPT came out and really "spoke" to humans, understanding a wide range of needs, writing answers, short essays and poems, writing code, mathematical and logical calculations, and more. In addition, human feedback reinforcement learning (RLHF) technology allows ChatGPT to continuously learn from human suggestions and evaluations of responses and move in the right direction, thus achieving excellent results with less than 1% of the parameters of GPT3. Although ChatGPT still has some flaws, such as references to non-existent papers and books and poor quality answers to questions that lack data, it is still a milestone in the history of artificial intelligence, and two months after its launch surpassed 100 million users, making it the fastest growing consumer app in history in terms of users.
The Next Challenge: Toward a 3D Internet of Presence After text, graphics and video, the important direction of digital technology evolution is to move from "online" to "present", and AIGC will be the cornerstone of building 3D Internet. People will build a simulation world in virtual space and "superimpose" virtual enhancement in the real world to realize a real sense of presence. With the breakthrough of XR, game engine, cloud game and other interaction, simulation and transmission technologies, the information transmission is getting closer to lossless, the digital simulation ability is indistinguishable from the real one, and human interaction and experience will reach a new stage.
At present, AIGC is still in the exploration stage in the field of 3D model, a path is based on diffusion model in two steps: first generate pictures from text, and then generate 3D data containing depth. Google and Nvidia in this field is more leading, has released their own text-generated 3D AI model. However, from the generation effect, there is still a distance from the average quality of 3D content produced manually now; the generation speed also fails to be satisfactory.
In October 2022, Google was the first to release DreamFusion, but its shortcomings are also significant. Firstly, the diffusion model only takes effect for 64x64 images, resulting in low quality of generated 3D; secondly, the scene rendering model not only requires massive samples, but also is computationally time-consuming and laborious, resulting in slower generation speed. Then, NVIDIA released Magic3D, which took about 40 minutes to generate a 3D model with texture in the face of the prompt "a blue poison dart frog sitting on a water lily". Compared to Google, Magic3D generates faster and better results, and can retain the same theme during successive generation, or migrate the style to the 3D model.

The second path is to use AI to "composite" photos of the same object from different perspectives to generate 3D directly, and NVIDIA is demonstrating at NeurIPS in December 2022 a generative AI model, GET3D (short for Generate Explicit Textured 3D), that can synthesize 3D models on-the-fly based on the 2D image categories it is trained on, such as buildings, cars, animals, etc. Generate Explicit Textured 3D), which can synthesize 3D models on-the-fly based on the 2D image categories it has trained on, such as buildings, cars, and animals. GET3D is trained on NVIDIA A100 GPUs and uses about 1 million photos taken from different angles to generate about 20 objects per second. It can generate about 20 objects per second. Combined with another of the team's technologies, the AI-generated models are able to distinguish between object geometry, lighting information and material information, allowing for significantly enhanced editability.
Possible path: Combining with procedural generation techniques in games Nevertheless, AIGC's ability on the 3D side is still a long way from creating 3D Internet. And the more mature programmed content generation (PCG, Procedural Content Generation) technology in the game, may be a big help for AIGC to step through the deep water.
From the technical path, AI generated 3D is difficult to follow the old way of "vigorous miracle", that is, to improve the effect by feeding the AI a large amount of input alone. First of all, the amount of information is different, a picture and a 3D model compared to the difference in a dimension, reflected in the storage is a different level of data; secondly, the picture and 3D storage and display principle is different, if 2D is a pixel array in the display of the objective display, 3D is real-time, fast, massive matrix operations, just like the model in 1 second dozens of times " photographing" in one second. In order to accurately calculate each pixel point and "render" it on the display, we need to consider at least (1) the geometric characteristics of the model, which are usually represented by thousands of triangular surfaces (2) the material characteristics, the color of the model itself, whether it is a strongly reflective metal or diffuse reflective fabric (3) the light, whether the light source is point-like, the color and intensity. Finally, there is relatively little data available for native 3D models, with only a small accumulation in games, film and television, and digital twins, far less than images that have existed for thousands of years and can exist in a non-digital form, such as ImageNet, which contains over 14 million images.
The matter of using computers to help creators has been explored by the gaming community for over forty years. Game content generated with algorithms first appeared in the game Rogue (Toy and Wichman) in 1981, with a random map, different for each game. 3D era, procedural generation techniques were heavily used in art production because of its huge time and labor costs, taking the game "Wilderness Dart Hunter 2" released in 2018 as an example, which involved more than six hundred artists and took eight years to complete a virtual scene of about 60 square kilometers.
Procedural generation is somewhere between pure manual and AIGC in terms of effectiveness and controllability. For example, No Man's Sky, an indie game released in 2016 that focuses on cosmic exploration, uses PCG to construct a series of generation rules and parameters that claim to create 184 billion different planets, each with a different form of environment and creatures.

