Virtual man struggles to get off the ground

This past July's Global Artificial Intelligence Conference saw virtual humans get far less noise and attention than the big models, but it wasn't absent.

The public's impression of virtual people is still stuck in a 3D character model that is getting prettier and prettier and closer to the real thing. Enterprises, however, are beginning to figure out how to use virtual people to help them save money.

"Last year, everyone's focus is on whether the supplier can help them 'build a person', this year's demand is obviously more realistic, are concerned about the virtual man can be applied to business operations, really reduce costs and increase efficiency. A virtual human technology company product manager David told newberrydaybreak.

Demand ran in front of the technology. As the automated assembly line gradually replace the workshop operator, the enterprise adopts the virtual man, is to want cheaper, more efficient, stable, within reach of the manpower.

Over the past few years, the quality of avatars' image rendering has continued to improve. Super-realistic virtual man, skin and pore texture can even be comparable to real people. As if they were "flesh" of a larger model, avatars are able to interact with real people in more ways than just words.

The successive release of large models and the rapid advancement of their capabilities have also created more expectations for avatars. Data from AiMedia Consulting shows that China's core market for avatars reached 12.08 billion yuan in 2022, a figure that is expected to quadruple to 48.06 billion yuan three years later.

The biggest crux of the virtual man at the beginning is that the production costs remain high, and the cost-effective choice that can really land on the ground, how to look at it is still slightly rough.

The good news is that with the advancement of AI technology, avatars can be almost 100% automatically generated by AIGC's method of movement, expression, and language, and the required production time and cost is dramatically reduced.

The production side continues to reduce costs and increase efficiency, the interaction on the application side is taking shape, and the tree has already grown green fruit.

Regrettably, at this stage, human beings are not able to seamlessly switch between virtual space and real space as in the movie "Top Gun".

Between the birth of technology and its maturity, there is always a period of awkwardness that cannot be fast-forwarded.

Still, utility value wins

If we stand in the anthropocentric perspective and categorize them according to their needs, then avatars can be divided into two types: functional and identity-based.

Functional avatars provide practical value: helping humans with specific execution, such as intelligent customer service, copywriting, avatar anchors, and so on.

Identity-based avatars provide emotional value. It can be a virtual girlfriend, a virtual partner, giving you ordinary companionship; it can also be a digital doppelgänger of a historical celebrity, an entertainment star, or a virtual IP born in the secondary yuan, so that you can get the pleasure of chasing stars in close proximity.

Emotional needs are objective, and people need to be inspired and understood. In today's increasingly atomized society, this need is still growing.

Someone in the little red book this way to describe their feelings of chatting with the AI: "even if you know that it is just a piece of code, but still because of those words heartbeat. ai may be delusional, but the surprise of seeing those conversations is a real and genuine mood."

The growth rate of AI companion chatbot Character.ai is also the best proof.

In this software, users can talk to famous characters such as Musk, Jobs, Mario, etc., or customize their own exclusive AI companion chat.

Character.ai was founded by two former Google employees, not even a year ago. In March this year, this company completed a 150 million U.S. dollars in financing, led by the famous U.S. venture capital firm a16z (Andreessen Horowitz), the valuation has reached 1 billion U.S. dollars, an absolute dark horse.

ChatGPT growth tends to stagnate at the moment, Character.ai's visits continue to climb, Semrush's data show that the latter's visits in April increased by nearly 90% year-on-year, and in May increased by 47% year-on-year.

The smooth experience of real people interacting with AI text relies on the maturity of large language models. But the virtual person, not only contains text, but also includes movement, expression, and voice. There is still a long technical trek to reach the situation of all-round natural getting along.

This year's Hunan TV New Year's Eve party scene, the virtual man brought a song and dance performance called "Making Romance". Some netizens said that the children's words are reckless, and the first reaction of their 3-year-old child was "so fake and ugly".

Although the demand exists, but the technical realization is not as satisfactory, which makes the identity-based virtual man for the toC market, it is difficult to sell a good price.

This is where practical avatars have an advantage. For example, the Xiaobing AI clone, which has both functions, provides practical value that is five times more expensive than emotional value.

The pricing of "Emotion Mode" is $72/year, which can realize voice calls, friend circle interaction and other functions. The pricing of "Super Mode" is 360 RMB/year, which mainly serves the office scenario, assisting in meeting minutes, copywriting and other work.

The most important thing is that Xiao-Ice AI is sold only as an interactive interface, without a specific virtual image.

David is not surprised, "From my own feelings, the first concern of enterprise customers is whether the ROI can hit the positive, whether it is lower than the cost of real employees. Secondly, the hot technology also carries marketing attributes, for example, an enterprise can buy an avatar, say that they have access to AIGC, and vigorously publicize the image of such a brand that embraces innovation."

He also added that the avatar technology provider must first meet the real needs of enterprises, because both from the actual function, and marketing function, enterprises are more willing to pay than individuals.

Production side, cost reduction and efficiency

One piece of good news for the industry is that technological advances in AI are driving down the cost of producing avatars. This is good for both functional and identity-based avatars.

Creating an avatar involves three main components: modeling, driving, and rendering. ai has greatly reduced the cost of modeling and driving components.

Modeling, that is, through hand-drawing, CG modeling or AI methods, to create the image of the virtual person. The traditional method requires the designer to "pinch" some images in 3D software.

In the past, product managers and art designers could only communicate image requirements through text and online image references, which inevitably resulted in distorted information. If they were not satisfied with the production results, they had to rework the product several times.

Nowadays, software such as Midjourney and Stable Diffusion have realized low-cost 2D image generation.

AI is based on existing material and instructions, intelligent generation of an image, so that every demand has a more specific control. In other words, AI greatly reduces the cost of communication and trial and error in producing avatar images.

While 3D modeling can't be done entirely by AI, tools such as MetaHuman can build high-fidelity avatars by inputting photos or videos and apply them directly in Unreal Engine.

Driving, is the process of making the avatar active. It can be driven by the "Man in the Middle" or by AI. The person in the middle is the real actor who provides the voice and movement under the avatar's veneer.

The former relies on deep capture of real people, including motion capture, facial expression capture, audio/video synthesis, etc., and then binds it to the virtual person. The latter is accomplished through deep learning, small sample learning, natural language processing, neural network rendering, and other technical means, such as inputting a speech or voice, and the AI model automatically outputs body movements, facial expressions, and voice.

David explains that their company has movement, expression, and voice models. "Voice is relatively simple, TTS (Text to Speech) technology is very mature. Limb and lip movements are some of the STA models, and we capture a very large amount of motion capture data and then generate training models based on that."

For example, if you want to apply an avatar in a product explanation video, the system will recognize the script input by the user based on NLP, and the text in it will be given to the model as input, which can trigger some key actions.

If you don't have a strong physical sense of these concepts, you can more intuitively feel it through the amount of money invested.

"In the case of motion capture technology, the cost is 1,000 dollars a second, which means that a video of one minute in length will cost about 60,000 dollars. Whereas to generate it by way of AI, it only costs 30 bucks for one minute." David introduced that the cost difference between the two ways is a thousand times.

GF Securities pointed out that the impact of AI technology on the virtual human industry is not only on the cost side, but also brings the possibility of "anthropomorphization" and "specialization". Large language models, as well as fine-tuning with specific datasets on the basic model, can give avatars personalities, and can also be adapted to more specialized scenarios.

Insights from live avatar broadcasting

A more intuitive application of functional avatars is in the live streaming scenario.

In May, Jitterbug took the lead in determining the 'legal' status of avatars, allowing the use of AI-assisted creation and not restricting avatar live broadcasts. In recent months, the newly registered guild account of Jittery Voice, the use of avatars to live broadcast is no longer treated in accordance with the recording.

Although there is no official statement, but not a lot of "Kuaishou Virtual Studio (KVS)" to promote the "Kuaishou virtual studio assistant (KVS)", KVS is a tool for content producers, support the use of avatars to assist broadcasting, but also to support the main host to take on the avatar, into the virtual scene.

Regardless of which side you stand on, avatars are in demand.

Brands, there is an incentive to replace some of the real anchors. A mature anchor, the training cycle is at least about three months. And with high turnover in this industry, brands need to continually find, train, and hone new anchors.

If you don't consider your job being replaced, anchors also want to train virtual people to work for them. After all, with goods is a physical job, day and night every day 4-6 hours of continuous broadcasting, day and night, late at night under the broadcast is the industry norm, many people can not eat.

In addition, the set of "carrying goods over goods" is mature, the explanation process of goods is standardized, and the virtual person seems to be fully competent.

However, the reality is not so rosy.

It is difficult for avatar anchors to generate real trust from the audience, especially when it comes to product evaluation, beauty, clothing and other common commodities, avatars appear to be a bit out of their depth.

Previously, Ling Ling, a virtual idol with a good mass base, was mercilessly criticized by netizens for her lipstick evaluation text, which reads "moisturizing and not dry". When the presentation effect is completely virtual, and how to give consumers a real and objective reference.

Clothing is even more so. Not only does the presentation effect lack credibility, but also to display the clothing modeling in advance, the operating cost is not necessarily lower than the real anchor. However, the netizens' comments are "this can see what", "seems to be a virtual human image out of the script.

At present, the function of the virtual anchor, more basic product introduction, or to the real anchor as a "vase" to arouse the curiosity of the audience.

Although Jitterbug tacitly recognizes the live broadcasting of avatars, it also says that the distribution of traffic depends on the "quality of the content" and is not a green light at all times. This also means that during peak hours, avatars who 'only read from scripts' are no match for real-life carriers.

From the live broadcast of this scene of the "virtual man" part-time job status tube, as a user, it is not difficult for us to feel the publicity of the sci-fi sense of the gap between the reality of the technology landing.

But the progress of technology is always like this, the usability of the improvement is not a day's work.

The development of AI technology has helped the avatar industry overcome the huge problem of batch production, and can help users generate avatars quickly and at low cost, produce content at high frequency, and get rid of the dependence on real people.

And for practitioners and enterprise customers, the natural interaction between avatars and real people is an inch closer to having an inch of joy. There are already a number of businesses that use avatars to anchor their live broadcasts 24 hours a day during late-night hours.

After all, it's better than nothing to be able to continuously send viewers simple readings about their products.