“Look forward, never look back… Let’s drive forward on the path of artificial intelligence (AI),” the speech delivered by “digital Tang Xiao’ou,” SenseTime’s founder, at the Chinese AI company’s online annual meeting on March 1 made his colleagues burst into tears.
Tang passed away last year. The “digital Tang,” featuring Tang in business attire, replicated Tang’s natural voice and gestures during a 9-minute video, and the avatar even imitated Tang in taking a sip of water.
The “digital Tang” also largely mimics the personalities and expressions of the real Tang, who was born in Northeast China – a region which has played an important role in Chinese comedy – and thus has given Tang an inborn unique sense of humor.
The “digital Tang” is so real that some of his colleagues thought it was recorded before Tang passed away. It is not until the digital avatar spoke about a Chinese movie which premiered in February this year that they realized that they were communicating with their beloved friend through a “digital bridge.”
Behind the creation of “digital Tang,” as well as an array of other “digital humans,” is the fast-track development of AI technologies, especially the large language model (LLM) that has been on an unprecedented rise since the beginning of 2023.
Teaching digital avatars like ‘tutoring children’
With the approach of the Qingming Festival, a time for Chinese to pay respects to their deceased family members, the “digital human” industry has been seeing rising demand as people crave the opportunity to “communicate” with their deceased loved ones. In addition to emotional companionship, industry insiders also believed that China could also lead the world in terms of technology’s commercialization in other application scenarios such as live broadcasting and short videos.
According to SenseTime, many descendants of deceased celebrities have approached SenseTime after watching the 9-minute footage of “digital Tang,” with the hope to create digital avatars of their own deceased relatives.
“Digital avatar technology is a gem we all wish to claim in the field of AI research,” Luan Qing, general manager of digital entertainment and culture business at SenseTime’s digital world group, told the Global Times. She stressed the technological difficulty. The 9-minute video embodies nearly decade-long research efforts and accumulation of related technology, she said.
With regards to the standards for digital humans, Luan referred to the Turing Test, a method of determining whether a machine can demonstrate human intelligence. Achieving this level of realism in digital avatars is challenging, requiring precise reproduction of image, movement, expression, and voice, as well as conveying the human being’s thoughts.
The process of creating the digital human includes image training, such as clothes, movements and facial expressions, followed by voice training, according to Luan. The first step was done using SenseTime’s self-developed AI model “SenseNova,” which was launched in April 2023. The model includes AI avatar video generation platform “SenseAvatar.”
“The second step to mimic Tang’s language style is more complicated. We select about four to five voice clips featuring Tang’s different talking styles as prompts, each was three to five seconds. Although it took some time for us to select the voice samples, the training was completed quite fast thanks to our voice large models,” Luan said.
SenseTime made important breakthrough in its voice large models in 2023, a banner year for generative AI. The company also has plans to unveil larger voice models in the first half of this year.
Luan described the whole training process as akin to teaching a child, “feeding” various video clips to the AI models to make a child learn and mimic the moves. For example, the impressive move of “digital Tang” taking a sip of water was generated after a footage of Tang drinking water was put in the AI large model, and a prompt on the time for performing the action was also pre-scripted.
Making a ‘lifelike’ digital avatar
While the technology is yet mature enough to create a “lifelike digital avatar” with complicated interactive features, demand has already experienced “explosive growth,” some industry insiders told the Global Times.
According to a report from iMedia Research, it is estimated that by 2025, the core market size of virtual digital humans will reach 48.06 billion yuan ($6.65 billion), expanding from the 12 billion yuan market in 2022.
Chinese tech company 360 last year launched a group of AI-powered digital humans based on innovations built on LLM, which the company said have a “soul” that make them differentiated from the traditional “repeater” form of digital humans. The “soul” is developed based on LLM training that equips the digital avatar with personalities and memory that makes it to think like a human being.
According to a statement that 360 sent to the Global Times on Monday, the company listed a wide range of application scenarios for digital humans, including news broadcasting, knowledge sharing, product marketing and digital company spokesperson.
“Compared with its international counterparts, China is ‘way ahead’ in the field of digital human technology application,” Luan stated, adding that this is because the rapid development of China’s livestreaming and short-video industry has driven the progress of digital human technology, which in turn fueled an earlier and more explosive growth of digital avatar applications compared to overseas markets.
Tian Feng, dean of SenseTime Intelligent Industry Research Institute, told the Global Times that high-quality AI replication technology can be put into practical use in some specific scenarios. For example, if a scientist’s papers and speeches are integrated into an LLM, then relevant scientific research and science education can still continue after the scientist’s death.
Industry insiders also said that Chinese companies need to make more technological breakthroughs and research and development inputs to achieve real-time interactions with AI-driven digital humans.
For example, while a human being could ask a digital human to walk toward him with verbal descriptions, the digital human still cannot perform small “unconsciousness” actions and expressions, such as flinging the hair back from forehead during the process.