Smart House

Video Display

About

The purpose of this demo is to showcase potential application scenarios for future AR/Smart glasses in indoor environments. As XR/AI technology continues to develop, hardware devices have consistently been a bottleneck in the industry. A lightweight pair of glasses that integrates AR, AI, and communication functions could be the key to making XR truly accessible in everyday life.

This demo is developed using UE5 and runs on the Oculus Quest 3. The core functionality allows users to control various features in the virtual environment through voice commands, such as turning on/off the TV, playing music, shopping, and entertainment. There are many more possibilities for Smart glasses in indoor settings, and I will continue to update the project with new features.

Speech recognition and NLP

I am using the UE5 Voice SDK plugin and connecting the triggered events with Blueprints. The process is that my voice is sent to a cloud-based speech recognition server, which determines if any keywords are triggered based on what I say (the keywords need to be set up on wit.ai). The returned keywords will trigger the pre-set events, such as the ladder ranking function shown in the diagram. It is worth mentioning that the language model needs continuous optimization. Although the recognized speech had errors, it still triggered the "showing" and "match" keywords, so the event was executed.

Facial Capture and
Metahuman

In UE5, Metahuman can be used to create various character models with highly realistic details. By combining facial capture or full-body motion capture, it is possible to quickly produce lifelike character animations, which are widely used in the gaming, film, entertainment, and education industries. Here, I used the Live Link Face iPhone app to connect to UE5, recording the Metahuman's facial expressions and head movements in real-time.

GenAI

In this demo, I used GenAI tools to quickly create 3D models, AI TV videos, and AI audio. For the 3D models, I used Hyperhuman AI, which generates models simply by uploading images or text prompts. The AI videos were created using Runway AI, where you can achieve results by either uploading images or using the prompt feature. For the audio, you can generate it by editing text and selecting an AI voice actor.

Reflection

This project explores how NLP technology and GenAI can enhance XR projects. Previously, the primary control method for XR project content was controller button operations. Now, in addition to gesture tracking technology, NLP is gradually becoming mainstream, as language is one of the primary means of human communication. Incorporating NLP can enhance the immersion, interactivity, and enjoyment for XR users. Significant development potential exists in industries such as virtual assistants, companion virtual humans, psychological counseling, smart homes, training and education, and interactive gaming.

On the other hand, using motion capture to create virtual characters will also become one of the mainstream methods in the future. This approach makes model animations more natural while eliminating the cumbersome processes and time required for traditional animation modeling. Moreover, there are already many mature facial expression generation solutions on the market, such as text-driven and voice-driven facial expressions. Previously, motion capture technology was widely used in the gaming and film industries, but now it is increasingly being used in various sectors such as training, healthcare, education, and cultural tourism.

Of course, due to the short production cycle of this demo, there is still much room for optimization.

1. Online voice recognition and keyword feedback may experience network latency, affecting the gameplay. To mitigate this, we could consider using offline speech recognition models, or mainstream speech services from Google or Microsoft.

2. For the facial animation of the MetaHuman virtual character, I used facial capture with an iPhone, and for body animation, I used models from Mixamo. In the future, we might consider using Rococo's full-body motion capture solution as a replacement to make the MetaHuman's movements more natural.

3. We might add other feature demonstrations, such as AR companions, AR furniture shopping, remote AR meetings, etc.