JOINT MULTIMODAL EMBEDDING AND BACKTRACKING SEARCH IN VISION-AND-LANGUAGE NAVIGATION

Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Blog Article

Due to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input Accessories data such as images and text.Vision-and-language navigation (VLN) require the alignment and grounding of multimodal input data to enable real-time perception of the task status on panoramic images and natural language instruction.This study proposes a novel deep neural network model (JMEBS), with joint multimodal embedding and backtracking search for VLN tasks.The proposed JMEBS model uses a transformer-based joint multimodal embedding module.JMEBS uses both multimodal context and temporal context.

It also employs backtracking-enabled greedy local search (BGLS), a novel algorithm with a backtracking feature designed to improve the task success rate and optimize the navigation path, based on the local and global scores related to candidate actions.A novel Wallpaper global scoring method is also used for performance improvement by comparing the partial trajectories searched thus far with a plurality of natural language instructions.The performance of the proposed model on various operations was then experimentally demonstrated and compared with other models using the Matterport3D Simulator and room-to-room (R2R) benchmark datasets.

Report this page