We aim to tackle the video navigation problem, whose goal is to train an in-context policy to find objects included in the context video in a new scene. After watching an 30-second egocentric video, ...