Vision-Language Joint Embedding Predictive Architecture (VL-JEPA) it’s an extension of JEPA that learns how images and language relate to each other by predicting shared meaning rather than exact pixels or words