본문 바로가기

multimodal

(5)
[용어정리] Dual Softmax Loss 3줄 요약 softmax loss 를 2번(dual)하는데, Column 기준으로 1번, Row 기준으로 1번할 것이다. 이렇게 Loss 만 바꿨더니, 대부분의 모델 성능이 오른다. 논문의 가정은 이러하다 원래 Retrieval 분야의 loss 함수는 이렇게 2개를 각각 구해서 합친다. 첫번째 식은 video 1개를 넣었을 때, B개의 text 중 가장 유사한 걸 찾는 걸 의미한다. 두번째 식은 text 1개를 넣었을 때, B개의 video 중 가장 유사한 걸 찾는 걸 의미한다. 세번째 식은 2개의 LOSS 합이 최소화되도록 하여 정답을 맞추도록 유도한다. 여기서 기억해야 하는 건, 첫번째 식은 Video-to-Text 를 최대화하고, 두번째 식은 Text-to-Video 를 최대화한 것이다. When ..
[논문이해] VLIS: Unimodal Language Models Guide Multimodal Language Generation 논문명: VLIS: Unimodal Language Models Guide Multimodal Language Generation 논문 링크: https://arxiv.org/abs/2310.09767 VLIS: Unimodal Language Models Guide Multimodal Language Generation Multimodal language generation, which leverages the synergy of language and vision, is a rapidly expanding field. However, existing vision-language models face challenges in tasks that require complex linguistic under..
[논문이해] Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? 논문명: Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?논문 링크: https://arxiv.org/abs/2301.00184 Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?Most existing text-video retrieval methods focus on cross-modal matching between the visual content of videos and textual query sentences. However, in real-world scenarios, online videos are often accompanied by relevan..
[논문이해] Unmasked Teacher: Towards Training-Efficient Video Foundation Models 논문명: Unmasked Teacher: Towards Training-Efficient Video Foundation Models 논문링크: https://arxiv.org/abs/2303.16058v1 Unmasked Teacher: Towards Training-Efficient Video Foundation Models Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity. Previous VFMs rely on Image Foundation Models (IFMs), which face challenges in transferring to the..
[논문이해] Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling 논문명: Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling 논문링크: https://arxiv.org/abs/2102.06183 Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling The canonical approach to video-and-language learning (e.g., video question answering) dictates a neural model to learn from offline-extracted dense video features from vision models and text features fro..