LLaVA (3) 썸네일형 리스트형 [논문이해] ShareGPT4V: Improving Large Multi-Modal Models with Better Captions 논문명: ShareGPT4V: Improving Large Multi-Modal Models with Better Captions논문 링크: https://arxiv.org/abs/2311.12793 ShareGPT4V: Improving Large Multi-Modal Models with Better CaptionsIn the realm of large multi-modal models (LMMs), efficient modality alignment is crucial yet often constrained by the scarcity of high-quality image-text data. To address this bottleneck, we introduce the ShareGPT4V dat.. [논문이해] VeCLIP: Improving CLIP Training via Visual-enriched Captions 논문명: VeCLIP: Improving CLIP Training via Visual-enriched Captions 논문 링크: https://arxiv.org/abs/2310.07699 VeCLIP: Improving CLIP Training via Visual-enriched Captions Large-scale web-crawled datasets are fundamental for the success of pre-training vision-language models, such as CLIP. However, the inherent noise and potential irrelevance of web-crawled AltTexts pose challenges in achieving preci.. [논문이해] VLIS: Unimodal Language Models Guide Multimodal Language Generation 논문명: VLIS: Unimodal Language Models Guide Multimodal Language Generation 논문 링크: https://arxiv.org/abs/2310.09767 VLIS: Unimodal Language Models Guide Multimodal Language Generation Multimodal language generation, which leverages the synergy of language and vision, is a rapidly expanding field. However, existing vision-language models face challenges in tasks that require complex linguistic under.. 이전 1 다음