Vlm Multimodal Tutorial En
§0 TL;DR Cheat Sheet
one page covering the core interview points for vision-language models (see §1–§13 below for derivations and code).
- Vision encoder = ViT-dominated: Dosovitskiy et al. 2021 (ICLR) slice images into $P\times P$ patches (typically $P=14$ or $16$), apply a linear projection + learnable positional embedding + optional
[CLS]token, and feed them into a Transformer encoder. The vision side of CLIP / SigLIP / LLaVA is all a ViT variant. - CLIP symmetric InfoNCE (must derive): Radford et al. 2021 (ICML) make image embeddings $\mathbf{u}_i$ and text embeddings $\mathbf{v}_i$ do contrastive learning in a shared space, with loss = average of row softmax + column softmax: $\mathcal{L} = \tfrac{1}{2}(\mathcal{L}_{i\to t} + \mathcal{L}_{t\to i})$. The temperature $\tau$ is learnable (log-parameterized, clipped to $[0,100]$).
- SigLIP replaces softmax with sigmoid: Zhai et al. 2023 (ICCV) treat each entry of the N×N similarity matrix as an independent binary CE, getting rid of batch-wise softmax normalization, so it is no longer linearly sensitive to batch size and can train with 32k+ batches on a single machine; a learnable bias $b$ corrects early negative dominance. SigLIP-2 (Google 2025) adds caption + self-distillation + dense local objectives and extends to multilingual.
- LLaVA = projector + 2-stage train: Liu et al. 2023 (NeurIPS) use a lightweight MLP projector to project frozen CLIP visual features into LLM token space. Stage 1 trains only the projector for feature alignment (caption data); Stage 2 unfreezes the LLM for visual instruction tuning (158K instructions generated by GPT-4).
- Q-Former vs Projector is the central BLIP-2 trade-off: Li et al. 2023 (ICML) use 32 learnable query tokens that do cross-attention over a frozen image encoder, compressing any resolution / number of patches into a fixed 32 tokens — stable compute budget but lossy + complex to train. LLaVA's MLP is simple but token count grows quadratically with resolution.
- Flamingo / Llama-3.2-Vision = gated cross-attn: Alayrac et al. 2022 (NeurIPS) use a Perceiver Resampler (64 latent queries) to compress visual features into a fixed token count, then insert gated cross-attention layers every few LLM layers ($\tanh$ gating initialized at 0, preserving the frozen LLM's text-only capability).
- Qwen2-VL's M-RoPE — must-know: Wang et al. 2024 split RoPE along head_dim into 6 chunks, assigning (t / h / w, three groups of position ids) following the axis sequence $(t, h, w, t, h, w)$; the typical config
mrope_section=[16,24,24](units are pairs of half head_dim, so $\sum \times 2 = $ head_dim=128, all 128 dims rotate). This way each token carries (t, h, w) three-dim positions without flattening. Pairs with native dynamic resolution (no longer padded to a fixed 224×224). - Three-stage training + preference optimization: (1) alignment trains the projector / Q-Former; (2) visual instruction tune unfreezes part of the LLM; (3) preference (LLaVA-RLHF, RLAIF-V, VLM-R1, DPO/PPO) addresses hallucinations and long-tail alignment. VLM-R1 (2025) uses GRPO + verifiable reward to transfer reasoning ability into vision-language tasks.
§1 Intuition: what is a VLM doing?
Think of an image as "another language". The work of a VLM splits into three parts:
- Visual tokenizer: compress pixels into a discrete or continuous token sequence (ViT patch → embedding)
- Cross-modal alignment: bring image / text of the same semantics close in a shared space — this is what CLIP / SigLIP do; essentially learning a shared embedding space
- Cross-modal generation: let the LLM "see" the image inside the prompt — this is what LLaVA / Qwen-VL / Flamingo do; essentially stuffing image tokens as a prefix into the LLM's context
the central architectural split in VLMs.
- Early fusion (dual-encoder + contrastive): CLIP / SigLIP, no cross-modal attention, only push / pull in the embedding space
- Projector fusion (visual tokens → LLM context): LLaVA / Qwen-VL, project image embeddings into LLM token space and concatenate as input tokens, then autoregressively decode
- Cross-attn fusion (image as KV, text as Q): Flamingo / BLIP-2 / Llama-3.2-V, add cross-attention layers so the LLM's text tokens actively query the visual KV
Compare from a Q/K/V perspective: in the projector paradigm the image is part of the LLM's input sequence (full interaction inside self-attention); in the cross-attn paradigm the image is always KV and is only queried — this causes different KV cache handling at inference time.
§2 ViT: turning an image into a token sequence
2.1 Patch tokenize
Input image $\mathbf{x} \in \mathbb{R}^{H \times W \times C}$, slice it into $N = HW/P^2$ patches of $P\times P$, flatten each patch to a $P^2 C$-dim vector, and pass it through a linear layer to $D$ dimensions:
$$\mathbf{z}_0 = [\mathbf{x}_\text{class};\ \mathbf{x}^1_p \mathbf{E};\ \mathbf{x}^2_p \mathbf{E};\ \dots;\ \mathbf{x}^N_p \mathbf{E}] + \mathbf{E}_\text{pos}$$
- $\mathbf{E} \in \mathbb{R}^{P^2 C \times D}$ is the patch embedding matrix (equivalent to a Conv2D with stride=$P$, kernel=$P$)
- $\mathbf{x}_\text{class} \in \mathbb{R}^D$ is the learnable [CLS] token, used to aggregate global information (for classification, take $\mathbf{z}_L^0$)
- $\mathbf{E}_\text{pos} \in \mathbb{R}^{(N+1) \times D}$ is a learnable 1D positional embedding — the original ViT uses 1D learned, not 2D sinusoidal (the paper's Appendix D.4 reports that 1D learned vs 2D sinusoidal differs only within noise)
CLIP ViT uses [CLS] for output, SigLIP / EVA-CLIP / modern LLaVA tend to use patch token average pool or keep all patch tokens to feed downstream. [CLS] is the ViT paper's choice, not an intrinsic part of ViT.
2.2 Transformer backbone
$$\mathbf{z}'_\ell = \text{MHA}(\text{LN}(\mathbf{z}_{\ell-1})) + \mathbf{z}_{\ell-1}, \quad \mathbf{z}_\ell = \text{MLP}(\text{LN}(\mathbf{z}'_\ell)) + \mathbf{z}'_\ell$$
Pre-norm (LN at the input of each sub-layer), MLP uses GELU. Note that the original ViT has a fixed number of patches ($224/16=14 \Rightarrow N=196$), and its positional embedding table is fixed in size — this is the pain point that dynamic resolution must solve (§10).
2.3 ViT specifications
| Model | Patch | Hidden $D$ | Layers | Heads | Params | Source |
|---|---|---|---|---|---|---|
| ViT-B/16 | 16 | 768 | 12 | 12 | 86M | Dosovitskiy 2021 |
| ViT-L/14 | 14 | 1024 | 24 | 16 | 304M | Dosovitskiy 2021 |
| ViT-H/14 | 14 | 1280 | 32 | 16 | 632M | Dosovitskiy 2021 |
| ViT-g/14 | 14 | 1408 | 40 | 16 | 1.0B | Zhai et al. 2022 |
| ViT-bigG/14 | 14 | 1664 | 48 | 16 | 1.8B | OpenCLIP, 2023 |
| EVA-02-L/14 | 14 | 1024 | 24 | 16 | 304M | Fang 2023 |
| SigLIP SoViT-400M/14 | 14 | 1152 | 27 | 16 | 400M | Alabdulmohsin 2023 |
ViT-family models mostly follow head_dim ≈ 64–88, i.e. $D / H$. Scaling laws suggest head_dim should not be too small, otherwise per-head expressiveness is limited.
2.4 Code: ViT patch embed + backbone (core 60 lines)
import torch
import torch.nn as nn
import torch.nn.functional as F
class PatchEmbed(nn.Module):
def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768):
super().__init__()
self.img_size, self.patch_size = img_size, patch_size
self.num_patches = (img_size // patch_size) ** 2
# A Conv2d with stride=P, kernel=P is equivalent to a linear projection
self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
def forward(self, x): # x: [B, C, H, W]
x = self.proj(x) # [B, D, H/P, W/P]
x = x.flatten(2).transpose(1, 2) # [B, N, D]
return x
class ViTBlock(nn.Module):
def __init__(self, dim, num_heads, mlp_ratio=4.0, dropout=0.0):
super().__init__()
self.ln1 = nn.LayerNorm(dim)
self.attn = nn.MultiheadAttention(dim, num_heads, dropout=dropout, batch_first=True)
self.ln2 = nn.LayerNorm(dim)
hidden = int(dim * mlp_ratio)
self.mlp = nn.Sequential(
nn.Linear(dim, hidden), nn.GELU(), nn.Dropout(dropout),
nn.Linear(hidden, dim), nn.Dropout(dropout),
)
def forward(self, x):
h = self.ln1(x)
a, _ = self.attn(h, h, h, need_weights=False) # self-attention
x = x + a
x = x + self.mlp(self.ln2(x))
return x
class ViT(nn.Module):
def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768,
depth=12, num_heads=12, num_classes=1000, use_cls=True):
super().__init__()
self.patch_embed = PatchEmbed(img_size, patch_size, in_chans, embed_dim)
N = self.patch_embed.num_patches
self.use_cls = use_cls
if use_cls:
self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
self.pos_embed = nn.Parameter(torch.zeros(1, N + 1, embed_dim))
else:
self.pos_embed = nn.Parameter(torch.zeros(1, N, embed_dim))
nn.init.trunc_normal_(self.pos_embed, std=0.02)
if use_cls:
nn.init.trunc_normal_(self.cls_token, std=0.02)
self.blocks = nn.ModuleList([ViTBlock(embed_dim, num_heads) for _ in range(depth)])
self.ln = nn.LayerNorm(embed_dim)
self.head = nn.Linear(embed_dim, num_classes) if num_classes > 0 else nn.Identity()
def forward(self, x):
B = x.size(0)
x = self.patch_embed(x) # [B, N, D]
if self.use_cls:
cls = self.cls_token.expand(B, -1, -1)
x = torch.cat([cls, x], dim=1) # [B, N+1, D]
x = x + self.pos_embed # broadcast over batch
for blk in self.blocks:
x = blk(x)
x = self.ln(x)
feat = x[:, 0] if self.use_cls else x.mean(dim=1) # CLS or mean-pool
return self.head(feat)
when transferring a ViT from $224^2$ to $336^2$, the pos_embed table needs resizing from $(14^2 + 1)$ rows to $(24^2 + 1)$ rows. Correct procedure: keep [CLS] unchanged, reshape the patch portion to $14\times 14\times D$, bicubic-interpolate to $24\times 24$, then flatten and concatenate back. Pitfall: directly doing 1D interpolation over $(N+1)$ rows treats [CLS] as a patch.
§3 CLIP: symmetric InfoNCE (must derive)
3.1 Formalizing the objective
CLIP (Radford et al. 2021, ICML) trains with $N$ (image, text) pairs per batch. Two encoders $f_\theta$ (image), $g_\phi$ (text) produce $\ell_2$-normalized embeddings:
$$\mathbf{u}_i = \frac{f_\theta(I_i)}{\|f_\theta(I_i)\|_2}, \quad \mathbf{v}_j = \frac{g_\phi(T_j)}{\|g_\phi(T_j)\|_2}, \quad \mathbf{u}_i, \mathbf{v}_j \in S^{D-1}$$
Define the similarity matrix $\mathbf{S} \in \mathbb{R}^{N\times N}$ (the "logit"):
$$S_{ij} = \frac{\mathbf{u}_i^\top \mathbf{v}_j}{\tau}$$
where $\tau > 0$ is the learnable temperature (engineered as logit_scale = log(1/τ), which is more stable in backprop, clamped in $[\log 1, \log 100]$).
3.2 Symmetric InfoNCE loss (average of row + column softmax)
Image → Text direction (for each image $i$, positive sample is $T_i$, negatives are $\{T_j\}_{j\neq i}$):
$$\mathcal{L}_{i\to t} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(S_{ii})}{\sum_{j=1}^{N} \exp(S_{ij})}$$
Text → Image direction:
$$\mathcal{L}_{t\to i} = -\frac{1}{N}\sum_{j=1}^{N} \log \frac{\exp(S_{jj})}{\sum_{i=1}^{N} \exp(S_{ij})}$$
Total symmetric loss:
$$\boxed{\;\mathcal{L}_\text{CLIP} = \frac{1}{2}\left(\mathcal{L}_{i\to t} + \mathcal{L}_{t\to i}\right)\;}$$
for the matrix $\mathbf{S}$, apply row softmax and take the NLL of the diagonal (image→text), apply column softmax and take the NLL of the diagonal (text→image). The mean of the two cross-entropies is the CLIP loss.
3.3 Gradient derivation (why symmetry matters)
Fix $\tau=1$. For the row logits $\mathbf{s}_i = (S_{i1},\dots,S_{iN})^\top$ inside $\mathcal{L}_{i\to t}$, apply softmax and let $p_{ij} = \text{softmax}(\mathbf{s}_i)_j$. Then:
$$\frac{\partial \mathcal{L}_{i\to t}}{\partial S_{ij}} = \frac{1}{N}\left(p_{ij} - \mathbb{1}[j=i]\right)$$
- $j=i$ (positive): gradient $\propto p_{ii} - 1 < 0$, pulls $\mathbf{u}_i, \mathbf{v}_i$ closer
- $j \neq i$ (negative): gradient $\propto p_{ij} > 0$, pushes $\mathbf{u}_i, \mathbf{v}_j$ apart
With only one direction $\mathcal{L}_{i\to t}$, $\mathbf{v}_j$ receives gradient from all $\mathbf{u}_i$, but cannot in turn constrain how $\mathbf{u}_i$ behaves when retrieved by other $\mathbf{v}_k$. Symmetrization adds the text→image retrieval constraint, preventing one-sided collapse in the embedding space (where image-side clusters tightly but text-side drifts).
3.4 Role of temperature
$$\mathcal{L}_{i\to t} = -\frac{1}{N}\sum_i \log\frac{\exp(\mathbf{u}_i^\top \mathbf{v}_i / \tau)}{\sum_j \exp(\mathbf{u}_i^\top \mathbf{v}_j / \tau)}$$
- $\tau \to 0^+$ (very small): softmax is nearly one-hot, only the hardest negative matters (the most similar but incorrect text); gradient is dominated by one or two negatives and training is unstable
- $\tau \to \infty$ (very large): softmax is uniform, positives and negatives are nearly indistinguishable, loss approaches the $\log N$ constant, almost no gradient
- OpenAI CLIP's learned steady state: $\tau \approx 0.01$ (
logit_scale ≈ log(100)), with the upper bound clamped to prevent collapse
Oord et al. 2018 (CPC) proved InfoNCE is a lower bound on the mutual information $I(U; V)$: $I(U; V) \ge \log N - \mathcal{L}_\text{InfoNCE}$. So increasing batch size $N$ while reducing loss directly raises the MI lower bound — this is why CLIP / SigLIP both chase huge batches.
3.5 Code: CLIP symmetric InfoNCE (core 50 lines)
import torch
import torch.nn as nn
import torch.nn.functional as F
class CLIPLoss(nn.Module):
"""Symmetric InfoNCE used by OpenAI CLIP (Radford et al. 2021)."""
def __init__(self, init_tau=0.07, max_logit_scale=4.6052):
super().__init__()
# Equivalent to logit_scale = log(1/τ); initial ~ log(1/0.07) ≈ 2.659
self.logit_scale = nn.Parameter(torch.tensor(1.0 / init_tau).log())
self.max_logit_scale = max_logit_scale # log(100), clamp to prevent blow-up
def forward(self, image_feats, text_feats):
"""
image_feats: [N, D] (unnormalized)
text_feats: [N, D]
"""
# L2 normalize to the unit sphere
u = F.normalize(image_feats, dim=-1) # [N, D]
v = F.normalize(text_feats, dim=-1) # [N, D]
# Clamp logit_scale upper bound (late training rises to ~log(100))
logit_scale = self.logit_scale.clamp(max=self.max_logit_scale).exp()
# Similarity matrix
logits_i2t = logit_scale * u @ v.t() # [N, N]
logits_t2i = logits_i2t.t() # [N, N]
# The diagonal contains the positive pairs
N = u.size(0)
labels = torch.arange(N, device=u.device)
loss_i2t = F.cross_entropy(logits_i2t, labels) # row softmax NLL
loss_t2i = F.cross_entropy(logits_t2i, labels) # column softmax NLL
return 0.5 * (loss_i2t + loss_t2i), logit_scale
# Example (under DDP, you need to all-gather feats from all GPUs before computing)
if __name__ == "__main__":
N, D = 8, 512
img_feats = torch.randn(N, D)
txt_feats = torch.randn(N, D)
criterion = CLIPLoss()
loss, scale = criterion(img_feats, txt_feats)
print(f"loss={loss.item():.4f} logit_scale={scale.item():.2f}")
on a single GPU, the batch $N$ loss only covers local negatives. Production CLIP (OpenCLIP / OpenAI) does dist.all_gather on $\mathbf{u}, \mathbf{v}$ after forward, so the negative pool = global batch size (e.g. 32k). Gradient is computed for only the local row / column on the host GPU via gradient checkpointing — this is an engineering trick, not a math change.
3.6 CLIP training data & scale
- WIT (WebImageText): 400M (image, text) pairs scraped from the internet (not released)
- LAION-400M / LAION-2B: the open-source replacement used by OpenCLIP; a series of scales were trained in 2022–2023
- DataComp: Gadre et al. 2023 (NeurIPS) proposed a systematic data-filtering benchmark, data quality > data scale
- Model scale: OpenAI's largest is ViT-L/14; OpenCLIP trained ViT-bigG/14 (LAION-2B), with zero-shot ImageNet ~80%+
3.7 CLIP's failure modes
- Poor OCR / text understanding: training captions usually do not describe the text inside an image, so CLIP is essentially "blind" to in-image text
- Fine-grained counting fails: "5 birds" vs "6 birds" is nearly indistinguishable in CLIP embeddings (the "counting problem")
- Weak on spatial relations: "cat on top of dog" vs "dog on top of cat" is hard to distinguish (POPE / Winoground benchmarks quantify this)
- Bag-of-words tendency: Yuksekgonul et al. 2023 (ICLR) showed CLIP is barely sensitive to word order in captions
§4 SigLIP: replacing softmax with sigmoid; batch scaling rewritten
4.1 Motivation
CLIP's softmax normalization couples all N×N similarities together: each positive's gradient depends on the row-wide logsumexp of negatives. This causes:
- Strong batch-size sensitivity: doubling N significantly changes the loss landscape; small batches barely learn
- Expensive DDP communication: must all-gather embeddings (O(N·D) bytes), creating a communication bottleneck
- Numerical instability: softmax overflows easily at very large N
Zhai et al. 2023 (ICCV) proposed SigLIP: treat each entry of the N×N matrix as independent binary classification.
4.2 Sigmoid loss derivation
Define similarity $S_{ij} = t \cdot \mathbf{u}_i^\top \mathbf{v}_j + b$, where $t = e^{t'}$ is a learnable scale (same as CLIP's $1/\tau$) and $b$ is a learnable bias (initialized to a negative number, e.g. $b_0 = -10$, to avoid early prediction of "all positive").
Label $y_{ij} = +1$ if $i=j$, $-1$ otherwise. Each entry does binary logistic regression:
$$\mathcal{L}_\text{SigLIP} = -\frac{1}{N}\sum_{i=1}^N \sum_{j=1}^N \log \sigma\!\left(y_{ij} \cdot S_{ij}\right) = \frac{1}{N}\sum_{i=1}^N \sum_{j=1}^N \log\!\left(1 + \exp(-y_{ij} S_{ij})\right)$$
the loss for each $(i,j)$ entry does not depend on any other entry. Therefore:
- batch size no longer couples all negatives
- a single machine can use very large batches (SigLIP reports 32k batch on one chip is trainable)
- communication only needs pairing local queries with remote keys and computing sigmoid entries (chunked all-pair); no need to synchronize logsumexp
early in training, $\mathbf{u}, \mathbf{v}$ are near-random, $S_{ij}$ is near 0, and sigmoid outputs 0.5. Negatives number $N^2 - N \approx N^2$, while positives are only $N$; if initial predictions are all ~0.5, negative-sample gradients dominate early training. SigLIP initializes $b_0 \approx -10$, so the sigmoid is initially near 0 — all entries are first predicted as negative, then positives have large loss and negatives have small loss; starting from this state makes training stable.
4.3 SigLIP vs CLIP comparison
| Dimension | CLIP (softmax) | SigLIP (sigmoid) |
|---|---|---|
| Loss form | $\propto$ logsumexp(row) + logsumexp(col) | $\propto$ $\sum_{ij}$ binary logistic |
| Batch dependence | Strong (gradient couples batch) | Weak (entries independent) |
| Communication | all-gather embeddings | chunked all-pair sigmoid |
| Bias term | None (implicitly absorbed by softmax) | learnable $b$, init $\approx -10$ |
| Small-batch behavior | Poor (< 4k barely learns) | Significantly better (1k learns) |
| Large-batch behavior | Diminishing returns | Keeps rising through 32k+ |
| Zero-shot ImageNet (ViT-L/14, 400M data) | ~75% | ~76–78% |
4.4 SigLIP-2 (Google 2025)
Tschannen et al. 2025, building on SigLIP-1:
- Adds a caption-style decoder (akin to CapPa) as a captioning auxiliary task
- Self-distillation + dense local objectives: patch-level local contrast for better detail localization
- Multilingual extension: training data scaled to 100+ languages, with substantial multilingual zero-shot gains
- Released a NaFlex variant for native aspect ratios
4.5 Code: SigLIP sigmoid loss (core 35 lines)
import torch
import torch.nn as nn
import torch.nn.functional as F
class SigLIPLoss(nn.Module):
"""Sigmoid Loss for Language Image Pre-training (Zhai et al. 2023)."""
def __init__(self, init_t=10.0, init_b=-10.0):
super().__init__()
# log-parameterize t for stability; b is a learnable bias
self.t_prime = nn.Parameter(torch.tensor(init_t).log())
self.b = nn.Parameter(torch.tensor(float(init_b)))
def forward(self, image_feats, text_feats):
u = F.normalize(image_feats, dim=-1) # [N, D]
v = F.normalize(text_feats, dim=-1) # [N, D]
t = self.t_prime.exp() # scale > 0
logits = t * (u @ v.t()) + self.b # [N, N]
# y_{ij} = +1 if i == j else -1
N = u.size(0)
labels = 2 * torch.eye(N, device=u.device) - 1 # [N, N], +1 on diag, -1 off
# log(1 + exp(-y * logits)) == -log sigmoid(y * logits)
loss = -F.logsigmoid(labels * logits).sum() / N # SigLIP convention: sum / N
return loss, t, self.b
the paper's Eq. (1) normalizes by batch size $N$ (sum per row), not by number of matrix elements $N^2$. Pitfall: writing loss.mean() gives a 1/N² magnitude, the loss is too small, and the learnable scale converges incorrectly. Correct: loss.sum() / N.
§5 EVA-CLIP / OpenCLIP / other CLIP variants
5.1 OpenCLIP
OpenCLIP (Cherti et al. 2023 CVPR) is the LAION team's open-source reproduction + extension:
- Open training recipe: full LAION-400M / LAION-2B training scripts
- Larger scale: ViT-bigG/14 trained on LAION-2B reaches zero-shot ImageNet ~80.1% (SOTA in 2023)
- Distributed InfoNCE: implements
local_loss=Truewith gradient checkpointing; each GPU only stores local rows/columns in memory
5.2 EVA-CLIP
EVA-CLIP (Sun et al. 2023) uses MIM-pretrained EVA / EVA-02 (Fang et al. 2023) as vision-tower initialization, substantially improving sample efficiency:
- ViT-L/14 on LAION-2B needs only 1/3 of OpenCLIP's compute budget to reach the same accuracy
- LayerScale + sub-LN + RoPE: engineering improvements on EVA-02's visual side
5.3 DataComp (data vs model vs algorithm)
Gadre et al. 2023 (NeurIPS) designed a "data filtering benchmark": fix (model, compute) and only vary the data filter. Conclusions:
- Combination of CLIP filtering + basic filtering + image-based filtering works best
- Large models (ViT-L/14) on small data (12.8M) are worse than ViT-B (in the data-limited regime, larger models overfit)
5.4 Comparison overview
| Method | Vision tower init | Loss | Batch | Training data | ImageNet zero-shot |
|---|---|---|---|---|---|
| CLIP (OpenAI) | from scratch | softmax InfoNCE | 32k | WIT 400M | 76.2% (L/14@336) |
| OpenCLIP | from scratch | softmax InfoNCE | 90k | LAION-2B | 80.1% (bigG/14) |
| EVA-CLIP | EVA-02 MIM | softmax InfoNCE | — | LAION-2B | 82.0% (E/14+) |
| SigLIP | from scratch | sigmoid | 32k | WebLI | 82.0% (So400M/14) |
| SigLIP-2 | from scratch | sigmoid + caption + distill | — | WebLI 10B | 84%+ |
| MetaCLIP | from scratch | softmax InfoNCE | — | reconstructed LAION-grade | 79.2% (H/14) |
the SigLIP family has stably surpassed CLIP on zero-shot ImageNet and downstream retrieval; representative open-weight VLMs using SigLIP-So400M are PaliGemma / LLaVA-OneVision / Molmo. The InternVL series uses their in-house InternViT; Qwen2-VL trains its own ViT; LLaVA-1.5/1.6 still use CLIP ViT-L/14 — "switching to SigLIP" is not an industry consensus.
§6 LLaVA: projector + 2-stage training
6.1 Architecture
LLaVA (Liu et al. 2023 NeurIPS) centers on a trio:
Image ──► CLIP ViT-L/14 ──► visual features z_v ∈ R^{N × d_v}
│
│ W ∈ R^{d_v × d_LLM} ← MLP projector
↓
H_v ∈ R^{N × d_LLM}
│
│ concatenated with text embedding
↓
Text tokens ──► tokenizer ──► H_t ──► [<bos>, H_v, H_t] ──► LLM (Vicuna / LLaMA-2)
│
↓
autoregressive response
- Vision tower: CLIP ViT-L/14 (frozen, taking the second-to-last layer's patch tokens). LLaVA-1.0 uses $224^2$ input giving $N=256$; LLaVA-1.5 upgrades to $336^2$, giving $N=576$
- Projector $W$: LLaVA-1.0 uses a single Linear; LLaVA-1.5 upgrades to 2-layer MLP + GELU (the paper reports a clear improvement in instruction following)
- LLM: Vicuna-13B (LLaVA-1.0/1.5) or LLaMA-2
6.2 Two-stage training
Stage 1: Feature Alignment Pre-training
- Use CC3M / LAION-558K caption data in the format
<image>\n<caption> - Train the projector $W$ only, freezing vision tower and LLM
- Goal: project visual features close to the LLM's word embedding space
Stage 2: End-to-end Visual Instruction Tuning
- Use 158K visual instructions generated by GPT-4 (LLaVA-Instruct)
- Unfreeze projector + LLM; vision tower remains frozen
- The LLM learns to "understand images, answer questions, follow visual instructions"
6.3 Code: LLaVA-style projector + forward (core 60 lines)
import torch
import torch.nn as nn
class LLaVAProjector(nn.Module):
"""2-layer MLP + GELU, as in LLaVA-1.5."""
def __init__(self, d_vision=1024, d_llm=4096):
super().__init__()
self.fc1 = nn.Linear(d_vision, d_llm)
self.act = nn.GELU()
self.fc2 = nn.Linear(d_llm, d_llm)
def forward(self, x): # x: [B, N, d_vision]
return self.fc2(self.act(self.fc1(x))) # [B, N, d_llm]
class LLaVA(nn.Module):
"""Skeleton: CLIP vision tower + projector + LLM."""
def __init__(self, vision_tower, projector, llm, image_token_id):
super().__init__()
self.vision_tower = vision_tower # CLIPViT, frozen at stage 1
self.projector = projector
self.llm = llm # e.g. LlamaForCausalLM
self.image_token_id = image_token_id # special <image> placeholder
@torch.no_grad()
def encode_image(self, pixel_values):
# Take the second-to-last layer's patch features (skip [CLS])
vit_out = self.vision_tower(pixel_values, output_hidden_states=True)
feat = vit_out.hidden_states[-2][:, 1:, :] # drop CLS, [B, N, d_v]
return feat
def forward(self, input_ids, pixel_values, labels=None, attention_mask=None):
# 1. Visual features → projector → LLM dim
with torch.no_grad():
visual_features = self.encode_image(pixel_values) # [B, N, d_v]
visual_tokens = self.projector(visual_features) # [B, N, d_llm]
# 2. LLM's word embedding table
token_embeds = self.llm.get_input_embeddings()(input_ids) # [B, L, d_llm]
# 3. Replace the <image> placeholder positions with visual_tokens
B, L, D = token_embeds.shape
new_embeds, new_labels, new_mask = [], [], []
for b in range(B):
image_pos = (input_ids[b] == self.image_token_id).nonzero(as_tuple=True)[0]
assert image_pos.numel() == 1, "exactly one <image> placeholder expected"
i = image_pos.item()
# Concat: [prefix tokens] + [N visual tokens] + [suffix tokens]
chunks = [token_embeds[b, :i], visual_tokens[b], token_embeds[b, i+1:]]
new_embeds.append(torch.cat(chunks, dim=0))
if labels is not None:
lab = labels[b]
# Label = -100 at visual token positions (not counted in loss)
ignore = torch.full((visual_tokens.size(1),), -100, dtype=lab.dtype, device=lab.device)
new_labels.append(torch.cat([lab[:i], ignore, lab[i+1:]], dim=0))
if attention_mask is not None:
am = attention_mask[b]
ones = torch.ones(visual_tokens.size(1), dtype=am.dtype, device=am.device)
new_mask.append(torch.cat([am[:i], ones, am[i+1:]], dim=0))
# 4. Pad back to a batch tensor and feed the LLM
inputs_embeds = torch.nn.utils.rnn.pad_sequence(new_embeds, batch_first=True)
labels = torch.nn.utils.rnn.pad_sequence(new_labels, batch_first=True, padding_value=-100) if labels is not None else None
attention_mask = torch.nn.utils.rnn.pad_sequence(new_mask, batch_first=True) if attention_mask is not None else None
return self.llm(inputs_embeds=inputs_embeds, labels=labels, attention_mask=attention_mask)
6.4 LLaVA-1.5 / 1.6 / NeXT key upgrades
| Version | Time | Main changes |
|---|---|---|
| LLaVA-1.0 | 2023.04 | Single Linear projector; CLIP ViT-L/14@224², visual tokens = 256 ($16\times 16$) |
| LLaVA-1.5 | 2023.10 | 2-layer MLP; resolution up to 336², visual tokens = 576 ($24\times 24$); adds OCR / GQA / VQAv2 academic data |
| LLaVA-1.6 / NeXT | 2024.01 | AnyRes: slice the image into $2\times 2 / 2\times 3 / \dots$ tiles and encode each, supporting any aspect ratio; up to 2880 tokens |
| LLaVA-OneVision | 2024.08 | Unified single / multi-image / video; introduces a mix of SI (single image) + OV (onevision) data |
| LLaVA-NeXT-Video | 2024.04 | Video version; feed the LLM serialized visual features from multiple frames |
training assumes a fixed 336²; at inference, a high-res image is sliced into $n \times m$ tiles of 336² each, encoded individually, plus one "global thumbnail" (the full image resized to 336²). Tokens go from 576 to (1 + n·m)·576, but each tile passes through the same frozen ViT. The same family as InternVL / Qwen-VL tiling.
§7 BLIP-2: Q-Former cross-attention
7.1 Motivation
LLaVA's projector is simple, but every patch becomes an LLM token: higher resolution ↑ more tokens ↑ LLM compute $O(L^2)$ ↑. BLIP-2 (Li et al. 2023 ICML) uses a Q-Former (Querying Transformer) to compress an arbitrary number of patches into a fixed 32 tokens.
7.2 Q-Former structure
Input: frozen image encoder output $\mathbf{Z} \in \mathbb{R}^{N \times d_v}$ (N=257 for ViT-g/14@224). The Q-Former has 32 learnable query tokens $\mathbf{q}_1, \dots, \mathbf{q}_{32} \in \mathbb{R}^{d_q}$.
Per Q-Former block:
$$\mathbf{q}^{(\ell)} = \text{SelfAttn}(\mathbf{q}^{(\ell-1)})$$ $$\mathbf{q}^{(\ell)} = \text{CrossAttn}(\mathbf{q}^{(\ell)},\ \mathbf{Z},\ \mathbf{Z})\quad \text{(inserted only every other layer)}$$ $$\mathbf{q}^{(\ell)} = \text{FFN}(\mathbf{q}^{(\ell)})$$
Key points:
- Inside self-attention: queries interact with each other, not with image patches
- Inside cross-attention: queries are Q, image patches are K/V — this is the information entry point
- The output $\mathbf{q}^{(L)} \in \mathbb{R}^{32 \times d_q}$ then goes through a Linear into the LLM dimension, serving as 32 visual tokens fed to the frozen LLM
7.3 Two-stage training
Stage 1: Representation Learning (only Q-Former trained, vision encoder frozen)
- ITC (Image-Text Contrastive): query embedding contrasted with text [CLS] in CLIP fashion
- ITM (Image-Text Matching): query and text tokens interact via cross-attn, followed by binary classification
- ITG (Image-grounded Text Generation): query does not interact with text; let the text decoder generate captions based on queries (causal mask controls visibility)
Stage 2: Generative Learning (only Q-Former trained, LLM frozen)
- Project Q-Former output into the LLM embedding space, so the LLM does prefix-tuned captioning / VQA based on the 32 visual tokens
7.4 Code: a single Q-Former cross-attention layer (core 40 lines)
import torch
import torch.nn as nn
class QFormerLayer(nn.Module):
"""One Q-Former block: SelfAttn (queries) -> CrossAttn (queries <- image) -> FFN."""
def __init__(self, d_q=768, d_v=1408, num_heads=12, mlp_ratio=4, has_cross=True):
super().__init__()
self.has_cross = has_cross
self.ln_self = nn.LayerNorm(d_q)
self.self_attn = nn.MultiheadAttention(d_q, num_heads, batch_first=True)
if has_cross:
self.ln_cross = nn.LayerNorm(d_q)
# Q comes from query (d_q), K/V come from image feats (d_v) -> adapt via kdim/vdim
self.cross_attn = nn.MultiheadAttention(d_q, num_heads,
kdim=d_v, vdim=d_v, batch_first=True)
self.ln_ffn = nn.LayerNorm(d_q)
hidden = int(d_q * mlp_ratio)
self.ffn = nn.Sequential(nn.Linear(d_q, hidden), nn.GELU(), nn.Linear(hidden, d_q))
def forward(self, q, image_feats=None): # q: [B, 32, d_q]
# Self-attention: queries talk to each other
h = self.ln_self(q)
a, _ = self.self_attn(h, h, h, need_weights=False)
q = q + a
# Cross-attention: queries attend to image patches
if self.has_cross and image_feats is not None:
h = self.ln_cross(q)
a, _ = self.cross_attn(h, image_feats, image_feats, need_weights=False)
q = q + a
# FFN
q = q + self.ffn(self.ln_ffn(q))
return q
class QFormer(nn.Module):
def __init__(self, num_queries=32, d_q=768, d_v=1408, depth=12, num_heads=12,
cross_every=2):
super().__init__()
self.queries = nn.Parameter(torch.zeros(1, num_queries, d_q))
nn.init.trunc_normal_(self.queries, std=0.02)
self.layers = nn.ModuleList([
QFormerLayer(d_q, d_v, num_heads, has_cross=(i % cross_every == 0))
for i in range(depth)
])
def forward(self, image_feats): # [B, N, d_v]
B = image_feats.size(0)
q = self.queries.expand(B, -1, -1) # [B, 32, d_q]
for layer in self.layers:
q = layer(q, image_feats)
return q # [B, 32, d_q]
nn.MultiheadAttention defaults to K/V input dim = embed_dim. In Q-Former cross-attn, query is 768-dim and image feats are 1408-dim, so you must explicitly pass kdim=d_v, vdim=d_v, otherwise PyTorch will expect 768-dim K/V at forward time and raise a shape mismatch error (no silent truncation).
7.5 Q-Former vs LLaVA Projector: trade-off
| Dimension | LLaVA Projector | BLIP-2 Q-Former |
|---|---|---|
| Parameter count | ~20M (MLP) | ~180M (Q-Former + queries) |
| Compute | Only MLP forward | 12 layers of cross-attn forward |
| Visual token count | $N$ (quadratic in resolution) | Fixed 32 |
| Information loss | Almost 0 (every patch enters the LLM) | Significant (256+ patches compressed to 32) |
| Training complexity | 1 stage (pretrain) + 1 stage (IT) | 2 stages (representation + generation); stage 1 jointly optimizes ITC + ITM + ITG |
| LLM context usage | Large (576–2880 tokens) | Small (32 tokens) |
| Best for | High resolution / detail tasks | LLM context-limited / batched multimodal inference |
Qwen-VL / LLaVA-NeXT / InternVL-2 / DeepSeek-VL2 all use projector (with spatial reduction / pixel shuffle to control token count); Q-Former has faded out in industrial VLMs. But the Q-Former idea is still active in video VLMs (using queries for frame-level pooling).
§8 Flamingo: Perceiver Resampler + Gated Cross-Attn
8.1 Design goals
Alayrac et al. 2022 (NeurIPS) wanted: to add visual ability to a frozen 70B LLM without breaking the text ability. Design choices:
- Do not retrain the LLM: fully frozen; insert only new layers in the middle
- Few trainable parameters: cross-attn layers + Perceiver Resampler
- Few-shot interleaved: training data is interleaved sequences of
(image, text, image, text, ...)
8.2 Perceiver Resampler
Similar to Q-Former, "use latent queries to compress the image". The Flamingo paper Sec 3.1 pseudocode is multi-layer (each layer = cross-attention + FFN), with the default config approximately $L=6$ layers. Per-layer update:
$$\mathbf{q}^{(\ell+1)} = \mathbf{q}^{(\ell)} + \text{CrossAttn}\!\left(\mathbf{q}^{(\ell)},\ [\mathbf{q}^{(\ell)};\ \mathbf{Z}],\ [\mathbf{q}^{(\ell)};\ \mathbf{Z}]\right), \quad \mathbf{q}^{(\ell+1)} = \mathbf{q}^{(\ell+1)} + \text{FFN}(\mathbf{q}^{(\ell+1)})$$
Note K/V is concat(query, image_feat), not just image_feat — so query tokens can also attend to each other. Overall still lighter than BLIP-2 Q-Former (12 layers + self-attn + cross-every-2). Output 64 latent visual tokens (independent of the input patch count).
8.3 Gated Cross-Attention (the core innovation)
Insert a new cross-attention module into the LLM every $k$ layers (e.g. every 4):
$$\mathbf{h}'_\ell = \mathbf{h}_\ell + \tanh(\alpha_\text{attn}) \cdot \text{CrossAttn}(\mathbf{h}_\ell, \mathbf{q}_\text{out}, \mathbf{q}_\text{out})$$ $$\mathbf{h}''_\ell = \mathbf{h}'_\ell + \tanh(\alpha_\text{ffn}) \cdot \text{FFN}(\mathbf{h}'_\ell)$$
Key: $\alpha_\text{attn}, \alpha_\text{ffn}$ are learnable scalars, initialized to 0. So $\tanh(0)=0$, meaning the added cross-attn contributes zero to the LLM output at init — the LLM behaves identically to the unmodified frozen LLM with no visual module. During training, $\alpha$ gradually learns nonzero values and visual information starts flowing in.
Llama-3.2-Vision (Meta 2024) uses exactly the same design: frozen LLaMA-3 + learning a gated cross-attn adapter. Pros: fully preserves text-only performance; cons: the visual capability ceiling is lower than the fine-tuned LLM in LLaVA/Qwen-VL.
8.4 Flamingo / Llama-3.2-V vs LLaVA comparison
| Aspect | Flamingo / Llama-3.2-V | LLaVA / Qwen-VL |
|---|---|---|
| LLM unfrozen? | No (frozen) | Yes (unfrozen in stage 2) |
| Image as tokens? | No (as KV) | Yes (as tokens) |
| Text-only ability preserved | ✅ Fully | ⚠️ May regress slightly |
| Visual understanding ceiling | Limited by cross-attn capacity | Higher (LLM can "think about" the image) |
| Training data | interleaved | image-instruction pairs |
| Applicable | Large LLM + no retraining | Small/medium LLM + vision-centric |
§9 CogVLM and "visual experts" / cross-attn fusion variants
9.1 CogVLM: a visual expert branch
Wang et al. 2023 (CogVLM)'s core idea: in the LLM's attention / FFN, replicate a parallel branch for visual tokens, sharing the attention computation with the original text branch but using different projections:
attention
┌──────────────┴──────────────┐
↓ ↓
text projection (frozen) vision expert projection (trainable)
│ │
└──────────────┬──────────────┘
↓
token-wise route: if visual_token, use vision branch
- MoE-like dual experts: text tokens go through the text branch, image tokens go through the "visual expert" branch
- Visual expert trained alone: text branch can be frozen, with visual ability coming from the expert
- Preserves the LLM's native text ability (same idea as Flamingo, but routing at the token rather than the layer level)
9.2 Llama-3.2 Vision: Flamingo-style cross-attn revived on a large LLM
Meta released Llama-3.2-V (11B / 90B) in September 2024:
- Frozen Llama-3 backbone
- Adds separate cross-attention layers (not modifying self-attn)
- Adapter-style training, only the cross-attn layers are trained
- Designed as a "robust base for long-tail visual tasks"; the visual ceiling is slightly below GPT-4V, but text-only benchmarks are nearly identical to the text-only Llama-3
9.3 Claude 3.5/3.7 Sonnet Vision and GPT-4V/4o
The closed-source architectures of Anthropic / OpenAI are undisclosed, but inferences from API behavior:
- GPT-4V (2023.09) / GPT-4o (2024.05): 4o is natively multimodal, jointly trained from the ground up over (image + text + audio), not a LLaVA-style post-hoc projector
- Claude 3.5/3.7 Sonnet (2024-2025): supports high-resolution images (up to 8000×8000 pixels, tiled on demand), with document understanding (PDF/screenshot) being a selling point
- Common feature: handles multi-page documents / screenshots / OCR of math formulas — far beyond the LLaVA family. This hints at significant optimization in training-data scale (document corpus) + tiling strategy.
§10 Qwen2-VL / DeepSeek-VL: dynamic resolution + M-RoPE
10.1 Native dynamic resolution
Qwen2-VL (Wang et al. 2024), DeepSeek-VL (Lu et al. 2024), and InternVL-2 all abandon the "resize to fixed 224²" tradition:
- Preserve original aspect ratio: resize images to the largest size near the original that is an integer multiple of patch_size
- Dynamic patch count: a $1024 \times 768$ image at $P=14$ slices into $73 \times 54 \approx 3942$ patches
- No fixed pos embed table: must use an extensible positional encoding (RoPE or 2D ALiBi-like)
10.2 M-RoPE (Multimodal RoPE)
Qwen2-VL's core innovation. Recap of ordinary 1D RoPE: treat each pair $(2k, 2k+1)$ of query / key dimensions as a complex number, multiplied by a position-dependent rotation:
$$\mathbf{R}_{m,k} = \begin{pmatrix} \cos(m\theta_k) & -\sin(m\theta_k) \\ \sin(m\theta_k) & \cos(m\theta_k) \end{pmatrix}, \quad \theta_k = 10000^{-2k/d}$$
After applying to $\mathbf{q}_m$, $\mathbf{q}_m^\top \mathbf{k}_n$ depends only on $m - n$ (relative position).
M-RoPE's extension: a visual token has three position dimensions (t, h, w). All head_dim dimensions rotate — but each pair $(2k, 2k+1)$ uses one of the three position ids (t / h / w) for its rotation angle, depending on which segment it falls into:
$$(\cos(m_\text{axis}\,\theta_k),\ \sin(m_\text{axis}\,\theta_k)), \quad \text{axis} \in \{t, h, w\}$$
Specifically, Qwen2-VL's mrope_section (unit is pairs of half head_dim, i.e. each number represents how many $(2k, 2k+1)$ pairs). One pair = 2 real dimensions, so "section sum × 2 = head_dim".
that is, the three axes occupy 16 / 24 / 24 dim pairs respectively; total $(16+24+24) \times 2 = 128 = $ head_dim. The implementation doubles the section to $[16, 24, 24, 16, 24, 24]$ to slice head_dim, with position ids of (t, h, w, t, h, w) used for rotation — all 128 dims rotate, none "left unrotated". Spatial dims (h, w) occupy 48 pairs > the temporal dim (t)'s 16 pairs, reflecting that inter-frame changes in video are slow while spatial content changes within a frame are dramatic.
A text token has no explicit (h, w): Qwen2-VL sets $m_t = m_h = m_w$ equal to that text token's 1D position id, so all three axes yield the same rotation angle, equivalent to ordinary 1D RoPE.
10.3 Qwen2.5-VL upgrades
Qwen2.5-VL (Bai et al. 2025), on top of Qwen2-VL:
- Absolute time encoding: the M-RoPE t dim switches to real timestamps (seconds), not frame indices, supporting arbitrary FPS videos
- Dynamic visual token budget: tune token count by task complexity
- Agent / GUI capability: add web screenshots / mobile UI operation traces to training data
10.4 DeepSeek-VL / VL2: high-resolution tiling + hybrid encoder
DeepSeek-VL (Lu et al. 2024) uses dual vision encoders:
- SigLIP: for global semantics (low resolution)
- SAM-B (Segment Anything backbone): for high-resolution detail
The two features are concatenated and fed to projector + LLM. DeepSeek-VL2 (2024.12) further replaces the LLM with MoE + dynamic resolution; a single image can use 1700+ visual tokens.
10.5 Code: M-RoPE three-dim positional embedding (core 50 lines, matching Qwen2-VL's HF implementation)
import torch
def build_mrope_cos_sin(positions, head_dim, mrope_section=(16, 24, 24), base=10000.0):
"""
Build cos/sin tensors for Qwen2-VL style M-RoPE.
positions: LongTensor [3, B, L] (axis 0: t / h / w; B batch; L seq len)
head_dim: per-head dim (must equal 2 * sum(mrope_section))
mrope_section: tuple of 3 ints; each = number of (half-dim) entries per axis
Returns: cos, sin both [B, L, head_dim], ready for LLaMA-style rotate_half.
"""
assert 2 * sum(mrope_section) == head_dim, "2 * sum(mrope_section) must = head_dim"
half = head_dim // 2 # = sum(mrope_section)
# Standard RoPE frequencies: θ_k = base^{-2k/head_dim}, k = 0..half-1
inv_freq = 1.0 / (base ** (torch.arange(0, half).float() * 2 / head_dim)) # [half]
inv_freq = inv_freq.to(positions.device)
# For each axis, compute angles / cos / sin of shape [B, L, half]
cos_axes, sin_axes = [], []
for a in range(3):
ang = positions[a].float().unsqueeze(-1) * inv_freq # [B, L, half]
cos_axes.append(ang.cos())
sin_axes.append(ang.sin())
# Slice half-dim into 3 segments by mrope_section; pick cos/sin for t/h/w
cos_chunks, sin_chunks = [], []
offset = 0
for axis, s in enumerate(mrope_section):
cos_chunks.append(cos_axes[axis][..., offset:offset+s]) # [B, L, s]
sin_chunks.append(sin_axes[axis][..., offset:offset+s])
offset += s
cos_half = torch.cat(cos_chunks, dim=-1) # [B, L, half]
sin_half = torch.cat(sin_chunks, dim=-1)
# LLaMA-RoPE style: duplicate to full head_dim
cos = torch.cat([cos_half, cos_half], dim=-1) # [B, L, head_dim]
sin = torch.cat([sin_half, sin_half], dim=-1)
return cos, sin
def rotate_half(x):
"""(x1, x2) -> (-x2, x1), LLaMA convention."""
x1, x2 = x.chunk(2, dim=-1)
return torch.cat((-x2, x1), dim=-1)
def apply_mrope(q, k, cos, sin):
"""
q, k: [B, num_heads, L, head_dim]
cos, sin:[B, L, head_dim]
"""
cos = cos.unsqueeze(1) # broadcast over heads
sin = sin.unsqueeze(1)
q_rot = q * cos + rotate_half(q) * sin
k_rot = k * cos + rotate_half(k) * sin
return q_rot, k_rot
easy traps.
- "head_dim split into t/h/w three independent 1D RoPE segments": wrong. Qwen2-VL is in fact 6 alternating segments
[s_t, s_h, s_w, s_t, s_h, s_w], with the full head_dim rotated - "section units are dims": wrong.
mrope_section=[16,24,24]units are pairs (each pair = 2 head_dim elements), $\sum \times 2 = $ head_dim = 128 - "What if a text token has no (h, w)?": set $m_t = m_h = m_w$ to the text's 1D position id; the three axes give the same rotation angle, degenerating to 1D RoPE
§11 Video VLM: LongVA / VideoLLaMA / long-video problem
11.1 Basic pipeline
Video = multi-frame image. Common VLM approach to video:
- Uniformly sample $K$ frames (e.g. 8 / 16 / 32)
- Each frame through the vision encoder → $N$ patch tokens per frame
- Token sequence concat: feed $K \cdot N$ visual tokens to the LLM
Problem: $K=32, N=576 \Rightarrow 18432$ tokens — beyond the context of an LLM that was instruction-tuned on single images.
11.2 Common compression strategies
- Time pooling: average pool adjacent frame tokens (VideoChat / VideoLLaMA)
- Q-Former resampler: per-frame query tokens compress to 32 (Video-BLIP)
- Token merge: merge similar tokens across frames (VideoLLaMA 2)
- Spatial pooling + temporal preservation: pool per-frame patch tokens to $H' \times W'$, retaining all frames (LLaVA-NeXT-Video)
11.3 LongVA / Long-context video
LongVA (Zhang et al. 2024) and others exploit long-context LLMs (200K+ tokens) to directly consume long video unrolled into a token sequence, paired with the M-RoPE temporal dim, for hour-long video QA. Qwen2-VL reports handling 20-minute video; Qwen2.5-VL pushes to 1+ hour.
11.4 Video benchmarks
- MVBench (Li et al. 2024): 20 fine-grained video understanding tasks
- Video-MME (Fu et al. arXiv 2024 / CVPR 2025): 900+ videos covering 3 duration tiers (short / medium / long) + 6 task categories
- EgoSchema (Mangalam et al. 2023 NeurIPS): first-person long video
- LongVideoBench (Wu et al. 2024 NeurIPS): from 8 seconds to 1 hour
§12 Training pipeline: alignment / instruct / preference
12.1 Stage 1: Alignment / Pre-training
Goal: align visual features to be near the LLM token space.
- Data: image-caption pairs (CC3M, LAION-558K, ShareGPT4V)
- Training: unfreeze projector only (LLaVA) / Q-Former (BLIP-2); vision tower + LLM frozen
- Loss: next-token prediction (LLM uses visual features as prefix to generate caption)
12.2 Stage 2: Visual instruction tuning
Goal: teach the VLM "to look at images, answer questions, follow instructions".
- Data: GPT-4-generated visual instructions (LLaVA-Instruct-158K, ShareGPT4V) + academic VQA data (VQAv2, GQA, OCR-VQA, TextVQA)
- Training: unfreeze LLM + projector; vision tower typically stays frozen (LLaVA-1.5 / Qwen-VL). Qwen2-VL unfreezes the vision tower for end-to-end fine-tuning in the final stage
- Loss: next-token prediction on answer tokens (input image + question are not counted in loss)
12.3 Stage 3: Preference / RLHF
Goal: reduce hallucinations, improve helpfulness / harmlessness, and align to long-tail tasks.
| Method | Time | Core |
|---|---|---|
| LLaVA-RLHF | 2023.09, Sun et al. | PPO + human preference + hallucination-aware reward |
| RLAIF-V | 2024, Yu et al. | AI feedback in place of human labels; divide-and-conquer |
| POVID | 2024 | DPO + deliberately constructed hallucination negatives |
| VLM-R1 | 2025 | GRPO + verifiable reward (R1-style for visual reasoning) |
| Bespoke / R1-Onevision | 2025 | Visual chain-of-thought + RL refinement |
it ports DeepSeek-R1's "verifiable reward + GRPO" recipe to visual tasks (e.g. ScienceQA, MMMU). Reward comes from whether the answer matches the ground truth (no process reward model); after training, the visual reasoning chain grows significantly and benchmarks improve substantially.
12.4 Data scale vs stage
| Stage | Data volume | Training tokens | Unfrozen modules |
|---|---|---|---|
| Alignment | 0.5–5M captions | 1–10B | projector |
| Instruction tune | 0.2–10M instructions | 1–50B | LLM + projector |
| Preference | 50k–500k preference pairs | 100M–1B | LLM (LoRA / full) |
§13 Multimodal Embeddings: BGE-VL / Jina-CLIP / VLM2Vec
13.1 Why a new generation of multimodal embeddings?
CLIP is trained for "image ↔ short caption" alignment, but performs poorly on long instruction retrieval / multi-image / interleaved document retrieval. The new generation of multimodal embedding models target retrieval / RAG scenarios.
13.2 Representative methods
- Jina-CLIP-v1 (2024): adds long-text contrastive (text-text task) + multi-resolution on top of CLIP; a single embedding model does image-text and text-text retrieval
- BGE-VL (2024): the BGE team's multimodal version; uses SigLIP-So400M + a small LLM for instruction-aware retrieval
- VLM2Vec (Jiang et al. 2024): instruction-conditioned mean pool of the last hidden state of a VLM (LLaVA / Qwen-VL); trains only a contrastive head, significantly beating CLIP on the MMEB benchmark
- mmE5 (2024): multimodal version of E5, supporting 12 types of retrieval tasks
13.3 Core tricks
- Instruction-aware: embedding input is
[instruction][image][query], so the same image yields different embeddings under different tasks - VLM-as-encoder: directly use an instruct-tuned VLM as backbone, no separate contrastive pretrain
- Hard negative mining: use a cross-encoder reranker to mine hard negatives outside the batch
§14 25 frequently-asked interview questions (L1 must-know · L2 advanced · L3 top labs)
L1 basics (must-know 10)
Q1. What is the CLIP loss? Why must it be symmetric?
- CLIP uses symmetric InfoNCE: $\mathcal{L} = \tfrac12(\mathcal{L}_{i\to t} + \mathcal{L}_{t\to i})$
- $\mathcal{L}_{i\to t}$: apply row softmax on the similarity matrix $\mathbf{S}$, take NLL of the diagonal (image retrieves text)
- $\mathcal{L}_{t\to i}$: column softmax, NLL of the diagonal (text retrieves image)
- Necessity of symmetry: one direction constrains only one retrieval direction; symmetrization lets image / text embeddings constrain each other, preventing "one-sided collapse" — e.g. image-side clusters but text-side drifts
Pitfall: answering only "InfoNCE" without saying "average of two softmaxes", or saying "do it backward once" without explaining why it is necessary.
Q2. What does the temperature τ in CLIP do? Why make it learnable?
- Temperature $\tau$ controls softmax sharpness: $\tau \to 0$ approaches one-hot, focusing on the hardest negative; $\tau \to \infty$ becomes uniform with no gradient
- OpenAI CLIP makes $\tau$ learnable (concretely parameterized as
logit_scale = log(1/τ), more stable in backprop) - Learned steady state $\tau \approx 0.01$ (
logit_scale ≈ log(100)), clamped from above to avoid collapse - Without making it learnable: hyperparameter-sensitive; each new data / model scale needs manual tuning
Pitfall: treating τ as a fixed 0.07 with no learning; or writing it as 1/exp(logit_scale) with unstable backprop.
Q3. Core change of SigLIP vs CLIP? Why is batch size no longer sensitive?
- Replace softmax InfoNCE with sigmoid binary CE: each $(i,j)$ pair is independently classified
- $S_{ij} = t \cdot \mathbf{u}_i^\top \mathbf{v}_j + b$; label $y_{ij} = +1 (i=j) / -1 (i\neq j)$; loss = $-\sum_{ij} \log\sigma(y_{ij} S_{ij})/N$
- Batch decoupled: each loss term is independent of row/col normalization, so doubling N only changes the negative count, not the loss landscape
- Engineering gains: single-machine large batch, simpler cross-node communication, and small batches can also learn (CLIP barely converges with small batches)
Pitfall: saying only "use sigmoid" without explaining batch-independence; or treating SigLIP as "CLIP with bias".
Q4. Is the [CLS] token in ViT mandatory?
- No. The original ViT uses [CLS] to align with BERT conventions
- Alternative: mean-pool all patch tokens as the image representation — most modern ViTs (DeiT-III, SigLIP, EVA-CLIP) use mean-pool or attentive pool
- CLIP uses [CLS]: contrastive training requires a single vector
- VLM vision towers usually drop [CLS]: LLaVA takes second-to-last layer patch tokens; [CLS] is not needed on the LLM side
Pitfall: treating [CLS] as a "mandatory component" of ViT; or saying "without [CLS] you cannot classify" (wrong; mean-pool also works).
Q5. What is LLaVA's projector? Why MLP instead of Linear?
- The projector maps visual encoder output ($d_v$=1024) to LLM token space ($d_\text{llm}$=4096)
- LLaVA-1.0: single
Linear(1024, 4096); LLaVA-1.5: 2-layer MLP + GELU - MLP adds non-linear expressivity, letting visual features map more flexibly into the LLM's "vocabulary"
- The paper reports MLP improves MM-Vet / SEED-Bench by 1–3 points over single Linear
Pitfall: saying only "project with a Linear"; or answering "use Q-Former" (that is BLIP-2, not LLaVA).
Q6. What do LLaVA's two training stages do?
- Stage 1 Feature Alignment: train the projector only, freeze vision tower + LLM, using caption data (CC3M / LAION-558K) to project visual features near the LLM embedding space
- Stage 2 Instruction Tuning: unfreeze LLM + projector (vision tower still frozen), using GPT-4-generated 158K visual instructions, teaching the LLM to follow visual instructions
- Why not train in one shot: jumping directly to stage 2 risks catastrophic forgetting of text ability; stage 1 first gives visual tokens a "near text-token" initialization, then instruction tuning is more stable
Pitfall: saying both stages "train the projector"; or omitting that stage 1 freezes the LLM, the key point.
Q7. What is a Q-Former? Pros/cons vs the LLaVA projector?
- BLIP-2's Q-Former: a 12-layer Transformer with 32 learnable query tokens that read information from a frozen image encoder via cross-attention, outputting a fixed 32 visual tokens
- Pros: fixed visual token count, low LLM context usage, compute budget stays the same as resolution grows
- Cons: large information loss (256 patches compressed to 32), more parameters (~180M), complex training (two stages: stage 1 representation learning includes joint ITC+ITM+ITG, stage 2 connects to frozen LLM for generation)
- 2024 mainstream returns to projector: Qwen-VL / LLaVA-NeXT / InternVL-2 all use projector
Pitfall: treating Q-Former as a synonym for projector; or not knowing modern VLMs prefer projectors.
Q8. How many patches / tokens does a ViT-L/14 yield on a 224×224 image?
- Patch count = $(224/14)^2 = 16^2 = 256$
- Token count = 256 + 1 (with [CLS]) = 257
- If LLaVA-style second-to-last layer patch tokens (drop [CLS]) = 256 visual tokens
- If resolution is 336 (LLaVA-1.5): $(336/14)^2 = 24^2 = 576$ tokens
Pitfall: miscalculating $(H/P)^2$ (treating $P^2$ as the patch count $N$); forgetting [CLS].
Q9. Why is CLIP poor at OCR / counting / spatial relations?
- Poor OCR: captions typically describe scenes, not text inside images; CLIP has no pixel-level OCR supervision
- Poor counting: captions rarely report exact counts ("how many birds" is usually "a flock of birds"); the embedding space does not preserve a counting signal
- Poor spatial relations: "cat on top of dog" and "dog on top of cat" look almost identical under bag-of-words; Yuksekgonul et al. 2023 (ICLR) quantify this with the ARO benchmark
- Mitigation directions: DETR-style local alignment, SigLIP-2's dense local objectives, document-level data
Pitfall: blaming OCR weakness on "resolution too low" (partly right, but the root cause is data + loss); claiming "CLIP is bag-of-words" too absolutely.
Q10. Why is the vision tower generally frozen when training a VLM?
- The vision tower (e.g. CLIP ViT-L) is already pre-trained well on its own data; unfreezing easily damages visual feature quality
- Training data is far smaller than CLIP pretraining (millions vs billions); unfreezing easily overfits
- Freezing also saves memory: hundreds of millions of vision tower params don't need optimizer state
- Qwen2-VL exception: in the final stage it unfreezes the vision tower for small-LR fine-tuning, paired with large mixed data to avoid forgetting
Pitfall: answering "cannot unfreeze" directly — wrong; late stages can carefully unfreeze.
L2 advanced (10 questions)
Q11. Why is SigLIP's bias $b$ initialized to $-10$?
- Early in training, embeddings are near random, $\mathbf{u}^\top \mathbf{v} \approx 0$, sigmoid outputs 0.5
- In the N×N matrix, negatives are $N^2 - N \approx N^2$, positives only $N$; if initial predictions are all 0.5, negative gradients dominate and positives get no useful signal
- Init $b \approx -10$ → $\sigma(b) \approx 4.5e^{-5}$ → all entries initially predicted as negative
- This way negatives have almost no loss, positives have large loss (predicted as negative but truly positive), gradient pulls positives in, training is stable
Pitfall: saying "avoid numerical issues"; or answering "for the symmetric term" (wrong; the bias is not a symmetric loss term).
Q12. Why is Flamingo's gated cross-attn initialized to 0?
- The new cross-attn output is multiplied by $\tanh(\alpha)$, with $\alpha$ initialized to 0
- $\tanh(0) = 0$, so at init the new module contributes zero to the frozen LLM — the LLM behaves identically to a text-only Llama without the visual module
- During training $\alpha$ slowly grows from 0, and visual signals are gradually injected
- Pros: fully preserves the frozen LLM's text-only ability; cons: visual ceiling limited by cross-attn capacity
- Llama-3.2 Vision uses the same design
Pitfall: treating the 0 init as just a "common init trick"; or not realizing it concerns frozen-LLM capability preservation.
Q13. How is LLaVA-1.6 / NeXT's AnyRes implemented?
- Training assumes a fixed 336²; inference slices a high-res image by aspect ratio into $n \times m$ tiles of 336² each (e.g. $2\times 2, 2\times 3$)
- Each tile passes through the frozen ViT to get 576 tokens; add a global thumbnail (whole image resized to 336² and encoded)
- Concatenate: $(1 + n\cdot m) \times 576$ visual tokens to the LLM
- Choice of slicing: pick the grid (from a predefined set such as $\{1\times 1, 2\times 2, 1\times 4, 4\times 1, ...\}$) closest to the original aspect ratio
Pitfall: treating AnyRes as a synonym for dynamic resolution — technically different. Qwen2-VL is native dynamic (patch count fully free), while LLaVA-1.6 is fixed-tile composition.
Q14. Why is Qwen2-VL's M-RoPE three-dim allocation not uniform?
- Qwen2-VL
mrope_section = [16, 24, 24](units are pairs of half head_dim), $\sum \times 2 = $ head_dim = 128 - All 128 dims rotate — different dim pairs use different axes (t/h/w) and their position ids to compute rotation angles
Reason for non-uniform allocation:
- Inter-frame variation in video is slow (adjacent frames are very similar), so $s_t = 16$ has a small share
- Intra-frame patch variation is dramatic (large visual differences across positions in a single frame), so $s_h = s_w = 24$ each need broader frequency coverage
- $s_h = s_w$: the image H/W axes are symmetric
Pitfall: treating section as "number of dims" (wrong; the unit is pairs = head_dim / 2 allocation); or thinking "the remaining dims do not rotate".
Q15. Why does BLIP-2 choose 32 query tokens?
- 32 is an empirical value, balancing LLM context usage vs information capacity
- Too few (< 16): large information loss, hurting VQA / detail tasks
- Too many (> 64): large LLM context usage, expensive Q-Former cross-attn computation
- The BLIP-2 paper's ablations show 32 is the sweet spot on most downstream tasks
- Conceptually similar to Perceiver (also uses latent queries to compress inputs)
Pitfall: answering only "empirical"; not realizing it is an engineering trade-off between context budget and information capacity.
Q16. How does CLIP compute InfoNCE under DDP?
- Local batch on each GPU $N_\text{local}$; total batch over N GPUs $N = K \cdot N_\text{local}$
- After forward on each GPU,
dist.all_gatherfetches everyone's image / text feats - Compute the global similarity $\mathbf{S} \in \mathbb{R}^{N \times N}$
- But backward only lets this GPU's $N_\text{local}$ rows / columns contribute gradients (to avoid duplicate backward)
- This is OpenCLIP's
local_loss=Trueoption
Pitfall: saying only "all-gather"; not knowing backward needs to avoid duplicate computation; or thinking backward also does an all-gather (wrong; backward flows along the communication backward path).
Q17. Why is Llama-3.2-V's visual ceiling lower than LLaVA / Qwen-VL?
- Llama-3.2-V uses frozen LLM + gated cross-attn adapter; LLM weights are unchanged
- LLaVA / Qwen-VL unfreeze the LLM, so its internal attention can reorganize to handle visual tokens specifically
- The latter can "use self-attention to think about the image"; the former can only passively receive visual signal via cross-attn
- Trade-off: Llama-3.2-V perfectly preserves text ability, LLaVA-Qwen may regress slightly but has a higher visual ceiling
Pitfall: saying only "fewer parameters"; not recognizing this is an architecture-level ceiling difference.
Q18. Why is much visual instruction-tuning data generated by GPT-4?
- Raw caption data (CC3M / LAION) is short, not instruction-style, cannot teach dialog ability
- Human-labeled visual instructions (e.g. VQAv2 questions) are small in scale and stylistically uniform
- GPT-4 + image + caption → generate multi-turn dialog / reasoning tasks / detailed descriptions: this is how LLaVA-Instruct-158K was made
- Prompt engineering controls coverage (three classes: detailed description, conversation, complex reasoning)
Pitfall: answering "lots of data"; not recognizing the key bottleneck is instruction style + diversity.
Q19. What is a typical CLIP / SigLIP training batch size?
- OpenAI CLIP: 32k batch (256 GPUs × ~128/GPU)
- OpenCLIP: up to 90k batch (LAION-2B)
- SigLIP: 32k batch is typically enough; the sigmoid loss makes each (i,j) entry independent and avoids softmax's batch-wide sync; the paper scans up to 256k batch but with diminishing returns
- Why small batches don't work: InfoNCE's MI lower bound $I(U;V) \ge \log N - \mathcal{L}$ tightens with larger N; the number of negatives also controls contrastive difficulty
- After SigLIP decouples batch, small batches improve significantly (1k batch can learn reasonable embeddings)
Pitfall: answering "a few hundred"; or not knowing the theoretical link between batch and InfoNCE.
Q20. What do POPE / Winoground / MMBench / MMMU each evaluate?
- POPE (Li et al. 2023): measures object hallucination — does the VLM claim objects exist that aren't in the image (yes/no binary)
- Winoground (Thrush et al. 2022): measures compositionality / word-order sensitivity — can it distinguish "cat on dog" vs "dog on cat"
- MMBench (Liu et al. 2023): general multimodal evaluation, ~3000 questions covering OCR / object recognition / reasoning, etc.
- MMMU (Yue et al. 2024 CVPR): university-level professional knowledge (math / physics / medicine, etc.), tests multimodal reasoning
- MM-Vet (Yu et al. 2023): integrated evaluation across 6 capabilities (recognition / knowledge / OCR / spatial / language / math)
Pitfall: confusing POPE and MMBench; not knowing Winoground is a "compositionality stress test".
L3 advanced (top labs / research directions, 5 questions)
Q21. Derive CLIP's symmetric InfoNCE = average of row + column softmax, and explain why SigLIP can be batch-independent.
Let batch size $N$ and similarity matrix $S_{ij} = \mathbf{u}_i^\top \mathbf{v}_j / \tau$.
CLIP derivation:
Row-wise softmax, $p_{ij} = \frac{\exp(S_{ij})}{\sum_k \exp(S_{ik})}$. Image→Text NLL:
$$\mathcal{L}_{i\to t} = -\frac{1}{N}\sum_i \log p_{ii} = -\frac{1}{N}\sum_i \log \frac{\exp(S_{ii})}{\sum_j \exp(S_{ij})}$$
Column-wise softmax (Text→Image):
$$\mathcal{L}_{t\to i} = -\frac{1}{N}\sum_j \log \frac{\exp(S_{jj})}{\sum_i \exp(S_{ij})}$$
Symmetric loss: $\mathcal{L} = \tfrac12 (\mathcal{L}_{i\to t} + \mathcal{L}_{t\to i})$. Notice the gradient with respect to $S_{ij}$:
$$\frac{\partial \mathcal{L}_{i\to t}}{\partial S_{ij}} = \frac{1}{N}(p_{ij} - \delta_{ij})$$
The gradient at each $S_{ij}$ depends on the entire row's softmax normalization $\sum_k \exp(S_{ik})$. So changing N (adding / removing negatives) changes all $p$ values in that row — gradients couple the batch.
SigLIP derivation:
$S_{ij} = t \cdot \mathbf{u}_i^\top \mathbf{v}_j + b$, $y_{ij} = 2\delta_{ij} - 1$,
$$\mathcal{L}_\text{SigLIP} = \frac{1}{N}\sum_{i,j} \log(1 + \exp(-y_{ij} S_{ij}))$$
Gradient:
$$\frac{\partial \mathcal{L}}{\partial S_{ij}} = \frac{1}{N}\cdot \frac{-y_{ij}}{1 + \exp(y_{ij} S_{ij})} = \frac{1}{N}\cdot (-y_{ij}) \cdot \sigma(-y_{ij} S_{ij})$$
Key: $\partial \mathcal{L} / \partial S_{ij}$ only depends on $S_{ij}$ itself, not other entries. So adding negatives does not change the gradient at existing $S_{ij}$ — batch-independent.
Engineering implications:
- CLIP: under DDP must all-gather embeddings to compute global logsumexp; communication is $O(N \cdot D)$ with extra sync points
- SigLIP: can use chunked all-pair, with each chunk computing only local rows × remote columns sigmoid terms, no logsumexp sync
Q22. Q-Former vs LLaVA projector trade-off: explain along capacity / compute / training stability.
Capacity (information capacity):
- LLaVA projector: all $N$ patch tokens enter the LLM; information has no bottleneck, but LLM context usage is large
- Q-Former: 32 queries is a fixed bottleneck; significant information compression, unfriendly to detail tasks (OCR / counting)
- Suppose visual encoder output rank is $r$; LLaVA visual context rank $\le r$ (preserved), Q-Former rank $\le \min(r, 32)$
Compute / Memory:
- LLaVA projector: MLP forward only, O(N·D²) compute
- Q-Former: 12 layers of cross-attn + self-attn + FFN, ~180M params; but downstream LLM context is short (32 tokens vs N=256+ tokens), so LLM inference is fast
- Total cost trade-off: at high image resolution (N=2880), Q-Former saves LLM inference; at low resolution where the LLM dominates, LLaVA is cheaper
Training stability:
- LLaVA: projector is easy to train (2 stages), gradient path is short
- Q-Former: 2-stage training (stage 1 representation jointly optimizes ITC + ITM + ITG; stage 2 generation connects to the frozen LLM); ITM head overfits easily, ITG requires intricate routing of causal vs self-attn masks — engineering pitfalls abound
Conclusion: 2024 mainstream returns to projector + spatial pixel-shuffle / merging to control token count; Q-Former retains value mainly in video / multi-image summarization (using queries for temporal pooling).
Q23. Why is Qwen2-VL's M-RoPE config `mrope_section = [16, 24, 24]` and not 1:1:1? Do all head_dim dims rotate?
Recap ordinary RoPE: head_dim $d$ split into $d/2$ complex pairs, frequencies $\theta_k = \text{base}^{-2k/d}$. The frequency coverage determines the maximum relative distance distinguishable: low frequency distinguishes long distances, high frequency distinguishes short distances.
Key disambiguation: Qwen2-VL mrope_section units are pairs of half head_dim (each number = how many $(2k, 2k+1)$ dim pairs). $[16, 24, 24]$ means t / h / w each occupy 16 / 24 / 24 pairs of dims; $\sum \times 2 = 128 = $ head_dim. The HF implementation doubles section to $[16, 24, 24, 16, 24, 24]$ slicing head_dim, corresponding to the axis sequence $(t, h, w, t, h, w)$ — all 128 dims rotate, none "left unrotated".
Design trade-offs:
- Temporal variation is slow: typical video sampled at 1–5 FPS; adjacent frames are very similar; long-range temporal dependencies are moderate. $s_t=16$ (25% share) suffices to cover hundreds to thousands of frames.
- Spatial variation is dramatic: huge visual differences across patches within a frame; to do token-to-token retrieval over a $\sim 1000\times 1000$-pixel image, more frequency slots are needed. $s_h = s_w = 24$ (37.5% each) covers more.
- Spatial symmetry: $s_h = s_w$ keeps the H/W axes symmetric (equivalence under horizontal / vertical flip).
- 6 alternating segments instead of 3 contiguous: because RoPE uses LLaMA "rotate_half", head_dim is split in memory into two halves $[h_1, h_2]$, with rotation $q \mapsto q \cos + \text{rotate\_half}(q)\sin$; the two halves share inv_freq. So axis allocations must mirror in both halves.
Qwen2.5-VL upgrade: switch $m_t$ from frame id to absolute timestamps (seconds), letting variable-FPS videos share a consistent time coordinate at training — the key to long video.
Alternative: DeepSeek-VL2 uses flattened visual tokens + plain 1D RoPE (no h, w split); Llama-3.2-V also does not split spacetime explicitly. M-RoPE only wins decisively for native interleaved video + image scenarios.
Q24. Root cause of VLM hallucinations? Pros and cons of existing mitigations?
Root causes:
- Data bias: training data contains common "co-occurrence priors" — "if there's a table, there's probably a chair". Co-occurrence makes a VLM tend to answer "yes, there's a chair" when seeing a table, even if there is no chair
- Language prior dominates: when the visual signal is weak (small objects, blur, odd angles), the VLM falls back to a pure language model, answering from "corpus common sense"
- LLM sycophancy: the user asks "is there X in the image" and the model tends to say Yes (human feedback biases towards being helpful → biased Yes)
- Stage 2 instruction tuning has no negative supervision: labels rarely teach "if there is no X, answer No"
Mitigations:
| Method | Idea | Pros | Cons |
|---|---|---|---|
| LLaVA-RLHF | PPO + hallucination-aware reward | Targeted late-stage fix | Needs a reward model + lots of preference data |
| RLAIF-V | AI-generated preference | Low data cost | Reward model's own bias accumulates |
| POVID | DPO + constructed hallucination negatives | Direct targeted fix | Negative design requires care |
| VCD (visual contrastive decoding) | At inference, run the VLM on the image and a blurred image, amplify the difference | Training-free | 2x inference cost |
| OPERA | Beam search + over-attention detection | Inference-time detection | Heuristic; may have false positives |
| POPE-driven evaluation | Use POPE for reverse supervision | Quantifiable | Only measures object hallucination |
Future directions: fix at the training data layer (grounded captions + segment-level supervision); visual chain-of-thought (VLM-R1 style) where the model "points to evidence" before answering.
Q25. For modern VLM training, should the vision tower use SigLIP or CLIP? Why did most pick SigLIP-So400M in 2024–2025?
Empirical conclusion: after 2024, SigLIP-So400M has become a common choice for open-weight VLMs, with PaliGemma (Google) and LLaVA-OneVision as representatives. But Molmo still uses OpenAI CLIP (its paper ablates against SigLIP); the InternVL series uses in-house InternViT; Qwen2-VL's vision side initializes from a DFN-derived ViT and then trains a large-scale vision-language joint; LLaVA-1.5 / 1.6 still uses CLIP ViT-L/14. "Switch to SigLIP" is not an industry consensus.
Why SigLIP-So400M is attractive:
- Stronger zero-shot performance: SigLIP-So400M has a 4–8 point lead over the same-size CLIP on zero-shot ImageNet; visual feature quality is higher
- Resolution-friendly: SigLIP already trained at 384²/512² extensively; CLIP is mostly 224²+336²; VLM tasks generally need high resolution, so SigLIP transfers more smoothly
- Batch-independent loss → stable fine-tune: SigLIP's sigmoid yields more predictable gradients when unfreezing the vision tower in stage 1
- Multilingual support: SigLIP-2 / mSigLIP natively support multiple languages
- Open weights: Google releases the full SigLIP / SigLIP-2 checkpoints (OpenAI CLIP has been open-sourced too, but with limited choices)
When to still pick CLIP:
- When strict alignment with OpenAI CLIP behavior is needed (e.g. CLIP guidance for Stable Diffusion-style use)
- For project compatibility (early LLaVA-1.0/1.5 + DALL-E pipelines use CLIP)
Note: SigLIP is not a silver bullet; DeepSeek-VL uses SigLIP + SAM dual-encoder — SAM features retain irreplaceable advantages on detail localization tasks.
§A Appendix: sanity-check outputs & references
A.1 Key code sanity check (illustrated by actual runs)
[ViT] patch_embed: (2, 3, 224, 224) -> (2, 196, 768) ✓
[ViT] forward + CLS: (2, 3, 224, 224) -> head out (2, 1000) ✓
[CLIP] N=8, D=512, init logit_scale=ln(1/0.07): loss ≈ 2.08 ≈ log(N) (random embeddings → near-uniform softmax) ✓
[CLIP] forward + backward: gradients along i→t and t→i paths are symmetric ✓
[SigLIP] N=8, D=512, b=0: loss = sum_{ij} log(1+e^0) / N = 64 * log 2 / 8 ≈ 5.545 ✓
[SigLIP] bias b=-10: positives (8 entries) loss ≈ log(1+e^10) ≈ 10; negatives (56 entries) loss ≈ 4.5e-5; total ≈ 8·10/8 ≈ 10.0 ✓ (positives dominate early gradient)
[LLaVA] visual feat (2, 256, 1024) -> projector -> (2, 256, 4096) ✓
[LLaVA] input_ids w/ <image> placeholder: 1 token -> 256 visual tokens after merge ✓
[Q-Former] image_feats (2, 257, 1408), queries (1, 32, 768) -> out (2, 32, 768) ✓
[M-RoPE] head_dim=128, mrope_section=[16,24,24]: 2 × sum = 128 = head_dim, full rotation ✓
[M-RoPE] pure-text token (pos_t=pos_h=pos_w=m): cos/sin identical across the three axes, equivalent to 1D RoPE ✓
A.2 Key references (by topic)
Vision encoder: Dosovitskiy et al. ICLR 2021 (ViT); Zhai et al. CVPR 2022 (ViT-g); Fang et al. arXiv 2023 (EVA-02)
Contrastive pretraining: Radford et al. ICML 2021 (CLIP); Cherti et al. CVPR 2023 (OpenCLIP); Zhai et al. ICCV 2023 (SigLIP); Tschannen et al. arXiv 2025 (SigLIP-2); Gadre et al. NeurIPS 2023 (DataComp)
Visual instruction / fusion: Liu et al. NeurIPS 2023 (LLaVA); Liu et al. CVPR 2024 (LLaVA-1.5); Li et al. ICML 2023 (BLIP-2); Alayrac et al. NeurIPS 2022 (Flamingo); Wang et al. arXiv 2023 (CogVLM); Bai et al. arXiv 2023 (Qwen-VL); Wang et al. arXiv 2024 (Qwen2-VL); Bai et al. arXiv 2025 (Qwen2.5-VL); Lu et al. arXiv 2024 (DeepSeek-VL); Wu et al. arXiv 2024 (DeepSeek-VL2); Chen et al. CVPR 2024 + arXiv 2024 (InternVL / InternVL-2); Llama Team arXiv 2024 (Llama-3.2-V); Deitke et al. arXiv 2024 (Molmo); Li et al. arXiv 2024 (LLaVA-OneVision)
VLM preference alignment: Sun et al. arXiv 2023 (LLaVA-RLHF); Yu et al. arXiv 2024 (RLAIF-V); Zhou et al. arXiv 2024 (POVID); Shen et al. arXiv 2025 (VLM-R1)
Multimodal embeddings: Koukounas et al. arXiv 2024 (Jina-CLIP); Jiang et al. arXiv 2024 (VLM2Vec); Zhang et al. arXiv 2024 (mmE5)
Evaluation: Li et al. EMNLP 2023 (POPE); Thrush et al. CVPR 2022 (Winoground); Liu et al. ECCV 2024 (MMBench); Yue et al. CVPR 2024 (MMMU); Fu et al. CVPR 2025 (Video-MME); Yu et al. ICML 2024 (MM-Vet); Mangalam et al. NeurIPS 2023 (EgoSchema)
Code + formulas have passed independent reviewer static checks; numerical values verified on PyTorch 2.x, CUDA 12.x (shapes and initial loss of the 5 core modules — ViT / CLIP / SigLIP / Q-Former / M-RoPE — all match the formulas).