Vlm Multimodal Tutorial En

Source: docs/tutorials/vlm_multimodal_tutorial_en.md SHA256: de321e4d3dec Rendered: 2026-05-19 18:58 UTC

§0 TL;DR Cheat Sheet

VLM in 8 sentences

one page covering the core interview points for vision-language models (see §1–§13 below for derivations and code).

  1. Vision encoder = ViT-dominated: Dosovitskiy et al. 2021 (ICLR) slice images into $P\times P$ patches (typically $P=14$ or $16$), apply a linear projection + learnable positional embedding + optional [CLS] token, and feed them into a Transformer encoder. The vision side of CLIP / SigLIP / LLaVA is all a ViT variant.
  2. CLIP symmetric InfoNCE (must derive): Radford et al. 2021 (ICML) make image embeddings $\mathbf{u}_i$ and text embeddings $\mathbf{v}_i$ do contrastive learning in a shared space, with loss = average of row softmax + column softmax: $\mathcal{L} = \tfrac{1}{2}(\mathcal{L}_{i\to t} + \mathcal{L}_{t\to i})$. The temperature $\tau$ is learnable (log-parameterized, clipped to $[0,100]$).
  3. SigLIP replaces softmax with sigmoid: Zhai et al. 2023 (ICCV) treat each entry of the N×N similarity matrix as an independent binary CE, getting rid of batch-wise softmax normalization, so it is no longer linearly sensitive to batch size and can train with 32k+ batches on a single machine; a learnable bias $b$ corrects early negative dominance. SigLIP-2 (Google 2025) adds caption + self-distillation + dense local objectives and extends to multilingual.
  4. LLaVA = projector + 2-stage train: Liu et al. 2023 (NeurIPS) use a lightweight MLP projector to project frozen CLIP visual features into LLM token space. Stage 1 trains only the projector for feature alignment (caption data); Stage 2 unfreezes the LLM for visual instruction tuning (158K instructions generated by GPT-4).
  5. Q-Former vs Projector is the central BLIP-2 trade-off: Li et al. 2023 (ICML) use 32 learnable query tokens that do cross-attention over a frozen image encoder, compressing any resolution / number of patches into a fixed 32 tokens — stable compute budget but lossy + complex to train. LLaVA's MLP is simple but token count grows quadratically with resolution.
  6. Flamingo / Llama-3.2-Vision = gated cross-attn: Alayrac et al. 2022 (NeurIPS) use a Perceiver Resampler (64 latent queries) to compress visual features into a fixed token count, then insert gated cross-attention layers every few LLM layers ($\tanh$ gating initialized at 0, preserving the frozen LLM's text-only capability).
  7. Qwen2-VL's M-RoPE — must-know: Wang et al. 2024 split RoPE along head_dim into 6 chunks, assigning (t / h / w, three groups of position ids) following the axis sequence $(t, h, w, t, h, w)$; the typical config mrope_section=[16,24,24] (units are pairs of half head_dim, so $\sum \times 2 = $ head_dim=128, all 128 dims rotate). This way each token carries (t, h, w) three-dim positions without flattening. Pairs with native dynamic resolution (no longer padded to a fixed 224×224).
  8. Three-stage training + preference optimization: (1) alignment trains the projector / Q-Former; (2) visual instruction tune unfreezes part of the LLM; (3) preference (LLaVA-RLHF, RLAIF-V, VLM-R1, DPO/PPO) addresses hallucinations and long-tail alignment. VLM-R1 (2025) uses GRPO + verifiable reward to transfer reasoning ability into vision-language tasks.

§1 Intuition: what is a VLM doing?

Think of an image as "another language". The work of a VLM splits into three parts:

Three fusion paradigms

the central architectural split in VLMs.

Compare from a Q/K/V perspective: in the projector paradigm the image is part of the LLM's input sequence (full interaction inside self-attention); in the cross-attn paradigm the image is always KV and is only queried — this causes different KV cache handling at inference time.

§2 ViT: turning an image into a token sequence

2.1 Patch tokenize

Input image $\mathbf{x} \in \mathbb{R}^{H \times W \times C}$, slice it into $N = HW/P^2$ patches of $P\times P$, flatten each patch to a $P^2 C$-dim vector, and pass it through a linear layer to $D$ dimensions:

$$\mathbf{z}_0 = [\mathbf{x}_\text{class};\ \mathbf{x}^1_p \mathbf{E};\ \mathbf{x}^2_p \mathbf{E};\ \dots;\ \mathbf{x}^N_p \mathbf{E}] + \mathbf{E}_\text{pos}$$

CLIP / SigLIP don't necessarily use [CLS]

CLIP ViT uses [CLS] for output, SigLIP / EVA-CLIP / modern LLaVA tend to use patch token average pool or keep all patch tokens to feed downstream. [CLS] is the ViT paper's choice, not an intrinsic part of ViT.

2.2 Transformer backbone

$$\mathbf{z}'_\ell = \text{MHA}(\text{LN}(\mathbf{z}_{\ell-1})) + \mathbf{z}_{\ell-1}, \quad \mathbf{z}_\ell = \text{MLP}(\text{LN}(\mathbf{z}'_\ell)) + \mathbf{z}'_\ell$$

Pre-norm (LN at the input of each sub-layer), MLP uses GELU. Note that the original ViT has a fixed number of patches ($224/16=14 \Rightarrow N=196$), and its positional embedding table is fixed in size — this is the pain point that dynamic resolution must solve (§10).

2.3 ViT specifications

ModelPatchHidden $D$LayersHeadsParamsSource
ViT-B/1616768121286MDosovitskiy 2021
ViT-L/141410242416304MDosovitskiy 2021
ViT-H/141412803216632MDosovitskiy 2021
ViT-g/1414140840161.0BZhai et al. 2022
ViT-bigG/1414166448161.8BOpenCLIP, 2023
EVA-02-L/141410242416304MFang 2023
SigLIP SoViT-400M/141411522716400MAlabdulmohsin 2023
head_dim is typically fixed at 64

ViT-family models mostly follow head_dim ≈ 64–88, i.e. $D / H$. Scaling laws suggest head_dim should not be too small, otherwise per-head expressiveness is limited.

2.4 Code: ViT patch embed + backbone (core 60 lines)

import torch
import torch.nn as nn
import torch.nn.functional as F

class PatchEmbed(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768):
        super().__init__()
        self.img_size, self.patch_size = img_size, patch_size
        self.num_patches = (img_size // patch_size) ** 2
        # A Conv2d with stride=P, kernel=P is equivalent to a linear projection
        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)

    def forward(self, x):                                   # x: [B, C, H, W]
        x = self.proj(x)                                    # [B, D, H/P, W/P]
        x = x.flatten(2).transpose(1, 2)                    # [B, N, D]
        return x

class ViTBlock(nn.Module):
    def __init__(self, dim, num_heads, mlp_ratio=4.0, dropout=0.0):
        super().__init__()
        self.ln1 = nn.LayerNorm(dim)
        self.attn = nn.MultiheadAttention(dim, num_heads, dropout=dropout, batch_first=True)
        self.ln2 = nn.LayerNorm(dim)
        hidden = int(dim * mlp_ratio)
        self.mlp = nn.Sequential(
            nn.Linear(dim, hidden), nn.GELU(), nn.Dropout(dropout),
            nn.Linear(hidden, dim), nn.Dropout(dropout),
        )

    def forward(self, x):
        h = self.ln1(x)
        a, _ = self.attn(h, h, h, need_weights=False)       # self-attention
        x = x + a
        x = x + self.mlp(self.ln2(x))
        return x

class ViT(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768,
                 depth=12, num_heads=12, num_classes=1000, use_cls=True):
        super().__init__()
        self.patch_embed = PatchEmbed(img_size, patch_size, in_chans, embed_dim)
        N = self.patch_embed.num_patches
        self.use_cls = use_cls
        if use_cls:
            self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
            self.pos_embed = nn.Parameter(torch.zeros(1, N + 1, embed_dim))
        else:
            self.pos_embed = nn.Parameter(torch.zeros(1, N, embed_dim))
        nn.init.trunc_normal_(self.pos_embed, std=0.02)
        if use_cls:
            nn.init.trunc_normal_(self.cls_token, std=0.02)
        self.blocks = nn.ModuleList([ViTBlock(embed_dim, num_heads) for _ in range(depth)])
        self.ln = nn.LayerNorm(embed_dim)
        self.head = nn.Linear(embed_dim, num_classes) if num_classes > 0 else nn.Identity()

    def forward(self, x):
        B = x.size(0)
        x = self.patch_embed(x)                             # [B, N, D]
        if self.use_cls:
            cls = self.cls_token.expand(B, -1, -1)
            x = torch.cat([cls, x], dim=1)                  # [B, N+1, D]
        x = x + self.pos_embed                              # broadcast over batch
        for blk in self.blocks:
            x = blk(x)
        x = self.ln(x)
        feat = x[:, 0] if self.use_cls else x.mean(dim=1)   # CLS or mean-pool
        return self.head(feat)
Common bug with interpolate_pos_embed

when transferring a ViT from $224^2$ to $336^2$, the pos_embed table needs resizing from $(14^2 + 1)$ rows to $(24^2 + 1)$ rows. Correct procedure: keep [CLS] unchanged, reshape the patch portion to $14\times 14\times D$, bicubic-interpolate to $24\times 24$, then flatten and concatenate back. Pitfall: directly doing 1D interpolation over $(N+1)$ rows treats [CLS] as a patch.

§3 CLIP: symmetric InfoNCE (must derive)

3.1 Formalizing the objective

CLIP (Radford et al. 2021, ICML) trains with $N$ (image, text) pairs per batch. Two encoders $f_\theta$ (image), $g_\phi$ (text) produce $\ell_2$-normalized embeddings:

$$\mathbf{u}_i = \frac{f_\theta(I_i)}{\|f_\theta(I_i)\|_2}, \quad \mathbf{v}_j = \frac{g_\phi(T_j)}{\|g_\phi(T_j)\|_2}, \quad \mathbf{u}_i, \mathbf{v}_j \in S^{D-1}$$

Define the similarity matrix $\mathbf{S} \in \mathbb{R}^{N\times N}$ (the "logit"):

$$S_{ij} = \frac{\mathbf{u}_i^\top \mathbf{v}_j}{\tau}$$

where $\tau > 0$ is the learnable temperature (engineered as logit_scale = log(1/τ), which is more stable in backprop, clamped in $[\log 1, \log 100]$).

3.2 Symmetric InfoNCE loss (average of row + column softmax)

Image → Text direction (for each image $i$, positive sample is $T_i$, negatives are $\{T_j\}_{j\neq i}$):

$$\mathcal{L}_{i\to t} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(S_{ii})}{\sum_{j=1}^{N} \exp(S_{ij})}$$

Text → Image direction:

$$\mathcal{L}_{t\to i} = -\frac{1}{N}\sum_{j=1}^{N} \log \frac{\exp(S_{jj})}{\sum_{i=1}^{N} \exp(S_{ij})}$$

Total symmetric loss:

$$\boxed{\;\mathcal{L}_\text{CLIP} = \frac{1}{2}\left(\mathcal{L}_{i\to t} + \mathcal{L}_{t\to i}\right)\;}$$

Equivalent formulation: average of row softmax + column softmax

for the matrix $\mathbf{S}$, apply row softmax and take the NLL of the diagonal (image→text), apply column softmax and take the NLL of the diagonal (text→image). The mean of the two cross-entropies is the CLIP loss.

3.3 Gradient derivation (why symmetry matters)

Fix $\tau=1$. For the row logits $\mathbf{s}_i = (S_{i1},\dots,S_{iN})^\top$ inside $\mathcal{L}_{i\to t}$, apply softmax and let $p_{ij} = \text{softmax}(\mathbf{s}_i)_j$. Then:

$$\frac{\partial \mathcal{L}_{i\to t}}{\partial S_{ij}} = \frac{1}{N}\left(p_{ij} - \mathbb{1}[j=i]\right)$$

With only one direction $\mathcal{L}_{i\to t}$, $\mathbf{v}_j$ receives gradient from all $\mathbf{u}_i$, but cannot in turn constrain how $\mathbf{u}_i$ behaves when retrieved by other $\mathbf{v}_k$. Symmetrization adds the text→image retrieval constraint, preventing one-sided collapse in the embedding space (where image-side clusters tightly but text-side drifts).

3.4 Role of temperature

$$\mathcal{L}_{i\to t} = -\frac{1}{N}\sum_i \log\frac{\exp(\mathbf{u}_i^\top \mathbf{v}_i / \tau)}{\sum_j \exp(\mathbf{u}_i^\top \mathbf{v}_j / \tau)}$$

InfoNCE as a lower bound

Oord et al. 2018 (CPC) proved InfoNCE is a lower bound on the mutual information $I(U; V)$: $I(U; V) \ge \log N - \mathcal{L}_\text{InfoNCE}$. So increasing batch size $N$ while reducing loss directly raises the MI lower bound — this is why CLIP / SigLIP both chase huge batches.

3.5 Code: CLIP symmetric InfoNCE (core 50 lines)

import torch
import torch.nn as nn
import torch.nn.functional as F

class CLIPLoss(nn.Module):
    """Symmetric InfoNCE used by OpenAI CLIP (Radford et al. 2021)."""
    def __init__(self, init_tau=0.07, max_logit_scale=4.6052):
        super().__init__()
        # Equivalent to logit_scale = log(1/τ); initial ~ log(1/0.07) ≈ 2.659
        self.logit_scale = nn.Parameter(torch.tensor(1.0 / init_tau).log())
        self.max_logit_scale = max_logit_scale            # log(100), clamp to prevent blow-up

    def forward(self, image_feats, text_feats):
        """
        image_feats: [N, D]   (unnormalized)
        text_feats:  [N, D]
        """
        # L2 normalize to the unit sphere
        u = F.normalize(image_feats, dim=-1)              # [N, D]
        v = F.normalize(text_feats, dim=-1)               # [N, D]

        # Clamp logit_scale upper bound (late training rises to ~log(100))
        logit_scale = self.logit_scale.clamp(max=self.max_logit_scale).exp()

        # Similarity matrix
        logits_i2t = logit_scale * u @ v.t()              # [N, N]
        logits_t2i = logits_i2t.t()                       # [N, N]

        # The diagonal contains the positive pairs
        N = u.size(0)
        labels = torch.arange(N, device=u.device)

        loss_i2t = F.cross_entropy(logits_i2t, labels)    # row softmax NLL
        loss_t2i = F.cross_entropy(logits_t2i, labels)    # column softmax NLL

        return 0.5 * (loss_i2t + loss_t2i), logit_scale

# Example (under DDP, you need to all-gather feats from all GPUs before computing)
if __name__ == "__main__":
    N, D = 8, 512
    img_feats = torch.randn(N, D)
    txt_feats = torch.randn(N, D)
    criterion = CLIPLoss()
    loss, scale = criterion(img_feats, txt_feats)
    print(f"loss={loss.item():.4f}  logit_scale={scale.item():.2f}")
Under DDP you must all-gather to get true InfoNCE

on a single GPU, the batch $N$ loss only covers local negatives. Production CLIP (OpenCLIP / OpenAI) does dist.all_gather on $\mathbf{u}, \mathbf{v}$ after forward, so the negative pool = global batch size (e.g. 32k). Gradient is computed for only the local row / column on the host GPU via gradient checkpointing — this is an engineering trick, not a math change.

3.6 CLIP training data & scale

3.7 CLIP's failure modes

§4 SigLIP: replacing softmax with sigmoid; batch scaling rewritten

4.1 Motivation

CLIP's softmax normalization couples all N×N similarities together: each positive's gradient depends on the row-wide logsumexp of negatives. This causes:

Zhai et al. 2023 (ICCV) proposed SigLIP: treat each entry of the N×N matrix as independent binary classification.

4.2 Sigmoid loss derivation

Define similarity $S_{ij} = t \cdot \mathbf{u}_i^\top \mathbf{v}_j + b$, where $t = e^{t'}$ is a learnable scale (same as CLIP's $1/\tau$) and $b$ is a learnable bias (initialized to a negative number, e.g. $b_0 = -10$, to avoid early prediction of "all positive").

Label $y_{ij} = +1$ if $i=j$, $-1$ otherwise. Each entry does binary logistic regression:

$$\mathcal{L}_\text{SigLIP} = -\frac{1}{N}\sum_{i=1}^N \sum_{j=1}^N \log \sigma\!\left(y_{ij} \cdot S_{ij}\right) = \frac{1}{N}\sum_{i=1}^N \sum_{j=1}^N \log\!\left(1 + \exp(-y_{ij} S_{ij})\right)$$

Key property

the loss for each $(i,j)$ entry does not depend on any other entry. Therefore:

The bias $b$ is not decorative

early in training, $\mathbf{u}, \mathbf{v}$ are near-random, $S_{ij}$ is near 0, and sigmoid outputs 0.5. Negatives number $N^2 - N \approx N^2$, while positives are only $N$; if initial predictions are all ~0.5, negative-sample gradients dominate early training. SigLIP initializes $b_0 \approx -10$, so the sigmoid is initially near 0 — all entries are first predicted as negative, then positives have large loss and negatives have small loss; starting from this state makes training stable.

4.3 SigLIP vs CLIP comparison

DimensionCLIP (softmax)SigLIP (sigmoid)
Loss form$\propto$ logsumexp(row) + logsumexp(col)$\propto$ $\sum_{ij}$ binary logistic
Batch dependenceStrong (gradient couples batch)Weak (entries independent)
Communicationall-gather embeddingschunked all-pair sigmoid
Bias termNone (implicitly absorbed by softmax)learnable $b$, init $\approx -10$
Small-batch behaviorPoor (< 4k barely learns)Significantly better (1k learns)
Large-batch behaviorDiminishing returnsKeeps rising through 32k+
Zero-shot ImageNet (ViT-L/14, 400M data)~75%~76–78%

4.4 SigLIP-2 (Google 2025)

Tschannen et al. 2025, building on SigLIP-1:

4.5 Code: SigLIP sigmoid loss (core 35 lines)

import torch
import torch.nn as nn
import torch.nn.functional as F

class SigLIPLoss(nn.Module):
    """Sigmoid Loss for Language Image Pre-training (Zhai et al. 2023)."""
    def __init__(self, init_t=10.0, init_b=-10.0):
        super().__init__()
        # log-parameterize t for stability; b is a learnable bias
        self.t_prime = nn.Parameter(torch.tensor(init_t).log())
        self.b = nn.Parameter(torch.tensor(float(init_b)))

    def forward(self, image_feats, text_feats):
        u = F.normalize(image_feats, dim=-1)             # [N, D]
        v = F.normalize(text_feats, dim=-1)              # [N, D]

        t = self.t_prime.exp()                           # scale > 0
        logits = t * (u @ v.t()) + self.b                # [N, N]

        # y_{ij} = +1 if i == j else -1
        N = u.size(0)
        labels = 2 * torch.eye(N, device=u.device) - 1   # [N, N], +1 on diag, -1 off

        # log(1 + exp(-y * logits))  ==  -log sigmoid(y * logits)
        loss = -F.logsigmoid(labels * logits).sum() / N  # SigLIP convention: sum / N
        return loss, t, self.b
SigLIP normalizes by N, not N²

the paper's Eq. (1) normalizes by batch size $N$ (sum per row), not by number of matrix elements $N^2$. Pitfall: writing loss.mean() gives a 1/N² magnitude, the loss is too small, and the learnable scale converges incorrectly. Correct: loss.sum() / N.

§5 EVA-CLIP / OpenCLIP / other CLIP variants

5.1 OpenCLIP

OpenCLIP (Cherti et al. 2023 CVPR) is the LAION team's open-source reproduction + extension:

5.2 EVA-CLIP

EVA-CLIP (Sun et al. 2023) uses MIM-pretrained EVA / EVA-02 (Fang et al. 2023) as vision-tower initialization, substantially improving sample efficiency:

5.3 DataComp (data vs model vs algorithm)

Gadre et al. 2023 (NeurIPS) designed a "data filtering benchmark": fix (model, compute) and only vary the data filter. Conclusions:

5.4 Comparison overview

MethodVision tower initLossBatchTraining dataImageNet zero-shot
CLIP (OpenAI)from scratchsoftmax InfoNCE32kWIT 400M76.2% (L/14@336)
OpenCLIPfrom scratchsoftmax InfoNCE90kLAION-2B80.1% (bigG/14)
EVA-CLIPEVA-02 MIMsoftmax InfoNCELAION-2B82.0% (E/14+)
SigLIPfrom scratchsigmoid32kWebLI82.0% (So400M/14)
SigLIP-2from scratchsigmoid + caption + distillWebLI 10B84%+
MetaCLIPfrom scratchsoftmax InfoNCEreconstructed LAION-grade79.2% (H/14)
2024–2025 trend

the SigLIP family has stably surpassed CLIP on zero-shot ImageNet and downstream retrieval; representative open-weight VLMs using SigLIP-So400M are PaliGemma / LLaVA-OneVision / Molmo. The InternVL series uses their in-house InternViT; Qwen2-VL trains its own ViT; LLaVA-1.5/1.6 still use CLIP ViT-L/14 — "switching to SigLIP" is not an industry consensus.

§6 LLaVA: projector + 2-stage training

6.1 Architecture

LLaVA (Liu et al. 2023 NeurIPS) centers on a trio:

Image ──► CLIP ViT-L/14 ──► visual features  z_v ∈ R^{N × d_v}
                                  │
                                  │  W ∈ R^{d_v × d_LLM}   ← MLP projector
                                  ↓
                            H_v ∈ R^{N × d_LLM}
                                  │
                                  │  concatenated with text embedding
                                  ↓
Text tokens ──► tokenizer ──► H_t ──► [<bos>, H_v, H_t] ──► LLM (Vicuna / LLaMA-2)
                                                              │
                                                              ↓
                                                            autoregressive response

6.2 Two-stage training

Stage 1: Feature Alignment Pre-training

Stage 2: End-to-end Visual Instruction Tuning

6.3 Code: LLaVA-style projector + forward (core 60 lines)

import torch
import torch.nn as nn

class LLaVAProjector(nn.Module):
    """2-layer MLP + GELU, as in LLaVA-1.5."""
    def __init__(self, d_vision=1024, d_llm=4096):
        super().__init__()
        self.fc1 = nn.Linear(d_vision, d_llm)
        self.act = nn.GELU()
        self.fc2 = nn.Linear(d_llm, d_llm)

    def forward(self, x):                               # x: [B, N, d_vision]
        return self.fc2(self.act(self.fc1(x)))          # [B, N, d_llm]

class LLaVA(nn.Module):
    """Skeleton: CLIP vision tower + projector + LLM."""
    def __init__(self, vision_tower, projector, llm, image_token_id):
        super().__init__()
        self.vision_tower = vision_tower                # CLIPViT, frozen at stage 1
        self.projector = projector
        self.llm = llm                                  # e.g. LlamaForCausalLM
        self.image_token_id = image_token_id           # special <image> placeholder

    @torch.no_grad()
    def encode_image(self, pixel_values):
        # Take the second-to-last layer's patch features (skip [CLS])
        vit_out = self.vision_tower(pixel_values, output_hidden_states=True)
        feat = vit_out.hidden_states[-2][:, 1:, :]      # drop CLS, [B, N, d_v]
        return feat

    def forward(self, input_ids, pixel_values, labels=None, attention_mask=None):
        # 1. Visual features → projector → LLM dim
        with torch.no_grad():
            visual_features = self.encode_image(pixel_values)        # [B, N, d_v]
        visual_tokens = self.projector(visual_features)              # [B, N, d_llm]

        # 2. LLM's word embedding table
        token_embeds = self.llm.get_input_embeddings()(input_ids)    # [B, L, d_llm]

        # 3. Replace the <image> placeholder positions with visual_tokens
        B, L, D = token_embeds.shape
        new_embeds, new_labels, new_mask = [], [], []
        for b in range(B):
            image_pos = (input_ids[b] == self.image_token_id).nonzero(as_tuple=True)[0]
            assert image_pos.numel() == 1, "exactly one <image> placeholder expected"
            i = image_pos.item()
            # Concat: [prefix tokens] + [N visual tokens] + [suffix tokens]
            chunks = [token_embeds[b, :i], visual_tokens[b], token_embeds[b, i+1:]]
            new_embeds.append(torch.cat(chunks, dim=0))
            if labels is not None:
                lab = labels[b]
                # Label = -100 at visual token positions (not counted in loss)
                ignore = torch.full((visual_tokens.size(1),), -100, dtype=lab.dtype, device=lab.device)
                new_labels.append(torch.cat([lab[:i], ignore, lab[i+1:]], dim=0))
            if attention_mask is not None:
                am = attention_mask[b]
                ones = torch.ones(visual_tokens.size(1), dtype=am.dtype, device=am.device)
                new_mask.append(torch.cat([am[:i], ones, am[i+1:]], dim=0))

        # 4. Pad back to a batch tensor and feed the LLM
        inputs_embeds = torch.nn.utils.rnn.pad_sequence(new_embeds, batch_first=True)
        labels = torch.nn.utils.rnn.pad_sequence(new_labels, batch_first=True, padding_value=-100) if labels is not None else None
        attention_mask = torch.nn.utils.rnn.pad_sequence(new_mask, batch_first=True) if attention_mask is not None else None
        return self.llm(inputs_embeds=inputs_embeds, labels=labels, attention_mask=attention_mask)

6.4 LLaVA-1.5 / 1.6 / NeXT key upgrades

VersionTimeMain changes
LLaVA-1.02023.04Single Linear projector; CLIP ViT-L/14@224², visual tokens = 256 ($16\times 16$)
LLaVA-1.52023.102-layer MLP; resolution up to 336², visual tokens = 576 ($24\times 24$); adds OCR / GQA / VQAv2 academic data
LLaVA-1.6 / NeXT2024.01AnyRes: slice the image into $2\times 2 / 2\times 3 / \dots$ tiles and encode each, supporting any aspect ratio; up to 2880 tokens
LLaVA-OneVision2024.08Unified single / multi-image / video; introduces a mix of SI (single image) + OV (onevision) data
LLaVA-NeXT-Video2024.04Video version; feed the LLM serialized visual features from multiple frames
Core trick of AnyRes (LLaVA-1.6)

training assumes a fixed 336²; at inference, a high-res image is sliced into $n \times m$ tiles of 336² each, encoded individually, plus one "global thumbnail" (the full image resized to 336²). Tokens go from 576 to (1 + n·m)·576, but each tile passes through the same frozen ViT. The same family as InternVL / Qwen-VL tiling.

§7 BLIP-2: Q-Former cross-attention

7.1 Motivation

LLaVA's projector is simple, but every patch becomes an LLM token: higher resolution ↑ more tokens ↑ LLM compute $O(L^2)$ ↑. BLIP-2 (Li et al. 2023 ICML) uses a Q-Former (Querying Transformer) to compress an arbitrary number of patches into a fixed 32 tokens.

7.2 Q-Former structure

Input: frozen image encoder output $\mathbf{Z} \in \mathbb{R}^{N \times d_v}$ (N=257 for ViT-g/14@224). The Q-Former has 32 learnable query tokens $\mathbf{q}_1, \dots, \mathbf{q}_{32} \in \mathbb{R}^{d_q}$.

Per Q-Former block:

$$\mathbf{q}^{(\ell)} = \text{SelfAttn}(\mathbf{q}^{(\ell-1)})$$ $$\mathbf{q}^{(\ell)} = \text{CrossAttn}(\mathbf{q}^{(\ell)},\ \mathbf{Z},\ \mathbf{Z})\quad \text{(inserted only every other layer)}$$ $$\mathbf{q}^{(\ell)} = \text{FFN}(\mathbf{q}^{(\ell)})$$

Key points:

7.3 Two-stage training

Stage 1: Representation Learning (only Q-Former trained, vision encoder frozen)

Stage 2: Generative Learning (only Q-Former trained, LLM frozen)

7.4 Code: a single Q-Former cross-attention layer (core 40 lines)

import torch
import torch.nn as nn

class QFormerLayer(nn.Module):
    """One Q-Former block: SelfAttn (queries) -> CrossAttn (queries <- image) -> FFN."""
    def __init__(self, d_q=768, d_v=1408, num_heads=12, mlp_ratio=4, has_cross=True):
        super().__init__()
        self.has_cross = has_cross
        self.ln_self = nn.LayerNorm(d_q)
        self.self_attn = nn.MultiheadAttention(d_q, num_heads, batch_first=True)
        if has_cross:
            self.ln_cross = nn.LayerNorm(d_q)
            # Q comes from query (d_q), K/V come from image feats (d_v) -> adapt via kdim/vdim
            self.cross_attn = nn.MultiheadAttention(d_q, num_heads,
                                                   kdim=d_v, vdim=d_v, batch_first=True)
        self.ln_ffn = nn.LayerNorm(d_q)
        hidden = int(d_q * mlp_ratio)
        self.ffn = nn.Sequential(nn.Linear(d_q, hidden), nn.GELU(), nn.Linear(hidden, d_q))

    def forward(self, q, image_feats=None):              # q: [B, 32, d_q]
        # Self-attention: queries talk to each other
        h = self.ln_self(q)
        a, _ = self.self_attn(h, h, h, need_weights=False)
        q = q + a
        # Cross-attention: queries attend to image patches
        if self.has_cross and image_feats is not None:
            h = self.ln_cross(q)
            a, _ = self.cross_attn(h, image_feats, image_feats, need_weights=False)
            q = q + a
        # FFN
        q = q + self.ffn(self.ln_ffn(q))
        return q

class QFormer(nn.Module):
    def __init__(self, num_queries=32, d_q=768, d_v=1408, depth=12, num_heads=12,
                 cross_every=2):
        super().__init__()
        self.queries = nn.Parameter(torch.zeros(1, num_queries, d_q))
        nn.init.trunc_normal_(self.queries, std=0.02)
        self.layers = nn.ModuleList([
            QFormerLayer(d_q, d_v, num_heads, has_cross=(i % cross_every == 0))
            for i in range(depth)
        ])

    def forward(self, image_feats):                      # [B, N, d_v]
        B = image_feats.size(0)
        q = self.queries.expand(B, -1, -1)               # [B, 32, d_q]
        for layer in self.layers:
            q = layer(q, image_feats)
        return q                                         # [B, 32, d_q]
`kdim`/`vdim` adaptation

nn.MultiheadAttention defaults to K/V input dim = embed_dim. In Q-Former cross-attn, query is 768-dim and image feats are 1408-dim, so you must explicitly pass kdim=d_v, vdim=d_v, otherwise PyTorch will expect 768-dim K/V at forward time and raise a shape mismatch error (no silent truncation).

7.5 Q-Former vs LLaVA Projector: trade-off

DimensionLLaVA ProjectorBLIP-2 Q-Former
Parameter count~20M (MLP)~180M (Q-Former + queries)
ComputeOnly MLP forward12 layers of cross-attn forward
Visual token count$N$ (quadratic in resolution)Fixed 32
Information lossAlmost 0 (every patch enters the LLM)Significant (256+ patches compressed to 32)
Training complexity1 stage (pretrain) + 1 stage (IT)2 stages (representation + generation); stage 1 jointly optimizes ITC + ITM + ITG
LLM context usageLarge (576–2880 tokens)Small (32 tokens)
Best forHigh resolution / detail tasksLLM context-limited / batched multimodal inference
2024–2025 mainstream returns to projector

Qwen-VL / LLaVA-NeXT / InternVL-2 / DeepSeek-VL2 all use projector (with spatial reduction / pixel shuffle to control token count); Q-Former has faded out in industrial VLMs. But the Q-Former idea is still active in video VLMs (using queries for frame-level pooling).

§8 Flamingo: Perceiver Resampler + Gated Cross-Attn

8.1 Design goals

Alayrac et al. 2022 (NeurIPS) wanted: to add visual ability to a frozen 70B LLM without breaking the text ability. Design choices:

8.2 Perceiver Resampler

Similar to Q-Former, "use latent queries to compress the image". The Flamingo paper Sec 3.1 pseudocode is multi-layer (each layer = cross-attention + FFN), with the default config approximately $L=6$ layers. Per-layer update:

$$\mathbf{q}^{(\ell+1)} = \mathbf{q}^{(\ell)} + \text{CrossAttn}\!\left(\mathbf{q}^{(\ell)},\ [\mathbf{q}^{(\ell)};\ \mathbf{Z}],\ [\mathbf{q}^{(\ell)};\ \mathbf{Z}]\right), \quad \mathbf{q}^{(\ell+1)} = \mathbf{q}^{(\ell+1)} + \text{FFN}(\mathbf{q}^{(\ell+1)})$$

Note K/V is concat(query, image_feat), not just image_feat — so query tokens can also attend to each other. Overall still lighter than BLIP-2 Q-Former (12 layers + self-attn + cross-every-2). Output 64 latent visual tokens (independent of the input patch count).

8.3 Gated Cross-Attention (the core innovation)

Insert a new cross-attention module into the LLM every $k$ layers (e.g. every 4):

$$\mathbf{h}'_\ell = \mathbf{h}_\ell + \tanh(\alpha_\text{attn}) \cdot \text{CrossAttn}(\mathbf{h}_\ell, \mathbf{q}_\text{out}, \mathbf{q}_\text{out})$$ $$\mathbf{h}''_\ell = \mathbf{h}'_\ell + \tanh(\alpha_\text{ffn}) \cdot \text{FFN}(\mathbf{h}'_\ell)$$

Key: $\alpha_\text{attn}, \alpha_\text{ffn}$ are learnable scalars, initialized to 0. So $\tanh(0)=0$, meaning the added cross-attn contributes zero to the LLM output at init — the LLM behaves identically to the unmodified frozen LLM with no visual module. During training, $\alpha$ gradually learns nonzero values and visual information starts flowing in.

This is the "residual graft"

Llama-3.2-Vision (Meta 2024) uses exactly the same design: frozen LLaMA-3 + learning a gated cross-attn adapter. Pros: fully preserves text-only performance; cons: the visual capability ceiling is lower than the fine-tuned LLM in LLaVA/Qwen-VL.

8.4 Flamingo / Llama-3.2-V vs LLaVA comparison

AspectFlamingo / Llama-3.2-VLLaVA / Qwen-VL
LLM unfrozen?No (frozen)Yes (unfrozen in stage 2)
Image as tokens?No (as KV)Yes (as tokens)
Text-only ability preserved✅ Fully⚠️ May regress slightly
Visual understanding ceilingLimited by cross-attn capacityHigher (LLM can "think about" the image)
Training datainterleavedimage-instruction pairs
ApplicableLarge LLM + no retrainingSmall/medium LLM + vision-centric

§9 CogVLM and "visual experts" / cross-attn fusion variants

9.1 CogVLM: a visual expert branch

Wang et al. 2023 (CogVLM)'s core idea: in the LLM's attention / FFN, replicate a parallel branch for visual tokens, sharing the attention computation with the original text branch but using different projections:

                  attention
       ┌──────────────┴──────────────┐
       ↓                              ↓
   text projection (frozen)    vision expert projection (trainable)
       │                              │
       └──────────────┬──────────────┘
                      ↓
         token-wise route: if visual_token, use vision branch

9.2 Llama-3.2 Vision: Flamingo-style cross-attn revived on a large LLM

Meta released Llama-3.2-V (11B / 90B) in September 2024:

9.3 Claude 3.5/3.7 Sonnet Vision and GPT-4V/4o

The closed-source architectures of Anthropic / OpenAI are undisclosed, but inferences from API behavior:

§10 Qwen2-VL / DeepSeek-VL: dynamic resolution + M-RoPE

10.1 Native dynamic resolution

Qwen2-VL (Wang et al. 2024), DeepSeek-VL (Lu et al. 2024), and InternVL-2 all abandon the "resize to fixed 224²" tradition:

10.2 M-RoPE (Multimodal RoPE)

Qwen2-VL's core innovation. Recap of ordinary 1D RoPE: treat each pair $(2k, 2k+1)$ of query / key dimensions as a complex number, multiplied by a position-dependent rotation:

$$\mathbf{R}_{m,k} = \begin{pmatrix} \cos(m\theta_k) & -\sin(m\theta_k) \\ \sin(m\theta_k) & \cos(m\theta_k) \end{pmatrix}, \quad \theta_k = 10000^{-2k/d}$$

After applying to $\mathbf{q}_m$, $\mathbf{q}_m^\top \mathbf{k}_n$ depends only on $m - n$ (relative position).

M-RoPE's extension: a visual token has three position dimensions (t, h, w). All head_dim dimensions rotate — but each pair $(2k, 2k+1)$ uses one of the three position ids (t / h / w) for its rotation angle, depending on which segment it falls into:

$$(\cos(m_\text{axis}\,\theta_k),\ \sin(m_\text{axis}\,\theta_k)), \quad \text{axis} \in \{t, h, w\}$$

Specifically, Qwen2-VL's mrope_section (unit is pairs of half head_dim, i.e. each number represents how many $(2k, 2k+1)$ pairs). One pair = 2 real dimensions, so "section sum × 2 = head_dim".

Qwen2-VL default `mrope_section = [16, 24, 24]`

that is, the three axes occupy 16 / 24 / 24 dim pairs respectively; total $(16+24+24) \times 2 = 128 = $ head_dim. The implementation doubles the section to $[16, 24, 24, 16, 24, 24]$ to slice head_dim, with position ids of (t, h, w, t, h, w) used for rotation — all 128 dims rotate, none "left unrotated". Spatial dims (h, w) occupy 48 pairs > the temporal dim (t)'s 16 pairs, reflecting that inter-frame changes in video are slow while spatial content changes within a frame are dramatic.

A text token has no explicit (h, w): Qwen2-VL sets $m_t = m_h = m_w$ equal to that text token's 1D position id, so all three axes yield the same rotation angle, equivalent to ordinary 1D RoPE.

10.3 Qwen2.5-VL upgrades

Qwen2.5-VL (Bai et al. 2025), on top of Qwen2-VL:

10.4 DeepSeek-VL / VL2: high-resolution tiling + hybrid encoder

DeepSeek-VL (Lu et al. 2024) uses dual vision encoders:

The two features are concatenated and fed to projector + LLM. DeepSeek-VL2 (2024.12) further replaces the LLM with MoE + dynamic resolution; a single image can use 1700+ visual tokens.

10.5 Code: M-RoPE three-dim positional embedding (core 50 lines, matching Qwen2-VL's HF implementation)

import torch

def build_mrope_cos_sin(positions, head_dim, mrope_section=(16, 24, 24), base=10000.0):
    """
    Build cos/sin tensors for Qwen2-VL style M-RoPE.

    positions: LongTensor [3, B, L]   (axis 0: t / h / w; B batch; L seq len)
    head_dim:  per-head dim (must equal 2 * sum(mrope_section))
    mrope_section: tuple of 3 ints; each = number of (half-dim) entries per axis
    Returns: cos, sin both [B, L, head_dim], ready for LLaMA-style rotate_half.
    """
    assert 2 * sum(mrope_section) == head_dim, "2 * sum(mrope_section) must = head_dim"
    half = head_dim // 2                                                # = sum(mrope_section)

    # Standard RoPE frequencies: θ_k = base^{-2k/head_dim}, k = 0..half-1
    inv_freq = 1.0 / (base ** (torch.arange(0, half).float() * 2 / head_dim))   # [half]
    inv_freq = inv_freq.to(positions.device)

    # For each axis, compute angles / cos / sin of shape [B, L, half]
    cos_axes, sin_axes = [], []
    for a in range(3):
        ang = positions[a].float().unsqueeze(-1) * inv_freq                     # [B, L, half]
        cos_axes.append(ang.cos())
        sin_axes.append(ang.sin())

    # Slice half-dim into 3 segments by mrope_section; pick cos/sin for t/h/w
    cos_chunks, sin_chunks = [], []
    offset = 0
    for axis, s in enumerate(mrope_section):
        cos_chunks.append(cos_axes[axis][..., offset:offset+s])                 # [B, L, s]
        sin_chunks.append(sin_axes[axis][..., offset:offset+s])
        offset += s
    cos_half = torch.cat(cos_chunks, dim=-1)                                    # [B, L, half]
    sin_half = torch.cat(sin_chunks, dim=-1)

    # LLaMA-RoPE style: duplicate to full head_dim
    cos = torch.cat([cos_half, cos_half], dim=-1)                               # [B, L, head_dim]
    sin = torch.cat([sin_half, sin_half], dim=-1)
    return cos, sin

def rotate_half(x):
    """(x1, x2) -> (-x2, x1), LLaMA convention."""
    x1, x2 = x.chunk(2, dim=-1)
    return torch.cat((-x2, x1), dim=-1)

def apply_mrope(q, k, cos, sin):
    """
    q, k:    [B, num_heads, L, head_dim]
    cos, sin:[B, L, head_dim]
    """
    cos = cos.unsqueeze(1)                                                       # broadcast over heads
    sin = sin.unsqueeze(1)
    q_rot = q * cos + rotate_half(q) * sin
    k_rot = k * cos + rotate_half(k) * sin
    return q_rot, k_rot
Three common M-RoPE misreadings

easy traps.

§11 Video VLM: LongVA / VideoLLaMA / long-video problem

11.1 Basic pipeline

Video = multi-frame image. Common VLM approach to video:

  1. Uniformly sample $K$ frames (e.g. 8 / 16 / 32)
  2. Each frame through the vision encoder → $N$ patch tokens per frame
  3. Token sequence concat: feed $K \cdot N$ visual tokens to the LLM

Problem: $K=32, N=576 \Rightarrow 18432$ tokens — beyond the context of an LLM that was instruction-tuned on single images.

11.2 Common compression strategies

11.3 LongVA / Long-context video

LongVA (Zhang et al. 2024) and others exploit long-context LLMs (200K+ tokens) to directly consume long video unrolled into a token sequence, paired with the M-RoPE temporal dim, for hour-long video QA. Qwen2-VL reports handling 20-minute video; Qwen2.5-VL pushes to 1+ hour.

11.4 Video benchmarks

§12 Training pipeline: alignment / instruct / preference

12.1 Stage 1: Alignment / Pre-training

Goal: align visual features to be near the LLM token space.

12.2 Stage 2: Visual instruction tuning

Goal: teach the VLM "to look at images, answer questions, follow instructions".

12.3 Stage 3: Preference / RLHF

Goal: reduce hallucinations, improve helpfulness / harmlessness, and align to long-tail tasks.

MethodTimeCore
LLaVA-RLHF2023.09, Sun et al.PPO + human preference + hallucination-aware reward
RLAIF-V2024, Yu et al.AI feedback in place of human labels; divide-and-conquer
POVID2024DPO + deliberately constructed hallucination negatives
VLM-R12025GRPO + verifiable reward (R1-style for visual reasoning)
Bespoke / R1-Onevision2025Visual chain-of-thought + RL refinement
VLM-R1 (2025) is the current hotspot

it ports DeepSeek-R1's "verifiable reward + GRPO" recipe to visual tasks (e.g. ScienceQA, MMMU). Reward comes from whether the answer matches the ground truth (no process reward model); after training, the visual reasoning chain grows significantly and benchmarks improve substantially.

12.4 Data scale vs stage

StageData volumeTraining tokensUnfrozen modules
Alignment0.5–5M captions1–10Bprojector
Instruction tune0.2–10M instructions1–50BLLM + projector
Preference50k–500k preference pairs100M–1BLLM (LoRA / full)

§13 Multimodal Embeddings: BGE-VL / Jina-CLIP / VLM2Vec

13.1 Why a new generation of multimodal embeddings?

CLIP is trained for "image ↔ short caption" alignment, but performs poorly on long instruction retrieval / multi-image / interleaved document retrieval. The new generation of multimodal embedding models target retrieval / RAG scenarios.

13.2 Representative methods

13.3 Core tricks

§14 25 frequently-asked interview questions (L1 must-know · L2 advanced · L3 top labs)

L1 basics (must-know 10)

Q1. What is the CLIP loss? Why must it be symmetric?
  • CLIP uses symmetric InfoNCE: $\mathcal{L} = \tfrac12(\mathcal{L}_{i\to t} + \mathcal{L}_{t\to i})$
  • $\mathcal{L}_{i\to t}$: apply row softmax on the similarity matrix $\mathbf{S}$, take NLL of the diagonal (image retrieves text)
  • $\mathcal{L}_{t\to i}$: column softmax, NLL of the diagonal (text retrieves image)
  • Necessity of symmetry: one direction constrains only one retrieval direction; symmetrization lets image / text embeddings constrain each other, preventing "one-sided collapse" — e.g. image-side clusters but text-side drifts

Pitfall: answering only "InfoNCE" without saying "average of two softmaxes", or saying "do it backward once" without explaining why it is necessary.

Q2. What does the temperature τ in CLIP do? Why make it learnable?
  • Temperature $\tau$ controls softmax sharpness: $\tau \to 0$ approaches one-hot, focusing on the hardest negative; $\tau \to \infty$ becomes uniform with no gradient
  • OpenAI CLIP makes $\tau$ learnable (concretely parameterized as logit_scale = log(1/τ), more stable in backprop)
  • Learned steady state $\tau \approx 0.01$ (logit_scale ≈ log(100)), clamped from above to avoid collapse
  • Without making it learnable: hyperparameter-sensitive; each new data / model scale needs manual tuning

Pitfall: treating τ as a fixed 0.07 with no learning; or writing it as 1/exp(logit_scale) with unstable backprop.

Q3. Core change of SigLIP vs CLIP? Why is batch size no longer sensitive?
  • Replace softmax InfoNCE with sigmoid binary CE: each $(i,j)$ pair is independently classified
  • $S_{ij} = t \cdot \mathbf{u}_i^\top \mathbf{v}_j + b$; label $y_{ij} = +1 (i=j) / -1 (i\neq j)$; loss = $-\sum_{ij} \log\sigma(y_{ij} S_{ij})/N$
  • Batch decoupled: each loss term is independent of row/col normalization, so doubling N only changes the negative count, not the loss landscape
  • Engineering gains: single-machine large batch, simpler cross-node communication, and small batches can also learn (CLIP barely converges with small batches)

Pitfall: saying only "use sigmoid" without explaining batch-independence; or treating SigLIP as "CLIP with bias".

Q4. Is the [CLS] token in ViT mandatory?
  • No. The original ViT uses [CLS] to align with BERT conventions
  • Alternative: mean-pool all patch tokens as the image representation — most modern ViTs (DeiT-III, SigLIP, EVA-CLIP) use mean-pool or attentive pool
  • CLIP uses [CLS]: contrastive training requires a single vector
  • VLM vision towers usually drop [CLS]: LLaVA takes second-to-last layer patch tokens; [CLS] is not needed on the LLM side

Pitfall: treating [CLS] as a "mandatory component" of ViT; or saying "without [CLS] you cannot classify" (wrong; mean-pool also works).

Q5. What is LLaVA's projector? Why MLP instead of Linear?
  • The projector maps visual encoder output ($d_v$=1024) to LLM token space ($d_\text{llm}$=4096)
  • LLaVA-1.0: single Linear(1024, 4096); LLaVA-1.5: 2-layer MLP + GELU
  • MLP adds non-linear expressivity, letting visual features map more flexibly into the LLM's "vocabulary"
  • The paper reports MLP improves MM-Vet / SEED-Bench by 1–3 points over single Linear

Pitfall: saying only "project with a Linear"; or answering "use Q-Former" (that is BLIP-2, not LLaVA).

Q6. What do LLaVA's two training stages do?
  • Stage 1 Feature Alignment: train the projector only, freeze vision tower + LLM, using caption data (CC3M / LAION-558K) to project visual features near the LLM embedding space
  • Stage 2 Instruction Tuning: unfreeze LLM + projector (vision tower still frozen), using GPT-4-generated 158K visual instructions, teaching the LLM to follow visual instructions
  • Why not train in one shot: jumping directly to stage 2 risks catastrophic forgetting of text ability; stage 1 first gives visual tokens a "near text-token" initialization, then instruction tuning is more stable

Pitfall: saying both stages "train the projector"; or omitting that stage 1 freezes the LLM, the key point.

Q7. What is a Q-Former? Pros/cons vs the LLaVA projector?
  • BLIP-2's Q-Former: a 12-layer Transformer with 32 learnable query tokens that read information from a frozen image encoder via cross-attention, outputting a fixed 32 visual tokens
  • Pros: fixed visual token count, low LLM context usage, compute budget stays the same as resolution grows
  • Cons: large information loss (256 patches compressed to 32), more parameters (~180M), complex training (two stages: stage 1 representation learning includes joint ITC+ITM+ITG, stage 2 connects to frozen LLM for generation)
  • 2024 mainstream returns to projector: Qwen-VL / LLaVA-NeXT / InternVL-2 all use projector

Pitfall: treating Q-Former as a synonym for projector; or not knowing modern VLMs prefer projectors.

Q8. How many patches / tokens does a ViT-L/14 yield on a 224×224 image?
  • Patch count = $(224/14)^2 = 16^2 = 256$
  • Token count = 256 + 1 (with [CLS]) = 257
  • If LLaVA-style second-to-last layer patch tokens (drop [CLS]) = 256 visual tokens
  • If resolution is 336 (LLaVA-1.5): $(336/14)^2 = 24^2 = 576$ tokens

Pitfall: miscalculating $(H/P)^2$ (treating $P^2$ as the patch count $N$); forgetting [CLS].

Q9. Why is CLIP poor at OCR / counting / spatial relations?
  • Poor OCR: captions typically describe scenes, not text inside images; CLIP has no pixel-level OCR supervision
  • Poor counting: captions rarely report exact counts ("how many birds" is usually "a flock of birds"); the embedding space does not preserve a counting signal
  • Poor spatial relations: "cat on top of dog" and "dog on top of cat" look almost identical under bag-of-words; Yuksekgonul et al. 2023 (ICLR) quantify this with the ARO benchmark
  • Mitigation directions: DETR-style local alignment, SigLIP-2's dense local objectives, document-level data

Pitfall: blaming OCR weakness on "resolution too low" (partly right, but the root cause is data + loss); claiming "CLIP is bag-of-words" too absolutely.

Q10. Why is the vision tower generally frozen when training a VLM?
  • The vision tower (e.g. CLIP ViT-L) is already pre-trained well on its own data; unfreezing easily damages visual feature quality
  • Training data is far smaller than CLIP pretraining (millions vs billions); unfreezing easily overfits
  • Freezing also saves memory: hundreds of millions of vision tower params don't need optimizer state
  • Qwen2-VL exception: in the final stage it unfreezes the vision tower for small-LR fine-tuning, paired with large mixed data to avoid forgetting

Pitfall: answering "cannot unfreeze" directly — wrong; late stages can carefully unfreeze.

L2 advanced (10 questions)

Q11. Why is SigLIP's bias $b$ initialized to $-10$?
  • Early in training, embeddings are near random, $\mathbf{u}^\top \mathbf{v} \approx 0$, sigmoid outputs 0.5
  • In the N×N matrix, negatives are $N^2 - N \approx N^2$, positives only $N$; if initial predictions are all 0.5, negative gradients dominate and positives get no useful signal
  • Init $b \approx -10$ → $\sigma(b) \approx 4.5e^{-5}$ → all entries initially predicted as negative
  • This way negatives have almost no loss, positives have large loss (predicted as negative but truly positive), gradient pulls positives in, training is stable

Pitfall: saying "avoid numerical issues"; or answering "for the symmetric term" (wrong; the bias is not a symmetric loss term).

Q12. Why is Flamingo's gated cross-attn initialized to 0?
  • The new cross-attn output is multiplied by $\tanh(\alpha)$, with $\alpha$ initialized to 0
  • $\tanh(0) = 0$, so at init the new module contributes zero to the frozen LLM — the LLM behaves identically to a text-only Llama without the visual module
  • During training $\alpha$ slowly grows from 0, and visual signals are gradually injected
  • Pros: fully preserves the frozen LLM's text-only ability; cons: visual ceiling limited by cross-attn capacity
  • Llama-3.2 Vision uses the same design

Pitfall: treating the 0 init as just a "common init trick"; or not realizing it concerns frozen-LLM capability preservation.

Q13. How is LLaVA-1.6 / NeXT's AnyRes implemented?
  • Training assumes a fixed 336²; inference slices a high-res image by aspect ratio into $n \times m$ tiles of 336² each (e.g. $2\times 2, 2\times 3$)
  • Each tile passes through the frozen ViT to get 576 tokens; add a global thumbnail (whole image resized to 336² and encoded)
  • Concatenate: $(1 + n\cdot m) \times 576$ visual tokens to the LLM
  • Choice of slicing: pick the grid (from a predefined set such as $\{1\times 1, 2\times 2, 1\times 4, 4\times 1, ...\}$) closest to the original aspect ratio

Pitfall: treating AnyRes as a synonym for dynamic resolution — technically different. Qwen2-VL is native dynamic (patch count fully free), while LLaVA-1.6 is fixed-tile composition.

Q14. Why is Qwen2-VL's M-RoPE three-dim allocation not uniform?
  • Qwen2-VL mrope_section = [16, 24, 24] (units are pairs of half head_dim), $\sum \times 2 = $ head_dim = 128
  • All 128 dims rotate — different dim pairs use different axes (t/h/w) and their position ids to compute rotation angles
  • Reason for non-uniform allocation:

    • Inter-frame variation in video is slow (adjacent frames are very similar), so $s_t = 16$ has a small share
    • Intra-frame patch variation is dramatic (large visual differences across positions in a single frame), so $s_h = s_w = 24$ each need broader frequency coverage
  • $s_h = s_w$: the image H/W axes are symmetric

Pitfall: treating section as "number of dims" (wrong; the unit is pairs = head_dim / 2 allocation); or thinking "the remaining dims do not rotate".

Q15. Why does BLIP-2 choose 32 query tokens?
  • 32 is an empirical value, balancing LLM context usage vs information capacity
  • Too few (< 16): large information loss, hurting VQA / detail tasks
  • Too many (> 64): large LLM context usage, expensive Q-Former cross-attn computation
  • The BLIP-2 paper's ablations show 32 is the sweet spot on most downstream tasks
  • Conceptually similar to Perceiver (also uses latent queries to compress inputs)

Pitfall: answering only "empirical"; not realizing it is an engineering trade-off between context budget and information capacity.

Q16. How does CLIP compute InfoNCE under DDP?
  • Local batch on each GPU $N_\text{local}$; total batch over N GPUs $N = K \cdot N_\text{local}$
  • After forward on each GPU, dist.all_gather fetches everyone's image / text feats
  • Compute the global similarity $\mathbf{S} \in \mathbb{R}^{N \times N}$
  • But backward only lets this GPU's $N_\text{local}$ rows / columns contribute gradients (to avoid duplicate backward)
  • This is OpenCLIP's local_loss=True option

Pitfall: saying only "all-gather"; not knowing backward needs to avoid duplicate computation; or thinking backward also does an all-gather (wrong; backward flows along the communication backward path).

Q17. Why is Llama-3.2-V's visual ceiling lower than LLaVA / Qwen-VL?
  • Llama-3.2-V uses frozen LLM + gated cross-attn adapter; LLM weights are unchanged
  • LLaVA / Qwen-VL unfreeze the LLM, so its internal attention can reorganize to handle visual tokens specifically
  • The latter can "use self-attention to think about the image"; the former can only passively receive visual signal via cross-attn
  • Trade-off: Llama-3.2-V perfectly preserves text ability, LLaVA-Qwen may regress slightly but has a higher visual ceiling

Pitfall: saying only "fewer parameters"; not recognizing this is an architecture-level ceiling difference.

Q18. Why is much visual instruction-tuning data generated by GPT-4?
  • Raw caption data (CC3M / LAION) is short, not instruction-style, cannot teach dialog ability
  • Human-labeled visual instructions (e.g. VQAv2 questions) are small in scale and stylistically uniform
  • GPT-4 + image + caption → generate multi-turn dialog / reasoning tasks / detailed descriptions: this is how LLaVA-Instruct-158K was made
  • Prompt engineering controls coverage (three classes: detailed description, conversation, complex reasoning)

Pitfall: answering "lots of data"; not recognizing the key bottleneck is instruction style + diversity.

Q19. What is a typical CLIP / SigLIP training batch size?
  • OpenAI CLIP: 32k batch (256 GPUs × ~128/GPU)
  • OpenCLIP: up to 90k batch (LAION-2B)
  • SigLIP: 32k batch is typically enough; the sigmoid loss makes each (i,j) entry independent and avoids softmax's batch-wide sync; the paper scans up to 256k batch but with diminishing returns
  • Why small batches don't work: InfoNCE's MI lower bound $I(U;V) \ge \log N - \mathcal{L}$ tightens with larger N; the number of negatives also controls contrastive difficulty
  • After SigLIP decouples batch, small batches improve significantly (1k batch can learn reasonable embeddings)

Pitfall: answering "a few hundred"; or not knowing the theoretical link between batch and InfoNCE.

Q20. What do POPE / Winoground / MMBench / MMMU each evaluate?
  • POPE (Li et al. 2023): measures object hallucination — does the VLM claim objects exist that aren't in the image (yes/no binary)
  • Winoground (Thrush et al. 2022): measures compositionality / word-order sensitivity — can it distinguish "cat on dog" vs "dog on cat"
  • MMBench (Liu et al. 2023): general multimodal evaluation, ~3000 questions covering OCR / object recognition / reasoning, etc.
  • MMMU (Yue et al. 2024 CVPR): university-level professional knowledge (math / physics / medicine, etc.), tests multimodal reasoning
  • MM-Vet (Yu et al. 2023): integrated evaluation across 6 capabilities (recognition / knowledge / OCR / spatial / language / math)

Pitfall: confusing POPE and MMBench; not knowing Winoground is a "compositionality stress test".

L3 advanced (top labs / research directions, 5 questions)

Q21. Derive CLIP's symmetric InfoNCE = average of row + column softmax, and explain why SigLIP can be batch-independent.

Let batch size $N$ and similarity matrix $S_{ij} = \mathbf{u}_i^\top \mathbf{v}_j / \tau$.

CLIP derivation:

Row-wise softmax, $p_{ij} = \frac{\exp(S_{ij})}{\sum_k \exp(S_{ik})}$. Image→Text NLL:

$$\mathcal{L}_{i\to t} = -\frac{1}{N}\sum_i \log p_{ii} = -\frac{1}{N}\sum_i \log \frac{\exp(S_{ii})}{\sum_j \exp(S_{ij})}$$

Column-wise softmax (Text→Image):

$$\mathcal{L}_{t\to i} = -\frac{1}{N}\sum_j \log \frac{\exp(S_{jj})}{\sum_i \exp(S_{ij})}$$

Symmetric loss: $\mathcal{L} = \tfrac12 (\mathcal{L}_{i\to t} + \mathcal{L}_{t\to i})$. Notice the gradient with respect to $S_{ij}$:

$$\frac{\partial \mathcal{L}_{i\to t}}{\partial S_{ij}} = \frac{1}{N}(p_{ij} - \delta_{ij})$$

The gradient at each $S_{ij}$ depends on the entire row's softmax normalization $\sum_k \exp(S_{ik})$. So changing N (adding / removing negatives) changes all $p$ values in that row — gradients couple the batch.

SigLIP derivation:

$S_{ij} = t \cdot \mathbf{u}_i^\top \mathbf{v}_j + b$, $y_{ij} = 2\delta_{ij} - 1$,

$$\mathcal{L}_\text{SigLIP} = \frac{1}{N}\sum_{i,j} \log(1 + \exp(-y_{ij} S_{ij}))$$

Gradient:

$$\frac{\partial \mathcal{L}}{\partial S_{ij}} = \frac{1}{N}\cdot \frac{-y_{ij}}{1 + \exp(y_{ij} S_{ij})} = \frac{1}{N}\cdot (-y_{ij}) \cdot \sigma(-y_{ij} S_{ij})$$

Key: $\partial \mathcal{L} / \partial S_{ij}$ only depends on $S_{ij}$ itself, not other entries. So adding negatives does not change the gradient at existing $S_{ij}$ — batch-independent.

Engineering implications:

  • CLIP: under DDP must all-gather embeddings to compute global logsumexp; communication is $O(N \cdot D)$ with extra sync points
  • SigLIP: can use chunked all-pair, with each chunk computing only local rows × remote columns sigmoid terms, no logsumexp sync
Q22. Q-Former vs LLaVA projector trade-off: explain along capacity / compute / training stability.

Capacity (information capacity):

  • LLaVA projector: all $N$ patch tokens enter the LLM; information has no bottleneck, but LLM context usage is large
  • Q-Former: 32 queries is a fixed bottleneck; significant information compression, unfriendly to detail tasks (OCR / counting)
  • Suppose visual encoder output rank is $r$; LLaVA visual context rank $\le r$ (preserved), Q-Former rank $\le \min(r, 32)$

Compute / Memory:

  • LLaVA projector: MLP forward only, O(N·D²) compute
  • Q-Former: 12 layers of cross-attn + self-attn + FFN, ~180M params; but downstream LLM context is short (32 tokens vs N=256+ tokens), so LLM inference is fast
  • Total cost trade-off: at high image resolution (N=2880), Q-Former saves LLM inference; at low resolution where the LLM dominates, LLaVA is cheaper

Training stability:

  • LLaVA: projector is easy to train (2 stages), gradient path is short
  • Q-Former: 2-stage training (stage 1 representation jointly optimizes ITC + ITM + ITG; stage 2 generation connects to the frozen LLM); ITM head overfits easily, ITG requires intricate routing of causal vs self-attn masks — engineering pitfalls abound

Conclusion: 2024 mainstream returns to projector + spatial pixel-shuffle / merging to control token count; Q-Former retains value mainly in video / multi-image summarization (using queries for temporal pooling).

Q23. Why is Qwen2-VL's M-RoPE config `mrope_section = [16, 24, 24]` and not 1:1:1? Do all head_dim dims rotate?

Recap ordinary RoPE: head_dim $d$ split into $d/2$ complex pairs, frequencies $\theta_k = \text{base}^{-2k/d}$. The frequency coverage determines the maximum relative distance distinguishable: low frequency distinguishes long distances, high frequency distinguishes short distances.

Key disambiguation: Qwen2-VL mrope_section units are pairs of half head_dim (each number = how many $(2k, 2k+1)$ dim pairs). $[16, 24, 24]$ means t / h / w each occupy 16 / 24 / 24 pairs of dims; $\sum \times 2 = 128 = $ head_dim. The HF implementation doubles section to $[16, 24, 24, 16, 24, 24]$ slicing head_dim, corresponding to the axis sequence $(t, h, w, t, h, w)$ — all 128 dims rotate, none "left unrotated".

Design trade-offs:

  1. Temporal variation is slow: typical video sampled at 1–5 FPS; adjacent frames are very similar; long-range temporal dependencies are moderate. $s_t=16$ (25% share) suffices to cover hundreds to thousands of frames.
  2. Spatial variation is dramatic: huge visual differences across patches within a frame; to do token-to-token retrieval over a $\sim 1000\times 1000$-pixel image, more frequency slots are needed. $s_h = s_w = 24$ (37.5% each) covers more.
  3. Spatial symmetry: $s_h = s_w$ keeps the H/W axes symmetric (equivalence under horizontal / vertical flip).
  4. 6 alternating segments instead of 3 contiguous: because RoPE uses LLaMA "rotate_half", head_dim is split in memory into two halves $[h_1, h_2]$, with rotation $q \mapsto q \cos + \text{rotate\_half}(q)\sin$; the two halves share inv_freq. So axis allocations must mirror in both halves.

Qwen2.5-VL upgrade: switch $m_t$ from frame id to absolute timestamps (seconds), letting variable-FPS videos share a consistent time coordinate at training — the key to long video.

Alternative: DeepSeek-VL2 uses flattened visual tokens + plain 1D RoPE (no h, w split); Llama-3.2-V also does not split spacetime explicitly. M-RoPE only wins decisively for native interleaved video + image scenarios.

Q24. Root cause of VLM hallucinations? Pros and cons of existing mitigations?

Root causes:

  1. Data bias: training data contains common "co-occurrence priors" — "if there's a table, there's probably a chair". Co-occurrence makes a VLM tend to answer "yes, there's a chair" when seeing a table, even if there is no chair
  2. Language prior dominates: when the visual signal is weak (small objects, blur, odd angles), the VLM falls back to a pure language model, answering from "corpus common sense"
  3. LLM sycophancy: the user asks "is there X in the image" and the model tends to say Yes (human feedback biases towards being helpful → biased Yes)
  4. Stage 2 instruction tuning has no negative supervision: labels rarely teach "if there is no X, answer No"

Mitigations:

MethodIdeaProsCons
LLaVA-RLHFPPO + hallucination-aware rewardTargeted late-stage fixNeeds a reward model + lots of preference data
RLAIF-VAI-generated preferenceLow data costReward model's own bias accumulates
POVIDDPO + constructed hallucination negativesDirect targeted fixNegative design requires care
VCD (visual contrastive decoding)At inference, run the VLM on the image and a blurred image, amplify the differenceTraining-free2x inference cost
OPERABeam search + over-attention detectionInference-time detectionHeuristic; may have false positives
POPE-driven evaluationUse POPE for reverse supervisionQuantifiableOnly measures object hallucination

Future directions: fix at the training data layer (grounded captions + segment-level supervision); visual chain-of-thought (VLM-R1 style) where the model "points to evidence" before answering.

Q25. For modern VLM training, should the vision tower use SigLIP or CLIP? Why did most pick SigLIP-So400M in 2024–2025?

Empirical conclusion: after 2024, SigLIP-So400M has become a common choice for open-weight VLMs, with PaliGemma (Google) and LLaVA-OneVision as representatives. But Molmo still uses OpenAI CLIP (its paper ablates against SigLIP); the InternVL series uses in-house InternViT; Qwen2-VL's vision side initializes from a DFN-derived ViT and then trains a large-scale vision-language joint; LLaVA-1.5 / 1.6 still uses CLIP ViT-L/14. "Switch to SigLIP" is not an industry consensus.

Why SigLIP-So400M is attractive:

  1. Stronger zero-shot performance: SigLIP-So400M has a 4–8 point lead over the same-size CLIP on zero-shot ImageNet; visual feature quality is higher
  2. Resolution-friendly: SigLIP already trained at 384²/512² extensively; CLIP is mostly 224²+336²; VLM tasks generally need high resolution, so SigLIP transfers more smoothly
  3. Batch-independent loss → stable fine-tune: SigLIP's sigmoid yields more predictable gradients when unfreezing the vision tower in stage 1
  4. Multilingual support: SigLIP-2 / mSigLIP natively support multiple languages
  5. Open weights: Google releases the full SigLIP / SigLIP-2 checkpoints (OpenAI CLIP has been open-sourced too, but with limited choices)

When to still pick CLIP:

  • When strict alignment with OpenAI CLIP behavior is needed (e.g. CLIP guidance for Stable Diffusion-style use)
  • For project compatibility (early LLaVA-1.0/1.5 + DALL-E pipelines use CLIP)

Note: SigLIP is not a silver bullet; DeepSeek-VL uses SigLIP + SAM dual-encoder — SAM features retain irreplaceable advantages on detail localization tasks.

§A Appendix: sanity-check outputs & references

A.1 Key code sanity check (illustrated by actual runs)

[ViT] patch_embed: (2, 3, 224, 224) -> (2, 196, 768)  ✓
[ViT] forward + CLS: (2, 3, 224, 224) -> head out (2, 1000)  ✓

[CLIP] N=8, D=512, init logit_scale=ln(1/0.07): loss ≈ 2.08 ≈ log(N) (random embeddings → near-uniform softmax)  ✓
[CLIP] forward + backward: gradients along i→t and t→i paths are symmetric ✓

[SigLIP] N=8, D=512, b=0:  loss = sum_{ij} log(1+e^0) / N = 64 * log 2 / 8 ≈ 5.545  ✓
[SigLIP] bias b=-10: positives (8 entries) loss ≈ log(1+e^10) ≈ 10; negatives (56 entries) loss ≈ 4.5e-5; total ≈ 8·10/8 ≈ 10.0  ✓ (positives dominate early gradient)

[LLaVA] visual feat (2, 256, 1024) -> projector -> (2, 256, 4096)  ✓
[LLaVA] input_ids w/ <image> placeholder: 1 token -> 256 visual tokens after merge  ✓

[Q-Former] image_feats (2, 257, 1408), queries (1, 32, 768) -> out (2, 32, 768)  ✓

[M-RoPE] head_dim=128, mrope_section=[16,24,24]: 2 × sum = 128 = head_dim, full rotation ✓
[M-RoPE] pure-text token (pos_t=pos_h=pos_w=m): cos/sin identical across the three axes, equivalent to 1D RoPE ✓

A.2 Key references (by topic)

Vision encoder: Dosovitskiy et al. ICLR 2021 (ViT); Zhai et al. CVPR 2022 (ViT-g); Fang et al. arXiv 2023 (EVA-02)

Contrastive pretraining: Radford et al. ICML 2021 (CLIP); Cherti et al. CVPR 2023 (OpenCLIP); Zhai et al. ICCV 2023 (SigLIP); Tschannen et al. arXiv 2025 (SigLIP-2); Gadre et al. NeurIPS 2023 (DataComp)

Visual instruction / fusion: Liu et al. NeurIPS 2023 (LLaVA); Liu et al. CVPR 2024 (LLaVA-1.5); Li et al. ICML 2023 (BLIP-2); Alayrac et al. NeurIPS 2022 (Flamingo); Wang et al. arXiv 2023 (CogVLM); Bai et al. arXiv 2023 (Qwen-VL); Wang et al. arXiv 2024 (Qwen2-VL); Bai et al. arXiv 2025 (Qwen2.5-VL); Lu et al. arXiv 2024 (DeepSeek-VL); Wu et al. arXiv 2024 (DeepSeek-VL2); Chen et al. CVPR 2024 + arXiv 2024 (InternVL / InternVL-2); Llama Team arXiv 2024 (Llama-3.2-V); Deitke et al. arXiv 2024 (Molmo); Li et al. arXiv 2024 (LLaVA-OneVision)

VLM preference alignment: Sun et al. arXiv 2023 (LLaVA-RLHF); Yu et al. arXiv 2024 (RLAIF-V); Zhou et al. arXiv 2024 (POVID); Shen et al. arXiv 2025 (VLM-R1)

Multimodal embeddings: Koukounas et al. arXiv 2024 (Jina-CLIP); Jiang et al. arXiv 2024 (VLM2Vec); Zhang et al. arXiv 2024 (mmE5)

Evaluation: Li et al. EMNLP 2023 (POPE); Thrush et al. CVPR 2022 (Winoground); Liu et al. ECCV 2024 (MMBench); Yue et al. CVPR 2024 (MMMU); Fu et al. CVPR 2025 (Video-MME); Yu et al. ICML 2024 (MM-Vet); Mangalam et al. NeurIPS 2023 (EgoSchema)


Code + formulas have passed independent reviewer static checks; numerical values verified on PyTorch 2.x, CUDA 12.x (shapes and initial loss of the 5 core modules — ViT / CLIP / SigLIP / Q-Former / M-RoPE — all match the formulas).