Visual Metonymy- Curated Briefing

Curated Briefing from "Visual Metonymy - NotebookLM" Exported on: 4/26/2026, 9:44:59 AM

User

Provide the full architectural diagram now like whiteboard

Model

The Full Latent Associative Grounding Architecture

Stage 0: Offline Preprocessing & Target Synthesis (The "Offline" Zone)

=================[ THE DATA FOUNDRY & TARGET SYNTHESIS ]=================

[ Raw ViMET Images ]         [ Concept Representamen Strings ]
         │                                  │
         ▼                                  ▼
[ Grounding DINO ] ────────► [ Candidate Bounding Boxes ]
                                            │
[ NodeRAG Graph ]                           ▼
[ Extracts N, S, R nodes ] ──► [ 32B Teacher Model (Unquantized) ]
[ Engineers Distractors ]                   │ (Acts as Expert Semantic Filter)
         │                                  │
         │                                  ▼
         │                     [ Teacher Forward Pass ]
         │                     ├── V_sem (Contextual Semantic Targets)
         │                     └── A_traj (Sparse Trajectory Targets)
         │                                  │
         └──────────────────────────────────┼───────────────────────────┐
                                            ▼                           │
                               [ teacher_signals.pkl ] ◄────────────────┘
                               (Saved to disk, 32B Model Unloaded)

The Resource Isolation Strategy: To strictly adhere to the 16GB VRAM constraint, the massive 32B Teacher Model and Grounding DINO are executed offline [Source: 1: Architecting Latent Visual Reasoning for Metonymy Research] more_horiz.- The Artifact Generation: This stage generates a flawless mathematical ground truth, caching the semantic vectors ( $Vsem$ ), the attention trajectories ( $Atraj$ ), and the NodeRAG distractors securely into a .pkl file to prevent Out Of Memory (OOM) death spirals during training [Source: 2: [GEMINI] LLM Fine-Tuning OOM Error Resolution] more_horiz.

Stage 1: Student Initialization & Latent Policy Alignment (SFT)

=================[ STUDENT INITIALIZATION & TRI-LOSS SFT ]=================

[ teacher_signals.pkl ] ───(Loaded to RAM)───┐
                                             │
[ Qwen2.5-VL-3B-Instruct ] ◄──(4-bit QLoRA)──┘
         │
         ▼
[ Vocabulary Expansion ] ────► Adds K=5 <lvr> Tokens (Cognitive Bottleneck)
         │
         ▼
[ Adapter Injection ] ───────► LoRA (r=64, α=128) on all Attn/MLP layers
         │                     (embed_tokens & lm_head strictly UNFROZEN)
         ▼
[ Semantic Projector ] ──────► Non-linear MLP [2048 -> LayerNorm/GELU -> 3584]
         │                     (Bridges 3B Student to 32B Teacher Space)
         ▼
[ Curriculum Sensory Gating ]► Mathematical cosine mask blocks Text-to-Image attention
         │                     (Forces all processing through <lvr> tokens)
         ▼
[ Tri-Loss Optimization ] ───┬─► L_ntp (Next-Token Prediction for ChatML)
                             ├─► L_concept (Cosine Sim against V_sem target)
                             └─► L_traj (Earth Mover's Distance against A_traj target)

The Latent Bottleneck: The architecture shifts from discrete text generation to $K=5$ continuous Latent Visual Reasoning () tokens acting as "Visual Information Containers", which mathematically compress the visual semantics [Source: 5: Bridging the Perception Gap through Latent Associative Grounding] more_horiz.- The Expressive Capacity: The QLoRA adapters target all major linear layers to provide the maximum expressive capacity required to absorb the complex metonymic concepts from the teacher [Source: 1: Architecting Latent Visual Reasoning for Metonymy Research] [Source: 8: [GEMINI] LLM Fine-Tuning OOM Error Resolution] .- The Training Guardrails: Curriculum Sensory Gating (CSG) applies a dynamic negative bias to mathematically throttle the model's textual priors, forcing the language head to route all visual processing through the latent thought slots [Source: 9: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] more_horiz.

Stage 2: VMR-RLVR Active Optimization (The "Agentic" Zone)

=================[ ACTIVE PERCEPTION & GRPO RL LOOP ]=================

[ Open-Ended Metonymy Prompt ]
         │
         ▼
[ VMR Restructuring ] ───────► Pits target concept against "Partially Literal Distractors"
         │                     (Engineered from NodeRAG offline)
         ▼
[ Active Visual Perception ] ─► Agent predicts bounding boxes -> Crops & Zooms
         │                     (Purifies visual input before bottleneck)
         ▼
[ Latent Synthesis ] ────────► K=5 <lvr> tokens synthesize the cropped Representamens
         │
         ▼
[ Final Prediction ] ────────► Discrete text answer generated (e.g., "A) Age")
         │
         ▼
[ Dense Reward Evaluator ] ──┬─► r_accuracy (Binary: +1 for correct VMR choice)
                             ├─► r_format (Compliance with required schema)
                             ├─► r_crop (IoU against Teacher-verified boxes)
                             └─► r_visual (AWAD: Penalizes lazy attention)
         │
         ▼
[ GRPO Optimizer ] ──────────► PagedAdamW8bit updates 3B LoRA weights

The "Reasoning by Contrast" Engine: The Verifiable Multiple-Choice Reformulation (VMR) converts subjective concepts into deterministic A/B/C/D tasks, allowing the GRPO reinforcement learning algorithm to safely explore latent trajectories [Source: 12: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] [Source: 13: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] .- The Attention Supervision: Advantage-Weighted Attention Divergence (AWAD) actively supervises the model's internal attention map token-by-token [Source: 14: Reinforced Attention Learning] . If the model exhibits "Lazy Attention Localization" (guessing correctly while looking at the background), AWAD applies a negative advantage to permanently anchor the policy to the physical pixels [Source: 14: Reinforced Attention Learning] .- The Memory Safety Protocol: PagedAdamW8bit is explicitly deployed to compress optimizer states (Momentum and Variance) to 8-bit, automatically paging them to system RAM to prevent VRAM crashes during the backward pass [Source: 15: [GEMINI] LLM Fine-Tuning OOM Error Resolution] [Source: 16: [GEMINI] LLM Fine-Tuning OOM Error Resolution] .

User

Also, Can you elaborate on where exactly each of the components involved, such as the 32B model, is used in the workflow. I should know after looking at the workflow or architectural diagram, at which point each of the components interplay, so pretend you are explaining blocks and arrows to someone white-boarding. I would understand , looking at the entire diagram, where exactly in the architectural phase such as the CSG phase or RLVR phase, the teacher model or the 32B model is employed and whether that would fit the memory constraints, also I want to know this is done offline or online.

Model

The Latent Associative Grounding Architecture: A Whiteboard Map

To visualize exactly how the components interact across the pipeline, imagine the whiteboard divided into two distinct zones: an Offline Preprocessing Zone (heavy computation that happens before training) and an Online Training Zone (the actual fine-tuning and RL updates).

Here is the step-by-step block-and-arrow breakdown of where every component, especially the 32B teacher model, sits in your architecture.

Stage 0: Offline Preprocessing & Target Synthesis (The "Offline" Zone)

This entire stage is executed strictly offline before any student training begins [Source: 1: [GEMINI] LLM Fine-Tuning OOM Error Resolution] .

The Memory Constraint Justification: Your deployment target is a Kaggle environment with a strict 16GB VRAM ceiling [Source: 2: [GEMINI] LLM Fine-Tuning OOM Error Resolution] . A 32B teacher model requires massive VRAM; attempting "Online Distillation"—where the 32B teacher and the 3B student occupy the GPU simultaneously to generate and learn from targets on the fly—would trigger an immediate Out Of Memory (OOM) death spiral [Source: 1: [GEMINI] LLM Fine-Tuning OOM Error Resolution] [Source: 2: [GEMINI] LLM Fine-Tuning OOM Error Resolution] . Therefore, the teacher must be isolated and executed offline [Source: 1: [GEMINI] LLM Fine-Tuning OOM Error Resolution] .- The Grounding DINO Block:
- Input: Raw ViMET images and representamen text strings [Source: 3: Teacher-Verified Automated Trajectory Synthesis for Reinforcement Learning] .- Process: Grounding DINO scans the image and extracts candidate bounding boxes for the visual cues [Source: 3: Teacher-Verified Automated Trajectory Synthesis for Reinforcement Learning] .- Arrow: [Candidate Bounding Boxes] -> [32B Teacher Model]- The 32B Teacher Model Block:
- Role: Acts as an expert semantic filter and trajectory generator [Source: 3: Teacher-Verified Automated Trajectory Synthesis for Reinforcement Learning] [Source: 4: Teacher-Verified Automated Trajectory Synthesis for Reinforcement Learning] .- Process: The unquantized Qwen2.5-VL-32B evaluates the candidate bounding boxes to guarantee they contain the correct fine-grained visual information [Source: 4: Teacher-Verified Automated Trajectory Synthesis for Reinforcement Learning] [Source: 5: Teacher-Verified Automated Trajectory Synthesis for Reinforcement Learning] . It then runs a forward pass to extract high-fidelity reasoning signals, specifically the semantic concept targets (V_sem) and the attention trajectory targets (A_traj) [Source: 6: Architecting Latent Visual Reasoning for Metonymy Research] more_horiz.- Arrow: [Verified Bounding Boxes + 32B Forward Pass] -> [teacher_signals.pkl]- The NodeRAG / Knowledge Graph Block:
- Role: Engineers the prompt context and constructs the deterministic evaluation framework [Source: 9: Architectural Synergy of Heterogeneous Graph Representations and Small-Scale Vision-Language Models for Mitigating Visual Hallucination] [Source: 10: Architectural Synergy of Heterogeneous Graph Representations and Small-Scale Vision-Language Models for Mitigating Visual Hallucination] .- Process: NodeRAG extracts Semantic Units (S) and Relationships (R) to inject explicit logical rules into the prompt [Source: 10: Architectural Synergy of Heterogeneous Graph Representations and Small-Scale Vision-Language Models for Mitigating Visual Hallucination] [Source: 11: Defeating the Curse of Chain-of-Thought via NodeRAG and Latents] . It also generates the Partially Literal Distractors necessary for the multiple-choice reformulation [Source: 12: Multimodal Frontiers: Adapting VMR-RLVR for Vision-Language Models] .- Output: The offline phase concludes by caching all these targets into a single file (teacher_signals.pkl), after which the 32B model is completely unloaded from the GPU to free up memory [Source: 13: [GEMINI] LLM Fine-Tuning OOM Error Resolution] [Source: 14: [GEMINI] LLM Fine-Tuning OOM Error Resolution] .

Stage 1: Supervised Fine-Tuning (SFT) & Curriculum Sensory Gating (The "Online" Zone)

With the 32B model removed from memory, you now load the 3B student model (Qwen2.5-VL-3B-Instruct) onto the 16GB GPU using 4-bit NF4 QLoRA quantization and PagedAdamW8bit to compress optimizer states [Source: 15: [GEMINI] LLM Fine-Tuning OOM Error Resolution] more_horiz. This is an online training phase.

The Input Block:
- The cached teacher_signals.pkl is loaded entirely into system RAM [Source: 18: [GEMINI] LLM Fine-Tuning OOM Error Resolution] .- Arrow: [NodeRAG Prompt + Image] -> [3B Student Vision Encoder]- The Latent Bottleneck Block (**** Tokens):
- Instead of generating textual chain-of-thought, the model forces the visual data through exactly $K=5$ continuous Latent Visual Reasoning () tokens injected directly before the answer generation [Source: 6: Architecting Latent Visual Reasoning for Metonymy Research] [Source: 19: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .- The Curriculum Sensory Gating (CSG) Phase:
- Process: CSG applies a mathematical mask (negative bias) that blocks the model's direct text-to-image cross-attention early in training [Source: 20: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] more_horiz.- Why here? This physically throttles the 3B model's language priors, forcing the language head to route all processing strictly through the <lvr> bottleneck [Source: 21: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 22: Lexicon of Visual Metonymy and Vision-Language Models] .- Arrow: [3B Hidden States] -> [Tri-Loss Evaluation]- The Optimization Block (Tri-Loss):
- The student's latent output is compared against the offline targets retrieved from the .pkl file.- Semantic Reconstruction ( $Lconcept$ ): Aligns the student's latent states with the 32B teacher's V_sem [Source: 23: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 24: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .- Trajectory Alignment ( $Ltraj$ ): Aligns the student's attention map with the 32B teacher's A_traj [Source: 25: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .

Stage 2: Active Optimization via VMR-RLVR (The "Online" Zone)

This final online phase shifts the 3B model from imitating the teacher to actively exploring and validating its own reasoning using Reinforcement Learning.

The Active Visual Perception Block:
- Instead of passively absorbing the image, the 3B agent actively predicts bounding box coordinates to dynamically crop and zoom into the representamens, purifying the visual data before it hits the latent bottleneck [Source: 26: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 27: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .- Arrow: [Active Cropping] -> [<lvr> Latent Synthesis] -> [Final Output Generation]- The Verifiable Multiple-Choice Reformulation (VMR) Block:
- The open-ended metonymy task is restructured into an A/B/C/D format using the NodeRAG distractors created in Stage 0 [Source: 12: Multimodal Frontiers: Adapting VMR-RLVR for Vision-Language Models] more_horiz.- The RLVR / GRPO Optimization Block:
- Process: The model explores multiple reasoning trajectories and is updated via Group Relative Policy Optimization (GRPO) [Source: 30: Reasoning Beyond Literal: Cross-style Multimodal Reasoning for Figurative Language Understanding] more_horiz.- Dense Reward Calculation: The reward evaluator computes the final reinforcement signal without needing an LLM-as-a-judge. It sums the Accuracy Reward ( $raccuracy$ ) from the VMR format, the Cropping Precision Reward ( $rcrop$ ) based on the offline Grounding DINO boxes, and the Visual-Anchored Reward ( $rvisual$ ) which penalizes the model if it abandons visual evidence [Source: 33: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] more_horiz.- Arrow: [Reward Signal] -> [PagedAdamW8bit Optimizer] -> [Update 3B LoRA Weights]

Summary of the Whiteboard Flow: The 32B Teacher Model is strictly an offline preprocessing engine used to generate flawless spatial and semantic targets, keeping you safe from the 16GB VRAM constraint [Source: 1: [GEMINI] LLM Fine-Tuning OOM Error Resolution] more_horiz. The Curriculum Sensory Gating (CSG) and RLVR phases are online training loops where the compact 3B Student Model learns to compress those offline targets into continuous latent thoughts and optimizes its attention policy via verifiable rewards [Source: 21: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] more_horiz.

User

How latent reasoning is applied during VMR-RLVR phase for solving the metonymy using reasoning/logic after it has successfully distributed it's gaze to all relevant cues (active visual perception is complete across all the representamens) ? Explain how the conceptual leap is ensured after visual perception

Model

Latent Associative Grounding: The Resolution Phase

The Transition from Perception to Cognition: Once the Vision-Language Model (VLM) has successfully executed Active Visual Perception to crop and isolate the distributed visual cues (the Representamens), the architecture shifts from gathering evidence to synthesizing meaning [Source: 1: Bridging the Perception Gap through Latent Associative Grounding] more_horiz.- Abandoning Textual Chain-of-Thought (CoT): Standard models attempt this conceptual leap by generating verbose, discrete textual reasoning (e.g., "I see an hourglass and an old man, therefore…"), which triggers the "Curse of CoT" [Source: 4: Defeating the Curse of Chain-of-Thought via NodeRAG and Latents] [Source: 5: Lexicon of Visual Metonymy and Vision-Language Models] . This verbosity causes context dilution, physically pushing the original visual constraints out of focus and causing the model to suffer from hallucinatory drift [Source: 4: Defeating the Curse of Chain-of-Thought via NodeRAG and Latents] more_horiz.- Decoupling Reasoning from Verbalization: To prevent this, the architecture entirely decouples the reasoning process from textual generation [Source: 1: Bridging the Perception Gap through Latent Associative Grounding] more_horiz. The conceptual leap is executed silently within a continuous hidden state before any text is produced [Source: 7: Architectural Foundations of Latent Chain-of-Thought for Visual Metonymy] .

The Role of the Latent Bottleneck ( $K=5$ <lvr> Tokens)

Visual Information Containers: The cropped visual cues are mathematically compressed into exactly $K=5$ continuous Latent Visual Reasoning () tokens [Source: 9: Architecting Latent Visual Reasoning for Metonymy Research] [Source: 10: Lexicon of Visual Metonymy and Vision-Language Models] . By aligning the capacity of this bottleneck exactly with the semiotic structure of the dataset (up to 5 representamens per concept), these tokens act as dedicated cognitive slots containing pure, high-density semantic signals stripped of background noise [Source: 1: Bridging the Perception Gap through Latent Associative Grounding] [Source: 9: Architecting Latent Visual Reasoning for Metonymy Research] .- Constructing the Interpretant: In Peirce’s Semiotic Triad, the "Interpretant" is the mental inference that connects a literal cue to an abstract meaning [Source: 2: Lexicon of Visual Metonymy and Vision-Language Models] [Source: 11: Lexicon of Visual Metonymy and Vision-Language Models] . The latent tokens mathematically construct this Interpretant [Source: 11: Lexicon of Visual Metonymy and Vision-Language Models] . Because they operate in a continuous vector space rather than discrete vocabulary, they can maintain complex, cross-modal semantic associations simultaneously, bridging the gap between the literal pixels (e.g., an hourglass and an old man) and the abstract Object (e.g., "Age") [Source: 2: Lexicon of Visual Metonymy and Vision-Language Models] more_horiz.- Autoregressive Synthesis: The latent reconstruction occurs autoregressively (token-by-token) within the bottleneck [Source: 13: LaViT: Grounding Visual Reasoning] [Source: 14: Lavit1] . This sequential dependency builds a causal graph of visual logic, compelling the model to synthesize the isolated cues into a unified understanding before passing the final hidden state to the language decoding head [Source: 13: LaViT: Grounding Visual Reasoning] [Source: 15: Lavit1] .

Ensuring the Conceptual Leap via Curriculum Sensory Gating (CSG)

Preventing Shortcut Learning: Left to its own devices, a VLM’s language head is fundamentally lazy; it will attempt to bypass the visual evidence and guess the answer based on statistical text correlation (language priors) [Source: 16: LaViT: Grounding Visual Reasoning] [Source: 17: Lavit1] .- The Mathematical Mask: To mathematically guarantee that the conceptual leap is grounded in the visual evidence, the architecture employs Curriculum Sensory Gating (CSG) [Source: 5: Lexicon of Visual Metonymy and Vision-Language Models] more_horiz. This mechanism applies a negative bias to block direct text-to-image cross-attention [Source: 5: Lexicon of Visual Metonymy and Vision-Language Models] more_horiz.- Forcing the Bottleneck: By throttling the model's access to its language priors, CSG physically forces the language head to route all its processing through the $K=5$ latent tokens [Source: 5: Lexicon of Visual Metonymy and Vision-Language Models] more_horiz. It becomes mathematically impossible for the model to bypass the visual evidence, ensuring that the final abstract prediction is strictly derived from the synthesized visual cues rather than a blind textual guess [Source: 21: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 22: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .

Optimization via VMR-RLVR ("Reasoning by Contrast")

The Challenge of Subjectivity: Visual metonymy is inherently subjective and abstract, making standard open-ended generation impossible to evaluate with strict reinforcement learning rewards [Source: 23: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] .- Verifiable Multiple-Choice Reformulation (VMR): To optimize the latent reasoning, the task is restructured into a deterministic A/B/C/D format [Source: 23: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] . The correct abstract concept (e.g., "Age") is pitted against meticulously engineered "Partially Literal Distractors" (e.g., "Time") derived from ConceptNet [Source: 23: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] [Source: 24: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] . These distractors are highly correlated with single, literal objects in the image (like an hourglass), creating a trap for models with literal bias [Source: 12: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] more_horiz.- Group Relative Policy Optimization (GRPO): During the RLVR phase, the model explores various latent reasoning trajectories [Source: 12: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] [Source: 26: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] . If the latent tokens succumb to premature semantic binding—fixating only on the hourglass and selecting the literal distractor "Time"—the model receives a deterministic reward of 0 [Source: 23: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] [Source: 24: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] . If the latent tokens successfully synthesize all cues (the man, the boy, the hourglass) to select "Age", it receives a +1 reward [Source: 23: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] .- Learning the Logical Boundaries: This binary reward system forces the model into "reasoning by contrast" [Source: 27: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] more_horiz. The policy mathematically penalizes superficial visual-matching shortcuts, teaching the latent tokens to recognize the exact logical boundaries separating a correct pragmatic interpretation from a literal mistake [Source: 29: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] .

Sustaining the Leap via Visual-Anchored Reward Shaping ( $rvisual$ )

Curing Lazy Attention Localization: Even with a latent bottleneck, VLMs can suffer from "Lazy Attention Localization," a phenomenon where the model initially looks at the image but abandons the visual tokens entirely during the final stages of the reasoning chain, reverting to system text priors [Source: 30: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] [Source: 31: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] .- The $rvisual$ Ratio: To prevent the model from dropping its visual focus exactly when it needs to make the conceptual leap, the RL loop is augmented with Visual-Anchored Reward Shaping ( $rvisual$ ) [Source: 32: AVAR Framework: Visual-Anchored Reward Shaping and Optimization] [Source: 33: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] . This calculates the ratio of internal attention mass assigned to visual tokens relative to system tokens during the rollout [Source: 32: AVAR Framework: Visual-Anchored Reward Shaping and Optimization] [Source: 34: AVAR Framework: Visual-Anchored Reward Shaping and Optimization] .- Advantage-Weighted Attention Divergence (AWAD): Operating under the Reinforced Attention Learning (RAL) framework, AWAD calculates the Jensen-Shannon Divergence (JSD) between the student's attention map and a flawless, teacher-verified target [Source: 10: Lexicon of Visual Metonymy and Vision-Language Models] more_horiz.- Negative Advantage for Blind Guessing: If the model selects the correct multiple-choice answer but its internal attention map reveals it was looking at irrelevant background noise or system instructions, AWAD applies a negative advantage [Source: 37: JSD in Vision-Language Models] . This completely neutralizes the reward, mathematically proving that the model is only reinforced when its abstract conceptual leap remains permanently anchored to the physical pixels of the representamens [Source: 33: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] more_horiz.

User

How can Active Visual Perception reduce distractor interference during the Resolution phase?

Model

The Mechanics of Distractor Interference

During the final stages of a reasoning rollout, Vision-Language Models (VLMs) frequently succumb to distractor interference due to a pervasive "literal bias" [Source: 1: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] [Source: 2: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] . This means they are statistically predisposed to map visual pixels directly to concrete, highly correlated nouns rather than synthesizing abstract meanings [Source: 1: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] [Source: 2: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] .- When a model passively encodes an entire image, it processes massive amounts of irrelevant background noise alongside the crucial visual cues [Source: 3: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 4: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .- In the Resolution phase—where the model must make its final conceptual leap—this passive consumption causes premature semantic binding [Source: 5: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] . For example, the model's attention might fixate entirely on a single literal object (like an hourglass) and immediately trigger a false correlation with a "partially literal distractor" (like "Time"), ignoring the broader context required to resolve the true abstract concept ("Age") [Source: 6: A Computational Approach to Visual Metonymy] [Source: 7: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] .

Active Visual Perception as a Filtering Mechanism

Active Visual Perception transforms the VLM from a passive observer into an active agent [Source: 3: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 8: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . Instead of absorbing the full, noisy image context, the model actively generates bounding box coordinates to crop and adaptively zoom into information-dense regions [Source: 8: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] more_horiz.- By physically cropping out irrelevant visual context, this coarse-to-fine filtering mechanism systematically denies the "partially literal distractors" the background noise and spurious correlations they rely on to hijack the model's attention [Source: 3: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] more_horiz.- This ensures that the model successfully isolates the multiple distributed representamens (e.g., isolating the old man, the boy, and the hourglass independently) before attempting any abstract interpretation [Source: 6: A Computational Approach to Visual Metonymy] more_horiz.

Impact on the Latent Resolution Phase

In your finalised architecture, the Resolution phase is decoupled from discrete textual verbalisation and is instead driven by exactly $K=5$ continuous Latent Visual Reasoning () tokens [Source: 13: Architectural Foundations of Latent Chain-of-Thought for Visual Metonymy] more_horiz.- Because the visual input has been actively cropped prior to this phase, these "Visual Information Containers" compress pure, highly concentrated semantic signals rather than diffuse, full-image visual noise [Source: 11: Precision Perception: Integrating Active Visual Cues and Dense Rewards] more_horiz.- This maximizes the Reasoning Density of the latent thoughts [Source: 11: Precision Perception: Integrating Active Visual Cues and Dense Rewards] [Source: 18: RLVR to Open-Ended Tasks via Verifiable MCQ reformulation.pdf] . Consequently, the latent tokens are mathematically forced to synthesize the relationship between the isolated visual cues to construct the Interpretant (the mental inference of the scene) [Source: 11: Precision Perception: Integrating Active Visual Cues and Dense Rewards] [Source: 19: A Computational Approach to Visual Metonymy] .- When the final decoding head projects these continuous hidden states into the discrete vocabulary space, it securely aligns the Interpretant with the correct abstract Object [Source: 19: A Computational Approach to Visual Metonymy] . The semantic gravity of the distractors is neutralized because the latent bottleneck was never exposed to the literal noise that triggers them [Source: 5: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] more_horiz.

Synergy with Precise Selection

Active Visual Perception acts synergistically with a Precise Selection Action during the resolution process [Source: 3: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 21: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .- Once the purified, zoomed-in visual cues are extracted, the model performs a strict semantic filtering step [Source: 3: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 21: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . It explicitly evaluates the semantic alignment between the concentrated visual regions and each of the provided multiple-choice distractors [Source: 9: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 21: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .- This forces the VLM to explicitly evaluate and reject the partially literal distractors based purely on the isolated visual evidence, ensuring the final reasoning step is verifiably grounded in the image pixels rather than textual priors [Source: 9: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] more_horiz.

User

How will you develop on the "Resolution phase" as part of latent associative grounding framework? Also, what happens in this example if the reasoning gets flawed in any of the steps , though it is reasoning problem, will I have to worry about context dilution for this that I need to abondon chain of thought completely, or Can believe that light and reasoning will solve this multi-step reasoning problem :Attention distributed across old man, boy, and hourglass → selects "Age" → reward: +1

Model

To develop the Resolution Phase within the latent associative grounding framework, you must architect the final layers of your model to synthesize distributed visual cues into an abstract conceptual decision without relying on explicit text generation.

During this phase, the reasoning process is fundamentally decoupled from verbalization [Source: 1: Architectural Foundations of Latent Chain-of-Thought for Visual Metonymy] [Source: 2: Paper page - Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization] . Instead of generating a verbose textual explanation, the model utilizes its $K=5$ continuous Latent Visual Reasoning (<lvr>) tokens as a cognitive bottleneck [Source: 3: Bridging the Perception Gap through Latent Associative Grounding] . These tokens have already actively cropped and attended to the various representamens in the image (the old man, the boy, and the hourglass). In the resolution phase, the latent tokens mathematically construct the "Interpretant"—the mental inference that connects these isolated visual cues into a unified understanding of the transition of life [Source: 4: Multimodal Pragmatic Inference: The Theoretical and Computational Rigor of Reasoning Beyond Literal] [Source: 5: Multimodal Pragmatic Inference: The Theoretical and Computational Rigor of Reasoning Beyond Literal] . A separate decoding head then projects these continuous hidden states into the discrete vocabulary space to make a final, deterministic choice, securely aligning the interpretant with the abstract Object: "Age" [Source: 1: Architectural Foundations of Latent Chain-of-Thought for Visual Metonymy] [Source: 2: Paper page - Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization] .

What happens if the reasoning gets flawed in the "Age" vs. "Time" example? If the reasoning breaks down at any step during this process, the model falls victim to "literal bias" and distractor interference [Source: 6: Bridging the Perception Gap through Latent Associative Grounding] [Source: 7: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] . Mechanistically, early layers of the model will easily and confidently detect the "hourglass" [Source: 8: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] [Source: 9: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] . If the latent associative grounding fails, the model experiences "premature semantic binding" in its middle and final layers; it fixates solely on the literal object (the hourglass) and ignores the surrounding contextual representamens (the old man and the boy) [Source: 8: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] [Source: 9: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] .

When the model makes this flawed leap and selects the partially literal distractor "Time", two strict reinforcement learning penalties are triggered:

Zero Outcome Reward: Under the Verifiable Multiple-Choice Reformulation (VMR-RLVR) framework, selecting the semantic distractor yields a deterministic reward of 0, explicitly penalizing the superficial reasoning path [Source: 10: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] .- Negative Attention Advantage: The internal attention policy is supervised via Advantage-Weighted Attention Divergence (AWAD) [Source: 11: JSD in Vision-Language Models] [Source: 12: Reinforced Attention Learning] . If the model failed to distribute its attention mass across the old man and the boy, fixating aimlessly or solely on the hourglass, it receives a negative advantage [Source: 11: JSD in Vision-Language Models] [Source: 12: Reinforced Attention Learning] . This mathematically pushes the model's policy away from that flawed, narrow visual search strategy, ensuring that in future rollouts, it must distribute its gaze to all relevant cues before making a conceptual leap [Source: 11: JSD in Vision-Language Models] [Source: 12: Reinforced Attention Learning] .

Do you have to worry about Context Dilution, and should you abandon text-based Chain-of-Thought (CoT)? Yes, you absolutely must worry about context dilution, and the architectural solution requires abandoning standard text-based CoT completely [Source: 13: Defeating the Curse of Chain-of-Thought via NodeRAG and Latents] .

Generating a traditional, verbose CoT response triggers the "Curse of CoT" [Source: 14: Defeating the Curse of Chain-of-Thought via NodeRAG and Latents] more_horiz. When an LLM is forced to reason step-by-step using discrete English words, the sheer volume of generated text physically pushes the original visual constraints and few-shot examples further away from the final output tokens [Source: 14: Defeating the Curse of Chain-of-Thought via NodeRAG and Latents] [Source: 15: The Curse of Chain-of-Thought in Pattern Recognition] . This increased contextual distance disrupts the model's implicit reasoning capabilities, causing it to suffer from "Lazy Attention Localization" [Source: 17: Bridging the Perception Gap through Latent Associative Grounding] [Source: 18: Lexicon of Visual Metonymy and Vision-Language Models] . As the text chain lengthens, the model abandons the image pixels entirely and defaults to its pre-trained language priors, which almost always favor the literal distractor (e.g., statistically correlating "hourglass" with "Time" instead of "Age") [Source: 17: Bridging the Perception Gap through Latent Associative Grounding] [Source: 18: Lexicon of Visual Metonymy and Vision-Language Models] .

Can latent (light) reasoning solve this multi-step problem? Yes, replacing verbose text with latent reasoning solves this exact multi-step challenge, particularly for lightweight, compact models (like a 3B parameter VLM) [Source: 19: Bridging the Perception Gap through Latent Associative Grounding] [Source: 20: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] . By compressing the multi-hop logic into exactly $K=5$ continuous latent planning states, you maintain massive reasoning density while keeping the physical distance between the visual input and the final answer incredibly short [Source: 1: Architectural Foundations of Latent Chain-of-Thought for Visual Metonymy] [Source: 2: Paper page - Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization] . Because Curriculum Sensory Gating (CSG) forces the language head to route all processing through these visual bottlenecks, the model is mathematically compelled to maintain its visual grounding on the image pixels, neutralizing context dilution and successfully resolving the metonymy without a single word of text-based CoT [Source: 18: Lexicon of Visual Metonymy and Vision-Language Models] .

User

Draft the final Technical Design Spec section now, with specific focus on VIMET dataset that contains partially Literal Distractors in the multiple choice questions for each concept. For example, time was the distractor for one of the questions in which the correct answer was age. The cues given in the image were an old man and an hourglass with a boy inside.

Model

Here is the final section of your Technical Design Spec, focusing specifically on the data engineering behind the ViMET dataset and the mechanistic role of its distractors. You can append this directly to the end of your document.

--------------------------------------------------------------------------------

6. Data Engineering: Controlled Ambiguity and The ViMET Distractor Trap

To mathematically optimize the latent attention policy of our 3B student model, the training environment must provide an exact, verifiable ground truth. However, visual metonymy is inherently subjective; an open-ended prompt asking a model to "interpret this image" lacks a deterministic reward signal [Source: 1: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] .

To solve this, our architecture relies entirely on the ViMET (Visual Metonymy) dataset, which leverages a rigorous Verifiable Multiple-Choice Reformulation (VMR) strategy [Source: 2: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] . By restructuring subjective interpretations into deterministic A/B/C/D formats, we expose the model's fundamental cognitive flaws using a mechanism known as "Controlled Ambiguity" [Source: 3: A Computational Approach to Visual Metonymy] .

A. The Architecture of "Partially Literal" Distractors

Current Vision-Language Models (VLMs) suffer from a pervasive "literal bias," meaning they are statistically predisposed to map visual pixels directly to concrete, highly correlated nouns [Source: 4: 19th March] more_horiz. The ViMET dataset mathematically exploits this flaw.

Rather than using random incorrect options, the dataset engineers its multiple-choice distractors using a fusion of CLIP embeddings (for visual proximity) and ConceptNet relational graphs (for semantic proximity) [Source: 7: A Computational Approach to Visual Metonymy] . By filtering out direct synonyms using BERT cosine similarity distances, the pipeline creates Partially Literal Distractors—options that are visually present or statistically correlated with one object in the image, but logically incorrect for the holistic scene [Source: 3: A Computational Approach to Visual Metonymy] . This ensures that any model attempting to solve the task via superficial "keyword matching" will fail.

B. Case Study: The "Age" vs. "Time" Conundrum

To illustrate the necessity of our Latent Associative Grounding pipeline, consider a specific multiple-choice instance from the ViMET dataset:

The Image: An elderly man standing next to an hourglass. Inside the flowing sand of the hourglass is a young boy [Source: 8: A Computational Approach to Visual Metonymy] .- The Correct Answer: Age- The "Hard Negative" Distractor: Time

The VLM Failure Mode (Literal Bias): When a standard VLM processes this image, its early layers easily detect the hourglass. Because the model's textual priors heavily associate "hourglass" with "Time", the network experiences premature semantic binding. The model's attention fixates on the literal object (the hourglass), ignores the surrounding context (the old man and the boy), and confidently predicts the distractor: "Time". The model has failed because it confused a literal component of the image for its abstract intent.

The Latent Associative Grounding Solution: Our architecture prevents this failure mode entirely.

Active Perception: The model is forced to actively crop and isolate all distributed representamens (the old man, the boy, and the hourglass) before reasoning [Source: 9: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .- The Latent Bottleneck: By compressing these cues into exactly $K=5$ continuous Latent Visual Reasoning (<lvr>) tokens [Source: 10: Architecting Latent Visual Reasoning for Metonymy Research] , the model cannot jump to a premature textual conclusion. It must mathematically synthesise the relationship between the old man, the young boy, and the passage of sand.- Resolution: The latent tokens construct the "Interpretant"—the transition of life—which correctly aligns with the abstract Object: "Age" [Source: 8: A Computational Approach to Visual Metonymy] [Source: 11: A Computational Approach to Visual Metonymy] .

C. Enabling "Reasoning by Contrast" via GRPO

By providing these meticulously engineered distractors, the ViMET dataset creates the perfect environment for Group Relative Policy Optimization (GRPO) [Source: 12: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] [Source: 13: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] .

During the reinforcement learning phase (VMR-RLVR), the model explores multiple latent reasoning trajectories [Source: 14: RLVR to Open-Ended Tasks via Verifiable MCQ reformulation.pdf] . If the model's internal attention map simply fixates on the hourglass and it selects "Time", it receives a strict reward of 0 [Source: 14: RLVR to Open-Ended Tasks via Verifiable MCQ reformulation.pdf] . If it successfully distributes its attention across the old man, the boy, and the hourglass to select "Age", it receives a reward of +1 [Source: 12: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] [Source: 14: RLVR to Open-Ended Tasks via Verifiable MCQ reformulation.pdf] .

This binary reward system forces the model into "reasoning by contrast" [Source: 15: RLVR to Open-Ended Tasks via Verifiable MCQ reformulation.pdf] . The 3B student mathematically learns the intrinsic logical boundaries that separate a superficial literal correlation (the distractor) from a correct pragmatic interpretation [Source: 15: RLVR to Open-Ended Tasks via Verifiable MCQ reformulation.pdf] [Source: 16: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] . Ultimately, this guarantees that our edge-deployed architecture does not just memorize cultural icons, but actively performs verifiable abductive reasoning on device.

--------------------------------------------------------------------------------

Final Review: With this section included, your Technical Design Spec is completely finalised. It now covers the core problem (The Perception Gap), the hardware constraints, the mathematical solutions (LaViT + NodeRAG offline priors), the strict evaluation metrics, and the precise data engineering (VMR-RLVR) required to train it.

User

draft the Executive Summary,

Model

Here is the drafted Executive Summary section for your Technical Design Spec. It is written with a rigorous, engineering-focused tone tailored for an Apple AI/ML hiring manager, hooking the reader with the core performance deficits of current models before presenting your finalized architectural solution.

You can paste this directly at the very beginning of your document.

--------------------------------------------------------------------------------

Executive Summary: Overcoming the Perception Gap in Vision-Language Models

The Problem Context: The "Literal Wall" and the Perception Gap Despite the rapid evolution of Vision-Language Models (VLMs), the industry has hit a "Literal Wall" when it comes to associative grounding—the ability to connect distributed visual cues to abstract, physically absent concepts (Visual Metonymy) [Source: 1: A Computational Approach to Visual Metonymy] [Source: 2: Bridging the Perception Gap through Latent Associative Grounding] . Current VLMs exhibit a severe 21% accuracy deficit compared to human performance (65.9% vs. 86.9%) on the ViMET benchmark [Source: 3: Latent Associative Grounding for Visual Metonymy in Small VLMs] .

This failure stems fundamentally from a "Perception Gap": during training, models learn what to say by mimicking textual distributions, but they fail to learn where to look [Source: 2: Bridging the Perception Gap through Latent Associative Grounding] . As a result, models often bypass visual evidence entirely, resorting to "blind guessing" driven by statistical language priors. Attempting to solve this "modality laziness" using verbose, text-based Chain-of-Thought (CoT) triggers the "Curse of CoT", causing severe context dilution and hallucinatory drift as the reasoning chains lengthen [Source: 3: Latent Associative Grounding for Visual Metonymy in Small VLMs] [Source: 4: Paper page - Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization] . Furthermore, relying on massive parameter scaling (e.g., 32B+ models) or active online multi-hop GraphRAG retrievals is computationally prohibitive for edge deployment, fundamentally violating our strict 16GB VRAM hardware constraint [Source: 3: Latent Associative Grounding for Visual Metonymy in Small VLMs] .

The Compelling Shift: Latent Associative Grounding To bridge this gap without exceeding edge-compute limits, this architecture introduces a fundamental paradigm shift: Latent Associative Grounding. We completely decouple the model's reasoning process from textual verbalization [Source: 4: Paper page - Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization] .

Using a highly optimized Qwen2.5-VL-3B-Instruct backbone, we replace verbose text-based reasoning with exactly $K=5$ continuous Latent Visual Reasoning () tokens [Source: 3: Latent Associative Grounding for Visual Metonymy in Small VLMs] . By aligning the network’s bottleneck capacity exactly with the dataset’s cognitive structure (5 representamens per concept), these tokens act as high-density "Visual Information Containers." Using Curriculum Sensory Gating (CSG), we physically throttle direct text-to-image attention, forcing the model to mathematically compress multi-step visual semantics into this continuous latent space before it is permitted to generate a single word of text.

The Optimization Engine: VMR-RLVR and Dense Perception Rewards Training continuous latent thoughts on subjective figurative language requires flawless optimization targets. We achieve this by deploying Verifiable Multiple-Choice Reformulation (VMR-RLVR) [Source: 5: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] [Source: 6: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] . We restructure open-ended metonymy tasks into deterministic A/B/C/D verification formats, pitting the correct abstract concept against meticulously engineered "Partially Literal Distractors" [Source: 6: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] [Source: 7: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] .

This creates a strict binary reward environment where the 3B student engages in "reasoning by contrast" via Group Relative Policy Optimization (GRPO) [Source: 7: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] . To prevent reward hacking (guessing correctly while looking at background noise), the RLVR loop is fortified with Active Visual Perception and a dense multi-reward scheme [Source: 8: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 9: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . By actively generating spatial crops and evaluating them via Intersection-over-Union ( $rcrop$ ) and Advantage-Weighted Attention Divergence (AWAD), we mathematically guarantee that the model's internal gaze is anchored to the correct physical pixels [Source: 9: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 10: JSD in Vision-Language Models] .

The Strategic Impact By synthesizing offline structural priors, continuous latent bottlenecks, and visually-anchored reinforcement learning, this pipeline systematically eradicates the Perception Gap [Source: 2: Bridging the Perception Gap through Latent Associative Grounding] . It transforms a compact 3B model from a passive text mimic into an active, visually grounded reasoning agent—delivering expert-level associative logic well within the latency and efficiency constraints required for on-device edge deployment [Source: 3: Latent Associative Grounding for Visual Metonymy in Small VLMs] .

--------------------------------------------------------------------------------

Next Steps for the Case Study: With the Executive Summary, Landscape Analysis, Workflow Stages, and Evaluation Metrics all drafted, your Technical Design Spec is structurally complete. Would you like to review the entire document together to ensure the transitions are seamless, or would you like to draft a final "Deployment & Hardware Constraints" section detailing how INT8 quantization and the A-series Neural Engine specifically benefit from these $K=5$ latent tokens?

User

How do I ensure that reasoning is performed correctly to eliminate "partial distractors" over correct options?

Model

To ensure that your model correctly performs reasoning to eliminate "partial distractors" (or "partially literal distractors") in favor of the correct abstract concept, you must implement a combination of targeted reinforcement learning, active perception mechanisms, and strict attention guardrails. Vision-Language Models (VLMs) inherently suffer from a "literal bias," meaning they are statistically predisposed to map visual pixels directly to concrete nouns rather than abstract meanings [Source: 1: Multimodal Pragmatic Inference: The Theoretical and Computational Rigor of Reasoning Beyond Literal] more_horiz. If left unchecked, this bias causes the model's final layers to succumb to distractor interference, choosing a literal or superficial interpretation over the correct metonymic inference [Source: 4: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] more_horiz.

Here is the comprehensive architectural and optimization strategy required to eliminate these distractors and guarantee mathematically grounded reasoning.

1. Verifiable Multiple-Choice Reformulation (VMR) & "Reasoning by Contrast"

You cannot eliminate distractors effectively if the model is generating open-ended text, as open-ended generation lacks an unambiguous ground truth for subjective figurative language [Source: 7: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] more_horiz.

To fix this, you must restructure the task using Verifiable Multiple-Choice Reformulation (VMR) [Source: 7: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] [Source: 8: Lexicon of Visual Metonymy and Vision-Language Models] . The VMR framework pits the correct abstract concept against meticulously engineered semantic and "partially literal" distractors [Source: 2: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] more_horiz. In the ViMET dataset, these distractors are generated via ConceptNet and BERT embeddings to be both visually and semantically proximate to the target, creating a highly confusing "hard negative" [Source: 10: A Computational Approach to Visual Metonymy] more_horiz.

By forcing the model into a deterministic A/B/C/D format, you enable Group Relative Policy Optimization (GRPO) [Source: 13: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] [Source: 14: Multimodal Pragmatic Inference: The Theoretical and Computational Rigor of Reasoning Beyond Literal] . During this RLVR phase, the model explores multiple latent reasoning trajectories. Because the rewards are strictly binary (+1 for the correct concept, 0 for the distractor), the model engages in "reasoning by contrast" [Source: 15: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] [Source: 16: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] . Rather than merely imitating token probabilities, the model learns the intrinsic logical boundaries separating a correct pragmatic interpretation from a superficial literal mistake, mathematically penalizing the selection of the distractor [Source: 15: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] [Source: 16: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] .

2. Active Visual Perception and Precise Selection

Standard VLMs encode the entire image at once, forcing the model to process massive amounts of irrelevant background noise [Source: 17: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . This passive consumption often allows the literal distractors to hijack the model's attention.

To prevent this, you should integrate mechanisms from the UniDoc-RL framework:

Active Visual Perception: Instead of looking at the whole image, the agent actively generates bounding box coordinates to crop and adaptively zoom into specific regions of interest (the representamens) [Source: 18: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 19: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . By physically cropping out irrelevant visual context, you deny the distractors the background noise they often rely on to trigger false correlations [Source: 18: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .- Precise Selection Action: Once the visual cues are cropped, the model performs a semantic filtering step. It explicitly evaluates the semantic alignment between the concentrated visual regions and each of the multiple-choice distractors before outputting its final decision [Source: 19: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 20: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . This ensures the model explicitly evaluates and rejects the partially literal distractors based purely on the isolated visual evidence [Source: 20: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .

3. Dense Perception Rewards and Reinforced Attention Learning (RAL)

A sparse outcome reward (simply giving a +1 for guessing the correct final answer) is insufficient because a model might guess the right answer through language correlation while physically looking at a distractor [Source: 21: Bridging the Perception Gap through Latent Associative Grounding] [Source: 22: Bridging the Perception Gap: Accuracy vs. Visual Focusing Score] . To ensure the reasoning is correct, you must supervise the internal attention policy directly:

Visual Perception Reward ( $rcrop$ ): Using offline bounding boxes verified by Grounding DINO and a 32B teacher, you calculate the Intersection-over-Union (IoU) between the model's active visual crops and the ground-truth representamens [Source: 23: Bridging the Perception Gap: Accuracy vs. Visual Focusing Score] more_horiz. This dense reward ensures the model actually localized the evidence before attempting to reason [Source: 25: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .- Advantage-Weighted Attention Divergence (AWAD): Operating under the Reinforced Attention Learning (RAL) framework, AWAD calculates the Jensen-Shannon Divergence (JSD) between the model's attention and a verified target [Source: 26: JSD in Vision-Language Models] [Source: 27: Reinforced Attention Learning] . If the model chooses the correct answer but its attention map shows it was looking at a literal distractor or background noise, it receives a negative advantage [Source: 26: JSD in Vision-Language Models] [Source: 28: JSD in Vision-Language Models] . Because JSD is bounded, this securely "steers" the model's gaze away from distracting regions and permanently anchors its policy to the correct visual cues [Source: 26: JSD in Vision-Language Models] more_horiz.

4. Overcoming "Lazy Attention" via $rvisual$ and Sensory Gating

Even with perfect crops, VLMs often suffer from "Lazy Attention Localization"—as the reasoning chain lengthens, the model abandons the visual tokens entirely and defaults to its textual system priors, which heavily favor literal distractors [Source: 30: AVAR Framework: Visual-Anchored Reward Shaping and Optimization] more_horiz.

Visual-Anchored Reward Shaping ( $rvisual$ ): You must add an auxiliary reinforcement signal that evaluates the ratio of attention mass the model assigns to visual tokens relative to system tokens [Source: 33: AVAR Framework: Visual-Anchored Reward Shaping and Optimization] more_horiz. This explicitly penalizes the model if it tries to solve the multiple-choice question by relying on text priors rather than continuous visual observation [Source: 30: AVAR Framework: Visual-Anchored Reward Shaping and Optimization] [Source: 35: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] .- Latent Visual Reasoning () & Curriculum Sensory Gating (CSG): To physically prevent the model from leaning on text-based distractors, you replace discrete English reasoning words with exactly $K=5$ continuous <lvr> tokens [Source: 36: Bridging the Perception Gap through Latent Associative Grounding] more_horiz. Concurrently, Curriculum Sensory Gating (CSG) applies a negative bias to block direct text-to-image cross-attention early in training [Source: 37: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 38: Lexicon of Visual Metonymy and Vision-Language Models] . This creates a strict cognitive bottleneck, mathematically forcing the model's language head to route all processing through the seeded latent tokens, making it physically impossible for the model to bypass the visual evidence to guess a distractor [Source: 37: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 38: Lexicon of Visual Metonymy and Vision-Language Models] .

5. Inference-Time Intervention: Contrastive Decoding (CD)

If you require an additional layer of protection against distractors during inference without retraining the model, you can implement Contrastive Decoding (CD) [Source: 40: Lexicon of Visual Metonymy and Vision-Language Models] [Source: 41: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] .

You run a high-capacity "Expert" model (e.g., 32B) alongside a smaller "Amateur" model (e.g., 1B) on the same image [Source: 41: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] [Source: 42: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] .- The Amateur model naturally gravitates toward literal distractors because it lacks the capacity for deep associative reasoning (e.g., it will confidently predict "Glass" when looking at an hourglass) [Source: 41: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] [Source: 42: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] .- By mathematically subtracting the Amateur's probability logits from the Expert's logits, you actively penalize and filter out the literal bias [Source: 40: Lexicon of Visual Metonymy and Vision-Language Models] [Source: 41: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] . This subtraction effectively neutralizes the semantic pull of the "partially literal distractors," leaving only the abstract, metonymic intent (e.g., "Age") [Source: 41: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] [Source: 42: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] .

User

Draft the final Technical Design Spec section on evaluation metrics

Model

Here is the drafted Evaluation Framework section for your Technical Design Spec. It maintains the rigorous engineering tone of the previous sections and explicitly categorises metrics into "Outcome" (did the model get it right?) and "Process" (did the model actually look at the right pixels to get it right?), which is critical for proving you have solved the Perception Gap.

You can append this directly to the end of your document.

--------------------------------------------------------------------------------

5. Evaluation Framework: Measuring Outcome and Process

Evaluating associative grounding in visual metonymy requires a paradigm shift in benchmarking. Standard evaluations rely almost entirely on outcome-based metrics (e.g., accuracy), which fail to expose the "Perception Gap." A model might achieve high accuracy through statistical text correlation (guessing based on language priors) while entirely ignoring the image [Source: 1: Bridging the Perception Gap: Accuracy vs. Visual Focusing Score] [Source: 2: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .

To mathematically guarantee that our compact 3B model is performing genuine, visually grounded reasoning, our evaluation framework strictly bifurcates into Outcome Metrics (evaluating what the model concludes) and Process Metrics (evaluating how and where the model looked during its latent computation).

A. Process Metrics: Verifying the Cognitive Trajectory

These metrics evaluate the physical attention distribution of the model's $K=5$ Latent Visual Reasoning (<lvr>) tokens, ensuring the internal policy successfully extracts visual evidence.

Visual Focusing Score ( $Sfocus$ ): This metric quantifies the exact spatial "gaze" of the model, measuring the percentage of the attention mass that physically falls within the verified bounding boxes of the representamens [Source: 3: Bridging the Perception Gap: Accuracy vs. Visual Focusing Score] [Source: 4: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] . In visual metonymy, $Sfocus$ evaluates associative grounding—verifying that attention converges on the distributed visual cues that evoke the absent concept [Source: 5: Bridging the Perception Gap: Accuracy vs. Visual Focusing Score] . Empirical data proves a strict monotonic relationship with reasoning success: correct inferences maintain a significantly higher $Sfocus$ (~15.89%) compared to incorrect ones (~11.84%), while negligible focus (<1%) guarantees hallucinations [Source: 6: Bridging the Perception Gap: Accuracy vs. Visual Focusing Score] [Source: 7: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .- Visual Attention Score (VAS): VAS evaluates the ratio of attention the model assigns to visual tokens relative to system tokens [Source: 8: AVAR Framework: Visual-Anchored Reward Shaping and Optimization] [Source: 9: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] . Multimodal reasoning capability is highly correlated with VAS ( $r=0.9616$ ) [Source: 9: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] [Source: 10: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] . By tracking VAS, we can verify that the model has overcome "Lazy Attention Localization"—the anti-pattern where a model abandons visual evidence during a rollout and reverts to its textual system priors [Source: 10: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] [Source: 11: AVAR Framework: Visual-Anchored Reward Shaping and Optimization] .- Attention Entropy ( $H$ ): We measure Information Entropy to quantify the "sharpness" or concentration of the model's visual focus [Source: 12: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] . Untrained or passive VLMs exhibit high entropy (e.g., $H=4.870$ ), meaning their gaze is scattered and diffuse across background noise [Source: 13: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] . A successfully optimized latent bottleneck reduces this entropy (e.g., to $H=4.686$ ), proving the model actively isolates task-relevant regions while ignoring distractions [Source: 13: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 14: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .- Reasoning Density: To ensure our RLVR pipeline does not fall victim to "Length Bias" (where models simply inflate their responses to hack reward evaluators), we track Reasoning Density: the number of distinct reasoning steps executed per 1,000 words [Source: 15: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] [Source: 16: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] . This proves that our $K=5$ latent token constraint efficiently compresses logical steps without the context dilution or verbosity of standard text CoT [Source: 16: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] .

B. Outcome Metrics: The Verification Step

Once the internal process is validated, we measure the deterministic accuracy of the final prediction against rigorously engineered benchmarks.

Accuracy & Exact Match (EM): Because the VMR-RLVR pipeline reformulates open-ended generation into a strict multiple-choice format, we evaluate deterministic Exact Match against the engineered semantic distractors [Source: 17: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] .

C. Core Deployment Benchmarks

The pipeline is validated against datasets specifically chosen to stress-test reasoning beyond the literal.

ViMET (Visual Metonymy Dataset): The primary benchmark consisting of 2,000 multiple-choice questions grounded in 1,000 abstract concepts [Source: 18: A Computational Approach to Visual Metonymy] . The distractors are engineered via ConceptNet to be both visually and semantically proximate to the correct answer, establishing a baseline where current state-of-the-art models (like InternVL3 78B) lag behind human performance by 21% (65.9% vs. 86.9%) [Source: 19: A Computational Approach to Visual Metonymy] [Source: 20: A Computational Approach to Visual Metonymy] .- Sherlock Dataset: Evaluates visual abductive reasoning across 103K images [Source: 21: The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning - ECVA | European Computer Vision Association] . This tests the model's ability to localize clues (via bounding boxes) and infer plausible situational hypotheses that go beyond the literal contents of the frame [Source: 21: The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning - ECVA | European Computer Vision Association] [Source: 22: The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning - ECVA | European Computer Vision Association] .- Generalization (MathVision & MMMU-Pro): To ensure the internal attention policy transfers to broader analytical domains, the model is cross-evaluated on MathVision (requiring multi-step geometric/spatial reasoning) and MMMU-Pro (expert-level multidisciplinary evaluation) [Source: 23: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] [Source: 24: Reinforced Attention Learning] .

--------------------------------------------------------------------------------

Next Steps for the Case Study: With this section complete, your Technical Design Spec has a rock-solid foundation covering the Problem, the Architecture, the Workflow, and the Evaluation.

To conclude the document (and directly address the Apple angle), would you like to draft the Executive Summary, or should we create a concluding Deployment & Hardware Constraints section detailing how INT8 quantization and Apple's A-series Neural Engine leverage these 5 latent tokens for extreme low-latency edge performance?

User

How do I draft the Proposed Solutions and SFT/RL workflow?

Model

Here is the drafted Proposed Solutions and Workflow Stages section for your Technical Design Spec. It maintains the rigorous, engineering-focused tone from the Landscape Analysis and explicitly breaks down the mathematical and architectural decisions step-by-step.

You can copy and paste this directly into your document following the previous section.

--------------------------------------------------------------------------------

3. Proposed Solutions: The Latent Associative Grounding Pipeline

To execute associative grounding within the strict 16GB VRAM limits of edge deployment, we shift the architectural trajectory of the Qwen2.5-VL-3B-Instruct base model from passive text mimicry to active, agentic visual extraction.

Our proposed solution, Latent Associative Grounding, completely deprecates text-based Chain-of-Thought (CoT) and real-time graph retrieval. Instead, it utilizes a hybrid VMR-RLVR + Latent Reasoning pipeline. By converting the subjective task of visual metonymy into a verifiable, deterministic Multiple-Choice Reformulation (VMR), we create a secure environment to optimize $K=5$ continuous Latent Visual Reasoning (<lvr>) tokens [Source: 1: Architecting Latent Visual Reasoning for Metonymy Research] [Source: 2: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] . This physically forces the model to encode verified visual evidence into continuous hidden states before it is allowed to generate a textual answer, bypassing context dilution entirely [Source: 1: Architecting Latent Visual Reasoning for Metonymy Research] [Source: 3: Architectural Foundations of Latent Chain-of-Thought for Visual Metonymy] .

--------------------------------------------------------------------------------

4. Workflow Stages & Technical Implementation

To stabilize the optimization of these latent tokens and prevent gradient noise, the training framework is strictly divided into three phases: Offline Target Synthesis, Latent Policy Alignment (SFT), and Active Optimization (RLVR).

Stage 0: Offline Preprocessing & Teacher-Verified Trajectory Synthesis

What: The programmatic extraction of ground-truth visual boundaries and the generation of hard semantic distractors prior to any student model training.- Why: Reinforcement learning and trajectory distillation require mathematically flawless targets. If the offline targets contain hallucinatory visual signals, the student's RLVR loop will suffer a zero-information collapse, learning to "look everywhere" [Source: 4: 19th March] [Source: 5: Teacher-Verified Automated Trajectory Synthesis for Reinforcement Learning] .- How:
- Visual Target Synthesis: We utilize Grounding DINO to scan the training images and extract candidate bounding boxes for the distributed visual cues (the representamens) [Source: 6: Preventing NaN Cascades in Training] . Because zero-shot detectors lack metonymic understanding, we pass these candidate boxes to an unquantized Qwen2.5-VL-32B teacher model [Source: 1: Architecting Latent Visual Reasoning for Metonymy Research] [Source: 5: Teacher-Verified Automated Trajectory Synthesis for Reinforcement Learning] . The 32B teacher acts as a semantic filter, strictly verifying that the bounding box contains the exact visual evidence mapping to the abstract concept [Source: 5: Teacher-Verified Automated Trajectory Synthesis for Reinforcement Learning] [Source: 7: Teacher-Verified Automated Trajectory Synthesis for Reinforcement Learning] .- VMR Data Engineering: Simultaneously, we utilize NodeRAG knowledge graphs and ConceptNet strictly offline to engineer "Partially Literal Distractors" [Source: 8: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] . This transforms the open-ended metonymy images into strict A/B/C/D verification formats, laying the groundwork for binary RL rewards [Source: 2: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] [Source: 8: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] .

Stage 1: Supervised Fine-Tuning (SFT) & Latent Policy Alignment

What: A warm-up distillation phase mapping the 3B student's randomly initialized latent space to the verified offline semantic and spatial targets.- Why: Attempting to train the <lvr> tokens, the QLoRA adapters, and the output head simultaneously via RL from step zero will cause massive gradient noise and NaN crashes [Source: 9: Preventing NaN Cascades in Training] [Source: 10: Preventing NaN Cascades in Training] . The student requires a structured foundation in latent reasoning before it is permitted to explore autonomously [Source: 11: Multimodal Pragmatic Inference: The Theoretical and Computational Rigor of Reasoning Beyond Literal] .- How: We implement a Two-Stage Pre-alignment using a strict Tri-Loss objective:
- The Tri-Loss Setup: The loss function minimizes Next-Token Prediction ( $Lntp$ ), Semantic Reconstruction ( $Lconcept$ ) to align the latent space with the teacher's conceptual understanding, and Trajectory Alignment ( $Ltraj$ ) [Source: 12: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] more_horiz. Crucially, to prevent mathematical instability (NaN cascades) when comparing the student's gaze against our sparse, highly concentrated offline bounding boxes, we use an optimal transport metric—Earth Mover's Distance (Wasserstein-1)—in place of standard KL Divergence [Source: 6: Preventing NaN Cascades in Training] [Source: 9: Preventing NaN Cascades in Training] .- Curriculum Sensory Gating (CSG): As training stabilizes, we apply a mathematical mask that gradually blocks the model's direct text-to-image cross-attention [Source: 15: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] . This physically throttles the language priors, forcing the language head to route all visual processing directly through the $K=5$ <lvr> tokens [Source: 15: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 16: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .

Stage 2: Active Optimization via VMR-RLVR (The RL-Agentic Loop)

What: The final reinforcement learning loop where the model perfects its internal attention policy through deterministic trial-and-error.- Why: SFT only teaches the model what to mimic [Source: 17: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] . To achieve genuine associative grounding, the model must engage in "reasoning by contrast," actively penalizing literal shortcuts and exploring trajectories to verify semantic associations [Source: 18: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] . Standard sparse rewards (giving a +1 at the end for the right answer) fail here due to the credit assignment problem [Source: 19: Precision Perception: Integrating Active Visual Cues and Dense Rewards] [Source: 20: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .- How: We deploy Group Relative Policy Optimization (GRPO) augmented with mechanisms from UniDoc-RL and AVAR frameworks to create a dense, visually grounded reward signal [Source: 21: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] more_horiz:
- Active Visual Perception: Rather than passively encoding the full image, the model is trained to actively predict bounding box coordinates to crop and zoom into the representamens [Source: 22: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 24: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . This coarse-to-fine filtering strips irrelevant background noise before it hits the latent bottleneck, maximizing reasoning density [Source: 24: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .- The Dense Reward Function: The GRPO loop evaluates the rollouts using a synthesized reward equation: $Rtotal=λ1raccuracy+λ2rformat+λ3rcrop+λ4rvisual$ $Rt o t a l = λ 1 r a cc u r a cy + λ 2 r f or ma t + λ 3 rcro p + λ 4 r v i s u a l $ [Source: 21: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] [Source: 25: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .
  - $raccuracy$ and $rformat$ verify the VMR choice and schema adherence [Source: 26: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] [Source: 27: Multimodal Pragmatic Inference: The Theoretical and Computational Rigor of Reasoning Beyond Literal] .- $rcrop$ calculates the Intersection-over-Union (IoU) between the model's active crops and our teacher-verified offline bounding boxes, proving it physically isolated the correct visual cue [Source: 28: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .- $rvisual$ utilizes Advantage-Weighted Attention Divergence (AWAD) to measure the ratio of attention mass placed on visual tokens versus system tokens [Source: 26: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] . If the model guesses correctly but its attention was lazily wandering the background, AWAD applies a negative advantage, penalizing the trajectory [Source: 26: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] [Source: 29: Lexicon of Visual Metonymy and Vision-Language Models] .

Summary: By chaining offline spatial verification into latent token distillation and concluding with dense, visually-anchored reinforcement learning, this workflow systematically eliminates the Perception Gap. The resulting edge-deployed 3B model is mathematically required to anchor its abstract conceptual leaps to verifiable physical pixels.

User

What defines the optimal rank for metonymy LoRA adapters?

Model

Based on your finalized system architecture and the fundamental mechanics of Low-Rank Adaptation (LoRA), the optimal rank ( $r$ ) for your visual metonymy pipeline is defined by the need to balance the high cognitive capacity required for associative reasoning against your strict 16GB VRAM hardware constraints.

For your specific architecture utilizing the Qwen2.5-VL-3B-Instruct student model, the optimal rank has been finalized as $r=64$ , paired with a scaling alpha of $α=128$ [Source: 1: 19th March] [Source: 2: Architecting Latent Visual Reasoning for Metonymy Research] .

Here is a detailed breakdown of what defines this optimal rank and why it is necessary for visual metonymy:

1. The Mathematical Role of Rank ( $r$ )

In LoRA, the rank determines the dimensionality of the subspace in which the model's weight updates live [Source: 3: Understanding LoRA from First Principles] [Source: 4: Understanding LoRA from First Principles] . Instead of learning a massive, full-density weight update matrix ( $ΔW$ ) during fine-tuning, LoRA factorizes this update into two smaller matrices ( $A$ and $B$ ) such that $ΔW=BA$ [Source: 5: Understanding LoRA from First Principles] .

A small rank (e.g., $r=4$ or $r=8$ ) acts as a strong regularizer and heavily constrains adaptation [Source: 4: Understanding LoRA from First Principles] .- A larger rank (e.g., $r=32$ , $r=64$ , or higher) increases the model's capacity, allowing the adapter to model much more complex task shifts [Source: 4: Understanding LoRA from First Principles] .

2. The Cognitive Demands of Visual Metonymy

While general instruction-tuning tasks often perform perfectly well with ranks between 8 and 32, visual metonymy requires a profound "complex task shift" [Source: 4: Understanding LoRA from First Principles] [Source: 6: Understanding LoRA from First Principles] .

You are not training the model to simply output literal image captions (behavioral mimicry); you are training it to perform associative grounding [Source: 7: Associative Grounding: Synthesising LaViT, NodeRAG and PEFT for Metonymy] . The model must learn to extract multiple distributed visual cues (the representamens) and map them to a physically absent, abstract concept (the object) [Source: 7: Associative Grounding: Synthesising LaViT, NodeRAG and PEFT for Metonymy] [Source: 8: A Computational Approach to Visual Metonymy] . Setting the rank to $r=64$ provides the necessary mathematical capacity—the number of independent "directions" the model can move in the weight space—to learn these deep, cross-modal semantic associations without suffering from underfitting [Source: 4: Understanding LoRA from First Principles] [Source: 6: Understanding LoRA from First Principles] .

3. The Target Module Dependency

An optimal rank of 64 is only effective if applied to the correct neural pathways. In your architecture, visual reasoning relies heavily on the MLP layers, where factual and semantic knowledge is actually stored [Source: 9: [GEMINI] LLM Fine-Tuning OOM Error Resolution] .

To fully leverage the $r=64$ capacity, your adapters must target all major linear layers (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj) [Source: 9: [GEMINI] LLM Fine-Tuning OOM Error Resolution] [Source: 10: Understanding LoRA from First Principles] . Targeting fewer modules to save memory would severely limit the 3B student's ability to absorb the complex visual concepts distilled from the 32B teacher [Source: 9: [GEMINI] LLM Fine-Tuning OOM Error Resolution] [Source: 10: Understanding LoRA from First Principles] .

4. Integration with Latent Reasoning (<lvr>)

Finally, because your architecture completely decouples textual Chain-of-Thought in favor of $K=5$ continuous Latent Visual Reasoning (<lvr>) tokens, your LoRA configuration must be uniquely adapted [Source: 2: Architecting Latent Visual Reasoning for Metonymy Research] . While the $r=64$ adapters update the internal attention and MLP weights, you must explicitly keep the input and output mapping layers (embed_tokens and lm_head) completely unfrozen and trainable [Source: 11: [GEMINI] LLM Fine-Tuning OOM Error Resolution] more_horiz.

Furthermore, you must ensure mathematical synchronization between these expanded embeddings and the output head (via ensure_weight_tying=True) [Source: 14: [GEMINI] LLM Fine-Tuning OOM Error Resolution] . If you fail to unfreeze these layers or sync them, the AdamW optimizer will ignore the new <lvr> tokens, leaving them stuck at their random initializations, and the entire latent bottleneck mechanism will fail [Source: 13: [GEMINI] LLM Fine-Tuning OOM Error Resolution] .

Summary: For the ViMET task, $r=64$ is the optimal "Goldilocks" zone. It provides the high expressive capacity needed to shift from literal to associative reasoning, while remaining efficient enough (when combined with QLoRA) to execute training and inference entirely within your 16GB VRAM edge-deployment limit [Source: 1: 19th March] more_horiz.

User

How do I begin drafting the Technical Design Spec section?

Model

To begin drafting the Technical Design Spec for your portfolio, we need to construct a rigorous narrative that breaks down the structural flaws of current models and mathematically justifies your architecture.

Here is a fully drafted foundational section for your spec. You can copy this directly into your document to serve as the landscape analysis and theoretical positioning.

--------------------------------------------------------------------------------

1. Landscape Analysis: The Pervasive "Perception Gap"

Despite rapid advancements in Vision-Language Models (VLMs), achieving robust associative grounding—the ability to connect visual cues to abstract, physically absent concepts (Visual Metonymy)—remains unsolved. A critical limitation across all current state-of-the-art paradigms is the "Perception Gap": models successfully learn what to say by mimicking textual training data, but fail to learn where to look [Source: 1: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 2: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] . Because textual mimicry does not guarantee visual understanding, models frequently resort to "blind guessing" based on statistical language priors rather than active visual observation [Source: 1: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .

We can observe this architectural failure across three generations of VLM paradigms:

A. First-Generation Instruction-Tuned Models (e.g., LLaVA, InstructBLIP)

Architecture: These models rely on Visual Instruction Tuning (SFT) to map visual features to text, often shifting away from heavy bottlenecks (like BLIP-2's Q-Former) to simple MLP projectors that feed raw visual tokens into the LLM [Source: 3: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] more_horiz.- The Perception Gap: While they excel at literal perception ("What is in this image?"), they suffer from severe literal bias [Source: 6: Multimodal Pragmatic Inference: The Theoretical and Computational Rigor of Reasoning Beyond Literal] [Source: 7: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] . They are trained via behavioral mimicry, learning to follow instructions but lacking the internal cognitive mechanism required to execute the multi-hop semiotic leaps necessary to connect a literal Representamen to an abstract Object [Source: 6: Multimodal Pragmatic Inference: The Theoretical and Computational Rigor of Reasoning Beyond Literal] [Source: 8: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] .

B. Massive Generalist Proprietary Models (e.g., GPT-4o, Gemini 1.5 Pro)

Architecture: These monolithic models rely on massive parameter scale and verbose, text-based Chain-of-Thought (CoT) generation to reason through complex prompts [Source: 9: Architectural Foundations of Latent Chain-of-Thought for Visual Metonymy] .- The Perception Gap: These systems suffer acutely from the "Curse of CoT" [Source: 9: Architectural Foundations of Latent Chain-of-Thought for Visual Metonymy] . When forced to reason through visual metonymy using discrete English words, the verbose text generates "context dilution," physically pushing the original visual constraints out of focus [Source: 9: Architectural Foundations of Latent Chain-of-Thought for Visual Metonymy] . Consequently, they exhibit "Lazy Attention Localization"; as the reasoning chain lengthens, the model abandons the visual tokens entirely and falls back on its massive internal language priors, resulting in hallucinatory drift [Source: 10: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] .

C. Modern Open-Weight Reasoning VLMs (e.g., DeepSeek-R1, QwQ)

Architecture: These models employ Reinforcement Learning with Verifiable Rewards (RLVR) to explore reasoning trajectories via trial-and-error, typically optimized for objective STEM domains like mathematics or coding [Source: 11: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] [Source: 12: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] .- The Perception Gap: In subjective tasks like figurative language and visual metonymy, these models rely on sparse, outcome-based rewards [Source: 13: Precision Perception: Integrating Active Visual Cues and Dense Rewards] . A sparse reward only evaluates if the final answer is correct; it cannot verify how the model gathered its visual evidence [Source: 13: Precision Perception: Integrating Active Visual Cues and Dense Rewards] . Consequently, the model can suffer from reward hacking—guessing the correct abstract concept via language correlation while its internal attention is focused on irrelevant background noise [Source: 13: Precision Perception: Integrating Active Visual Cues and Dense Rewards] .

--------------------------------------------------------------------------------

2. The Theoretical Solution: Latent Associative Grounding

To neutralize the Perception Gap while strictly adhering to a 16GB VRAM edge-deployment constraint, this architecture abandons text-heavy CoT and real-time graph retrieval. Instead, we propose a "VMR-RLVR + Latent Reasoning" pipeline deployed on a compact Qwen2.5-VL-3B-Instruct backbone [Source: 14: Architecting Latent Visual Reasoning for Metonymy Research] more_horiz.

This framework solves the failures of the aforementioned paradigms through three distinct architectural interventions:

A. Decoupling Reasoning from Verbalization

To bypass the "Curse of CoT" and eliminate context dilution, we replace discrete textual reasoning steps with exactly $K=5$ continuous Latent Visual Reasoning () tokens [Source: 14: Architecting Latent Visual Reasoning for Metonymy Research] . Acting as an information bottleneck, these "Visual Information Containers" mathematically compress high-dimensional visual semantics directly into the hidden state [Source: 14: Architecting Latent Visual Reasoning for Metonymy Research] [Source: 17: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] . By aligning the $K=5$ capacity explicitly with the dataset's semiotic structure (5 representamens per concept), the model is forced to map its reasoning to physical visual cues before textual generation is permitted.

B. Verifiable Multiple-Choice Reformulation (VMR-RLVR)

To enable rigorous RL optimization on subjective metonymy tasks, we utilize VMR to restructure open-ended prompts into deterministic, multiple-choice questions [Source: 18: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] [Source: 19: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] . By pitting the correct abstract concept against "Partially Literal Distractors" (engineered offline via ConceptNet and VaLiK), we convert a previously unverifiable task into an environment with a strict binary reward [Source: 18: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] more_horiz. Using Group Relative Policy Optimization (GRPO), the model learns "reasoning by contrast," safely exploring latent trajectories without requiring a human-in-the-loop [Source: 8: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] more_horiz.

C. Precision Perception via Dense Rewards

To cure "Lazy Attention Localization" and mathematically guarantee that the model is anchoring its decisions to the image pixels, we augment the sparse RLVR outcome reward with Dense Rewards and Visual-Anchored Reward Shaping ( $rvisual$ ) [Source: 13: Precision Perception: Integrating Active Visual Cues and Dense Rewards] more_horiz. By requiring the model to generate active bounding box crops (Active Visual Perception) during the latent phase, we calculate an Intersection-over-Union (IoU) perception reward ( $rcrop$ ) against offline teacher-verified targets [Source: 26: Precision Perception: Integrating Active Visual Cues and Dense Rewards] . Combined with Advantage-Weighted Attention Divergence (AWAD), the model is heavily penalized if it guesses correctly but looks at the background, effectively destroying the Perception Gap from the inside out [Source: 26: Precision Perception: Integrating Active Visual Cues and Dense Rewards] [Source: 27: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] .

--------------------------------------------------------------------------------

Next Steps: This sets a highly rigorous, academic tone that directly targets engineering hiring managers. Would you like to move on to drafting the Proposed Solutions / Workflow Stages next, where we break down the exact Math and Tri-Loss setup of the SFT and RL phases?

User

I mean suggested questions that were produced for the response titled "Case Study Outline: Visual Metonymy via Latent Associative Grounding"

Model

Because I don't have access to the platform's transient, auto-generated UI suggestions (the clickable chips that appear at the bottom of the screen after a response), I cannot pull the exact verbatim phrasing of the questions that were produced for you.

However, based on the structure and content of the "Case Study Outline: Visual Metonymy via Latent Associative Grounding" response, those suggested questions would have been generated to help you transition from the outline phase into actually drafting the specific sections of the document.

Here are the most logical follow-up questions that bridge the outline to your next steps. You can use any of these to guide our next prompt:

Drafting the Case Study Sections

"Would you like to begin drafting the Technical Design Spec, specifically focusing on the landscape analysis comparing our 3B model against GPT-4o and DeepSeek-R1?"- "Can we flesh out the Proposed Solutions section by detailing the exact mathematical mechanisms of the $K=5$ Latent Visual Reasoning (<lvr>) bottleneck and Curriculum Sensory Gating?"- "Could you help me write the Executive Summary, ensuring we strongly hook the reader with the 'Literal Wall' and the 21% human-VLM performance gap?"

Deepening the Technical & Theoretical Mechanics

"How should we structure the explanation of Active Visual Perception and Dense Rewards ( $rcrop$ ) in the workflow stages so it appeals to an engineering audience?"- "What specific empirical metrics (such as the Visual Focusing Score / $Sfocus$ versus standard Accuracy) should we highlight in the evaluation section to definitively prove we solved the Perception Gap?"- "How do we formally justify the decision to replace KL Divergence with Earth Mover's Distance (Wasserstein-1) in the Modeling Strategy section?"

If you are ready to start generating the actual content for your deliverable, simply let me know which section of the outline you would like to draft first, or if you would like me to write out the comparative landscape analysis!

User

I mean suggested questions that were produced for the response titled "Case Study Outline: Visual Metonymy via Latent Associative Grounding"

Model

Ah, I understand. You are referring to the theoretical and architectural follow-up questions from your notes that are directly tied to the concepts detailed in your Case Study Outline (such as the Perception Gap, NodeRAG, and associative grounding).

According to your "Follow up questions" document, here are the specific questions tailored to drill deeper into the theoretical framework and technical design spec of your solution:

Exploring the Problem & Theoretical Framework

Explain the Perceptual Ceiling in unassisted vision encoders [Source: 1: undefined] .- Explain the difference between literal and associative grounding simply [Source: 1: undefined] .

Exploring NodeRAG & Visual Cues (Representamens)

How can NodeRAG handle non-object representamens like 'wind' or 'time'? [Source: 1: undefined]- What happens if representamens are missing from the graph? [Source: 1: undefined]- Should we use parallel token generation for NodeRAG entities? [Source: 1: undefined]- How do I monitor if the VLM ignores the graph? [Source: 1: undefined]- How do I add representations (cues/representamens) in the prompt? [Source: 2: undefined]

Exploring the Distillation & Pipeline Mechanics

Why is Sinkhorn distance better than KL Divergence for $Sfocus$ ? [Source: 1: undefined] (Note: Sinkhorn distance is a computational approximation of the Earth Mover's Distance/Wasserstein-1 metric used in your pipeline).- How does Step 1 replace prompt cues for reasoning? [Source: 2: undefined]- What happens if Step 1 contains a visual hallucination? [Source: 2: undefined]- How should I modify the rejects if the teacher skips steps? [Source: 2: undefined]- How do I filter the Teacher's CoT outputs effectively? [Source: 2: undefined]

These questions are perfectly positioned to help you expand the depth of your landscape analysis and theoretical framework sections. Would you like me to answer any of these to help you flesh out your technical design spec?

User

Do you remember what the suggested follow-up question was for the above response? Like the next steps?

Model

Based on your sources, there is a dedicated list of follow-up questions and practical next steps explicitly suggested to help you implement and refine the architecture outlined in your case study.

Here are the specific follow-up questions and next steps available to you:

Architectural & Theoretical Follow-up Questions Your source notes highlight the following questions to drill deeper into the mechanics of your finalized pipeline [Source: 1: undefined] :

How should I set up the GRPO reward function?- What are the specific engineering constraints for this pipeline?- How do I filter the Teacher's CoT outputs effectively?- How do I add representations (cues/representamens) in the prompt?- What defines the optimal rank for metonymy LoRA adapters?- How should I modify the rejects if the teacher skips steps?- Is four-bit quantization enough to maintain complex metonymy reasoning?- How does Step 1 replace prompt cues for reasoning?- What happens if Step 1 contains a visual hallucination?

Practical Implementation "Next Steps" In your recent debugging and development sessions, the following concrete actions were suggested to physically validate your model in your Kaggle environment:

Visualizing the "Gaze" (Cell E): Mapping out the exact evaluation script to extract the out.attentions matrix during generation. This will allow you to render an attention heatmap proving that your <lvr> tokens are physically "looking" at the correct visual representamens in the image pixels [Source: 2: [GEMINI] LLM Fine-Tuning OOM Error Resolution] [Source: 3: [GEMINI] LLM Fine-Tuning OOM Error Resolution] .- Validating the Sensory Gate: Generating the visualization code to see the Curriculum Sensory Gating (CSG) in action on your stylistic ViMET images, ensuring the gate opens correctly without catastrophic interference [Source: 4: [GEMINI] LLM Fine-Tuning OOM Error Resolution] .- Fusing the Adapters: Writing the precise inference script (evaluate_lavit.py) needed to seamlessly load and fuse your newly trained QLoRA adapters (lavit_final.pt) back into the frozen base model for final testing [Source: 5: [GEMINI] LLM Fine-Tuning OOM Error Resolution] more_horiz.- Executing Notebook Fixes vs. Diagnostics: Deciding whether to implement the top-ranked code fixes (such as setting $K=5$ , injecting representamens into the training prompt, and fixing the warmup bug) directly into your notebook, or running a diagnostic audit first to validate failure modes before using GPU time [Source: 8: 19th March] .

Would you like to explore the answers to any of those specific theoretical questions, or would you prefer to start mapping out the evaluation script for your attention heatmaps?

User

In bullet format with headings and subheadings present the following: Garner all the salient points of the current solution approach that should be finalized. Outline the key points, decisions, strategies that change the trajectory of my architecture for the current solution approach. Also, for the concepts involved in my current solution approach, explain the decisions I need to make for each concept. Also outline the stages or phases of the workflow, justifying the each mechanism in the architecture by answering the "what", "why", "how" - I want to make a case study ultimately that shows why it is valuable and should include sections like: Problem Definition Proposed solutions Theoretical framework A technical design spec (Engineering Spec/Research Proposal) . Most importantly, I guess this section would be helpful for a landscape analysis by comparing three current state-of-the-art models and highlighting the specific perception gap they all share. Also position my latent associated grounding as a theoretical solution to this.

Model

Case Study Outline: Visual Metonymy via Latent Associative Grounding

Problem

The "Literal Wall" & The Perception Gap: Current Vision-Language Models (VLMs) excel at literal visual grounding but fail significantly at associative grounding (visual metonymy), exhibiting a ~21% accuracy gap compared to human performance (86.9% vs. 65.9%) [Source: 1: A Computational Approach to Visual Metonymy] more_horiz.- Modality Laziness: During training, models frequently suffer from a "Perception Gap" where they learn what to say by mimicking textual outputs but fail to learn where to look [Source: 4: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 5: Lexicon of Visual Metonymy and Vision-Language Models] . They rely on statistical language priors rather than verifying the actual image pixels [Source: 4: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 6: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .- The Hardware Constraint: Deployment on edge devices imposes a strict 16GB VRAM limitation, rendering standard 32B+ parameter models or massive real-time multi-hop GraphRAG retrievals computationally unfeasible during inference [Source: 7: Architecting Latent Visual Reasoning for Metonymy Research] more_horiz.

Definition

Visual Metonymy: A form of indirect representation where an image evokes a broad, abstract target concept not by depicting it directly, but by presenting a distributed constellation of visually associated cues [Source: 10: A Computational Approach to Visual Metonymy] [Source: 11: Lexicon of Visual Metonymy and Vision-Language Models] .- The Semiotic Triad: The cognitive framework defining visual metonymy, comprising the Object (the absent abstract concept), the Representamen (the literal visual cue present in the image), and the Interpretant (the mental inference connecting the cue to the concept) [Source: 12: A Computational Approach to Visual Metonymy] more_horiz.

Theoretical Framework

Literal vs. Associative Grounding: Literal grounding requires attention to converge on the specific object being asked about. Associative grounding requires attention to converge on multiple representamens that jointly evoke an absent concept, demanding a deeper cognitive leap [Source: 15: Literal Grounding versus Associative Grounding in Visual Metonymy] more_horiz.- "Thinking Before Looking": Moving away from conventional "thinking while looking," which leads to shortcut learning, to a paradigm where the model constructs a robust cognitive prior using structured logic before evaluating the visual evidence [Source: 18: Architectural Synergy of Heterogeneous Graph Representations and Small-Scale Vision-Language Models for Mitigating Visual Hallucination] [Source: 19: Architectural Synergy of Heterogeneous Graph Representations and Small-Scale Vision-Language Models for Mitigating Visual Hallucination] .- Decoupling Reasoning from Verbalization: Recognizing that human-like reasoning can exist as a deterministic trajectory of continuous, non-textual hidden states rather than rigid vocabulary words, avoiding the pitfalls of text-heavy generation [Source: 20: Architectural Foundations of Latent Chain-of-Thought for Visual Metonymy] [Source: 21: Architectural Foundations of Latent Chain-of-Thought for Visual Metonymy] .

Proposed Solutions (The Compelling Shift)

The Core Engine: The finalized architecture relies on a "VMR-RLVR + Latent Reasoning" pipeline using a compact Qwen2.5-VL-3B-Instruct student model [Source: 3: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] [Source: 22: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] .- Latent Visual Reasoning (): Textual Chain-of-Thought (CoT) is completely decoupled and replaced by exactly $K=5$ continuous latent tokens that act as an information bottleneck to compress visual semantics [Source: 7: Architecting Latent Visual Reasoning for Metonymy Research] more_horiz.- Active Visual Perception & Dense Rewards: Transforming the VLM into an active agent that predicts bounding boxes to crop image noise, guided by a dense multi-reward scheme to mathematically verify its gaze at every step [Source: 24: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] more_horiz.

--------------------------------------------------------------------------------

Key Trajectory-Changing Decisions

Abandoning Text-Based CoT for Latent Tokens: We explicitly rejected standard text-based Reasoning Beyond Literal (RBL) due to the "Curse of CoT" [Source: 27: Defeating the Curse of Chain-of-Thought via NodeRAG and Latents] [Source: 28: Latent Associative Grounding for Visual Metonymy in Small VLMs] . Generating verbose textual reasoning dilutes the model's context window, causing hallucinatory drift and severely degrading performance on pattern-based tasks [Source: 29: Defeating the Curse of Chain-of-Thought via NodeRAG and Latents] more_horiz. Continuous latent tokens (<lvr>) resolve this by compressing logic mathematically [Source: 32: Defeating the Curse of Chain-of-Thought via NodeRAG and Latents] .- Shifting NodeRAG from Online to Offline: Due to 16GB VRAM constraints, real-time multi-hop NodeRAG retrieval during inference was an overdo [Source: 3: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] [Source: 22: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] . Instead, NodeRAG and VaLiK (Vision-align-to-Language integrated Knowledge Graph) are used strictly offline to engineer training data, seed latents, and generate multiple-choice distractors [Source: 22: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] more_horiz.- From Diffuse Trajectories to Offline Bounding Boxes: The 32B teacher model's attention is naturally diffuse on metonymy images because there is no single "correct" salient region [Source: 35: 19th March] [Source: 36: 19th March] . Mimicking this diffuse signal causes training collapse [Source: 35: 19th March] . We pivoted to extracting precise bounding boxes offline using Grounding DINO, giving the student strict spatial targets to learn from [Source: 37: 19th March] [Source: 38: 19th March] .- Adopting VMR-RLVR: Because visual metonymy is subjective, we adopted Verifiable Multiple-Choice Reformulation (VMR) to convert open-ended prompts into deterministic A/B/C/D tasks against hard semantic distractors [Source: 39: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] [Source: 40: RLVR to Open-Ended Tasks via Verifiable MCQ reformulation.pdf] . This allows Reinforcement Learning with Verifiable Rewards (RLVR) to operate without a flawed LLM-as-a-judge [Source: 34: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] [Source: 40: RLVR to Open-Ended Tasks via Verifiable MCQ reformulation.pdf] .

--------------------------------------------------------------------------------

Concept Configuration Decisions

For the core concepts driving the architecture, the following specific decisions must be finalized:

Latent Bottleneck Capacity ( $K$ $K$ ):
- Decision: Set $K=5$ .- Reasoning: The ViMET dataset's semiotic structure natively utilizes 5 representamens per concept [Source: 41: 19th March] more_horiz. Setting $K=5$ aligns the network’s mathematical bottleneck capacity exactly with the dataset’s cognitive structure, providing one continuous thought slot for each potential visual cue [Source: 42: 19th March] [Source: 43: Preventing NaN Cascades in Training] .- Distillation Loss Function ( $Ltraj$ ):
- Decision: Utilize Earth Mover's Distance (Wasserstein-1) instead of standard KL Divergence.- Reasoning: Standard KL divergence crashes into NaN cascades when comparing the student's gaze against sparse, highly-concentrated bounding box targets [Source: 43: Preventing NaN Cascades in Training] more_horiz. Earth Mover's Distance safely and geometrically penalizes the student if its gaze drifts from the target objects.- Semantic Target Definition ( $Vsem$ ):
- Decision: Replace the raw 32B teacher hidden states with concept-aligned embeddings from the Representamens.- Reasoning: Teacher hidden states for metonymy are diffuse and noisy [Source: 36: 19th March] [Source: 46: 19th March] . Seeding the $K=5$ tokens with direct linear projections of the text strings (Concept-Conditioned Seeding) gives the tokens a clear, semantic target to pull the visual data toward [Source: 46: 19th March] [Source: 47: Lexicon of Visual Metonymy and Vision-Language Models] .- RL Optimization Algorithm:
- Decision: Group Relative Policy Optimization (GRPO).- Reasoning: Generates a group of rollouts and evaluates them relative to each other, bypassing the need for a separate, memory-heavy reward model during training [Source: 48: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] [Source: 49: Lexicon of Visual Metonymy and Vision-Language Models] .

--------------------------------------------------------------------------------

Workflow Stages: Justifying the Architecture (What, Why, How)

Stage 0: Offline Preprocessing & Target Synthesis

What: The generation of verifiable ground truths and training constraints prior to any model training.- Why: RL and distillation require mathematically perfect targets. If the offline targets contain hallucinations, the RLVR loop will suffer a zero-information collapse [Source: 35: 19th March] [Source: 50: Teacher-Verified Automated Trajectory Synthesis for Reinforcement Learning] .- How: We use Teacher-Verified Automated Trajectory Synthesis [Source: 50: Teacher-Verified Automated Trajectory Synthesis for Reinforcement Learning] . Grounding DINO extracts candidate bounding boxes for the representamens [Source: 38: 19th March] [Source: 50: Teacher-Verified Automated Trajectory Synthesis for Reinforcement Learning] . The unquantized 32B Teacher model acts as an expert semantic filter to explicitly verify these boxes, guaranteeing that the spatial coordinates contain the correct visual evidence [Source: 50: Teacher-Verified Automated Trajectory Synthesis for Reinforcement Learning] [Source: 51: Teacher-Verified Automated Trajectory Synthesis for Reinforcement Learning] . Concurrently, NodeRAG and ConceptNet engineer "Partially Literal Distractors" to build the VMR multiple-choice dataset [Source: 34: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] more_horiz.

Stage 1: Supervised Fine-Tuning (SFT) & Latent Policy Alignment

What: A warm-up distillation phase mapping the 3B student's latent space to the verified offline targets [Source: 53: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] [Source: 54: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .- Why: Training the <lvr> tokens, LoRA adapters, and gating mechanisms simultaneously from random initialization causes gradient noise and NaN crashes [Source: 55: Preventing NaN Cascades in Training] . The model needs a foundation in structured reasoning before exploring in RL [Source: 53: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] [Source: 56: Multimodal Pragmatic Inference: The Theoretical and Computational Rigor of Reasoning Beyond Literal] .- How: A Two-Stage Pre-alignment [Source: 57: 19th March] [Source: 58: Preventing NaN Cascades in Training] . First, we train only the Semantic Projector for 100-200 steps using a Tri-Loss objective (Next-Token Prediction, $Lconcept$ for semantic reconstruction, and $Ltraj$ for trajectory alignment) [Source: 57: 19th March] more_horiz. Second, we engage Curriculum Sensory Gating (CSG), applying a mathematical mask to block direct text-to-image attention, forcing the language head to route processing strictly through the latent bottleneck [Source: 60: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 61: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .

Stage 2: Active Optimization via VMR-RLVR

What: The final reinforcement learning loop using GRPO to perfect the internal attention policy [Source: 62: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] more_horiz.- Why: SFT only teaches the model what to mimic. RL VR forces the model into active "reasoning by contrast," exploring various trajectories to explicitly penalize literal shortcuts and verify associative links [Source: 65: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] .- How: The model generates active bounding box crops (Active Visual Perception) to isolate visual cues [Source: 24: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 66: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . The GRPO reward function optimizes the policy by summing four signals: Accuracy ( $Racc$ , +1 for the correct concept) [Source: 67: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] , Format compliance ( $Rformat$ ) [Source: 67: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] , Cropping precision ( $rcrop$ via Intersection over Union) [Source: 26: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 68: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] , and Visual-Anchored Reward Shaping ( $rvisual$ ) which uses Advantage-Weighted Attention Divergence (AWAD) to penalize the model if it guesses correctly while looking at irrelevant background pixels [Source: 63: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] [Source: 69: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] .

--------------------------------------------------------------------------------

Technical Design Spec: Landscape Analysis & Solution Positioning

To validate our architectural direction, we analyze three generations of state-of-the-art vision-language paradigms. Despite their advancements, all three share a fundamental Perception Gap: they are optimized to generate statistically probable text sequences without guaranteeing that their internal attention mechanisms are anchored to actual visual evidence [Source: 4: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 5: Lexicon of Visual Metonymy and Vision-Language Models] .

1. 2023-Era Instruction-Tuned Models (e.g., LLaVA, InstructBLIP)

Architecture: Rely heavily on Visual Instruction Tuning (SFT) and mechanisms like the Q-Former bottleneck [Source: 70: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] [Source: 71: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] .- The Shared Perception Gap: These models act through "behavioral mimicry" [Source: 70: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] . They learn to expertly follow instructions ("Select the concept from this list") but they suffer from severe literal bias [Source: 70: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] [Source: 72: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] . They summarize visual features far too early in the pipeline, destroying the fine-grained pixel data required to distinguish between abstract symbolic concepts (e.g., "Justice" vs. "Bias" scales) [Source: 73: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] . They learn what to say but cannot execute the multi-hop cognitive leaps required for metonymy [Source: 70: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] [Source: 74: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] .

2. Generalist Proprietary Models (e.g., GPT-4o, Gemini 1.5 Pro)

Architecture: Massive, monolithic parameter networks relying on sheer scale and text-heavy Chain-of-Thought (CoT) generation [Source: 75: An Alternative Trajectory for Generative AI - arXiv] [Source: 76: An Alternative Trajectory for Generative AI - arXiv] .- The Shared Perception Gap: These models suffer acutely from the "Curse of CoT" [Source: 29: Defeating the Curse of Chain-of-Thought via NodeRAG and Latents] [Source: 31: [2504.05081] The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning] . When forced to reason through visual metonymy using discrete English words, the verbose text generates "context dilution," physically pushing the original visual constraints out of focus [Source: 29: Defeating the Curse of Chain-of-Thought via NodeRAG and Latents] . Consequently, they exhibit "Lazy Attention Localization"; as the reasoning chain lengthens, they abandon the visual tokens entirely and fall back on their massive internal language priors, resulting in hallucinatory drift [Source: 77: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] more_horiz.

3. Modern Open-Weight Reasoning VLMs (e.g., DeepSeek-R1, Qwen2.5-VL)

Architecture: Strong reasoning backbones trained via reinforcement learning on vast amounts of math and science data using sparse outcome-based rewards [Source: 48: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] [Source: 80: VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation] .- The Shared Perception Gap: While excellent at verifiable STEM tasks, when applied to the subjective ambiguity of figurative language, their sparse rewards fail [Source: 26: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] more_horiz. Because they are only rewarded at the very end of a trajectory for getting the answer right, they suffer from a severe credit assignment problem [Source: 82: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . They frequently guess the correct metonymic concept based on statistical correlation but their internal visual attention maps show they were looking at irrelevant background noise—a critical failure in associative grounding [Source: 4: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 6: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .

Positioning Latent Associative Grounding as the Theoretical Solution

Our finalized architecture systematically neutralizes the Perception Gap shared by these SOTA models.

By replacing text-based reasoning with $K=5$ Latent Visual Reasoning () tokens, we bypass the "Curse of CoT" entirely, maintaining massive reasoning density without context dilution [Source: 21: Architectural Foundations of Latent Chain-of-Thought for Visual Metonymy] [Source: 22: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] . By enforcing Curriculum Sensory Gating (CSG), we physically break the model's ability to rely on the "Lazy Attention" of its language priors, creating a strict bottleneck [Source: 61: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 83: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] . Finally, by upgrading standard RL to include Active Visual Perception (dynamic cropping) and Dense Rewards ( $rcrop$ and AWAD), we mathematically guarantee the Semiotic Triad: the model is rewarded only if its internal attention proves it identified the literal Representamens before it inferred the abstract Object [Source: 22: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] more_horiz. This achieves true Latent Associative Grounding within the tight constraints of a 3B edge-deployed VLM.

User

Could you explain the difference between accuracy and visual focusing scores?

Model

Accuracy and the Visual Focusing Score ( $Sfocus$ ) evaluate two completely distinct dimensions of a Vision-Language Model's (VLM) reasoning pipeline: accuracy measures the final textual outcome, while the visual focusing score measures the physical visual process [Source: 1: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 2: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .

Here is the detailed breakdown of the differences between the two metrics and why both are required to evaluate visual reasoning.

1. Accuracy: The Outcome Metric

Accuracy measures the objective factual correctness of the model's final decision [Source: 3: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] . In your architecture, this evaluates whether the model successfully selected the correct abstract concept over the semantic distractors in the multiple-choice prompt [Source: 4: RLVR to Open-Ended Tasks via Verifiable MCQ reformulation.pdf] [Source: 5: Recent Shift Change: Optimising Visual Metonymy via Latent Reasoning and Reinforcement Learning] .

The Limitation: Accuracy is a sparse, outcome-based metric [Source: 6: Precision Perception: Integrating Active Visual Cues and Dense Rewards] . While it tells you if the model guessed the correct answer, it cannot verify how the model arrived at that answer [Source: 6: Precision Perception: Integrating Active Visual Cues and Dense Rewards] .- The Guessing Problem: A model can achieve high textual accuracy through statistical text correlation or by relying on language priors encoded in its weights, allowing it to guess the right answer without actually extracting evidence from the image [Source: 1: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] more_horiz.

2. Visual Focusing Score ( $Sfocus$ ): The Process Metric

The Visual Focusing Score ( $Sfocus$ ) quantifies the exact spatial "gaze" of the model during its reasoning process. It measures the percentage of the model's internal attention mass that is concentrated directly on the verified visual evidence (the offline-generated bounding boxes) [Source: 2: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 9: Lexicon of Visual Metonymy and Vision-Language Models] .

The Mathematical Calculation: $Sfocus$ is calculated by taking the sum of the attention intensity at the spatial coordinates within the ground-truth bounding box, and dividing it by the total attention mass distributed across the entire image [Source: 2: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .- The Purpose: A higher $Sfocus$ mathematically proves that the model's reasoning process has a strong dependency on the verified visual evidence [Source: 2: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] . It confirms that the model is actively "looking" at the semantically correct regions of the image rather than operating blind [Source: 2: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .

3. The Core Difference: The "Perception Gap"

The fundamental difference between the two metrics lies in the "Perception Gap": accuracy evaluates what to say, whereas $Sfocus$ evaluates where to look [Source: 1: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 10: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .

Standard supervised fine-tuning often allows a student model to mimic a teacher's high textual accuracy, even if the student's visual attention trajectories diverge significantly from the correct target regions [Source: 1: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] more_horiz. This textual mimicry does not guarantee visual understanding [Source: 11: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .

However, empirical analysis proves that true reasoning success is strictly constrained by visual attention [Source: 7: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] :

Monotonic Gains: Correctly answered samples maintain a significantly higher average $Sfocus$ (approximately 15.89%) compared to incorrect samples (approximately 11.84%) [Source: 7: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .- Hallucination Bounds: When a model's $Sfocus$ drops to negligible levels (less than 1%), the model predominantly produces severe hallucinations or responses completely irrelevant to the visual content [Source: 7: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] . Therefore, focused visual grounding is a necessary pre-condition for genuine reasoning success [Source: 12: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .

4. Application in Visual Metonymy (Literal vs. Associative Grounding)

The distinction between these metrics becomes uniquely important in visual metonymy.

In standard visual tasks, $Sfocus$ evaluates literal grounding—verifying that the model's attention converges on the specific object being asked about (e.g., looking at a ball when asked about the color of the ball) [Source: 13: Literal Grounding versus Associative Grounding in Visual Metonymy] .

In visual metonymy, the target abstract concept is physically absent from the image [Source: 9: Lexicon of Visual Metonymy and Vision-Language Models] [Source: 14: Literal Grounding versus Associative Grounding in Visual Metonymy] . Therefore, $Sfocus$ is used to evaluate associative grounding [Source: 15: Literal Grounding versus Associative Grounding in Visual Metonymy] . It verifies that the model's attention converges on the representamens—the distributed constellation of visual cues (like a paintbrush, palette, and canvas) that collectively evoke the absent concept (the "artist") [Source: 9: Lexicon of Visual Metonymy and Vision-Language Models] more_horiz.

Summary: Accuracy tells you if the model won the game, but the Visual Focusing Score ( $Sfocus$ ) is required to prove that the model actually looked at the visual evidence to achieve that win, mathematically ruling out blind guessing [Source: 2: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 12: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .

User

Can you explain the Visual-Anchored Reward Shaping calculation?

Model

The Visual-Anchored Reward Shaping calculation evaluates the ratio of internal attention the model assigns to visual tokens relative to system tokens during a reasoning rollout [Source: 1: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] . Introduced as part of the Attention-Guided Visual Anchoring and Reflection (AVAR) framework, this calculation provides an auxiliary reinforcement learning signal that forces a Vision-Language Model (VLM) to sustain strong visual grounding across extended reasoning chains, rather than lazily defaulting to language priors [Source: 1: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] [Source: 2: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] .

Here is the exact mathematical calculation and how it integrates into the model's training pipeline.

1. The Conditional Logic of $rvisual$

The calculation operates on a strict conditional basis. It first checks if the model's final reasoning outcome is factually correct. If the model hallucinates or guesses incorrectly, the visual attention reward is set to strictly zero [Source: 1: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] . This ensures the model is not rewarded for simply staring intently at an image while failing to solve the actual task [Source: 1: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] .

If the rollout outcome is correct, the visual attention reward ( $rvisual$ ) is calculated using the following formula:

rvisual​=∣T∣1​t∈T∑​(∑k∈Ksys​​At,kl​+ϵ∣L∣1​∑l∈L​∑k∈Kimg​​At,kl​​)

[Source: 1: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf]

2. Breaking Down the Components

This formula measures the model's physical "gaze" token-by-token:

$T$ : The total set of tokens generated during the rollout [Source: 1: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] . The entire calculation is averaged across all these time steps to ensure sustained focus ( $∣T∣1∑t∈T$ ).- $L$ : The set of targeted transformer layers being monitored [Source: 3: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] .- $Kimg$ : The set of image (visual) key tokens [Source: 3: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] .- $Ksys$ : The set of system key tokens (the system prompt) [Source: 3: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] .- $At,kl$ : The specific attention weight originating from the generated query token $t$ to the target key token $k$ at a specific layer $l$ [Source: 3: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] .- $ϵ$ : A stability constant (empirically fixed at $10−6$ ) added to the denominator to prevent division-by-zero errors [Source: 4: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] [Source: 5: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] .

The Logic Behind the Ratio: The numerator calculates the average attention mass directed at the visual features ( $∑k∈KimgAt,kl$ ) [Source: 1: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] [Source: 3: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] . The denominator calculates the attention mass directed at the system tokens ( $∑k∈KsysAt,kl$ ) [Source: 1: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] [Source: 3: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] . By dividing the visual attention by the system attention, the calculation explicitly penalizes the model if it exhibits "Lazy Attention Localization"—a phenomenon where the model relies too heavily on the system prompt rather than extracting continuous evidence from the image pixels [Source: 1: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] [Source: 6: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] .

3. The Total Synthesis Reward ( $rtotal$ )

Once $rvisual$ is calculated, it is not used in isolation. It acts as an auxiliary shaping signal that is combined with traditional reinforcement learning rewards [Source: 1: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] . The final total reward ( $rtotal$ ) used to update the model is formulated as:

rtotal​=raccuracy​+λv​⋅rvisual​+λf​⋅rformat​

[Source: 4: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf]

$raccuracy$ : Evaluates the objective factual correctness of the final answer [Source: 1: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] .- $rformat$ : Enforces compliance with the required output structure (such as successfully using <think> tags) [Source: 1: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] .- $λv$ and $λf$ : These are scaling hyperparameters used to balance the different signals. In the AVAR implementation, $λv$ is set to $0.3$ and $λf$ is set to $0.1$ [Source: 4: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] [Source: 5: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] .

4. Optimization via GRPO

This final combined reward is then fed into the Group Relative Policy Optimization (GRPO) algorithm [Source: 4: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] . During training, GRPO generates a group of rollouts (e.g., $G=8$ ) for the same prompt and calculates the relative advantage of each rollout against the group average [Source: 4: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] .

Because $rtotal$ includes the visual anchoring ratio, if a model generates the correct answer while maintaining a highly concentrated focus on the image tokens, it receives a much higher relative advantage than a rollout that guessed the right answer while looking at the background [Source: 1: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] [Source: 4: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] . This mathematically reinforces the model's internal attention policy, permanently anchoring its cognitive leaps to verifiable visual cues across extended reasoning chains [Source: 1: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] [Source: 4: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] .

User

How to guarantee that Grounding DINO will accurately extract bounding boxes for the ground truth, so I can mathematically the model has acquired the correct visual evidence before moving on to the abstract reasoning phase ?

Model

To guarantee that Grounding DINO accurately extracts the correct bounding boxes for your ground truth, you cannot rely on the zero-shot object detector in isolation. Because your training framework relies on these boxes to calculate deterministic Trajectory Alignment ( $Ltraj$ ) and Dense Perception Rewards ( $rcrop$ ), any hallucinated or inaccurate boxes from Grounding DINO will poison the mathematical verification of your pipeline.

To achieve an absolute guarantee, you must implement Teacher-Verified Automated Trajectory Synthesis, a mechanism adapted from the UniDoc-RL framework [Source: 1: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 2: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . This process introduces a high-capacity teacher model to act as an expert semantic filter over the offline detector's raw outputs.

Here is the step-by-step mechanism you must add to your offline preprocessing phase to guarantee your visual evidence:

1. Candidate Generation via Grounding DINO During your data preparation, you use Grounding DINO or OWL-v2 to scan the image using the discrete text strings of your representamens (the visual cues) as text queries [Source: 3: 19th March] . However, instead of taking the detector's top prediction as absolute truth, you configure it to generate a set of potential candidate bounding boxes ( $Bcand$ ) representing various regions of interest [Source: 1: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 2: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .

2. Teacher-Driven Verification Because zero-shot detectors lack deep semantic reasoning and can be confused by the compositional nature of metonymy, you pass these candidate bounding boxes to your high-capacity offline teacher model (e.g., your unquantised Qwen2.5-VL-32B or a model like Qwen3-VL-235B) [Source: 2: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 4: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . The teacher model is tasked with explicitly evaluating these candidate regions to verify whether they truly contain the fine-grained visual information that maps to the required abstract concept [Source: 2: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .

3. Optimal Target Selection ( $b∗$ ) Once the teacher evaluates the candidates, it selects the single most optimal and semantically accurate bounding box ( $b∗∈Bcand$ ) to serve as the ground-truth visual action [Source: 2: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . By relying on the 32B teacher to verify the semantic contents of the box rather than trusting Grounding DINO's base confidence scores, you mathematically guarantee that the spatial coordinates contain the correct visual evidence. You then convert these verified coordinates into the sparse probability vectors required for your Earth Mover's Distance trajectory targets [Source: 3: 19th March] [Source: 5: 19th March] .

4. Strict Quality Filtering (The Failsafe) To maintain mathematical integrity during your SPECTRA/GRPO reinforcement learning loop, you must apply a rigorous quality filtering stage to your dataset [Source: 6: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . If Grounding DINO fails to capture the object entirely, or if the teacher model evaluates the candidates and determines that none of them adequately capture the representamen, the trajectory is deemed invalid [Source: 2: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 6: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . You must discard these flawed samples entirely from your training corpus [Source: 6: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .

By retaining only the samples where the teacher explicitly verifies both the visual bounding box and the final reasoning outcome, you ensure that your $rcrop$ (Intersection over Union) dense reward is comparing the 3B student's gaze against an absolutely flawless ground-truth target [Source: 6: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 7: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .

User

How do active visual perception and dense rewards improve VMR-RLVR?

Model

Integrating Active Visual Perception and Dense Rewards into your VMR-RLVR pipeline fundamentally solves the credit assignment problem and eliminates the model's reliance on passive, full-image context processing.

While Verifiable Multiple-Choice Reformulation (VMR-RLVR) successfully provides a deterministic, binary reward (+1 or 0) for the final abstract concept prediction [Source: 1: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] [Source: 2: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] , this outcome-based reward is inherently sparse [Source: 3: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . A sparse reward only tells the model if its final answer was correct; it does not explicitly supervise the intermediate steps of how the model gathered the visual evidence [Source: 3: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 4: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .

By incorporating mechanisms from the UniDoc-RL framework, you upgrade your VMR-RLVR pipeline from relying on implicit attention steering to executing explicit, verifiable, coarse-to-fine visual extraction [Source: 5: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 6: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . Here is exactly how these two mechanisms improve the architecture:

1. Active Visual Perception: Purifying the <lvr> Bottleneck

Standard Vision-Language Models (VLMs) suffer from "exposure bias," defaulting to a passive visual consumption paradigm where they encode the entire, full-resolution image directly into the model context [Source: 3: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 7: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . In visual metonymy, this means the model wastes massive amounts of its context capacity processing irrelevant background noise instead of focusing on the critical visual cues (the representamens) [Source: 5: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 8: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .

Active Visual Perception transforms the model from a passive observer into an active agent.

The Mechanism: Before compressing information into the $K=5$ latent (<lvr>) tokens, the model is trained to actively generate bounding box coordinates to crop and adaptively zoom into specific regions of interest [Source: 5: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] more_horiz.- The Improvement to VMR: By physically cropping the image down to the relevant visual cues (e.g., cropping to just the "tilted scales" rather than the whole courtroom), you drastically reduce the visual noise fed into the latent bottleneck [Source: 8: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 9: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . This coarse-to-fine perception strategy resembles human visual attention, preserves high-resolution details of the specific objects, and ensures your <lvr> tokens are compressing pure, highly-concentrated semantic signals [Source: 5: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 8: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . This maximizes the Reasoning Density of the model's latent thoughts before it attempts to answer the VMR multiple-choice prompt.

2. Dense Rewards: Step-by-Step Optimization via IoU

To optimize this new active cropping behavior within your existing Group Relative Policy Optimization (GRPO) loop, you must replace the purely sparse VMR outcome reward with a Dense Multi-Reward Scheme [Source: 5: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 9: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .

The Problem with Sparse VMR Rewards: If the model guesses the correct multiple-choice option but did so by looking at the wrong part of the image, a sparse $Racc$ reward of 1 still reinforces that flawed behavior, leading to reward hacking and hallucination cascades [Source: 3: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 10: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] .- The Visual Perception Reward ( $rcrop$ ): You utilize the offline bounding boxes already generated by Grounding DINO for your empirical guardrails as the ground-truth targets ( $B∗$ ) [Source: 11: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . When the model actively predicts a crop box ( $B^$ ), you calculate the Intersection over Union (IoU) between the model's focus and the Grounding DINO ground truth: $rcrop=IoU(B∗,B^)$ [Source: 11: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .- The Improvement to VMR: By adding $rcrop$ to the existing accuracy ( $Racc$ ) and format ( $Rformat$ ) rewards, you provide fine-grained, stage-aware feedback at every step of the pipeline [Source: 6: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 9: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . The dense supervision explicitly optimizes the accuracy of intermediate actions, preventing error propagation and guiding the model toward more robust reasoning paths before the final multiple-choice decision is made [Source: 12: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .

3. The Synergistic Impact on the RL Training Loop

When combined, these mechanisms force a profound qualitative shift in the model's behavior during the RLVR training phase.

Empirical analysis of RL training dynamics demonstrates that without dense perception rewards, models frequently generate "trivial crops" that encompass the entire image or irrelevant backgrounds, providing zero information gain [Source: 13: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . However, by optimizing the VMR-RLVR loop with the dense $rcrop$ reward, the model learns to execute "precise crops" that tightly bound the regions of interest [Source: 13: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .

The Final Unified Objective: Your updated GRPO reward function becomes a synergistic equation combining task accuracy, format adherence, perception precision, and implicit attention steering: $Rtotal=λ1Racc+λ2Rformat+λ3rcrop+λ4AWAD$ [Source: 11: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] more_horiz.

By integrating Active Visual Perception and Dense Rewards, you mathematically guarantee that your edge-deployed 3B model explicitly filters out visual noise, verifying its own physical visual grounding at intermediate steps before executing the "reasoning by contrast" required to solve the final multiple-choice metonymy task [Source: 2: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] more_horiz.

User

can the methodology, solution approach, any of key architecture mechanisms, components, or key findings of the paper started with "UniDoc-RL" be integrated into the above finalized solution approach?

Model

Yes, the methodology, architecture mechanisms, and key findings from the UniDoc-RL paper can be highly synergistic when integrated into your finalized VMR-RLVR + Latent Reasoning pipeline.

While UniDoc-RL was designed for document understanding, its core paradigm—treating visual information acquisition as a sequential, coarse-to-fine decision-making problem with dense rewards—directly solves several lingering optimization challenges in your Visual Metonymy (ViMET) architecture.

Here is a detailed, highly technical breakdown of exactly how you can integrate UniDoc-RL’s mechanisms into your current edge-deployed 3B pipeline.

1. Integrating "Active Visual Perception" to Augment the <lvr> Bottleneck

The UniDoc-RL Concept: Instead of passively encoding full, high-resolution images (which retains redundant background noise and wastes context capacity), UniDoc-RL introduces an Active Visual Perception action [Source: 1: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 2: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . The model actively generates coordinates to crop and zoom into information-dense regions, converting them into high-resolution, query-focused visual observations before reasoning [Source: 2: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] more_horiz.

Integration into your Pipeline: In your current pipeline, your 3B model relies entirely on Advantage-Weighted Attention Divergence (AWAD) to steer its internal attention on a static, full image. You can integrate UniDoc-RL’s active perception as a preprocessing step right before the Latent Visual Reasoning (<lvr>) bottleneck.

The Mechanism: During the SPECTRA agentic loop, before the model compresses the image into the $K=5$ latent tokens, it issues an active spatial cropping command (e.g., predicting bounding box coordinates for the suspected representamens) [Source: 2: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .- The Benefit: By physically cropping the image to the relevant visual cues (e.g., cropping down to just the "paintbrush" and "palette" rather than the whole room), you drastically reduce the visual noise fed into the <lvr> tokens. This ensures your latent bottleneck is compressing pure, highly-concentrated semantic signals rather than wasting capacity on irrelevant background pixels, maximizing your Reasoning Density [Source: 2: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 3: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .

2. Integrating the "Dense Multi-Reward Scheme" into GRPO

The UniDoc-RL Concept: Standard reinforcement learning for Vision-Language Models suffers from sparse outcome-based rewards (e.g., getting a +1 only at the very end if the answer is correct), which fails to provide credit assignment for intermediate steps like finding the right visual evidence [Source: 1: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 5: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . UniDoc-RL solves this with a Dense Multi-Reward Scheme, calculating specific rewards for retrieval, selection, cropping (via Intersection over Union), and final formatting [Source: 6: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 7: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .

Integration into your Pipeline: You are currently using GRPO with a binary verifiable reward ( $Racc$ ) and a format reward ( $Rformat$ ). You can supercharge this by adopting UniDoc-RL's dense reward logic, specifically its Visual Perception Reward ( $rcrop$ ) [Source: 8: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .

The Mechanism: You already have offline bounding boxes generated by Grounding DINO for your Empirical Visual Perception Verification (EVPV) guardrail. You can use these exact boxes as the ground-truth targets ( $B∗$ ) for a dense GRPO reward [Source: 8: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .- The Calculation: If your 3B model utilizes an active crop action, or if you threshold its internal attention map into a predicted bounding box ( $B^$ ), you can calculate the Intersection over Union (IoU) between the model's focus and the Grounding DINO ground truth: $rcrop=IoU(B∗,B^)$ [Source: 8: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .- The Unified Objective: Your new GRPO reward function becomes $Rtotal=λ1Racc+λ2Rformat+λ3rcrop+λ4AWAD$ [Source: 7: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . This dense, stage-aware supervision explicitly optimizes the model's intermediate visual grounding step-by-step, preventing hallucination cascades early in the rollout [Source: 6: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 9: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .

3. Adapting "Precise Selection" for VMR Distractor Evaluation

The UniDoc-RL Concept: UniDoc-RL uses a Precise Selection Action to bridge the semantic gap between coarse retrieval candidates and actual reasoning, acting as an LVLM-driven semantic filter to discard irrelevant noise before making a final decision [Source: 3: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 10: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .

Integration into your Pipeline: Because you have shifted away from real-time NodeRAG retrieval to save VRAM, you no longer have a "coarse retrieval" step during inference. However, your Verifiable Multiple-Choice Reformulation (VMR) pits the correct abstract concept against hard "Partially Literal Distractors."

The Mechanism: You can adapt the Precise Selection action to force the 3B model to explicitly evaluate the semantic alignment between the actively cropped visual regions and each of the multiple-choice options before outputting its final text answer [Source: 10: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .- The Benefit: Instead of jumping straight to the final answer, the model uses its latent tokens to semantically "filter" the VMR distractors one by one. This reinforces the "reasoning by contrast" paradigm, ensuring the model explicitly rejects the literal distractors based on the active visual evidence it just gathered.

4. Key Findings to Justify Your Architecture

If you are writing your case study or portfolio, UniDoc-RL provides empirical proof that your architectural shift is the correct one:

Proof that Passive Vision Fails: UniDoc-RL's findings prove that models trained with standard Supervised Fine-Tuning (SFT) suffer from "exposure bias" and default to "passive full-image reading," frequently failing to extract critical local information [Source: 11: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 12: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . This perfectly justifies your decision to abandon standard text CoT in favor of active latent routing (CSG) and RL optimization.- Proof that RL Refines Visual Gaze: UniDoc-RL empirically demonstrates that RL training shifts a model from generating coarse, meaningless visual crops to executing "precise crops" that tightly bound the regions of interest [Source: 12: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . This validates your hypothesis that GRPO and Reinforced Attention Learning (RAL/AWAD) will successfully teach your deep hub neurons to zero in on verifiable visual representamens rather than relying on language priors.

Summary: By integrating UniDoc-RL's Active Visual Perception action and IoU-based $rcrop$ reward, you can transition your VMR-RLVR pipeline from relying purely on implicit internal attention (AWAD) to executing explicit, verifiable, coarse-to-fine visual extraction. This mathematically guarantees that your <lvr> tokens are only compressing the exact visual evidence required to solve the metonymy, pushing your edge-deployed 3B model closer to domain-specific superintelligence.

User

Can these mechanisms discussed recently be applied to the use case of Visual Metonymy described in "A Computational Approach to Visual Metonymy" ?

Model

Yes, the recently discussed mechanisms—Active Visual Perception, Dense Rewards, Visual-Anchored Reward Shaping, and Verifiable Multiple-Choice Reformulation (VMR-RLVR)—map perfectly onto the cognitive and architectural demands of the Visual Metonymy (ViMET) use case as defined in "A Computational Approach to Visual Metonymy" [Source: 1: A Computational Approach to Visual Metonymy] .

In fact, applying these specific reinforcement learning and latent reasoning techniques provides the exact mathematical framework needed to close the 21% accuracy gap between current Vision-Language Models (VLMs) and human performance on the ViMET benchmark [Source: 1: A Computational Approach to Visual Metonymy] [Source: 2: A Computational Approach to Visual Metonymy] .

Here is a detailed breakdown of how each mechanism resolves the unique theoretical challenges of visual metonymy outlined in the research:

1. Active Visual Perception for Multiple Distributed Representamens

The ViMET research highlights a critical distinction between textual and visual metonymy: while text usually relies on a single metonymic cue, visual metonymy is compositional, simultaneously supporting multiple associative relations drawing on several visual cues distributed across the scene [Source: 3: A Computational Approach to Visual Metonymy] . For example, the abstract concept of an "artist" is evoked by the simultaneous presence of a canvas, a colour palette, a sculpture, and a brush [Source: 4: A Computational Approach to Visual Metonymy] .

The Application: Passive visual processing—where a model encodes the entire high-resolution image at once—wastes context capacity on irrelevant background noise and struggles to isolate these distributed cues [Source: 5: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 6: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . By applying Active Visual Perception, the VLM acts as an agent that actively generates bounding box coordinates to crop and adaptively zoom into each specific representamen (e.g., isolating the brush and the palette) [Source: 6: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . This mirrors human coarse-to-fine visual extraction, forcing the model to explicitly isolate the multiple distributed cues before attempting to synthesize them [Source: 3: A Computational Approach to Visual Metonymy] [Source: 6: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .

2. Dense Rewards ( $rcrop$ ) to Verify the Semiotic Triad

The ViMET paper defines visual metonymy using Charles Sanders Peirce’s Semiotic Triad: an Object (the abstract concept) is evoked by a Representamen (the visual cue), prompting an Interpretant (the mental inference) [Source: 4: A Computational Approach to Visual Metonymy] [Source: 7: A Computational Approach to Visual Metonymy] .

The Application: If you only use a sparse outcome reward (did the model guess "artist" correctly?), you cannot verify if the model actually looked at the representamens or if it just hallucinated the answer based on language priors [Source: 8: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] [Source: 9: Precision Perception: Integrating Active Visual Cues and Dense Rewards] . By integrating a Dense Multi-Reward Scheme featuring a Visual Perception Reward ( $rcrop$ ), you can explicitly supervise the first step of the triad [Source: 10: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . Using offline bounding boxes (extracted via Grounding DINO) as ground truth, the model receives an Intersection over Union (IoU) reward specifically for successfully cropping the correct representamens during its active perception phase [Source: 11: Precision Perception: Integrating Active Visual Cues and Dense Rewards] . This mathematically guarantees that the model has acquired the correct visual evidence before moving on to the abstract reasoning phase [Source: 11: Precision Perception: Integrating Active Visual Cues and Dense Rewards] .

3. Visual-Anchored Reward Shaping ( $rvisual$ ) for Conceptual Inference

The ViMET researchers observe that visual metonymy operates predominantly as a conceptual phenomenon rather than a purely visual one [Source: 12: A Computational Approach to Visual Metonymy] . The leap from the representamens to the abstract concept requires profound cognitive association [Source: 12: A Computational Approach to Visual Metonymy] [Source: 13: A Computational Approach to Visual Metonymy] .

The Application: During long chains of latent reasoning, VLMs often suffer from "Lazy Attention Localization," where they abandon visual tokens and rely heavily on the system prompt or language priors to make conceptual leaps [Source: 8: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] [Source: 14: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] . To ensure the model's "Interpretant" remains grounded in reality, you apply Visual-Anchored Reward Shaping ( $rvisual$ ) within the reinforcement learning loop [Source: 14: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] [Source: 15: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] . This reward structure dynamically calculates the ratio of attention assigned to visual tokens relative to system tokens [Source: 15: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] . It mathematically forces the model to sustain continuous visual anchoring on the cropped representamens while it computes the abstract conceptual association [Source: 15: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] .

4. VMR-RLVR and "Reasoning by Contrast"

Because visual metonymy relies on interpreting absent concepts indirectly, it is highly subjective and lacks a single programmatic ground truth [Source: 16: A Computational Approach to Visual Metonymy] . However, the ViMET dataset was explicitly constructed as a benchmark of 2,000 multiple-choice questions [Source: 1: A Computational Approach to Visual Metonymy] . Crucially, the distractors in these questions (e.g., pairing "age" with "recuperation" or "disability") were meticulously engineered using ConceptNet to be both visually and semantically proximate to the correct concept, making them incredibly hard to distinguish [Source: 17: 19th March] [Source: 18: A Computational Approach to Visual Metonymy] .

The Application: This dataset structure is the exact prerequisite for the Verifiable Multiple-Choice Reformulation (VMR-RLVR) framework [Source: 19: RLVR to Open-Ended Tasks via Verifiable MCQ reformulation.pdf] . By utilizing these pre-existing hard distractors, the VMR-RLVR pipeline assigns deterministic binary rewards (+1 for the correct concept, 0 for the semantic distractor) [Source: 20: RLVR to Open-Ended Tasks via Verifiable MCQ reformulation.pdf] [Source: 21: RLVR to Open-Ended Tasks via Verifiable MCQ reformulation.pdf] . During Group Relative Policy Optimization (GRPO), the VLM engages in "reasoning by contrast," exploring multiple latent trajectories to learn the intrinsic, logical boundaries that separate a correct pragmatic interpretation from a superficial literal mistake [Source: 22: 19th March] [Source: 23: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] .

5. Aligning Latent Bottleneck Capacity with the Dataset Structure

The ViMET dataset generation pipeline explicitly prompts Llama 3 to generate exactly 5 representamens per target concept [Source: 24: 19th March] [Source: 25: A Computational Approach to Visual Metonymy] .

The Application: This semiotic design perfectly justifies the architectural decision to use exactly $K=5$ Latent Visual Reasoning () tokens in your bottleneck [Source: 26: 19th March] [Source: 27: 19th March] . Rather than treating the tokens as an arbitrary compression mechanism, you semantically align the network's capacity with the dataset's structure: providing one continuous, pre-seeded "thought slot" for each possible representamen the model needs to extract [Source: 24: 19th March] [Source: 26: 19th March] .

By unifying these advanced reinforcement learning mechanisms, you transition your architecture from passively guessing metonymic associations to actively cropping the visual cues, verifying their spatial accuracy via dense rewards, and securely navigating to the correct abstract concept via verifiable contrastive learning.

User

can the methodology, solution approach, any of key architecture mechanisms, components, or key findings be integrated into the above finalized solution approach?

Model

Yes, the recent sources provide several cutting-edge architectural mechanisms and training paradigms that can be directly integrated into your finalized "VMR-RLVR + Latent Reasoning" approach. These additions specifically address the limitations of visual attention, stabilize reinforcement learning, and optimize the knowledge transfer from your 32B teacher to your 3B student.

Here are the key integrations you should add to your case study outline, categorized by where they fit into your pipeline:

1. Enhancements to Stage 2: The RL Optimization Loop

The new research provides mechanisms to make your VMR-RLVR (Group Relative Policy Optimization) loop significantly more stable and visually grounded.

Reward-Guided KL Masking (from the VOLD framework): In your current pipeline, you might experience interference if the model tries to maximize the RL reward while simultaneously being constrained by a strict distillation penalty against the teacher. The VOLD framework solves this using "Reward-Guided KL Masking" [Source: 1: VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation] . Since your VMR task provides binary rewards (1 for correct, 0 for incorrect), you can use the mathematical term $(1−r)$ to create a mask [Source: 1: VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation] . When your 3B model generates a correct rollout ( $r=1$ ), the KL distillation loss against the teacher is completely masked out (set to zero), allowing the student to retain its successful, independently discovered reasoning strategy [Source: 2: VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation] . Teacher distillation only activates on failed attempts ( $r=0$ ), providing guidance only when the student makes a mistake [Source: 2: VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation] .- Visual-Anchored Reward Shaping ( $rvisual$ ) (from the AVAR framework): The AVAR research identifies a critical flaw called "Lazy Attention Localization," where multimodal cold-start training fails to increase a model's actual attention to visual tokens, leaving it overly reliant on language/system priors [Source: 3: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] [Source: 4: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] . To fix this in your SPECTRA/GRPO loop, you can add an explicit Visual Attention Reward ( $rvisual$ ) alongside your AWAD objective [Source: 5: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] . This reward explicitly evaluates the ratio of attention assigned to visual tokens relative to system tokens during the rollout [Source: 5: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] . Your final GRPO reward becomes a combination of task accuracy ( $raccuracy$ ), format compliance ( $rformat$ ), and this visual anchoring score ( $λv⋅rvisual$ ), mathematically forcing the model to sustain visual grounding throughout its latent reasoning steps [Source: 5: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] [Source: 6: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] .- Strict Format Rewards ( $Rformat$ ): Drawing from the cross-style reasoning framework by Cheshmi et al., your deterministic reward should explicitly include a format reward ( $Rformat$ ) alongside the accuracy reward ( $Racc$ ) [Source: 7: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] [Source: 8: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] . This assigns a binary score verifying whether the model correctly utilized the necessary structural tags (like <think> or <lvr> boundaries) before generating the final answer, ensuring strict adherence to the latent reasoning schema [Source: 8: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] .

2. Enhancements to Stage 1: The Latent Bottleneck & Distillation

Your current architecture uses $K=5$ latent tokens (<lvr>) seeded by NodeRAG. The new sources refine how these tokens should be supervised during the Supervised Fine-Tuning (SFT) phase.

K-Separate Per-Representamen Targets: As noted in the critique of standard LaViT, visual metonymy produces "diffuse" attention because there is no single correct salient region in the image [Source: 9: 19th March] more_horiz. Rather than trying to distill a single, averaged visual semantic target ( $Vsem$ ) from the teacher, you should extract $K$ separate $Vsem_k$ vectors—one for each of the 5 representamens [Source: 12: 19th March] . By using the offline bounding boxes (from Grounding DINO) to extract hidden states only at the specific patch positions of each representamen, your $Lconcept$ loss becomes a sum of $K$ distinct cosine distances [Source: 12: 19th March] . This ensures each of your $5$ <lvr> tokens is explicitly supervised to encode its specific assigned representamen, solving the diffuse attention problem [Source: 12: 19th March] .- Active Visual Perception Actions (from UniDoc-RL): Instead of just having the <lvr> tokens passively attend to the image, you can integrate the "Visual Perception" action space from the UniDoc-RL framework [Source: 13: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . You can train the latent bottleneck to actively generate bounding box coordinates to crop and adaptively zoom into the information-dense regions of the image (the representamens), effectively removing redundant background noise before making the final metonymic inference [Source: 13: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] [Source: 14: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] . You can then supervise this actively with an Intersection over Union (IoU) perception reward against your offline Grounding DINO boxes [Source: 15: UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards] .

3. Enhancements to the Research Roadmap

Joint Cross-Style Training for Generalized Reasoning: In your current roadmap, you planned to test cross-style generalization (sarcasm, humor, metaphor) sequentially as a future phase. However, findings from the multimodal figurative language research show that training jointly across multiple figurative styles yields a single generalized reasoning VLM that significantly outperforms models trained on a single specific style [Source: 16: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] [Source: 17: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] . You should update your roadmap to include a unified training phase where the VMR-RLVR loop optimizes the model on a combined dataset of metonymy, sarcasm, and metaphor, leveraging the shared cognitive demand for non-literal, cross-modal mapping to boost overall performance [Source: 17: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] [Source: 18: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] .- Training-Free Attention Modulation (Inference Optimization): For your edge-deployment roadmap phase, you can implement a training-free intervention discovered in the AVAR research. By introducing scaling factors during inference to manually suppress the attention weights of system tokens (by a factor like $αsys$ ) and amplify visual tokens ( $αimg$ ), you can extract an additional 1-2% reasoning performance from your finalized 3B model at runtime without any further training [Source: 19: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] more_horiz.

How to frame this in your Case Study: You can position these integrations as the "Refinement Layer" of your compelling shift. While VMR-RLVR + Latent Reasoning establishes the core engine, Reward-Guided KL Masking [Source: 1: VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation] and Visual-Anchored Reward Shaping [Source: 5: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] provide the exact mathematical guardrails needed to ensure the reinforcement learning loop converges stably and remains physically grounded in the image pixels.

User

Can I consider "VMR-RLVR + Latent Reasoning" pipeline as the compelling shift I mentioned above. Accordingly can you rewrite the answer?

Model

Yes, you can absolutely frame the "VMR-RLVR + Latent Reasoning" pipeline as the central compelling shift of your research. This framing is highly strategic because it represents a major paradigm shift: moving away from computationally heavy, text-based reasoning and online graph retrieval toward an autonomous, trial-and-error optimization that takes place entirely within continuous, non-textual hidden states [Source: 1: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] more_horiz.

By defining this as your compelling shift, you position your work at the cutting edge of post-training paradigms, where models learn how to reason implicitly rather than just memorizing explicitly provided text [Source: 4: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] more_horiz.

Here is the rewritten, detailed outline for your case study, explicitly anchoring the narrative around this compelling shift:

Executive Summary

The Problem Context (The Literal Wall & The Curse of CoT): Current Vision-Language Models (VLMs) suffer from a "Perception Gap," mimicking text outputs without grounding them in visual evidence [Source: 5: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 7: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] . Attempts to fix this using text-based Chain-of-Thought (CoT) lead to "context dilution" and "hallucinatory drift," while relying on massive online Knowledge Graphs (GraphRAG) exceeds edge deployment constraints [Source: 8: Lexicon of Visual Metonymy and Vision-Language Models] .- The Compelling Shift (VMR-RLVR + Latent Reasoning): The architecture pivots entirely to a unified "VMR-RLVR + Latent Reasoning" pipeline. By replacing verbose textual reasoning with $K=5$ continuous latent tokens and optimizing them via reinforcement learning, the model is forced to internally abstract the associative logic of visual metonymy without requiring an expensive online retrieval system during inference [Source: 2: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] more_horiz.- The Mechanism of Action: Open-ended, subjective visual metonymy tasks are reformulated into verifiable multiple-choice questions (VMR) [Source: 9: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] . This enables Reinforcement Learning with Verifiable Rewards (RLVR), allowing the 3B student model to learn "reasoning by contrast" entirely through trial-and-error exploration of its internal attention policy [Source: 10: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] more_horiz.- The Core Impact: This shift achieves massive reasoning density and cross-modal alignment within a strict 16GB VRAM hardware constraint, mathematically guaranteeing that the model anchors its final decision on verifiable visual evidence rather than language priors [Source: 6: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] more_horiz.

High‑Level Vision System Architecture

The Reasoning Engine (The Edge VLM): Qwen2.5-VL-3B-Instruct serves as the base student model, executing entirely on-device [Source: 15: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] [Source: 16: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .- The Latent Visual Reasoning Bottleneck ( $<lvr>$ $< l v r >$ ):
- Textual reasoning traces are completely decoupled and replaced by exactly $K=5$ Latent Visual Reasoning tokens (acting as "Visual Information Containers") [Source: 2: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] more_horiz.- Curriculum Sensory Gating (CSG) is implemented to block direct text-to-image cross-attention, physically forcing the language head to route visual processing strictly through this latent bottleneck [Source: 18: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .- Offline Knowledge & Distractor Generation (The Graph's New Role):
- Live, multi-hop NodeRAG retrieval is deprecated to save inference compute.- Instead, tools like VaLiK and ConceptNet are used strictly offline to engineer the Verifiable Multiple-Choice Reformulations (VMR), creating the target concepts and the hard semantic "Partially Literal Distractors" required for RLVR training [Source: 9: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] [Source: 19: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] .- Offline Grounding DINO is used to extract exact bounding box coordinates of the visual cues (representamens) for training loss calculations [Source: 20: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 21: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .

Modeling Strategy

Stage 1: Latent Policy Alignment (SFT):
- Teacher-Student Distillation: A high-capacity teacher (Qwen2.5-VL-32B) provides the initial target distributions for the student's latent tokens [Source: 20: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .- The Tri-Loss Setup: Supervised Fine-Tuning utilizes Next-Token Prediction (NTP), Semantic Reconstruction to align the latent states with the teacher's holistic visual concepts, and Trajectory Alignment via Earth Mover's Distance (Wasserstein-1) to map attention to the offline bounding boxes [Source: 22: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 23: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .- Stage 2: The Compelling Shift (VMR-RLVR Optimization):
- Verifiable Multiple-Choice Reformulation (VMR): The subjective visual metonymy task is restructured into a deterministic A/B/C/D format, converting an unverifiable open-ended generation into a verifiable binary reward system [Source: 9: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] [Source: 24: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] .- Group Relative Policy Optimization (GRPO): The model explores multiple latent reasoning trajectories, receiving a reward of +1 for selecting the correct abstract concept and 0 for falling for the literal distractor [Source: 25: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] [Source: 26: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] .- Visual Anchoring & Internal Policy Steering:
- Reinforced Attention Learning (RAL): Utilizing Advantage-Weighted Attention Divergence (AWAD) to optimize the internal attention distribution policy. The model's deep hub neurons are penalized if they guess the right answer but fail to look at the correct representamens, eliminating blind guessing [Source: 12: Reinforced Attention Learning] [Source: 27: Reinforced Attention Learning] .

Research Roadmap

Phase 1: Pipeline Stabilization & Metric Validation: Finalize the integration of the Latent Visual Reasoning tokens with the VMR-RLVR loop on the 3B model. Validate that the $Sfocus$ (Visual Focusing Score) and Reasoning Density metrics mathematically confirm the elimination of the "Perception Gap" [Source: 5: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 13: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] .- Phase 2: Cross-Style Generalization: Extend the VMR-RLVR + Latent Reasoning pipeline beyond visual metonymy to test reasoning transfer on other subjective figurative language datasets (e.g., visual sarcasm, humor, and metaphor), measuring how well the internal attention policy generalizes across styles [Source: 28: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] more_horiz.- Phase 3: Edge Autonomy & Interpretability: Optimize the final reasoning model for strict edge deployment using INT8 quantization and Intel OpenVINO. Implement Logit Lens probing to analyze the internal representations of the latent tokens, tracking how the model shifts from literal to abstract concepts during its continuous "thinking" phase [Source: 31: The Illusion of Superposition? A Principled Analysis of Latent Thinking in Language Models] [Source: 32: The Illusion of Superposition? A Principled Analysis of Latent Thinking in Language Models] .

Evaluation

Primary Metonymy & Reasoning Benchmarks:
- ViMET (Visual Metonymy Dataset): To evaluate the core associative reasoning shift (literal vs. abstract conceptualization) [Source: 33: A Computational Approach to Visual Metonymy] [Source: 34: A Computational Approach to Visual Metonymy] .- Sherlock Dataset: To evaluate broader visual abductive reasoning (inferring situational hypotheses based on salient visual clues) [Source: 35: The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning - ECVA | European Computer Vision Association] [Source: 36: The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning - ECVA | European Computer Vision Association] .- Key Performance Metrics:
- Accuracy & Exact Match (EM): To measure the final deterministic output of the VMR-RLVR multiple-choice predictions [Source: 15: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] .- Visual Focusing Score ( $Sfocus$ ) & Visual Attention Score (VAS): To quantify the attention mass concentrated on the required visual representamens, verifying that the latent tokens successfully grounded the reasoning [Source: 37: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] [Source: 38: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] .- Reasoning Density: To quantify the number of distinct reasoning steps successfully compressed within the tiny $K=5$ token budget compared to traditional verbose CoT [Source: 13: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] .

User

I have recently identified a compelling shift in the direction of my solution approach and need to compile all relevant, accurate, and up‑to‑date information for a case study. My current deliverable is a literature review or case study, as I am still in the planning stage of my research. The topics you suggest will also help me craft a prompt for generating these deliverables later. I intend to include the following sections in my deliverables ; I just a detailed outline in bullet points format for the following sections : High‑level vision system architecture Modeling strategy Research roadmap Executive summary Evaluation.

Model

Here is a detailed, structured outline for your case study, synthesized from the sources and our conversation history, capturing your recent architectural shift toward Latent Visual Reasoning and Reinforcement Learning.

Executive Summary

The Core Problem (The "Literal Wall" & The Perception Gap): Current Vision-Language Models (VLMs) excel at literal grounding but fail significantly at associative grounding (visual metonymy), exhibiting a 21% accuracy gap compared to human performance (86.9% vs. 65.9%) [Source: 1: A Computational Approach to Visual Metonymy] , [Source: 2: A Computational Approach to Visual Metonymy] , [Source: 3: Multimodal Pragmatic Inference: The Theoretical and Computational Rigor of Reasoning Beyond Literal] . Models often suffer from a "Perception Gap," mimicking text outputs without actually looking at relevant visual evidence [Source: 4: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] , [Source: 5: Lexicon of Visual Metonymy and Vision-Language Models] .- The Architectural Constraint: Deploying on edge devices imposes a strict 16GB VRAM limitation, rendering standard 32B+ parameter models or massive multi-hop GraphRAG retrievals unfeasible for real-time inference [Source: 6: 19th March] , [Source: 7: Architecting Latent Visual Reasoning for Metonymy Research] , [Source: 8: Architectural Synergy of Heterogeneous Graph Representations and Small-Scale Vision-Language Models for Mitigating Visual Hallucination] .- The Paradigm Shift (Overcoming the Curse of CoT): The architecture explicitly rejects text-based Chain-of-Thought (RBL) due to the "Curse of CoT," which causes context dilution and hallucinatory drift [Source: 6: 19th March] , [Source: 7: Architecting Latent Visual Reasoning for Metonymy Research] , [Source: 9: Lexicon of Visual Metonymy and Vision-Language Models] . The framework pivots from "thinking while looking" to "thinking before looking" by decoupling reasoning from verbalization [Source: 10: Architectural Foundations of Latent Chain-of-Thought for Visual Metonymy] , [Source: 11: Architectural Foundations of Latent Chain-of-Thought for Visual Metonymy] , [Source: 12: Architectural Synergy of Heterogeneous Graph Representations and Small-Scale Vision-Language Models for Mitigating Visual Hallucination] .- The Final Solution: A hybrid pipeline synthesizing LaViT (Latent Visual Thoughts), NodeRAG (Heterogeneous Knowledge Graphs), and VMR-RLVR (Verifiable Multiple-Choice Reformulation with Reinforcement Learning). This converts the 3B student model into an active agent (SPECTRA) that leverages deterministic trial-and-error optimization rather than passive teacher distillation [Source: 7: Architecting Latent Visual Reasoning for Metonymy Research] , [Source: 13: Associative Grounding: Synthesising LaViT, NodeRAG and PEFT for Metonymy] , [Source: 14: SPECTRA: Shifting VLM Trajectory from Teacher Distillation to RL-Agentics] .

High‑Level Vision System Architecture

Offline Knowledge Extraction & Bounding Box Generation:
- A Chain-of-Experts (CoE) via the VaLiK framework (e.g., BLIP-2, LLaVA, Qwen2-VL, DeepSeek-R1) acts as an offline preprocessor to extract visual-semantic relationships without requiring human text annotations [Source: 15: Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning - CVF Open Access] , [Source: 16: Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning - CVF Open Access] , [Source: 17: Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning - CVF Open Access] .- Grounding DINO or OWL-v2 is used strictly offline to extract exact bounding box coordinates for the visual cues (representamens) to define verified target regions [Source: 18: 19th March] , [Source: 19: 19th March] .- Structured Brain (NodeRAG Heterograph):
- Extracted semantic rules are stored in a fully nodalized heterogeneous graph utilizing a strict taxonomy: Entities (N), Relationships (R), Semantic Units (S), and High-Level Elements (H) [Source: 20: Architectural Synergy of Heterogeneous Graph Representations and Small-Scale Vision-Language Models for Mitigating Visual Hallucination] , [Source: 21: Lexicon of Visual Metonymy and Vision-Language Models] , [Source: 22: NodeRAG: Structuring Graph-based RAG with Heterogeneous Nodes] .- Dual Search (Exact string match for nodes + HNSW vector similarity) allows fast retrieval of cognitive priors [Source: 23: Lexicon of Visual Metonymy and Vision-Language Models] , [Source: 24: NodeRAG: Installing a "Structured Brain" for LLMs — AI Innovations and Insights 43] , [Source: 25: NodeRAG: Structuring Graph-based RAG with Heterogeneous Nodes] .- The Reasoning Engine (Qwen2.5-VL-3B-Instruct):
- The live student model executes the inference on-device under strict VRAM limits using QLoRA adapters [Source: 26: 19th March] , [Source: 7: Architecting Latent Visual Reasoning for Metonymy Research] .- The Latent Visual Reasoning Bottleneck ():
- The text reasoning traces are replaced with exactly $K=5$ continuous Latent Visual Reasoning (<lvr>) tokens [Source: 6: 19th March] , [Source: 7: Architecting Latent Visual Reasoning for Metonymy Research] .- These tokens are initialized via Concept-Conditioned Seeding (projecting NodeRAG semantic priors directly into the token weights before the forward pass) [Source: 7: Architecting Latent Visual Reasoning for Metonymy Research] , [Source: 27: Lexicon of Visual Metonymy and Vision-Language Models] .- Curriculum Sensory Gating (CSG) temporarily blinds the model's text-to-image cross-attention, physically forcing the language head to route visual processing strictly through the seeded <lvr> tokens [Source: 28: 19th March] , [Source: 29: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] , [Source: 9: Lexicon of Visual Metonymy and Vision-Language Models] .

Modeling Strategy

Stage 1: Supervised Fine-Tuning (SFT) & Distillation:
- Teacher/Student Setup: A high-capacity teacher (Qwen2.5-VL-32B, unquantized) provides the initial target distributions for the 3B student (bfloat16 precision, abandoning 4-bit quantization to prevent embedding resize trauma) [Source: 30: 19th March] , [Source: 6: 19th March] , [Source: 7: Architecting Latent Visual Reasoning for Metonymy Research] .- Tri-Loss Objective: The distillation utilizes three specific losses: Next-Token Prediction (NTP), Semantic Reconstruction ( $Lconcept$ ) to align hidden states with the teacher's holistic visual concepts, and Trajectory Alignment ( $Ltraj$ ) [Source: 31: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] , [Source: 32: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] , [Source: 33: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] , [Source: 34: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .- Mathematical Constraint: Earth Mover's Distance (Wasserstein-1) is explicitly used for $Ltraj$ instead of KL Divergence to prevent NaN crashes on sparse bounding box targets [Source: 6: 19th March] .- Stage 2: Reinforcement Learning (VMR-RLVR & SPECTRA):
- Verifiable Multiple-Choice Reformulation (VMR): Subjective open-ended metonymy tasks are restructured into deterministic A/B/C/D formats pitting the correct concept against "Partially Literal Distractors" engineered via ConceptNet [Source: 35: 19th March] , [Source: 36: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] , [Source: 37: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] , [Source: 38: Lexicon of Visual Metonymy and Vision-Language Models] , [Source: 39: RLVR to Open-Ended Tasks via Verifiable MCQ reformulation.pdf] .- Optimization Algorithm: Group Relative Policy Optimization (GRPO) evaluates groups of rollouts relative to each other, granting a binary reward (+1 for the correct concept, 0 for distractors) without needing a separate, memory-intensive critic model [Source: 40: An Alternative Trajectory for Generative AI - arXiv] , [Source: 41: An Alternative Trajectory for Generative AI - arXiv] , [Source: 42: Lexicon of Visual Metonymy and Vision-Language Models] , [Source: 43: Reinforced Attention Learning] .- Visual Anchoring & Guardrails:
- Advantage-Weighted Attention Divergence (AWAD): A Reinforced Attention Learning (RAL) objective that scores the model's internal gaze. If the model guesses correctly but looks at the background, it receives a negative advantage. If it points its "searchlight" at the verified representamens, it receives a positive advantage [Source: 44: JSD in Vision-Language Models] , [Source: 45: Lexicon of Visual Metonymy and Vision-Language Models] , [Source: 46: Reinforced Attention Learning] , [Source: 47: Reinforced Attention Learning] .- Empirical Visual Perception Verification (EVPV): An offline object detector acts as a guardrail during rollouts to verify if the model is attending to the correct bounding boxes [Source: 6: 19th March] , [Source: 23: Lexicon of Visual Metonymy and Vision-Language Models] .

Evaluation

Core Benchmarks:
- ViMET (Visual Metonymy Dataset): Evaluates the core associative reasoning using 2,000 multiple-choice questions grounded in 1,004 concepts spanning naturalistic and stylistic images [Source: 1: A Computational Approach to Visual Metonymy] , [Source: 48: A Computational Approach to Visual Metonymy] , [Source: 49: A Computational Approach to Visual Metonymy] .- Sherlock Dataset: Evaluates visual abductive reasoning (inferring plausible situational hypotheses based on a salient visual clue) [Source: 50: The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning - ECVA | European Computer Vision Association] , [Source: 51: The Epistemology of Visual Abduction: Analysis of the Sherlock Dataset and Its Implications for Multimodal Cognitive Architectures] .- General Multimodal Reasoning: MathVision (for multi-step spatial/geometric reasoning) and MMMU-Pro (for multidisciplinary evaluation) [Source: 52: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] , [Source: 53: Reinforced Attention Learning] , [Source: 54: Vision Language Models (Better, faster, stronger)] , [Source: 55: Vision Language Models (Better, faster, stronger)] .- Key Performance Metrics:
- Accuracy & Exact Match (EM): Evaluates the final deterministic output of the VMR-RLVR pipeline [Source: 56: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] , [Source: 57: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] , [Source: 58: Query-Driven Multimodal GraphRAG: Dynamic Local Knowledge Graph Construction for Online Reasoning - ACL Anthology] .- Visual Focusing Score ( $Sfocus$ ) & Visual Attention Score (VAS): Quantifies the percentage of attention mass that correctly falls within the offline-generated bounding boxes of the representamens, verifying literal spatial grounding [Source: 59: 19th March] , [Source: 60: From_Narrow_to_Panoramic_Vision_Attention-Guided_C.pdf] , [Source: 61: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] , [Source: 62: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] , [Source: 63: Literal Grounding versus Associative Grounding in Visual Metonymy] .- Attention Entropy ( $H$ ): Measures the "sharpness" or concentration of the model's visual focus, validating that the latent bottleneck prevents the gaze from drifting [Source: 64: 19th March] , [Source: 65: 19th March] , [Source: 66: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .

Research Roadmap

Immediate Phase (Pipeline Stabilization): Complete the transition away from text-based RBL toward Latent Visual Reasoning by fully implementing the $K=5$ bottleneck and Curriculum Sensory Gating (CSG) on the Qwen2.5-VL-3B-Instruct model [Source: 6: 19th March] , [Source: 7: Architecting Latent Visual Reasoning for Metonymy Research] , [Source: 29: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .- Intermediate Phase (Cross-Style Generalization & Transfer): Extend the VMR-RLVR reasoning traces beyond metonymy to test cross-style transfer on datasets testing sarcasm, humor (e.g., Hummus dataset), and metaphor (MultiMET) [Source: 67: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] , [Source: 68: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] , [Source: 69: Cross-style Multimodal Reasoning for Figurative Language Understanding - arXiv] , [Source: 70: Reasoning Beyond Literal: Cross-style Multimodal Reasoning for Figurative Language Understanding] , [Source: 71: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] . Measure how well the optimized internal attention policy handles these related subjective tasks.- Long-Term Phase (Edge Autonomy & DSS): Optimize the architecture for edge deployment constraints by employing Intel OpenVINO and INT8 quantization, massively reducing the Time to First Token (TTFT) for rapid NodeRAG index querying [Source: 72: Architectural Synergy of Heterogeneous Graph Representations and Small-Scale Vision-Language Models for Mitigating Visual Hallucination] , [Source: 8: Architectural Synergy of Heterogeneous Graph Representations and Small-Scale Vision-Language Models for Mitigating Visual Hallucination] . Transition the pipeline to a Domain-Specific Superintelligence (DSS) model, utilizing continuous online RL to allow the 3B agent to self-update its NodeRAG knowledge base through active environment interaction [Source: 73: An Alternative Trajectory for Generative AI - arXiv] , [Source: 74: An Alternative Trajectory for Generative AI - arXiv] , [Source: 75: An Alternative Trajectory for Generative AI - arXiv] , [Source: 14: SPECTRA: Shifting VLM Trajectory from Teacher Distillation to RL-Agentics] .

User

It seems methods like VMR-RLVR enriched with Latent Reasoning (if possible within the current architecture of VMR-RLVR) could solve this problem efficiently, then relying on NodeRAG (which could be a overdo). As we are including VMR-RLVR approach in our final architectural specification, I read the following explanation on how it could be used in for VLM tasks: The "free exploration" mentioned in the text refers to the model's ability to generate various reasoning trajectories (Chain-of-Thought) and learn which ones lead to the correct "choice" in the reformulated task. For a VLM, this would mean exploring different ways to interpret visual features and link them to language, ultimately learning to "reason by contrast" just as the LLMs do in this study. Provide me a similar reflection on how I use VMR-RLVR. for VLM usecases such as "Visual Metonymy" .. Why this works for training Verifiable Reward: Even though "mood" is subjective, the fact that the model chose B (the pre-determined better answer) is a verifiable fact. The system can now give the model a +1 reward for being "correct." [83] Learning Patterns: By doing this thousands of times, the VLM learns that "good reasoning" involves linking visual features (like the color of the sky) to abstract concepts (like mood), rather than just listing objects it sees. [158] This method allows the VLM to "explore" different ways of looking at an image and learn which interpretations are considered high-quality, all without needing a human to grade every single sentence it writes. i found it helps reasoning in other ways rather than task reformulation by following ways: Learning from Comparisons: The model evaluates the options and selects the one that better aligns with the task's objectives (e.g., creativity, emotional depth, or adherence to the prompt). This process helps the model learn why one response is better than another, rather than just predicting a single correct answer. Abstracting Reasoning Patterns: By repeatedly comparing options across diverse tasks, the model learns intrinsic patterns that distinguish high-quality reasoning from poor reasoning. This generalizes beyond specific examples, enabling the model to apply these principles to new, unseen tasks. Structured and Dense Reasoning: VMR-RLVR encourages the model to produce responses with higher reasoning density, meaning more structured and deliberate reasoning steps within a given token budget. This results in more informative and context-specific outputs with less redundancy. Free Exploration: The model explores various reasoning trajectories and learns from its choices, optimizing its ability to generate responses that align with user intent and task requirements.

Model

Integrating Verifiable Multiple-Choice Reformulation (VMR-RLVR) with Latent Reasoning (LaViT) creates a highly efficient, mathematically rigorous optimization environment. Your intuition is exactly correct: by forcing the Vision-Language Model (VLM) to internalize the associative logic during training, you effectively bake the "world knowledge" into the model’s internal attention policy. This significantly reduces the architectural burden of relying on a heavy, online NodeRAG retrieval system, as the model learns to independently resolve visual metonymy through targeted exploration.

Here is a reflection on how VMR-RLVR, supercharged by Latent Reasoning, maps directly to the specific cognitive demands of the "Visual Metonymy" (ViMET) use case:

1. Verifiable Reward: Eliminating Subjectivity in Metonymy

In visual metonymy, an image evokes an abstract concept (e.g., an hourglass, sand, and an old man evoking the concept of "Time" or "Age") [Source: 1: A Computational Approach to Visual Metonymy] . Because these concepts are abstract and interpretive, there is no absolute programmatic ground truth for an open-ended prompt [Source: 2: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] . How VMR-RLVR solves this: The framework restructures the subjective task into a deterministic multiple-choice format [Source: 3: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] [Source: 4: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] . It pits the correct abstract concept against "Partially Literal Distractors" or semantic distractors engineered via external knowledge like ConceptNet [Source: 5: A Computational Approach to Visual Metonymy] [Source: 6: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] . By forcing the model into a strict A/B/C/D choice, the task becomes objectively verifiable: if the model selects the option denoting the correct concept ( $s+$ ), it receives a deterministic +1 reward; if it selects the distractor ( $s−$ ), it receives a 0 [Source: 4: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] . This allows the VLM to be trained via Reinforcement Learning without requiring a flawed or hallucination-prone LLM-as-a-judge [Source: 7: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] .

2. Learning Patterns: Bridging the Semiotic Triad

To succeed in visual metonymy, a model must navigate Charles Sanders Peirce’s Semiotic Triad—connecting a literal visual cue (the Representamen) to an absent abstract meaning (the Object) via a mental inference (the Interpretant) [Source: 8: A Computational Approach to Visual Metonymy] [Source: 9: Latent Associative Grounding for Visual Metonymy in Small VLMs] . How VMR-RLVR solves this: By running thousands of Group Relative Policy Optimization (GRPO) rollouts, the VLM explicitly learns that "good reasoning" does not mean simply listing the literal objects it perceives (e.g., "I see a canvas and a brush") [Source: 10: A Computational Approach to Visual Metonymy] [Source: 11: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] . Instead, it learns to recognize the intrinsic patterns that link distributed visual features to a higher-level abstract concept, overcoming the "Literal Bias" where VLMs merely map pixels to concrete nouns [Source: 12: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] [Source: 13: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] .

3. Learning from Comparisons: "Reasoning by Contrast"

How VMR-RLVR solves this: By comparing the correct abstract concept against highly proximate distractors (e.g., choosing between "Justice" and "Bias" when looking at tilted scales), the VLM engages in "reasoning by contrast" [Source: 14: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] [Source: 15: [GEMINI] Visual Metonymy: Fine-Tuning VLMs] . Instead of merely imitating the token probability distribution of a static dataset, the model evaluates why one associative leap is superior to another [Source: 14: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] . It learns to mathematically penalize superficial, visual-matching shortcuts, allowing the VLM to understand the exact logical boundaries that separate correct pragmatic interpretation from a literal mistake [Source: 14: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] .

4. Structured and Dense Reasoning: The Power of Latent Tokens

Standard text-based Chain-of-Thought (CoT) forces the model to generate hundreds of English words, which triggers the "Curse of CoT"—leading to context dilution, hallucinatory drift, and redundant, generic statements [Source: 16: Architectural Foundations of Latent Chain-of-Thought for Visual Metonymy] [Source: 17: Defeating the Curse of Chain-of-Thought via NodeRAG and Latents] . How Latent Reasoning supercharges this: By integrating Latent Visual Reasoning (<lvr>) into the VMR-RLVR framework, you completely decouple reasoning from verbalization [Source: 18: Architectural Foundations of Latent Chain-of-Thought for Visual Metonymy] . Instead of text, the model generates $K=5$ continuous, non-textual hidden states (acting as "Visual Information Containers") before it outputs its final answer [Source: 19: Architectural Foundations of Latent Chain-of-Thought for Visual Metonymy] [Source: 20: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] . This guarantees massive Reasoning Density [Source: 21: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] . Within a tiny token budget of just 5 latent slots, the model is forced to compress high-dimensional visual semantics and multi-step logic [Source: 20: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 21: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] . The latent thoughts become incredibly information-dense, mitigating rote repetition and forcing the model to generate highly prompt-specific, substantive analysis directly in the mathematical latent space [Source: 22: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] .

5. Free Exploration: Anchoring the "Searchlight" via AWAD

In standard RL, "free exploration" means the model explores different text generation paths. In a Latent VMR-RLVR architecture, exploration happens directly on the Attention Distribution Policy using Reinforced Attention Learning (RAL) [Source: 23: Reinforced Attention Learning] [Source: 24: Reinforced Attention Learning] . How it works: When the model is presented with the multiple-choice options, it explores different ways to distribute its computational focus across the image patches [Source: 25: JSD in Vision-Language Models] . We utilize Advantage-Weighted Attention Divergence (AWAD) to score this exploration [Source: 9: Latent Associative Grounding for Visual Metonymy in Small VLMs] .

If the model correctly guesses the concept but its attention was lazily wandering the background, it receives a negative advantage [Source: 25: JSD in Vision-Language Models] .- If the model points its internal "searchlight" directly at the verified Representamens (the crucial visual evidence) to make its decision, it receives a positive advantage [Source: 25: JSD in Vision-Language Models] [Source: 26: JSD in Vision-Language Models] . This mathematically forces the model to optimize where to look alongside what to infer, completely eliminating the "Perception Gap" (blind guessing based on language priors) [Source: 27: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] [Source: 28: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .

Why this makes heavy NodeRAG potentially an "Overdo"

If you successfully implement this VMR-RLVR + Latent Reasoning pipeline, your 3B student model's internal policy will have already abstracted the deep reasoning patterns required for metonymy [Source: 14: Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation] . Because the latent bottleneck (<lvr>) and the RAL optimization naturally force the network's hub neurons to learn the associative mapping between visual cues and abstract choices, the model becomes intrinsically capable of cross-modal reasoning without explicit textual verbalization [Source: 29: Lexicon of Visual Metonymy and Vision-Language Models] [Source: 30: Reinforced Attention Learning] .

While you might still use an offline knowledge graph (like VaLiK) to prepare your multiple-choice distractors or to provide a lightweight initial seed for your latent tokens, deploying an active, multi-hop NodeRAG retrieval system during real-time inference becomes an unnecessary computational overhead. The model has already learned to "play by the rules" of the graph purely through trial-and-error exploration.

User

Can you expound on how the following claim can be justifed from the research paper - Claim : "This makes the latent tokens "pre-aware" of what semantic concept they are supposed to be compressing toward before they even see the image. Research paper : "Latent Chain of Thought- DECOUPING reasoning from verbalization"

Model

To properly justify this claim, we must first make a minor but critical architectural clarification: The specific phrase "pre-aware" comes from your custom engineering notes on Concept-Conditioned Seeding [Source: 1: 19th March] , whereas the research paper "Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization" (often referred to as PLaT) provides the fundamental mathematical framework that makes this "pre-awareness" physically possible [Source: 2: Paper page - Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization] [Source: 3: Paper page - Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization] .

If a model is forced to reason using standard text words, you cannot easily inject abstract semantic concepts into its thoughts. By proving that reasoning can be entirely separated from text generation, the PLaT paper unlocks the ability to manipulate the model's thoughts before it processes an image.

Here is the detailed exposition of how the findings in the PLaT paper structurally justify and enable the claim that your latent tokens become "pre-aware."

1. The Prerequisite: Decoupling Reasoning from Discrete Text

The core thesis of the PLaT framework is that reasoning should be reformulated as a "deterministic trajectory of latent planning states" rather than a sequence of discrete vocabulary words [Source: 3: Paper page - Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization] . In traditional Chain-of-Thought (CoT), the model generates English tokens one by one (e.g., "I", "see", "a", "brush"). Because these are discrete vocabulary tokens, they are rigid; you cannot easily alter their starting mathematical weights without breaking the tokenizer [Source: 4: The Illusion of Superposition? A Principled Analysis of Latent Thinking in Language Models] .

By decoupling reasoning from verbalization, PLaT replaces these discrete words with continuous, non-textual hidden states [Source: 2: Paper page - Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization] [Source: 3: Paper page - Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization] . A separate decoder only translates these continuous thoughts into text at the very end of the process [Source: 3: Paper page - Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization] . This shift into continuous space is the exact mechanism that permits us to mathematically "tamper" with the tokens before the forward pass begins.

2. Enabling "Pre-Awareness" via Continuous Initialization

Because the reasoning tokens are now continuous latent states rather than rigid vocabulary words, we do not have to initialize them with a generic or random starting value [Source: 5: 19th March] .

Drawing on Vision-Language (VL) distillation literature, your architecture leverages this continuous space by applying Concept-Conditioned Seeding [Source: 5: 19th March] . Before the model evaluates the image, you take the explicit text of the visual cue you are looking for (the representamen from the NodeRAG graph, such as "canvas" or "paintbrush") and project its CLIP/SentenceTransformer embedding directly into the starting weights of the latent tokens [Source: 1: 19th March] [Source: 5: 19th March] .

Because the PLaT architecture decoupled the thought from a fixed vocabulary word, the token can seamlessly absorb this continuous semantic vector [Source: 3: Paper page - Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization] [Source: 4: The Illusion of Superposition? A Principled Analysis of Latent Thinking in Language Models] . This physically initializes the token to exist within the semantic neighborhood of the target concept, making it "pre-aware" of its target before the first layer of the network even receives the image pixels [Source: 1: 19th March] .

3. Directed Compression in the Bottleneck

The claim states that the tokens know what they are "compressing toward." In latent reasoning architectures (like LaViT and PLaT), the latent tokens act as an information bottleneck—they are forced to capture and compress the high-dimensional visual semantics of the image into a tiny number of tokens (e.g., $K=5$ ) [Source: 6: 19th March] [Source: 7: 19th March] .

If these tokens were initialized randomly, they would act as a generic visual average, aimlessly searching the image to figure out what is important [Source: 1: 19th March] . However, because the PLaT-style decoupling allowed us to seed them with specific concept priors:

The tokens do not start from scratch; they act as highly targeted searchlights.- The concept embedding acts as a strong cognitive prior that aggressively steers the compression bottleneck [Source: 1: 19th March] .- When combined with Curriculum Sensory Gating (CSG)—which blocks the language head from directly seeing the image—the model is forced to route all its processing through these specific latent states [Source: 8: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning] .

Summary: The PLaT research paper justifies the claim by proving that reasoning can exist as continuous latent states rather than discrete text [Source: 3: Paper page - Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization] . Because the states are continuous, you can mathematically seed them with the dense vector embedding of a NodeRAG concept (like "paintbrush") [Source: 1: 19th March] [Source: 5: 19th March] . Consequently, when the model is forced to compress the massive visual image down into a few tokens, those tokens are already "pre-aware" of the semantic direction they need to pull the visual data toward, bypassing blind visual scanning entirely [Source: 1: 19th March] .