VLM Usage

视觉语言模型（vlm）弥合了图像和文本之间的差距，实现了高级任务，如生成图像字幕，基于视觉回答问题，或理解文本和视觉数据之间的关系。它们的架构旨在无缝地处理这两种模式。

Chat Format

许多 vlm 的结构以类似聊天机器人的方式进行交互，从而增强了可用性。该格式包括：

为模型设置角色或上下文的系统消息，例如“您是分析可视化数据的助手”。
结合文本输入和相关图像的用户查询。
辅助响应提供来自多模态分析的文本输出。

这种会话结构是直观的，符合用户的期望，特别是对于交互式应用程序，如客户服务或教育工具。下面是一个格式化输入的示例

[
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a Vision Language Model specialized in interpreting visual data from chart images..."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "<image_data>"},
            {"type": "text", "text": "What is the highest value in the bar chart?"}
        ]
    },
    {
        "role": "assistant",
        "content": [{"type": "text", "text": "42"}]
    }
]

vlm 还可以通过调整输入结构来适应顺序或并行视觉输入，从而处理多个图像甚至视频。对于视频，帧可以作为单独的图像进行提取和处理，同时保持时间顺序。

Practice

加载模型

import torch, PIL
from transformers import AutoProcessor, AutoModelForVision2Seq, BitsAndBytesConfig
from transformers.image_utils import load_image

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

model_name = "HuggingFaceTB/SmolVLM-Instruct"
model = AutoModelForVision2Seq.from_pretrained(
    model_name,
    cache_dir="D:\study\smol-course\data\SmolVLM-Instruct"
).to(device)

processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Instruct", cache_dir="D:\study\smol-course\data\SmolVLM-Instruct")

print(processor.image_processor.size)

AutoProcessor.from_pretrained 自动加载该模型所需的图像预处理器和文本分词器，封装为一个 processor 对象，后续可直接用它将输入图像 + 问题处理成模型所需的格式。

处理单张图像

# Load  one image
image1 = load_image(image_url1)

# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Can you describe the image?"}
        ]
    },
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)

inputs = processor(text=prompt, images=[image1], return_tensors="pt")

inputs = inputs.to(device)

# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts)

注意：

图像数据并不是放在 messages 里，而是通过 processor(…, images=[image1]) 传进去的，messages 只是告诉模型“这里应该有一张图”，实际的图像张量（token）由 processor 自动插入并对齐。
processor.apply_chat_template 本质上是将 messages 转换为：<|user|> \<image> Can you describe the image? <|assistant|>
- <image> 是一个特殊 token 占位符，不是图像数据，而是告诉模型“这里会插入图像的视觉 token”
- 接下来在 processor(text=prompt, images=[image1], return_tensors=”pt”) 中：
  - images=[image1]：真正传入图像数据的位置
  - processor 会自动将 image1：
    - resize、normalize
    - 转为 tensor
    - 并对齐到 prompt 中的 <image> token 位置

处理多张图像

将 message 更改为：

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "What event do they both represent?"}
        ]
    },
]

添加了一个 iamge 占位符，相应的，在调用 processor 时，也要传入两个 image 对象：

1
2
3

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = inputs.to(device)

处理视频

抽取视频关键帧，将其作为多张图像来传递给模型。首先定义抽帧和缩放函数：

from IPython.display import Video
import cv2
import numpy as np

def extract_frames(video_path, max_frames=50, target_size=None):
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened():
        raise ValueError(f"Could not open video: {video_path}")
    
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    frame_indices = np.linspace(0, total_frames - 1, max_frames, dtype=int)

    frames = []
    for idx in frame_indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret:
            frame = PIL.Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
            if target_size:
                frames.append(resize_and_crop(frame, target_size))
            else:
                frames.append(frame)
    cap.release()
    return frames

def resize_and_crop(image, target_size):
    width, height = image.size
    scale = target_size / min(width, height)
    image = image.resize((int(width * scale), int(height * scale)), PIL.Image.Resampling.LANCZOS)
    left = (image.width - target_size) // 2
    top = (image.height - target_size) // 2
    return image.crop((left, top, left + target_size, top + target_size))

extract_frames 将视频抽帧后，使用 opencv 来进行读取，并且转换为 numpy 数组。
resize_and_crop 根据需要将视频帧进行裁剪

接下来定义了一个 generate_response 函数，用于生成回答。

def generate_response(model, processor, frames, question):

    image_tokens = [{"type": "image"} for _ in frames]
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text", 
                    "text": "Following are the frames of a video in temporal order."
                }, 
                    *image_tokens, 
                {
                    "type": "text",
                    "text": question
                }
            ]
        }
    ]
    inputs = processor(
        text=processor.apply_chat_template(messages, add_generation_prompt=True),
        images=frames,
        return_tensors="pt"
    ).to(model.device)

    outputs = model.generate(
        **inputs, max_new_tokens=100, num_beams=5, temperature=0.7, do_sample=True, use_cache=True
    )
    return processor.decode(outputs[0], skip_special_tokens=True)

注意：

message 里面的 image_tokens，重复了 len(frames) 次，即将视频帧作为多张图像传入。
generate 的参数：

QwenVL 处理视频

Qwen 支持多种方法传入视频：

# Messages containing a images list as a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Messages containing a local video path and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Messages containing a video url and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

核心处理方法为 process_vision_info：

下载/读取本地视频或图片帧
自动抽帧（基于 message 中给的信息（如果有，例如上述的第二种情况））
标准化图片尺寸（基于 message 中给的信息（如果有，例如上述的第二种情况））
返回三类信息：
- image_inputs: 图像列表（如果有）
- video_inputs: 视频帧图像列表（tensor 格式）
- video_kwargs: 与视频处理相关的参数（如帧时间戳、维度等）

1	image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)

VLM Fine-Tuning

Efficient Fine-Tuning

Quantization

量化降低了模型权重和激活的精度，显著降低了内存使用并加快了计算速度。例如，从 float32 切换到 bfloat16 可以在保持性能的同时减少每个参数的内存需求。对于更激进的压缩，可以使用 8 位和 4 位量化，进一步减少内存使用，但代价是一定的准确性。这些技术可以应用于模型和优化器设置，从而在有限的资源下对硬件进行有效的训练。

PEFT & LoRA

LoRA （Low-Rank Adaptation）专注于学习紧凑的秩分解矩阵，同时保持原始模型权值不变。这大大减少了可训练参数的数量，显著减少了资源需求。当LoRA与PEFT集成时，只需调整一小部分可训练的参数子集，就可以对大型模型进行微调。这种方法对于特定于任务的调整特别有效，在保持性能的同时将数十亿个可训练参数减少到数百万个。

Batch Size Optimization

为了优化批大小以进行微调，可以从一个较大的值开始，并在发生内存不足（OOM）错误时减小它。通过增加 gradient_accumulation_steps 进行补偿，有效地维护多个更新的总批大小。此外，启用 gradient_checkpointing，通过在反向传递期间重新计算中间状态来降低内存使用，以减少计算时间来减少激活内存需求。这些策略最大限度地提高了硬件利用率，并有助于克服内存限制。

简单来说，batch size 过大会导致 OOM（Out-Of-Memory），可以使用以下策略：

设置如下：

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./fine_tuned_model",  # Directory for model checkpoints
    per_device_train_batch_size=4,   # Batch size per device (GPU/TPU)
    num_train_epochs=3,              # Total training epochs
    learning_rate=5e-5,              # Learning rate
    save_steps=1000,                 # Save checkpoint every 1000 steps
    bf16=True,                       # Use mixed precision for training
    gradient_checkpointing=True,     # Enable to reduce activation memory usage
    gradient_accumulation_steps=16,  # Accumulate gradients over 16 steps
    logging_steps=50                 # Log metrics every 50 steps
)

Supervised Fine-Tuning

监督微调（SFT）通过利用包含成对输入（如图像和相应文本）的标记数据集，使预训练的视觉语言模型（VLM）适应特定任务。该方法增强了模型执行特定领域或特定任务功能的能力，例如可视化问题回答、图像字幕或图表解释。

Preference Optimization

偏好优化，特别是直接偏好优化（DPO），训练视觉语言模型（VLM）与人类偏好保持一致。该模型不是严格遵循预定义的指令，而是学会优先考虑人类主观上喜欢的输出。这种方法对于涉及创造性判断、细致推理或不同可接受答案的任务特别有用。

Practice with SFT

加载模型

model_name = "HuggingFaceTB/SmolVLM-Instruct"
model = AutoModelForVision2Seq.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
).to(device)
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Instruct")

设置 LoRA。注意这里使用的是 get_peft_model 方法。前面的代码在加载模型时就添加了 peft 参数，但是更推荐下面这种方法，即先加载模型，然后再将可训练的参数结合到模型中。

from peft import LoraConfig, get_peft_model

# Configure LoRA
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05,
    r=8,
    bias="none",
    target_modules=["q_proj", "v_proj"],
    task_type="CAUSAL_LM",
)

# Apply PEFT model adaptation
peft_model = get_peft_model(model, peft_config)

# Print trainable parameters
peft_model.print_trainable_parameters()

定义 collate_fn 函数用于预处理数据，目的是将一个批次（batch）的图文对话样本，整理为模型可以直接使用的张量输入格式。

examples 是一个 batch 的原始样本列表。
对于 batch 中的每个样本，都构建一个 message 字典。
对构造好的 message 添加聊天模板，然后存放在 text_inputs 中

对当前图像进行预处理，在这个例子中，加载的数据集结构如下，image 字段已经是一个 PIL 对象。将处理好的 image 保存到 image_inputs 中。

print(ds)
DatasetDict({
    train: Dataset({
        features: ['image', 'query', 'label', 'human_or_machine'],
        num_rows: 28299
    })
    val: Dataset({
        features: ['image', 'query', 'label', 'human_or_machine'],
        num_rows: 1920
    })
    test: Dataset({
        features: ['image', 'query', 'label', 'human_or_machine'],
        num_rows: 2500
    })
})

print(type(ds['train'][0]['image']))
<class 'PIL.PngImagePlugin.PngImageFile'>

将文本 prompt + 图像输入送入 processor，返回的 batch 为一个 list，其中的一个元素包含：
- input_ids（文本 token ids）
- pixel_values（图像 tensor）
- attention_mask （mask）
构造 labels 字段,把 padding token 的位置置为 -100，告诉 loss 不计算这些位置
将 image token 位置同样也设置为 -100，图像在文本中是通过一个特殊 token（如 <image>）占位的，但这个 token 不是模型要预测的目标，所以也将其设为 -100 以屏蔽 loss
最终返回构造好的样本:
- input_ids
- attention_mask
- pixel_values
- labels

def collate_fn(examples):
    # System message template for the VLM
    system_message = """You are a Vision Language Model specialized in interpreting visual data from chart images.
    Your task is to analyze the provided chart image and respond to queries with concise answers, usually a single word, number, or short phrase.
    The charts include a variety of types (e.g., line charts, bar charts) and contain colors, labels, and text.
    Focus on delivering accurate, succinct answers based on the visual information. Avoid additional explanation unless absolutely necessary."""

    # Initialize lists for text and image inputs
    text_inputs = []
    image_inputs = []

    # Process all examples in one loop
    for example in examples:
        # Format the chat structure for the processor
        formatted_example = {
            "messages": [
                {
                    "role": "system",
                    "content": [{"type": "text", "text": system_message}],
                },
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                        },
                        {
                            "type": "text",
                            "text": example["query"],
                        },
                    ],
                },
            ]
        }
        # Apply chat template and strip extra spaces
        text_inputs.append(processor.apply_chat_template(formatted_example["messages"], tokenize=False).strip())
        
        # Ensure images are in RGB mode
        image = example["image"]
        if image.mode != 'RGB':
            image = image.convert('RGB')
        image_inputs.append( [image] )

    # Tokenize the texts and process the images
    batch = processor(
        text=text_inputs,
        images=image_inputs,
        return_tensors="pt",
        padding=True
    )

    # Clone input IDs for labels
    labels = batch["input_ids"].clone()
    labels[labels == processor.tokenizer.pad_token_id] = -100  # Mask padding tokens in labels

    # Ensure image_token is converted to string if it is an AddedToken
    # In some processor, processor.image_token return a list for each image.
    # TODO: AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Instruct") only have one ?
    image_token_id = processor.tokenizer.convert_tokens_to_ids(str(processor.image_token))

    # Mask image token IDs in the labels
    labels[labels == image_token_id] = -100

    # Add labels back to the batch
    batch["labels"] = labels


    return batch

最后配置参数。注意在 SFTConfig 中：

dataset_text_field=””：指定数据集中用作文本输入的字段，这里留空（因使用自定义 collate_fn）
skip_prepare_dataset=True：告诉 trl 不需要自动预处理数据集，我们手动处理
remove_unused_columns=False：保留所有字段，否则只会保留 tokenizer 用到的字段，会丢图像等信息

SFTTrainer 中，设置 data_collator=collate_fn，表示自定义的批处理函数，用于图文混合数据

from trl import SFTConfig, SFTTrainer

# Configure the Trainer
training_args = SFTConfig(
    output_dir="sft_output",  # Directory to save the model
    num_train_epochs=3,                     # number of training epochs
    per_device_train_batch_size=1,          # batch size per device during training
    gradient_accumulation_steps=16,         # number of steps before performing a backward/update pass
    gradient_checkpointing=True,            # use gradient checkpointing to save memory
    optim="adamw_torch_fused",              # use fused adamw optimizer
    logging_steps=5,                        # log every 10 steps
    save_strategy="epoch",                  # save checkpoint every epoch
    learning_rate=2e-4,                     # learning rate, based on QLoRA paper
    bf16=True,                              # use bfloat16 precision
    tf32=True,                              # use tf32 precision
    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper
    warmup_ratio=0.03,                      # warmup ratio based on QLoRA paper
    lr_scheduler_type="constant",           # use constant learning rate scheduler
    push_to_hub=True,                       # push model to hub
    gradient_checkpointing_kwargs = {"use_reentrant": False}, # use reentrant checkpointing
    # dataloader_num_workers=16, 
    dataset_text_field="", # need a dummy field for collator
    dataset_kwargs = {"skip_prepare_dataset": True}, # important for collator
    remove_unused_columns = False                    # necessary else features except label will be removed
)
# Initialize the Trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=ds["train"],
    eval_dataset=ds["test"],
    data_collator=collate_fn,
    peft_config=peft_config,
    tokenizer=processor.tokenizer,
)