Evaluation

全面的评估策略检查模型性能的多个方面。我们评估特定于任务的功能，如问题回答和总结，以了解模型如何处理不同类型的问题。我们通过连贯性和事实准确性等因素来衡量输出质量。安全评估有助于识别潜在的有害输出或偏见。最后，领域专业知识测试验证模型在目标领域中的专业知识。

Automatic Benchmarks：学习使用标准化的基准和指标来评估你的模型。我们将探索常见的基准，如MMLU和TruthfulQA，了解关键的评估指标和设置，并介绍可重复评估的最佳实践。
Custom Domain Evaluation：了解如何创建适合您特定用例的评估管道。我们将逐步完成设计自定义评估任务，实现专门的度量，并构建符合您需求的评估数据集。
Domain Evaluation Project：构建特定于领域的评估管道的完整示例。您将学习如何生成评估数据集、使用 Argilla 进行数据注释、创建标准化数据集以及使用 LightEval 评估模型。

Automatic Benchmarks

自动基准测试通常由带有预定义任务和评估指标的精心策划的数据集组成。这些基准旨在评估模型能力的各个方面，从基本的语言理解到复杂的推理。使用自动基准测试的关键优势在于它们的标准化——它们允许跨不同模型进行一致的比较，并提供可重复的结果。

Using LightEval for Benchmarking

LightEval 任务用一种特定的格式定义：

1	{suite}\|{task}\|{num_few_shot}\|{auto_reduce}

suite: The benchmark suite (e.g., ‘mmlu’, ‘truthfulqa’)
task: Specific task within the suite (e.g., ‘abstract_algebra’)
num_few_shot: Number of examples to include in prompt (0 for zero-shot)
auto_reduce: Whether to automatically reduce few-shot examples if prompt is too long (0 or 1)

Example

定义评估任务（MMLU 子任务）
设置 pipeline 参数
创建评估追踪器
加载模型 + 构建 pipeline

Pipeline 是 LightEval 的核心组件，负责：

将任务、模型、评估记录器统一起来
自动加载数据集、执行推理、记录结果

from lighteval.tasks import Task, Pipeline
from transformers import AutoModelForCausalLM

# Define tasks to evaluate
domain_tasks = [
    "mmlu|anatomy|0|0",
    "mmlu|high_school_biology|0|0", 
    "mmlu|high_school_chemistry|0|0",
    "mmlu|professional_medicine|0|0"
]

# Configure pipeline parameters
pipeline_params = {
    "max_samples": 40,  # Number of samples to evaluate 每个任务最多评估 40 条样本
    "batch_size": 1,    # Batch size for inference 推理批次大小为 1（节省显存）
    "num_workers": 4    # Number of worker processes 多进程加载数据或处理任务
}

# Create evaluation tracker
# 创建评估追踪器，保存推理过程中的答案（generations）
evaluation_tracker = EvaluationTracker(
    output_path="./results",
    save_generations=True
)

# Load model and create pipeline
model = AutoModelForCausalLM.from_pretrained("your-model-name")
pipeline = Pipeline(
    tasks=domain_tasks,
    pipeline_parameters=pipeline_params,
    evaluation_tracker=evaluation_tracker,
    model=model
)

# Run evaluation
pipeline.evaluate()

# Get and display results
results = pipeline.get_results()
pipeline.show_results()

Custom Domain Evaluation

虽然标准基准提供了有价值的见解，但许多应用程序需要针对特定领域或用例定制的专门评估方法。本指南将帮助您创建自定义评估管道，以准确地评估您的模型在目标领域中的性能。

Implementation with LightEval

LightEval 为实现自定义评估提供了一个灵活的框架。下面是如何创建自定义任务：

from lighteval.tasks import Task, Doc
from lighteval.metrics import SampleLevelMetric, MetricCategory, MetricUseCase

class CustomEvalTask(Task):
    def __init__(self):
        super().__init__(
            name="custom_task",
            version="0.0.1",
            metrics=["accuracy", "f1"],  # Your chosen metrics
            description="Description of your custom evaluation task"
        )
    
    def get_prompt(self, sample):
        # Format your input into a prompt
        return f"Question: {sample['question']}\nAnswer:"
    
    def process_response(self, response, ref):
        # Process model output and compare to reference
        return response.strip() == ref.strip()

Task: LightEval 中评估任务的基类。所有评估任务都必须继承自它。
Doc: （虽然未使用）通常表示评估中的一个样本结构。
prompt 构造逻辑：方法接收一个样本（字典 sample），从中提取字段（如 “question”）来构建 prompt。
模型输出的处理和比较方法：
- response: 模型生成的回答（字符串）
- ref: 标准答案（reference）
- 返回 True/False 表示该样本是否正确

Custom Metrics

特定于领域的任务通常需要专门的度量标准。LightEval 提供了一个灵活的框架，用于创建捕获领域相关性能方面的自定义指标：

from aenum import extend_enum
from lighteval.metrics import Metrics, SampleLevelMetric, SampleLevelMetricGrouping
import numpy as np

# Define a sample-level metric function
def custom_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> dict:
    """Example metric that returns multiple scores per sample"""
    response = predictions[0]
    return {
        "accuracy": response == formatted_doc.choices[formatted_doc.gold_index],
        "length_match": len(response) == len(formatted_doc.reference)
    }

# Create a metric that returns multiple values per sample
custom_metric_group = SampleLevelMetricGrouping(
    metric_name=["accuracy", "length_match"],  # Names of sub-metrics
    higher_is_better={  # Whether higher values are better for each metric
        "accuracy": True,
        "length_match": True
    },
    category=MetricCategory.CUSTOM,
    use_case=MetricUseCase.SCORING,
    sample_level_fn=custom_metric,
    corpus_level_fn={  # How to aggregate each metric
        "accuracy": np.mean,
        "length_match": np.mean
    }
)

# Register the metric with LightEval
extend_enum(Metrics, "custom_metric_name", custom_metric_group)

自定义评估函数 custom_metric，这个函数对每一个样本返回多个评价指标
- predictions: 模型的输出列表（通常只有一个预测值）
- formatted_doc: Doc 对象，包含标准参考信息
- 返回值是一个字典，包含多个指标（例如 accuracy 和 length_match）
将该函数包装为 SampleLevelMetricGrouping
- metric_name: 包含该分组下的所有子指标名
- higher_is_better: 定义每个指标的“最优方向”
- sample_level_fn: 样本级评估函数
- corpus_level_fn: 总体评估如何聚合（这里使用 np.mean）
注册自定义指标到 LightEval 内部枚举系统中
- 给 Metrics 注册了新的枚举项 “custom_metric_name”，它将与任务绑定。

只有单个指标的情况：

def simple_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> bool:
    """Example metric that returns a single score per sample"""
    response = predictions[0]
    return response == formatted_doc.choices[formatted_doc.gold_index]

simple_metric_obj = SampleLevelMetric(
    metric_name="simple_accuracy",
    higher_is_better=True,
    category=MetricCategory.CUSTOM,
    use_case=MetricUseCase.SCORING,
    sample_level_fn=simple_metric,
    corpus_level_fn=np.mean  # How to aggregate across samples
)

extend_enum(Metrics, "simple_metric", simple_metric_obj)

然后，可以通过在任务配置中引用自定义指标来在评估任务中使用它们。度量将在所有样本中自动计算，并根据您指定的功能进行汇总。