Inference

推理是使用经过训练的语言模型来生成预测或响应的过程。虽然推理可能看起来很简单，但大规模有效地部署模型需要仔细考虑各种因素，如性能、成本和可靠性。大型语言模型（llm）由于其大小和计算需求而提出了独特的挑战。

LLM 推理可以分为两种主要方法：

用于开发和测试的基于管道的简单推理，
用于生产部署的优化服务解决方案。

我们将介绍这两种方法，从简单的管道方法开始，然后转向生产就绪的解决方案。

Basic Inference with Transformers Pipeline

🤗 transformer 中的 pipeline 抽象提供了一种简单的方法，可以对来自 huggingface 的任何模型运行推理。它处理所有的预处理和后处理步骤，使得无需深入了解其体系结构或需求就可以轻松使用模型。

三个关键阶段：

Preprocessing
Model Inference
Postprocessing

基础用法

from transformers import pipeline

# Create a pipeline with a specific model
generator = pipeline(
    "text-generation",
    model="HuggingFaceTB/SmolLM2-1.7B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)

# Generate text
response = generator(
    "Write a short poem about coding:",
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7
)
print(response[0]['generated_text'])

详细的参数说明

response = generator(
    "Translate this to French:",
    max_new_tokens=100,     # 生成文本的最大长度
    do_sample=True,         # 解码时用采样的策略，而不是贪心策略
    temperature=0.7,        # 这个参数可以控制随机性，值越大越随机
    top_k=50,               # 采样时，只考虑最靠前的前 k 个 token
    top_p=0.95,             # 采样时，概率值的阈值
    num_return_sequences=1  # 针对一个输入输出几个输出
)

Text Generation Inference (TGI)

Text Generation Inference（简称 TGI）是一个由 Hugging Face 开发的工具包，主要用于对大语言模型进行部署和服务搭建。它旨在为常用的开源大语言模型（LLMs）实现高性能文本生成。TGI 被 Hugging Chat 用于实际生产，Hugging Chat 是一个面向开源模型的开源交互界面。

即调用 API 的过程

from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(
    base_url="http://localhost:8080/v1/",
    api_key="-"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stream=True
)

# iterate and print stream
for message in chat_completion:
    print(message)