教程:在 DSPy 程序中使用音频¶
本教程将引导您使用 DSPy 构建基于音频的应用管道。
安装依赖¶
确保您使用的是最新的 DSPy 版本
pip install -U dspy
要处理音频数据,请安装以下依赖项
pip install soundfile torch==2.0.1+cu118 torchaudio==2.0.2+cu118
加载 Spoken-SQuAD 数据集¶
我们将使用 Spoken-SQuAD 数据集(官方版本和HuggingFace 版本用于教程演示),该数据集包含用于问答的口语音频片段。
import random
import dspy
from dspy.datasets import DataLoader
kwargs = dict(fields=("context", "instruction", "answer"), input_keys=("context", "instruction"))
spoken_squad = DataLoader().from_huggingface(dataset_name="AudioLLMs/spoken_squad_test", split="train", trust_remote_code=True, **kwargs)
random.Random(42).shuffle(spoken_squad)
spoken_squad = spoken_squad[:100]
split_idx = len(spoken_squad) // 2
trainset_raw, testset_raw = spoken_squad[:split_idx], spoken_squad[split_idx:]
预处理音频数据¶
数据集中的音频片段需要进行一些预处理,转换成带相应采样率的字节数组。
def preprocess(x):
audio = dspy.Audio.from_array(x.context["array"], x.context["sampling_rate"])
return dspy.Example(
passage_audio=audio,
question=x.instruction,
answer=x.answer
).with_inputs("passage_audio", "question")
trainset = [preprocess(x) for x in trainset_raw]
testset = [preprocess(x) for x in testset_raw]
len(trainset), len(testset)
用于口语问答的 DSPy 程序¶
让我们定义一个简单的 DSPy 程序,它使用音频输入直接回答问题。这与基础问答 (BasicQA) 任务非常相似,唯一的区别在于段落上下文以音频文件形式提供,供模型听取并回答问题。
class SpokenQASignature(dspy.Signature):
"""Answer the question based on the audio clip."""
passage_audio: dspy.Audio = dspy.InputField()
question: str = dspy.InputField()
answer: str = dspy.OutputField(desc = 'factoid answer between 1 and 5 words')
spoken_qa = dspy.ChainOfThought(SpokenQASignature)
现在让我们配置可以处理输入音频的 LLM。
dspy.settings.configure(lm=dspy.LM(model='gpt-4o-mini-audio-preview-2024-12-17'))
注意:在签名中使用 dspy.Audio
允许直接将音频传递给模型。
定义评估指标¶
我们将使用完全匹配指标(dspy.evaluate.answer_exact_match
)来衡量答案与提供的参考答案相比的准确性。
evaluate_program = dspy.Evaluate(devset=testset, metric=dspy.evaluate.answer_exact_match,display_progress=True, num_threads = 10, display_table=True)
evaluate_program(spoken_qa)
使用 DSPy 进行优化¶
您可以使用任何 DSPy 优化器,像优化任何 DSPy 程序一样优化这个基于音频的程序。
注意:音频 token 可能成本较高,因此建议保守地配置像 dspy.BootstrapFewShotWithRandomSearch
或 dspy.MIPROv2
这样的优化器,少量样本示例为 0-2 个,并且候选 / 尝试次数少于优化器的默认参数。
optimizer = dspy.BootstrapFewShotWithRandomSearch(metric = dspy.evaluate.answer_exact_match, max_bootstrapped_demos=2, max_labeled_demos=2, num_candidate_programs=5)
optimized_program = optimizer.compile(spoken_qa, trainset = trainset)
evaluate_program(optimized_program)
prompt_lm = dspy.LM(model='gpt-4o-mini') #NOTE - this is the LLM guiding the MIPROv2 instruction candidate proposal
optimizer = dspy.MIPROv2(metric=dspy.evaluate.answer_exact_match, auto="light", prompt_model = prompt_lm)
#NOTE - MIPROv2's dataset summarizer cannot process the audio files in the dataset, so we turn off the data_aware_proposer
optimized_program = optimizer.compile(spoken_qa, trainset=trainset, max_bootstrapped_demos=2, max_labeled_demos=2, data_aware_proposer=False)
evaluate_program(optimized_program)
在这个小数据集上,MIPROv2 使性能比基线提高了约 10%。
现在我们已经了解了如何在 DSPy 中使用具有音频输入能力的 LLM,接下来我们反过来操作。
在下一个任务中,我们将使用标准的文本 LLM 为文本转语音模型生成提示,然后评估生成的语音在某些下游任务中的质量。与直接让像 gpt-4o-mini-audio-preview-2024-12-17
这样的 LLM 生成音频相比,这种方法通常更具成本效益,同时仍然能够构建可优化以生成更高质量语音输出的管道。
加载 CREMA-D 数据集¶
我们将使用 CREMA-D 数据集(官方版本和HuggingFace 版本用于教程演示),该数据集包含选定参与者以六种目标情感之一(中性、快乐、悲伤、愤怒、恐惧和厌恶)说出同一句话的音频片段。
from collections import defaultdict
label_map = ['neutral', 'happy', 'sad', 'anger', 'fear', 'disgust']
kwargs = dict(fields=("sentence", "label", "audio"), input_keys=("sentence", "label"))
crema_d = DataLoader().from_huggingface(dataset_name="myleslinder/crema-d", split="train", trust_remote_code=True, **kwargs)
def preprocess(x):
return dspy.Example(
raw_line=x.sentence,
target_style=label_map[x.label],
reference_audio=dspy.Audio.from_array(x.audio["array"], x.audio["sampling_rate"])
).with_inputs("raw_line", "target_style")
random.Random(42).shuffle(crema_d)
crema_d = crema_d[:100]
random.seed(42)
label_to_indices = defaultdict(list)
for idx, x in enumerate(crema_d):
label_to_indices[x.label].append(idx)
per_label = 100 // len(label_map)
train_indices, test_indices = [], []
for indices in label_to_indices.values():
selected = random.sample(indices, min(per_label, len(indices)))
split = len(selected) // 2
train_indices.extend(selected[:split])
test_indices.extend(selected[split:])
trainset = [preprocess(crema_d[idx]) for idx in train_indices]
testset = [preprocess(crema_d[idx]) for idx in test_indices]
用于生成指定情感语音的 TTS 指令的 DSPy 管道¶
现在我们将构建一个管道,通过文本行和如何说出的指令共同提示 TTS 模型,生成富有情感表现力的语音。此任务的目标是使用 DSPy 生成提示,指导 TTS 输出与数据集中参考音频的情感和风格匹配。
首先,让我们设置 TTS 生成器,以生成指定情感或风格的语音。我们使用 gpt-4o-mini-tts
,因为它支持使用原始输入和语音提示模型,并生成经 dspy.Audio
处理的 .wav
音频响应文件。我们还为 TTS 输出设置了一个缓存。
import os
import base64
import hashlib
from openai import OpenAI
CACHE_DIR = ".audio_cache"
os.makedirs(CACHE_DIR, exist_ok=True)
def hash_key(raw_line: str, prompt: str) -> str:
return hashlib.sha256(f"{raw_line}|||{prompt}".encode("utf-8")).hexdigest()
def generate_dspy_audio(raw_line: str, prompt: str) -> dspy.Audio:
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
key = hash_key(raw_line, prompt)
wav_path = os.path.join(CACHE_DIR, f"{key}.wav")
if not os.path.exists(wav_path):
response = client.audio.speech.create(
model="gpt-4o-mini-tts",
voice="coral", #NOTE - this can be configured to any of the 11 offered OpenAI TTS voices - https://platform.openai.com/docs/guides/text-to-speech#voice-options.
input=raw_line,
instructions=prompt,
response_format="wav"
)
with open(wav_path, "wb") as f:
f.write(response.content)
with open(wav_path, "rb") as f:
encoded = base64.b64encode(f.read()).decode("utf-8")
return dspy.Audio(data=encoded, format="wav")
现在让我们定义用于生成 TTS 指令的 DSPy 程序。对于这个程序,我们可以再次使用标准的文本 LLM,因为我们只是生成指令。
class EmotionStylePromptSignature(dspy.Signature):
"""Generate an OpenAI TTS instruction that makes the TTS model speak the given line with the target emotion or style."""
raw_line: str = dspy.InputField()
target_style: str = dspy.InputField()
openai_instruction: str = dspy.OutputField()
class EmotionStylePrompter(dspy.Module):
def __init__(self):
self.prompter = dspy.ChainOfThought(EmotionStylePromptSignature)
def forward(self, raw_line, target_style):
out = self.prompter(raw_line=raw_line, target_style=target_style)
audio = generate_dspy_audio(raw_line, out.openai_instruction)
return dspy.Prediction(audio=audio)
dspy.settings.configure(lm=dspy.LM(model='gpt-4o-mini'))
定义评估指标¶
音频参考比较通常不是一项简单的任务,因为评估语音存在主观差异,尤其是在情感表达方面。在本教程中,我们使用基于嵌入的相似度指标进行客观评估,利用 Wav2Vec 2.0 将音频转换为嵌入,并计算参考音频与生成音频之间的余弦相似度。要更准确地评估音频质量,人工反馈或感知指标会更合适。
import torch
import torchaudio
import soundfile as sf
import io
bundle = torchaudio.pipelines.WAV2VEC2_BASE
model = bundle.get_model().eval()
def decode_dspy_audio(dspy_audio):
audio_bytes = base64.b64decode(dspy_audio.data)
array, _ = sf.read(io.BytesIO(audio_bytes), dtype="float32")
return torch.tensor(array).unsqueeze(0)
def extract_embedding(audio_tensor):
with torch.inference_mode():
return model(audio_tensor)[0].mean(dim=1)
def cosine_similarity(a, b):
return torch.nn.functional.cosine_similarity(a, b).item()
def audio_similarity_metric(example, pred, trace=None):
ref_audio = decode_dspy_audio(example.reference_audio)
gen_audio = decode_dspy_audio(pred.audio)
ref_embed = extract_embedding(ref_audio)
gen_embed = extract_embedding(gen_audio)
score = cosine_similarity(ref_embed, gen_embed)
if trace is not None:
return score > 0.8
return score
evaluate_program = dspy.Evaluate(devset=testset, metric=audio_similarity_metric, display_progress=True, num_threads = 10, display_table=True)
evaluate_program(EmotionStylePrompter())
我们可以看一个示例,了解 DSPy 程序生成了哪些指令以及相应的得分。
program = EmotionStylePrompter()
pred = program(raw_line=testset[1].raw_line, target_style=testset[1].target_style)
print(audio_similarity_metric(testset[1], pred)) #0.5725605487823486
dspy.inspect_history(n=1)
[2025-05-15T22:01:22.667596] System message: Your input fields are: 1. `raw_line` (str) 2. `target_style` (str) Your output fields are: 1. `reasoning` (str) 2. `openai_instruction` (str) All interactions will be structured in the following way, with the appropriate values filled in. [[ ## raw_line ## ]] {raw_line} [[ ## target_style ## ]] {target_style} [[ ## reasoning ## ]] {reasoning} [[ ## openai_instruction ## ]] {openai_instruction} [[ ## completed ## ]] In adhering to this structure, your objective is: Generate an OpenAI TTS instruction that makes the TTS model speak the given line with the target emotion or style. User message: [[ ## raw_line ## ]] It's eleven o'clock [[ ## target_style ## ]] disgust Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## openai_instruction ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`. Response: [[ ## reasoning ## ]] To generate the OpenAI TTS instruction, we need to specify the target emotion or style, which in this case is 'disgust'. We will use the OpenAI TTS instruction format, which includes the text to be spoken and the desired emotion or style. [[ ## openai_instruction ## ]] "Speak the following line with a tone of disgust: It's eleven o'clock" [[ ## completed ## ]]
TTS 指令
Speak the following line with a tone of disgust: It's eleven o'clock
from IPython.display import Audio
audio_bytes = base64.b64decode(pred.audio.data)
array, rate = sf.read(io.BytesIO(audio_bytes), dtype="float32")
Audio(array, rate=rate)
该指令指定了目标情感,但除此之外信息量不大。我们还可以看到该样本的音频得分不高。让我们看看通过优化这个管道是否可以做得更好。
使用 DSPy 进行优化¶
我们可以利用 dspy.MIPROv2
来优化下游任务目标并生成更高质量的 TTS 指令,从而获得更准确、更富有表现力的音频生成。
prompt_lm = dspy.LM(model='gpt-4o-mini')
teleprompter = dspy.MIPROv2(metric=audio_similarity_metric, auto="light", prompt_model = prompt_lm)
optimized_program = teleprompter.compile(EmotionStylePrompter(),trainset=trainset)
evaluate_program(optimized_program)
让我们看看优化后的程序表现如何。
pred = optimized_program(raw_line=testset[1].raw_line, target_style=testset[1].target_style)
print(audio_similarity_metric(testset[1], pred)) #0.6691027879714966
dspy.inspect_history(n=1)
[2025-05-15T22:09:40.088592] System message: Your input fields are: 1. `raw_line` (str) 2. `target_style` (str) Your output fields are: 1. `reasoning` (str) 2. `openai_instruction` (str) All interactions will be structured in the following way, with the appropriate values filled in. [[ ## raw_line ## ]] {raw_line} [[ ## target_style ## ]] {target_style} [[ ## reasoning ## ]] {reasoning} [[ ## openai_instruction ## ]] {openai_instruction} [[ ## completed ## ]] In adhering to this structure, your objective is: Generate an OpenAI TTS instruction that makes the TTS model speak the given line with the target emotion or style, as if the speaker is a [insert persona relevant to the task, e.g. "irate customer", "angry boss", etc.]. The instruction should specify the tone, pitch, and other characteristics of the speaker's voice to convey the target emotion. User message: [[ ## raw_line ## ]] It's eleven o'clock [[ ## target_style ## ]] disgust Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## openai_instruction ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`. Response: [[ ## reasoning ## ]] To convey disgust, the speaker's voice should be characterized by a high-pitched tone, a slightly nasal quality, and a sense of revulsion. The speaker's words should be delivered with a sense of distaste and aversion, as if the speaker is trying to convey their strong negative emotions. [[ ## openai_instruction ## ]] Generate a text-to-speech synthesis of the input text "It's eleven o'clock" with the following characteristics: - Tone: Disgusted - Pitch: High-pitched, slightly nasal - Emphasis: Emphasize the words to convey a sense of distaste and aversion - Volume: Moderate to loud, with a sense of rising inflection at the end to convey the speaker's strong negative emotions - Speaker: A person who is visibly and audibly disgusted, such as a character who has just been served a spoiled meal. [[ ## completed ## ]]
MIPROv2 优化程序指令
Generate an OpenAI TTS instruction that makes the TTS model speak the given line with the target emotion or style, as if the speaker is a [insert persona relevant to the task, e.g. "irate customer", "angry boss", etc.]. The instruction should specify the tone, pitch, and other characteristics of the speaker's voice to convey the target emotion.
TTS 指令
Generate a text-to-speech synthesis of the input text "It's eleven o'clock" with the following characteristics:
- Tone: Disgusted
- Pitch: High-pitched, slightly nasal
- Emphasis: Emphasize the words to convey a sense of distaste and aversion
- Volume: Moderate to loud, with a sense of rising inflection at the end to convey the speaker's strong negative emotions
- Speaker: A person who is visibly and audibly disgusted, such as a character who has just been served a spoiled meal.
from IPython.display import Audio
audio_bytes = base64.b64decode(pred.audio.data)
array, rate = sf.read(io.BytesIO(audio_bytes), dtype="float32")
Audio(array, rate=rate)
MIPROv2 的指令调优为整体任务目标增加了更多细节,为如何定义 TTS 指令提供了更多标准,进而生成的指令在语音韵律的各个方面更加具体,并产生了更高的相似度得分。