能力

嵌入

文本嵌入是文本的数值表示，能够衡量语义相似性。本指南介绍了嵌入、其应用以及如何使用嵌入模型来完成搜索、推荐和异常检测等任务。

在实施嵌入之前

在选择嵌入提供商时，根据您的需求和偏好，有几个因素可以考虑：

数据集大小和领域特异性：模型训练数据集的大小及其与您想要嵌入的领域的相关性。更大或更具领域特异性的数据通常会产生更好的领域内嵌入
推理性能：嵌入查找速度和端到端延迟。这对于大规模生产部署来说是一个特别重要的考虑因素
定制化：在私有数据上继续训练的选项，或针对非常特定领域的模型专业化。这可以提高在独特词汇上的性能

如何通过 Anthropic 获取嵌入

Anthropic 不提供自己的嵌入模型。Voyage AI 是一个嵌入提供商，拥有涵盖上述所有考虑因素的多种选项和功能。

Voyage AI 制作最先进的嵌入模型，并为特定行业领域（如金融和医疗保健）提供定制模型，或为个别客户提供定制微调模型。

本指南的其余部分针对 Voyage AI，但我们鼓励您评估各种嵌入供应商，以找到最适合您特定用例的方案。

可用模型

Voyage 推荐使用以下文本嵌入模型：

模型	上下文长度	嵌入维度	描述
`voyage-3-large`	32,000	1024（默认），256、512、2048	最佳通用和多语言检索质量。详情请参阅博客文章。
`voyage-3.5`	32,000	1024（默认），256、512、2048	针对通用和多语言检索质量进行了优化。详情请参阅博客文章。
`voyage-3.5-lite`	32,000	1024（默认），256、512、2048	针对延迟和成本进行了优化。详情请参阅博客文章。
`voyage-code-3`	32,000	1024（默认），256、512、2048	针对代码检索进行了优化。详情请参阅博客文章。
`voyage-finance-2`	32,000	1024	针对金融检索和 RAG 进行了优化。详情请参阅博客文章。
`voyage-law-2`	16,000	1024	针对法律和长上下文检索和 RAG 进行了优化。同时提升了所有领域的性能。详情请参阅博客文章。

此外，推荐以下多模态嵌入模型：

模型	上下文长度	嵌入维度	描述
`voyage-multimodal-3`	32000	1024	丰富的多模态嵌入模型，可以向量化交错的文本和内容丰富的图像，如 PDF 截图、幻灯片、表格、图表等。详情请参阅博客文章。

需要帮助决定使用哪个文本嵌入模型？请查看常见问题。

Voyage AI 入门

要访问 Voyage 嵌入：

在 Voyage AI 网站上注册
获取 API 密钥
将 API 密钥设置为环境变量以方便使用：

export VOYAGE_API_KEY="<your secret key>"

您可以通过使用官方 voyageai Python 包或 HTTP 请求来获取嵌入，如下所述。

Voyage Python 库

可以使用以下命令安装 voyageai 包：

pip install -U voyageai

然后，您可以创建一个客户端对象并开始使用它来嵌入您的文本：

import voyageai

vo = voyageai.Client()
# This will automatically use the environment variable VOYAGE_API_KEY.
# Alternatively, you can use vo = voyageai.Client(api_key="<your secret key>")

texts = ["Sample text 1", "Sample text 2"]

result = vo.embed(texts, model="voyage-3.5", input_type="document")
print(result.embeddings[0])
print(result.embeddings[1])

result.embeddings 将是一个包含两个嵌入向量的列表，每个向量包含 1024 个浮点数。运行上述代码后，两个嵌入将打印在屏幕上：

[-0.013131560757756233, 0.019828535616397858, ...]   # embedding for "Sample text 1"
[-0.0069352793507277966, 0.020878976210951805, ...]  # embedding for "Sample text 2"

创建嵌入时，您可以为 embed() 函数指定一些其他参数。

有关 Voyage Python 包的更多信息，请参阅 Voyage 文档。

Voyage HTTP API

您也可以通过请求 Voyage HTTP API 来获取嵌入。例如，您可以在终端中通过 curl 命令发送 HTTP 请求：

curl https://api.voyageai.com/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VOYAGE_API_KEY" \
  -d '{
    "input": ["Sample text 1", "Sample text 2"],
    "model": "voyage-3.5"
  }'

您将获得的响应是一个包含嵌入和令牌使用量的 JSON 对象：

{
  "object": "list",
  "data": [
    {
      "embedding": [-0.013131560757756233, 0.019828535616397858, ...],
      "index": 0
    },
    {
      "embedding": [-0.0069352793507277966, 0.020878976210951805, ...],
      "index": 1
    }
  ],
  "model": "voyage-3.5",
  "usage": {
    "total_tokens": 10
  }
}

有关 Voyage HTTP API 的更多信息，请参阅 Voyage 文档。

AWS Marketplace

Voyage 嵌入可在 AWS Marketplace 上获取。有关在 AWS 上访问 Voyage 的说明可在此处找到。

快速入门示例

现在我们知道了如何获取嵌入，让我们看一个简短的示例。

假设我们有一个包含六个文档的小型语料库需要检索

documents = [
    "The Mediterranean diet emphasizes fish, olive oil, and vegetables, believed to reduce chronic diseases.",
    "Photosynthesis in plants converts light energy into glucose and produces essential oxygen.",
    "20th-century innovations, from radios to smartphones, centered on electronic advancements.",
    "Rivers provide water, irrigation, and habitat for aquatic species, vital for ecosystems.",
    "Apple's conference call to discuss fourth fiscal quarter results and business updates is scheduled for Thursday, November 2, 2023 at 2:00 p.m. PT / 5:00 p.m. ET.",
    "Shakespeare's works, like 'Hamlet' and 'A Midsummer Night's Dream,' endure in literature."
]

我们将首先使用 Voyage 将每个文档转换为嵌入向量

import voyageai

vo = voyageai.Client()

# Embed the documents
doc_embds = vo.embed(
    documents, model="voyage-3.5", input_type="document"
).embeddings

嵌入将允许我们在向量空间中进行语义搜索/检索。给定一个示例查询，

query = "When is Apple's conference call scheduled?"

我们将其转换为嵌入，并进行最近邻搜索，根据嵌入空间中的距离找到最相关的文档。

import numpy as np

# Embed the query
query_embd = vo.embed(
    [query], model="voyage-3.5", input_type="query"
).embeddings[0]

# Compute the similarity
# Voyage embeddings are normalized to length 1, therefore dot-product
# and cosine similarity are the same.
similarities = np.dot(doc_embds, query_embd)

retrieved_id = np.argmax(similarities)
print(documents[retrieved_id])

请注意，我们分别使用 input_type="document" 和 input_type="query" 来嵌入文档和查询。更多规范可在此处找到。

输出将是第 5 个文档，它确实与查询最相关：

Apple's conference call to discuss fourth fiscal quarter results and business updates is scheduled for Thursday, November 2, 2023 at 2:00 p.m. PT / 5:00 p.m. ET.

如果您正在寻找关于如何使用嵌入进行 RAG 的详细教程集，包括向量数据库，请查看我们的 RAG 教程。

常见问题

定价

请访问 Voyage 的定价页面获取最新的定价详情。

Was this page helpful?

能力

嵌入

文本嵌入是文本的数值表示，能够衡量语义相似性。本指南介绍了嵌入、其应用以及如何使用嵌入模型来完成搜索、推荐和异常检测等任务。

在实施嵌入之前

在选择嵌入提供商时，根据您的需求和偏好，有几个因素可以考虑：

数据集大小和领域特异性：模型训练数据集的大小及其与您想要嵌入的领域的相关性。更大或更具领域特异性的数据通常会产生更好的领域内嵌入
推理性能：嵌入查找速度和端到端延迟。这对于大规模生产部署来说是一个特别重要的考虑因素
定制化：在私有数据上继续训练的选项，或针对非常特定领域的模型专业化。这可以提高在独特词汇上的性能

如何通过 Anthropic 获取嵌入

Anthropic 不提供自己的嵌入模型。Voyage AI 是一个嵌入提供商，拥有涵盖上述所有考虑因素的多种选项和功能。

Voyage AI 制作最先进的嵌入模型，并为特定行业领域（如金融和医疗保健）提供定制模型，或为个别客户提供定制微调模型。

本指南的其余部分针对 Voyage AI，但我们鼓励您评估各种嵌入供应商，以找到最适合您特定用例的方案。

可用模型

Voyage 推荐使用以下文本嵌入模型：

模型	上下文长度	嵌入维度	描述
`voyage-3-large`	32,000	1024（默认），256、512、2048	最佳通用和多语言检索质量。详情请参阅博客文章。
`voyage-3.5`	32,000	1024（默认），256、512、2048	针对通用和多语言检索质量进行了优化。详情请参阅博客文章。
`voyage-3.5-lite`	32,000	1024（默认），256、512、2048	针对延迟和成本进行了优化。详情请参阅博客文章。
`voyage-code-3`	32,000	1024（默认），256、512、2048	针对代码检索进行了优化。详情请参阅博客文章。
`voyage-finance-2`	32,000	1024	针对金融检索和 RAG 进行了优化。详情请参阅博客文章。
`voyage-law-2`	16,000	1024	针对法律和长上下文检索和 RAG 进行了优化。同时提升了所有领域的性能。详情请参阅博客文章。

此外，推荐以下多模态嵌入模型：

模型	上下文长度	嵌入维度	描述
`voyage-multimodal-3`	32000	1024	丰富的多模态嵌入模型，可以向量化交错的文本和内容丰富的图像，如 PDF 截图、幻灯片、表格、图表等。详情请参阅博客文章。

需要帮助决定使用哪个文本嵌入模型？请查看常见问题。

Voyage AI 入门

要访问 Voyage 嵌入：

在 Voyage AI 网站上注册
获取 API 密钥
将 API 密钥设置为环境变量以方便使用：

export VOYAGE_API_KEY="<your secret key>"

您可以通过使用官方 voyageai Python 包或 HTTP 请求来获取嵌入，如下所述。

Voyage Python 库

可以使用以下命令安装 voyageai 包：

pip install -U voyageai

然后，您可以创建一个客户端对象并开始使用它来嵌入您的文本：

import voyageai

vo = voyageai.Client()
# This will automatically use the environment variable VOYAGE_API_KEY.
# Alternatively, you can use vo = voyageai.Client(api_key="<your secret key>")

texts = ["Sample text 1", "Sample text 2"]

result = vo.embed(texts, model="voyage-3.5", input_type="document")
print(result.embeddings[0])
print(result.embeddings[1])

result.embeddings 将是一个包含两个嵌入向量的列表，每个向量包含 1024 个浮点数。运行上述代码后，两个嵌入将打印在屏幕上：

[-0.013131560757756233, 0.019828535616397858, ...]   # embedding for "Sample text 1"
[-0.0069352793507277966, 0.020878976210951805, ...]  # embedding for "Sample text 2"

创建嵌入时，您可以为 embed() 函数指定一些其他参数。

有关 Voyage Python 包的更多信息，请参阅 Voyage 文档。

Voyage HTTP API

您也可以通过请求 Voyage HTTP API 来获取嵌入。例如，您可以在终端中通过 curl 命令发送 HTTP 请求：

curl https://api.voyageai.com/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VOYAGE_API_KEY" \
  -d '{
    "input": ["Sample text 1", "Sample text 2"],
    "model": "voyage-3.5"
  }'

您将获得的响应是一个包含嵌入和令牌使用量的 JSON 对象：

{
  "object": "list",
  "data": [
    {
      "embedding": [-0.013131560757756233, 0.019828535616397858, ...],
      "index": 0
    },
    {
      "embedding": [-0.0069352793507277966, 0.020878976210951805, ...],
      "index": 1
    }
  ],
  "model": "voyage-3.5",
  "usage": {
    "total_tokens": 10
  }
}

有关 Voyage HTTP API 的更多信息，请参阅 Voyage 文档。

AWS Marketplace

Voyage 嵌入可在 AWS Marketplace 上获取。有关在 AWS 上访问 Voyage 的说明可在此处找到。

快速入门示例

现在我们知道了如何获取嵌入，让我们看一个简短的示例。

假设我们有一个包含六个文档的小型语料库需要检索

documents = [
    "The Mediterranean diet emphasizes fish, olive oil, and vegetables, believed to reduce chronic diseases.",
    "Photosynthesis in plants converts light energy into glucose and produces essential oxygen.",
    "20th-century innovations, from radios to smartphones, centered on electronic advancements.",
    "Rivers provide water, irrigation, and habitat for aquatic species, vital for ecosystems.",
    "Apple's conference call to discuss fourth fiscal quarter results and business updates is scheduled for Thursday, November 2, 2023 at 2:00 p.m. PT / 5:00 p.m. ET.",
    "Shakespeare's works, like 'Hamlet' and 'A Midsummer Night's Dream,' endure in literature."
]

我们将首先使用 Voyage 将每个文档转换为嵌入向量

import voyageai

vo = voyageai.Client()

# Embed the documents
doc_embds = vo.embed(
    documents, model="voyage-3.5", input_type="document"
).embeddings

嵌入将允许我们在向量空间中进行语义搜索/检索。给定一个示例查询，

query = "When is Apple's conference call scheduled?"

我们将其转换为嵌入，并进行最近邻搜索，根据嵌入空间中的距离找到最相关的文档。

import numpy as np

# Embed the query
query_embd = vo.embed(
    [query], model="voyage-3.5", input_type="query"
).embeddings[0]

# Compute the similarity
# Voyage embeddings are normalized to length 1, therefore dot-product
# and cosine similarity are the same.
similarities = np.dot(doc_embds, query_embd)

retrieved_id = np.argmax(similarities)
print(documents[retrieved_id])

请注意，我们分别使用 input_type="document" 和 input_type="query" 来嵌入文档和查询。更多规范可在此处找到。

输出将是第 5 个文档，它确实与查询最相关：

Apple's conference call to discuss fourth fiscal quarter results and business updates is scheduled for Thursday, November 2, 2023 at 2:00 p.m. PT / 5:00 p.m. ET.

如果您正在寻找关于如何使用嵌入进行 RAG 的详细教程集，包括向量数据库，请查看我们的 RAG 教程。

常见问题

定价

请访问 Voyage 的定价页面获取最新的定价详情。

Was this page helpful?

在实施嵌入之前

如何通过 Anthropic 获取嵌入

可用模型

Voyage AI 入门

Voyage Python 库

Voyage HTTP API

AWS Marketplace

快速入门示例

常见问题

为什么 Voyage 嵌入具有卓越的质量？

有哪些嵌入模型可用，我应该使用哪个？

我应该使用哪个相似度函数？

字符、单词和令牌之间有什么关系？

我应该何时以及如何使用 input_type 参数？

有哪些量化选项可用？

如何截断 Matryoshka 嵌入？

定价

在实施嵌入之前

如何通过 Anthropic 获取嵌入

可用模型

Voyage AI 入门

Voyage Python 库

Voyage HTTP API

AWS Marketplace

快速入门示例

常见问题

为什么 Voyage 嵌入具有卓越的质量？

有哪些嵌入模型可用，我应该使用哪个？

我应该使用哪个相似度函数？

字符、单词和令牌之间有什么关系？

我应该何时以及如何使用 input_type 参数？

有哪些量化选项可用？

如何截断 Matryoshka 嵌入？

定价