Loading...
  • 建構
  • 管理
  • 模型與定價
  • 客戶端 SDK
  • API 參考
Search...
⌘K
Log in
內容審核
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Solutions

  • AI agents
  • Code modernization
  • Coding
  • Customer support
  • Education
  • Financial services
  • Government
  • Life sciences

Partners

  • Amazon Bedrock
  • Google Cloud's Vertex AI

Learn

  • Blog
  • Courses
  • Use cases
  • Connectors
  • Customer stories
  • Engineering at Anthropic
  • Events
  • Powered by Claude
  • Service partners
  • Startups program

Company

  • Anthropic
  • Careers
  • Economic Futures
  • Research
  • News
  • Responsible Scaling Policy
  • Security and compliance
  • Transparency

Learn

  • Blog
  • Courses
  • Use cases
  • Connectors
  • Customer stories
  • Engineering at Anthropic
  • Events
  • Powered by Claude
  • Service partners
  • Startups program

Help and security

  • Availability
  • Status
  • Support
  • Discord

Terms and policies

  • Privacy policy
  • Responsible disclosure policy
  • Terms of service: Commercial
  • Terms of service: Consumer
  • Usage policy
建構/使用案例

內容審核

內容審核是維護數位應用程式中安全、尊重和高效環境的關鍵方面。本指南討論如何在您的數位應用程式中使用 Claude 進行內容審核。

造訪 內容審核 cookbook 以查看使用 Claude 的內容審核實現範例。

本指南著重於審核應用程式內的使用者生成內容。如果您正在尋求有關審核與 Claude 互動的指導,請參閱 guardrails 指南。

使用 Claude 前的準備

決定是否使用 Claude 進行內容審核

以下是一些關鍵指標,表明您應該使用 Claude 等 LLM 而不是傳統 ML 或基於規則的方法進行內容審核:

Anthropic 已訓練所有 Claude 模型保持誠實、有幫助和無害。這可能導致 Claude 審核被認為特別危險的內容(符合 可接受使用政策),無論使用何種提示。例如,想要允許使用者發佈明確性內容的成人網站可能會發現 Claude 仍然會標記明確內容需要審核,即使他們在提示中指定不審核明確性內容。建議在建立審核解決方案之前提前審查 AUP。

生成要審核的內容範例

在開發內容審核解決方案之前,首先建立應該被標記的內容和不應該被標記的內容的範例。確保包括邊界情況和具有挑戰性的場景,這些場景可能對內容審核系統難以有效處理。之後,審查您的範例以建立一個明確定義的審核類別清單。 例如,社交媒體平台生成的範例可能包括以下內容:

allowed_user_comments = [
    "This movie was great, I really enjoyed it. The main actor really killed it!",
    "I hate Mondays.",
    "It is a great time to invest in gold!",
]

disallowed_user_comments = [
    "Delete this post now or you better hide. I am coming after you and your family.",
    "Stay away from the 5G cellphones!! They are using 5G to control you.",
    "Congratulations! You have won a $1,000 gift card. Click here to claim your prize!",
]

# Sample user comments to test the content moderation
user_comments = allowed_user_comments + disallowed_user_comments

# List of categories considered unsafe for content moderation
unsafe_categories = [
    "Child Exploitation",
    "Conspiracy Theories",
    "Hate",
    "Indiscriminate Weapons",
    "Intellectual Property",
    "Non-Violent Crimes",
    "Privacy",
    "Self-Harm",
    "Sex Crimes",
    "Sexual Content",
    "Specialized Advice",
    "Violent Crimes",
]

有效審核這些範例需要對語言的細緻理解。在評論 This movie was great, I really enjoyed it. The main actor really killed it! 中,內容審核系統需要認識到「killed it」是一個隱喻,而不是實際暴力的指示。相反,儘管缺乏暴力的明確提及,評論 Delete this post now or you better hide. I am coming after you and your family. 應該被內容審核系統標記。

unsafe_categories 清單可以自訂以符合您的特定需求。例如,如果您希望防止未成年人在您的網站上建立內容,您可以將「Underage Posting」附加到清單中。


如何使用 Claude 審核內容

選擇合適的 Claude 模型

選擇模型時,重要的是考慮您的資料大小。如果成本是一個問題,較小的模型(如 Claude Haiku 3)是一個絕佳選擇,因為其成本效益。以下是社交媒體平台每月接收 10 億篇貼文的文本審核成本估計:

  • 內容大小

    • 每月貼文數:10 億
    • 每篇貼文的字元數:100
    • 總字元數:1000 億
  • 估計的 token 數

    • 輸入 token:286 億(假設每 3.5 個字元 1 個 token)
    • 被標記的訊息百分比:3%
    • 每個被標記訊息的輸出 token:50
    • 總輸出 token:15 億
  • Claude Haiku 3 估計成本

    • 輸入 token 成本:2,860 MTok * $0.25/MTok = $715
實際成本可能與這些估計不同。這些估計基於 批次處理 部分中突出顯示的提示。通過從回應中移除 explanation 欄位,可以進一步減少輸出 token。

建立強大的提示

為了使用 Claude 進行內容審核,Claude 必須理解您應用程式的審核要求。讓我們開始編寫一個提示,允許您定義審核需求:

Python
import json

# Initialize the Anthropic client
client = anthropic.Anthropic()


def moderate_message(message, unsafe_categories):
    # Convert the list of unsafe categories into a string, with each category on a new line
    unsafe_category_str = "\n".join(unsafe_categories)

    # Construct the prompt for Claude, including the message and unsafe categories
    assessment_prompt = f"""
    Determine whether the following message warrants moderation,
    based on the unsafe categories outlined below.

    Message:
    <message>{message}</message>

    Unsafe Categories:
    <categories>
    {unsafe_category_str}
    </categories>

    Respond with ONLY a JSON object, using the format below:
    {{
    "violation": <Boolean field denoting whether the message should be moderated>,
    "categories": [Comma-separated list of violated categories],
    "explanation": [Optional. Only include if there is a violation.]
    }}"""

    # Send the request to Claude for content moderation
    response = client.messages.create(
        model="claude-3-haiku-20240307",  # Using the Haiku model for lower costs
        max_tokens=200,
        temperature=0,  # Use 0 temperature for increased consistency
        messages=[{"role": "user", "content": assessment_prompt}],
    )

    # Parse the JSON response from Claude
    assessment = json.loads(response.content[0].text)

    # Extract the violation status from the assessment
    contains_violation = assessment["violation"]

    # If there's a violation, get the categories and explanation; otherwise, use empty defaults
    violated_categories = assessment.get("categories", []) if contains_violation else []
    explanation = assessment.get("explanation") if contains_violation else None

    return contains_violation, violated_categories, explanation


# Process each comment and print the results
for comment in user_comments:
    print(f"\nComment: {comment}")
    violation, violated_categories, explanation = moderate_message(
        comment, unsafe_categories
    )

    if violation:
        print(f"Violated Categories: {', '.join(violated_categories)}")
        print(f"Explanation: {explanation}")
    else:
        print("No issues detected.")

在此範例中,moderate_message 函數包含一個評估提示,其中包括不安全內容類別和要評估的訊息。提示要求 Claude 根據上面定義的不安全類別評估訊息是否應該被審核。

然後解析模型的評估以確定是否存在違規。如果存在違規,Claude 還會返回違規類別的清單,以及訊息不安全原因的解釋。

評估您的提示

內容審核是一個分類問題。因此,您可以使用 分類 cookbook 中概述的相同技術來確定內容審核系統的準確性。

另一個考慮因素是,您可能不是將內容審核視為二元分類問題,而是建立多個類別來代表各種風險級別。建立多個風險級別允許您調整審核的激進程度。例如,您可能想要自動阻止被認為高風險的使用者查詢,同時將有許多中等風險查詢的使用者標記為人工審查。

Python
import json

# Initialize the Anthropic client
client = anthropic.Anthropic()


def assess_risk_level(message, unsafe_categories):
    # Convert the list of unsafe categories into a string, with each category on a new line
    unsafe_category_str = "\n".join(unsafe_categories)

    # Construct the prompt for Claude, including the message, unsafe categories, and risk level definitions
    assessment_prompt = f"""
    Assess the risk level of the following message warrants moderation,
    based on the unsafe categories listed below.

Message:
<message>{message}</message>

Unsafe Categories:
<categories>
{unsafe_category_str}
</categories>

Assign a risk level based on your confidence that the user's message should be moderated
based on the defined unsafe categories:

0 - No risk
1 - Low risk
2 - Medium risk
3 - High risk

Respond with ONLY a JSON object, using the format below:
{{
  "risk_level": <Numerical field denoting the risk level>,
  "categories": [Comma-separated list of violated categories],
  "explanation": <Optional. Only include if risk level is greater than 0>
}}"""

    # Send the request to Claude for risk assessment
    response = client.messages.create(
        model="claude-3-haiku-20240307",  # Using the Haiku model for lower costs
        max_tokens=200,
        temperature=0,  # Use 0 temperature for increased consistency
        messages=[{"role": "user", "content": assessment_prompt}],
    )

    # Parse the JSON response from Claude
    assessment = json.loads(response.content[0].text)

    # Extract the risk level, violated categories, and explanation from the assessment
    risk_level = assessment["risk_level"]
    violated_categories = assessment["categories"]
    explanation = assessment.get("explanation")

    return risk_level, violated_categories, explanation


# Process each comment and print the results
for comment in user_comments:
    print(f"\nComment: {comment}")
    risk_level, violated_categories, explanation = assess_risk_level(
        comment, unsafe_categories
    )

    print(f"Risk Level: {risk_level}")
    if violated_categories:
        print(f"Violated Categories: {', '.join(violated_categories)}")
    if explanation:
        print(f"Explanation: {explanation}")

此程式碼實現了一個 assess_risk_level 函數,該函數使用 Claude 評估訊息的風險級別。該函數接受訊息和不安全類別清單作為輸入。

在函數內,為 Claude 生成一個提示,包括要評估的訊息、不安全類別和評估風險級別的具體說明。提示指示 Claude 使用包括風險級別、違規類別和可選解釋的 JSON 物件進行回應。

這種方法通過分配風險級別來實現靈活的內容審核。它可以無縫整合到更大的系統中,以根據其評估的風險級別自動進行內容過濾或標記評論以供人工審查。例如,執行此程式碼時,評論 Delete this post now or you better hide. I am coming after you and your family. 因其危險威脅而被識別為高風險。相反,評論 Stay away from the 5G cellphones!! They are using 5G to control you. 被分類為中等風險。

部署您的提示

一旦您對解決方案的品質有信心,就該將其部署到生產環境中了。以下是在生產中使用內容審核時應遵循的一些最佳實踐:

  1. 向使用者提供清晰的回饋: 當使用者輸入被阻止或回應因內容審核而被標記時,提供資訊豐富且建設性的回饋,幫助使用者理解為什麼他們的訊息被標記以及他們如何可以適當地重新表述。在上面的程式碼範例中,這是通過 Claude 回應中的 explanation 標籤完成的。

  2. 分析審核內容: 追蹤您的審核系統標記的內容類型,以識別趨勢和潛在的改進領域。

  3. 持續評估和改進: 使用精確度和召回追蹤等指標定期評估內容審核系統的效能。使用此資料反覆改進您的審核提示、關鍵字和評估標準。


改進效能

在複雜的場景中,除了標準 提示工程技術 之外,考慮其他策略來改進效能可能會有所幫助。以下是一些進階策略:

定義主題並提供範例

除了在提示中列出不安全類別外,還可以通過為每個類別提供定義和相關短語來進一步改進。

Python
import json

# Initialize the Anthropic client
client = anthropic.Anthropic()

# Dictionary of categories considered unsafe for content moderation, with their definitions
unsafe_category_definitions = {
    "Child Exploitation": "Content that depicts child nudity or that enables, encourages, excuses, or depicts the sexual abuse of children.",
    "Conspiracy Theories": """Content that promotes or endorses unfounded, false, or misleading theories about events, situations, or individuals,
        which can incite harm or undermine public trust in institutions or facts.""",
    "Hate": """Content that is hateful toward people on the basis of their protected characteristics
        (race, color, ethnicity, national origin, disability, religious affiliation, caste, sexual orientation, sex, gender identity and serious disease),
        or content that perpetuates negative stereotypes.""",
    "Indiscriminate Weapons": "Content that enables the creation of weapons of mass destruction.",
    "Intellectual Property": "Content that violates the intellectual property rights of third parties.",
    "Non-Violent Crimes": "Content that enables, encourages, or excuses the commission of non-violent crimes.",
    "Privacy": "Content that contains sensitive, personal information about private individuals.",
    "Self-Harm": "Content that enables, encourages, or excuses acts of intentional self-harm.",
    "Sex Crimes": "Content that enables, encourages, or excuses the commission of sex crimes.",
    "Sexual Content": "Content that contains erotic descriptions or explicit depictions of sex acts.",
    "Specialized Advice": """Content that contains financial, medical, or legal advice.
        Financial advice includes guidance on investments, stocks, bonds, or any financial planning.""",
    "Violent Crimes": "Content that enables, encourages, or excuses the commission of violent crimes.",
}


def moderate_message_with_definitions(message, unsafe_category_definitions):
    # Format unsafe categories string, with each category and its definition on a new line
    unsafe_category_str = "\n".join(
        f"{category}: {definition}"
        for category, definition in unsafe_category_definitions.items()
    )

    # Construct the prompt for Claude, including the message and unsafe categories
    assessment_prompt = f"""Determine whether the following message warrants moderation, based on the unsafe categories outlined below.

Message:
<message>{message}</message>

Unsafe Categories and Their Definitions:
<categories>
{unsafe_category_str}
</categories>

It's important that you remember all unsafe categories and their definitions.

Respond with ONLY a JSON object, using the format below:
{{
  "violation": <Boolean field denoting whether the message should be moderated>,
  "categories": [Comma-separated list of violated categories],
  "explanation": [Optional. Only include if there is a violation.]
}}"""

    # Send the request to Claude for content moderation
    response = client.messages.create(
        model="claude-3-haiku-20240307",  # Using the Haiku model for lower costs
        max_tokens=200,
        temperature=0,  # Use 0 temperature for increased consistency
        messages=[{"role": "user", "content": assessment_prompt}],
    )

    # Parse the JSON response from Claude
    assessment = json.loads(response.content[0].text)

    # Extract the violation status from the assessment
    contains_violation = assessment["violation"]

    # If there's a violation, get the categories and explanation; otherwise, use empty defaults
    violated_categories = assessment.get("categories", []) if contains_violation else []
    explanation = assessment.get("explanation") if contains_violation else None

    return contains_violation, violated_categories, explanation


# Process each comment and print the results
for comment in user_comments:
    print(f"\nComment: {comment}")
    violation, violated_categories, explanation = moderate_message_with_definitions(
        comment, unsafe_category_definitions
    )

    if violation:
        print(f"Violated Categories: {', '.join(violated_categories)}")
        print(f"Explanation: {explanation}")
    else:
        print("No issues detected.")

moderate_message_with_definitions 函數通過允許每個不安全類別與詳細定義配對來擴展早期的 moderate_message 函數。這在程式碼中通過用 unsafe_category_definitions 字典替換原始函數中的 unsafe_categories 清單來發生。此字典將每個不安全類別對應到其相應的定義。類別名稱及其定義都包含在提示中。

值得注意的是,Specialized Advice 類別的定義現在指定了應該禁止的財務建議類型。因此,評論 It's a great time to invest in gold!(之前通過 moderate_message 評估)現在觸發違規。

考慮批次處理

為了在不需要即時審核的情況下降低成本,請考慮批次審核訊息。在提示的背景中包含多個訊息,並要求 Claude 評估哪些訊息應該被審核。

Python
import json

# Initialize the Anthropic client
client = anthropic.Anthropic()


def batch_moderate_messages(messages, unsafe_categories):
    # Convert the list of unsafe categories into a string, with each category on a new line
    unsafe_category_str = "\n".join(unsafe_categories)

    # Format messages string, with each message wrapped in XML-like tags and given an ID
    messages_str = "\n".join(
        [f"<message id={idx}>{msg}</message>" for idx, msg in enumerate(messages)]
    )

    # Construct the prompt for Claude, including the messages and unsafe categories
    assessment_prompt = f"""Determine the messages to moderate, based on the unsafe categories outlined below.

Messages:
<messages>
{messages_str}
</messages>

Unsafe categories and their definitions:
<categories>
{unsafe_category_str}
</categories>

Respond with ONLY a JSON object, using the format below:
{{
  "violations": [
    {{
      "id": <message id>,
      "categories": [list of violated categories],
      "explanation": <Explanation of why there's a violation>
    }},
    ...
  ]
}}

Important Notes:
- Remember to analyze every message for a violation.
- Select any number of violations that reasonably apply."""

    # Send the request to Claude for content moderation
    response = client.messages.create(
        model="claude-3-haiku-20240307",  # Using the Haiku model for lower costs
        max_tokens=2048,  # Increased max token count to handle batches
        temperature=0,  # Use 0 temperature for increased consistency
        messages=[{"role": "user", "content": assessment_prompt}],
    )

    # Parse the JSON response from Claude
    assessment = json.loads(response.content[0].text)
    return assessment


# Process the batch of comments and get the response
response_obj = batch_moderate_messages(user_comments, unsafe_categories)

# Print the results for each detected violation
for violation in response_obj["violations"]:
    print(f"""Comment: {user_comments[violation["id"]]}
Violated Categories: {", ".join(violation["categories"])}
Explanation: {violation["explanation"]}
""")

在此範例中,batch_moderate_messages 函數使用單個 Claude API 呼叫處理整批訊息的審核。 在函數內,建立一個提示,其中包括要評估的訊息清單、定義的不安全內容類別及其描述。提示指示 Claude 返回一個 JSON 物件,列出所有包含違規的訊息。回應中的每個訊息都由其 id 識別,該 id 對應於訊息在輸入清單中的位置。 請記住,為您的特定需求找到最佳批次大小可能需要一些實驗。雖然較大的批次大小可以降低成本,但它們也可能導致品質略微下降。此外,您可能需要增加 Claude API 呼叫中的 max_tokens 參數以容納更長的回應。有關您選擇的模型可以輸出的最大 token 數的詳細資訊,請參閱 模型比較頁面。

內容審核 cookbook

查看如何使用 Claude 進行內容審核的完整實現程式碼範例。

Was this page helpful?

  • 使用 Claude 前的準備
  • 決定是否使用 Claude 進行內容審核
  • 如何使用 Claude 審核內容
  • 選擇合適的 Claude 模型
  • 輸出 token 成本:1,500 MTok * $1.25/MTok = $1,875
  • 月成本:$715 + $1,875 = $2,590
  • Claude Opus 4.7 估計成本

    • 輸入 token 成本:2,860 MTok * $5.00/MTok = $14,300
    • 輸出 token 成本:1,500 MTok * $25.00/MTok = $37,500
    • 月成本:$14,300 + $37,500 = $51,800
  • Guardrails 指南

    探索 guardrails 指南以了解審核與 Claude 互動的技術。