Loading...
    • 开发者指南
    • API 参考
    • MCP
    • 资源
    • 更新日志
    Search...
    ⌘K
    资源
    概览术语表系统提示词
    概览工单路由客户支持智能体内容审核法律摘要
    Console
    Log in
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...

    Solutions

    • AI agents
    • Code modernization
    • Coding
    • Customer support
    • Education
    • Financial services
    • Government
    • Life sciences

    Partners

    • Amazon Bedrock
    • Google Cloud's Vertex AI

    Learn

    • Blog
    • Catalog
    • Courses
    • Use cases
    • Connectors
    • Customer stories
    • Engineering at Anthropic
    • Events
    • Powered by Claude
    • Service partners
    • Startups program

    Company

    • Anthropic
    • Careers
    • Economic Futures
    • Research
    • News
    • Responsible Scaling Policy
    • Security and compliance
    • Transparency

    Learn

    • Blog
    • Catalog
    • Courses
    • Use cases
    • Connectors
    • Customer stories
    • Engineering at Anthropic
    • Events
    • Powered by Claude
    • Service partners
    • Startups program

    Help and security

    • Availability
    • Status
    • Support
    • Discord

    Terms and policies

    • Privacy policy
    • Responsible disclosure policy
    • Terms of service: Commercial
    • Terms of service: Consumer
    • Usage policy
    使用场景

    内容审核

    内容审核是在数字应用中维护安全、尊重和高效环境的关键方面。在本指南中,我们将讨论如何使用 Claude 在您的数字应用中进行内容审核。

    访问我们的内容审核 cookbook,查看使用 Claude 实现内容审核的示例。

    本指南专注于审核应用中用户生成的内容。如果您正在寻找有关审核与 Claude 交互的指导,请参阅我们的防护措施指南。

    在使用 Claude 构建之前

    决定是否使用 Claude 进行内容审核

    以下是一些关键指标,表明您应该使用像 Claude 这样的 LLM 而非传统的机器学习或基于规则的方法来进行内容审核:

    Anthropic 已将所有 Claude 模型训练为诚实、有帮助且无害的。这可能导致 Claude 对被认为特别危险的内容进行审核(符合我们的可接受使用政策),无论使用什么提示词。例如,一个希望允许用户发布露骨性内容的成人网站可能会发现,即使他们在提示词中指定不要审核露骨性内容,Claude 仍然会将露骨内容标记为需要审核。我们建议在构建审核解决方案之前先查看我们的 AUP。

    生成需要审核的内容示例

    在开发内容审核解决方案之前,首先创建应该被标记的内容和不应该被标记的内容的示例。确保包含边缘案例和可能对内容审核系统构成挑战的困难场景。之后,审查您的示例以创建一个定义明确的审核类别列表。 例如,社交媒体平台生成的示例可能包括以下内容:

    allowed_user_comments = [
        'This movie was great, I really enjoyed it. The main actor really killed it!',
        'I hate Mondays.',
        'It is a great time to invest in gold!'
    ]
    
    disallowed_user_comments = [
        'Delete this post now or you better hide. I am coming after you and your family.',
        'Stay away from the 5G cellphones!! They are using 5G to control you.',
        'Congratulations! You have won a $1,000 gift card. Click here to claim your prize!'
    ]
    
    # Sample user comments to test the content moderation
    user_comments = allowed_user_comments + disallowed_user_comments
    
    # List of categories considered unsafe for content moderation
    unsafe_categories = [
        'Child Exploitation',
        'Conspiracy Theories',
        'Hate',
        'Indiscriminate Weapons', 
        'Intellectual Property',
        'Non-Violent Crimes', 
        'Privacy',
        'Self-Harm',
        'Sex Crimes',
        'Sexual Content',
        'Specialized Advice',
        'Violent Crimes'
    ]

    有效审核这些示例需要对语言有细致的理解。在评论 This movie was great, I really enjoyed it. The main actor really killed it! 中,内容审核系统需要识别出 "killed it" 是一个比喻,而不是实际暴力的指示。相反,尽管没有明确提及暴力,评论 Delete this post now or you better hide. I am coming after you and your family. 应该被内容审核系统标记。

    unsafe_categories 列表可以根据您的具体需求进行自定义。例如,如果您希望防止未成年人在您的网站上创建内容,您可以将 "Underage Posting" 添加到列表中。


    如何使用 Claude 进行内容审核

    选择合适的 Claude 模型

    在选择模型时,重要的是要考虑数据的规模。如果成本是一个考虑因素,像 Claude Haiku 3 这样的较小模型因其成本效益而是一个极好的选择。以下是对一个每月接收十亿条帖子的社交媒体平台进行文本审核的成本估算:

    • 内容规模

      • 每月帖子数:10 亿
      • 每条帖子字符数:100
      • 总字符数:1000 亿
    • 预估 token 数

      • 输入 token 数:286 亿(假设每 3.5 个字符 1 个 token)
      • 被标记的消息百分比:3%
      • 每条被标记消息的输出 token 数:50
      • 总输出 token 数:15 亿
    • Claude Haiku 3 预估成本

      • 输入 token 成本:2,860 MTok * $0.25/MTok = $715
      • 输出 token 成本:1,500 MTok * $1.25/MTok = $1,875
      • 月度成本:$715 + $1,875 = $2,590
    • Claude Opus 4.6 预估成本

      • 输入 token 成本:2,860 MTok * $5.00/MTok = $14,300
      • 输出 token 成本:1,500 MTok * $25.00/MTok = $37,500
      • 月度成本:$14,300 + $37,500 = $51,800
    实际成本可能与这些估算不同。这些估算基于批量处理部分中突出显示的提示词。通过从响应中移除 explanation 字段,可以进一步减少输出 token。

    构建强大的提示词

    为了使用 Claude 进行内容审核,Claude 必须了解您应用的审核要求。让我们从编写一个允许您定义审核需求的提示词开始:

    import anthropic
    import json
    
    # Initialize the Anthropic client
    client = anthropic.Anthropic()
    
    def moderate_message(message, unsafe_categories):
        # Convert the list of unsafe categories into a string, with each category on a new line
        unsafe_category_str = '\n'.join(unsafe_categories)
        
        # Construct the prompt for Claude, including the message and unsafe categories
        assessment_prompt = f"""
        Determine whether the following message warrants moderation, 
        based on the unsafe categories outlined below.
        
        Message:
        <message>{message}</message>
    
        Unsafe Categories:
        <categories>
        {unsafe_category_str}
        </categories>
    
        Respond with ONLY a JSON object, using the format below:
        {{
        "violation": <Boolean field denoting whether the message should be moderated>,
        "categories": [Comma-separated list of violated categories],
        "explanation": [Optional. Only include if there is a violation.]
        }}"""
    
        # Send the request to Claude for content moderation
        response = client.messages.create(
            model="claude-3-haiku-20240307",  # Using the Haiku model for lower costs
            max_tokens=200,
            temperature=0,   # Use 0 temperature for increased consistency
            messages=[
                {"role": "user", "content": assessment_prompt}
            ]
        )
        
        # Parse the JSON response from Claude
        assessment = json.loads(response.content[0].text)
        
        # Extract the violation status from the assessment
        contains_violation = assessment['violation']
        
        # If there's a violation, get the categories and explanation; otherwise, use empty defaults
        violated_categories = assessment.get('categories', []) if contains_violation else []
        explanation = assessment.get('explanation') if contains_violation else None
        
        return contains_violation, violated_categories, explanation
    
    # Process each comment and print the results
    for comment in user_comments:
        print(f"\nComment: {comment}")
        violation, violated_categories, explanation = moderate_message(comment, unsafe_categories)
        
        if violation:
            print(f"Violated Categories: {', '.join(violated_categories)}")
            print(f"Explanation: {explanation}")
        else:
            print("No issues detected.")

    在这个示例中,moderate_message 函数包含一个评估提示词,其中包括不安全内容类别和我们希望评估的消息。该提示词要求 Claude 根据我们定义的不安全类别评估消息是否应该被审核。

    然后解析模型的评估结果以确定是否存在违规。如果存在违规,Claude 还会返回一个违规类别列表,以及关于消息为何不安全的解释。

    评估您的提示词

    内容审核是一个分类问题。因此,您可以使用我们的分类 cookbook 中概述的相同技术来确定内容审核系统的准确性。

    一个额外的考虑是,您可以不将内容审核视为二元分类问题,而是创建多个类别来表示不同的风险级别。创建多个风险级别允许您调整审核的严格程度。例如,您可能希望自动阻止被认为高风险的用户查询,而拥有许多中等风险查询的用户则被标记以供人工审查。

    import anthropic
    import json
    
    # Initialize the Anthropic client
    client = anthropic.Anthropic()
    
    def assess_risk_level(message, unsafe_categories):
        # Convert the list of unsafe categories into a string, with each category on a new line
        unsafe_category_str = '\n'.join(unsafe_categories)
        
        # Construct the prompt for Claude, including the message, unsafe categories, and risk level definitions
        assessment_prompt = f"""
        Assess the risk level of the following message warrants moderation, 
        based on the unsafe categories listed below.
        
    Message:
    <message>{message}</message>
    
    Unsafe Categories:
    <categories>
    {unsafe_category_str}
    </categories>
    
    Assign a risk level based on your confidence that the user's message should be moderated 
    based on the defined unsafe categories:
    
    0 - No risk
    1 - Low risk
    2 - Medium risk
    3 - High risk
    
    Respond with ONLY a JSON object, using the format below:
    {{
      "risk_level": <Numerical field denoting the risk level>,
      "categories": [Comma-separated list of violated categories],
      "explanation": <Optional. Only include if risk level is greater than 0>
    }}"""
    
        # Send the request to Claude for risk assessment
        response = client.messages.create(
            model="claude-3-haiku-20240307",  # Using the Haiku model for lower costs
            max_tokens=200,
            temperature=0,   # Use 0 temperature for increased consistency
            messages=[
                {"role": "user", "content": assessment_prompt}
            ]
        )
        
        # Parse the JSON response from Claude
        assessment = json.loads(response.content[0].text)
        
        # Extract the risk level, violated categories, and explanation from the assessment
        risk_level = assessment["risk_level"]
        violated_categories = assessment["categories"]
        explanation = assessment.get("explanation")
        
        return risk_level, violated_categories, explanation
    
    # Process each comment and print the results
    for comment in user_comments:
        print(f"\nComment: {comment}")
        risk_level, violated_categories, explanation = assess_risk_level(comment, unsafe_categories)
        
        print(f"Risk Level: {risk_level}")
        if violated_categories:
            print(f"Violated Categories: {', '.join(violated_categories)}")
        if explanation:
            print(f"Explanation: {explanation}")

    这段代码实现了一个 assess_risk_level 函数,使用 Claude 来评估消息的风险级别。该函数接受一条消息和一个不安全类别列表作为输入。

    在函数内部,为 Claude 生成了一个提示词,包括要评估的消息、不安全类别以及评估风险级别的具体说明。该提示词指示 Claude 以 JSON 对象的形式响应,其中包括风险级别、违规类别和可选的解释。

    这种方法通过分配风险级别实现了灵活的内容审核。它可以无缝集成到更大的系统中,根据评估的风险级别自动过滤内容或标记评论以供人工审查。例如,在执行此代码时,评论 Delete this post now or you better hide. I am coming after you and your family. 由于其危险的威胁性质被识别为高风险。相反,评论 Stay away from the 5G cellphones!! They are using 5G to control you. 被归类为中等风险。

    部署您的提示词

    一旦您对解决方案的质量有信心,就可以将其部署到生产环境中。以下是在生产环境中使用内容审核时应遵循的一些最佳实践:

    1. 向用户提供清晰的反馈: 当用户输入被阻止或响应因内容审核而被标记时,提供信息丰富且具有建设性的反馈,帮助用户理解为什么他们的消息被标记以及如何适当地重新措辞。在上面的代码示例中,这是通过 Claude 响应中的 explanation 标签来实现的。

    2. 分析被审核的内容: 跟踪审核系统标记的内容类型,以识别趋势和潜在的改进领域。

    3. 持续评估和改进: 使用精确率和召回率跟踪等指标定期评估内容审核系统的性能。使用这些数据迭代地优化您的审核提示词、关键词和评估标准。


    提升性能

    在复杂场景中,除了标准的提示词工程技术之外,考虑额外的策略来提升性能可能会有所帮助。以下是一些高级策略:

    定义主题并提供示例

    除了在提示词中列出不安全类别外,还可以通过提供每个类别的定义和相关短语来进一步改进。

    import anthropic
    import json
    
    # Initialize the Anthropic client
    client = anthropic.Anthropic()
    
    # Dictionary of categories considered unsafe for content moderation, with their definitions
    unsafe_category_definitions = {
        'Child Exploitation': 'Content that depicts child nudity or that enables, encourages, excuses, or depicts the sexual abuse of children.',
        'Conspiracy Theories': """Content that promotes or endorses unfounded, false, or misleading theories about events, situations, or individuals, 
            which can incite harm or undermine public trust in institutions or facts.""",
        'Hate': """Content that is hateful toward people on the basis of their protected characteristics 
            (race, color, ethnicity, national origin, disability, religious affiliation, caste, sexual orientation, sex, gender identity and serious disease), 
            or content that perpetuates negative stereotypes.""",
        'Indiscriminate Weapons': 'Content that enables the creation of weapons of mass destruction.', 
        'Intellectual Property': 'Content that violates the intellectual property rights of third parties.',
        'Non-Violent Crimes': 'Content that enables, encourages, or excuses the commission of non-violent crimes.', 
        'Privacy': 'Content that contains sensitive, personal information about private individuals.',
        'Self-Harm': 'Content that enables, encourages, or excuses acts of intentional self-harm.',
        'Sex Crimes': 'Content that enables, encourages, or excuses the commission of sex crimes.',
        'Sexual Content': 'Content that contains erotic descriptions or explicit depictions of sex acts.',
        'Specialized Advice': """Content that contains financial, medical, or legal advice. 
            Financial advice includes guidance on investments, stocks, bonds, or any financial planning.""",
        'Violent Crimes': 'Content that enables, encourages, or excuses the commission of violent crimes.', 
    }
    
    
    def moderate_message_with_definitions(message, unsafe_category_definitions):
        # Format unsafe categories string, with each category and its definition on a new line
        unsafe_category_str = '\n'.join(f'{category}: {definition}' 
            for category, definition in unsafe_category_definitions.items()
        )
        
        # Construct the prompt for Claude, including the message and unsafe categories
        assessment_prompt = f"""Determine whether the following message warrants moderation, based on the unsafe categories outlined below.
    
    Message:
    <message>{message}</message>
    
    Unsafe Categories and Their Definitions:
    <categories>
    {unsafe_category_str}
    </categories>
    
    It's important that you remember all unsafe categories and their definitions.
    
    Respond with ONLY a JSON object, using the format below:
    {{
      "violation": <Boolean field denoting whether the message should be moderated>,
      "categories": [Comma-separated list of violated categories],
      "explanation": [Optional. Only include if there is a violation.]
    }}"""
    
        # Send the request to Claude for content moderation
        response = client.messages.create(
            model="claude-3-haiku-20240307",  # Using the Haiku model for lower costs
            max_tokens=200,
            temperature=0,   # Use 0 temperature for increased consistency
            messages=[
                {"role": "user", "content": assessment_prompt}
            ]
        )
        
        # Parse the JSON response from Claude
        assessment = json.loads(response.content[0].text)
        
        # Extract the violation status from the assessment
        contains_violation = assessment['violation']
        
        # If there's a violation, get the categories and explanation; otherwise, use empty defaults
        violated_categories = assessment.get('categories', []) if contains_violation else []
        explanation = assessment.get('explanation') if contains_violation else None
        
        return contains_violation, violated_categories, explanation
    
    
    # Process each comment and print the results
    for comment in user_comments:
        print(f"\nComment: {comment}")
        violation, violated_categories, explanation = moderate_message_with_definitions(comment, unsafe_category_definitions)
        
        if violation:
            print(f"Violated Categories: {', '.join(violated_categories)}")
            print(f"Explanation: {explanation}")
        else:
            print("No issues detected.")

    moderate_message_with_definitions 函数在之前的 moderate_message 函数基础上进行了扩展,允许每个不安全类别与详细定义配对。在代码中,这是通过将原始函数中的 unsafe_categories 列表替换为 unsafe_category_definitions 字典来实现的。该字典将每个不安全类别映射到其对应的定义。类别名称及其定义都包含在提示词中。

    值得注意的是,Specialized Advice 类别的定义现在指定了应该禁止的金融建议类型。因此,之前通过 moderate_message 评估的评论 It's a great time to invest in gold! 现在会触发违规。

    考虑批量处理

    为了在不需要实时审核的情况下降低成本,可以考虑批量审核消息。在提示词的上下文中包含多条消息,并要求 Claude 评估哪些消息应该被审核。

    import anthropic
    import json
    
    # Initialize the Anthropic client
    client = anthropic.Anthropic()
    
    def batch_moderate_messages(messages, unsafe_categories):
        # Convert the list of unsafe categories into a string, with each category on a new line
        unsafe_category_str = '\n'.join(unsafe_categories)
        
        # Format messages string, with each message wrapped in XML-like tags and given an ID
        messages_str = '\n'.join([f'<message id={idx}>{msg}</message>' for idx, msg in enumerate(messages)])
        
        # Construct the prompt for Claude, including the messages and unsafe categories
        assessment_prompt = f"""Determine the messages to moderate, based on the unsafe categories outlined below.
    
    Messages:
    <messages>
    {messages_str}
    </messages>
    
    Unsafe categories and their definitions:
    <categories>
    {unsafe_category_str}
    </categories>
    
    Respond with ONLY a JSON object, using the format below:
    {{
      "violations": [
        {{
          "id": <message id>,
          "categories": [list of violated categories],
          "explanation": <Explanation of why there's a violation>
        }},
        ...
      ]
    }}
    
    Important Notes:
    - Remember to analyze every message for a violation.
    - Select any number of violations that reasonably apply."""
    
        # Send the request to Claude for content moderation
        response = client.messages.create(
            model="claude-3-haiku-20240307",  # Using the Haiku model for lower costs
            max_tokens=2048,  # Increased max token count to handle batches
            temperature=0,    # Use 0 temperature for increased consistency
            messages=[
                {"role": "user", "content": assessment_prompt}
            ]
        )
        
        # Parse the JSON response from Claude
        assessment = json.loads(response.content[0].text)
        return assessment
    
    
    # Process the batch of comments and get the response
    response_obj = batch_moderate_messages(user_comments, unsafe_categories)
    
    # Print the results for each detected violation
    for violation in response_obj['violations']:
        print(f"""Comment: {user_comments[violation['id']]}
    Violated Categories: {', '.join(violation['categories'])}
    Explanation: {violation['explanation']}
    """)

    在这个示例中,batch_moderate_messages 函数通过单次 Claude API 调用处理整批消息的审核。 在函数内部,创建了一个提示词,其中包括要评估的消息列表、定义的不安全内容类别及其描述。该提示词指示 Claude 返回一个 JSON 对象,列出所有包含违规的消息。响应中的每条消息都通过其 id 标识,该 id 对应于消息在输入列表中的位置。 请记住,找到适合您特定需求的最佳批量大小可能需要一些实验。虽然较大的批量大小可以降低成本,但也可能导致质量略有下降。此外,您可能需要增加 Claude API 调用中的 max_tokens 参数以适应更长的响应。有关您选择的模型可以输出的最大 token 数的详细信息,请参阅模型比较页面。

    内容审核 cookbook

    查看一个完整实现的基于代码的示例,了解如何使用 Claude 进行内容审核。

    防护措施指南

    探索我们的防护措施指南,了解审核与 Claude 交互的技术。

    Was this page helpful?

    • 在使用 Claude 构建之前
    • 决定是否使用 Claude 进行内容审核
    • 如何使用 Claude 进行内容审核
    • 选择合适的 Claude 模型