請造訪我們的內容審核 cookbook,查看使用 Claude 實作內容審核的範例。
以下是一些關鍵指標,說明您應該使用像 Claude 這樣的 LLM,而非傳統 ML 或基於規則的方法來進行內容審核:
在開發內容審核解決方案之前,首先建立應該被標記和不應該被標記的內容範例。確保包含邊緣案例和可能對內容審核系統構成挑戰的困難場景。之後,審查您的範例以建立一個定義明確的審核類別清單。 例如,社群媒體平台生成的範例可能包括以下內容:
allowed_user_comments = [
'This movie was great, I really enjoyed it. The main actor really killed it!',
'I hate Mondays.',
'It is a great time to invest in gold!'
]
disallowed_user_comments = [
'Delete this post now or you better hide. I am coming after you and your family.',
'Stay away from the 5G cellphones!! They are using 5G to control you.',
'Congratulations! You have won a $1,000 gift card. Click here to claim your prize!'
]
# Sample user comments to test the content moderation
user_comments = allowed_user_comments + disallowed_user_comments
# List of categories considered unsafe for content moderation
unsafe_categories = [
'Child Exploitation',
'Conspiracy Theories',
'Hate',
'Indiscriminate Weapons',
'Intellectual Property',
'Non-Violent Crimes',
'Privacy',
'Self-Harm',
'Sex Crimes',
'Sexual Content',
'Specialized Advice',
'Violent Crimes'
]有效審核這些範例需要對語言有細緻的理解。在評論 This movie was great, I really enjoyed it. The main actor really killed it! 中,內容審核系統需要識別「killed it」是一個比喻,而非實際暴力的指示。相反地,儘管沒有明確提及暴力,評論 Delete this post now or you better hide. I am coming after you and your family. 應該被內容審核系統標記。
unsafe_categories 清單可以根據您的特定需求進行自訂。例如,如果您希望防止未成年人在您的網站上建立內容,您可以將「Underage Posting」附加到清單中。
在選擇模型時,重要的是考慮您的資料規模。如果成本是一個考量因素,像 Claude Haiku 3 這樣的較小模型因其成本效益而是一個絕佳選擇。以下是對一個每月接收十億則貼文的社群媒體平台進行文字審核的成本估算:
內容規模
估計 token 數
Claude Haiku 3 估計成本
Claude Opus 4.6 估計成本
explanation 欄位,可以進一步減少輸出 token。為了使用 Claude 進行內容審核,Claude 必須了解您應用程式的審核需求。讓我們從撰寫一個允許您定義審核需求的提示詞開始:
import anthropic
import json
# Initialize the Anthropic client
client = anthropic.Anthropic()
def moderate_message(message, unsafe_categories):
# Convert the list of unsafe categories into a string, with each category on a new line
unsafe_category_str = '\n'.join(unsafe_categories)
# Construct the prompt for Claude, including the message and unsafe categories
assessment_prompt = f"""
Determine whether the following message warrants moderation,
based on the unsafe categories outlined below.
Message:
<message>{message}</message>
Unsafe Categories:
<categories>
{unsafe_category_str}
</categories>
Respond with ONLY a JSON object, using the format below:
{{
"violation": <Boolean field denoting whether the message should be moderated>,
"categories": [Comma-separated list of violated categories],
"explanation": [Optional. Only include if there is a violation.]
}}"""
# Send the request to Claude for content moderation
response = client.messages.create(
model="claude-3-haiku-20240307", # Using the Haiku model for lower costs
max_tokens=200,
temperature=0, # Use 0 temperature for increased consistency
messages=[
{"role": "user", "content": assessment_prompt}
]
)
# Parse the JSON response from Claude
assessment = json.loads(response.content[0].text)
# Extract the violation status from the assessment
contains_violation = assessment['violation']
# If there's a violation, get the categories and explanation; otherwise, use empty defaults
violated_categories = assessment.get('categories', []) if contains_violation else []
explanation = assessment.get('explanation') if contains_violation else None
return contains_violation, violated_categories, explanation
# Process each comment and print the results
for comment in user_comments:
print(f"\nComment: {comment}")
violation, violated_categories, explanation = moderate_message(comment, unsafe_categories)
if violation:
print(f"Violated Categories: {', '.join(violated_categories)}")
print(f"Explanation: {explanation}")
else:
print("No issues detected.")在此範例中,moderate_message 函式包含一個評估提示詞,其中包括不安全內容類別和我們希望評估的訊息。提示詞要求 Claude 根據我們定義的不安全類別評估訊息是否應該被審核。
然後解析模型的評估結果以確定是否存在違規。如果存在違規,Claude 還會返回一個違規類別清單,以及訊息為何不安全的解釋。
內容審核是一個分類問題。因此,您可以使用我們分類 cookbook 中概述的相同技術來確定您的內容審核系統的準確性。
一個額外的考量是,您可以不將內容審核視為二元分類問題,而是建立多個類別來表示各種風險等級。建立多個風險等級允許您調整審核的積極程度。例如,您可能希望自動封鎖被認為高風險的使用者查詢,而有許多中等風險查詢的使用者則被標記進行人工審查。
import anthropic
import json
# Initialize the Anthropic client
client = anthropic.Anthropic()
def assess_risk_level(message, unsafe_categories):
# Convert the list of unsafe categories into a string, with each category on a new line
unsafe_category_str = '\n'.join(unsafe_categories)
# Construct the prompt for Claude, including the message, unsafe categories, and risk level definitions
assessment_prompt = f"""
Assess the risk level of the following message warrants moderation,
based on the unsafe categories listed below.
Message:
<message>{message}</message>
Unsafe Categories:
<categories>
{unsafe_category_str}
</categories>
Assign a risk level based on your confidence that the user's message should be moderated
based on the defined unsafe categories:
0 - No risk
1 - Low risk
2 - Medium risk
3 - High risk
Respond with ONLY a JSON object, using the format below:
{{
"risk_level": <Numerical field denoting the risk level>,
"categories": [Comma-separated list of violated categories],
"explanation": <Optional. Only include if risk level is greater than 0>
}}"""
# Send the request to Claude for risk assessment
response = client.messages.create(
model="claude-3-haiku-20240307", # Using the Haiku model for lower costs
max_tokens=200,
temperature=0, # Use 0 temperature for increased consistency
messages=[
{"role": "user", "content": assessment_prompt}
]
)
# Parse the JSON response from Claude
assessment = json.loads(response.content[0].text)
# Extract the risk level, violated categories, and explanation from the assessment
risk_level = assessment["risk_level"]
violated_categories = assessment["categories"]
explanation = assessment.get("explanation")
return risk_level, violated_categories, explanation
# Process each comment and print the results
for comment in user_comments:
print(f"\nComment: {comment}")
risk_level, violated_categories, explanation = assess_risk_level(comment, unsafe_categories)
print(f"Risk Level: {risk_level}")
if violated_categories:
print(f"Violated Categories: {', '.join(violated_categories)}")
if explanation:
print(f"Explanation: {explanation}")此程式碼實作了一個 assess_risk_level 函式,使用 Claude 來評估訊息的風險等級。該函式接受一則訊息和一個不安全類別清單作為輸入。
在函式內部,為 Claude 生成一個提示詞,包括要評估的訊息、不安全類別以及評估風險等級的具體指示。提示詞指示 Claude 以 JSON 物件回應,其中包括風險等級、違規類別和可選的解釋。
這種方法透過分配風險等級實現靈活的內容審核。它可以無縫整合到更大的系統中,根據評估的風險等級自動過濾內容或標記評論進行人工審查。例如,執行此程式碼時,評論 Delete this post now or you better hide. I am coming after you and your family. 因其危險威脅而被識別為高風險。相反地,評論 Stay away from the 5G cellphones!! They are using 5G to control you. 被歸類為中等風險。
一旦您對解決方案的品質有信心,就是將其部署到生產環境的時候了。以下是在生產環境中使用內容審核時應遵循的一些最佳實踐:
向使用者提供清晰的回饋: 當使用者輸入被封鎖或回應因內容審核而被標記時,提供有資訊性和建設性的回饋,幫助使用者了解為什麼他們的訊息被標記以及如何適當地重新措辭。在上面的程式碼範例中,這是透過 Claude 回應中的 explanation 標籤完成的。
分析被審核的內容: 追蹤您的審核系統標記的內容類型,以識別趨勢和潛在的改進領域。
持續評估和改進: 使用精確率和召回率追蹤等指標定期評估您的內容審核系統的效能。使用這些資料迭代地改進您的審核提示詞、關鍵字和評估標準。
在複雜場景中,除了標準的提示詞工程技術之外,考慮額外的策略來提升效能可能會有所幫助。以下是一些進階策略:
除了在提示詞中列出不安全類別外,還可以透過提供與每個類別相關的定義和短語來進一步改進。
import anthropic
import json
# Initialize the Anthropic client
client = anthropic.Anthropic()
# Dictionary of categories considered unsafe for content moderation, with their definitions
unsafe_category_definitions = {
'Child Exploitation': 'Content that depicts child nudity or that enables, encourages, excuses, or depicts the sexual abuse of children.',
'Conspiracy Theories': """Content that promotes or endorses unfounded, false, or misleading theories about events, situations, or individuals,
which can incite harm or undermine public trust in institutions or facts.""",
'Hate': """Content that is hateful toward people on the basis of their protected characteristics
(race, color, ethnicity, national origin, disability, religious affiliation, caste, sexual orientation, sex, gender identity and serious disease),
or content that perpetuates negative stereotypes.""",
'Indiscriminate Weapons': 'Content that enables the creation of weapons of mass destruction.',
'Intellectual Property': 'Content that violates the intellectual property rights of third parties.',
'Non-Violent Crimes': 'Content that enables, encourages, or excuses the commission of non-violent crimes.',
'Privacy': 'Content that contains sensitive, personal information about private individuals.',
'Self-Harm': 'Content that enables, encourages, or excuses acts of intentional self-harm.',
'Sex Crimes': 'Content that enables, encourages, or excuses the commission of sex crimes.',
'Sexual Content': 'Content that contains erotic descriptions or explicit depictions of sex acts.',
'Specialized Advice': """Content that contains financial, medical, or legal advice.
Financial advice includes guidance on investments, stocks, bonds, or any financial planning.""",
'Violent Crimes': 'Content that enables, encourages, or excuses the commission of violent crimes.',
}
def moderate_message_with_definitions(message, unsafe_category_definitions):
# Format unsafe categories string, with each category and its definition on a new line
unsafe_category_str = '\n'.join(f'{category}: {definition}'
for category, definition in unsafe_category_definitions.items()
)
# Construct the prompt for Claude, including the message and unsafe categories
assessment_prompt = f"""Determine whether the following message warrants moderation, based on the unsafe categories outlined below.
Message:
<message>{message}</message>
Unsafe Categories and Their Definitions:
<categories>
{unsafe_category_str}
</categories>
It's important that you remember all unsafe categories and their definitions.
Respond with ONLY a JSON object, using the format below:
{{
"violation": <Boolean field denoting whether the message should be moderated>,
"categories": [Comma-separated list of violated categories],
"explanation": [Optional. Only include if there is a violation.]
}}"""
# Send the request to Claude for content moderation
response = client.messages.create(
model="claude-3-haiku-20240307", # Using the Haiku model for lower costs
max_tokens=200,
temperature=0, # Use 0 temperature for increased consistency
messages=[
{"role": "user", "content": assessment_prompt}
]
)
# Parse the JSON response from Claude
assessment = json.loads(response.content[0].text)
# Extract the violation status from the assessment
contains_violation = assessment['violation']
# If there's a violation, get the categories and explanation; otherwise, use empty defaults
violated_categories = assessment.get('categories', []) if contains_violation else []
explanation = assessment.get('explanation') if contains_violation else None
return contains_violation, violated_categories, explanation
# Process each comment and print the results
for comment in user_comments:
print(f"\nComment: {comment}")
violation, violated_categories, explanation = moderate_message_with_definitions(comment, unsafe_category_definitions)
if violation:
print(f"Violated Categories: {', '.join(violated_categories)}")
print(f"Explanation: {explanation}")
else:
print("No issues detected.")moderate_message_with_definitions 函式擴展了先前的 moderate_message 函式,允許每個不安全類別與詳細定義配對。這在程式碼中透過將原始函式中的 unsafe_categories 清單替換為 unsafe_category_definitions 字典來實現。此字典將每個不安全類別映射到其對應的定義。類別名稱及其定義都包含在提示詞中。
值得注意的是,Specialized Advice 類別的定義現在指定了應禁止的財務建議類型。因此,先前通過 moderate_message 評估的評論 It's a great time to invest in gold! 現在會觸發違規。
為了在不需要即時審核的情況下降低成本,請考慮批次審核訊息。在提示詞的上下文中包含多則訊息,並要求 Claude 評估哪些訊息應該被審核。
import anthropic
import json
# Initialize the Anthropic client
client = anthropic.Anthropic()
def batch_moderate_messages(messages, unsafe_categories):
# Convert the list of unsafe categories into a string, with each category on a new line
unsafe_category_str = '\n'.join(unsafe_categories)
# Format messages string, with each message wrapped in XML-like tags and given an ID
messages_str = '\n'.join([f'<message id={idx}>{msg}</message>' for idx, msg in enumerate(messages)])
# Construct the prompt for Claude, including the messages and unsafe categories
assessment_prompt = f"""Determine the messages to moderate, based on the unsafe categories outlined below.
Messages:
<messages>
{messages_str}
</messages>
Unsafe categories and their definitions:
<categories>
{unsafe_category_str}
</categories>
Respond with ONLY a JSON object, using the format below:
{{
"violations": [
{{
"id": <message id>,
"categories": [list of violated categories],
"explanation": <Explanation of why there's a violation>
}},
...
]
}}
Important Notes:
- Remember to analyze every message for a violation.
- Select any number of violations that reasonably apply."""
# Send the request to Claude for content moderation
response = client.messages.create(
model="claude-3-haiku-20240307", # Using the Haiku model for lower costs
max_tokens=2048, # Increased max token count to handle batches
temperature=0, # Use 0 temperature for increased consistency
messages=[
{"role": "user", "content": assessment_prompt}
]
)
# Parse the JSON response from Claude
assessment = json.loads(response.content[0].text)
return assessment
# Process the batch of comments and get the response
response_obj = batch_moderate_messages(user_comments, unsafe_categories)
# Print the results for each detected violation
for violation in response_obj['violations']:
print(f"""Comment: {user_comments[violation['id']]}
Violated Categories: {', '.join(violation['categories'])}
Explanation: {violation['explanation']}
""")在此範例中,batch_moderate_messages 函式透過單次 Claude API 呼叫處理整批訊息的審核。
在函式內部,建立了一個提示詞,其中包括要評估的訊息清單、定義的不安全內容類別及其描述。提示詞指示 Claude 返回一個 JSON 物件,列出所有包含違規的訊息。回應中的每則訊息都透過其 id 識別,該 id 對應於訊息在輸入清單中的位置。
請記住,找到適合您特定需求的最佳批次大小可能需要一些實驗。雖然較大的批次大小可以降低成本,但也可能導致品質略有下降。此外,您可能需要增加 Claude API 呼叫中的 max_tokens 參數以容納更長的回應。有關您所選模型可以輸出的最大 token 數的詳細資訊,請參閱模型比較頁面。
Was this page helpful?