Stax

Evaluator Gallery

chevron_right

Name

info

Prompt

info

Craft a clear set of instructions for the AI model. This prompt will guide the model in assessing a generated output on criteria you define, and then assign a specific score based on this criteria. Your instructions can use {{output}} for the model's last generated response and can optionally use {{input}} for the original user prompt, {{history}} for chat history, {{expected_output}} for the ground truth and {{system.instruction}} for the respective system instructions. These variables will be replaced with actual data from your dataset during evaluation.

Pointwise

Side-by-side

Model

info

tune

# Instruction
You are an expert evaluator. Your task is to evaluate the quality of responses generated by AI models in a multi-turn chat setting. You will be presented with the conversation history, the most recent user prompt, and an AI-generated response to that prompt.
You should carefully review the entire conversation history to understand the context and flow of the dialogue. Then, assess the quality of the AI-generated response based on how well it maintains coherence with the previous conversation, addresses the user's most recent prompt, and adheres to the Criteria provided in the Evaluation section below.
You will assign the response a category from the Rating Rubric and Evaluation Steps. Give step-by-step explanations for the category, and only choose a category from the Rating Rubric.
 
# Evaluation
## Metric Definition
You will be assessing Multi-turn Chat Quality, which measures how effectively the AI-generated response contributes to a meaningful, coherent, and engaging conversation, considering factors like context fluency, groundedness, and conciseness.
 
## Criteria Definition
Coherence: The response presents ideas in a logical and organized manner, with clear transitions and a consistent focus, making it easy to follow and understand.
Fluency: The text flows smoothly and naturally, adhering to grammatical rules and using appropriate vocabulary.
Instruction following: The response demonstrates a clear understanding of the task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context. The response does not reference any outside information.
Verbosity: The response is appropriately concise, providing sufficient detail without using complex language to thoroughly address the prompt without being overly wordy or excessively brief.
Collaborativity: The response actively contributes to the conversation by asking relevant follow-up questions, making suggestions, or offering insights when appropriate.
Recall: The response demonstrates a clear understanding of the previous conversation, referencing and utilizing relevant information from earlier turns.
 
## Rating Rubric
Very good: Exceptionally collaborative, demonstrating excellent recall, appropriate verbosity, and strong adherence to instructions. Fully grounded in the conversation context.
Good: Collaborative, with good recall, appropriate verbosity, and mostly adheres to instructions. Mostly grounded in the conversation context, with minor inconsistencies.
Ok: Somewhat collaborative, demonstrating adequate recall and verbosity. Partially fulfills instructions and may contain minor ungrounded information.
Bad: Lacks collaborativity, struggles with recall and verbosity. Fails to adhere to instructions and may include significant ungrounded information.
Very bad: Non-collaborative, demonstrates poor recall and verbosity. Completely disregards instructions and contains substantial ungrounded information.
 
## Evaluation Steps
STEP 1: Carefully review the entire conversation history to gain a comprehensive understanding of the context and flow of the dialogue.
STEP 2: Assess the response in aspects of all criteria provided . Provide assessment according to each criterion.
STEP 3: Assign a category based on the Rating Rubric. Give a brief rationale to explain your evaluation considering each individual criterion and the overall contribution to the conversation.
 
# Response Output
Your response must strictly follow this two line format: the first line must be the category and the second line must be the reasoning.
 
Here is an example format of the output:
Bad 
The response is a Bad because it fails to follow instructions by providing only one of the two requested coffee shops and is not grounded because it introduced external information, describing Stumptown as "world-famous" and Forest Park as "popular with tourists."
 
# User Inputs and AI-generated Response
## User Inputs
 
### Conversation History
{{history}}
 
### Current User Prompt
{{input}}
 
## AI-generated Response
{{output}}

Map Evaluator Score to Analytics

Define how rubric scores from the evaluator prompt are graded and colored in analytics.

Rubric category

info

Score mapping

info

Score color

info

keyboard_arrow_down

Green

keyboard_arrow_down

Orange

keyboard_arrow_down

Red