Illustration of a robot being evaluated

Generative AI Tool Pre-Release Evaluation Guide

Co-authored by: Hailey Stevens-Macfarlane, Varun Shourie, Ishrat Ahmed, and Paul Alvarado 


LLM-powered chatbot solutions offer Arizona State University the potential to streamline communications, enhance user experiences, and provide instant support to students, faculty, and staff. However, launching these tools without proper testing poses serious risks—ranging from misinformation and biased or offensive interactions to reputational damage and compliance violations (especially regarding FERPA, HIPAA, and university data policies).

This quick guide outlines the critical steps and metrics to consider before deploying an LLM-powered chatbot at ASU, whether it is used internally or public facing.

Within Enterprise Technology, teams planning to deploy AI solutions should consult the AI Acceleration - Data Science team for evaluation frameworks, tools, and guidance. However, each owning team remains responsible for successful, compliant deployment. By following these best practices, ASU’s AI solutions remain ethical, inclusive, and reliable.

For questions, please reach out to the AI Acceleration – Data Science team via the #et-ai-tool-evaluation Slack channel.


Potential Risks of Deploying Without Evaluation

Robots proving their accuracy
Image generated using Google Labs ImageFX: https://labs.google/fx/tools/image-fx/6i5a6r0790000 
  • Misinformation and Inaccuracy 
    Example: Incorrect details about financial aid or grading policies can cause confusion and stress for students and impact their success.
  • Bias and Discrimination 
    Example: Subtle or overt biases against certain user groups can violate ASU’s commitment to access.
  • Offensive or Toxic Responses 
    Example: Unfiltered toxic language reflects poorly on the university’s values and damages institutional credibility.
  • Compliance Violations 
    Example: Failure to protect student privacy (FERPA) or medical data (HIPAA) can result in significant legal and financial consequences.
  • Erosion of Trust and Reputation 
    Example: Public backlash if the chatbot provides discriminatory or misleading content, hurting ASU’s reputation.
  • Data Security & Privacy Breaches 
    Example: Chatbot inadvertently sharing confidential information or storing sensitive data without proper safeguards.

     


Key Dimensions to Consider

Bots demonstrating an ability to breach security
Image generated using Google Labs ImageFX: https://labs.google/fx/tools/image-fx/0kflf88o30000 

Accuracy

  • Definition: Are the chatbot’s responses factually correct, relevant, and verifiable?
  • Why It Matters: Misleading info about grading criteria, financial aid, or scholarships undermines user trust and can negatively affect student outcomes.

Bias & Fairness

  • Definition: Does the chatbot treat all users equally, without favoring or discriminating against certain groups?
  • Why It Matters: Biased responses (e.g., suggesting some demographics are less capable) contradict ASU’s commitment to access and inclusion.

Toxicity

  • Definition: Does the chatbot use harmful, offensive, or profane language?
  • Why It Matters: Public-facing bots must not produce toxic or derogatory content that could tarnish ASU’s image.

Compliance & Privacy

  • Definition: Does the chatbot adhere to relevant regulations (e.g., FERPA, HIPAA) and handle sensitive data responsibly?
  • Why It Matters: Non-compliance can incur legal risks and compromise student or patient privacy at ASU.

Security & Safety

  • Definition: Does the chatbot protect against cyber threats, unauthorized access, and harmful misuse?
  • Why It Matters: Chatbots must be resilient against attacks such as prompt injection, jailbreaking, data exfiltration, or misuse that could expose sensitive information or allow manipulation of responses. Ensuring secure access, monitoring, and mitigation strategies is critical to maintaining a safe AI deployment.

Obedience

  • Definition: Does the chatbot follow explicit guardrails and/or protocols specified in the system prompt, such as not revealing confidential information or providing disallowed services?
  • Why It Matters: Prevents unauthorized disclosures or inappropriate behavior that could violate university policies and jeopardize public trust.

Throughput

  • Definition: Can the chatbot handle high volumes of user requests efficiently?
  • Why It Matters: Slow or unresponsive bots frustrate users and hamper ASU’s service quality.

Common Scenarios & Priority Focus

A student and a robot studying together
Image generated using Google Labs ImageFX: https://labs.google/fx/tools/image-fx/4iovgb2kn0000 

In real-world projects, chatbot implementations vary widely based on use cases, target audiences, and functional requirements. The examples below illustrate how priorities can shift depending on whether the chatbot is internal or public-facing, handles sensitive data, or addresses unique student needs. Developers must use their own judgment and adapt these guidelines to their specific use cases.

Scenario 1: Internal AI Assistant

Example: Experience Center Concierge Bot – Used by Staff to Assist Students, Faculty, and Employees.

Priorities:

  • Accuracy – Must provide correct, up-to-date information for internal teams to ensure reliable assistance.
  • Obedience – Must follow predefined protocols and avoid unauthorized disclosures of internal data.
  • Bias & Fairness – Should ensure neutral, non-discriminatory language, though risk is lower due to internal usage.
  • Throughput – Should handle large query volumes efficiently without degrading response quality.
  • Security & Safety – Prevent unauthorized API access or data leakage from internal systems.
  • Compliance & Privacy – Generally lower risk since it deals with internal queries, but must still follow ASU data handling policies.
  • Toxicity – Should maintain professional, respectful language but faces minimal risk of inappropriate interactions.

Scenario 2: Student-Facing Chatbot

Student-facing chatbots serve a diverse range of student needs, from answering factual questions to assisting with creative projects and academic advising. Given their broad impact, it is critical to develop escalation paths for cases where chatbot interactions require human intervention. This includes mental health risk detection, where chatbots should recognize distress signals and direct students to appropriate university support services rather than attempting to handle sensitive situations autonomously. Escalation paths also ensure that students receive expert guidance when chatbots reach the limits of their capabilities.

Below are three primary student-facing chatbot use cases, each with its priority considerations.

A. Fact-Based Q&A Chatbot

Example: Student Services Chatbot for Financial Aid, Academic Policies, University Services, etc.

Priorities

  • Bias & Fairness – Must not discriminate or show favoritism.
  • Toxicity – Should never generate offensive, harmful, or inappropriate responses.
  • Compliance & Privacy – Must comply with FERPA/HIPAA if handling sensitive info.
  • Accuracy – Should provide correct details on policies, deadlines, and resources.
  • Obedience – Must not provide restricted info or commit the university to unverified promises.
  • Security & Safety – Must resist unauthorized queries and vulnerabilities.
  • Throughput – Should serve high traffic effectively, especially during peak times.

Escalation Paths & Mental Health Risk Detection:

  • Share mental health resources (e.g., ASU Counseling, crisis hotlines).
  • Escalate critical cases to human staff (helpdesk tickets, live chat).
  • Do not engage in sensitive discussions; redirect to trained professionals.

B. Student-Facing Creativity & Ideation Chatbot

Example: AI Chatbot for Brainstorming, Essay Writing, Research Exploration, or Creative Projects.

Priorities:

  • Bias & Fairness – Must avoid reinforcing harmful stereotypes.
  • Toxicity – Must never produce offensive or harmful content.
  • Security & Safety – Must prevent misuse (hate speech, dangerous content).
  • Obedience – Adhere to academic integrity guidelines (no plagiarized essays).
  • Compliance & Privacy – Respect FERPA if used in class settings.
  • Throughput – Handle high interaction volumes efficiently.
  • Accuracy (Conditional) – If referencing facts or academic concepts, ensure correctness.

Escalation Paths & Academic Integrity Considerations:

  • Detect and discourage academic dishonesty (e.g., requests for full essays).
  • Provide brainstorming resources but avoid giving direct answers to assignments.
  • Log and restrict access if users prompt harmful or unethical content.

C. Student-Facing Advising & Enrollment Chatbot

Example: Automated Academic Advising Assistant – Course Selection, Major Guidance, Degree Progression Support.

Priorities:

  • Accuracy – Must provide correct course requirements, advising info.
  • Compliance & Privacy – Follow FERPA for student records and academic plans.
  • Bias & Fairness – Must not discourage any demographic from pursuing certain majors.
  • Obedience – Adhere to advising protocols; should not override official guidance.
  • Security & Safety – Prevent unauthorized data access or sharing.
  • Toxicity – Maintain supportive, respectful language.
  • Throughput – Handle peak advising periods smoothly.

Escalation Paths & Mental Health Risk Detection:

  • Share academic support resources (tutoring, workshops).
  • Refer students to wellness services or faculty office hours for stress or mental health concerns.
  • Escalate complex issues to human advisors for personalized support.

Scenario 3: Public-Facing Chatbot

Example: ASU Facts Bot - Used by the general public to learn high-level facts about ASU. 

Priorities:

  • Bias & Fairness – Avoid discriminatory or unfair treatment of any user group.
  • Toxicity – Prevent offensive, inappropriate, or harmful language.
  • Compliance & Privacy – Adhere to legal and institutional regulations (e.g., FERPA, HIPAA).
  • Accuracy – Provide correct and up-to-date information.
  • Obedience – Follow internal guidelines and avoid unauthorized disclosures or promises.
  • Security & Safety – Protect against malicious exploitation and data breaches.
  • Throughput – Handle high volumes of concurrent requests without delays.

Evaluation Methodologies

generate an image in a flat illustration style of a droid helping a female student learn. In the background, add two scientists observing behind a window. Use mostly white and black (on a very light tan background), with some gold accents and minimal maroon accents
Image generated using Google Labs ImageFX: https://labs.google/fx/tools/image-fx/7rik00vjv0000 

Methodology 1: Automated Testing & Evaluation

Example: Ethical AI Engine

Use Case: Rapidly assess model performance (accuracy, bias, toxicity) using prepared test data and a stable API.

Goal: Quickly identify performance issues and harmful behaviors during the development phase.

Prerequisites: Teams interested in conducting automated testing are encouraged to consult the AI Acceleration – Data Science team for guidance. To successfully run automated tests, you will also need:

  • Ensure the GenAI solution has a stable API or testing interface allowing programmatic access.
  • Prepare relevant test datasets reflecting actual user queries and edge cases.
  • Define your success metrics (e.g. accuracy, toxicity, bias, etc.).

Methodology 2: Manual Testing

Example: Red Teaming

Use Case: Intentionally stress-test the bot with adversarial prompts—offensive, manipulative, or privacy-infringing.

Goal: Identify vulnerabilities in the bot’s guardrails and ensure it won’t provide harmful or restricted content.

Prerequisites: 

  • Clear objectives: Define what specific risks or failure modes you want to test.
  • A list of red teaming questions or prompts detailing how to test these scenarios. Example scenarios include: Safety & Abuse, Model Jailbreaking, Prompt-Injections, Bias & Fairness.

Methodology 3: Pilot Run / Field Experiment

Example: A/B Testing

Use Case: Launch a limited version to a small user group or specific department.

Goal: Gather real-world user feedback, uncover unanticipated issues, and validate performance before a wider release. Think of the pilot run or field experiment as the ultimate validation for generative AI solutions—especially high-stakes projects. We strongly recommend that all critical GenAI initiatives conduct pilot experiments before full rollout.

Prerequisites:

  • Stakeholder Alignment: Agreement from departments or teams involved, ensuring they understand the experimental rollout.
  • Metrics & Data Collection: Infrastructure to track engagement, user actions, errors, and performance metrics for both test and control groups.
  • Feedback Loop: Tools and processes to collect user feedback (e.g., surveys, comments, or complaint forms) and automatically log interactions.

Evaluation Matrix

Use the following matrix to understand the minimum level of evaluation you should perform for each common scenario.

Scenario

Automated Testing & Evaluation

Manual Testing

Pilot Run / Field Experiment

Internal AI Assistant

Required

Suggested

Suggested

Student-Facing Chatbot

Required

Required

Suggested

Public-Facing Chatbot

Required

Required

Required


Conclusion

By following the steps in this guide, ASU Enterprise Technology teams can ensure their generative AI solutions are safe, compliant, and aligned with university values. Keep in mind that adequate planning is critical:

  • Automated Testing: Plan for at least 4 weeks of model and data evaluations.
  • Red Teaming: Allocate 1–2 weeks for adversarial stress tests.
  • Field Experiment / Pilot Run: Reserve 2 months to gather real-world feedback and validate performance.

Making time for thorough evaluation early in the development cycle helps prevent costly issues later and contributes to reliable, responsible AI deployments at ASU.


Keep Reading

Generative AI Tool Pre-Release Evaluation Guide

Stella Wenxing Liu

Arizona State University remains dedicated to responsible, principled innovation when deploying generative AI solutions, including chatbots. This guide ensures each project aligns with ASU’s values by mitigating potential risks—such as misinformation, bias, toxicity, and compliance lapses—using rigorous methods like automated testing, red teaming, and pilot experiments. In doing so, we uphold accuracy, fairness, and user trust while enhancing digital experiences across the university.

Agents in Generative AI

Zohair Zaidi

Agents in generative AI are semi-autonomous entities that collaborate and interact dynamically, allowing them to solve complex problems and combine specialized capabilities for greater efficiency and adaptability.

How to Use Knowledge Base (RAG)

Jinjing Zhao

Explore an overview of the Knowledge Base and Retrieval Augmented Generation (RAG) methods. Learn about the different types of Knowledge Base retrieval and understand the distinctions between the Knowledge Base and system prompts.

CreateAI Platform Available LLM Models

Faith Timoh Abang

We are proud to offer 40+ models including multi-modal (voice, image, text) for the ASU community to access securely on the CreateAI Platform. Users can find the following models available for experimentation and use in Model Comparison, ASU GPT, and MyAI Builder (access request required). Updated: March 25, 2025.