
Generative AI Tool Pre-Release Evaluation Guide
Co-authored by: Hailey Stevens-Macfarlane, Varun Shourie, Ishrat Ahmed, and Paul Alvarado
LLM-powered chatbot solutions offer Arizona State University the potential to streamline communications, enhance user experiences, and provide instant support to students, faculty, and staff. However, launching these tools without proper testing poses serious risks—ranging from misinformation and biased or offensive interactions to reputational damage and compliance violations (especially regarding FERPA, HIPAA, and university data policies).
This quick guide outlines the critical steps and metrics to consider before deploying an LLM-powered chatbot at ASU, whether it is used internally or public facing.
Within Enterprise Technology, teams planning to deploy AI solutions should consult the AI Acceleration - Data Science team for evaluation frameworks, tools, and guidance. However, each owning team remains responsible for successful, compliant deployment. By following these best practices, ASU’s AI solutions remain ethical, inclusive, and reliable.
For questions, please reach out to the AI Acceleration – Data Science team via the #et-ai-tool-evaluation Slack channel.
Potential Risks of Deploying Without Evaluation

- Misinformation and Inaccuracy
Example: Incorrect details about financial aid or grading policies can cause confusion and stress for students and impact their success. - Bias and Discrimination
Example: Subtle or overt biases against certain user groups can violate ASU’s commitment to access. - Offensive or Toxic Responses
Example: Unfiltered toxic language reflects poorly on the university’s values and damages institutional credibility. - Compliance Violations
Example: Failure to protect student privacy (FERPA) or medical data (HIPAA) can result in significant legal and financial consequences. - Erosion of Trust and Reputation
Example: Public backlash if the chatbot provides discriminatory or misleading content, hurting ASU’s reputation. Data Security & Privacy Breaches
Example: Chatbot inadvertently sharing confidential information or storing sensitive data without proper safeguards.
Key Dimensions to Consider

Accuracy
- Definition: Are the chatbot’s responses factually correct, relevant, and verifiable?
- Why It Matters: Misleading info about grading criteria, financial aid, or scholarships undermines user trust and can negatively affect student outcomes.
Bias & Fairness
- Definition: Does the chatbot treat all users equally, without favoring or discriminating against certain groups?
- Why It Matters: Biased responses (e.g., suggesting some demographics are less capable) contradict ASU’s commitment to access and inclusion.
Toxicity
- Definition: Does the chatbot use harmful, offensive, or profane language?
- Why It Matters: Public-facing bots must not produce toxic or derogatory content that could tarnish ASU’s image.
Compliance & Privacy
- Definition: Does the chatbot adhere to relevant regulations (e.g., FERPA, HIPAA) and handle sensitive data responsibly?
- Why It Matters: Non-compliance can incur legal risks and compromise student or patient privacy at ASU.
Security & Safety
- Definition: Does the chatbot protect against cyber threats, unauthorized access, and harmful misuse?
- Why It Matters: Chatbots must be resilient against attacks such as prompt injection, jailbreaking, data exfiltration, or misuse that could expose sensitive information or allow manipulation of responses. Ensuring secure access, monitoring, and mitigation strategies is critical to maintaining a safe AI deployment.
Obedience
- Definition: Does the chatbot follow explicit guardrails and/or protocols specified in the system prompt, such as not revealing confidential information or providing disallowed services?
- Why It Matters: Prevents unauthorized disclosures or inappropriate behavior that could violate university policies and jeopardize public trust.
Throughput
- Definition: Can the chatbot handle high volumes of user requests efficiently?
- Why It Matters: Slow or unresponsive bots frustrate users and hamper ASU’s service quality.
Common Scenarios & Priority Focus

In real-world projects, chatbot implementations vary widely based on use cases, target audiences, and functional requirements. The examples below illustrate how priorities can shift depending on whether the chatbot is internal or public-facing, handles sensitive data, or addresses unique student needs. Developers must use their own judgment and adapt these guidelines to their specific use cases.
Scenario 1: Internal AI Assistant
Example: Experience Center Concierge Bot – Used by Staff to Assist Students, Faculty, and Employees.
Priorities:
- Accuracy – Must provide correct, up-to-date information for internal teams to ensure reliable assistance.
- Obedience – Must follow predefined protocols and avoid unauthorized disclosures of internal data.
- Bias & Fairness – Should ensure neutral, non-discriminatory language, though risk is lower due to internal usage.
- Throughput – Should handle large query volumes efficiently without degrading response quality.
- Security & Safety – Prevent unauthorized API access or data leakage from internal systems.
- Compliance & Privacy – Generally lower risk since it deals with internal queries, but must still follow ASU data handling policies.
- Toxicity – Should maintain professional, respectful language but faces minimal risk of inappropriate interactions.
Scenario 2: Student-Facing Chatbot
Student-facing chatbots serve a diverse range of student needs, from answering factual questions to assisting with creative projects and academic advising. Given their broad impact, it is critical to develop escalation paths for cases where chatbot interactions require human intervention. This includes mental health risk detection, where chatbots should recognize distress signals and direct students to appropriate university support services rather than attempting to handle sensitive situations autonomously. Escalation paths also ensure that students receive expert guidance when chatbots reach the limits of their capabilities.
Below are three primary student-facing chatbot use cases, each with its priority considerations.
A. Fact-Based Q&A Chatbot
Example: Student Services Chatbot for Financial Aid, Academic Policies, University Services, etc.
Priorities
- Bias & Fairness – Must not discriminate or show favoritism.
- Toxicity – Should never generate offensive, harmful, or inappropriate responses.
- Compliance & Privacy – Must comply with FERPA/HIPAA if handling sensitive info.
- Accuracy – Should provide correct details on policies, deadlines, and resources.
- Obedience – Must not provide restricted info or commit the university to unverified promises.
- Security & Safety – Must resist unauthorized queries and vulnerabilities.
- Throughput – Should serve high traffic effectively, especially during peak times.
Escalation Paths & Mental Health Risk Detection:
- Share mental health resources (e.g., ASU Counseling, crisis hotlines).
- Escalate critical cases to human staff (helpdesk tickets, live chat).
- Do not engage in sensitive discussions; redirect to trained professionals.
B. Student-Facing Creativity & Ideation Chatbot
Example: AI Chatbot for Brainstorming, Essay Writing, Research Exploration, or Creative Projects.
Priorities:
- Bias & Fairness – Must avoid reinforcing harmful stereotypes.
- Toxicity – Must never produce offensive or harmful content.
- Security & Safety – Must prevent misuse (hate speech, dangerous content).
- Obedience – Adhere to academic integrity guidelines (no plagiarized essays).
- Compliance & Privacy – Respect FERPA if used in class settings.
- Throughput – Handle high interaction volumes efficiently.
- Accuracy (Conditional) – If referencing facts or academic concepts, ensure correctness.
Escalation Paths & Academic Integrity Considerations:
- Detect and discourage academic dishonesty (e.g., requests for full essays).
- Provide brainstorming resources but avoid giving direct answers to assignments.
- Log and restrict access if users prompt harmful or unethical content.
C. Student-Facing Advising & Enrollment Chatbot
Example: Automated Academic Advising Assistant – Course Selection, Major Guidance, Degree Progression Support.
Priorities:
- Accuracy – Must provide correct course requirements, advising info.
- Compliance & Privacy – Follow FERPA for student records and academic plans.
- Bias & Fairness – Must not discourage any demographic from pursuing certain majors.
- Obedience – Adhere to advising protocols; should not override official guidance.
- Security & Safety – Prevent unauthorized data access or sharing.
- Toxicity – Maintain supportive, respectful language.
- Throughput – Handle peak advising periods smoothly.
Escalation Paths & Mental Health Risk Detection:
- Share academic support resources (tutoring, workshops).
- Refer students to wellness services or faculty office hours for stress or mental health concerns.
- Escalate complex issues to human advisors for personalized support.
Scenario 3: Public-Facing Chatbot
Example: ASU Facts Bot - Used by the general public to learn high-level facts about ASU.
Priorities:
- Bias & Fairness – Avoid discriminatory or unfair treatment of any user group.
- Toxicity – Prevent offensive, inappropriate, or harmful language.
- Compliance & Privacy – Adhere to legal and institutional regulations (e.g., FERPA, HIPAA).
- Accuracy – Provide correct and up-to-date information.
- Obedience – Follow internal guidelines and avoid unauthorized disclosures or promises.
- Security & Safety – Protect against malicious exploitation and data breaches.
- Throughput – Handle high volumes of concurrent requests without delays.
Evaluation Methodologies

Methodology 1: Automated Testing & Evaluation
Example: Ethical AI Engine
Use Case: Rapidly assess model performance (accuracy, bias, toxicity) using prepared test data and a stable API.
Goal: Quickly identify performance issues and harmful behaviors during the development phase.
Prerequisites: Teams interested in conducting automated testing are encouraged to consult the AI Acceleration – Data Science team for guidance. To successfully run automated tests, you will also need:
- Ensure the GenAI solution has a stable API or testing interface allowing programmatic access.
- Prepare relevant test datasets reflecting actual user queries and edge cases.
- Define your success metrics (e.g. accuracy, toxicity, bias, etc.).
Methodology 2: Manual Testing
Example: Red Teaming
Use Case: Intentionally stress-test the bot with adversarial prompts—offensive, manipulative, or privacy-infringing.
Goal: Identify vulnerabilities in the bot’s guardrails and ensure it won’t provide harmful or restricted content.
Prerequisites:
- Clear objectives: Define what specific risks or failure modes you want to test.
- A list of red teaming questions or prompts detailing how to test these scenarios. Example scenarios include: Safety & Abuse, Model Jailbreaking, Prompt-Injections, Bias & Fairness.
Methodology 3: Pilot Run / Field Experiment
Example: A/B Testing
Use Case: Launch a limited version to a small user group or specific department.
Goal: Gather real-world user feedback, uncover unanticipated issues, and validate performance before a wider release. Think of the pilot run or field experiment as the ultimate validation for generative AI solutions—especially high-stakes projects. We strongly recommend that all critical GenAI initiatives conduct pilot experiments before full rollout.
Prerequisites:
- Stakeholder Alignment: Agreement from departments or teams involved, ensuring they understand the experimental rollout.
- Metrics & Data Collection: Infrastructure to track engagement, user actions, errors, and performance metrics for both test and control groups.
- Feedback Loop: Tools and processes to collect user feedback (e.g., surveys, comments, or complaint forms) and automatically log interactions.
Evaluation Matrix
Use the following matrix to understand the minimum level of evaluation you should perform for each common scenario.
Scenario | Automated Testing & Evaluation | Manual Testing | Pilot Run / Field Experiment |
Internal AI Assistant | Required | Suggested | Suggested |
Student-Facing Chatbot | Required | Required | Suggested |
Public-Facing Chatbot | Required | Required | Required |
Conclusion
By following the steps in this guide, ASU Enterprise Technology teams can ensure their generative AI solutions are safe, compliant, and aligned with university values. Keep in mind that adequate planning is critical:
- Automated Testing: Plan for at least 4 weeks of model and data evaluations.
- Red Teaming: Allocate 1–2 weeks for adversarial stress tests.
- Field Experiment / Pilot Run: Reserve 2 months to gather real-world feedback and validate performance.
Making time for thorough evaluation early in the development cycle helps prevent costly issues later and contributes to reliable, responsible AI deployments at ASU.
Keep Reading
Generative AI Tool Pre-Release Evaluation Guide
Arizona State University remains dedicated to responsible, principled innovation when deploying generative AI solutions, including chatbots. This guide ensures each project aligns with ASU’s values by mitigating potential risks—such as misinformation, bias, toxicity, and compliance lapses—using rigorous methods like automated testing, red teaming, and pilot experiments. In doing so, we uphold accuracy, fairness, and user trust while enhancing digital experiences across the university.
Agents in Generative AI
Agents in generative AI are semi-autonomous entities that collaborate and interact dynamically, allowing them to solve complex problems and combine specialized capabilities for greater efficiency and adaptability.
How to Use Knowledge Base (RAG)
Explore an overview of the Knowledge Base and Retrieval Augmented Generation (RAG) methods. Learn about the different types of Knowledge Base retrieval and understand the distinctions between the Knowledge Base and system prompts.
CreateAI Platform Available LLM Models
We are proud to offer 40+ models including multi-modal (voice, image, text) for the ASU community to access securely on the CreateAI Platform. Users can find the following models available for experimentation and use in Model Comparison, ASU GPT, and MyAI Builder (access request required). Updated: March 25, 2025.