Reasoning Reliability Audit of an Enterprise AI Model

The goal of this project was to test an enterprise AI model to see how it performs on advanced mathematical problems, nontrivial quantitative tasks, and multistep long-horizon tasks.

The verifiers had to be deterministic, meaning they had to be rule-based with no use of AI.

The deliverable was:

a suite of tasks in each category with their verifiers,
an infrastructure in Python that implements the verification with a dashboard that summarizes the results, and
a report with recommendations.

Skills and deliverables: Artificial Intelligence, Mathematics, Statistics, Data Analysis, Python, Technical Writing