New

Senior Data Scientist

Microsoft
United States, Washington, Redmond
Nov 02, 2025
OverviewM365 Copilot Cadets (Customer & AnalyticsDriven Eval Team) turns realworld customer feedback into evaluation datasets, rubrics, and insights that measurably improve Microsoft 365 Copilot quality. We connect customer scenarios, analytics, and rigorous evaluation frameworks to power a continuous feedback flywheel across Microsoft 365 Copilot to accelerate measurable product improvements.As a Senior Data Scientist part of Cadets, you will own evaluation analytics endtoend: curate datasets from customer and production signals; author binaryfirst rubrics; build LLM (Large Language Model)asjudge graders and work on highquality synthetic data generation to scale evaluations with experience in humanmatch rates. You'll partner with PM/Eng/Design and VIP customers to ship quality gains and AI features with confidence.You'll Thrive Here If You Have:Evaluation proficiency for LLM/agent systems: dataset curation, rubric design, humanintheloop grading, judge prompts with quantitative agreement goals.Experience in analytics & experimentation skills (statistical inference, A/B), plus Python/SQL for largescale trace analysis.LLM fundamentals: prompt engineering, fewshot design, retrieval metrics, multiturn/agent trace evaluation.Data quality mindset: trace hygiene, metadata design, policy/PII awareness, and principled guardrails. Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond. ResponsibilitiesEvaluation & Feedback AnalysisConvert multisource feedback (dogfood, VIP customers, production traces) into a prioritized dataset of 10-100 tasks per scenario, each with prompts and golden outputs; maintain a living failure taxonomy prioritized by volume O impact O fixability.Rubrics & LLMasJudgeAuthor crisp, binaryfirst rubrics across 7-30 dimensions (e.g., correctness/completeness, refusal calibration, tooluse quality, formatting/contract, persona/tone, trace hygiene).Build grader prompts (with fewshots and counterexamples) that achieve 80% humanmatch rate, track TPR/TNR on heldout sets, and prevent reward hacking.Synthetic & HumanLabeled DataDesign structured tuples to scale highsignal synthetic data; orchestrate vendor/partner annotation sprints and live calibrations to align shared judgment.Ensure datasets are reproducible with linked artifacts and robust metadata/trace hygiene.CustomerGrounded ScenariosPartner with PMs/solution architects to codevelop evals with VIP customers so tasks reflect real outcomes and workflows; quantify lift from fixes and inform the next hillclimb.Team Leadership & Ways of WorkingCoown the Cadets "feedback flywheel" with PM/Eng (instrumentation, taxonomy, guardrails vs. evaluators) and help operationalize weekly checklists, change logs, and judge refresh cadence.