People & HR

How to run a performance review cycle with AI

People & HR3 AI tools7 steps6 friction points

Want an agent to handle this on your live data? See the Starch version →

A performance review cycle is the periodic process of evaluating employee work — collecting self-assessments, gathering peer or manager feedback, calibrating ratings across teams, and documenting outcomes in a way that informs compensation, promotions, and development plans. For most operators running small teams, it lands on the calendar every six or twelve months and immediately expands to fill more time than expected. Writing thoughtful feedback, keeping track of who submitted what, and synthesizing responses into fair ratings is grinding, detail-heavy work.

The reason people reach for AI here is obvious: performance reviews are high-stakes writing tasks wrapped around a data coordination problem. Drafting feedback for ten people requires consistency in tone and specificity in observation — exactly the kind of structured language work where LLMs visibly reduce effort. AI can also help managers who stare at a blank text box and don't know how to start, or who write the same vague paragraph for every direct report without noticing.

ChatGPT, Claude, and Gemini can meaningfully accelerate several parts of this workflow today. They're genuinely useful for drafting feedback narratives, generating review question sets, summarizing themes from raw peer feedback, and helping calibrate rating language against a rubric. You're still the one gathering the inputs, running the process, and deciding on ratings — but the writing and synthesis work gets faster.

Skip the manual work → See the Starch version

People & HR3 AI tools7 steps6 friction points

AI walkthrough

How to do it with AI today

A practical walkthrough using ChatGPT, Claude, and other off-the-shelf LLMs — what they're good at, what you'll have to do by hand.

Tools that work for this

ChatGPTClaudeGemini

Step-by-step

1 Design your review structure: paste your existing rubric or job levels into Claude and ask it to generate a self-assessment template and a peer feedback form with five to eight specific, behavior-focused questions per competency.

2 Distribute the forms manually — Google Forms, Notion, or whatever your team uses — and collect all responses. This part is purely on you; the LLM can't reach your tools.

3 For self-assessments: copy each employee's raw self-assessment into ChatGPT and prompt it to identify the strongest concrete examples, flag vague or unsupported claims, and produce a structured one-page summary.

4 For peer feedback: copy all peer responses for one employee into Claude at once and ask it to synthesize recurring themes, surface contradictions, and draft a thematic summary paragraph suitable for a manager review.

5 Draft manager ratings: paste your rubric definitions and the synthesized feedback into the same session and prompt the LLM to suggest a rating with a one-sentence rationale — you review and override as needed.

6 Write the final review narrative: give the LLM the synthesized feedback, your rating decision, and any context about the employee's quarter, then prompt it to draft a two to three paragraph performance summary in your preferred tone.

7 Prepare for calibration: paste all draft ratings and rationale sentences across your team into a single session and ask ChatGPT to flag distribution anomalies, identify where language is inconsistent across similar ratings, and surface anyone who looks over- or under-rated relative to the rest.

Prompts you can copy

Here is our engineering competency rubric and three peer feedback responses for Maya. Synthesize the key themes into a two-paragraph summary a manager could use to write a formal review. Flag any contradictions between reviewers.

I need to write a performance review for a direct report who had a strong Q1 but struggled with cross-functional communication in Q2. Draft a 250-word narrative that acknowledges both, using specific behavior-focused language. Tone: direct, constructive, not punitive.

Below are the self-assessment question responses from six employees on my team. For each person, identify the one strongest concrete example they cited and the one area where their self-assessment seems unsupported by specifics.

Our rating scale is Exceeds, Meets, Developing, Below Expectations. Here are ten draft manager ratings and one-line rationales. Flag anyone who looks inconsistently rated compared to the rest of the group, and note where the language used for 'Meets' varies significantly across managers.

Generate a 10-question peer feedback form for a senior account executive role. Focus on observable behaviors, not personality traits. Each question should be answerable with a specific example. Avoid leading questions.

Reality check

Where this gets hard

The walkthrough above works — until your numbers change, the LLM hallucinates, or you have to re-paste everything next month.

No connection to your actual data — you manually copy self-assessments, peer responses, and rubric docs into the chat window for every single employee, every single cycle.

Context limits create real problems at scale — a team of fifteen with four peer reviewers each generates more text than fits cleanly in one session, forcing you to chunk and re-prompt.

Nothing carries forward between cycles — the prompt structure, tone calibration, and rating language you dialed in last review period lives in no one's memory except yours.

Output consistency drifts — the same prompt phrased slightly differently returns different structures, so the review for employee A looks nothing like the review for employee B unless you manually re-standardize.

Action items from the review disappear — the LLM helps you write the review, but there's no connection to your task tracker, calendar, or any follow-up system; development goals evaporate after the document is closed.

Calibration is still a manual spreadsheet problem — even with LLM-drafted rationales, you're exporting everything to a Google Sheet to compare ratings side by side; the LLM can't see the live view.

Tired of the friction?

Starch runs the whole workflow on live data — no copy-paste, no hallucinated numbers, no re-prompting next month.

See the Starch version →

Starch alternative

The same workflow on Starch

Starch is an agentic operating system — it builds and runs persistent apps and automations on your live business data. For performance reviews, that means the agent builds the workflow infrastructure once, connects it to the tools your team already uses, and keeps it running across cycles without starting from scratch each time.

Connect Notion from Starch's integration catalog and Slack via scheduled sync — the agent can pull existing documentation, prior review notes, and team communication context into each review draft automatically, rather than requiring manual copy-paste.

Use the Knowledge Management starter app to store your competency rubrics, rating scale definitions, and prior cycle summaries in one searchable place — so every future review draft references the same grounding documents, not whatever you happened to paste that day.

Use the Meeting Notes starter app to capture 1:1 and performance conversation transcripts throughout the year — the agent extracts action items and decisions, building an evidence base you can reference when drafting reviews instead of reconstructing from memory.

Describe the review tracker you need in plain English — 'build me an app that tracks review status by employee, shows who has submitted self-assessments, flags overdue peer reviewers, and lets me record final ratings with notes' — and Starch builds it without a spreadsheet or form tool.

Use the Project Management starter app to turn development goals from completed reviews into tracked tasks assigned to the right people, with due dates — so outcomes from the review cycle don't disappear after the document is filed.

Build an automation once — 'every review cycle, generate a calibration summary across all employees showing rating distribution and flag outliers' — and Starch runs it against your live data each time, not a manually assembled export.

Get closed-beta access →

Toolkit