12 Questions to Ask Before Starting an AI Pilot

Most AI pilots are scoped on the wrong axis. The conversation starts with capabilities (what the model can do) when it should start with conditions (what has to be true around the model for the pilot to ship). The result is a procurement process that selects for the most impressive demo and a deployment process that discovers, six weeks in, that the pilot was scoped against assumptions nobody verified.

The questions below are the ones we ask before any Acme Consulting pilot, and the ones we tell clients to ask before signing with anyone. They cluster into four themes: Scope, Data, Quality, and Off-ramp. Each question has a 'good answer' and a 'bad answer.' If a vendor's answer to any of them is the bad answer, you have not yet found a problem with the model. You have found a problem with the engagement.

Theme 1: Scope (does this pilot have a real shape?)

Q1. What is the single workflow we are replacing or augmenting, end-to-end?

Why it matters: diffuse scope is the single biggest predictor of pilot purgatory. A workflow you can describe in a sentence is a workflow you can build for. A workflow that needs three sentences and a flowchart is a workflow that has not been clarified yet.

Good answer: 'Intake-to-first-draft for vendor NDAs, roughly 120 per month, owned by the contracts team. The output is a redlined draft ready for partner review.'
Bad answer: 'We will explore legal AI use cases.'

Q2. What are the written acceptance criteria, with thresholds and a decision date?

Why it matters: without numeric pass/fail criteria, demos always 'look promising,' and the pilot's success becomes a matter of mood. The acceptance criteria are the only mechanism that converts 'we tried AI' into 'AI worked or did not.'

Good answer: 'At least 92% extraction accuracy on a 200-document holdout, p95 latency 30 seconds or less, signed off by 15 September.'
Bad answer: 'We will know it when we see it.'

Q3. Who is the named business owner, and what KPI of theirs is moving?

Why it matters: IT-owned pilots without a P&L sponsor rarely survive renewal. If the system saves time but no one's KPI improves, the system gets de-prioritised quietly.

Good answer: 'COO is sponsor. KPI: cycle time on customer exceptions from 9 days to 3.'
Bad answer: 'The innovation team is leading it.'

Theme 2: Data (can the model actually do the work?)

Q4. What data does the model see, where does it live, and who signs the BAA or DPA?

Why it matters: HIPAA covered entities cannot use a vendor without a signed Business Associate Agreement, and BAA scope is product-specific (ChatGPT consumer is not covered, OpenAI API under a signed BAA is). Legal and financial firms have analogous confidentiality exposure.

Good answer: BAA executed, region-locked tenancy, DPA references sub-processors, training-data exclusion confirmed in writing.
Bad answer: 'We use the OpenAI API directly' with no enterprise tenancy.

Q5. Is our data used to train the vendor's or any upstream model?

Why it matters: a vendor privacy policy reserving rights to share user data has, in recent US case law, been held to destroy attorney-client privilege over AI-assisted legal research. The cost of getting this wrong is not theoretical.

Good answer: contractual no-train clause covering provider and sub-processors, retention 30 days or less, zero-data-retention enabled where available.
Bad answer: 'Training is opt-out via a setting.'

Q6. What is the integration surface: read-only, write-back, or agentic?

Why it matters: write and agentic scope multiply failure blast radius and audit cost. A read-only system that proposes actions for human approval is fundamentally different to govern than one that takes actions autonomously.

Good answer: 'Phase 1 read-only with proposed-action queue. Write enabled only after 60-day audit and approval-tier review.'
Bad answer: 'It connects to everything via MCP from day one.'

Theme 3: Quality (will it hold up in production?)

Q7. Where is the human review gate, and what gets logged?

Why it matters: the EU AI Act Article 14 and the NIST AI RMF Generative AI Profile (NIST AI 600-1, July 2024) both centre on documented human oversight. 'Users can flag bad outputs' is not human oversight in the regulatory sense.

Good answer: reviewer queue with rationale capture, immutable audit log, four-eyes for high-risk classes.
Bad answer: 'Users can flag bad outputs.'

Q8. How are prompts, models, and tools versioned, and how do you roll back?

Why it matters: silent provider model swaps are the most common cause of week-2 regression. A prompt that worked on Tuesday on 'GPT-4' may quietly fail on Thursday on 'GPT-4' (different snapshot).

Good answer: pinned model snapshot (e.g. gpt-4.1-2025-04-14), prompt registry, canary deployment with one-click rollback, change notifications.
Bad answer: 'We always use the latest model.'

Q9. What is the eval set and the regression cadence?

Why it matters: no eval means no way to know quality changed. 'We spot-check' is the AI equivalent of shipping production code without unit tests.

Good answer: 200 to 500 graded examples, weekly regression run, drift alerts on accuracy and refusal rate.
Bad answer: 'We spot-check.'

Theme 4: Off-ramp (can we leave?)

Q10. Who owns the prompts, fine-tunes, embeddings, and logs at termination?

Why it matters: lock-in usually hides in derived artefacts, not the model. The vendor's prompts can be classified as their IP, the embeddings can be in their proprietary format, and the audit logs can be inside their platform with no export path.

Good answer: customer owns all derived assets, export in open formats within 30 days of termination, deletion certificate.
Bad answer: 'Prompts are part of our IP.'

Q11. What is the indemnity for AI errors, IP infringement, and regulatory penalties, and what is the cap?

Why it matters: generic SaaS master agreements do not cover training rights, hallucination harm, or HIPAA penalty pass-through. The standard 'twelve-month-fees' liability cap is meaningless if the regulator's penalty is multiples of the contract value.

Good answer: AI addendum with uncapped IP indemnity, regulatory-penalty pass-through, named-peril coverage for hallucination harm in scope.
Bad answer: 'Standard 12-month-fees cap, mutual.'

Q12. What is our exit plan if the vendor is acquired, repriced, or deprecates the model?

Why it matters: model deprecation cycles are now under 18 months. M&A in the AI vendor stack is constant. The plan you do not write is the plan that does not happen.

Good answer: source escrow or model-portability clause, 90-day price-lock on renewal, documented swap path to a second provider.
Bad answer: 'We will cross that bridge later.'

A short anonymised case

A 60-attorney mid-market firm signed a 12-month pilot for an AI contract-review tool. Nobody asked Q5 (training rights) or Q10 (prompt ownership). Six months in, the firm tried to migrate workflows to a second vendor for redundancy. The original vendor's terms classified the firm's curated playbook prompts as 'Service Improvements' and refused export. Separately, a partner realised the platform's standard ToS reserved rights to use inputs for 'service quality.' The firm issued a privilege-preservation memo to clients and paused use on three matters.

Net cost of the two missing questions: roughly $180,000 in legal review and a 4-month delay. The technical pilot worked fine. The procurement design did not. This pattern repeats often enough that we treat the off-ramp questions as load-bearing rather than defensive.

How to use these questions

Send them to your shortlist before the next demo. Ask for written answers, not verbal ones. The answers are themselves a document: you will be able to point at them in three months when the pilot is going sideways and ask 'is this what we agreed?'

If a vendor cannot answer six or more of the twelve in writing, the engagement is unfit to start. If they answer all twelve well, the pilot is the easiest conversation you will have with them, and you can negotiate scope and price with confidence that the foundation is in place. The pilot success rate, in our experience, tracks the score on this list within roughly 10%.