Signal // May 19, 2026

The benchmark is lying to you. So is the vendor.

May 24, 2026

AI agents defer to authority — and comply with unsafe requests when told to

LLMs assigned high-status personas mirror human power dynamics, including deference to authority and compliance with harmful instructions. For health comms teams using AI in advisory board simulations or medical review workflows where authority gradients exist, you may have built a compliance failure in. Any agentic setup where a "Chief Medical Officer" persona can override safety guardrails carries this risk.

RAG hallucinations are driven by evidence integration — not retrieval

A controlled study on RAG-based systems using medical QA datasets found the failure point isn't what gets retrieved — it's how the model uses it. Correct evidence is retrieved and then overridden by prior training data. For health comms teams using RAG-based tools for literature synthesis or content generation: better retrieval won't fix this. The integration layer needs auditing.

Parallel LLM framework cuts omission errors 84% in long-document analysis

Sequential LLM processing buries early-document content by the time the model reaches the conclusion. This parallel evidence-anchoring approach cuts that omission rate by 84% on long-form tasks — directly relevant to any team using AI to analyse clinical study reports, SLRs, or lengthy regulatory documents where early-context bias silently distorts outputs.

AI agents score 40 points higher on benchmarks when coached on what to look for

Every model tested collapsed to near-identical accuracy floors once prompt scaffolding was removed — meaning vendor capability claims built on leaderboard scores may be systematically overstated. Before buying an AI agent for medical writing or regulatory workflows on the basis of published accuracy scores, ask to see the results without the coaching prompt.

AI generates structured biomedical abstracts — but more prompting means less accuracy

Adding entity-level prompting to an LLM pipeline generating structured biomedical abstracts degraded factual accuracy compared to simpler approaches. For health comms teams building LLM pipelines for scientific summaries: more elaborate prompting doesn't automatically mean better outputs. The counterintuitive finding has practical implications for anyone already running iterative prompt engineering on medical content.

That's it for this edition. Back next week.

— Ned

The Irreplaceables

Discussion about this post

Ready for more?