Prompt Engineering: What Actually Moves the Needle

Preamble (Soap-box) I have noticed a major increase in users complaining about usage limits, and bad output. This, in my opinion, is the result of new users entering this world, bad marketing, hype guides, and “experts” flooding the info marketing world with slop for the low low price of $N. If anyone tells you that you can be proficient with AI and prompting if you take their short course and 20hrs of practice. They are lying to you, and you are likely wasting your money.

This isn’t true for all courses and a lot of very well known, and reputable leaders in this space have courses. But. They are very clear on what they know, don’t know, still learning, etc… And none of them will hustle you for $49. A lot of their courses are hundreds or thousands of dollars. Why, because it represents years of knowledge and work.

Ok, here’s the actual article

Most prompt engineering advice reads like marketing copy. “Just be clear!” Great. Thanks.

What follows is what I’ve actually found moves output quality, tested across Anthropic, OpenAI, and open-weight models, in production workflows, not toy demos. If something here contradicts the conventional wisdom, good. The conventional wisdom is often vibes dressed up as methodology.

Use the API or Playground

Consumer-facing chat wrappers inject their own system prompts, apply content filters you can’t see, and silently modify your parameters. You’re not driving; you’re in the back seat making suggestions.

The playground or workbench environments (Anthropic Console, OpenAI Playground) give you direct control over temperature, top-p, system prompts, stop sequences, and output formatting. If you’re doing anything beyond casual Q&A, this is where you should be working.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6-20250514",
    max_tokens=1024,
    temperature=0.2,
    system="You are a penetration testing report writer. "
           "Output only valid JSON. No commentary.",
    messages=[
        {
            "role": "user",
            "content": "Summarize this finding: SQL injection in "
                       "/api/v2/users?search= parameter, PostgreSQL "
                       "backend, authenticated endpoint."
        }
    ]
)

Temperature at 0.2 for factual extraction. Temperature at 0.8+ for creative generation. Most people leave it at the default and wonder why their outputs feel inconsistent. The default isn’t wrong; it’s just not your decision, and it should be.

Understand the Prompt Layers

Every API call has up to three layers, and they’re not equal.

System prompt: sets the model’s identity, constraints, and behavioral frame. This is the most powerful lever you have. It runs before anything else and shapes how the model interprets everything that follows.
User prompt: your specific task, data, and instructions.
Assistant prompt (prefill): you can pre-populate the start of the model’s response to steer format and tone. This is wildly underused.

System:
You are a security analyst reviewing application logs.
Respond only in structured JSON.
Never speculate beyond what the log data supports.

User:
Analyze these authentication logs and identify anomalies:
<logs>
2025-08-01T03:14:22Z user=admin action=login status=fail ip=198.51.100.44
2025-08-01T03:14:23Z user=admin action=login status=fail ip=198.51.100.44
2025-08-01T03:14:24Z user=admin action=login status=fail ip=198.51.100.44
2025-08-01T03:14:25Z user=admin action=login status=success ip=198.51.100.44
2025-08-01T03:14:26Z user=admin action=change_password ip=198.51.100.44
</logs>

Assistant (prefill):
{"anomalies": [

That prefill forces JSON output from the first token. No preamble, no “Sure, I’d be happy to help.” The model continues from where you left off. This technique alone eliminates half the output-parsing headaches I see in production pipelines.

Keep Prompts Dense, Not Long

There’s a persistent myth that longer prompts are better because you’re “giving the model more to work with.” In practice, the signal-to-noise ratio matters more than length. A 200-token prompt with clear intent outperforms a 2,000-token prompt padded with hedging and redundancy.

This doesn’t mean stripping out necessary context. It means compressing. Every sentence should earn its place.

# Bloated
You are an expert in cybersecurity. I would really appreciate it if you
could take a look at this code and tell me if there are any potential
security vulnerabilities. Please be thorough and consider all possible
attack vectors. It would be great if you could explain your reasoning.

# Dense
You are a security code reviewer. Identify vulnerabilities in this code.
For each finding: state the vulnerability class (OWASP Top 10), the
affected line(s), severity (Critical/High/Medium/Low), and a one-line
fix. No preamble.

Same ask. Half the tokens. Twice the precision on output format. The bloated version invites the model to ramble because you rambled.

Always Provide Examples

Few-shot prompting is the single most reliable technique for controlling output quality. Not because it “teaches” the model, it already knows how to do most things you’re asking. It constrains the output space. You’re showing the model the exact shape of what “correct” looks like.

Classify these support tickets by urgency. Use this format exactly:

Ticket: "Can't log in, getting 500 error on /auth"
Classification: HIGH: authentication failure, service-impacting
Action: Route to on-call engineering

Ticket: "Logo looks pixelated on the about page"
Classification: LOW: cosmetic, non-blocking
Action: Add to design backlog

Now classify:
Ticket: "Customer data export returning empty CSV for accounts created after July 1"

One example is good. Two or three is better. Beyond five, you’re burning tokens for diminishing returns, unless your classification task is genuinely ambiguous, in which case more examples help disambiguate the edge cases.

The key insight: examples define boundaries. The model infers what you don’t want from what you do show.

Ground the Model With Real Data

LLMs are pattern-completion engines. They generate plausible text. That’s not the same as generating accurate text. The moment you need factual precision (customer data, system logs, vulnerability databases) you need to bring the data to the model, not ask the model to remember it.

This is the core idea behind RAG (Retrieval-Augmented Generation): retrieve real data first, then pass it into the prompt as context.

# Simplified RAG pattern
def analyze_with_context(query: str, documents: list[str]) -> str:
    context = "\n---\n".join(documents)

    response = client.messages.create(
        model="claude-sonnet-4-6-20250514",
        max_tokens=2048,
        system=(
            "Answer questions using ONLY the provided documents. "
            "If the documents don't contain the answer, say so. "
            "Never fabricate information."
        ),
        messages=[
            {
                "role": "user",
                "content": f"Documents:\n{context}\n\n"
                           f"Question: {query}"
            }
        ]
    )
    return response.content[0].text

The “never fabricate” instruction matters, but the real safeguard is the architecture. You’re not trusting the model’s parametric knowledge. You’re making it a reasoning layer over data you control.

Tool use and function calling push this further: the model can decide when it needs external data and request it mid-conversation. That’s where agentic patterns start.

Be Specific, Not Polite

You don’t need to say “please” or “I would appreciate it if.” Politeness is for humans. Precision is for models.

This isn’t about being rude; it’s about removing ambiguity. Every filler word is a token the model has to process and potentially mirror in style.

# Vague
Could you take a look at this code and let me know
if there's anything wrong with it?

# Specific
Review this Python function for:
1. SQL injection vulnerabilities
2. Missing input validation
3. Improper error handling that leaks stack traces

For each issue found, output:
- Line number
- Vulnerability type
- Severity (Critical/High/Medium/Low)
- One-line remediation

The specific version constrains the search space. The vague version is an open invitation. The model might focus on style, performance, naming conventions, or a dozen other things you didn’t care about.

Define Output Format Explicitly

If you need structured output, specify the structure. If you need machine-readable output, say so and show the schema. Don’t hope the model guesses right.

Analyze the following network scan results and output a JSON object
with this exact schema:

{
    "host": "string: target IP or hostname",
    "open_ports": [
        {
            "port": "integer",
            "service": "string: identified service name",
            "version": "string or null",
            "risk": "Critical | High | Medium | Low | Info"
        }
    ],
    "summary": "string: 2-3 sentence assessment",
    "recommended_actions": ["string"]
}

Do not wrap the JSON in markdown code fences.
Do not include any text outside the JSON object.

Two things happening here. The schema shows the exact structure, including types and enums. The negative instructions (“Do not wrap…”) prevent common failure modes. Models love wrapping JSON in triple backticks. If you’re parsing the output programmatically, that extra formatting breaks your pipeline.

For Anthropic’s API specifically, structured output via tool definitions gives you even stronger guarantees: the model is constrained to return valid JSON matching your schema.

Remove Conflicting Instructions

“Write a detailed summary.” That’s a contradiction. A summary is compressed. Detail is expanded. The model will pick one or try to split the difference; either way, you’ve introduced ambiguity in the one place you can’t afford it.

Watch for these in your prompts:

“Be concise but thorough”
“Creative but accurate”
“Simple but comprehensive”

Each of these asks the model to optimize for two opposing goals simultaneously. It can’t. Pick the priority, or sequence them as separate steps.

# Conflicting: model has to guess your priority
Provide a concise but comprehensive security assessment
of this application.

# Sequential: each step has one clear objective
Step 1: List every security finding. Be exhaustive. Include
low-severity issues.

Step 2: For the top 5 findings by severity, write a one-paragraph
assessment with remediation guidance.

Step 3: Write a 3-sentence executive summary suitable for
a non-technical stakeholder.

The sequential version produces better output at every step because each step has exactly one job.

Structure Your Input Data

When you’re feeding data into a prompt, format matters. Models parse structured data more reliably than freeform text. XML tags, JSON objects, and delimiters all help the model distinguish between your instructions and your data.

Analyze the following vulnerability scan results and prioritize
remediation.

<scan_results>
[
    {
        "id": "VULN-001",
        "type": "SQL Injection",
        "location": "/api/v2/search",
        "severity": "Critical",
        "cvss": 9.8
    },
    {
        "id": "VULN-002",
        "type": "Missing CSRF Token",
        "location": "/account/settings",
        "severity": "Medium",
        "cvss": 5.4
    },
    {
        "id": "VULN-003",
        "type": "Outdated TLS Configuration",
        "location": "*.example.com",
        "severity": "High",
        "cvss": 7.5
    }
]
</scan_results>

Prioritize by: exploitability first, then CVSS score.
Output as a numbered remediation plan.

XML tags (<scan_results>) are particularly effective for separating data from instructions. The model treats tagged content as a distinct input block. This matters when your data might contain text that looks like instructions. Without clear boundaries, the model can confuse data content with task directives.

One caution on CSVs: models lose positional accuracy in large tabule inputs. Beyond a few hundred rows, chunk the data or switch to JSON. I’ve seen models silently drop or transpose columns past row ~200 in CSV format.

Chain of Thought and Step-by-Step Reasoning

For complex tasks: multi-step analysis, code debugging, logic problems… explicitly ask the model to reason before answering. This isn’t a gimmick. The model’s reasoning tokens directly improve output quality for tasks that require intermediate steps.

A web application uses the following authentication flow:
1. User submits credentials to /api/login
2. Server validates against LDAP, returns a JWT
3. JWT is stored in localStorage
4. JWT is sent via Authorization header on subsequent requests
5. JWT expires after 24 hours, no refresh mechanism

Analyze this authentication flow for security weaknesses.
Think through each step before identifying vulnerabilities.
Consider the entire attack surface, not just the obvious issues.

The instruction to “think through each step” isn’t filler. It causes the model to allocate reasoning tokens to each component of the flow before synthesizing findings. Without it, the model tends to jump to the most obvious issue (localStorage XSS exposure) and underweight subtler problems (no refresh token means 24-hour session hijack windows, LDAP injection surface, no rate limiting mentioned).

Some models support extended thinking or reasoning modes natively. When available, use them for security analysis, code review, and architectural decisions. The quality difference is measurable.

The Prompt Structure That Works

After testing hundreds of variations, this is the skeleton I keep coming back to:

[System]
You are a {specific role} with expertise in {specific domain}.
{Behavioral constraints: what to never do.}
{Output format constraints.}

[User]
Context: {What the model needs to know about the situation.}

Task: {Exactly what you want done. One clear objective.}

Input data:
<data>
{Structured data, clearly delimited.}
</data>

Output requirements:
- {Format specification}
- {Content constraints}
- {What to include/exclude}

Examples:
<example>
Input: {sample input}
Output: {exact desired output}
</example>

Not every prompt needs every section. A quick question doesn’t need examples and delimited data. But when output quality matters (when you’re building a pipeline, generating reports, or doing security analysis) this structure consistently produces better results than freeform instructions.

Use the Right Model for the Job

Not every task needs the most capable model. A classification task that routes support tickets doesn’t need the same model you’d use for nuanced code review. Running every request through your most expensive model is like using a forensic lab to check if the door is locked.

Think about it in terms of the task:

Extraction and classification: smaller, faster models handle these well. Low temperature, structured output.
Analysis and reasoning: mid-tier models with chain-of-thought prompting. This is the sweet spot for most production workloads.
Complex generation, research, and multi-step reasoning: the largest models justify their cost here. Architecture decisions, novel code generation, security analysis with ambiguous inputs.

Token costs compound fast in production. A 10x cost difference between model tiers matters when you’re processing thousands of requests daily.

Iterate Like It’s Code

Prompt engineering is empirical. You don’t write a function once and ship it; you test it, find edge cases, and refine. Prompts deserve the same discipline.

# Basic prompt evaluation loop
test_cases = [
    {
        "input": "Failed login from 3 IPs in 2 minutes",
        "expected_severity": "HIGH"
    },
    {
        "input": "User changed display name",
        "expected_severity": "LOW"
    },
    {
        "input": "New API key generated for service account",
        "expected_severity": "MEDIUM"
    },
]

def evaluate_prompt(system_prompt: str, test_cases: list) -> float:
    correct = 0
    for case in test_cases:
        response = client.messages.create(
            model="claude-sonnet-4-6-20250514",
            max_tokens=50,
            system=system_prompt,
            messages=[
                {
                    "role": "user",
                    "content": f"Classify severity: "
                               f"{case['input']}"
                }
            ]
        )
        result = response.content[0].text.strip()
        if case["expected_severity"] in result:
            correct += 1
    return correct / len(test_cases)

Run the same prompt against the same inputs ten times. If you’re not getting consistent results, your prompt or weights are underspecified. Increase constraint, add examples, lower temperature, in that order. Don’t just start changing temperature when you really just needed to add a couple examples. Don’t add 10 examples but have temperature set to 0.9… etc.

Keep a prompt library, a journal, log, notes.

Version control it. When you find a prompt that works reliably, that’s an asset. Treat it like one.

What Doesn’t Work

A few things I see repeated in prompt engineering guides that don’t hold up in practice:

“Role-play prompts are always better.” “You are the world’s greatest security expert” doesn’t make the model more capable. A specific role with specific constraints does. “You are a security analyst reviewing AWS IAM policies for least-privilege violations.” That’s useful. The superlative adds nothing.
“Longer context is always better.” Retrieval relevance matters more than volume. Dumping your entire codebase into the context window doesn’t help if the relevant function is buried in 50K tokens of unrelated code.
“Prompt injection is solved.” It isn’t. Why do you think all reputable frontier models still have responsible reporting, and bounty programs? If your application passes untrusted user input into a prompt alongside system instructions, you have a prompt injection surface. Structural separation (system vs. user prompts), input sanitization, and output validation are all necessary. None of them are sufficient alone.

Prompt engineering is a skill, not a secret. The techniques here should be iterative, they are for me, I am updating this article for the third time in 18 months.

The actual edge comes from understanding why they work, which means understanding how these models process and attend to different parts of your input. That understanding compounds. The tricks don’t.