Skip to main content
Back to Blog

Advanced Prompt Engineering: From Basic to Production-Grade

Master the techniques that separate amateur prompts from production systems—chain-of-thought, structured outputs, model-specific optimization, and prompt architecture.

10 min read
Share:

Beyond "Please Help Me With..."

Prompt engineering has evolved from a niche skill to a professional discipline. The difference between a mediocre prompt and an expert one can mean 18% vs 79% accuracy on complex tasks. This post covers the techniques that production systems use to get reliable, high-quality outputs from LLMs.

From research: "Encouraging the model to break down tasks step by step can significantly improve accuracy. Instructing it to 'Let's think step by step' has been shown to increase accuracy from 18% to 79% on tasks like MultiArith."

Chain-of-Thought Prompting

The Core Technique

Chain-of-thought (CoT) prompting enables complex reasoning by generating intermediate steps before the final answer.

Why CoT works: LLMs are fundamentally next-token predictors. When asked to produce an answer directly, the model must compute the solution "in its head" (within the forward pass) and output only the conclusion. This is like asking a human to solve "347 × 89" and say only the final answer—possible but error-prone. CoT gives the model "scratch space": by generating intermediate steps as tokens, each step becomes part of the context for the next, reducing the cognitive load per step.

When CoT helps most: CoT provides the largest gains on multi-step problems—arithmetic, logical reasoning, code generation—where each step depends on previous ones. It helps less for knowledge recall ("What year did X happen?") or pattern matching tasks where the answer is direct. In some cases, CoT can actually hurt: for very simple questions, forcing step-by-step reasoning adds tokens without improving accuracy.

The computational tradeoff: CoT generates more tokens, which costs more and takes longer. For a simple question, this overhead may not be worthwhile. For complex reasoning, the accuracy improvement vastly outweighs the cost. Production systems often route queries: simple questions get direct answers, complex ones get CoT.

Without CoT:

Code
User: If a store has 23 apples and sells 17, then receives
      a shipment of 12, how many apples does it have?
Model: 18 [sometimes wrong, no reasoning visible]

With CoT:

Code
User: If a store has 23 apples and sells 17, then receives
      a shipment of 12, how many apples does it have?
      Let's solve this step by step.
Model: Let me work through this:
       1. Starting apples: 23
       2. After selling 17: 23 - 17 = 6
       3. After shipment of 12: 6 + 12 = 18
       The store has 18 apples.

Zero-Shot vs Few-Shot CoT

Zero-Shot CoT: Just add "Let's think step by step" or similar instruction.

Code
Solve this problem. Think through it step by step before
giving your final answer.

[Problem]

Few-Shot CoT: Provide examples of the reasoning process you want.

Code
Problem: Sarah has 5 marbles. She gives 2 to Tom and
receives 3 from Jane. How many does she have?

Reasoning:
- Start: 5 marbles
- Give 2 to Tom: 5 - 2 = 3
- Receive 3 from Jane: 3 + 3 = 6
Answer: 6 marbles

Problem: [Your actual problem]

From the research: "Best practices for CoT prompting include providing clear logical steps in the prompt as well as a few examples to guide the model. Combining CoT with few-shot prompting can be particularly effective for complex tasks."

Self-Consistency

Generate multiple reasoning chains and vote on the answer.

The intuition: If a problem has one correct answer, different valid reasoning paths should converge on it. If you ask the same question multiple times with temperature > 0, you'll get different reasoning chains—but if most chains arrive at "42", that's probably right. Chains that arrive at different answers likely contain errors. Self-consistency exploits this: sample multiple times, take the majority vote.

Why temperature matters: Temperature = 0 gives deterministic output; you'd get the same answer every time, so voting is useless. Temperature = 0.7-1.0 introduces diversity—different phrasings, different reasoning approaches—while staying coherent enough to produce valid answers. Too high (1.5+) and answers become nonsensical; too low and you lose the diversity that makes voting powerful.

Python
def self_consistent_answer(prompt, n_samples=5):
    answers = []
    for _ in range(n_samples):
        response = llm.generate(prompt, temperature=0.7)
        answer = extract_final_answer(response)
        answers.append(answer)

    # Return most common answer
    return Counter(answers).most_common(1)[0][0]

The cost tradeoff: Self-consistency requires N model calls instead of 1. For important decisions where accuracy matters more than cost, this is worthwhile—5 calls giving 95% accuracy beats 1 call giving 70% accuracy. For high-volume, low-stakes queries, the overhead may not justify the gain.

From research: "Self-consistency prompting generates multiple reasoning paths and then selects the most consistent answer. This is particularly effective for tasks involving arithmetic or common sense."

Tree of Thoughts

For complex problems, explore multiple reasoning branches:

Code
Problem: [Complex problem]

Generate 3 different approaches to solving this:

Approach 1: [Reasoning path A]
Evaluation: How promising is this approach? (1-10)

Approach 2: [Reasoning path B]
Evaluation: How promising is this approach? (1-10)

Approach 3: [Reasoning path C]
Evaluation: How promising is this approach? (1-10)

Select the most promising approach and continue reasoning...

From research: "Tree-of-Thoughts is an advanced strategy that generalizes chain-of-thought. Instead of a single linear chain of reasoning, it encourages the model to explore multiple branches of reasoning at each step—treating the problem like a decision tree."

Structured Outputs

Why Structure Matters

Structured outputs are consistent, parseable, and ensure all required information is included. For production systems, this is non-negotiable.

The production problem: Free-form LLM outputs are unpredictable. Ask for "an analysis" and you might get a paragraph, bullet points, or a numbered list—depending on the model's mood. Your downstream code needs to parse this, and parsing free text is fragile. Structured outputs guarantee a specific format: always JSON, always matching your schema, always parseable. This transforms LLM calls from "hopefully this works" to "reliably returns the data I need."

Structured outputs vs prompt engineering for JSON: You could ask the model to "respond in JSON format with fields X, Y, Z." This mostly works, but sometimes the model forgets a field, adds extra ones, or produces syntactically invalid JSON. Structured Outputs (OpenAI) and .with_structured_output() (LangChain) use constrained decoding—the model literally cannot produce tokens that would violate your schema. This is enforcement at the generation level, not a polite request.

When to use structured outputs: Always use them when you need to programmatically process the response. Use free-form when you're displaying directly to users and want natural prose. For analysis tasks that extract specific fields, structured outputs are essential.

JSON Mode vs Structured Outputs

From OpenAI: "Structured Outputs is the evolution of JSON mode. While both ensure valid JSON is produced, only Structured Outputs ensure schema adherence."

JSON Mode (older approach):

Python
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    response_format={"type": "json_object"}
)
# May produce valid JSON but not guaranteed to match your schema

Structured Outputs (recommended):

Python
from pydantic import BaseModel

class ProductReview(BaseModel):
    sentiment: str  # positive, negative, neutral
    score: float    # 0.0 to 1.0
    key_points: list[str]
    recommendation: bool

response = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[{"role": "user", "content": review_text}],
    response_format=ProductReview
)

From OpenAI: "On complex JSON schema following evals, the new model with Structured Outputs scores a perfect 100%, compared to less than 40% for gpt-4-0613."

Schema Design Best Practices

Good schema:

Python
class AnalysisResult(BaseModel):
    summary: str = Field(description="2-3 sentence summary")
    entities: list[str] = Field(description="Named entities found")
    sentiment: Literal["positive", "negative", "neutral"]
    confidence: float = Field(ge=0.0, le=1.0)

Tips:

  1. Use Literal for enumerated values
  2. Add Field(description=...) to guide the model
  3. Use constraints (ge, le, min_length)
  4. Keep schemas focused—don't ask for everything in one call

XML Structure for Claude

From Anthropic's documentation: "Claude's documentation recommends using XML tags to separate parts of the prompt template or large prompts. This method has proven effective in practice."

Code
<task>
Analyze the following customer feedback and extract insights.
</task>

<feedback>
{{customer_feedback}}
</feedback>

<output_format>
Respond with:
- sentiment: positive/negative/neutral
- main_issues: list of issues mentioned
- suggested_actions: list of recommended responses
</output_format>

Few-Shot Prompting

When to Use Few-Shot

Few-shot prompting provides examples that demonstrate the desired behavior:

Code
Classify the sentiment of these reviews:

Review: "Amazing product, exceeded expectations!"
Sentiment: positive

Review: "Arrived broken, waste of money"
Sentiment: negative

Review: "It works as described, nothing special"
Sentiment: neutral

Review: "{{new_review}}"
Sentiment:

From research: "Examples communicate nuances that instructions alone can't capture. The AI learns from patterns in your examples."

Few-Shot Best Practices

1. Choose representative examples: Cover the range of cases you expect, including edge cases.

2. Order matters: Place similar examples near the query. Some models have recency bias.

3. Balance classes: If classifying, include equal examples of each class.

4. Show the format exactly: If you want JSON output, show JSON in examples.

5. Keep examples concise: Longer isn't better—focus on clarity.

Dynamic Few-Shot Selection

For production systems, select examples based on similarity:

Python
def get_dynamic_examples(query, example_store, k=3):
    # Embed the query
    query_embedding = embed(query)

    # Find most similar examples
    similarities = [
        cosine_similarity(query_embedding, ex['embedding'])
        for ex in example_store
    ]

    top_k = sorted(zip(similarities, example_store), reverse=True)[:k]
    return [ex for _, ex in top_k]

# Build prompt with relevant examples
examples = get_dynamic_examples(user_query, example_db)
prompt = format_few_shot_prompt(examples, user_query)

Role and Persona Prompting

The Research Reality

Research on persona prompting is mixed:

From PromptHub: "The research is torn with regard to how effective role prompting is. Papers like 'When A Helpful Assistant Is Not Really Helpful' make a strong case against role prompting, saying it can even lead to a degradation in performance."

However: "On the other side are papers like 'ExpertPrompting' and 'Better Zero-Shot Reasoning with Role-Play Prompting' which show increased performance with specific types of role prompting."

When Personas Help

Good for open-ended tasks:

Code
You are a senior software architect reviewing code for
security vulnerabilities. Examine this code with particular
attention to:
- Input validation
- SQL injection
- Authentication bypass

[code]

Less useful for factual tasks:

Code
# This probably doesn't help:
You are an expert mathematician. What is 17 × 24?

# The model already knows math—the persona adds nothing

From research: "Persona prompting is effective for open-ended tasks (e.g., creative writing). It's generally not beneficial for accuracy-based tasks (e.g., classification), especially with newer models."

Effective Role Prompting

Be specific about expertise:

Code
# Vague (less effective):
You are an expert programmer.

# Specific (more effective):
You are a security engineer specializing in web application
penetration testing with 10 years of experience in OWASP
vulnerabilities.

Define the task context:

Code
You are reviewing this code as part of a pre-deployment
security audit. The application handles financial transactions
and must comply with PCI-DSS requirements. Flag any issues
that would fail compliance.

Modern Guidance

From Claude's documentation: "XML tags and heavy role prompting are less necessary with modern models. Start with explicit, clear instructions."

The recommendation: "The best prompt isn't the longest or most complex. It's the one that achieves your goals reliably with the minimum necessary structure."

ReAct: Reasoning + Acting

The ReAct Framework

ReAct combines reasoning with tool use for agentic applications:

From the research: "ReAct prompts LLMs to generate verbal reasoning traces and actions for a task. Reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources."

ReAct Prompt Structure

Code
You have access to the following tools:
- search(query): Search the web for information
- calculate(expression): Evaluate mathematical expressions
- lookup(term): Look up a term in the knowledge base

Answer the user's question using this format:

Thought: [What do I need to figure out?]
Action: [tool_name(parameters)]
Observation: [Result from the tool]
... (repeat Thought/Action/Observation as needed)
Thought: I now have enough information to answer
Answer: [Final answer]

Question: {{user_question}}

Example Execution

Code
Question: What is the GDP per capita of France in 2024?

Thought: I need to find France's GDP and population for 2024.
Action: search("France GDP 2024")
Observation: France's GDP in 2024 is approximately $3.05 trillion.

Thought: Now I need the population.
Action: search("France population 2024")
Observation: France's population in 2024 is approximately 68 million.

Thought: I can calculate GDP per capita now.
Action: calculate("3050000000000 / 68000000")
Observation: 44852.94

Thought: I now have enough information to answer.
Answer: France's GDP per capita in 2024 is approximately $44,853.

ReAct vs Native Function Calling

From research: "Initially, ReACT showed improved performance over other prompting techniques, particularly in complex, multi-step tasks, but was largely superseded in late 2023 by native function calling techniques supported by OpenAI, Anthropic, Mistral, and Google models."

When to use ReAct prompting:

  • Models without native function calling
  • Need visible reasoning traces
  • Educational/debugging purposes

When to use native function calling:

  • Production systems
  • Better reliability
  • Cleaner integration

Model-Specific Optimization

Claude vs GPT Differences

Context windows:

  • Claude 3.5 Sonnet: 200,000 tokens
  • GPT-4o: 128,000 tokens

Prompt structure: From research: "Claude's documentation recommends using XML tags to separate parts of the prompt. In contrast, separating system and user inputs for GPT often results in diminished performance. GPT tends to handle tasks more efficiently when all instructions are provided in a single system message."

Meta-prompting: "For Claude LLM, meta-prompting can significantly improve prompt effectiveness. In contrast, meta-prompting for GPT has not shown substantial improvements."

Claude-Specific Tips

Code
<instructions>
Your task is to analyze the document and extract key findings.
</instructions>

<document>
{{document_content}}
</document>

<output_requirements>
- Format as bullet points
- Include page references where applicable
- Highlight any numerical data
</output_requirements>

GPT-Specific Tips

Code
[System message - put everything here]
You are analyzing documents for key findings.

Instructions:
1. Read the entire document
2. Extract key findings as bullet points
3. Include page references
4. Highlight numerical data

Format your response as:
## Key Findings
- Finding 1 (p. X)
- Finding 2 (p. Y)

## Numerical Data
- Statistic 1: value

Writing Style Differences

From research: "Claude sounds more human right out of the box. OpenAI's models still overuse certain phrases like 'in today's ever-changing landscape' and 'let's dive in' that have become dead giveaways of AI-generated content."

To reduce AI-isms in GPT:

Code
Write in a direct, conversational tone. Avoid:
- Corporate buzzwords
- Phrases like "dive in", "leverage", "game-changer"
- Excessive bullet points
- Starting sentences with "I"

Prompt Caching and Optimization

Understanding Prompt Caching

From research: "Model providers like OpenAI and Anthropic use Prompt Caching to reduce costs by up to 90% and latency by 80%, primarily over large-context prompts."

How it works: "Processing a prompt involves converting text into numerical representations and computing relationships between every token. When a substantial portion remains static across requests, prompt caching stores the intermediate mathematical states (Key-Value pairs) so they don't need to be recomputed."

Optimizing for Cache Hits

Structure prompts with static content first:

Python
# Good - static content first, dynamic content last
prompt = f"""
[System instructions - 2000 tokens, always the same]
{static_instructions}

[Context documents - often repeated]
{context}

[User query - changes each time]
{user_query}
"""

# Bad - dynamic content interspersed
prompt = f"""
User asks: {user_query}

Instructions: {static_instructions}

Context: {context}
"""

From research: "Structure prompts so that static or repeated content is at the beginning and dynamic content is at the end. For example, if you're doing RAG, place retrieved context at the end of the prompt and instructions above."

Long Prompt Challenges

Attention limitations: "Large language models use attention mechanisms that scale quadratically with input length. Doubling your prompt tokens can more than double processing time."

Recency bias: "Transformers naturally weight recent tokens more heavily, meaning critical information from early in a long prompt gets undervalued or ignored entirely. A 10,000-token prompt might effectively operate on just the last 2,000 tokens."

Mitigation strategies:

  1. Repeat critical information:
Code
[Important instruction at start]

[Long context...]

[Reminder: Remember to {important instruction}]
  1. Use clear section markers:
Code
=== CRITICAL REQUIREMENTS ===
[Most important instructions]

=== CONTEXT ===
[Background information]

=== TASK ===
[What to do]
  1. Summarize long contexts: Instead of including 50 pages, summarize to key points.

Production Prompt Architecture

Prompt Templates

Production systems can't use hardcoded prompt strings scattered across the codebase. The PromptBuilder class implements a builder pattern for composable, maintainable prompts.

Why a builder pattern?

  • Modularity: Each component (system, context, examples, task) is set independently
  • Reusability: Build common bases and customize for specific use cases
  • Testability: Test prompt components in isolation
  • Fluent API: Chain methods for readable construction

The builder outputs XML-tagged prompts (Claude's preferred format), but you can modify build() to output markdown sections for GPT or any other format. The key is separating prompt construction from prompt content.

Python
class PromptBuilder:
    def __init__(self):
        self.system = ""
        self.context = ""
        self.examples = []
        self.task = ""
        self.output_format = ""

    def set_system(self, system: str):
        self.system = system
        return self

    def add_context(self, context: str):
        self.context += f"\n{context}"
        return self

    def add_example(self, input: str, output: str):
        self.examples.append({"input": input, "output": output})
        return self

    def set_task(self, task: str):
        self.task = task
        return self

    def set_output_format(self, format: str):
        self.output_format = format
        return self

    def build(self) -> str:
        parts = []

        if self.system:
            parts.append(f"<system>\n{self.system}\n</system>")

        if self.context:
            parts.append(f"<context>\n{self.context}\n</context>")

        if self.examples:
            examples_text = "\n\n".join([
                f"Input: {ex['input']}\nOutput: {ex['output']}"
                for ex in self.examples
            ])
            parts.append(f"<examples>\n{examples_text}\n</examples>")

        if self.task:
            parts.append(f"<task>\n{self.task}\n</task>")

        if self.output_format:
            parts.append(f"<output_format>\n{self.output_format}\n</output_format>")

        return "\n\n".join(parts)

Version Control for Prompts

Prompts are code—they should be versioned, reviewed, and deployed with the same rigor. Store prompts as YAML files with metadata (version, model, description) alongside the template itself.

Benefits of YAML prompt files:

  • Git history: Track who changed what and why
  • Code review: Require approval for prompt changes
  • Rollback: Revert to previous versions if quality drops
  • Documentation: Metadata explains purpose without reading the prompt
  • Environment parity: Same prompt file across dev/staging/prod

The example below shows a complete prompt specification. The {{feedback}} placeholder gets filled at runtime. Examples are stored inline for self-contained testing.

YAML
# prompts/classification_v2.yaml
name: sentiment_classification
version: "2.1"
model: gpt-4o
temperature: 0
description: "Classify customer feedback sentiment"

system: |
  You are a sentiment analysis system. Classify text into
  positive, negative, or neutral categories.

template: |
  Classify the sentiment of this customer feedback:

  Feedback: {{feedback}}

  Respond with only: positive, negative, or neutral

examples:
  - input: "Great product, love it!"
    output: "positive"
  - input: "Terrible experience, never again"
    output: "negative"

A/B Testing Prompts

How do you know if a prompt change is actually better? Intuition is unreliable—a prompt that feels better might actually perform worse on edge cases. A/B testing provides data-driven answers.

The PromptExperiment class routes requests randomly to prompt variants and tracks success metrics. After enough samples (statistical significance), you can confidently pick the winner or discover that changes made no difference.

Key metrics to track:

  • Task success rate: Did users get correct/useful answers?
  • Latency: Longer prompts mean slower responses
  • Token usage: Cost implications
  • User satisfaction: Thumbs up/down, follow-up questions

The get_variant method implements weighted random selection. Start with 50/50 splits, then shift traffic to the winner as confidence builds.

Python
class PromptExperiment:
    def __init__(self, variants: dict[str, str]):
        self.variants = variants
        self.results = {k: [] for k in variants}

    def get_variant(self) -> tuple[str, str]:
        # Random assignment
        variant_name = random.choice(list(self.variants.keys()))
        return variant_name, self.variants[variant_name]

    def record_result(self, variant: str, success: bool, latency: float):
        self.results[variant].append({
            "success": success,
            "latency": latency
        })

    def analyze(self):
        for variant, results in self.results.items():
            success_rate = sum(r["success"] for r in results) / len(results)
            avg_latency = sum(r["latency"] for r in results) / len(results)
            print(f"{variant}: {success_rate:.1%} success, {avg_latency:.2f}s avg")

Common Anti-Patterns

What Not to Do

1. Over-prompting:

Code
# Bad - unnecessary verbosity
Please kindly and helpfully analyze the following text
and provide a comprehensive, detailed, thorough analysis
of the sentiment expressed therein...

# Good - direct and clear
Analyze this text's sentiment. Return: positive, negative,
or neutral.

2. Contradictory instructions:

Code
# Bad
Be concise but comprehensive. Provide brief responses
with extensive detail.

# Good
Provide a 2-3 sentence summary followed by 3 key details.

3. Assuming model knowledge:

Code
# Bad - assumes model knows your schema
Parse this using our standard format.

# Good - explicit
Parse this into JSON with fields: name (string),
date (YYYY-MM-DD), amount (float).

4. Negative instructions only:

Code
# Less effective
Don't be verbose. Don't use jargon. Don't be formal.

# More effective
Write concisely using everyday language in a casual tone.

From research: "Prompt engineering best practices focus on being specific, providing clear context, examples, and data, defining the desired output, and giving instructions on what to do rather than what to avoid."

Evaluation and Iteration

Measuring Prompt Quality

Metrics to track:

  • Accuracy on test cases
  • Format compliance rate
  • Latency
  • Token usage / cost
  • User satisfaction (for production)

Build evaluation sets:

Python
eval_set = [
    {
        "input": "Test case 1",
        "expected": "Expected output 1",
        "tags": ["edge_case", "numeric"]
    },
    # ... more cases
]

def evaluate_prompt(prompt_template, eval_set):
    results = []
    for case in eval_set:
        prompt = prompt_template.format(input=case["input"])
        output = llm.generate(prompt)

        results.append({
            "input": case["input"],
            "expected": case["expected"],
            "actual": output,
            "correct": evaluate_match(output, case["expected"]),
            "tags": case["tags"]
        })

    return results

Iterative Improvement

  1. Start simple: Basic prompt without techniques
  2. Identify failures: Run eval set, find patterns
  3. Add techniques: CoT for reasoning, examples for format
  4. Test changes: Compare against baseline
  5. Iterate: One change at a time

Conclusion

Advanced prompt engineering combines multiple techniques:

  1. Chain-of-thought for reasoning tasks
  2. Structured outputs for reliable parsing
  3. Few-shot examples for format and style
  4. Model-specific optimization for best results
  5. Caching-aware structure for efficiency

The key insight: prompting is an empirical discipline. What works depends on your specific task, model, and data. Build evaluation sets, measure rigorously, and iterate based on results.

Frequently Asked Questions

Enrico Piovano, PhD

Co-founder & CTO at Goji AI. Former Applied Scientist at Amazon (Alexa & AGI), focused on Agentic AI and LLMs. PhD in Electrical Engineering from Imperial College London. Gold Medalist at the National Mathematical Olympiad.

Related Articles