Advanced Prompt Patterns: Tree-of-Thought, ReAct, and Self-Consistency

Hero image: A branching tree of glowing thought paths, some paths lit green (successful reasoning) and others red (dead ends), converging on a golden answer node

Chain-of-Thought prompting was a breakthrough — but it has a fundamental limitation. It follows a single reasoning path. If that path starts with a wrong assumption, every subsequent step is built on a faulty foundation. There's no backtracking, no exploration of alternatives, no way to course-correct.

The advanced prompting patterns we'll cover in this post address exactly this limitation. They were born from a simple question: what if the model could explore multiple reasoning paths, use external tools to verify its assumptions, and check its own work against alternative approaches?

These techniques — Tree-of-Thought, ReAct, Self-Consistency, meta-prompting, and more — represent the current frontier of prompt engineering. They're what separates a clever chatbot from a reliable AI system that can handle complex, multi-step tasks in production.

This is Part 5 of our Prompt Engineering Deep-Dive series. If you haven't read Parts 1-4, the techniques here build directly on system prompts, Chain-of-Thought, few-shot prompting, and structured output.

Tree-of-Thought (ToT): Exploring Multiple Paths

Chain-of-Thought follows one path: Step 1 → Step 2 → Step 3 → Answer. Tree-of-Thought explores a branching tree of possibilities, evaluates each branch, and prunes dead ends before committing to an answer.

How It Works

1. Generate multiple candidate next-steps at each reasoning point

2. Evaluate each candidate (is this step promising or a dead end?)

3. Select the most promising branches to continue

4. Backtrack from dead ends and explore alternatives

Architecture diagram comparing linear CoT (single path) to Tree-of-Thought (branching paths with evaluation and pruning)

flowchart TB START["Problem"] subgraph COT ["Chain-of-Thought (Linear)"] direction LR C1["Step 1"] --> C2["Step 2"] --> C3["Step 3"] --> CA["Answer"] end subgraph TOT ["Tree-of-Thought (Branching)"] direction TB T1["Step 1a"] T2["Step 1b"] T3["Step 1c"] T1 --> T4["Step 2a"] T1 --> T5["Step 2b"] T2 --> T6["Step 2c ✗"] T3 --> T7["Step 2d"] T4 --> T8["Answer ✓"] T5 --> T9["Dead end ✗"] T7 --> T10["Answer ✓✓"] end START --> COT START --> TOT style C1 fill:#3498db,stroke:#2980b9,color:#fff style C2 fill:#3498db,stroke:#2980b9,color:#fff style C3 fill:#3498db,stroke:#2980b9,color:#fff style CA fill:#3498db,stroke:#2980b9,color:#fff style T1 fill:#2ecc71,stroke:#27ae60,color:#fff style T2 fill:#f39c12,stroke:#e67e22,color:#fff style T3 fill:#2ecc71,stroke:#27ae60,color:#fff style T4 fill:#2ecc71,stroke:#27ae60,color:#fff style T5 fill:#e74c3c,stroke:#c0392b,color:#fff style T6 fill:#e74c3c,stroke:#c0392b,color:#fff style T7 fill:#2ecc71,stroke:#27ae60,color:#fff style T8 fill:#2ecc71,stroke:#27ae60,color:#fff style T9 fill:#e74c3c,stroke:#c0392b,color:#fff style T10 fill:#6C63FF,stroke:#8B83FF,color:#fff style START fill:#6C63FF,stroke:#8B83FF,color:#fff style COT fill:#1a1a2e,stroke:#3498db,color:#fff style TOT fill:#1a1a2e,stroke:#2ecc71,color:#fff

Implementation

def tree_of_thought(problem: str, breadth: int = 3, depth: int = 3) -> str:
    """Explore multiple reasoning paths and select the best."""
    
    def generate_steps(context: str, n: int) -> list[str]:
        prompt = f"""Given this problem and progress so far:
{context}

Generate {n} different possible next steps. 
For each step, explain your reasoning.
Return as a numbered list."""
        return parse_steps(call_llm(prompt))
    
    def evaluate_step(context: str, step: str) -> float:
        prompt = f"""Evaluate this reasoning step:
Context: {context}
Step: {step}

Rate from 0.0 to 1.0:
- Is this step logically sound? 
- Does it make progress toward the solution?
- Does it avoid assumptions that could be wrong?

Return ONLY a number between 0.0 and 1.0."""
        return float(call_llm(prompt).strip())
    
    # BFS through reasoning tree
    candidates = [{"context": problem, "steps": [], "score": 1.0}]
    
    for level in range(depth):
        next_candidates = []
        
        for candidate in candidates:
            steps = generate_steps(candidate["context"], breadth)
            
            for step in steps:
                score = evaluate_step(candidate["context"], step)
                new_context = candidate["context"] + f"\nStep {level+1}: {step}"
                next_candidates.append({
                    "context": new_context,
                    "steps": candidate["steps"] + [step],
                    "score": candidate["score"] * score
                })
        
        # Keep top candidates (beam search)
        candidates = sorted(
            next_candidates, 
            key=lambda x: x["score"], 
            reverse=True
        )[:breadth]
    
    # Return the highest-scoring path
    best = candidates[0]
    return synthesize_answer(problem, best["steps"])

When to Use ToT

| Use Case | CoT Sufficient? | ToT Needed? |

|----------|----------------|-------------|

| Simple math | Yes | No |

| Code debugging | Usually | For complex multi-file bugs |

| Architecture design | No | Yes — multiple valid approaches |

| Strategic planning | No | Yes — tradeoffs require exploration |

| Game solving (chess, puzzles) | No | Yes — search required |

| Creative writing | No | Yes — exploring different directions |

Cost consideration: ToT uses 5-20x more API calls than single CoT. Use it only when the accuracy improvement justifies the cost.

ReAct: Reasoning + Acting

ReAct (Reasoning + Acting) combines Chain-of-Thought reasoning with tool use. Instead of reasoning in isolation, the model thinks about what information it needs, uses tools to get it, observes the results, and continues reasoning.

The ReAct Loop

Thought: I need to check if the database table exists
Action: query_database("SHOW TABLES LIKE 'users'")
Observation: Table 'users' exists with columns: id, name, email, created_at
Thought: The table exists. Now I need to check if there's an index on email
Action: query_database("SHOW INDEX FROM users WHERE Column_name = 'email'")
Observation: No index found on email column
Thought: Missing email index explains the slow login query. I should recommend adding it.
Answer: Add an index on users.email — this will fix the O(n) scan on every login.

Implementation

def react_agent(
    question: str,
    tools: dict[str, callable],
    max_steps: int = 10
) -> str:
    """ReAct agent: interleave reasoning and tool use."""
    
    tool_descriptions = "\n".join(
        f"- {name}: {func.__doc__}" for name, func in tools.items()
    )
    
    system = f"""You are a reasoning agent. For each step:
1. Thought: Reason about what you know and what you need
2. Action: Call a tool if needed (format: tool_name(args))
3. Observation: [Tool result will be inserted here]

Repeat until you have enough information to answer.
When ready, respond with: Answer: [your final answer]

Available tools:
{tool_descriptions}"""
    
    messages = [
        {"role": "system", "content": system},
        {"role": "user", "content": question}
    ]
    
    for step in range(max_steps):
        response = call_llm(messages)
        messages.append({"role": "assistant", "content": response})
        
        # Check if we have a final answer
        if "Answer:" in response:
            return response.split("Answer:")[-1].strip()
        
        # Parse and execute action
        action_match = re.search(r'Action:\s*(\w+)\((.*?)\)', response)
        if action_match:
            tool_name = action_match.group(1)
            tool_args = action_match.group(2)
            
            if tool_name in tools:
                result = tools[tool_name](tool_args)
                observation = f"Observation: {result}"
            else:
                observation = f"Observation: Error — tool '{tool_name}' not found"
            
            messages.append({"role": "user", "content": observation})
    
    return "Max steps reached without conclusion."

ReAct vs. Plain Tool Use

The key difference is that ReAct makes the reasoning explicit. In plain tool use, the model calls tools but doesn't show its reasoning about why it chose that tool or what it expects to find. ReAct forces the model to articulate its hypothesis before acting, which:

1. Improves tool selection — Thinking first reduces irrelevant tool calls

2. Enables debugging — You can read the thought trace to understand failures

3. Supports learning — The reasoning chain becomes training data for improvement

Self-Consistency: Majority Vote on Reasoning

Self-Consistency generates multiple independent reasoning chains for the same problem and takes the majority vote. It's based on the insight that correct reasoning paths are more likely to converge on the same answer.

Implementation

import collections

def self_consistent_answer(
    prompt: str,
    n_samples: int = 5,
    temperature: float = 0.7
) -> dict:
    """Generate multiple reasoning paths and vote on the answer."""
    
    answers = []
    reasoning_chains = []
    
    for _ in range(n_samples):
        response = call_llm(
            prompt + "\nThink step by step, then give your final answer on the last line starting with 'ANSWER:'",
            temperature=temperature  # Higher temp for diversity
        )
        
        # Extract final answer
        lines = response.strip().split('\n')
        answer_line = [l for l in lines if l.startswith('ANSWER:')]
        if answer_line:
            answer = answer_line[-1].replace('ANSWER:', '').strip()
            answers.append(answer)
            reasoning_chains.append(response)
    
    # Majority vote
    counter = collections.Counter(answers)
    best_answer, vote_count = counter.most_common(1)[0]
    
    return {
        "answer": best_answer,
        "confidence": vote_count / len(answers),
        "total_votes": len(answers),
        "vote_distribution": dict(counter),
        "reasoning_chains": reasoning_chains
    }

When Self-Consistency Shines

Self-Consistency is most valuable when:

  • The problem has a single correct answer (math, classification, yes/no)
  • Individual CoT accuracy is in the 60-85% range (high enough to converge, low enough to benefit)
  • You can afford N times the API cost (typically N=5 to N=11)

| Single CoT Accuracy | Self-Consistency (N=5) | Improvement |

|---------------------|----------------------|-------------|

| 60% | ~78% | +18% |

| 70% | ~87% | +17% |

| 80% | ~94% | +14% |

| 90% | ~98% | +8% |

Diminishing returns above N=11. Research shows that going from 5 to 11 samples provides meaningful improvement, but 11 to 21 provides very little additional benefit.

flowchart TB PROMPT["Same Problem"] PROMPT --> R1["Chain 1
T=0.7"] PROMPT --> R2["Chain 2
T=0.7"] PROMPT --> R3["Chain 3
T=0.7"] PROMPT --> R4["Chain 4
T=0.7"] PROMPT --> R5["Chain 5
T=0.7"] R1 --> A1["Answer: A"] R2 --> A2["Answer: B"] R3 --> A3["Answer: A"] R4 --> A4["Answer: A"] R5 --> A5["Answer: C"] A1 --> VOTE["Majority Vote"] A2 --> VOTE A3 --> VOTE A4 --> VOTE A5 --> VOTE VOTE --> FINAL["Final: A
3/5 = 60% confidence"] style PROMPT fill:#6C63FF,stroke:#8B83FF,color:#fff style R1 fill:#3498db,stroke:#2980b9,color:#fff style R2 fill:#3498db,stroke:#2980b9,color:#fff style R3 fill:#3498db,stroke:#2980b9,color:#fff style R4 fill:#3498db,stroke:#2980b9,color:#fff style R5 fill:#3498db,stroke:#2980b9,color:#fff style A1 fill:#2ecc71,stroke:#27ae60,color:#fff style A2 fill:#f39c12,stroke:#e67e22,color:#fff style A3 fill:#2ecc71,stroke:#27ae60,color:#fff style A4 fill:#2ecc71,stroke:#27ae60,color:#fff style A5 fill:#e74c3c,stroke:#c0392b,color:#fff style VOTE fill:#9b59b6,stroke:#8e44ad,color:#fff style FINAL fill:#2ecc71,stroke:#27ae60,color:#fff

Meta-Prompting: Prompts That Write Prompts

Meta-prompting uses the LLM itself to generate, refine, and optimize prompts. Instead of manually iterating on prompt wording, you ask the model to help.

Pattern: Automatic Prompt Optimization

def optimize_prompt(
    initial_prompt: str,
    test_cases: list[dict],
    n_iterations: int = 5
) -> str:
    """Use the LLM to iteratively improve a prompt."""
    
    current_prompt = initial_prompt
    best_score = evaluate_prompt(current_prompt, test_cases)
    best_prompt = current_prompt
    
    for iteration in range(n_iterations):
        # Ask the model to analyze failures
        failures = get_failures(current_prompt, test_cases)
        
        improvement_request = f"""Current prompt:
{current_prompt}

This prompt fails on these cases:
{failures}

Analyze why it fails and suggest an improved version of the prompt 
that would handle these cases correctly while maintaining accuracy 
on the cases it already handles well.

Return ONLY the improved prompt, nothing else."""
        
        new_prompt = call_llm(improvement_request)
        new_score = evaluate_prompt(new_prompt, test_cases)
        
        if new_score > best_score:
            best_score = new_score
            best_prompt = new_prompt
            current_prompt = new_prompt
    
    return best_prompt

Pattern: Task Decomposition Prompting

Ask the model to break down a complex task into sub-prompts:

I need to analyze customer support tickets and produce a weekly report.

Break this task into a sequence of focused sub-tasks, where each 
sub-task has:
1. A clear input
2. A specific prompt optimized for that sub-task
3. A defined output format
4. Dependencies on previous sub-tasks

Design the prompts so each one is simple enough to be highly reliable.

Reflexion: Learning from Mistakes

Reflexion extends ReAct by adding a self-reflection step. After completing a task, the model evaluates its own performance and generates feedback that improves future attempts.

def reflexion_agent(
    task: str,
    evaluator: callable,
    max_attempts: int = 3
) -> str:
    """Agent that learns from its own mistakes."""
    
    reflections = []
    
    for attempt in range(max_attempts):
        # Include past reflections in the prompt
        reflection_context = ""
        if reflections:
            reflection_context = "\n\nPrevious attempts and reflections:\n"
            for r in reflections:
                reflection_context += f"- Attempt: {r['summary']}\n"
                reflection_context += f"  Reflection: {r['reflection']}\n"
                reflection_context += f"  What to do differently: {r['improvement']}\n"
        
        prompt = f"""{task}
{reflection_context}
Think step by step. If you've seen reflections above, 
use them to avoid repeating the same mistakes."""
        
        response = call_llm(prompt)
        score, feedback = evaluator(response)
        
        if score >= 0.9:  # Good enough
            return response
        
        # Self-reflect on the failure
        reflection_prompt = f"""You attempted this task:
{task}

Your response:
{response}

Evaluation feedback:
{feedback}

Reflect on what went wrong and what you should do differently 
next time. Be specific and actionable."""
        
        reflection = call_llm(reflection_prompt)
        reflections.append({
            "summary": response[:200],
            "reflection": reflection,
            "improvement": reflection  # Could parse for action items
        })
    
    return response  # Return best attempt

Combining Patterns: The Full Stack

In production, these patterns are often combined:

class ProductionReasoningPipeline:
    """Combines multiple advanced patterns for maximum reliability."""
    
    def __init__(self, tools: dict, schemas: dict):
        self.tools = tools
        self.schemas = schemas
    
    def solve(self, problem: str, complexity: str = "auto") -> dict:
        if complexity == "auto":
            complexity = self._assess_complexity(problem)
        
        if complexity == "simple":
            # Direct CoT — cheapest
            return self._solve_cot(problem)
        
        elif complexity == "medium":
            # Self-Consistency — better accuracy
            return self._solve_self_consistent(problem)
        
        elif complexity == "complex":
            # ReAct with tools — can gather information
            return self._solve_react(problem)
        
        elif complexity == "hard":
            # Tree-of-Thought with Reflexion — maximum accuracy
            return self._solve_tot_reflexion(problem)
    
    def _assess_complexity(self, problem: str) -> str:
        prompt = f"""Rate this problem's complexity: simple, medium, complex, or hard.
Problem: {problem}
Consider: number of steps, need for external info, ambiguity, number of valid approaches.
Return ONLY one word."""
        return call_llm(prompt).strip().lower()
Comparison visual: A decision matrix showing when to use each advanced pattern based on accuracy needs, cost budget, and task type

Performance and Cost Comparison

| Pattern | API Calls | Accuracy Gain | Best For |

|---------|-----------|---------------|----------|

| Single CoT | 1x | Baseline | Simple reasoning |

| Self-Consistency (N=5) | 5x | +10-18% | Classification, math |

| ReAct | 3-10x | +15-25% | Tasks needing external data |

| Tree-of-Thought | 10-50x | +20-35% | Complex planning, design |

| Reflexion | 3-9x | +10-20% | Iterative improvement |

| Combined (adaptive) | 1-50x | Optimal per task | Production systems |

The key insight: match the technique to the task complexity. Using Tree-of-Thought for "What's 2+2?" wastes money. Using single CoT for "Design a distributed database migration strategy" wastes accuracy.

Conclusion

Advanced prompt patterns extend the capabilities of LLMs from simple question-answering to complex reasoning, planning, and problem-solving. The key takeaways:

1. Tree-of-Thought explores multiple paths — use for design, planning, and ambiguous problems

2. ReAct combines reasoning with tools — use when the model needs external information

3. Self-Consistency uses majority voting — use for deterministic problems where individual accuracy is 60-85%

4. Meta-prompting automates prompt optimization — use to iterate faster on prompt quality

5. Reflexion learns from mistakes — use for tasks where iterative improvement is possible

6. Combine adaptively — route tasks to the cheapest technique that achieves acceptable accuracy

In the final post of this series, we'll bring everything together with Production Prompt Engineering — testing, versioning, A/B testing, and optimization at scale.

*This is Part 5 of the Prompt Engineering Deep-Dive series. Previous: [Structured Output](#). Next: [Production Prompt Engineering — Testing and Optimization at Scale](#).*


Enjoyed this post? Follow AmtocSoft for AI tutorials from beginner to professional.

Buy Me a Coffee | 🔔 YouTube | 💼 LinkedIn | 🐦 X/Twitter

Comments

Popular posts from this blog

What is an LLM? A Beginner's Guide to Large Language Models

What Is Voice AI? TTS, STT, and Voice Agents Explained

29 Million Secrets Leaked: The Hardcoded Credentials Crisis