Advanced Prompt Patterns: Tree-of-Thought, ReAct, and Self-Consistency

Chain-of-Thought prompting was a breakthrough — but it has a fundamental limitation. It follows a single reasoning path. If that path starts with a wrong assumption, every subsequent step is built on a faulty foundation. There's no backtracking, no exploration of alternatives, no way to course-correct.
The advanced prompting patterns we'll cover in this post address exactly this limitation. They were born from a simple question: what if the model could explore multiple reasoning paths, use external tools to verify its assumptions, and check its own work against alternative approaches?
These techniques — Tree-of-Thought, ReAct, Self-Consistency, meta-prompting, and more — represent the current frontier of prompt engineering. They're what separates a clever chatbot from a reliable AI system that can handle complex, multi-step tasks in production.
This is Part 5 of our Prompt Engineering Deep-Dive series. If you haven't read Parts 1-4, the techniques here build directly on system prompts, Chain-of-Thought, few-shot prompting, and structured output.
Tree-of-Thought (ToT): Exploring Multiple Paths
Chain-of-Thought follows one path: Step 1 → Step 2 → Step 3 → Answer. Tree-of-Thought explores a branching tree of possibilities, evaluates each branch, and prunes dead ends before committing to an answer.
How It Works
1. Generate multiple candidate next-steps at each reasoning point
2. Evaluate each candidate (is this step promising or a dead end?)
3. Select the most promising branches to continue
4. Backtrack from dead ends and explore alternatives

Implementation
def tree_of_thought(problem: str, breadth: int = 3, depth: int = 3) -> str:
"""Explore multiple reasoning paths and select the best."""
def generate_steps(context: str, n: int) -> list[str]:
prompt = f"""Given this problem and progress so far:
{context}
Generate {n} different possible next steps.
For each step, explain your reasoning.
Return as a numbered list."""
return parse_steps(call_llm(prompt))
def evaluate_step(context: str, step: str) -> float:
prompt = f"""Evaluate this reasoning step:
Context: {context}
Step: {step}
Rate from 0.0 to 1.0:
- Is this step logically sound?
- Does it make progress toward the solution?
- Does it avoid assumptions that could be wrong?
Return ONLY a number between 0.0 and 1.0."""
return float(call_llm(prompt).strip())
# BFS through reasoning tree
candidates = [{"context": problem, "steps": [], "score": 1.0}]
for level in range(depth):
next_candidates = []
for candidate in candidates:
steps = generate_steps(candidate["context"], breadth)
for step in steps:
score = evaluate_step(candidate["context"], step)
new_context = candidate["context"] + f"\nStep {level+1}: {step}"
next_candidates.append({
"context": new_context,
"steps": candidate["steps"] + [step],
"score": candidate["score"] * score
})
# Keep top candidates (beam search)
candidates = sorted(
next_candidates,
key=lambda x: x["score"],
reverse=True
)[:breadth]
# Return the highest-scoring path
best = candidates[0]
return synthesize_answer(problem, best["steps"])
When to Use ToT
| Use Case | CoT Sufficient? | ToT Needed? |
|----------|----------------|-------------|
| Simple math | Yes | No |
| Code debugging | Usually | For complex multi-file bugs |
| Architecture design | No | Yes — multiple valid approaches |
| Strategic planning | No | Yes — tradeoffs require exploration |
| Game solving (chess, puzzles) | No | Yes — search required |
| Creative writing | No | Yes — exploring different directions |
Cost consideration: ToT uses 5-20x more API calls than single CoT. Use it only when the accuracy improvement justifies the cost.
ReAct: Reasoning + Acting
ReAct (Reasoning + Acting) combines Chain-of-Thought reasoning with tool use. Instead of reasoning in isolation, the model thinks about what information it needs, uses tools to get it, observes the results, and continues reasoning.
The ReAct Loop
Thought: I need to check if the database table exists
Action: query_database("SHOW TABLES LIKE 'users'")
Observation: Table 'users' exists with columns: id, name, email, created_at
Thought: The table exists. Now I need to check if there's an index on email
Action: query_database("SHOW INDEX FROM users WHERE Column_name = 'email'")
Observation: No index found on email column
Thought: Missing email index explains the slow login query. I should recommend adding it.
Answer: Add an index on users.email — this will fix the O(n) scan on every login.
Implementation
def react_agent(
question: str,
tools: dict[str, callable],
max_steps: int = 10
) -> str:
"""ReAct agent: interleave reasoning and tool use."""
tool_descriptions = "\n".join(
f"- {name}: {func.__doc__}" for name, func in tools.items()
)
system = f"""You are a reasoning agent. For each step:
1. Thought: Reason about what you know and what you need
2. Action: Call a tool if needed (format: tool_name(args))
3. Observation: [Tool result will be inserted here]
Repeat until you have enough information to answer.
When ready, respond with: Answer: [your final answer]
Available tools:
{tool_descriptions}"""
messages = [
{"role": "system", "content": system},
{"role": "user", "content": question}
]
for step in range(max_steps):
response = call_llm(messages)
messages.append({"role": "assistant", "content": response})
# Check if we have a final answer
if "Answer:" in response:
return response.split("Answer:")[-1].strip()
# Parse and execute action
action_match = re.search(r'Action:\s*(\w+)\((.*?)\)', response)
if action_match:
tool_name = action_match.group(1)
tool_args = action_match.group(2)
if tool_name in tools:
result = tools[tool_name](tool_args)
observation = f"Observation: {result}"
else:
observation = f"Observation: Error — tool '{tool_name}' not found"
messages.append({"role": "user", "content": observation})
return "Max steps reached without conclusion."
ReAct vs. Plain Tool Use
The key difference is that ReAct makes the reasoning explicit. In plain tool use, the model calls tools but doesn't show its reasoning about why it chose that tool or what it expects to find. ReAct forces the model to articulate its hypothesis before acting, which:
1. Improves tool selection — Thinking first reduces irrelevant tool calls
2. Enables debugging — You can read the thought trace to understand failures
3. Supports learning — The reasoning chain becomes training data for improvement
Self-Consistency: Majority Vote on Reasoning
Self-Consistency generates multiple independent reasoning chains for the same problem and takes the majority vote. It's based on the insight that correct reasoning paths are more likely to converge on the same answer.
Implementation
import collections
def self_consistent_answer(
prompt: str,
n_samples: int = 5,
temperature: float = 0.7
) -> dict:
"""Generate multiple reasoning paths and vote on the answer."""
answers = []
reasoning_chains = []
for _ in range(n_samples):
response = call_llm(
prompt + "\nThink step by step, then give your final answer on the last line starting with 'ANSWER:'",
temperature=temperature # Higher temp for diversity
)
# Extract final answer
lines = response.strip().split('\n')
answer_line = [l for l in lines if l.startswith('ANSWER:')]
if answer_line:
answer = answer_line[-1].replace('ANSWER:', '').strip()
answers.append(answer)
reasoning_chains.append(response)
# Majority vote
counter = collections.Counter(answers)
best_answer, vote_count = counter.most_common(1)[0]
return {
"answer": best_answer,
"confidence": vote_count / len(answers),
"total_votes": len(answers),
"vote_distribution": dict(counter),
"reasoning_chains": reasoning_chains
}
When Self-Consistency Shines
Self-Consistency is most valuable when:
- The problem has a single correct answer (math, classification, yes/no)
- Individual CoT accuracy is in the 60-85% range (high enough to converge, low enough to benefit)
- You can afford N times the API cost (typically N=5 to N=11)
| Single CoT Accuracy | Self-Consistency (N=5) | Improvement |
|---------------------|----------------------|-------------|
| 60% | ~78% | +18% |
| 70% | ~87% | +17% |
| 80% | ~94% | +14% |
| 90% | ~98% | +8% |
Diminishing returns above N=11. Research shows that going from 5 to 11 samples provides meaningful improvement, but 11 to 21 provides very little additional benefit.
T=0.7"] PROMPT --> R2["Chain 2
T=0.7"] PROMPT --> R3["Chain 3
T=0.7"] PROMPT --> R4["Chain 4
T=0.7"] PROMPT --> R5["Chain 5
T=0.7"] R1 --> A1["Answer: A"] R2 --> A2["Answer: B"] R3 --> A3["Answer: A"] R4 --> A4["Answer: A"] R5 --> A5["Answer: C"] A1 --> VOTE["Majority Vote"] A2 --> VOTE A3 --> VOTE A4 --> VOTE A5 --> VOTE VOTE --> FINAL["Final: A
3/5 = 60% confidence"] style PROMPT fill:#6C63FF,stroke:#8B83FF,color:#fff style R1 fill:#3498db,stroke:#2980b9,color:#fff style R2 fill:#3498db,stroke:#2980b9,color:#fff style R3 fill:#3498db,stroke:#2980b9,color:#fff style R4 fill:#3498db,stroke:#2980b9,color:#fff style R5 fill:#3498db,stroke:#2980b9,color:#fff style A1 fill:#2ecc71,stroke:#27ae60,color:#fff style A2 fill:#f39c12,stroke:#e67e22,color:#fff style A3 fill:#2ecc71,stroke:#27ae60,color:#fff style A4 fill:#2ecc71,stroke:#27ae60,color:#fff style A5 fill:#e74c3c,stroke:#c0392b,color:#fff style VOTE fill:#9b59b6,stroke:#8e44ad,color:#fff style FINAL fill:#2ecc71,stroke:#27ae60,color:#fff
Meta-Prompting: Prompts That Write Prompts
Meta-prompting uses the LLM itself to generate, refine, and optimize prompts. Instead of manually iterating on prompt wording, you ask the model to help.
Pattern: Automatic Prompt Optimization
def optimize_prompt(
initial_prompt: str,
test_cases: list[dict],
n_iterations: int = 5
) -> str:
"""Use the LLM to iteratively improve a prompt."""
current_prompt = initial_prompt
best_score = evaluate_prompt(current_prompt, test_cases)
best_prompt = current_prompt
for iteration in range(n_iterations):
# Ask the model to analyze failures
failures = get_failures(current_prompt, test_cases)
improvement_request = f"""Current prompt:
{current_prompt}
This prompt fails on these cases:
{failures}
Analyze why it fails and suggest an improved version of the prompt
that would handle these cases correctly while maintaining accuracy
on the cases it already handles well.
Return ONLY the improved prompt, nothing else."""
new_prompt = call_llm(improvement_request)
new_score = evaluate_prompt(new_prompt, test_cases)
if new_score > best_score:
best_score = new_score
best_prompt = new_prompt
current_prompt = new_prompt
return best_prompt
Pattern: Task Decomposition Prompting
Ask the model to break down a complex task into sub-prompts:
I need to analyze customer support tickets and produce a weekly report. Break this task into a sequence of focused sub-tasks, where each sub-task has: 1. A clear input 2. A specific prompt optimized for that sub-task 3. A defined output format 4. Dependencies on previous sub-tasks Design the prompts so each one is simple enough to be highly reliable.
Reflexion: Learning from Mistakes
Reflexion extends ReAct by adding a self-reflection step. After completing a task, the model evaluates its own performance and generates feedback that improves future attempts.
def reflexion_agent(
task: str,
evaluator: callable,
max_attempts: int = 3
) -> str:
"""Agent that learns from its own mistakes."""
reflections = []
for attempt in range(max_attempts):
# Include past reflections in the prompt
reflection_context = ""
if reflections:
reflection_context = "\n\nPrevious attempts and reflections:\n"
for r in reflections:
reflection_context += f"- Attempt: {r['summary']}\n"
reflection_context += f" Reflection: {r['reflection']}\n"
reflection_context += f" What to do differently: {r['improvement']}\n"
prompt = f"""{task}
{reflection_context}
Think step by step. If you've seen reflections above,
use them to avoid repeating the same mistakes."""
response = call_llm(prompt)
score, feedback = evaluator(response)
if score >= 0.9: # Good enough
return response
# Self-reflect on the failure
reflection_prompt = f"""You attempted this task:
{task}
Your response:
{response}
Evaluation feedback:
{feedback}
Reflect on what went wrong and what you should do differently
next time. Be specific and actionable."""
reflection = call_llm(reflection_prompt)
reflections.append({
"summary": response[:200],
"reflection": reflection,
"improvement": reflection # Could parse for action items
})
return response # Return best attempt
Combining Patterns: The Full Stack
In production, these patterns are often combined:
class ProductionReasoningPipeline:
"""Combines multiple advanced patterns for maximum reliability."""
def __init__(self, tools: dict, schemas: dict):
self.tools = tools
self.schemas = schemas
def solve(self, problem: str, complexity: str = "auto") -> dict:
if complexity == "auto":
complexity = self._assess_complexity(problem)
if complexity == "simple":
# Direct CoT — cheapest
return self._solve_cot(problem)
elif complexity == "medium":
# Self-Consistency — better accuracy
return self._solve_self_consistent(problem)
elif complexity == "complex":
# ReAct with tools — can gather information
return self._solve_react(problem)
elif complexity == "hard":
# Tree-of-Thought with Reflexion — maximum accuracy
return self._solve_tot_reflexion(problem)
def _assess_complexity(self, problem: str) -> str:
prompt = f"""Rate this problem's complexity: simple, medium, complex, or hard.
Problem: {problem}
Consider: number of steps, need for external info, ambiguity, number of valid approaches.
Return ONLY one word."""
return call_llm(prompt).strip().lower()

Performance and Cost Comparison
| Pattern | API Calls | Accuracy Gain | Best For |
|---------|-----------|---------------|----------|
| Single CoT | 1x | Baseline | Simple reasoning |
| Self-Consistency (N=5) | 5x | +10-18% | Classification, math |
| ReAct | 3-10x | +15-25% | Tasks needing external data |
| Tree-of-Thought | 10-50x | +20-35% | Complex planning, design |
| Reflexion | 3-9x | +10-20% | Iterative improvement |
| Combined (adaptive) | 1-50x | Optimal per task | Production systems |
The key insight: match the technique to the task complexity. Using Tree-of-Thought for "What's 2+2?" wastes money. Using single CoT for "Design a distributed database migration strategy" wastes accuracy.
Conclusion
Advanced prompt patterns extend the capabilities of LLMs from simple question-answering to complex reasoning, planning, and problem-solving. The key takeaways:
1. Tree-of-Thought explores multiple paths — use for design, planning, and ambiguous problems
2. ReAct combines reasoning with tools — use when the model needs external information
3. Self-Consistency uses majority voting — use for deterministic problems where individual accuracy is 60-85%
4. Meta-prompting automates prompt optimization — use to iterate faster on prompt quality
5. Reflexion learns from mistakes — use for tasks where iterative improvement is possible
6. Combine adaptively — route tasks to the cheapest technique that achieves acceptable accuracy
In the final post of this series, we'll bring everything together with Production Prompt Engineering — testing, versioning, A/B testing, and optimization at scale.
*This is Part 5 of the Prompt Engineering Deep-Dive series. Previous: [Structured Output](#). Next: [Production Prompt Engineering — Testing and Optimization at Scale](#).*
Enjoyed this post? Follow AmtocSoft for AI tutorials from beginner to professional.
☕ Buy Me a Coffee | 🔔 YouTube | 💼 LinkedIn | 🐦 X/Twitter
Comments
Post a Comment