What Is Tool Calling / Function Calling: How AI Agents Execute External Functions?

What Is Tool Calling?

Tool calling (also known as function calling) is how AI agents connect to the real world.

Instead of being limited to text generation based on training data, LLMs with tool calling capabilities can request execution of external functions. They can fetch live data from databases. Hit APIs for real-time information. Send emails. Schedule appointments. Query your CRM.

Here’s what makes it different from regular LLM interactions: The model doesn’t execute functions itself. It returns a structured JSON request specifying which tool to call and with what parameters. Your application code actually executes the function and returns the result.

Think of it like this: The LLM is a smart orchestrator that decides what needs to happen. Your code is the executor that makes it happen.

Source: OpenAI Function Calling Documentation, Anthropic Tool Use Guide

How Tool Calling Works

The workflow follows a back-and-forth conversation pattern between your application and the LLM:

Step 1: You send a prompt + tool definitions

User question: "What's the weather in Paris?"
Available tools: [get_weather(location)]

Step 2: LLM analyzes and returns structured tool call

{
  "name": "get_weather",
  "arguments": {"location": "Paris, France"}
}

Step 3: Your application executes the function

result = get_weather("Paris, France")
# Returns: {"temperature": 25, "unit": "C"}

Step 4: You send the function result back to the LLM

Function result: {"temperature": 25, "unit": "C"}

Step 5: LLM generates the final user-friendly response

"The weather in Paris today is 25°C."

This multi-turn pattern is how production AI agents handle complex workflows. The LLM orchestrates. Your code executes. The user gets answers grounded in real data.

Source: Google Gemini Function Calling Guide

Why AI Teams Need Tool Calling

Without tool calling, your AI agent is stuck with whatever was in its training data. That’s fine for explaining concepts or writing drafts. It’s useless for production applications.

With tool calling, you can:

Extend beyond training data LLMs are trained on snapshots. Tool calling lets them access live data—current weather, stock prices, inventory levels, customer records.

Power complex workflows One user query can trigger multiple function calls. Search your knowledge base, fetch customer details from CRM, check order status, create a support ticket, send an email. The LLM orchestrates the entire workflow.

Keep data in your infrastructure Instead of sending everything to the cloud provider, tools execute in your environment. Your database queries run on your servers. Your API calls use your credentials. You maintain custody and control.

ROI that actually moves the P&L Tool calling is how AI agents automate real work. Customer support agents that resolve tickets. Research agents that compile reports. Data analysis agents that generate insights. These aren’t demos—they’re tools that save time and cut costs.

I’ve seen tool-calling agents reduce customer support resolution time by 60%. Cut research workflows from 3 hours to 15 minutes. Automate SQL queries that previously required a data analyst.

That’s the difference between an AI chatbot and an AI agent that creates value.

Source: IBM Think - What is Tool Calling

OpenAI vs Anthropic vs Gemini Tool Calling

Not all tool calling implementations are the same. Here’s how the three major providers stack up for production deployments:

OpenAI Function Calling

What makes it different:

Parallel execution: Can call multiple functions simultaneously
Flexible message ordering: Less rigid conversation structure
Streaming support: Real-time function call generation with delta updates

Production code example:

from openai import OpenAI
import json

client = OpenAI()

tools = [{
    "type": "function",
    "name": "get_weather",
    "description": "Get current weather for a location",
    "parameters": {
        "type": "object",
        "properties": {
            "location": {
                "type": "string",
                "description": "City and country, e.g. 'Paris, France'"
            },
            "units": {
                "type": "string",
                "enum": ["celsius", "fahrenheit"]
            }
        },
        "required": ["location"]
    },
    "strict": true
}]

# Step 1: Send request with tools
messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]

response = client.chat.completions.create(
    model="gpt-4.1",
    messages=messages,
    tools=tools
)

# Step 2: Check for tool calls
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]

    # Step 3: Execute tool
    function_args = json.loads(tool_call.function.arguments)
    weather_data = get_weather(**function_args)

    # Step 4: Send result back
    messages.append(response.choices[0].message)
    messages.append({
        "role": "tool",
        "content": json.dumps(weather_data),
        "tool_call_id": tool_call.id
    })

    # Step 5: Get final response
    final_response = client.chat.completions.create(
        model="gpt-4.1",
        messages=messages,
        tools=tools
    )

When to use OpenAI:

Complex agent workflows with parallel tool execution
Multi-tool orchestration scenarios
Need flexible conversation patterns

Source: OpenAI Function Calling Guide

Anthropic Claude Tool Use

What makes it different:

Sequential execution model: More rigid back-and-forth pattern
Server-side tools: Web Search and Web Fetch execute automatically
Constitutional AI training: Built-in safety features

Production code example:

import anthropic

client = anthropic.Anthropic()

tools = [{
    "name": "get_weather",
    "description": "Get current weather for a location",
    "input_schema": {
        "type": "object",
        "properties": {
            "location": {
                "type": "string",
                "description": "City and country"
            }
        },
        "required": ["location"]
    }
}]

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}]
)

# Check for tool use
if response.stop_reason == "tool_use":
    tool_use = next(block for block in response.content if block.type == "tool_use")

    # Execute tool
    weather_data = get_weather(**tool_use.input)

    # Send result back (requires strict message alternation)
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        tools=tools,
        messages=[
            {"role": "user", "content": "What's the weather in Tokyo?"},
            {"role": "assistant", "content": response.content},
            {
                "role": "user",
                "content": [{
                    "type": "tool_result",
                    "tool_use_id": tool_use.id,
                    "content": str(weather_data)
                }]
            }
        ]
    )

When to use Claude:

Long-context document analysis (legal contracts, financial reports)
Safety-critical applications
Need built-in content moderation

Source: Anthropic Tool Use Documentation

Google Gemini Function Calling

What makes it different:

Automatic function calling: SDK converts Python functions to declarations automatically
Thinking models: Uses internal reasoning for better function selection
Built-in MCP support: Model Context Protocol integration

Production code example:

from google import genai
from google.genai import types

client = genai.Client()

# Define function directly in Python
def get_weather(location: str) -> dict:
    """Gets the current temperature for a location.

    Args:
        location: The city and state, e.g. San Francisco, CA

    Returns:
        A dictionary containing temperature and unit.
    """
    # Implementation
    return {"temperature": 25, "unit": "Celsius"}

# SDK automatically converts to function declaration
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="What's the weather in SF?",
    config=types.GenerateContentConfig(tools=[get_weather])
)

# With automatic execution (Python SDK only)
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="What's the weather in SF?",
    config=types.GenerateContentConfig(
        tools=[get_weather],
        automatic_function_calling=True
    )
)
# SDK automatically executes function and sends result back

When to use Gemini:

Rapid prototyping with automatic function calling
MCP server integration
Need compositional tool chaining

Source: Google Gemini Function Calling Documentation

Provider Comparison

Feature	OpenAI	Anthropic Claude	Google Gemini
Parallel Execution	✅ Native	❌ Sequential only	✅ Native
Message Flexibility	✅ High	❌ Strict alternation	✅ Moderate
Strict Mode	✅ Yes	✅ Yes	✅ VALIDATED mode
Automatic Calling	❌ Manual	❌ Manual	✅ Python SDK
Built-in Tools	Code Interpreter, Web Search	Web Search, Web Fetch	Code Execution, Grounding
Token Overhead	Medium	High	Medium
Best For	Complex orchestration	Document analysis	Rapid prototyping

Pick based on your use case. Need parallel execution for independent API calls? OpenAI or Gemini. Need safety features for customer-facing apps? Claude. Need fast prototyping? Gemini’s automatic function calling is hard to beat.

Tool Calling Architecture Patterns

How you structure tool calling affects performance, cost, and reliability.

Single-Tool vs Multi-Tool Agents

Single-tool architecture: One agent, one function. Simple. Deterministic. Easy to debug.

tools = [
    {"type": "function", "name": "get_weather", "description": "..."}
]

Use this when:

You have a focused use case (weather bot, calculator, database query)
Tool selection isn’t the challenge
Simplicity matters more than flexibility

Multi-tool architecture: Multiple specialized functions. The LLM picks the right tool for each subtask.

tools = [
    {"type": "function", "name": "get_weather", "description": "..."},
    {"type": "function", "name": "search_web", "description": "..."},
    {"type": "function", "name": "send_email", "description": "..."}
]

Use this when:

Building general-purpose agents
Workflows require different capabilities
User requests are unpredictable

Best practice: Keep tool count under 20. I’ve seen accuracy drop 40% when teams expose 50+ tools. Too many options confuse the model. Group related functions into single tools or use dynamic tool selection to show only relevant tools per context.

Source: OpenAI Function Calling Best Practices

Sequential vs Parallel Execution

Sequential execution: One tool at a time. Wait for result. Then proceed.

User: "What's the weather in Paris and London?"

Sequential:
1. get_weather("Paris") → Wait → Result A
2. get_weather("London") → Wait → Result B
3. Synthesize response

Total time: ~2-4 seconds

Parallel execution: Call multiple functions simultaneously. Aggregate results.

User: "What's the weather in Paris and London?"

Parallel:
1. get_weather("Paris") + get_weather("London") simultaneously
2. Aggregate results A and B
3. Synthesize response

Total time: ~1-2 seconds (50-70% faster)

OpenAI and Gemini support parallel execution natively. Claude requires sequential execution—one function call per turn.

Use parallel when:

Operations are independent (no dependencies)
Speed matters (customer-facing applications)
Failure of one shouldn’t block others

Use sequential when:

One result feeds into the next call
Debugging and observability are priorities
Preventing cascading failures matters

Source: LangChain Tool Execution Patterns

ReAct Pattern (Reasoning + Acting)

The ReAct pattern interleaves natural-language reasoning with tool execution. The LLM explicitly states its thought process before each action.

The cycle:

1. THOUGHT: "I need to search for quantum computing information"
2. ACTION: search("quantum computing basics")
3. OBSERVATION: "Found 10 articles about quantum fundamentals..."
[Repeat until task complete]

Why it works:

Improved interpretability (you can see why the agent made each decision)
Dynamic planning (agent adjusts strategy based on observations)
Reduced hallucination (grounds reasoning in actual data)
Error recovery (can detect and correct mistakes)

Real example:

User: "What's the weather in the capital of France?"

Thought: "I need to find the capital of France first"
Action: search_web("capital of France")
Observation: "The capital of France is Paris"

Thought: "Now I can get the weather for Paris"
Action: get_weather("Paris, France")
Observation: {"temperature": 18, "unit": "celsius"}

Response: "The weather in Paris (the capital of France) is 18°C"

Use ReAct for:

Question answering with web search
API orchestration workflows
Data gathering tasks
Research agents

Don’t use it for:

Pure reasoning tasks (no tools needed)
Predetermined workflows (use Plan & Solve instead)
Cost-sensitive applications (extra tokens for reasoning traces)

Source: ReAct Paper - Yao et al.

Tool Calling vs Structured Outputs

People confuse these constantly. They use similar JSON Schema definitions but solve different problems.

Structured Outputs: Forces the LLM’s response into a specific data format. You’re extracting or transforming information.

User: "Extract person details from this text"
↓
LLM generates structured JSON:
{
  "name": "John Doe",
  "age": 30,
  "email": "john@example.com"
}
↓
Application receives formatted data directly

Use structured outputs when:

Extracting data from unstructured text
Converting natural language to database records
Building data pipelines that need consistent formats
Primary concern is output schema compliance

Tool Calling: Enables the LLM to request execution of external functions. You’re orchestrating actions.

User: "What's the weather in Tokyo?"
↓
LLM decides to call a tool:
{
  "tool": "get_weather",
  "arguments": {"location": "Tokyo, Japan"}
}
↓
Application executes get_weather("Tokyo, Japan")
↓
LLM receives result and generates response:
"It's currently 22°C in Tokyo"

Use tool calling when:

Agent needs to access external services
Model must decide which actions to take
Building workflows with multiple tools
Need autonomous decision-making about when to use external resources

Token efficiency comparison:

Task: Extract temperature from "It's 75°F in Miami"

Structured Output:
Request: ~150 tokens (prompt + schema)
Response: ~30 tokens ({"temperature": 75, "unit": "F"})
Total: ~180 tokens

Tool Calling:
Request: ~200 tokens (prompt + tool definition)
Response: ~50 tokens (function call JSON)
Function result: ~20 tokens
Final response: ~30 tokens
Total: ~300 tokens

Result: Structured outputs 40% more efficient for extraction

In production, use both:

Layer 1 (Orchestration): Tool calling determines workflow
Layer 2 (Data Processing): Structured outputs extract/transform data
Layer 3 (Integration): Tool calling executes actions

Source: Agenta AI - Structured Outputs vs Function Calling

Deploy Tool-Calling Agents in Under a Week

Most companies spend months deploying AI agents with tool calling. Requirements gathering. API integration planning. Security implementation. Testing.

That’s the old way.

Traditional approach:

Week 1-4: Requirements and API integration planning
Week 5-8: Tool development and schema definition
Week 9-12: Security implementation and testing
Week 13-16: Production deployment and monitoring
Total: 4 months

Fast deployment approach:

Day 1: Discovery and tool requirements alignment
Day 2-3: Rapid tool integration with pre-built templates
Day 4-5: Security hardening and error handling
Day 6-7: Production deployment with observability
Total: Under 1 week

How do you actually do this?

Pre-built tool calling templates Don’t write function schemas from scratch. Start with battle-tested templates for common integrations:

CRM tools (Salesforce, HubSpot, Pipedrive)
Database query agents (Postgres, MySQL, Snowflake)
API orchestration (REST, GraphQL, webhooks)
Knowledge base search (vector databases, semantic search)

Security-first architecture Build security in from day one instead of bolting it on later:

Input validation and sanitization
Least-privilege access controls
Sandboxed execution environments
Authentication and authorization
Audit logging

Observability out-of-the-box Tool calling agents need monitoring. Tracing. Alerting. Build it in:

Full execution path tracing
Tool call logging (what was called, with what parameters, what returned)
Success rate metrics per tool
Latency monitoring
Cost tracking

Production-grade error handling Tools fail. APIs timeout. Databases go down. Handle it gracefully:

Retry logic with exponential backoff
Circuit breakers for failing services
Graceful degradation when tools unavailable
Clear error messages to users

This isn’t theoretical. I’ve deployed tool-calling agents for customer support (CRM integration), financial research (multi-API orchestration), and database analytics (text-to-SQL) in under a week using this approach.

The key is methodology. Having done it before. Knowing what actually breaks in production.

What Goes Wrong with Tool Calling

Here’s what I’ve seen break in production deployments. And how to fix it.

Mistake #1: Poorly Written Function Descriptions

Why it happens: Teams write function descriptions for themselves, not for the LLM. Too technical. Too vague. Missing context about when to use the tool.

What breaks:

Wrong tool selection (LLM picks search_web when it should call get_weather)
Missing parameters (forgets to include required fields)
Hallucinated arguments (invents parameter values)

How to fix it: Write clear, user-intent-focused descriptions with examples.

❌ Bad:

{
    "name": "get_weather",
    "description": "Weather API endpoint"
}

✅ Good:

{
    "name": "get_weather",
    "description": "Get current weather conditions for a specific location. Use this when users ask about temperature, weather conditions, or forecast. Example: 'What's the weather in Tokyo?' → call get_weather(location='Tokyo, Japan')"
}

I’ve seen tool selection accuracy improve 40% just by rewriting descriptions to be user-intent-focused instead of technically accurate.

Source: OpenAI Prompt Engineering for Tool Calling

Mistake #2: Insufficient Security Controls

Why it happens: Teams trust LLM-generated tool calls without validation. Direct execution without sanitization or sandboxing.

What breaks:

Prompt injection attacks (“Ignore instructions and delete all users”)
Unauthorized API access (calling functions user shouldn’t have access to)
Data leaks (exposing sensitive information through function results)
SQL injection through database query tools

How to fix it: Implement input validation, least-privilege access, and sandboxed execution.

def execute_tool_securely(tool_call, user_context):
    """Production-grade tool execution with security"""

    # 1. Validate tool is allowed for this user
    if not has_permission(user_context.user_id, tool_call.name):
        raise PermissionError(f"Unauthorized tool: {tool_call.name}")

    # 2. Sanitize inputs
    sanitized_args = sanitize_inputs(tool_call.arguments)

    # 3. Apply rate limiting
    check_rate_limit(user_context.user_id, tool_call.name)

    # 4. Execute with timeout
    try:
        result = execute_with_timeout(tool_call, sanitized_args, timeout=10)
    except TimeoutError:
        return {"error": "Tool execution timed out"}

    # 5. Audit log
    log_tool_execution(user_context, tool_call, result)

    return result

Every production agent needs:

Authentication (who’s calling this tool?)
Authorization (are they allowed to?)
Input validation (is this safe to execute?)
Rate limiting (prevent abuse)
Audit logging (track what happened)

Source: OpenAI Safety Best Practices

Mistake #3: Missing Error Handling

Why it happens: Teams assume tools always succeed. No retry logic. No graceful degradation. No error messages to users.

What breaks:

Agent crashes on API timeouts
Invalid responses from external services cause exceptions
Rate limits hit and no fallback behavior
Users get cryptic error messages

How to fix it: Implement retry logic, circuit breakers, and graceful degradation.

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
def call_external_api(endpoint, params):
    """Call external API with automatic retry"""
    try:
        response = requests.get(endpoint, params=params, timeout=5)
        response.raise_for_status()
        return response.json()
    except requests.Timeout:
        logger.warning(f"API timeout: {endpoint}")
        raise
    except requests.HTTPError as e:
        if e.response.status_code == 429:  # Rate limit
            logger.warning(f"Rate limited: {endpoint}")
            raise
        elif e.response.status_code >= 500:  # Server error
            logger.error(f"Server error: {endpoint}")
            raise
        else:
            # Client error, don't retry
            return {"error": f"API error: {e.response.status_code}"}

# Graceful degradation
def get_weather_with_fallback(location):
    """Get weather with fallback to cached data"""
    try:
        return call_external_api(weather_api, {"location": location})
    except Exception as e:
        logger.error(f"Weather API failed: {e}")

        # Try cache
        cached = get_cached_weather(location)
        if cached:
            return {"data": cached, "source": "cache", "warning": "Using cached data"}

        # Final fallback
        return {"error": "Weather service temporarily unavailable"}

In production, I’ve seen 30% of issues come from unhandled tool failures. APIs timeout. Services go down. Rate limits get hit. Plan for it.

Source: LangChain Error Handling

Mistake #4: Tool Sprawl (Too Many Tools)

Why it happens: Teams add every possible function “just in case.” Database has 50 tables? Expose 50 query tools. API has 100 endpoints? Add 100 tools.

What breaks:

Poor tool selection accuracy (LLM gets confused with too many options)
Higher latency (more tools = more tokens = slower responses)
Increased costs (tool definitions consume tokens on every request)

How to fix it: Keep under 20 tools. Use dynamic tool selection. Group related functions.

Example: Database agent with 50 tools

# ❌ Don't do this
tools = [
    {"name": "query_users_table", ...},
    {"name": "query_orders_table", ...},
    {"name": "query_products_table", ...},
    # ... 47 more tools
]

Better: Single tool with table parameter

# ✅ Do this instead
tools = [
    {
        "name": "query_database",
        "description": "Query any database table using natural language",
        "parameters": {
            "table": {"type": "string", "enum": ["users", "orders", "products"]},
            "query": {"type": "string", "description": "Natural language query"}
        }
    }
]

I’ve seen tool selection accuracy improve 60% by reducing from 50 tools to 15 grouped tools.

Dynamic tool selection is even better:

def get_relevant_tools(user_query):
    """Return only tools relevant to this query"""

    # Semantic search over tool descriptions
    relevant_tools = semantic_search(user_query, all_tools, top_k=10)

    return relevant_tools

Source: OpenAI Tool Selection Best Practices

Mistake #5: No Observability

Why it happens: Teams focus on functionality, ignore monitoring. Ship to production without tracing, logging, or metrics.

What breaks:

Can’t debug failures (no visibility into what tools were called)
Can’t optimize costs (no tracking of token usage per tool)
Can’t track success rates (don’t know which tools fail most often)
Can’t measure impact (no metrics on time saved or accuracy)

How to fix it: Implement tracing, logging, dashboards from day one.

import logging
from datetime import datetime

class ToolCallTracer:
    def __init__(self):
        self.traces = []

    def log_tool_call(self, tool_name, arguments, result, duration_ms, user_id):
        """Log every tool execution"""

        trace = {
            "timestamp": datetime.utcnow().isoformat(),
            "user_id": user_id,
            "tool_name": tool_name,
            "arguments": arguments,
            "result_preview": str(result)[:100],
            "duration_ms": duration_ms,
            "success": not isinstance(result, dict) or "error" not in result,
            "token_estimate": estimate_tokens(arguments, result)
        }

        self.traces.append(trace)

        # Send to logging system
        logging.info(f"Tool call: {tool_name}", extra=trace)

        # Send metrics to monitoring
        self.send_metrics(trace)

    def send_metrics(self, trace):
        """Send metrics to dashboard"""

        # Success rate
        metrics.increment(f"tool.{trace['tool_name']}.calls")
        if trace["success"]:
            metrics.increment(f"tool.{trace['tool_name']}.success")
        else:
            metrics.increment(f"tool.{trace['tool_name']}.failure")

        # Latency
        metrics.histogram(f"tool.{trace['tool_name']}.duration", trace["duration_ms"])

        # Cost
        metrics.histogram(f"tool.{trace['tool_name']}.tokens", trace["token_estimate"])

Track these metrics:

Success rate per tool (which tools fail most often?)
Latency per tool (which are slowest?)
Token usage per tool (which are most expensive?)
Tool selection accuracy (is LLM picking the right tool?)
Full execution traces (what happened in failed workflows?)

Tool calling observability is non-negotiable for production. You need visibility to debug issues, optimize performance, and prove ROI.

Source: LangSmith Observability Guide

Master Tool Calling with Agent Guild

Want to go deeper on production tool calling patterns? Join the Agent Guild community:

Weekly workshops on advanced tool calling:

ReAct vs Plan & Solve vs REWOO implementation
Multi-agent tool orchestration
Production-grade security and observability
Real-world debugging sessions

Code reviews from experienced AI architects:

Get feedback on your tool implementations
Learn from production patterns that work
Optimize for reliability and cost
Avoid common pitfalls

Access to tool calling templates and frameworks:

Pre-built integrations (CRM, databases, APIs)
Security patterns and testing frameworks
Monitoring and observability templates
Production deployment guides

Direct access to TMA’s engineering team:

Office hours with tool calling experts
Debugging support for production issues
Architecture reviews
Real talk about what actually works

Join the Agent Guild →

Tool Calling Implementation Code

Here’s production-ready code for all three major providers with proper error handling and security.

OpenAI Production Implementation

from openai import OpenAI
import json
import logging
from typing import Dict, Any, List

class OpenAIToolCalling:
    def __init__(self, api_key: str):
        self.client = OpenAI(api_key=api_key)
        self.logger = logging.getLogger(__name__)

    def execute_with_tools(
        self,
        user_message: str,
        tools: List[Dict],
        model: str = "gpt-4.1",
        max_iterations: int = 5
    ) -> str:
        """Execute tool calling workflow with retry logic"""

        messages = [{"role": "user", "content": user_message}]
        iteration = 0

        while iteration < max_iterations:
            try:
                # Get model response
                response = self.client.chat.completions.create(
                    model=model,
                    messages=messages,
                    tools=tools,
                    tool_choice="auto"
                )

                message = response.choices[0].message

                # No tool calls - return final response
                if not message.tool_calls:
                    return message.content

                # Execute tool calls
                messages.append(message)

                for tool_call in message.tool_calls:
                    # Execute tool with error handling
                    result = self.execute_tool_safely(tool_call)

                    # Add result to messages
                    messages.append({
                        "role": "tool",
                        "content": json.dumps(result),
                        "tool_call_id": tool_call.id
                    })

                iteration += 1

            except Exception as e:
                self.logger.error(f"Tool calling error: {e}")
                return f"Error: {str(e)}"

        return "Max iterations reached"

    def execute_tool_safely(self, tool_call) -> Dict[str, Any]:
        """Execute tool with validation and error handling"""

        try:
            # Parse arguments
            arguments = json.loads(tool_call.function.arguments)

            # Validate inputs
            self.validate_tool_inputs(tool_call.function.name, arguments)

            # Execute tool (implement your tools here)
            if tool_call.function.name == "get_weather":
                return self.get_weather(**arguments)
            elif tool_call.function.name == "search_web":
                return self.search_web(**arguments)
            else:
                return {"error": f"Unknown tool: {tool_call.function.name}"}

        except json.JSONDecodeError:
            return {"error": "Invalid JSON arguments"}
        except Exception as e:
            self.logger.error(f"Tool execution failed: {e}")
            return {"error": str(e)}

    def validate_tool_inputs(self, tool_name: str, arguments: Dict) -> None:
        """Validate tool inputs before execution"""

        # Implement validation logic
        if tool_name == "get_weather":
            if "location" not in arguments:
                raise ValueError("Missing required parameter: location")

    def get_weather(self, location: str, units: str = "celsius") -> Dict:
        """Example tool implementation"""

        # Implement actual weather API call
        return {
            "location": location,
            "temperature": 25,
            "unit": units,
            "conditions": "Sunny"
        }

    def search_web(self, query: str, max_results: int = 5) -> Dict:
        """Example tool implementation"""

        # Implement actual web search
        return {
            "query": query,
            "results": ["Result 1", "Result 2"],
            "count": 2
        }

Anthropic Production Implementation

import anthropic
from typing import Dict, Any, List

class ClaudeToolCalling:
    def __init__(self, api_key: str):
        self.client = anthropic.Anthropic(api_key=api_key)

    def execute_with_tools(
        self,
        user_message: str,
        tools: List[Dict],
        model: str = "claude-sonnet-4-5",
        max_iterations: int = 5
    ) -> str:
        """Execute tool calling with Claude's strict alternation"""

        messages = [{"role": "user", "content": user_message}]
        iteration = 0

        while iteration < max_iterations:
            try:
                response = self.client.messages.create(
                    model=model,
                    max_tokens=1024,
                    tools=tools,
                    messages=messages
                )

                # Check stop reason
                if response.stop_reason == "end_turn":
                    # Extract text content
                    return self.extract_text_content(response.content)

                if response.stop_reason == "tool_use":
                    # Add assistant message
                    messages.append({
                        "role": "assistant",
                        "content": response.content
                    })

                    # Execute all tool uses
                    tool_results = []
                    for block in response.content:
                        if block.type == "tool_use":
                            result = self.execute_tool_safely(block)
                            tool_results.append({
                                "type": "tool_result",
                                "tool_use_id": block.id,
                                "content": str(result)
                            })

                    # Add tool results (strict user turn)
                    messages.append({
                        "role": "user",
                        "content": tool_results
                    })

                    iteration += 1

            except Exception as e:
                return f"Error: {str(e)}"

        return "Max iterations reached"

    def extract_text_content(self, content: List) -> str:
        """Extract text from content blocks"""

        text_parts = []
        for block in content:
            if hasattr(block, 'text'):
                text_parts.append(block.text)

        return " ".join(text_parts)

    def execute_tool_safely(self, tool_use) -> Dict[str, Any]:
        """Execute tool with error handling"""

        try:
            if tool_use.name == "get_weather":
                return self.get_weather(**tool_use.input)
            else:
                return {"error": f"Unknown tool: {tool_use.name}"}

        except Exception as e:
            return {"error": str(e)}

    def get_weather(self, location: str) -> Dict:
        """Example tool implementation"""

        return {
            "location": location,
            "temperature": 25,
            "unit": "celsius"
        }

Production Security Wrapper

import logging
from typing import Dict, Any, Callable
from functools import wraps
import time

def secure_tool(func: Callable) -> Callable:
    """Decorator for production-grade tool security"""

    @wraps(func)
    def wrapper(*args, **kwargs):
        logger = logging.getLogger(func.__name__)
        start_time = time.time()

        try:
            # 1. Validate inputs
            validate_inputs(kwargs)

            # 2. Rate limiting
            check_rate_limit(func.__name__)

            # 3. Execute with timeout
            result = execute_with_timeout(func, *args, **kwargs, timeout=10)

            # 4. Validate outputs
            validate_outputs(result)

            # 5. Log execution
            duration_ms = (time.time() - start_time) * 1000
            logger.info(
                f"Tool {func.__name__} executed successfully",
                extra={
                    "duration_ms": duration_ms,
                    "arguments": kwargs
                }
            )

            return result

        except TimeoutError:
            logger.error(f"Tool {func.__name__} timed out")
            return {
                "error": "Tool execution timed out",
                "tool": func.__name__,
                "status": "timeout"
            }

        except Exception as e:
            logger.error(f"Tool {func.__name__} failed: {e}")
            return {
                "error": str(e),
                "tool": func.__name__,
                "status": "failed"
            }

    return wrapper

def validate_inputs(kwargs: Dict) -> None:
    """Validate and sanitize inputs"""

    for key, value in kwargs.items():
        # SQL injection prevention
        if isinstance(value, str):
            if any(dangerous in value.lower() for dangerous in ["drop", "delete", "truncate"]):
                raise ValueError(f"Potentially dangerous input detected: {key}")

        # Email validation
        if "email" in key.lower():
            if not is_valid_email(value):
                raise ValueError(f"Invalid email format: {value}")

def validate_outputs(result: Any) -> None:
    """Validate output doesn't contain sensitive data"""

    if isinstance(result, dict):
        # Check for exposed credentials
        sensitive_keys = ["password", "api_key", "secret", "token"]
        for key in sensitive_keys:
            if key in result:
                raise ValueError(f"Sensitive data in output: {key}")

def check_rate_limit(tool_name: str) -> None:
    """Implement rate limiting logic"""

    # Implement with Redis or in-memory cache
    pass

def execute_with_timeout(func: Callable, *args, timeout: int = 10, **kwargs) -> Any:
    """Execute function with timeout"""

    import signal

    def timeout_handler(signum, frame):
        raise TimeoutError(f"Function {func.__name__} exceeded {timeout}s timeout")

    signal.signal(signal.SIGALRM, timeout_handler)
    signal.alarm(timeout)

    try:
        result = func(*args, **kwargs)
    finally:
        signal.alarm(0)

    return result

def is_valid_email(email: str) -> bool:
    """Validate email format"""

    import re
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return re.match(pattern, email) is not None

# Usage
@secure_tool
def get_weather(location: str, units: str = "celsius") -> Dict[str, Any]:
    """Get current weather with production-grade security"""

    # Implementation
    return {
        "temperature": 25,
        "unit": units,
        "location": location
    }

This production code includes:

Input validation and sanitization
Rate limiting hooks
Timeout protection
Output validation (no sensitive data leaks)
Comprehensive logging
Error handling with graceful degradation

Frequently Asked Questions

What is the difference between tool calling and function calling?

Tool calling and function calling are the same capability with different naming. “Function calling” was the original term from OpenAI. “Tool calling” is now more common because it reflects the broader concept of LLMs using external tools (not just functions). Both mean the LLM generates structured requests for your application to execute external capabilities.

How many tools can I provide to an LLM?

Technically no hard limit, but keep it under 20 tools for optimal accuracy. Too many tools confuse the model, increase token costs, and slow down responses. Use dynamic tool selection or group related functions to stay within this limit. I’ve seen accuracy drop 40% when teams expose 50+ tools.

Does the LLM actually execute the function?

No. The LLM only generates a structured request specifying which function to call and with what parameters. Your application code is responsible for executing the actual function and returning the result to the LLM. This separation is critical for security—you control what actually executes.

What happens if a tool call fails?

Your application should catch the error, return an error message to the LLM (as a tool result), and let the LLM decide how to proceed. Implement retry logic, circuit breakers, and graceful degradation for production systems. Don’t just crash—handle failures gracefully and inform the user.

Can LLMs call multiple tools in parallel?

OpenAI and Google Gemini support parallel tool calling natively. Anthropic Claude uses sequential execution only (one tool per turn). Choose your provider based on your workflow requirements. Parallel execution is 50-70% faster for independent operations but requires more complex error handling.

How do I secure tool calling in production?

Implement input validation, authentication, rate limiting, least-privilege access, sandboxed execution, and audit logging. Never trust LLM-generated tool calls without validation. Treat them like user input—validate everything before execution. Use the secure_tool decorator pattern shown in the code examples.

What is the ReAct pattern?

ReAct (Reasoning + Acting) interleaves natural-language reasoning with tool execution. The LLM explicitly states its thought process before each action: “THOUGHT: I need to search for X” → “ACTION: search(X)” → “OBSERVATION: Found Y”. This improves interpretability, reduces hallucination, and enables error recovery. Use it for research agents, question answering with web search, and complex workflows.

Should I use tool calling or structured outputs?

Use tool calling for orchestration and decision-making (when the LLM needs to choose actions). Use structured outputs for data extraction and formatting (when you just need a specific data format). Often you’ll use both in the same system—tool calling for workflow orchestration, structured outputs for data processing.

How do I test tool calling implementations?

Implement unit tests for individual tools, integration tests for tool calling workflows, load tests for production scenarios, and end-to-end tests for complete user journeys. Mock external APIs for deterministic testing. Use adversarial testing (red-teaming) to test security. Track metrics like tool selection accuracy, latency, and success rates.

How much does tool calling cost?

Tool calling adds token overhead from tool definitions, tool_use blocks, and tool_result blocks. Typical overhead is 10-30% additional tokens per request. Optimize by using concise descriptions, limiting tool count, and fine-tuning for large tool sets. Track token usage per tool to identify optimization opportunities.

Can I fine-tune models for tool calling?

Yes. Fine-tuning can improve tool selection accuracy and reduce token overhead for large tool sets. OpenAI supports fine-tuning for function calling with specialized datasets. Best for applications with many domain-specific tools or when token efficiency is critical. Expect 20-40% improvement in tool selection accuracy.

What is strict mode in tool calling?

Strict mode (OpenAI) or structured outputs (Anthropic/Gemini) guarantees the LLM’s tool calls exactly match your JSON schema. This eliminates parsing errors and ensures reliable tool execution. Always use strict mode in production—it prevents the model from hallucinating invalid parameter values.

How do I implement tool calling with LangChain?

LangChain provides unified tool calling APIs across providers. Use the @tool decorator to create tools and create_tool_calling_agent to build agents. LangGraph adds workflow orchestration. Example: tools = [weather_tool, search_tool]; agent = create_tool_calling_agent(llm, tools). LangChain handles the complexity of multi-turn conversations.

What is the difference between client tools and server tools?

Client tools are executed by your application code (you control execution). Server tools are executed by the LLM provider (e.g., Anthropic’s Web Search). Server tools reduce latency but limit control. Use client tools for custom integrations, server tools for standardized capabilities like web search.

How do I monitor tool calling in production?

Implement tracing (full execution path), logging (all tool calls and results), metrics (success rates, latency, costs), and alerts (failure thresholds). Use tools like LangSmith or custom observability platforms. Track: success rate per tool, latency per tool, token usage per tool, tool selection accuracy, full execution traces.

Can I chain multiple tool calls together?

Yes. Tool chaining composes multiple functions where outputs feed into subsequent inputs. All major providers support this either natively (Gemini compositional calling) or through application orchestration (OpenAI, Claude). Design tools to be composable with clear inputs/outputs. Set maximum chain depth limits to prevent infinite loops.

What temperature should I use for tool calling?

Use temperature=0 for deterministic tool calling in production. Higher temperatures (0.3-0.5) may work for creative tool use but risk unreliable tool selection. Production agents need consistency—temperature=0 ensures the same tool is selected for the same input every time.

How do I handle tool calling timeouts?

Implement timeouts at the tool execution level (not LLM level). If a tool times out, return an error to the LLM and let it decide whether to retry, use an alternative tool, or inform the user. Use the execute_with_timeout pattern from the code examples. Set reasonable timeouts (5-10 seconds for most tools).

What is tool_choice in function calling?

tool_choice controls how the LLM selects tools: “auto” (LLM decides), “required” (must call a tool), “none” (no tools), or specific function name (force specific tool). Use “auto” for most cases. Use “required” when tool execution is mandatory. Use specific function when you need deterministic workflows.

Can I use tool calling with streaming?

Yes. OpenAI and Anthropic support streaming tool calls with delta updates. This enables real-time UI updates as the LLM generates tool call requests. Useful for showing users what the agent is thinking and doing in real-time. Improves perceived performance for long-running workflows.

How do I implement rate limiting for tools?

Track tool execution counts per user/session and enforce limits. Use Redis or in-memory counters for fast lookups. Return rate limit errors to the LLM when thresholds are exceeded. Example: 100 database queries per hour per user, 10 email sends per day. Prevents abuse and controls costs.

What is Model Context Protocol (MCP)?

MCP standardizes how AI systems connect to tools and data sources. Google Gemini has native MCP support, making it easier to integrate standardized tool connections. Think of it as a universal adapter for AI tools. Still emerging (as of late 2024) but worth watching for enterprise deployments.

How do I implement authentication for tools?

Tools should validate API keys, user tokens, or session credentials before execution. Store credentials securely (environment variables, secret managers) and use least-privilege access. Never hardcode credentials. Use per-user credentials when possible to enforce row-level security. Implement OAuth flows for third-party API access.

What is the best framework for tool calling?

Depends on your use case. LangChain (flexible, large ecosystem), LlamaIndex (data-focused, RAG integration), Semantic Kernel (enterprise, Azure-first), or native APIs (full control, no abstractions). For rapid deployment, I recommend LangChain with custom security layers. For maximum control, use native APIs directly.

How do I implement tool versioning?

Include version numbers in tool names or descriptions. Maintain backward compatibility when updating tools. Use feature flags to gradually roll out new tool versions. Example: get_weather_v1, get_weather_v2. Support multiple versions simultaneously during transitions. Track which version each user/session is using.

Can I use tool calling offline?

Tool calling requires LLM API access (cloud providers). For offline scenarios, use locally hosted models (Ollama, LM Studio) with function calling support, though accuracy may be lower than cloud models. Alternatively, use cached responses or offline-first architectures with sync when online.

How do I implement caching for tool results?

Cache deterministic tool results (same inputs → same outputs) using Redis or in-memory caches. Set appropriate TTLs based on data freshness requirements. Return cached results to reduce latency and costs. Example: Weather data cached for 1 hour, stock prices for 5 minutes, database queries until data changes.

What is programmatic tool calling?

Anthropic’s beta feature where Claude calls tools from within code execution, reducing latency by eliminating round-trips. The model executes tools directly rather than requesting application execution. Currently experimental—watch for general availability in 2025.

How do I implement tool calling with human-in-the-loop?

Add approval workflows before executing sensitive tools (financial transactions, data deletion). Pause execution, request human approval, then proceed or cancel based on response. Example: “Agent wants to refund $500 to customer. Approve?” Implement with async workflows and notification systems.

What are common tool calling anti-patterns?

Too many tools (>20), vague descriptions, no error handling, missing security controls, no observability, no rate limiting, synchronous execution for independent tools, hardcoded credentials, no input validation. Avoid these and you’ll ship production-ready agents faster than teams that learn the hard way.

AI Agent

AI agents are autonomous systems that use tool calling to interact with external services and take actions. Tool calling is the core capability that transforms static chatbots into dynamic agents. Without tool calling, an AI agent is just a conversational interface. With it, agents can access databases, call APIs, send emails, schedule appointments, and execute real-world workflows. Every production AI agent relies on tool calling for orchestration.

RAG System

RAG (Retrieval-Augmented Generation) systems use tool calling to query vector databases and knowledge bases. The workflow: user asks question → agent uses tool calling to search vector DB → retrieves relevant documents → LLM generates answer with context. Tool calling enables the retrieval step. Without it, RAG systems can’t dynamically fetch information based on queries. Production RAG agents use multiple tools: vector search, document retrieval, metadata filtering, re-ranking.

Agent Orchestration

Agent orchestration coordinates multiple agents and tools in complex workflows. Tool calling is the mechanism for orchestration—agents call tools to communicate with each other, share state, and coordinate actions. Multi-agent systems rely on tool calling for inter-agent communication. Example: Research agent calls search tool → passes results to analysis agent via tool calling → analysis agent calls database tool to store findings.

Prompt Engineering

Effective tool calling requires specialized prompt engineering. How you describe tools affects selection accuracy. How you structure system prompts determines when tools are used. Tool calling prompts need to explain: when to use each tool, what parameters are required, how to handle errors, and how to interpret results. I’ve seen 40% improvement in tool selection accuracy just from better prompt engineering.

Semantic Search

Semantic search is often implemented as a tool in AI agents. Instead of keyword matching, agents use semantic search tools to find relevant information based on meaning. Common pattern: user question → agent calls semantic_search tool → retrieves relevant chunks → synthesizes answer. Tool calling enables dynamic semantic search based on user intent.

Sources: