> ## Documentation Index
> Fetch the complete documentation index at: https://private-7c7dfe99-mintlify-8a08bda2.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Execution engine configuration

> Configure DataStore execution engine - auto, chdb, or pandas

DataStore can execute operations using different backends. This guide explains how to configure and optimize engine selection.

<h2 id="engines">
  Available Engines
</h2>

| Engine   | Description                                     | Best For                                        |
| -------- | ----------------------------------------------- | ----------------------------------------------- |
| `auto`   | Automatically selects best engine per operation | General use (default)                           |
| `chdb`   | Forces all operations through ClickHouse SQL    | Large datasets, aggregations                    |
| `pandas` | Forces all operations through pandas            | Compatibility testing, pandas-specific features |

<h2 id="setting">
  Setting the Engine
</h2>

<h3 id="global">
  Global Configuration
</h3>

```python theme={null}
from chdb.datastore.config import config

# Option 1: Using set method
config.set_execution_engine('auto')    # Default
config.set_execution_engine('chdb')    # Force ClickHouse
config.set_execution_engine('pandas')  # Force pandas

# Option 2: Using shortcuts
config.use_auto()     # Auto-select
config.use_chdb()     # Force ClickHouse
config.use_pandas()   # Force pandas
```

<h3 id="checking">
  Checking Current Engine
</h3>

```python theme={null}
print(config.execution_engine)  # 'auto', 'chdb', or 'pandas'
```

***

<h2 id="auto-mode">
  Auto Mode
</h2>

In `auto` mode (default), DataStore selects the optimal engine for each operation:

<h3 id="auto-chdb">
  Operations Executed in chDB
</h3>

* SQL-compatible filtering (`filter()`, `where()`)
* Column selection (`select()`)
* Sorting (`sort()`, `orderby()`)
* Grouping and aggregation (`groupby().agg()`)
* Joins (`join()`, `merge()`)
* Distinct (`distinct()`, `drop_duplicates()`)
* Limiting (`limit()`, `head()`, `tail()`)

<h3 id="auto-pandas">
  Operations Executed in pandas
</h3>

* Custom apply functions (`apply(custom_func)`)
* Complex pivot tables with custom aggregations
* Operations not expressible in SQL
* When input is already a pandas DataFrame

<h3 id="auto-example">
  Example
</h3>

```python theme={null}
from chdb import datastore as pd
from chdb.datastore.config import config

config.use_auto()  # Default

ds = pd.read_csv("data.csv")

# This uses chDB (SQL)
result = (ds
    .filter(ds['amount'] > 100)   # SQL: WHERE
    .groupby('region')            # SQL: GROUP BY
    .agg({'amount': 'sum'})       # SQL: SUM()
)

# This uses pandas (custom function)
result = ds.apply(lambda row: complex_calculation(row), axis=1)
```

***

<h2 id="chdb-mode">
  chDB Mode
</h2>

Force all operations through ClickHouse SQL:

```python theme={null}
config.use_chdb()
```

<h3 id="chdb-when">
  When to Use
</h3>

* Processing large datasets (millions of rows)
* Heavy aggregation workloads
* When you want maximum SQL optimization
* Consistent behavior across all operations

<h3 id="chdb-performance">
  Performance Characteristics
</h3>

| Operation Type        | Performance                  |
| --------------------- | ---------------------------- |
| GroupBy/Aggregation   | Excellent (up to 20x faster) |
| Complex Filtering     | Excellent                    |
| Sorting               | Very Good                    |
| Simple Single Filters | Good (slight overhead)       |

<h3 id="chdb-limitations">
  Limitations
</h3>

* Custom Python functions may not be supported
* Some pandas-specific features require conversion

***

<h2 id="pandas-mode">
  pandas Mode
</h2>

Force all operations through pandas:

```python theme={null}
config.use_pandas()
```

<h3 id="pandas-when">
  When to Use
</h3>

* Compatibility testing with pandas
* Using pandas-specific features
* Debugging pandas-related issues
* When data is already in pandas format

<h3 id="pandas-performance">
  Performance Characteristics
</h3>

| Operation Type           | Performance      |
| ------------------------ | ---------------- |
| Simple Single Operations | Good             |
| Custom Functions         | Excellent        |
| Complex Aggregations     | Slower than chDB |
| Large Datasets           | Memory intensive |

***

<h2 id="cross-datastore">
  Cross-DataStore Engine
</h2>

Configure the engine for operations that combine columns from different DataStores:

```python theme={null}
# Set cross-DataStore engine
config.set_cross_datastore_engine('auto')
config.set_cross_datastore_engine('chdb')
config.set_cross_datastore_engine('pandas')
```

<h3 id="cross-example">
  Example
</h3>

```python theme={null}
ds1 = pd.read_csv("sales.csv")
ds2 = pd.read_csv("inventory.csv")

# This operation involves two DataStores
result = ds1.join(ds2, on='product_id')
# Uses cross_datastore_engine setting
```

***

<h2 id="selection-logic">
  Engine Selection Logic
</h2>

<h3 id="decision-tree">
  Auto Mode Decision Tree
</h3>

```text theme={null}
Operation requested
    │
    ├─ Can be expressed in SQL?
    │      │
    │      ├─ Yes → Use chDB
    │      │
    │      └─ No → Use pandas
    │
    └─ Cross-DataStore operation?
           │
           └─ Use cross_datastore_engine setting
```

<h3 id="function-override">
  Function-Level Override
</h3>

Some functions can have their engine explicitly configured:

```python theme={null}
from chdb.datastore.config import function_config

# Force specific functions to use specific engine
function_config.use_chdb('length', 'substring')
function_config.use_pandas('upper', 'lower')
```

See [Function Config](/products/chdb/configuration/function-config) for details.

***

<h2 id="performance-comparison">
  Performance Comparison
</h2>

Benchmark results on 10M rows:

| Operation        | pandas (ms) | chdb (ms) | Speedup |
| ---------------- | ----------- | --------- | ------- |
| GroupBy count    | 347         | 17        | 19.93x  |
| Combined ops     | 1,535       | 234       | 6.56x   |
| Complex pipeline | 2,047       | 380       | 5.39x   |
| Filter+Sort+Head | 1,537       | 350       | 4.40x   |
| GroupBy agg      | 406         | 141       | 2.88x   |
| Single filter    | 276         | 526       | 0.52x   |

**Key insights:**

* chDB excels at aggregations and complex pipelines
* pandas is slightly faster for simple single operations
* Use `auto` mode to get the best of both

***

<h2 id="best-practices">
  Best Practices
</h2>

<h3 id="start-with-auto-mode">
  1. Start with Auto Mode
</h3>

```python theme={null}
config.use_auto()  # Let DataStore decide
```

<h3 id="profile-before-forcing">
  2. Profile Before Forcing
</h3>

```python theme={null}
config.enable_profiling()
# Run your workload
# Check profiler report to see where time is spent
```

<h3 id="force-engine-for-specific-workloads">
  3. Force Engine for Specific Workloads
</h3>

```python theme={null}
# For heavy aggregation workloads
config.use_chdb()

# For pandas compatibility testing
config.use_pandas()
```

<h3 id="use-explain-to-understand-execution">
  4. Use explain() to Understand Execution
</h3>

```python theme={null}
ds = pd.read_csv("data.csv")
query = ds.filter(ds['age'] > 25).groupby('city').agg({'salary': 'sum'})

# See what SQL will be generated
query.explain()
```

***

<h2 id="troubleshooting">
  Troubleshooting
</h2>

<h3 id="issue-operation-slower">
  Issue: Operation slower than expected
</h3>

```python theme={null}
# Check current engine
print(config.execution_engine)

# Enable debug to see what's happening
config.enable_debug()

# Try forcing specific engine
config.use_chdb()  # or config.use_pandas()
```

<h3 id="issue-unsupported-operation">
  Issue: Unsupported operation in chdb mode
</h3>

```python theme={null}
# Some pandas operations aren't supported in SQL
# Solution: use auto mode
config.use_auto()

# Or explicitly convert to pandas first
df = ds.to_df()
result = df.some_pandas_specific_operation()
```

<h3 id="issue-memory-issues">
  Issue: Memory issues with large data
</h3>

```python theme={null}
# Use chdb engine to avoid loading all data into memory
config.use_chdb()

# Filter early to reduce data size
result = ds.filter(ds['date'] >= '2024-01-01').to_df()

# For maximum throughput on large datasets, use performance mode
# which enables parallel Parquet reading and single-SQL aggregation
config.use_performance_mode()
```

<Tip>
  **Performance Mode**

  If you are running heavy aggregation workloads and don't need exact pandas output compatibility (row order, MultiIndex, dtype corrections), consider using [Performance Mode](/products/chdb/configuration/performance-mode). It automatically sets the engine to `chdb` and removes all pandas compatibility overhead.
</Tip>
