ClickHouse for Solana Data
Why ClickHouse is the fastest and most cost-effective database for large-scale Solana data indexing.
Why ClickHouse?
When building a Solana data indexer that needs to handle billions of transactions with millisecond query times, database choice is the most critical architectural decision. ClickHouse — an open-source column-oriented OLAP (Online Analytical Processing) database developed by Yandex — has emerged as the clear winner for this use case, and it is not close.
ClickHouse was designed from the ground up for analytical workloads: aggregations over large datasets, time-series analysis, and high-throughput writes. These are precisely the operations that Solana data indexing requires. The result is a database that can ingest millions of rows per second while simultaneously serving complex analytical queries in milliseconds — performance that row-oriented databases like PostgreSQL cannot match at this scale.
ClickHouse vs. Alternatives for Solana Indexing
| Database | Write Speed | Query Speed (1B rows) | Storage Efficiency | Solana Fit |
|---|---|---|---|---|
| ClickHouse | 1M+ rows/sec | Milliseconds | Excellent (10:1 compression) | ⭐⭐⭐⭐⭐ |
| PostgreSQL | ~50K rows/sec | Minutes | Good (3:1 compression) | ⭐⭐ |
| TimescaleDB | ~200K rows/sec | Seconds–Minutes | Good | ⭐⭐⭐ |
| BigQuery | Managed | Seconds | Excellent | ⭐⭐⭐⭐ |
| Apache Parquet+Spark | Batch only | Minutes | Excellent | ⭐⭐⭐ |
| MongoDB | ~100K rows/sec | Minutes | Poor | ⭐⭐ |
How ClickHouse Achieves This Performance
Column-Oriented Storage
Traditional row-oriented databases store all columns of a row together. When you query only 3 columns from a 50-column table, the database still reads all 50 columns from disk. ClickHouse stores each column separately. A query reading 3 columns reads only 6% of the data, dramatically reducing I/O.
For Solana transaction data, this is transformative. A typical transaction record might have 30+ fields, but most analytical queries only need 3–5 of them. ClickHouse reads only the relevant columns, making queries 5–10x faster than row-oriented alternatives.
Vectorized Query Execution
ClickHouse processes data in batches of 65,536 rows (vectors) rather than row-by-row. This enables SIMD (Single Instruction, Multiple Data) CPU instructions, which process multiple values simultaneously. Combined with automatic parallelization across all CPU cores, ClickHouse can aggregate billions of rows in seconds.
MergeTree Engine and Compression
ClickHouse's MergeTree storage engine stores data sorted by a primary key, enabling efficient range scans. Data is stored in compressed chunks using LZ4 or ZSTD compression. Solana transaction data typically achieves 8:1 to 12:1 compression ratios in ClickHouse, reducing storage costs dramatically compared to uncompressed alternatives.
Materialized Views
ClickHouse materialized views automatically maintain pre-aggregated data as new rows are inserted. For Solana analytics, this means you can maintain real-time aggregates — hourly trading volume, daily active wallets, per-program transaction counts — without running expensive queries on the full dataset.
Solana Data Schema Design
Designing the right schema is as important as choosing ClickHouse. The key principle is to optimize for your most common queries through sort key selection and partitioning.
-- Core transactions table
CREATE TABLE solana.transactions
(
signature String,
slot UInt64,
block_time DateTime,
success Bool,
fee UInt64,
compute_units UInt32,
-- Account keys (up to 64 accounts per transaction)
account_keys Array(String),
-- Program IDs involved
program_ids Array(String),
-- Pre/post token balances for DEX analysis
pre_token_balances Array(Tuple(
account_index UInt8,
mint String,
amount UInt64,
decimals UInt8
)),
post_token_balances Array(Tuple(
account_index UInt8,
mint String,
amount UInt64,
decimals UInt8
)),
-- Log messages for event parsing
log_messages Array(String),
-- Instruction data (base58 encoded)
instructions Array(Tuple(
program_id String,
accounts Array(UInt8),
data String
))
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(block_time)
ORDER BY (block_time, slot, signature)
SETTINGS index_granularity = 8192;-- Parsed DEX swap events (materialized from transactions)
CREATE TABLE solana.dex_swaps
(
signature String,
block_time DateTime,
slot UInt64,
-- DEX identification
program_id String,
pool_address String,
-- Swap details
trader String,
token_in_mint String,
token_out_mint String,
amount_in UInt64,
amount_out UInt64,
-- Derived metrics
price_impact Float64,
fee_amount UInt64
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(block_time)
ORDER BY (block_time, program_id, pool_address)
SETTINGS index_granularity = 8192;
-- Materialized view to auto-populate from transactions
CREATE MATERIALIZED VIEW solana.dex_swaps_mv
TO solana.dex_swaps
AS SELECT
signature,
block_time,
slot,
-- Parse swap data from transactions
-- (parsing logic depends on specific DEX program)
program_ids[1] as program_id
FROM solana.transactions
WHERE has(program_ids, '675kPX9MHTjS2zt1qfr1NYHuzeLXfQM9H24wFSUt1Mp8'); -- RaydiumQuery Examples
Trading Volume by DEX
-- 24-hour trading volume by DEX program
SELECT
program_id,
count() as swap_count,
sum(amount_in) / 1e9 as volume_sol,
uniqExact(trader) as unique_traders
FROM solana.dex_swaps
WHERE block_time > now() - INTERVAL 24 HOUR
GROUP BY program_id
ORDER BY volume_sol DESC;Wallet Activity Analysis
-- Top wallets by transaction count in last 7 days
SELECT
arrayJoin(account_keys) as wallet,
count() as tx_count,
sum(fee) / 1e9 as total_fees_sol
FROM solana.transactions
WHERE
block_time > now() - INTERVAL 7 DAY
AND success = true
GROUP BY wallet
HAVING tx_count > 100
ORDER BY tx_count DESC
LIMIT 100;Real-Time Price Aggregation
-- 1-minute OHLCV for a token pair
SELECT
toStartOfMinute(block_time) as minute,
argMin(price, block_time) as open,
max(price) as high,
min(price) as low,
argMax(price, block_time) as close,
sum(amount_in) as volume
FROM (
SELECT
block_time,
toFloat64(amount_out) / toFloat64(amount_in) as price,
amount_in
FROM solana.dex_swaps
WHERE
token_in_mint = 'So11111111111111111111111111111111111111112' -- SOL
AND token_out_mint = 'YourTokenMint'
AND block_time > now() - INTERVAL 24 HOUR
)
GROUP BY minute
ORDER BY minute;Ingestion Pipeline
The ingestion pipeline connects your data source (gRPC stream or RPC) to ClickHouse. For real-time ingestion, the recommended approach is to use ClickHouse's asynchronous insert mode, which batches small inserts automatically:
import { createClient } from '@clickhouse/client';
const client = createClient({
url: 'http://localhost:8123',
database: 'solana',
// Enable async inserts for high-throughput ingestion
clickhouse_settings: {
async_insert: 1,
wait_for_async_insert: 0,
async_insert_max_data_size: '10000000', // 10MB batches
async_insert_busy_timeout_ms: '1000', // Flush every 1 second
},
});
// Buffer transactions and insert in batches
const buffer: TransactionRecord[] = [];
const BATCH_SIZE = 1000;
async function ingestTransaction(tx: ParsedTransaction) {
buffer.push(transformTransaction(tx));
if (buffer.length >= BATCH_SIZE) {
await flushBuffer();
}
}
async function flushBuffer() {
if (buffer.length === 0) return;
const batch = buffer.splice(0, buffer.length);
await client.insert({
table: 'transactions',
values: batch,
format: 'JSONEachRow',
});
console.log(`Inserted ${batch.length} transactions`);
}Managed ClickHouse for Solana
Running ClickHouse in production requires significant operational expertise. For teams that want the performance benefits without the operational overhead, managed solutions are available. Supanode offers direct ClickHouse SQL access to their pre-built Solana transaction index, covering data from January 2024 to real-time. This eliminates the need to build and maintain your own ingestion pipeline while providing the full flexibility of SQL queries.