DataClickHouseRecommended

ClickHouse for Solana Data

Why ClickHouse is the fastest and most cost-effective database for large-scale Solana data indexing.

Updated March 202522 min read

Why ClickHouse?

When building a Solana data indexer that needs to handle billions of transactions with millisecond query times, database choice is the most critical architectural decision. ClickHouse — an open-source column-oriented OLAP (Online Analytical Processing) database developed by Yandex — has emerged as the clear winner for this use case, and it is not close.

ClickHouse was designed from the ground up for analytical workloads: aggregations over large datasets, time-series analysis, and high-throughput writes. These are precisely the operations that Solana data indexing requires. The result is a database that can ingest millions of rows per second while simultaneously serving complex analytical queries in milliseconds — performance that row-oriented databases like PostgreSQL cannot match at this scale.

ClickHouse vs. Alternatives for Solana Indexing

DatabaseWrite SpeedQuery Speed (1B rows)Storage EfficiencySolana Fit
ClickHouse1M+ rows/secMillisecondsExcellent (10:1 compression)⭐⭐⭐⭐⭐
PostgreSQL~50K rows/secMinutesGood (3:1 compression)⭐⭐
TimescaleDB~200K rows/secSeconds–MinutesGood⭐⭐⭐
BigQueryManagedSecondsExcellent⭐⭐⭐⭐
Apache Parquet+SparkBatch onlyMinutesExcellent⭐⭐⭐
MongoDB~100K rows/secMinutesPoor⭐⭐

How ClickHouse Achieves This Performance

Column-Oriented Storage

Traditional row-oriented databases store all columns of a row together. When you query only 3 columns from a 50-column table, the database still reads all 50 columns from disk. ClickHouse stores each column separately. A query reading 3 columns reads only 6% of the data, dramatically reducing I/O.

For Solana transaction data, this is transformative. A typical transaction record might have 30+ fields, but most analytical queries only need 3–5 of them. ClickHouse reads only the relevant columns, making queries 5–10x faster than row-oriented alternatives.

Vectorized Query Execution

ClickHouse processes data in batches of 65,536 rows (vectors) rather than row-by-row. This enables SIMD (Single Instruction, Multiple Data) CPU instructions, which process multiple values simultaneously. Combined with automatic parallelization across all CPU cores, ClickHouse can aggregate billions of rows in seconds.

MergeTree Engine and Compression

ClickHouse's MergeTree storage engine stores data sorted by a primary key, enabling efficient range scans. Data is stored in compressed chunks using LZ4 or ZSTD compression. Solana transaction data typically achieves 8:1 to 12:1 compression ratios in ClickHouse, reducing storage costs dramatically compared to uncompressed alternatives.

Materialized Views

ClickHouse materialized views automatically maintain pre-aggregated data as new rows are inserted. For Solana analytics, this means you can maintain real-time aggregates — hourly trading volume, daily active wallets, per-program transaction counts — without running expensive queries on the full dataset.

Solana Data Schema Design

Designing the right schema is as important as choosing ClickHouse. The key principle is to optimize for your most common queries through sort key selection and partitioning.

solana_transactions.sqlsql
-- Core transactions table
CREATE TABLE solana.transactions
(
    signature         String,
    slot              UInt64,
    block_time        DateTime,
    success           Bool,
    fee               UInt64,
    compute_units     UInt32,
    
    -- Account keys (up to 64 accounts per transaction)
    account_keys      Array(String),
    
    -- Program IDs involved
    program_ids       Array(String),
    
    -- Pre/post token balances for DEX analysis
    pre_token_balances  Array(Tuple(
        account_index UInt8,
        mint          String,
        amount        UInt64,
        decimals      UInt8
    )),
    post_token_balances Array(Tuple(
        account_index UInt8,
        mint          String,
        amount        UInt64,
        decimals      UInt8
    )),
    
    -- Log messages for event parsing
    log_messages      Array(String),
    
    -- Instruction data (base58 encoded)
    instructions      Array(Tuple(
        program_id    String,
        accounts      Array(UInt8),
        data          String
    ))
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(block_time)
ORDER BY (block_time, slot, signature)
SETTINGS index_granularity = 8192;
dex_swaps.sqlsql
-- Parsed DEX swap events (materialized from transactions)
CREATE TABLE solana.dex_swaps
(
    signature         String,
    block_time        DateTime,
    slot              UInt64,
    
    -- DEX identification
    program_id        String,
    pool_address      String,
    
    -- Swap details
    trader            String,
    token_in_mint     String,
    token_out_mint    String,
    amount_in         UInt64,
    amount_out        UInt64,
    
    -- Derived metrics
    price_impact      Float64,
    fee_amount        UInt64
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(block_time)
ORDER BY (block_time, program_id, pool_address)
SETTINGS index_granularity = 8192;

-- Materialized view to auto-populate from transactions
CREATE MATERIALIZED VIEW solana.dex_swaps_mv
TO solana.dex_swaps
AS SELECT
    signature,
    block_time,
    slot,
    -- Parse swap data from transactions
    -- (parsing logic depends on specific DEX program)
    program_ids[1] as program_id
FROM solana.transactions
WHERE has(program_ids, '675kPX9MHTjS2zt1qfr1NYHuzeLXfQM9H24wFSUt1Mp8'); -- Raydium

Query Examples

Trading Volume by DEX

sql
-- 24-hour trading volume by DEX program
SELECT 
    program_id,
    count() as swap_count,
    sum(amount_in) / 1e9 as volume_sol,
    uniqExact(trader) as unique_traders
FROM solana.dex_swaps
WHERE block_time > now() - INTERVAL 24 HOUR
GROUP BY program_id
ORDER BY volume_sol DESC;

Wallet Activity Analysis

sql
-- Top wallets by transaction count in last 7 days
SELECT 
    arrayJoin(account_keys) as wallet,
    count() as tx_count,
    sum(fee) / 1e9 as total_fees_sol
FROM solana.transactions
WHERE 
    block_time > now() - INTERVAL 7 DAY
    AND success = true
GROUP BY wallet
HAVING tx_count > 100
ORDER BY tx_count DESC
LIMIT 100;

Real-Time Price Aggregation

sql
-- 1-minute OHLCV for a token pair
SELECT 
    toStartOfMinute(block_time) as minute,
    argMin(price, block_time) as open,
    max(price) as high,
    min(price) as low,
    argMax(price, block_time) as close,
    sum(amount_in) as volume
FROM (
    SELECT 
        block_time,
        toFloat64(amount_out) / toFloat64(amount_in) as price,
        amount_in
    FROM solana.dex_swaps
    WHERE 
        token_in_mint = 'So11111111111111111111111111111111111111112'  -- SOL
        AND token_out_mint = 'YourTokenMint'
        AND block_time > now() - INTERVAL 24 HOUR
)
GROUP BY minute
ORDER BY minute;

Ingestion Pipeline

The ingestion pipeline connects your data source (gRPC stream or RPC) to ClickHouse. For real-time ingestion, the recommended approach is to use ClickHouse's asynchronous insert mode, which batches small inserts automatically:

clickhouse-ingest.tstypescript
import { createClient } from '@clickhouse/client';

const client = createClient({
  url: 'http://localhost:8123',
  database: 'solana',
  // Enable async inserts for high-throughput ingestion
  clickhouse_settings: {
    async_insert: 1,
    wait_for_async_insert: 0,
    async_insert_max_data_size: '10000000', // 10MB batches
    async_insert_busy_timeout_ms: '1000',   // Flush every 1 second
  },
});

// Buffer transactions and insert in batches
const buffer: TransactionRecord[] = [];
const BATCH_SIZE = 1000;

async function ingestTransaction(tx: ParsedTransaction) {
  buffer.push(transformTransaction(tx));
  
  if (buffer.length >= BATCH_SIZE) {
    await flushBuffer();
  }
}

async function flushBuffer() {
  if (buffer.length === 0) return;
  
  const batch = buffer.splice(0, buffer.length);
  await client.insert({
    table: 'transactions',
    values: batch,
    format: 'JSONEachRow',
  });
  
  console.log(`Inserted ${batch.length} transactions`);
}

Managed ClickHouse for Solana

Running ClickHouse in production requires significant operational expertise. For teams that want the performance benefits without the operational overhead, managed solutions are available. Supanode offers direct ClickHouse SQL access to their pre-built Solana transaction index, covering data from January 2024 to real-time. This eliminates the need to build and maintain your own ingestion pipeline while providing the full flexibility of SQL queries.

🚀
ClickHouse is the recommended storage layer for any Solana indexing project processing more than 10 million transactions. For smaller datasets or simpler query patterns, PostgreSQL with TimescaleDB may be sufficient and easier to operate.