storing-data.md

description	Documentation for highlight-next-line
sidebar_label	External disks for storing data
sidebar_position	68
slug	/operations/storing-data
title	External disks for storing data
doc_type	guide

Data processed in ClickHouse is usually stored in the local file system of the machine on which ClickHouse server is running. That requires large-capacity disks, which can be expensive. To avoid storing data locally, various storage options are supported:

Amazon S3 object storage.
Azure Blob Storage.
Unsupported: The Hadoop Distributed File System (HDFS)

:::note ClickHouse also has support for external table engines, which are different from the external storage option described on this page, as they allow reading data stored in some general file format (like Parquet). On this page we are describing storage configuration for the ClickHouse MergeTree family or Log family tables.

to work with data stored on Amazon S3 disks, use the S3 table engine.
to work with data stored in Azure Blob Storage, use the AzureBlobStorage table engine.
to work with data in the Hadoop Distributed File System (unsupported), use the HDFS table engine. :::

Configure external storage {#configuring-external-storage}

MergeTree and Log family table engines can store data to S3, AzureBlobStorage, HDFS (unsupported) using a disk with types s3, azure_blob_storage, hdfs (unsupported) respectively.

Disk configuration requires:

A type section, equal to one of s3, azure_blob_storage, hdfs (unsupported), local_blob_storage, web.
Configuration of a specific external storage type.

Starting from 24.1 clickhouse version, it is possible to use a new configuration option. It requires specifying:

A type equal to object_storage
object_storage_type, equal to one of s3, azure_blob_storage (or just azure from 24.3), hdfs (unsupported), local_blob_storage (or just local from 24.3), web.

Optionally, metadata_type can be specified (it is equal to local by default), but it can also be set to plain, web and, starting from 24.4, plain_rewritable. Usage of plain metadata type is described in plain storage section, web metadata type can be used only with web object storage type, local metadata type stores metadata files locally (each metadata files contains mapping to files in object storage and some additional meta information about them).

For example:

<s3>
    <type>s3</type>
    <endpoint>https://s3.eu-west-1.amazonaws.com/clickhouse-eu-west-1.clickhouse.com/data/</endpoint>
    <use_environment_credentials>1</use_environment_credentials>
</s3>

is equal to the following configuration (from version 24.1):

<s3>
    <type>object_storage</type>
    <object_storage_type>s3</object_storage_type>
    <metadata_type>local</metadata_type>
    <endpoint>https://s3.eu-west-1.amazonaws.com/clickhouse-eu-west-1.clickhouse.com/data/</endpoint>
    <use_environment_credentials>1</use_environment_credentials>
</s3>

The following configuration:

<s3_plain>
    <type>s3_plain</type>
    <endpoint>https://s3.eu-west-1.amazonaws.com/clickhouse-eu-west-1.clickhouse.com/data/</endpoint>
    <use_environment_credentials>1</use_environment_credentials>
</s3_plain>

is equal to:

<s3_plain>
    <type>object_storage</type>
    <object_storage_type>s3</object_storage_type>
    <metadata_type>plain</metadata_type>
    <endpoint>https://s3.eu-west-1.amazonaws.com/clickhouse-eu-west-1.clickhouse.com/data/</endpoint>
    <use_environment_credentials>1</use_environment_credentials>
</s3_plain>

An example of full storage configuration will look like:

<clickhouse>
    <storage_configuration>
        <disks>
            <s3>
                <type>s3</type>
                <endpoint>https://s3.eu-west-1.amazonaws.com/clickhouse-eu-west-1.clickhouse.com/data/</endpoint>
                <use_environment_credentials>1</use_environment_credentials>
            </s3>
        </disks>
        <policies>
            <s3>
                <volumes>
                    <main>
                        <disk>s3</disk>
                    </main>
                </volumes>
            </s3>
        </policies>
    </storage_configuration>
</clickhouse>

Starting with version 24.1, it can also look like:

<clickhouse>
    <storage_configuration>
        <disks>
            <s3>
                <type>object_storage</type>
                <object_storage_type>s3</object_storage_type>
                <metadata_type>local</metadata_type>
                <endpoint>https://s3.eu-west-1.amazonaws.com/clickhouse-eu-west-1.clickhouse.com/data/</endpoint>
                <use_environment_credentials>1</use_environment_credentials>
            </s3>
        </disks>
        <policies>
            <s3>
                <volumes>
                    <main>
                        <disk>s3</disk>
                    </main>
                </volumes>
            </s3>
        </policies>
    </storage_configuration>
</clickhouse>

To make a specific kind of storage a default option for all MergeTree tables, add the following section to the configuration file:

<clickhouse>
    <merge_tree>
        <storage_policy>s3</storage_policy>
    </merge_tree>
</clickhouse>

If you want to configure a specific storage policy for a specific table, you can define it in settings while creating the table:

CREATE TABLE test (a Int32, b String)
ENGINE = MergeTree() ORDER BY a
SETTINGS storage_policy = 's3';

You can also use disk instead of storage_policy. In this case it is not necessary to have the storage_policy section in the configuration file, and a disk section is enough.

CREATE TABLE test (a Int32, b String)
ENGINE = MergeTree() ORDER BY a
SETTINGS disk = 's3';

refresh_parts_interval and table_disk {#refresh-parts-interval-and-table-disk}

This setting is intended for non-Replicated MergeTree tables where parts may be written externally and metadata discovery must be refreshed from storage.

The MergeTree setting refresh_parts_interval enables periodic refresh of the list of data parts from the underlying storage (e.g. to pick up parts written externally). The important distinction is shared metadata across replicas vs replica-local metadata (e.g. S3 with local metadata per replica): only when metadata is shared will new parts be visible to all replicas. Using object storage alone does not imply shared metadata.

Object storage (e.g. disk = 's3') does not imply shared metadata. When metadata is stored locally per replica (the default), each replica independently manages its pointers to blobs in object storage. Changes made on one replica are not visible to others. In that case, refresh_parts_interval does not make new parts visible across replicas, because the metadata each replica reads is replica-local.
Automatic part refreshing requires that the filesystem metadata be shared (or that the table use table-owned, readonly metadata so that refresh is applicable). Setting table_disk = true together with a table-local disk (e.g. SETTINGS disk = disk(type=object_storage, ...), table_disk = true) is one way to get the correct semantics: the table owns the metadata life cycle and the storage is treated as readonly, so refresh_parts_interval runs and externally added parts can be discovered.
With a globally defined disk (e.g. disk = 's3' in storage_configuration) and default local metadata, each replica has its own metadata state. Even though blobs may be in S3, the storage is not considered shared for the purpose of refresh_parts_interval, and new parts created outside ClickHouse or on another replica will not be detected.

For automatic part refreshing, ensure the metadata is shared or use a table-level disk with table_disk = true as above. Relying only on refresh_parts_interval with replica-local metadata will not refresh parts as expected.

:::note refresh_parts_interval is not used for ReplicatedMergeTree tables. Replicated tables already synchronize parts through the replication mechanism. This setting is only applicable to non-replicated MergeTree tables where parts are written externally and metadata refresh is required. :::

Dynamic Configuration {#dynamic-configuration}

There is also a possibility to specify storage configuration without a predefined disk in configuration in a configuration file, but can be configured in the CREATE/ATTACH query settings.

The following example query builds on the above dynamic disk configuration and shows how to use a local disk to cache data from a table stored at a URL.

ATTACH TABLE uk_price_paid UUID 'cf712b4f-2ca8-435c-ac23-c4393efe52f7'
(
    price UInt32,
    date Date,
    postcode1 LowCardinality(String),
    postcode2 LowCardinality(String),
    type Enum8('other' = 0, 'terraced' = 1, 'semi-detached' = 2, 'detached' = 3, 'flat' = 4),
    is_new UInt8,
    duration Enum8('unknown' = 0, 'freehold' = 1, 'leasehold' = 2),
    addr1 String,
    addr2 String,
    street LowCardinality(String),
    locality LowCardinality(String),
    town LowCardinality(String),
    district LowCardinality(String),
    county LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (postcode1, postcode2, addr1, addr2)
  -- highlight-start
  SETTINGS disk = disk(
    type=web,
    endpoint='https://raw.githubusercontent.com/ClickHouse/web-tables-demo/main/web/'
  );
  -- highlight-end

The example below adds a cache to external storage.

ATTACH TABLE uk_price_paid UUID 'cf712b4f-2ca8-435c-ac23-c4393efe52f7'
(
    price UInt32,
    date Date,
    postcode1 LowCardinality(String),
    postcode2 LowCardinality(String),
    type Enum8('other' = 0, 'terraced' = 1, 'semi-detached' = 2, 'detached' = 3, 'flat' = 4),
    is_new UInt8,
    duration Enum8('unknown' = 0, 'freehold' = 1, 'leasehold' = 2),
    addr1 String,
    addr2 String,
    street LowCardinality(String),
    locality LowCardinality(String),
    town LowCardinality(String),
    district LowCardinality(String),
    county LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (postcode1, postcode2, addr1, addr2)
-- highlight-start
  SETTINGS disk = disk(
    type=cache,
    max_size='1Gi',
    path='/var/lib/clickhouse/custom_disk_cache/',
    disk=disk(
      type=web,
      endpoint='https://raw.githubusercontent.com/ClickHouse/web-tables-demo/main/web/'
      )
  );
-- highlight-end

In the settings highlighted below notice that the disk of type=web is nested within the disk of type=cache.

:::note The example uses type=web, but any disk type can be configured as dynamic, including local disk. Local disks require a path argument to be inside the server config parameter custom_local_disks_base_directory, which has no default, so set that also when using local disk. :::

A combination of config-based configuration and sql-defined configuration is also possible:

ATTACH TABLE uk_price_paid UUID 'cf712b4f-2ca8-435c-ac23-c4393efe52f7'
(
    price UInt32,
    date Date,
    postcode1 LowCardinality(String),
    postcode2 LowCardinality(String),
    type Enum8('other' = 0, 'terraced' = 1, 'semi-detached' = 2, 'detached' = 3, 'flat' = 4),
    is_new UInt8,
    duration Enum8('unknown' = 0, 'freehold' = 1, 'leasehold' = 2),
    addr1 String,
    addr2 String,
    street LowCardinality(String),
    locality LowCardinality(String),
    town LowCardinality(String),
    district LowCardinality(String),
    county LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (postcode1, postcode2, addr1, addr2)
  -- highlight-start
  SETTINGS disk = disk(
    type=cache,
    max_size='1Gi',
    path='/var/lib/clickhouse/custom_disk_cache/',
    disk=disk(
      type=web,
      endpoint='https://raw.githubusercontent.com/ClickHouse/web-tables-demo/main/web/'
      )
  );
  -- highlight-end

where web is from the server configuration file:

<storage_configuration>
    <disks>
        <web>
            <type>web</type>
            <endpoint>'https://raw.githubusercontent.com/ClickHouse/web-tables-demo/main/web/'</endpoint>
        </web>
    </disks>
</storage_configuration>

Using S3 Storage {#s3-storage}

Required parameters {#required-parameters-s3}

Parameter	Description
`endpoint`	S3 endpoint URL in `path` or `virtual hosted` styles. Should include the bucket and root path for data storage.
`access_key_id`	S3 access key ID used for authentication.
`secret_access_key`	S3 secret access key used for authentication.

Optional parameters {#optional-parameters-s3}

Parameter	Description	Default Value
`region`	S3 region name.	-
`support_batch_delete`	Controls whether to check for batch delete support. Set to `false` when using Google Cloud Storage (GCS) as GCS doesn't support batch deletes.	`true`
`use_environment_credentials`	Reads AWS credentials from environment variables: `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_SESSION_TOKEN` if they exist. Note: environment credentials are shared across all S3 disks. To use different credentials for different disks, specify explicit `access_key_id` and `secret_access_key` per disk instead.	`false`
`use_insecure_imds_request`	If `true`, uses insecure IMDS request when obtaining credentials from Amazon EC2 metadata.	`false`
`expiration_window_seconds`	Grace period (in seconds) for checking if expiration-based credentials have expired.	`120`
`proxy`	Proxy configuration for S3 endpoint. Each `uri` element inside `proxy` block should contain a proxy URL.	-
`connect_timeout_ms`	Socket connect timeout in milliseconds.	`10000` (10 seconds)
`request_timeout_ms`	Request timeout in milliseconds.	`5000` (5 seconds)
`retry_attempts`	Number of retry attempts for failed requests.	`10`
`single_read_retries`	Number of retry attempts for connection drops during read.	`4`
`min_bytes_for_seek`	Minimum number of bytes to use seek operation instead of sequential read.	`1 MB`
`metadata_path`	Local filesystem path to store S3 metadata files.	`/var/lib/clickhouse/disks/<disk_name>/`
`skip_access_check`	If `true`, skips disk access checks during startup.	`false`
`header`	Adds specified HTTP header to requests. Can be specified multiple times.	-
`server_side_encryption_customer_key_base64`	Required headers for accessing S3 objects with SSE-C encryption.	-
`server_side_encryption_kms_key_id`	Required headers for accessing S3 objects with SSE-KMS encryption. Empty string uses AWS managed S3 key.	-
`server_side_encryption_kms_encryption_context`	Encryption context header for SSE-KMS (used with `server_side_encryption_kms_key_id`).	-
`server_side_encryption_kms_bucket_key_enabled`	Enables S3 bucket keys for SSE-KMS (used with `server_side_encryption_kms_key_id`).	Matches bucket-level setting
`s3_max_put_rps`	Maximum PUT requests per second before throttling.	`0` (unlimited)
`s3_max_put_burst`	Maximum concurrent PUT requests before hitting RPS limit.	Same as `s3_max_put_rps`
`s3_max_get_rps`	Maximum GET requests per second before throttling.	`0` (unlimited)
`s3_max_get_burst`	Maximum concurrent GET requests before hitting RPS limit.	Same as `s3_max_get_rps`
`read_resource`	Resource name for scheduling read requests.	Empty string (disabled)
`write_resource`	Resource name for scheduling write requests.	Empty string (disabled)
`key_template`	Defines object key generation format using re2 syntax. Requires `storage_metadata_write_full_object_key` flag. Incompatible with `root path` in `endpoint`. Requires `key_compatibility_prefix`.	-
`key_compatibility_prefix`	Required with `key_template`. Specifies the previous `root path` from `endpoint` for reading older metadata versions.	-
`read_only`	Only allowing reading from the disk.	-
:::note
Google Cloud Storage (GCS) is also supported using the type `s3`. See GCS backed MergeTree.
:::

Using Plain Storage {#plain-storage}

In 22.10 a new disk type s3_plain was introduced, which provides a write-once storage. Configuration parameters for it are the same as for the s3 disk type. Unlike the s3 disk type, it stores data as is. In other words, instead of having randomly generated blob names, it uses normal file names (the same way as ClickHouse stores files on local disk) and does not store any metadata locally. For example, it is derived from data on s3.

This disk type allows keeping a static version of the table, as it does not allow executing merges on the existing data and does not allow inserting of new data. A use case for this disk type is to create backups on it, which can be done via BACKUP TABLE data TO Disk('plain_disk_name', 'backup_name'). Afterward, you can do RESTORE TABLE data AS data_restored FROM Disk('plain_disk_name', 'backup_name') or use ATTACH TABLE data (...) ENGINE = MergeTree() SETTINGS disk = 'plain_disk_name'.

Configuration:

<s3_plain>
    <type>s3_plain</type>
    <endpoint>https://s3.eu-west-1.amazonaws.com/clickhouse-eu-west-1.clickhouse.com/data/</endpoint>
    <use_environment_credentials>1</use_environment_credentials>
</s3_plain>

Starting from 24.1 it is possible configure any object storage disk (s3, azure, hdfs (unsupported), local) using the plain metadata type.

Configuration:

<s3_plain>
    <type>object_storage</type>
    <object_storage_type>azure</object_storage_type>
    <metadata_type>plain</metadata_type>
    <endpoint>https://s3.eu-west-1.amazonaws.com/clickhouse-eu-west-1.clickhouse.com/data/</endpoint>
    <use_environment_credentials>1</use_environment_credentials>
</s3_plain>

Using S3 Plain Rewritable Storage {#s3-plain-rewritable-storage}

A new disk type s3_plain_rewritable was introduced in 24.4. Similar to the s3_plain disk type, it does not require additional storage for metadata files. Instead, metadata is stored in S3. Unlike the s3_plain disk type, s3_plain_rewritable allows executing merges and supports INSERT operations. Mutations and replication of tables are not supported.

A use case for this disk type is for non-replicated MergeTree tables. Although the s3 disk type is suitable for non-replicated MergeTree tables, you may opt for the s3_plain_rewritable disk type if you do not require local metadata for the table and are willing to accept a limited set of operations. This could be useful, for example, for system tables.

Configuration:

<s3_plain_rewritable>
    <type>s3_plain_rewritable</type>
    <endpoint>https://s3.eu-west-1.amazonaws.com/clickhouse-eu-west-1.clickhouse.com/data/</endpoint>
    <use_environment_credentials>1</use_environment_credentials>
</s3_plain_rewritable>

is equal to

<s3_plain_rewritable>
    <type>object_storage</type>
    <object_storage_type>s3</object_storage_type>
    <metadata_type>plain_rewritable</metadata_type>
    <endpoint>https://s3.eu-west-1.amazonaws.com/clickhouse-eu-west-1.clickhouse.com/data/</endpoint>
    <use_environment_credentials>1</use_environment_credentials>
</s3_plain_rewritable>

Starting from 24.5 it is possible to configure any object storage disk (s3, azure, local) using the plain_rewritable metadata type.

Using Azure Blob Storage {#azure-blob-storage}

MergeTree family table engines can store data to Azure Blob Storage using a disk with type azure_blob_storage.

Configuration markup:

<storage_configuration>
    ...
    <disks>
        <blob_storage_disk>
            <type>azure_blob_storage</type>
            <storage_account_url>http://account.blob.core.windows.net</storage_account_url>
            <container_name>container</container_name>
            <account_name>account</account_name>
            <account_key>pass123</account_key>
            <metadata_path>/var/lib/clickhouse/disks/blob_storage_disk/</metadata_path>
            <cache_path>/var/lib/clickhouse/disks/blob_storage_disk/cache/</cache_path>
            <skip_access_check>false</skip_access_check>
        </blob_storage_disk>
    </disks>
    ...
</storage_configuration>

Connection parameters {#azure-blob-storage-connection-parameters}

Parameter	Description	Default Value
`storage_account_url` (Required)	Azure Blob Storage account URL. Examples: `http://account.blob.core.windows.net` or `http://azurite1:10000/devstoreaccount1`.	-
`container_name`	Target container name.	`default-container`
`container_already_exists`	Controls container creation behavior: - `false`: Creates a new container - `true`: Connects directly to existing container - Unset: Checks if container exists, creates if needed	-

Authentication parameters (the disk will try all available methods and Managed Identity Credential):

Parameter	Description
`connection_string`	For authentication using a connection string.
`account_name`	For authentication using Shared Key (used with `account_key`).
`account_key`	For authentication using Shared Key (used with `account_name`).

Limit parameters {#azure-blob-storage-limit-parameters}

Parameter	Description
`s3_max_single_part_upload_size`	Maximum size of a single block upload to Blob Storage.
`min_bytes_for_seek`	Minimum size of a seekable region.
`max_single_read_retries`	Maximum number of attempts to read a chunk of data from Blob Storage.
`max_single_download_retries`	Maximum number of attempts to download a readable buffer from Blob Storage.
`thread_pool_size`	Maximum number of threads for `IDiskRemote` instantiation.
`s3_max_inflight_parts_for_one_file`	Maximum number of concurrent put requests for a single object.

Other parameters {#azure-blob-storage-other-parameters}

Parameter	Description	Default Value
`metadata_path`	Local filesystem path to store metadata files for Blob Storage.	`/var/lib/clickhouse/disks/<disk_name>/`
`skip_access_check`	If `true`, skips disk access checks during startup.	`false`
`read_resource`	Resource name for scheduling read requests.	Empty string (disabled)
`write_resource`	Resource name for scheduling write requests.	Empty string (disabled)
`metadata_keep_free_space_bytes`	Amount of free metadata disk space to reserve.	-

Examples of working configurations can be found in integration tests directory (see e.g. test_merge_tree_azure_blob_storage or test_azure_blob_storage_zero_copy_replication).

:::note Zero-copy replication is not ready for production Zero-copy replication is disabled by default in ClickHouse version 22.8 and higher. This feature is not recommended for production use. :::

Using HDFS storage (Unsupported) {#using-hdfs-storage-unsupported}

In this sample configuration:

the disk is of type hdfs (unsupported)
the data is hosted at hdfs://hdfs1:9000/clickhouse/

By the way, HDFS is unsupported and therefore there might be issues when using it. Feel free to make a pull request with the fix if any issue arises.

<clickhouse>
    <storage_configuration>
        <disks>
            <hdfs>
                <type>hdfs</type>
                <endpoint>hdfs://hdfs1:9000/clickhouse/</endpoint>
                <skip_access_check>true</skip_access_check>
            </hdfs>
            <hdd>
                <type>local</type>
                <path>/</path>
            </hdd>
        </disks>
        <policies>
            <hdfs>
                <volumes>
                    <main>
                        <disk>hdfs</disk>
                    </main>
                    <external>
                        <disk>hdd</disk>
                    </external>
                </volumes>
            </hdfs>
        </policies>
    </storage_configuration>
</clickhouse>

Keep in mind that HDFS may not work in corner cases.

Using Data Encryption {#encrypted-virtual-file-system}

You can encrypt the data stored on S3, or HDFS (unsupported) external disks, or on a local disk. To turn on the encryption mode, in the configuration file you must define a disk with the type encrypted and choose a disk on which the data will be saved. An encrypted disk ciphers all written files on the fly, and when you read files from an encrypted disk it deciphers them automatically. So you can work with an encrypted disk like with a normal one.

Example of disk configuration:

<disks>
  <disk1>
    <type>local</type>
    <path>/path1/</path>
  </disk1>
  <disk2>
    <type>encrypted</type>
    <disk>disk1</disk>
    <path>path2/</path>
    <key>_16_ascii_chars_</key>
  </disk2>
</disks>

For example, when ClickHouse writes data from some table to a file store/all_1_1_0/data.bin to disk1, then in fact this file will be written to the physical disk along the path /path1/store/all_1_1_0/data.bin.

When writing the same file to disk2, it will actually be written to the physical disk at the path /path1/path2/store/all_1_1_0/data.bin in encrypted mode.

Required Parameters {#required-parameters-encrypted-disk}

Parameter	Type	Description
`type`	String	Must be set to `encrypted` to create an encrypted disk.
`disk`	String	Type of disk to use for underlying storage.
`key`	Uint64	Key for encryption and decryption. Can be specified in hexadecimal using `key_hex`. Multiple keys can be specified using the `id` attribute.

Optional Parameters {#optional-parameters-encrypted-disk}

Parameter	Type	Default	Description
`path`	String	Root directory	Location on the disk where data will be saved.
`current_key_id`	String	-	The key ID used for encryption. All specified keys can be used for decryption.
`algorithm`	Enum	`AES_128_CTR`	Encryption algorithm. Options: - `AES_128_CTR` (16-byte key) - `AES_192_CTR` (24-byte key) - `AES_256_CTR` (32-byte key)

Example of disk configuration:

<clickhouse>
    <storage_configuration>
        <disks>
            <disk_s3>
                <type>s3</type>
                <endpoint>...
            </disk_s3>
            <disk_s3_encrypted>
                <type>encrypted</type>
                <disk>disk_s3</disk>
                <algorithm>AES_128_CTR</algorithm>
                <key_hex id="0">00112233445566778899aabbccddeeff</key_hex>
                <key_hex id="1">ffeeddccbbaa99887766554433221100</key_hex>
                <current_key_id>1</current_key_id>
            </disk_s3_encrypted>
        </disks>
    </storage_configuration>
</clickhouse>

Using local cache {#using-local-cache}

It is possible to configure local cache over disks in storage configuration starting from version 22.3. For versions 22.3 - 22.7 cache is supported only for s3 disk type. For versions >= 22.8 cache is supported for any disk type: S3, Azure, Local, Encrypted, etc. For versions >= 23.5 cache is supported only for remote disk types: S3, Azure, HDFS (unsupported). Cache uses LRU cache policy.

Example of configuration for versions later or equal to 22.8:

<clickhouse>
    <storage_configuration>
        <disks>
            <s3>
                <type>s3</type>
                <endpoint>...</endpoint>
                ... s3 configuration ...
            </s3>
            <cache>
                <type>cache</type>
                <disk>s3</disk>
                <path>/s3_cache/</path>
                <max_size>10Gi</max_size>
            </cache>
        </disks>
        <policies>
            <s3_cache>
                <volumes>
                    <main>
                        <disk>cache</disk>
                    </main>
                </volumes>
            </s3_cache>
        <policies>
    </storage_configuration>

Example of configuration for versions earlier than 22.8:

<clickhouse>
    <storage_configuration>
        <disks>
            <s3>
                <type>s3</type>
                <endpoint>...</endpoint>
                ... s3 configuration ...
                <data_cache_enabled>1</data_cache_enabled>
                <data_cache_max_size>10737418240</data_cache_max_size>
            </s3>
        </disks>
        <policies>
            <s3_cache>
                <volumes>
                    <main>
                        <disk>s3</disk>
                    </main>
                </volumes>
            </s3_cache>
        <policies>
    </storage_configuration>

File Cache disk configuration settings:

These settings should be defined in the disk configuration section.

Parameter	Type	Default	Description
`path`	String	-	Required. Path to the directory where cache will be stored.
`max_size`	Size	-	Required. Maximum cache size in bytes or readable format (e.g., `10Gi`). Files are evicted using LRU policy when the limit is reached. Supports `ki`, `Mi`, `Gi` formats (since v22.10).
`cache_on_write_operations`	Boolean	`false`	Enables write-through cache for `INSERT` queries and background merges. Can be overridden per query with `enable_filesystem_cache_on_write_operations`.
`enable_filesystem_query_cache_limit`	Boolean	`false`	Enables per-query cache size limits based on `max_query_cache_size`.
`enable_cache_hits_threshold`	Boolean	`false`	When enabled, data is cached only after being read multiple times.
`cache_hits_threshold`	Integer	`0`	Number of reads required before data is cached (requires `enable_cache_hits_threshold`).
`enable_bypass_cache_with_threshold`	Boolean	`false`	Skips cache for large read ranges.
`bypass_cache_threshold`	Size	`256Mi`	Read range size that triggers cache bypass (requires `enable_bypass_cache_with_threshold`).
`max_file_segment_size`	Size	`8Mi`	Maximum size of a single cache file in bytes or readable format.
`max_elements`	Integer	`10000000`	Maximum number of cache files.
`load_metadata_threads`	Integer	`16`	Number of threads for loading cache metadata at startup.
`use_split_cache`	Boolean	`false`	Use separation of files to system/data.
`split_cache_ratio`	Double	`0.1`	Ratio of system segment to total size of cache for split_cache.

Note: Size values support units like ki, Mi, Gi, etc. (e.g., 10Gi).

File Cache Query/Profile Settings {#file-cache-query-profile-settings}

Setting	Type	Default	Description
`enable_filesystem_cache`	Boolean	`true`	Enables/disables cache usage per query, even when using a `cache` disk type.
`read_from_filesystem_cache_if_exists_otherwise_bypass_cache`	Boolean	`false`	When enabled, uses cache only if data exists; new data won't be cached.
`enable_filesystem_cache_on_write_operations`	Boolean	`false` (Cloud: `true`)	Enables write-through cache. Requires `cache_on_write_operations` in cache config.
`enable_filesystem_cache_log`	Boolean	`false`	Enables detailed cache usage logging to `system.filesystem_cache_log`.
`filesystem_cache_allow_background_download`	Boolean	`true`	Allows partially downloaded segments to be finished in the background. Disable to keep downloads in the foreground for the current query/session.
`max_query_cache_size`	Size	`false`	Maximum cache size per query. Requires `enable_filesystem_query_cache_limit` in cache config.
`filesystem_cache_skip_download_if_exceeds_per_query_cache_write_limit`	Boolean	`true`	Controls behavior when `max_query_cache_size` is reached: - `true`: Stops downloading new data - `false`: Evicts old data to make space for new data

:::warning Cache configuration settings and cache query settings correspond to the latest ClickHouse version, for earlier versions something might not be supported. :::

Cache system tables {#cache-system-tables-file-cache}

Table Name	Description	Requirements
`system.filesystem_cache`	Displays the current state of the filesystem cache.	None
`system.filesystem_cache_log`	Provides detailed cache usage statistics per query.	Requires `enable_filesystem_cache_log = true`

Cache commands {#cache-commands-file-cache}

`SYSTEM CLEAR|DROP FILESYSTEM CACHE (<cache_name>) (ON CLUSTER)` -- `ON CLUSTER` {#system-clear-filesystem-cache-on-cluster}

This command is only supported when no <cache_name> is provided

`SHOW FILESYSTEM CACHES` {#show-filesystem-caches}

Show a list of filesystem caches which were configured on the server. (For versions less than or equal to 22.8 the command is named SHOW CACHES)

SHOW FILESYSTEM CACHES

┌─Caches────┐
│ s3_cache  │
└───────────┘

`DESCRIBE FILESYSTEM CACHE '<cache_name>'` {#describe-filesystem-cache}

Show cache configuration and some general statistics for a specific cache. Cache name can be taken from SHOW FILESYSTEM CACHES command. (For versions less than or equal to 22.8 the command is named DESCRIBE CACHE)

DESCRIBE FILESYSTEM CACHE 's3_cache'

┌────max_size─┬─max_elements─┬─max_file_segment_size─┬─boundary_alignment─┬─cache_on_write_operations─┬─cache_hits_threshold─┬─current_size─┬─current_elements─┬─path───────┬─background_download_threads─┬─enable_bypass_cache_with_threshold─┐
│ 10000000000 │      1048576 │             104857600 │            4194304 │                         1 │                    0 │         3276 │               54 │ /s3_cache/ │                           2 │                                  0 │
└─────────────┴──────────────┴───────────────────────┴────────────────────┴───────────────────────────┴──────────────────────┴──────────────┴──────────────────┴────────────┴─────────────────────────────┴────────────────────────────────────┘

Cache current metrics	Cache asynchronous metrics	Cache profile events
`FilesystemCacheSize`	`FilesystemCacheBytes`	`CachedReadBufferReadFromSourceBytes`, `CachedReadBufferReadFromCacheBytes`
`FilesystemCacheElements`	`FilesystemCacheFiles`	`CachedReadBufferReadFromSourceMicroseconds`, `CachedReadBufferReadFromCacheMicroseconds`
		`CachedReadBufferCacheWriteBytes`, `CachedReadBufferCacheWriteMicroseconds`
		`CachedWriteBufferCacheWriteBytes`, `CachedWriteBufferCacheWriteMicroseconds`

Using static Web storage (read-only) {#web-storage}

This is a read-only disk. Its data is only read and never modified. A new table is loaded to this disk via ATTACH TABLE query (see example below). Local disk is not actually used, each SELECT query will result in a http request to fetch required data. All modification of the table data will result in an exception, i.e. the following types of queries are not allowed: CREATE TABLE, ALTER TABLE, RENAME TABLE, DETACH TABLE and TRUNCATE TABLE. Web storage can be used for read-only purposes. An example use is for hosting sample data, or for migrating data. There is a tool clickhouse-static-files-uploader, which prepares a data directory for a given table (SELECT data_paths FROM system.tables WHERE name = 'table_name'). For each table you need, you get a directory of files. These files can be uploaded to, for example, a web server with static files. After this preparation, you can load this table into any ClickHouse server via DiskWeb.

In this sample configuration:

the disk is of type web
the data is hosted at http://nginx:80/test1/
a cache on local storage is used

<clickhouse>
    <storage_configuration>
        <disks>
            <web>
                <type>web</type>
                <endpoint>http://nginx:80/test1/</endpoint>
            </web>
            <cached_web>
                <type>cache</type>
                <disk>web</disk>
                <path>cached_web_cache/</path>
                <max_size>100000000</max_size>
            </cached_web>
        </disks>
        <policies>
            <web>
                <volumes>
                    <main>
                        <disk>web</disk>
                    </main>
                </volumes>
            </web>
            <cached_web>
                <volumes>
                    <main>
                        <disk>cached_web</disk>
                    </main>
                </volumes>
            </cached_web>
        </policies>
    </storage_configuration>
</clickhouse>

:::tip Storage can also be configured temporarily within a query, if a web dataset is not expected to be used routinely, see dynamic configuration and skip editing the configuration file.

A demo dataset is hosted in GitHub. To prepare your own tables for web storage see the tool clickhouse-static-files-uploader :::

In this ATTACH TABLE query the UUID provided matches the directory name of the data, and the endpoint is the URL for the raw GitHub content.

-- highlight-next-line
ATTACH TABLE uk_price_paid UUID 'cf712b4f-2ca8-435c-ac23-c4393efe52f7'
(
    price UInt32,
    date Date,
    postcode1 LowCardinality(String),
    postcode2 LowCardinality(String),
    type Enum8('other' = 0, 'terraced' = 1, 'semi-detached' = 2, 'detached' = 3, 'flat' = 4),
    is_new UInt8,
    duration Enum8('unknown' = 0, 'freehold' = 1, 'leasehold' = 2),
    addr1 String,
    addr2 String,
    street LowCardinality(String),
    locality LowCardinality(String),
    town LowCardinality(String),
    district LowCardinality(String),
    county LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (postcode1, postcode2, addr1, addr2)
  -- highlight-start
  SETTINGS disk = disk(
      type=web,
      endpoint='https://raw.githubusercontent.com/ClickHouse/web-tables-demo/main/web/'
      );
  -- highlight-end

A ready test case. You need to add this configuration to config:

<clickhouse>
    <storage_configuration>
        <disks>
            <web>
                <type>web</type>
                <endpoint>https://clickhouse-datasets.s3.yandex.net/disk-with-static-files-tests/test-hits/</endpoint>
            </web>
        </disks>
        <policies>
            <web>
                <volumes>
                    <main>
                        <disk>web</disk>
                    </main>
                </volumes>
            </web>
        </policies>
    </storage_configuration>
</clickhouse>

And then execute this query:

ATTACH TABLE test_hits UUID '1ae36516-d62d-4218-9ae3-6516d62da218'
(
    WatchID UInt64,
    JavaEnable UInt8,
    Title String,
    GoodEvent Int16,
    EventTime DateTime,
    EventDate Date,
    CounterID UInt32,
    ClientIP UInt32,
    ClientIP6 FixedString(16),
    RegionID UInt32,
    UserID UInt64,
    CounterClass Int8,
    OS UInt8,
    UserAgent UInt8,
    URL String,
    Referer String,
    URLDomain String,
    RefererDomain String,
    Refresh UInt8,
    IsRobot UInt8,
    RefererCategories Array(UInt16),
    URLCategories Array(UInt16),
    URLRegions Array(UInt32),
    RefererRegions Array(UInt32),
    ResolutionWidth UInt16,
    ResolutionHeight UInt16,
    ResolutionDepth UInt8,
    FlashMajor UInt8,
    FlashMinor UInt8,
    FlashMinor2 String,
    NetMajor UInt8,
    NetMinor UInt8,
    UserAgentMajor UInt16,
    UserAgentMinor FixedString(2),
    CookieEnable UInt8,
    JavascriptEnable UInt8,
    IsMobile UInt8,
    MobilePhone UInt8,
    MobilePhoneModel String,
    Params String,
    IPNetworkID UInt32,
    TraficSourceID Int8,
    SearchEngineID UInt16,
    SearchPhrase String,
    AdvEngineID UInt8,
    IsArtifical UInt8,
    WindowClientWidth UInt16,
    WindowClientHeight UInt16,
    ClientTimeZone Int16,
    ClientEventTime DateTime,
    SilverlightVersion1 UInt8,
    SilverlightVersion2 UInt8,
    SilverlightVersion3 UInt32,
    SilverlightVersion4 UInt16,
    PageCharset String,
    CodeVersion UInt32,
    IsLink UInt8,
    IsDownload UInt8,
    IsNotBounce UInt8,
    FUniqID UInt64,
    HID UInt32,
    IsOldCounter UInt8,
    IsEvent UInt8,
    IsParameter UInt8,
    DontCountHits UInt8,
    WithHash UInt8,
    HitColor FixedString(1),
    UTCEventTime DateTime,
    Age UInt8,
    Sex UInt8,
    Income UInt8,
    Interests UInt16,
    Robotness UInt8,
    GeneralInterests Array(UInt16),
    RemoteIP UInt32,
    RemoteIP6 FixedString(16),
    WindowName Int32,
    OpenerName Int32,
    HistoryLength Int16,
    BrowserLanguage FixedString(2),
    BrowserCountry FixedString(2),
    SocialNetwork String,
    SocialAction String,
    HTTPError UInt16,
    SendTiming Int32,
    DNSTiming Int32,
    ConnectTiming Int32,
    ResponseStartTiming Int32,
    ResponseEndTiming Int32,
    FetchTiming Int32,
    RedirectTiming Int32,
    DOMInteractiveTiming Int32,
    DOMContentLoadedTiming Int32,
    DOMCompleteTiming Int32,
    LoadEventStartTiming Int32,
    LoadEventEndTiming Int32,
    NSToDOMContentLoadedTiming Int32,
    FirstPaintTiming Int32,
    RedirectCount Int8,
    SocialSourceNetworkID UInt8,
    SocialSourcePage String,
    ParamPrice Int64,
    ParamOrderID String,
    ParamCurrency FixedString(3),
    ParamCurrencyID UInt16,
    GoalsReached Array(UInt32),
    OpenstatServiceName String,
    OpenstatCampaignID String,
    OpenstatAdID String,
    OpenstatSourceID String,
    UTMSource String,
    UTMMedium String,
    UTMCampaign String,
    UTMContent String,
    UTMTerm String,
    FromTag String,
    HasGCLID UInt8,
    RefererHash UInt64,
    URLHash UInt64,
    CLID UInt32,
    YCLID UInt64,
    ShareService String,
    ShareURL String,
    ShareTitle String,
    ParsedParams Nested(
        Key1 String,
        Key2 String,
        Key3 String,
        Key4 String,
        Key5 String,
        ValueDouble Float64),
    IslandID FixedString(16),
    RequestNum UInt32,
    RequestTry UInt8
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(EventDate)
ORDER BY (CounterID, EventDate, intHash32(UserID))
SAMPLE BY intHash32(UserID)
SETTINGS storage_policy='web';

Required parameters {#static-web-storage-required-parameters}

Parameter	Description
`type`	`web`. Otherwise the disk is not created.
`endpoint`	The endpoint URL in `path` format. Endpoint URL must contain a root path to store data, where they were uploaded.

Optional parameters {#optional-parameters-web}

Parameter	Description	Default Value
`min_bytes_for_seek`	The minimal number of bytes to use seek operation instead of sequential read	`1` MB
`remote_fs_read_backoff_threashold`	The maximum wait time when trying to read data for remote disk	`10000` seconds
`remote_fs_read_backoff_max_tries`	The maximum number of attempts to read with backoff	`5`

If a query fails with an exception DB:Exception Unreachable URL, then you can try to adjust the settings: http_connection_timeout, http_receive_timeout, keep_alive_timeout.

To get files for upload run: clickhouse static-files-disk-uploader --metadata-path <path> --output-dir <dir> (--metadata-path can be found in query SELECT data_paths FROM system.tables WHERE name = 'table_name').

When loading files by endpoint, they must be loaded into <endpoint>/store/ path, but config must contain only endpoint.

If URL is not reachable on disk load when the server is starting up tables, then all errors are caught. If in this case there were errors, tables can be reloaded (become visible) via DETACH TABLE table_name -> ATTACH TABLE table_name. If metadata was successfully loaded at server startup, then tables are available straight away.

Use http_max_single_read_retries setting to limit the maximum number of retries during a single HTTP read.

Zero-copy Replication (not ready for production) {#zero-copy}

Zero-copy replication is possible, but not recommended, with S3 and HDFS (unsupported) disks. Zero-copy replication means that if the data is stored remotely on several machines and needs to be synchronized, then only the metadata is replicated (paths to the data parts), but not the data itself.

:::note Zero-copy replication is not ready for production Zero-copy replication is disabled by default in ClickHouse version 22.8 and higher. This feature is not recommended for production use. :::

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configure external storage {#configuring-external-storage}

refresh_parts_interval and table_disk {#refresh-parts-interval-and-table-disk}

Dynamic Configuration {#dynamic-configuration}

Using S3 Storage {#s3-storage}

Required parameters {#required-parameters-s3}

Optional parameters {#optional-parameters-s3}

Using Plain Storage {#plain-storage}

Using S3 Plain Rewritable Storage {#s3-plain-rewritable-storage}

Using Azure Blob Storage {#azure-blob-storage}

Connection parameters {#azure-blob-storage-connection-parameters}

Limit parameters {#azure-blob-storage-limit-parameters}

Other parameters {#azure-blob-storage-other-parameters}

Using HDFS storage (Unsupported) {#using-hdfs-storage-unsupported}

Using Data Encryption {#encrypted-virtual-file-system}

Required Parameters {#required-parameters-encrypted-disk}

Optional Parameters {#optional-parameters-encrypted-disk}

Using local cache {#using-local-cache}

File Cache Query/Profile Settings {#file-cache-query-profile-settings}

Cache system tables {#cache-system-tables-file-cache}

Cache commands {#cache-commands-file-cache}

`SYSTEM CLEAR|DROP FILESYSTEM CACHE (<cache_name>) (ON CLUSTER)` -- `ON CLUSTER` {#system-clear-filesystem-cache-on-cluster}

`SHOW FILESYSTEM CACHES` {#show-filesystem-caches}

`DESCRIBE FILESYSTEM CACHE '<cache_name>'` {#describe-filesystem-cache}

Using static Web storage (read-only) {#web-storage}

Required parameters {#static-web-storage-required-parameters}

Optional parameters {#optional-parameters-web}

Zero-copy Replication (not ready for production) {#zero-copy}

FilesExpand file tree

storing-data.md

Latest commit

History

storing-data.md

File metadata and controls

Configure external storage {#configuring-external-storage}

refresh_parts_interval and table_disk {#refresh-parts-interval-and-table-disk}

Dynamic Configuration {#dynamic-configuration}

Using S3 Storage {#s3-storage}

Required parameters {#required-parameters-s3}

Optional parameters {#optional-parameters-s3}

Using Plain Storage {#plain-storage}

Using S3 Plain Rewritable Storage {#s3-plain-rewritable-storage}

Using Azure Blob Storage {#azure-blob-storage}

Connection parameters {#azure-blob-storage-connection-parameters}

Limit parameters {#azure-blob-storage-limit-parameters}

Other parameters {#azure-blob-storage-other-parameters}

Using HDFS storage (Unsupported) {#using-hdfs-storage-unsupported}

Using Data Encryption {#encrypted-virtual-file-system}

Required Parameters {#required-parameters-encrypted-disk}

Optional Parameters {#optional-parameters-encrypted-disk}

Using local cache {#using-local-cache}

File Cache Query/Profile Settings {#file-cache-query-profile-settings}

Cache system tables {#cache-system-tables-file-cache}

Cache commands {#cache-commands-file-cache}

SYSTEM CLEAR|DROP FILESYSTEM CACHE (<cache_name>) (ON CLUSTER) -- ON CLUSTER {#system-clear-filesystem-cache-on-cluster}

SHOW FILESYSTEM CACHES {#show-filesystem-caches}

DESCRIBE FILESYSTEM CACHE '<cache_name>' {#describe-filesystem-cache}

Using static Web storage (read-only) {#web-storage}

Required parameters {#static-web-storage-required-parameters}

Optional parameters {#optional-parameters-web}

Zero-copy Replication (not ready for production) {#zero-copy}

`SYSTEM CLEAR|DROP FILESYSTEM CACHE (<cache_name>) (ON CLUSTER)` -- `ON CLUSTER` {#system-clear-filesystem-cache-on-cluster}

`SHOW FILESYSTEM CACHES` {#show-filesystem-caches}

`DESCRIBE FILESYSTEM CACHE '<cache_name>'` {#describe-filesystem-cache}