|
DOCUMENTATION
Community
Resources
Blog
o ENGLISH
Getting Started
Introduction to Snowflake
Tutorials, Videos & Other Resources
Release Notes
Connecting to Snowflake
Loading Data into Snowflake
o Overview of Data Loading
o Summary of Data Loading Features
o Data Loading Considerations
Preparing Your Data Files
File Sizing Best Practices and Limitations
Continuous Data Loads (i.e. Snowpipe) and File
Sizing
Preparing Delimited Text Files
Semi-structured Data Files and Columnarization
Numeric Data Guidelines
Date and Timestamp Data Guidelines
Planning a Data Load
Staging Data
Loading Data
Managing Regular Data Loads
o Preparing to Load Data
o Bulk Loading Using COPY
o Loading Continuously Using Snowpipe
o Loading Using the Web Interface (Limited)
o Querying Data in Staged Files
o Querying Metadata for Staged Files
o Transforming Data During a Load
o Data Loading Tutorials
Unloading Data from Snowflake
Using Snowflake
Sharing Data Securely in Snowflake
Managing Your Snowflake Account
Managing Security in Snowflake
General Reference
SQL Command Reference
SQL Function Reference
Appendices
NEXTPREVIOUS |
DOCS »
LOADING DATA INTO
SNOWFLAKE »
DATA LOADING
CONSIDERATIONS »
PREPARING YOUR DATA FILES
Preparing Your Data Files
This topic provides best practices, general guidelines, and important
considerations for preparing your data files for loading.
In this Topic:
File Sizing Best Practices and
Limitations
o General File Sizing
Recommendations
o Semi-structured Data Size
Limitations
o Parquet Data Size
Limitations
Continuous Data Loads (i.e.
Snowpipe) and File Sizing
Preparing Delimited Text Files
Semi-structured Data Files and
Columnarization
Numeric Data Guidelines
Date and Timestamp Data
Guidelines
File Sizing Best Practices and Limitations
For best load performance and to avoid size limitations, consider the following
data file sizing guidelines. Note that these recommendations apply to bulk
data loads as well as continuous loading using Snowpipe.
General File Sizing Recommendations
The number of load operations that run in parallel cannot exceed the number
of data files to be loaded. To optimize the number of parallel operations for a
load, we recommend aiming to produce data files roughly 10 MB to 100 MB in
size compressed. Aggregate smaller files to minimize the processing
overhead for each file. Split larger files into a greater number of smaller files to
distribute the load among the servers in an active warehouse. The number of
data files that are processed in parallel is determined by the number and
capacity of servers in a warehouse. We recommend splitting large files by line
to avoid records that span chunks.
If your source database does not allow you to export data files in smaller
chunks, you can use a third-party utility to split large CSV files.
Linux or macOS
The split utility enables you to split a CSV file into multiple smaller files.
Syntax:
split [-a suffix_length] [-b byte_count[k|m]] [-l line_count] [-p pattern] [file [name]]
For more information, type man split in a terminal window.
Example:
split -l 100000 pagecounts-20151201.csv pages
This example splits a file named pagecounts-20151201.csv by line length.
Suppose the large single file is 8 GB in size and contains 10 million lines. Split
by 100,000, each of the 100 smaller files is 80 MB in size (10 million / 100,000
= 100). The split files are named pages<suffix>.
Windows
Windows does not include a native file split utility; however, Windows supports
many third-party tools and scripts that can split large data files.
Semi-structured Data Size Limitations
The VARIANT data type imposes a 16 MB (compressed) size limit on
individual rows.
In general, JSON and Avro data sets are a simple concatenation of multiple
documents. The JSON or Avro output from some software is composed of a
single huge array containing multiple records. There is no need to separate
the documents with line breaks or commas, though both are supported.
Instead, we recommend enabling the STRIP_OUTER_ARRAY file format
option for the COPY INTO <table> command to remove the outer array
structure and load the records into separate table rows:
copy into <table>
from @~/<file>.json
file_format = (type = 'JSON' strip_outer_array = true);
Parquet Data Size Limitations
Currently, data loads of large Parquet files (e.g. greater than 3 GB) could time
out. Split large files into files 1 GB in size (or smaller) for loading.
Continuous Data Loads (i.e. Snowpipe) and
File Sizing
Snowpipe is designed to load new data typically within a minute after a file
notification is sent; however, loading can take significantly longer for really
large files or in cases where an unusual amount of compute resources is
necessary to decompress, decrypt, and transform the new data.
In addition to resource consumption, an overhead to manage files in the
internal load queue is included in the utilization costs charged for Snowpipe.
This overhead increases in relation to the number of files queued for loading.
Snowpipe charges 0.06 credits per 1000 files queued.
For the most efficient and cost-effective load experience with Snowpipe, we
recommend following the file sizing recommendations in File Sizing Best
Practices and Limitations (in this topic). If it takes longer than one minute to
accumulate MBs of data in your source application, consider creating a new
(potentially smaller) data file once per minute. This approach typically leads to
a good balance between cost (i.e. resources spent on Snowpipe queue
management and the actual load) and performance (i.e. load latency).
Creating smaller data files and staging them in cloud storage more often than
once per minute has the following disadvantages:
A reduction in latency between
staging and loading the data
cannot be guaranteed.
An overhead to manage files in
the internal load queue is
included in the utilization costs
charged for Snowpipe. This
overhead increases in relation
to the number of files queued
for loading.
Various tools can aggregate and batch data files. One convenient option is
Amazon Kinesis Firehose. Firehose allows defining both the desired file size,
called the buffer size, and the wait interval after which a new file is sent (to
cloud storage in this case), called the buffer interval. For more information,
see the Kinesis Firehose documentation
If your source application typically accumulates enough data within a minute
to populate files larger than the recommended maximum for optimal parallel
processing, you could increase the buffer size. Keeping the buffer interval
setting at 60 seconds (the minimum value) helps avoid creating too many files
or increasing latency.
Preparing Delimited Text Files
Consider the following guidelines when preparing your delimited text (CSV)
files for loading:
UTF-8 is the default character
set, however, additional
encodings are supported. Use
the ENCODING file format
option to specify the character
set for the data files. For more
information, see CREATE FILE
FORMAT.
Fields that contain delimiter
characters should be enclosed
in quotes (single or double). If
the data contains single or
double quotes, then those
quotes must be escaped.
Carriage returns are commonly
introduced on Windows
systems in conjunction with a
line feed character to mark the
end of a line (\r \n). Fields that
contain carriage returns should
also be enclosed in quotes
(single or double).
The number of columns in each
row should be consistent.
Semi-structured Data Files and
Columnarization
When semi-structured data is inserted into a VARIANT column, Snowflake
extracts as much of the data as possible to a columnar form, based on certain
rules. The rest is stored as a single column in a parsed semi-structured
structure. Currently, elements that have the following characteristics
are not extracted into a column:
Elements that contain even a
single “null” value are not
extracted into a column. Note
that this applies to elements
with “null” values and not to
elements with missing values,
which are represented in
columnar form.
This rule ensures that
information is not lost, i.e, the
difference between VARIANT
“null” values and SQL NULL
values is not obfuscated.
Elements that contain multiple
data types. For example:
The foo element in one row
contains a number:
{"foo":1}
The same element in another
row contains a string:
{"foo":"1"}
When a semi-structured element is queried:
If the element was extracted
into a column, Snowflake’s
execution engine (which is
columnar) scans only the
extracted column.
If the element was not
extracted into a column, the
execution engine must scan
the entire JSON structure, and
then for each row traverse the
structure to output values,
impacting performance.
To avoid this performance impact:
Extract semi-structured data
elements containing “null”
values into relational
columns before loading them.
Alternatively, if the “null” values
in your files indicate missing
values and have no other
special meaning, we
recommend setting the file
format
option STRIP_NULL_VALUES
to TRUE when loading the
semi-structured data files. This
option removes object
elements or array elements
containing “null” values.
Ensure each unique element
stores values of a single native
data type (string or number).
Numeric Data Guidelines
Related Topics
Numeric Data Types
Avoid embedded characters,
such as commas (e.g., 123,456).
If a number includes a
fractional component, it should
be separated from the whole
number portion by a decimal
point (e.g., 123456.789).
Oracle only. The Oracle
NUMBER or NUMERIC types
allow for arbitrary scale,
meaning they accept values
with decimal components even
if the data type was not defined
with a precision or scale.
Whereas in Snowflake,
columns designed for values
with decimal components must
be defined with a scale to
preserve the decimal portion.
Date and Timestamp Data Guidelines
Related Topics
Date & Time Data Types
Date, time, and timestamp data
should be formatted based on
the following components:
Format Description
YYYY Four-digit year.
YY Two-digit year, controlled by
the TWO_DIGIT_CENTURY_START session parameter,
e.g. when set to 1980, values of 79 and 80 parsed
as 2079 and 1980 respectively.
MM Two-digit month (01=January, etc.).
MON Full or abbreviated month name.
DD Two-digit day of month (01 through 31).
DY Abbreviated day of week.
HH24 Two digits for hour (00 through 23); am/pm not allowed.
HH12 Two digits for hour (01 through 12); am/pm allowed.
AM , PM Ante meridiem (am) / post meridiem (pm); for use with HH12.
MI Two digits for minute (00 through 59).
SS Two digits for second (00 through 59).
FF Fractional seconds with precision 0 (seconds) to 9
(nanoseconds), e.g. FF, FF0, FF3, FF9. Specifying FF is
equivalent to FF6 (microseconds).
TZH:TZM , TZHTZM , TZH Time zone hour and minute, offset from UTC. Can be
prefixed by +/- for sign.
Oracle only. The Oracle DATE
data type can contain
date or timestamp information.
If your Oracle database
includes DATE columns that
also store time-related
information, map these
columns to a TIMESTAMP data
type in Snowflake rather than
DATE.
Note
Snowflake checks temporal data values at load time. Invalid date, time, and
timestamp values (e.g., 0000-00-00) produce an error.
NEXTPREVIOUS |
ASK THE COMMUNITY
CONTACT SUPPORT
REPORT DOC ISSUE
follow us
visit our blog
Solutions
o Use Cases
o Media & Entertainment
o Healthcare
o Financial Services
o Retail & CPG
Products
o Overview
o Why Snowflake
o Architecture
o Data Warehouse Security
o Pricing
Resources
o Resource Library
o Support & Services
o Documentation
o Legal
o Community
Explore
o News
o Events
o Webinars
o Blog
o Trending
About
o About Snowflake
o Partners
o Leadership
o Snowflake Board
o Careers
o Contact
450 Concar Drive, San Mateo, CA, 94402, United States| 844-SNOWFLK (844-766-9355)
© 2020 Snowflake Inc. All Rights Reserved | Privacy | Site Terms