0% found this document useful (0 votes)

78 views13 pages

Documentation: Community Resources Blog English

Uploaded by

Abhishek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views13 pages

Documentation: Community Resources Blog English

Uploaded by

Abhishek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

|

DOCUMENTATION
 Community
 Resources
 Blog

o ENGLISH

 Getting Started
 Introduction to Snowflake
 Tutorials, Videos & Other Resources
 Release Notes
 Connecting to Snowflake
 Loading Data into Snowflake
o Overview of Data Loading
o Summary of Data Loading Features
o Data Loading Considerations
 Preparing Your Data Files
 File Sizing Best Practices and Limitations
 Continuous Data Loads (i.e. Snowpipe) and File
Sizing
 Preparing Delimited Text Files
 Semi-structured Data Files and Columnarization
 Numeric Data Guidelines
 Date and Timestamp Data Guidelines
 Planning a Data Load
 Staging Data
 Loading Data
 Managing Regular Data Loads
o Preparing to Load Data
o Bulk Loading Using COPY
o Loading Continuously Using Snowpipe
o Loading Using the Web Interface (Limited)
o Querying Data in Staged Files
o Querying Metadata for Staged Files
o Transforming Data During a Load
o Data Loading Tutorials
 Unloading Data from Snowflake
 Using Snowflake
 Sharing Data Securely in Snowflake
 Managing Your Snowflake Account
 Managing Security in Snowflake
 General Reference
 SQL Command Reference
 SQL Function Reference
 Appendices
NEXTPREVIOUS |
 DOCS »

 LOADING DATA INTO
SNOWFLAKE »

 DATA LOADING
CONSIDERATIONS »

 PREPARING YOUR DATA FILES

Preparing Your Data Files

This topic provides best practices, general guidelines, and important
considerations for preparing your data files for loading.

In this Topic:

 File Sizing Best Practices and

Limitations
o General File Sizing
Recommendations
o Semi-structured Data Size
Limitations
o Parquet Data Size
Limitations
 Continuous Data Loads (i.e.
Snowpipe) and File Sizing
 Preparing Delimited Text Files
 Semi-structured Data Files and
Columnarization
 Numeric Data Guidelines
 Date and Timestamp Data
Guidelines

File Sizing Best Practices and Limitations

For best load performance and to avoid size limitations, consider the following
data file sizing guidelines. Note that these recommendations apply to bulk
data loads as well as continuous loading using Snowpipe.

General File Sizing Recommendations

The number of load operations that run in parallel cannot exceed the number
of data files to be loaded. To optimize the number of parallel operations for a
load, we recommend aiming to produce data files roughly 10 MB to 100 MB in
size compressed. Aggregate smaller files to minimize the processing
overhead for each file. Split larger files into a greater number of smaller files to
distribute the load among the servers in an active warehouse. The number of
data files that are processed in parallel is determined by the number and
capacity of servers in a warehouse. We recommend splitting large files by line
to avoid records that span chunks.

If your source database does not allow you to export data files in smaller
chunks, you can use a third-party utility to split large CSV files.

Linux or macOS
The split utility enables you to split a CSV file into multiple smaller files.

Syntax:

split [-a suffix_length] [-b byte_count[k|m]] [-l line_count] [-p pattern] [file [name]]

For more information, type man split in a terminal window.

Example:

split -l 100000 pagecounts-20151201.csv pages

This example splits a file named pagecounts-20151201.csv by line length.

Suppose the large single file is 8 GB in size and contains 10 million lines. Split
by 100,000, each of the 100 smaller files is 80 MB in size (10 million / 100,000
= 100). The split files are named pages<suffix>.

Windows

Windows does not include a native file split utility; however, Windows supports
many third-party tools and scripts that can split large data files.

Semi-structured Data Size Limitations

The VARIANT data type imposes a 16 MB (compressed) size limit on
individual rows.

In general, JSON and Avro data sets are a simple concatenation of multiple
documents. The JSON or Avro output from some software is composed of a
single huge array containing multiple records. There is no need to separate
the documents with line breaks or commas, though both are supported.

Instead, we recommend enabling the STRIP_OUTER_ARRAY file format

option for the COPY INTO <table> command to remove the outer array
structure and load the records into separate table rows:

copy into <table>

from @~/<file>.json
file_format = (type = 'JSON' strip_outer_array = true);

Parquet Data Size Limitations

Currently, data loads of large Parquet files (e.g. greater than 3 GB) could time
out. Split large files into files 1 GB in size (or smaller) for loading.

Continuous Data Loads (i.e. Snowpipe) and

File Sizing
Snowpipe is designed to load new data typically within a minute after a file
notification is sent; however, loading can take significantly longer for really
large files or in cases where an unusual amount of compute resources is
necessary to decompress, decrypt, and transform the new data.

In addition to resource consumption, an overhead to manage files in the

internal load queue is included in the utilization costs charged for Snowpipe.
This overhead increases in relation to the number of files queued for loading.
Snowpipe charges 0.06 credits per 1000 files queued.

For the most efficient and cost-effective load experience with Snowpipe, we
recommend following the file sizing recommendations in File Sizing Best
Practices and Limitations (in this topic). If it takes longer than one minute to
accumulate MBs of data in your source application, consider creating a new
(potentially smaller) data file once per minute. This approach typically leads to
a good balance between cost (i.e. resources spent on Snowpipe queue
management and the actual load) and performance (i.e. load latency).

Creating smaller data files and staging them in cloud storage more often than
once per minute has the following disadvantages:

 A reduction in latency between

staging and loading the data
cannot be guaranteed.
 An overhead to manage files in
the internal load queue is
included in the utilization costs
charged for Snowpipe. This
overhead increases in relation
to the number of files queued
for loading.

Various tools can aggregate and batch data files. One convenient option is
Amazon Kinesis Firehose. Firehose allows defining both the desired file size,
called the buffer size, and the wait interval after which a new file is sent (to
cloud storage in this case), called the buffer interval. For more information,
see the Kinesis Firehose documentation

If your source application typically accumulates enough data within a minute

to populate files larger than the recommended maximum for optimal parallel
processing, you could increase the buffer size. Keeping the buffer interval
setting at 60 seconds (the minimum value) helps avoid creating too many files
or increasing latency.

Preparing Delimited Text Files

Consider the following guidelines when preparing your delimited text (CSV)
files for loading:

 UTF-8 is the default character

set, however, additional
encodings are supported. Use
the ENCODING file format
option to specify the character
set for the data files. For more
information, see CREATE FILE
FORMAT.
 Fields that contain delimiter
characters should be enclosed
in quotes (single or double). If
the data contains single or
double quotes, then those
quotes must be escaped.
 Carriage returns are commonly
introduced on Windows
systems in conjunction with a
line feed character to mark the
end of a line (\r \n). Fields that
contain carriage returns should
also be enclosed in quotes
(single or double).
 The number of columns in each
row should be consistent.

Semi-structured Data Files and

Columnarization
When semi-structured data is inserted into a VARIANT column, Snowflake
extracts as much of the data as possible to a columnar form, based on certain
rules. The rest is stored as a single column in a parsed semi-structured
structure. Currently, elements that have the following characteristics
are not extracted into a column:

 Elements that contain even a

single “null” value are not
extracted into a column. Note
that this applies to elements
with “null” values and not to
elements with missing values,
which are represented in
columnar form.

This rule ensures that

information is not lost, i.e, the
difference between VARIANT
“null” values and SQL NULL
values is not obfuscated.
 Elements that contain multiple
data types. For example:

The foo element in one row

contains a number:

{"foo":1}

The same element in another

row contains a string:

{"foo":"1"}

When a semi-structured element is queried:

 If the element was extracted

into a column, Snowflake’s
execution engine (which is
columnar) scans only the
extracted column.
 If the element was not
extracted into a column, the
execution engine must scan
the entire JSON structure, and
then for each row traverse the
structure to output values,
impacting performance.

To avoid this performance impact:

 Extract semi-structured data

elements containing “null”
values into relational
columns before loading them.

Alternatively, if the “null” values

in your files indicate missing
values and have no other
special meaning, we
recommend setting the file
format
option STRIP_NULL_VALUES
to TRUE when loading the
semi-structured data files. This
option removes object
elements or array elements
containing “null” values.

 Ensure each unique element

stores values of a single native
data type (string or number).

Numeric Data Guidelines

 Avoid embedded characters,

such as commas (e.g., 123,456).
 If a number includes a
fractional component, it should
be separated from the whole
number portion by a decimal
point (e.g., 123456.789).
 Oracle only. The Oracle
NUMBER or NUMERIC types
allow for arbitrary scale,
meaning they accept values
with decimal components even
if the data type was not defined
with a precision or scale.
Whereas in Snowflake,
columns designed for values
with decimal components must
be defined with a scale to
preserve the decimal portion.

Date and Timestamp Data Guidelines

Related Topics
 Date & Time Data Types

 Date, time, and timestamp data

should be formatted based on
the following components:

Format Description

YYYY Four-digit year.

YY Two-digit year, controlled by

the TWO_DIGIT_CENTURY_START session parameter,
e.g. when set to 1980, values of 79 and 80 parsed
as 2079 and 1980 respectively.

MM Two-digit month (01=January, etc.).

MON Full or abbreviated month name.

DD Two-digit day of month (01 through 31).

DY Abbreviated day of week.

HH24 Two digits for hour (00 through 23); am/pm not allowed.

HH12 Two digits for hour (01 through 12); am/pm allowed.

AM , PM Ante meridiem (am) / post meridiem (pm); for use with HH12.

MI Two digits for minute (00 through 59).

SS Two digits for second (00 through 59).

FF Fractional seconds with precision 0 (seconds) to 9

(nanoseconds), e.g. FF, FF0, FF3, FF9. Specifying FF is
equivalent to FF6 (microseconds).

TZH:TZM , TZHTZM , TZH Time zone hour and minute, offset from UTC. Can be
prefixed by +/- for sign.
 Oracle only. The Oracle DATE
data type can contain
date or timestamp information.
If your Oracle database
includes DATE columns that
also store time-related
information, map these
columns to a TIMESTAMP data
type in Snowflake rather than
DATE.
Note

Snowflake checks temporal data values at load time. Invalid date, time, and
timestamp values (e.g., 0000-00-00) produce an error.

NEXTPREVIOUS |
 ASK THE COMMUNITY

 CONTACT SUPPORT

 REPORT DOC ISSUE

visit our blog

 Solutions

o Use Cases

o Media & Entertainment

o Healthcare

o Financial Services

o Retail & CPG

 Products

o Overview

o Why Snowflake

o Architecture

o Data Warehouse Security

o Pricing

 Resources

o Resource Library

o Support & Services

o Documentation

o Legal

o Community

 Explore

o News

o Events

o Webinars

o Blog

o Trending

 About

o About Snowflake

o Partners

o Leadership

o Snowflake Board

o Careers

o Contact
450 Concar Drive, San Mateo, CA, 94402, United States| 844-SNOWFLK (844-766-9355)

Snowflake Data File Preparation Guide
No ratings yet
Snowflake Data File Preparation Guide
11 pages
Data File Prep for Snowflake Users
No ratings yet
Data File Prep for Snowflake Users
13 pages
Snowflake Prctice1
100% (1)
Snowflake Prctice1
51 pages
Unloading - Data - From - Snowflake
No ratings yet
Unloading - Data - From - Snowflake
12 pages
Programming+in+Snowflake+ +All+Slides
100% (1)
Programming+in+Snowflake+ +All+Slides
342 pages
Starting Snwoflake
No ratings yet
Starting Snwoflake
5 pages
Snowflake 4 Data Loading
No ratings yet
Snowflake 4 Data Loading
1 page
Loading Data in +snowflake
No ratings yet
Loading Data in +snowflake
10 pages
Data Unloading for Cloud Storage
No ratings yet
Data Unloading for Cloud Storage
3 pages
Snowflake Data Loading Techniques
No ratings yet
Snowflake Data Loading Techniques
5 pages
All Course Slides
100% (1)
All Course Slides
192 pages
File Handling
No ratings yet
File Handling
12 pages
1.snowflake Data Load 1234
No ratings yet
1.snowflake Data Load 1234
1 page
Snowflake To Oracle
No ratings yet
Snowflake To Oracle
16 pages
Snowflake Optimization Guide
No ratings yet
Snowflake Optimization Guide
23 pages
Snowflake Data Loading Best Practices
No ratings yet
Snowflake Data Loading Best Practices
8 pages
6.DataLoading in Snowflake
No ratings yet
6.DataLoading in Snowflake
10 pages
Snowflake Mini Project Overview
No ratings yet
Snowflake Mini Project Overview
7 pages
Snowflake
No ratings yet
Snowflake
16 pages
Snowflake Snowpro Certification Exam Cheat Sheet by Jeno Yamma
100% (1)
Snowflake Snowpro Certification Exam Cheat Sheet by Jeno Yamma
7 pages
Snowflake Snowpro Exam Cheatsheet
83% (12)
Snowflake Snowpro Exam Cheatsheet
7 pages
Snowflake Data Warehouse Top Commands
No ratings yet
Snowflake Data Warehouse Top Commands
61 pages
SnowPro Advanced Data Engineer Study Guide
No ratings yet
SnowPro Advanced Data Engineer Study Guide
14 pages
Cabanasj486 Snowflake Snowpro Core
No ratings yet
Cabanasj486 Snowflake Snowpro Core
6 pages
Snowflake Syllabus
100% (2)
Snowflake Syllabus
2 pages
Importing & Exporting CSV Fileppt For Class 12, Presentation With Examples
100% (2)
Importing & Exporting CSV Fileppt For Class 12, Presentation With Examples
12 pages
Hands-On Lab Guide For: Virtual Zero-To-Snowflake
No ratings yet
Hands-On Lab Guide For: Virtual Zero-To-Snowflake
63 pages
Standard Fund Workbook 23B06
No ratings yet
Standard Fund Workbook 23B06
262 pages
BA 06 Data Loading Storage and File Formats
No ratings yet
BA 06 Data Loading Storage and File Formats
70 pages
Snowflake Hands-On Lab Guide
No ratings yet
Snowflake Hands-On Lab Guide
93 pages
Data Modeling Best Practices for Large Datasets
No ratings yet
Data Modeling Best Practices for Large Datasets
12 pages
1408 PerformanceTuningAndBestPracticesForMassIngestionTasks en H2L
No ratings yet
1408 PerformanceTuningAndBestPracticesForMassIngestionTasks en H2L
10 pages
Snowflake SnowPro Core Certification Exam Questions - Page 25 of 27 - SkillCertPro
No ratings yet
Snowflake SnowPro Core Certification Exam Questions - Page 25 of 27 - SkillCertPro
1 page
Extra - Data Science Unit II
No ratings yet
Extra - Data Science Unit II
41 pages
All Snowflake Details Document
No ratings yet
All Snowflake Details Document
105 pages
Ravi Snowflake Interview Questions-1
No ratings yet
Ravi Snowflake Interview Questions-1
20 pages
SnowProCoreStudyGuide 042621
No ratings yet
SnowProCoreStudyGuide 042621
13 pages
Snowpro™ Advanced: Data Engineer: Exam Study Guide
No ratings yet
Snowpro™ Advanced: Data Engineer: Exam Study Guide
16 pages
Loading Data
No ratings yet
Loading Data
16 pages
Snowflake SnowPro Core Certification Exam Questions - Page 26 of 27 - SkillCertPro
No ratings yet
Snowflake SnowPro Core Certification Exam Questions - Page 26 of 27 - SkillCertPro
1 page
Aaryan
No ratings yet
Aaryan
32 pages
Lumira
No ratings yet
Lumira
36 pages
Data Pipeline Pharmarack
No ratings yet
Data Pipeline Pharmarack
3 pages
Teradata To Snowflake Migration Guide
No ratings yet
Teradata To Snowflake Migration Guide
14 pages
Pandas
No ratings yet
Pandas
50 pages
Snowflake Admin & Data Loading Guide
No ratings yet
Snowflake Admin & Data Loading Guide
51 pages
Data Management With Python, SQLite, and SQLAlchemy
No ratings yet
Data Management With Python, SQLite, and SQLAlchemy
57 pages
File Introduction To File Operations in Python - What Are File Operations
No ratings yet
File Introduction To File Operations in Python - What Are File Operations
13 pages
Snowflake SQL
No ratings yet
Snowflake SQL
2 pages
Snowflake - Interview Questions
No ratings yet
Snowflake - Interview Questions
15 pages
Rest of The Ip Project
No ratings yet
Rest of The Ip Project
26 pages
Getting Started With Snowflake Guide
100% (1)
Getting Started With Snowflake Guide
23 pages
Best Practices For Using Tableau With Snowflake
No ratings yet
Best Practices For Using Tableau With Snowflake
63 pages
Duke Fuqua Casebook Consulting Case Interview Book 2017 - 2018杜克大学富卡商学院咨询案例面试
100% (3)
Duke Fuqua Casebook Consulting Case Interview Book 2017 - 2018杜克大学富卡商学院咨询案例面试
361 pages
MSC IT MINI Project Guidlines
No ratings yet
MSC IT MINI Project Guidlines
4 pages
IMSFundV155 2024 07
No ratings yet
IMSFundV155 2024 07
553 pages
RSMSSB JE Electrical & Mechanical 2020 (Degree) 26 Dec 2020 (English)
No ratings yet
RSMSSB JE Electrical & Mechanical 2020 (Degree) 26 Dec 2020 (English)
18 pages
Introduction to MASM Assembly Language
No ratings yet
Introduction to MASM Assembly Language
28 pages
PPDS Functionality Master Data CIF
100% (2)
PPDS Functionality Master Data CIF
19 pages
DevOps Engineer Resume - 4 Years Experience
No ratings yet
DevOps Engineer Resume - 4 Years Experience
5 pages
Intro to R Workshop: Oct 2019, Noida
No ratings yet
Intro to R Workshop: Oct 2019, Noida
3 pages
To Do List
No ratings yet
To Do List
4 pages
Access PDF
No ratings yet
Access PDF
20 pages
Server Log
No ratings yet
Server Log
19 pages
HP Data Protector 6.20 Support Matrix
No ratings yet
HP Data Protector 6.20 Support Matrix
16 pages
Understanding Web 2.0 Features
No ratings yet
Understanding Web 2.0 Features
2 pages
Micromine 10 User Guide
100% (9)
Micromine 10 User Guide
864 pages
CS4012 Lab: File & Network Commands
No ratings yet
CS4012 Lab: File & Network Commands
6 pages
Hot Notes User Guide
No ratings yet
Hot Notes User Guide
13 pages
论文答辩ppt
100% (1)
论文答辩ppt
7 pages
Zeeshan's Latest Resume
No ratings yet
Zeeshan's Latest Resume
1 page
FC500 Console Installation and Updates Guide
No ratings yet
FC500 Console Installation and Updates Guide
4 pages
CSI SAP2000 ADVANCED 11.0.4 1) Install/UPDATE 2) Copy CSIS2K1104.exe and Lservrc Files To Installed Folder and Apply The Patch. Done !
No ratings yet
CSI SAP2000 ADVANCED 11.0.4 1) Install/UPDATE 2) Copy CSIS2K1104.exe and Lservrc Files To Installed Folder and Apply The Patch. Done !
1 page
Visual Graphic Design NC Iii
No ratings yet
Visual Graphic Design NC Iii
12 pages
Unity Boot Sequence
No ratings yet
Unity Boot Sequence
133 pages
Grade 3 CH 6 - Microsoft Windows LOGO
100% (1)
Grade 3 CH 6 - Microsoft Windows LOGO
15 pages
Microsoft Excel 2007 Guide and Functions
No ratings yet
Microsoft Excel 2007 Guide and Functions
49 pages
CS10-1L Computer Fundamentals and Programming Laboratory: Practical Exam PE1 For Batch 2 (C++) Date Name CS10-1L
No ratings yet
CS10-1L Computer Fundamentals and Programming Laboratory: Practical Exam PE1 For Batch 2 (C++) Date Name CS10-1L
5 pages
Ifpug Snap v2.1 (Software Non-Functional Assessment Process) Quick Guide © IFPUG 2013
No ratings yet
Ifpug Snap v2.1 (Software Non-Functional Assessment Process) Quick Guide © IFPUG 2013
2 pages
Supgta
No ratings yet
Supgta
44 pages
Borland C#Builder ™
No ratings yet
Borland C#Builder ™
2 pages
NaumanRaees Resume
No ratings yet
NaumanRaees Resume
2 pages
AuditScripts Critical Security Control Executive Assessment Tool V6.1a
No ratings yet
AuditScripts Critical Security Control Executive Assessment Tool V6.1a
4 pages

Documentation: Community Resources Blog English

Uploaded by

Documentation: Community Resources Blog English

Uploaded by

|

Preparing Your Data Files

 File Sizing Best Practices and

File Sizing Best Practices and Limitations

General File Sizing Recommendations

For more information, type man split in a terminal window.

split -l 100000 pagecounts-20151201.csv pages

This example splits a file named pagecounts-20151201.csv by line length.

Semi-structured Data Size Limitations

Instead, we recommend enabling the STRIP_OUTER_ARRAY file format

copy into <table>

Parquet Data Size Limitations

Continuous Data Loads (i.e. Snowpipe) and

In addition to resource consumption, an overhead to manage files in the

 A reduction in latency between

If your source application typically accumulates enough data within a minute

Preparing Delimited Text Files

 UTF-8 is the default character

Semi-structured Data Files and

 Elements that contain even a

This rule ensures that

The foo element in one row

The same element in another

When a semi-structured element is queried:

 If the element was extracted

To avoid this performance impact:

 Extract semi-structured data

Alternatively, if the “null” values

 Ensure each unique element

Numeric Data Guidelines

 Avoid embedded characters,

Date and Timestamp Data Guidelines

 Date, time, and timestamp data

YYYY Four-digit year.

YY Two-digit year, controlled by

MM Two-digit month (01=January, etc.).

MON Full or abbreviated month name.

DD Two-digit day of month (01 through 31).

DY Abbreviated day of week.

HH24 Two digits for hour (00 through 23); am/pm not allowed.

MI Two digits for minute (00 through 59).

SS Two digits for second (00 through 59).

FF Fractional seconds with precision 0 (seconds) to 9

visit our blog

© 2020 Snowflake Inc. All Rights Reserved | Privacy | Site Terms

You might also like