0% found this document useful (0 votes)
84 views33 pages

ORC Encryption 2019 PDF

Uploaded by

bewithyou2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views33 pages

ORC Encryption 2019 PDF

Uploaded by

bewithyou2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Fine Grain Access Control for Big Data:

ORC Column Encryption


Owen O’Malley
[email protected]
@owen_omalley
March 2019
Who Am I?
• Worked on Hadoop since Jan 2006
• Worked at Yahoo on Web Search team
• First committer added to Hadoop project
• MapReduce, Security, Hive, and ORC
• Worked on many different file formats
• Sequence File, Avro, RC File and ORC File
• Spun Hortonworks out of Yahoo in 2011
2 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
What is the Problem?

3 © Hortonworks Inc. 2011 – 2019. All Rights Reserved


Controlling Sensitive Data
• Some data is very sensitive
• Personally Identifiable Information
• Credit card
• Medical information
• Companies run on data
• Need controls on data
• GDPR is a BIG deal
4 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
What is the Problem?
• Related data, different security requirements
• Authorization – who can see it
• Audit – track who read it
• Encrypt on disk – regulatory
• File-level (or blob) granularity isn’t enough
• File systems don’t understand columns

5 © Hortonworks Inc. 2011 – 2019. All Rights Reserved


Requirements
• Readers should transparently decrypt data
• If and only if the user has access to the key
• The data must be decrypted locally
• Columns are only decrypted as necessary
• Master keys must be managed securely
• Support for Key Management Server & hardware
• Support for key rolling
6 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Partial Solutions

7 © Hortonworks Inc. 2011 – 2019. All Rights Reserved


Partial Solution – HDFS Encryption
• Transparent HDFS Encryption
• Encryption zones
• HDFS directory trees
• Unique master key for each zone
• Client decrypts data
• Key Management via KeyProvider API

8 © Hortonworks Inc. 2011 – 2019. All Rights Reserved


HDFS Encryption Limitations
• Very coarse protection
• Only entire directory subtrees
• No ability to protect columns
• A lot of users need access to keys
• Moves between zones is painful
• When writing with Hive, data is moved
multiple times per a query
9 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Partial Solution – Hive Server 2
• All queries sent to Hive Server 2
• Only ‘hive’ user has access to data in HDFS
• Supports LLAP
• Integrates with Apache Ranger
• Control access to rows & columns
• Dynamically mask data

10 © Hortonworks Inc. 2011 – 2019. All Rights Reserved


Hive Architecture with Hive Server 2

11 © Hortonworks Inc. 2011 – 2019. All Rights Reserved


Hive Server 2 Limitations
• Limits access to Hive SQL
• Breaks Hadoop’s multi-paradigm data access
• Many customers use both Hive & Spark
• JDBC is not distributed
• Throughput is limited to 1 machine
• New Spark to LLAP connector addresses this

12 © Hortonworks Inc. 2011 – 2019. All Rights Reserved


Partial Solution – Separate tables
• Split private information out of tables
• Separate directories in HDFS
• HDFS and/or HS2 authorization
• Enables HDFS encryption
• Limitations
• Need to join with other tables
• Higher operational overhead
13 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Partial Solution – Encryption UDF
• Hive has user defined functions
• aes_encrypt and aes_decrypt
• Limitations
• Key management is problematic
• Encryption is not seeded
• Size of value leaks information

14 © Hortonworks Inc. 2011 – 2019. All Rights Reserved


Solution

15 © Hortonworks Inc. 2011 – 2019. All Rights Reserved


Columnar Encryption
• Columnar file formats (eg. ORC)
• Write data in columns
• Column projection
• Better compression
• Encryption works really well
• Only encrypt bytes for column
• Can store multiple variants of data
16 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
ORC File Format Column 1
Index Data

~200 MB Stripe
Column 2

Column 3
Row Data
Column 4

Column 5
Stripe Footer Column 6

Column 7
Index Data
~200 MB Stripe

Column 8

Row Data Column 1 Stream 2.1

Column 2 Stream 2.2

Stripe Footer Column 3 Stream 2.3

Index Data Column 4 Stream 2.4


~200 MB Stripe

Column 5

Row Data
Column 6

Column 7
Stripe Footer
Column 8
File Footer

Postscript

17 © Hortonworks Inc. 2011 – 2019. All Rights Reserved


User Experience
• Set table properties for encryption
• orc.encrypt.pii = ”ssn,email”
• orc.encrypt.credit = “card_info”
• Define where to get the encryption keys
• Configuration defines the key provider via URI
• Can use the Hadoop or Ranger KMS
• Compatible with public cloud KMS

18 © Hortonworks Inc. 2011 – 2019. All Rights Reserved


Key Management
• Uses Hadoop or Ranger KMS
• Create a master key for each use
• “pii”, “pci”, or “hippa”
• Each column in each file uses unique local key
• Policies limit access to master keys
• User never gets master keys

19 © Hortonworks Inc. 2011 – 2019. All Rights Reserved


KeyProvider API
• Provides limited access to encryption keys
• Encrypts or decrypts local keys
• Key versions and key rolling
• Allows 3rd party plugins
• Supports Hadoop or Ranger KMS

20 © Hortonworks Inc. 2011 – 2019. All Rights Reserved


Encryption Data Flow

21 © Hortonworks Inc. 2011 – 2019. All Rights Reserved


Encryption Flow
• Local key
• Random for each encrypted column in file
• Encrypted w/ master key by KMS
• Encrypted local key is stored in file metadata
• IV is generated to be unique
• Column, kind, stripe, & counter

22 © Hortonworks Inc. 2011 – 2019. All Rights Reserved


Data Masking
• What happens without key access?
• Define static masks
• Nullify – all values become null
• Redact – mask values ‘Xxxxx Xxxxx!’
• Can define ranges to unmask
• SHA256 – replace with SHA256
• Custom - user defined
23 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Data Masking
• Anonymization is hard!
• AOL search logs
• Netflix prize datasets
• NYC taxi dataset
• Always evaluate security tradeoffs
• Tokenization is a useful technique
• Assign arbitrary replacements
24 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Key Disposal
• Often need to keep data for 90 days
• Currently the data is written twice
• With column encryption:
• Roll keys daily
• Delete master key after 90 days

25 © Hortonworks Inc. 2011 – 2019. All Rights Reserved


ORC Encryption Design
• Write both variants of streams
• Masked unencrypted
• Unmasked encrypted
• Encrypt both data and statistics
• Maintain compatibility for old readers
• Read unencrypted variant
• Preserve ability to seek in file
26 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
ORC Write Pipeline
• Streams go through pipeline
• Run length encoding
• Compression (zlib, snappy, or lzo)
• Encryption
• Encryption is AES/CTR
• Allows seek
• No padding
27 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Conclusions

28 © Hortonworks Inc. 2011 – 2019. All Rights Reserved


Conclusions
• ORC column encryptions provides
• Transparent encryption
• Multi-paradigm column security
• Audit logging (via KMS logging)
• Static masking
• Supports file merging
• Different stripes with different local key
29 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Integration with Other Tools
• Apache Ranger
• Provides security from a single control panel
• Provides Attribute Based Access Control (ABAC)
• Manages encryption settings based on policies
• Controls access to decryption keys
• Apache Atlas
• Metadata driven governance for enterprises
• Provides ability to tag tables or columns
30 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Integration with Other Tools
• Hive & Spark
• No change other than defining table properties
• Apache Hive’s LLAP
• Cache and fast processing of SQL queries
• Column encryption changes internal interfaces
• Cache both encrypted and unencrypted variants
• Ensure audit log reflects end-user and what they accessed

31 © Hortonworks Inc. 2011 – 2019. All Rights Reserved


Limitations
• Need encryption policy for write
• Current Atlas & Ranger tags lag data
• Auto-discovery requires pre-access
• Changes to masking policy
• Need to re-write files
• Need additional data masks
• Credit card, addresses, etc.
• Decrypted local keys could be saved
32 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Thank you!
Twitter: @owen_omalley
Email: [email protected]

33 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

You might also like