Dse Admin 60
Dse Admin 60
0
Administrator Guide
Earlier DSE version
Latest 6.0 patch:
6.0.13
Updated: 2020-09-18-07:00
© 2020 DataStax, Inc. All rights reserved.
DataStax, Titan, and TitanDB are registered trademarks of DataStax,
Inc. and its subsidiaries in the United States and/or other countries.
Apache Cassandra, Apache, Tomcat, Lucene, Solr, Hadoop, Spark, TinkerPop, and Cassandra are trademarks
of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.
Contents
Chapter 1. Getting started................................................................................................................................... 16
New features.....................................................................................................................................................19
Chapter 4. Configuration....................................................................................................................................101
cassandra.yaml......................................................................................................................................... 109
dse.yaml.................................................................................................................................................... 141
remote.yaml...............................................................................................................................................180
cassandra-rackdc.properties..................................................................................................................... 184
cassandra-topology.properties.................................................................................................................. 184
Cassandra................................................................................................................................................. 189
JMX........................................................................................................................................................... 191
TPC........................................................................................................................................................... 192
LDAP......................................................................................................................................................... 193
Kerberos....................................................................................................................................................194
NodeSync..................................................................................................................................................194
Logging configuration......................................................................................................................................205
DSE Graph......................................................................................................................................................362
Architecture............................................................................................................................................... 433
Terminology...............................................................................................................................................440
Keyspaces.................................................................................................................................................450
Data types.................................................................................................................................................451
Operations.................................................................................................................................................451
CQL queries..............................................................................................................................................467
Metrics.......................................................................................................................................................468
nodetool...........................................................................................................................................................523
abortrebuild............................................................................................................................................... 523
assassinate............................................................................................................................................... 524
bootstrap................................................................................................................................................... 526
cfhistograms.............................................................................................................................................. 527
cfstats........................................................................................................................................................ 527
cleanup......................................................................................................................................................527
clearsnapshot............................................................................................................................................ 529
compact.....................................................................................................................................................530
compactionhistory..................................................................................................................................... 532
compactionstats........................................................................................................................................ 537
decommission........................................................................................................................................... 538
describecluster.......................................................................................................................................... 539
describering...............................................................................................................................................541
disableautocompaction..............................................................................................................................543
disablebackup........................................................................................................................................... 544
disablebinary............................................................................................................................................. 545
disablegossip.............................................................................................................................................547
disablehandoff........................................................................................................................................... 548
disablehintsfordc....................................................................................................................................... 549
drain.......................................................................................................................................................... 551
enableautocompaction.............................................................................................................................. 552
enablebackup............................................................................................................................................ 553
enablebinary..............................................................................................................................................555
enablegossip............................................................................................................................................. 556
enablehandoff............................................................................................................................................557
enablehintsfordc........................................................................................................................................ 558
failuredetector............................................................................................................................................560
flush...........................................................................................................................................................561
garbagecollect........................................................................................................................................... 562
gcstats....................................................................................................................................................... 564
getbatchlogreplaythrottle........................................................................................................................... 566
getcachecapacity.......................................................................................................................................567
getcachekeystosave..................................................................................................................................568
getcompactionthreshold............................................................................................................................ 570
getcompactionthroughput..........................................................................................................................571
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
getconcurrentcompactors.......................................................................................................................... 572
getconcurrentviewbuilders.........................................................................................................................574
getendpoints..............................................................................................................................................575
gethintedhandoffthrottlekb.........................................................................................................................578
getinterdcstreamthroughput...................................................................................................................... 579
getlogginglevels.........................................................................................................................................580
getmaxhintwindow.....................................................................................................................................582
getseeds....................................................................................................................................................583
getsstables................................................................................................................................................ 585
getstreamthroughput................................................................................................................................. 587
gettimeout..................................................................................................................................................589
gettraceprobability..................................................................................................................................... 590
gossipinfo.................................................................................................................................................. 592
handoffwindow.......................................................................................................................................... 593
help............................................................................................................................................................595
info.............................................................................................................................................................599
inmemorystatus......................................................................................................................................... 600
invalidatecountercache..............................................................................................................................602
invalidatekeycache.................................................................................................................................... 603
invalidaterowcache....................................................................................................................................605
join.............................................................................................................................................................606
listendpointspendinghints.......................................................................................................................... 607
leaksdetection........................................................................................................................................... 609
listsnapshots..............................................................................................................................................611
mark_unrepaired....................................................................................................................................... 613
move..........................................................................................................................................................614
netstats......................................................................................................................................................616
nodesyncservice........................................................................................................................................618
pausehandoff.............................................................................................................................................629
proxyhistograms........................................................................................................................................ 631
rangekeysample........................................................................................................................................ 633
rebuild........................................................................................................................................................634
rebuild_index............................................................................................................................................. 637
rebuild_view.............................................................................................................................................. 638
refresh....................................................................................................................................................... 640
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
refreshsizeestimates................................................................................................................................. 641
reloadseeds...............................................................................................................................................643
reloadtriggers............................................................................................................................................ 644
relocatesstables........................................................................................................................................ 645
removenode.............................................................................................................................................. 647
repair......................................................................................................................................................... 649
replaybatchlog........................................................................................................................................... 652
resetlocalschema...................................................................................................................................... 654
resume...................................................................................................................................................... 655
resumehandoff.......................................................................................................................................... 656
ring............................................................................................................................................................ 657
scrub..........................................................................................................................................................659
sequence...................................................................................................................................................660
setbatchlogreplaythrottle........................................................................................................................... 663
setcachecapacity.......................................................................................................................................665
setcachekeystosave.................................................................................................................................. 666
setcompactionthreshold............................................................................................................................ 668
setcompactionthroughput.......................................................................................................................... 669
setconcurrentcompactors.......................................................................................................................... 670
setconcurrentviewbuilders......................................................................................................................... 671
sethintedhandoffthrottlekb......................................................................................................................... 672
setinterdcstreamthroughput.......................................................................................................................674
setlogginglevel...........................................................................................................................................675
setmaxhintwindow..................................................................................................................................... 677
setstreamthroughput................................................................................................................................. 679
settimeout..................................................................................................................................................680
settraceprobability..................................................................................................................................... 682
sjk.............................................................................................................................................................. 684
snapshot....................................................................................................................................................686
status.........................................................................................................................................................689
statusbackup............................................................................................................................................. 691
statusbinary............................................................................................................................................... 693
statusgossip.............................................................................................................................................. 694
statushandoff.............................................................................................................................................695
stop............................................................................................................................................................697
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
stopdaemon...............................................................................................................................................698
tablehistograms......................................................................................................................................... 700
tablestats................................................................................................................................................... 701
toppartitions...............................................................................................................................................706
tpstats........................................................................................................................................................709
truncatehints..............................................................................................................................................715
upgradesstables........................................................................................................................................ 716
verify..........................................................................................................................................................718
version.......................................................................................................................................................720
viewbuildstatus.......................................................................................................................................... 721
dse commands................................................................................................................................................722
About dse commands............................................................................................................................... 722
add-node................................................................................................................................................... 724
advrep....................................................................................................................................................... 727
beeline.......................................................................................................................................................760
cassandra..................................................................................................................................................761
cassandra-stop..........................................................................................................................................763
exec...........................................................................................................................................................764
fs................................................................................................................................................................765
gremlin-console......................................................................................................................................... 766
list-nodes................................................................................................................................................... 767
pyspark......................................................................................................................................................768
remove-node............................................................................................................................................. 769
spark..........................................................................................................................................................771
spark-class................................................................................................................................................ 773
spark-jobserver..........................................................................................................................................774
spark-history-server...................................................................................................................................776
spark-sql....................................................................................................................................................777
spark-sql-thriftserver..................................................................................................................................778
spark-submit..............................................................................................................................................779
SparkR...................................................................................................................................................... 782
-v............................................................................................................................................................... 783
dse client-tool..................................................................................................................................................783
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
cassandra..................................................................................................................................................786
configuration export.................................................................................................................................. 788
configuration byos-export..........................................................................................................................789
spark..........................................................................................................................................................792
alwayson-sql..............................................................................................................................................794
nodesync......................................................................................................................................................... 796
disable....................................................................................................................................................... 798
enable........................................................................................................................................................801
help............................................................................................................................................................804
tracing........................................................................................................................................................807
validation................................................................................................................................................... 817
append...................................................................................................................................................... 819
cat..............................................................................................................................................................820
cd...............................................................................................................................................................822
chgrp......................................................................................................................................................... 824
chmod........................................................................................................................................................825
chown........................................................................................................................................................ 827
cp...............................................................................................................................................................828
df............................................................................................................................................................... 830
du.............................................................................................................................................................. 831
echo...........................................................................................................................................................833
exit.............................................................................................................................................................834
fsck............................................................................................................................................................ 835
get............................................................................................................................................................. 836
ls................................................................................................................................................................837
mkdir..........................................................................................................................................................839
mv..............................................................................................................................................................841
put............................................................................................................................................................. 843
pwd............................................................................................................................................................845
realpath..................................................................................................................................................... 846
rename...................................................................................................................................................... 847
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
rm.............................................................................................................................................................. 848
rmdir.......................................................................................................................................................... 849
stat.............................................................................................................................................................851
truncate..................................................................................................................................................... 852
umount...................................................................................................................................................... 853
dsetool.............................................................................................................................................................854
core_indexing_status................................................................................................................................ 857
create_core............................................................................................................................................... 859
createsystemkey....................................................................................................................................... 862
encryptconfigvalue.................................................................................................................................... 864
get_core_config.........................................................................................................................................864
get_core_schema......................................................................................................................................865
help............................................................................................................................................................867
index_checks.............................................................................................................................................868
infer_solr_schema..................................................................................................................................... 870
inmemorystatus......................................................................................................................................... 871
insights_config...........................................................................................................................................872
insights_filters............................................................................................................................................875
list_index_files........................................................................................................................................... 877
list_core_properties................................................................................................................................... 879
list_subranges........................................................................................................................................... 880
listjt............................................................................................................................................................ 881
managekmip revoke..................................................................................................................................884
managekmip destroy.................................................................................................................................885
node_health...............................................................................................................................................886
partitioner.................................................................................................................................................. 887
perf............................................................................................................................................................ 888
read_resource........................................................................................................................................... 891
rebuild_indexes......................................................................................................................................... 892
reload_core............................................................................................................................................... 894
ring............................................................................................................................................................ 896
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
set_core_property..................................................................................................................................... 897
sparkmaster cleanup.................................................................................................................................899
stop_core_reindex.....................................................................................................................................902
tieredtablestats.......................................................................................................................................... 903
tsreload......................................................................................................................................................905
unload_core...............................................................................................................................................906
upgrade_index_files.................................................................................................................................. 907
write_resource...........................................................................................................................................908
fs-stress tool..............................................................................................................................................920
sstabledowngrade..................................................................................................................................... 922
sstabledump.............................................................................................................................................. 924
sstableexpiredblockers.............................................................................................................................. 930
sstablelevelreset........................................................................................................................................931
sstableloader............................................................................................................................................. 933
sstablemetadata........................................................................................................................................ 935
sstableofflinerelevel...................................................................................................................................939
sstablepartitions........................................................................................................................................ 941
sstablerepairedset..................................................................................................................................... 944
sstablescrub.............................................................................................................................................. 946
sstablesplit.................................................................................................................................................948
sstableupgrade..........................................................................................................................................950
sstableutil.................................................................................................................................................. 951
sstableverify.............................................................................................................................................. 953
DataStax tools.................................................................................................................................................954
Starting as a service.................................................................................................................................957
Repairing nodes..............................................................................................................................................995
Compression........................................................................................................................................... 1010
• DataStax Enterprise-based applications and clusters are much different than relational databases and use
a data model based on the types of queries, not on modeling entities and relationships. Architecture in brief
contains key concepts and terminology for understanding the database.
• You can use DSE OpsCenter and Lifecycle Manager for most administrative tasks.
• Save yourself some time and frustration by spending a few moments looking at DataStax Doc and Search tips.
These short topics talk about navigation and bookmarking aids that will make your journey through the docs
more efficient and productive.
The following are not administrator specific but are presented to give you a fuller picture of the database:
• Cassandra Query Language (CQL) is the query language for DataStax Enterprise.
• DataStax provides drivers in several programming languages for connecting client applications to the
database.
• APIs are available for OpsCenter, DseGraphFrame, DataStax Spark Cassandra Connector, and the drivers.
Plan
The Planning and testing guide contains guidelines for capacity planning and hardware selection in production
environments. Key topics include:
• Estimating RAM
• CPU recommendations
Install
DataStax offers a variety of ways to set up a cluster:
Cloud
On premises
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
16
Getting started with DataStax Enterprise 6.0
• Docker images
• Binary tarball
For help with choosing an install type, see Which install method should I use?
Secure
DSE Advanced Security provides fine-grained user and access controls to keep applications data protected and
compliant with regulatory standards like PCI, SOX, HIPAA, and the European Union’s General Data Protection
Regulation (GDPR). Key topics include:
The DSE database includes the default role cassandra with password cassandra. This is a superuser login has
full access to the database. DataStax recommends only using the cassandra role once during initial Role Based
Access Control (RBAC) set up to establish your own root account and then disabling the cassandra role. See
Adding a superuser login.
Tune
Important topics for optimizing the performance of the database include:
Operations
The most commonly used operations include:
• Tools
Load
The primary tools for getting data into and out of the database are:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
17
Getting started with DataStax Enterprise 6.0
• DSE OpsCenter
Troubleshooting
• Troubleshooting guide
Upgrading
Key topics in the Upgrade Guide include:
Advanced Functionality
See Advanced functionality in DataStax Enterprise 6.0.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
18
Getting started with DataStax Enterprise 6.0
DSE Management Services automatically handle administration and maintenance tasks and assist with
overall database cluster management.
NodeSync service
Continuous background repair that virtually eliminates manual efforts to run repair operations in a
DataStax cluster.
Advanced Replication
Advanced Replication allows a single cluster to have a primary hub with multiple spokes. This allows
configurable, bi-directional distributed data replication to and from source and destination clusters.
DSE In-Memory
Store and access data exclusively from memory.
DSE Multi-Instance
Run multiple DataStax Enterprise nodes on a single host machine.
DSE Tiered Storage
Automate data movement across different types of storage media.
Feature Description
NodeSync DSE NodeSync removes the need for manual repair operations in DSE's distribution of Cassandra and eliminates
cluster outages that are attributed to manual repair failures. This equates to operational cost savings, reduced support
cycles, and reduced application management pain. NodeSync also makes applications run more predictably, making
capacity planning easier. NodeSync’s advantages for operational simplicity extend across the whole data layer
including database, search, and analytics.
Be sure to read the DSE NodeSync: Operational Simplicity at its Best blog.
Advanced Performance DSE Advanced Performance delivers numerous performance advantages over open-source Apache Cassandra
including:
• Thread per core (TPC) and asynchronous architecture: A coordination-free design, DSE’s thread-per-core
architecture provides up to 2x more throughput for read and write operations.
• Storage engine optimizations that provide up to half the latency of open source Cassandra and include optimized
compaction.
• DataStax Bulk Loader Up to 4x faster loads and unloads of data over current data loading utilities. Up to 4 times
faster than current data loading utilities. Be sure to read the Introducing DataStax Bulk Loader blog.
• Continuous paging improves DSE Analytics read performance by up to 3x over open source Apache Cassandra
and Apache Spark.
DSE TrafficControl DSE TrafficControl provides a backpressure mechanism to avoid overloading DSE nodes with client or replica
requests that could make DSE nodes unresponsive or lead to long garbage collections and out of memory errors. DSE
TrafficControl is enabled by default and comes pre-tuned to accommodate very different workloads, from simple reads
and writes to the most extreme workloads. It requires no configuration.
Automated Upgrades for Part of OpsCenter LifeCycle Manager, the Upgrade Service handles patch upgrades of DSE clusters at the data center,
patch releases rack, or node level with up to 60% less manual involvement. The Upgrade Service allows you to easily clone your
existing configuration profile to ensure compatibility with DSE upgrades. Be sure to read the Taking the Pain Out of
Database Upgrades blog.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
19
Getting started with DataStax Enterprise 6.0
Feature Description
• AlwaysOn SQL with advanced security, ensures around-the-clock uptime for analytics queries with the freshest,
secure insight. It is interoperable with existing business intelligence tools that utilize ODBC/JDBC and other Spark-
based tools. Be sure to read the Introducing AlwaysOn SQL for DSE Analytics blog.
• Structured Streaming simple, efficient, and robust streaming of data from Apache Kafka, file systems, or other
sources.
• Enhanced Spark SQL support allows you to execute Spark queries using a variation of the SQL language. Spark
SQL includes APIs for returning Spark Datasets in Scala and Java, and interactively using an SQL shell or visually
through DataStax Studio notebooks.
Be sure to read the What’s New for DataStax Enterprise Analytics 6 blog.
• Better throughput for DSE Graph due to Advanced Performance improvements, resulting in DSE Graph handling
more requests per node.
• Smart Analytics Query Routing: the DSE Graph engine automatically routes a Gremlin OLAP traversal to the
correct implementation (DSE Graph Frames or Gremlin OLAP) for the fastest and best execution.
• Advanced Schema Management provides the ability to remove any graph schema element, not just vertex labels
or properties.
• The Batches in DSE Graph Fluent API adds the ability to execute DSE Graph statements in batches to speeds up
writes to DSE Graph.
• TinkerPop 3.3.0. DataStax has added a lot of great enhancements to the Apache TinkerPopTM tool suite.
Enhancements have proved faster, more robust graph querying and provided a better developer experience.
• Private Schemas: Control who can see what parts of a table definition, critical for security compliance best
practices.
• Separation of Duties: Create administrator roles who can carry out everyday administrative tasks without having
unnecessary access to data.
• Auditing by Role: Focus your audits on the users you need to scrutinize. You can now elect to audit activity by user
type and increase the signal to noise ratio by removing application tier system accounts from the audit trail.
• Unified Authorization for DSE Analytics: Additional protection for data used for analytics operations.
Be sure to read the Safe data? Check. DataStax Enterprise Advanced Security blog.
DSE Search Built with a production-certified version of Apache Solr™ 6, DSE Search requires less configuration, improved search
data consistency, and a more synchronous write path for indexing data with less moving pieces to tune and monitor.
DSE 5.1 introduced index management CQL and cqlsh commands to streamline operations and development. DSE 6.0
adds a wider array of CQL query functionality and indexing support.
Be sure to read the What’s New for Search in DSE 6 blog.
• The Batches in DSE Graph Fluent API adds the ability to execute DSE Graph statements in batches to speed up
writes to DSE Graph.
• The C# and Node.js DataStax drivers include Batches in DSE Graph Fluent API, as well as the Java and Python
drivers.
Be sure to read the What’s New With Drivers for DSE 6 blog.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
20
Getting started with DataStax Enterprise 6.0
Feature Description
DataStax Studio Improvements to DSE Studio further ease DSE development include:
• Notebook Sharing: Easily collaborate with your colleagues to develop DSE applications using the new import and
export capabilities.
• Spark SQL support: Query and analyze data with Spark SQL using DataStax Studio's visual and intelligent
notebooks, which provide syntax highlighting, auto-code completion and correction, and more.
• Interactive Graphs: explore and configure DSE Graph schemas with a whiteboard-like view that allows you to drag
your vertices and edges.
• Notebook History: provides a historical dated record with descriptions and change events that makes it easy to
track and rollback changes.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
21
Chapter 2. DataStax Enterprise release notes
Release notes for DataStax Enterprise 6.0.
Before you upgrade to a later major version, upgrade to the latest patch Upgrades to DSE 6.0 are supported from:
release (6.0.13) on your current version. Be sure to read the relevant
upgrade documentation. • DSE 5.1
• DSE 5.0
Check the compatibility page for your products. DSE 6.0 product compatibility:
• OpsCenter 6.5
• Studio 6.0
See Upgrading DataStax drivers. DataStax Drivers: You may need to recompile your client application
code.
Use DataStax Bulk Loader for loading and unloading data. Loads data into DSE 5.0 or later and unloads data from any Apache
Cassandra™ 2.1 or later data source.
• 6.0.13 Components
DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
22
DataStax Enterprise release notes
• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.
• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.
• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.
All components from DSE 6.0.13 are listed. Components that are updated for DSE 6.0.11 are indicated with an
asterisk (*).
• Netty 4.1.25.7.dse
DSE 6.0.13 is compatible with Apache Cassandra™ 3.11 and adds production-certified enhancements if any.
DataStax recommends upgrading all DSE Search nodes to DSE 6.0.13 or later.
• Fixed StackOverflowError thrown during read repairs (only large clusters or clusters with enabled vnodes
are affected). (DB-4350)
• Increased default direct_reads_size_in_mb value. Previously it was 2M per core + 2M shared. It is now
4M per core + 4M shared. (DB-4348)
• Slow indexing at bootstrap time due to early TPC boundaries computation when node is replaced by a node
with the same IP (DB-4049)
• Fixed a problem with the treatment of zeroes in the type decimal that could cause assertion errors, or not
being able to find some rows if their key is 0 written using different precisions, or both. (DB-4472)
• Fixed the NullPointerException issue described in CASSANDRA-14200: NPE when dumping an SSTable
with null value for timestamp column. (DB-4512)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
23
DataStax Enterprise release notes
• Fix an issue that was causing excessive contention during encryption/decryption operations. The fix results
in an encryption/decryption performance improvement. (DB-4419)
• Fixed an issue to prevent an unbounded number of flushing tasks for memtables that are almost empty.
(DB-4376)
• Global BloomFilterFalseRatio is now calculated in the same way as table BloomFilterFalseRatio. Now
both types of metrics include true negatives, the formula is ratio = falsePositiveCount / (truePositiveCount +
falsePositiveCount + trueNegativeCount). (DB-4439)
• Fixed a bug whereby after a node replacement procedure. the bootstrap indexing in DSE Search was
happening only on one TPC core. (DB-4049)
• Systemd units are included for DSE packages for CentOS and compatible OSes. (DSP-7603)
• The server_host option in dse.yaml now handles mutiple, comma separated LDAP server addresses.
(DSP-20833)
• Cassandra tools now work on encrypted SSTables when security is configured. (DSP-20940)
• Recording a slow CQL query to the log will no longer block the thread. (DSP-20894)
• The frequency of range queries performed by lease manager is now configurable via
dse.lease.refresh.interval.seconds system property (an addition to JMX and dsetool command)
(DSP-20696)
• Security updates:
# The jackson-databind library has been upgraded to 2.9.10.4 to address a Jackson databind vulnerability
(CVE-2020-8840) (DSP-20981)
# Fixed some security vulnerabilities for Solr HTTP REST API when authorization is enabled. Now, users
with no appropriate permissions can perform search operations. Resources can be deleted when
authorization is enabled, given the correct permissions. (DSP-20749)
# Fixed an issue where the audit logging did not capture search queries. (DSP-21058)
# While there is no change in default behavior, there is a new render_cql_literals option in dse.yaml
under the audit logging section, which is false by default. When enabled, bound variables for logged
statements will be rendered as CQL literals, which means there will be additional quotation marks and
escaping, as well as values of all complex types (collections, tuples, udts) will be in human readable
format. (DSP-17032)
# Fixed LDAP settings to properly handle nested groups so that LDAP enumerates all ancestors of a
user's distinguishedName. Inherited groups retrieval with directory_search and members_search types.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
24
DataStax Enterprise release notes
Fixed fetching parent groups of a role that's mapped to an LDAP group. See new dse.yaml options,
all_groups_xxx in ldap_options, to configure optimized retrieval of parent groups, including inherited
ones, in a single roundtrip. (DSP-20107)
# When DSE tries one authentication scheme and finds that the password is invalid, DSE now tries
another scheme, but only if the user has a scheme permission for that other scheme. (DSP-20903)
# Raised the upper bound limit on DSE LDAP caches. The upper limit for
ldap_options.credentials_validity_in_ms has been increased to 864,000,000 ms, which is
10 days. The upper limit for ldap_options.search_validity_in_seconds has been increased to
864,000 seconds, which is 10 days. (DSP-21072)
# Fixed an error condition when DSE failed to get the LDAP roles while refreshing a database schema.
(DSP-21075)
6.0.13 DSEFS
• To minimize fsck impact on overloaded clusters, throttling is possible via -p or --parallelism arguments.
• Fixed an issue where an excessive number of connections are created to port 5599 when using DSEFS.
(DSP-21021)
• Search-related latency metrics will now decay in time like other metrics. Named queries (using query.name
parameter) will now have separate latency metrics. New mbean atributes are available for search
latency metrics: TotalLatency (us), Min, Max, Mean, StdDev, DurationUnit, MeanRate, OneMinuteRate,
FiveMinuteRate, FifteenMinuteRate, RateUnit, 98th, 999th. (DSP-19612)
• Fixed some security vulnerabilities for Solr HTTP REST API when authorization is enabled. Now, users with
no appropriate permissions can perform search operations. Resources can be deleted when authorization is
enabled, given the correct permissions. (DSP-20749)
• Fixed a bug where a decryption block cache occasionally was not operational (SOLR-14498). (DSP-20987)
• Fixed an issue where the audit logging did not capture search queries. (DSP-21058)
• Fixed a bug where after several months of up time an encrypted index wouldn't accept more writes unless
the core is reloaded. (DSP-21234)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
25
DataStax Enterprise release notes
• 6.0.12 Components
DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:
• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.
• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.
• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.
All components from DSE 6.0.12 are listed. Components that are updated for DSE 6.0.11 are indicated with an
asterisk (*).
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
26
DataStax Enterprise release notes
• Apache Solr™6.0.1.1.2716
• Apache Spark™2.2.3.13
• Netty 4.1.25.6.dse
DataStax recommends upgrading all DSE Search nodes to DSE 6.0.12 or later.
• The frequency of range queries performed by lease manager is now configurable via JMX and dsetool
command. (DSP-20696)
• Added dse.ldap.retry_interval.ms system property, which sets the time between subsequent retries
when trying authentication using LDAP server. (DSP-20298)
• Removed Jodd Core dependency that created vulnerability to Arbitrary File Writes. (DSP-19206)
• Added a new JMX attribute of ConnectionSearchPassword for LdapAuthenticator bean has been added,
which updates the LDAP search password without the need to restart DSE. (DSP-18928)
• dsetool ring shows in-progress search index building during bootstrap. (DSP-15281)
• Made the search reference visible in the error message for LDAP connections. (DSP-20578)
• DecayingEstimatedHistogram now decays even when there are no updates so invalid metric values do not
linger. (DSP-20674)
• Nodesync can now be enabled on all system distributed and protected tables. (DB-3241)
• Improved the estimated values of histogram percentiles reported via JMX. In some cases, the percentiles
may go slightly up. (DB-4275)
• Added --disable-history option to cqlsh that disables saving history to disk for current execution. Added
history section to cqlshrc which is called with boolean parameter disabled that is set to False by default.
(DB-3843)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
27
DataStax Enterprise release notes
• Improved error messaging for enabled internode SSL encryption in Cassandra Tools test suite. (DB-3957)
• Security updates:
Resolved issues:
• Bug that prevented LIST ROLES and LIST USERS to work with system-keyspace-filtering enabled.
(DB-4221)
• Continuous paging sessions could leak if the continuous result sets on the driver side were not exhausted or
cancelled. (DB-4313)
• Error that caused nodetool viewbuildstatus to return an incorrect error message. (DB-2397)
Resolved issues:
• Internal continuous paging sessions were not closed when LIMIT clause was added in SQL query, which
caused sessions leak and inability to close the Spark application gracefully because the Java driver waited
indefinitely for orphaned sessions to finish. (DSP-19804)
• Removed Jodd Core dependency that created vulnerability to Arbitrary File Writes. (DSP-19206)
• Security updates:
6.0.12 DSEFS
• DSEFS local file system implementation returns alphabetically sorted directories and files when using
wildcards and listing command. (DSP-20057)
• When creating a file through WebHDFS API, DSEFS does not verify WX permissions of parent's parent
when the parent exists. (DSP-20355)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
28
DataStax Enterprise release notes
Resolved issues:
• DSEFS cannot use Mixed Case keyspaces, which was broken by DSP-16825. (DSP-20354)
• Changed classic Graph query so vertices are read from _p tables in Cassandra using SELECT ... WHERE
<vertex primary key columns> statement. The search predicate is applied in memory. (DSP-20230)
• Error messages related to Solr errors contain better description of the root cause. (DSP-13792)
• The dsetool stop_core_reindex command now mentions the node in the output message. (DSP-17090)
• Improved warnings for search index creation via dsetool or CQL. (DSP-17994)
• Improved guidance with warnings when index rebuild is required for ALTER SEARCH INDEX, RELOAD SEARCH
INDEX, and dsetool reload_core commands. (DSP-19347)
• suggest request handler requires select permission. Previously, suggest request handler returned
forbidden response when authorization was on, regardless of the user permissions. (DSP-20697)
• Security update:
• 6.0.11 Components
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
29
DataStax Enterprise release notes
DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:
• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.
• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.
• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.
All components from DSE 6.0.11 are listed. Components that are updated for DSE 6.0.10 are indicated with an
asterisk (*).
• Netty 4.1.25.6.dse
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
30
DataStax Enterprise release notes
DataStax recommends upgrading all DSE Search nodes to DSE 6.0.11 or later.
• Fixed a bug to avoid multiple disposals of Solr filter cache DocSet objects. (DSP-15765)
• Improve performance, logging, and add options for using the Solr timeAllowed parameter in all queries.
The Solr timeAllowed option in queries is now enforced by default to prevent long-running shard queries.
(DSP-19781, DSP-19790)
• Add support for nodesync command to specify different IP addresses for JMX and CQL. (DB-2969)
• Prevent accepting streamed SSTables or loading SSTables when the clustering order does not match.
(DB-3530)
• Dropping and re-adding the same column with incompatible types is not supported. This change prevents
unreadable SSTables. (DB-3586)
Resolved issues:
• Reads against ma and mc SSTables hit more SSTables than necessary due to the bug fixed by
CASSANDRA-14861. (DB-3691)
• Error retrieving expired columns with secondary index on key components. (DB-3764)
• The diff logic used by the secondary index does not always pick the latest schema and results in ERROR
[CoreThread-8] errors on batch writes. (DB-3838)
• Fixed concurrency factor calculation for distributed range read with a maximum 10 times
the number of cores. Configurable maximum concurrency factor with new JVM argument -
Ddse.max_concurrent_range_requests. (DB-3859)
• AIO and DSE Metrics Collector are not available on REHL/Centos 6.x because GLIBC_2.14 is not present.
(DSP-18603)
• Using SELECT JSON for empty BLOB values incorrectly returns an empty string instead of the expected 0x.
(DSP-20022)
• RoleManager cache keeps invalid values if the LDAP connectivity is down. (DSP-20098)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
31
DataStax Enterprise release notes
• LDAP user login fails due to parsing failure on user DN with parentheses. (DSP-20106)
• New du dsefs shell command lists sizes of the files and directories in a specific directory. (DSP-19572)
• Improve configuration of available system resources for Spark Workers. You can now set the total memory
and total cores with new environment variables that take precedence over the resource_manager_options
defined in dse.yaml. (DSP-19673)
dse.yaml resource_manager_options Environment variable
memory_total SPARK_WORKER_TOTAL_MEMORY
cores_total SPARK_WORKER_TOTAL_CORES
• Support for multiple contact points is added for DSEFS implementation of the Hadoop FileSystem.
(DSP-19704)
Provide FileSystem URI with:
$ dsefs://host0\[:port\]\[,host1\[:port\]\]/
Enhancements:
• The Solr timeAllowed option in queries is now enforced by default to prevent long-running shard queries.
This change prevents complex facets and boolean queries from using system resources after the DSE
Search coordinator considers the queries to have timed out. For all queries, the default for the timeAllowed
value uses the value of client_request_timeout_seconds setting in dse.yaml. (DSP-19781, DSP-19790)
While using Solr timeAllowed in queries improves performance for long zombie queries, it can cause
increased per-request latency cost in mixed workloads. If the per-request latency cost is too high, use the
-Ddse.timeAllowed.enabled.default search system property to disable timeAllowed in your queries.
Resolved issues:
• Apply filter cache optimization to remote shard requests when RF=N. . (DSP-19800)
• Filter cache warming doesn't warm parent-only filter correctly when RF=N. (DSP-19802)
• Handle paging states serialized with a different version than the session version (CASSANDRA-15176)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
32
DataStax Enterprise release notes
• 6.0.10 Components
DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:
• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.
• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.
• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.
All components from DSE 6.0.10 are listed. Components that are updated for DSE 6.0.10 are indicated with an
asterisk (*).
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
33
DataStax Enterprise release notes
• Netty 4.1.13.13.dse
DSE 6.0.10 is compatible with Apache Cassandra™ 3.11 and adds production-certified enhancements.
• Fixed incorrect handling of frozen type issues to accept all valid CQL statements and reject all invalid CQL
statements. (DB-3084)
• Standalone cqlsh client tool provides an interface for developers to interact with the database and issue
CQL commands without having to install the database software. From DataStax Labs, download the version
of CQLSH that corresponds to your DataStax database version. (DSP-18694)
• New options to select cipher suite and protocol to configure KMIP encryption when connecting to a KMIP
server. (DSP-17294)
• Storing and revoking permissions for the application owner is removed. The application owner is explicitly
assumed to have these permissions. (DSP-19393)
• Fixed an issue where T values are hidden by property keys of the same name in valueMap(). (DSP-19261)
# facet.limit < 0 is no longer supported. Override the default facet.limit of 20000 with the -
Dsolr.max.facet.limit.size system property.
# This change adds guardrails that can cause misconfigured faceting queries to fail. Before upgrading, set
an explicit facet.limit.
• Improved troubleshooting. A log entry is now created when autocompaction is disabled or enabled for a
table. (DB-1635)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
34
DataStax Enterprise release notes
• Enhanced DroppedMessages logging output adds the size percentiles of the dropped messages, their most
common destinations, and the most common tables targeted for read requests or mutations. (DB-1250)
• Reformatted StatusLogger output to reduce details in the INFO level system.log. The detailed output is still
present in the debug.log. (DB-2552)
• For nodetool tpstats -F json and nodetool tpstats -F yaml, wait latencies (in ms) appear in the
output. Although not labeled, the wait latencies are included in the following order: 50%, 75%, 95%, 98%,
99%, Min, and Max. (DB-3401)
• New resources improve debugging leaked chunks before the cache evicts them and provide more
meaningful call stack and stack trace. (DB-3504)
# RandomAccessReader/RandomAccessReader
# AsyncPartitionReader/FlowSource
# AsyncSSTableScanner/FlowSource
• New options to select cipher suite and protocol to configure KMIP encryption when connecting to a KMIP
server. (DSP-17294)
• Standalone cqlsh client tool provides an interface for developers to interact with the database and issue
CQL commands without having to install the database software. From DataStax Labs, download the version
of CQLSH that corresponds to your DataStax database version. (DSP-18694)
• Upgraded Apache MINA Core library to 2.0.21 to prevent a security issue where Apache MINA Core was
vulnerable to information disclosure. (DSP-19213)
• Update Jackson Databind to 2.9.9.1 for all components except DataStax Bulk Loader. (DSP-19441)
Resolved issues:
• Tarball installs to create two instances on the same physical server with remote JMX access with binding
the separated IPs to port 7199 causes JMX error of Address already in use (Bind failed) because
com.sun.management.jmxremote.host is ignored. (DB-2483)
• Incorrect handling of frozen type issues: valid CQL statements are not accepted and invalid CQL statements
are not property rejected. (DB-3084)
• DSE fails to start with ERROR Attempted serializing to buffer exceeded maximum of 65535 bytes. Improved
error to identify a workaround for commitlog corruption. (DB-3162)
• sstabledowngrade needs write access to the snapshot folder for a different output location. (DB-3231)
• The number of pending compactions reported by nodetool compactionstats was incorrect (off by one) for
Time Window Compaction Strategy (TWCS). (DB-3284)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
35
DataStax Enterprise release notes
• When unable to send mutations to replicas due to overloading, hints are mistakenly created against the local
node. (DB-3421)
• When a non-frozen UDT column is dropped and the table is later re-created from the schema that was
created as part of a snapshot, the dropped column record is invalid and may lead to failure loading some
SSTables. (DB-3434)
• Error in custom provider prevents DSE node startup. With this fix, the node will start up but insights
is not active. See the DataStax Support Knowledge Base for steps to resolve existing missing or incorrect
keyspace replication problems. (DSP-19521)
Known issues:
• On Oracle Linux 7.x, StorageService.java:4970 exception occurs with DSE package installation.
(DSP-19625)
Workaround: On Oracle Linux 7.x operating systems, install DSE using the binary tarball.
• Storing and revoking permissions for the application owner is removed. Instead of explicitly storing
permission of the application owner to manage and view Spark applications, the application owner is
explicitly assumed to have these permissions. (DSP-19393)
Resolved issues:
• Spark applications incorrectly reported that joins were broken. DirectJoin output check too strict.
(DSP-19063)
• Submitting many Spark apps will reach the default tombstone_failure_threshold before the default 90 days
gc_grace_seconds defined for the system_auth.role_permissions table. (DSP-19098)
Workaround with this fix:
1. Manually grant permissions to the user before the user starts Spark jobs:
• Credentials are not masked in the debug level logs for Spark Jobserver and Spark submitted jobs.
(DSP-19490)
• New graph truncate command to remove all data from graph. (DSP-17609)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
36
DataStax Enterprise release notes
Resolved issues:
• T values get hidden by property keys of the same name in valueMap(). (DSP-19261)
Enhancements:
• DSE 6.0 search query latency is on parity with DSE 5.1. (DSP-18677)
• For token ranges dictated by distribution, filter cache warming occurs when a node is restarted, a search
index is rebuilt, or when node health score is up to 0.9. New per-core metrics for metric type WarmupMetrics
and other improvements. (DSP-8621)
# facet.limit < 0 is no longer supported. Override the default facet.limit of 20000 with the -
Dsolr.max.facet.limit.size system property.
# This change adds guardrails that can cause misconfigured faceting queries to fail. Before upgrading, set
an explicit facet.limit.
Resolved issues:
• Solr CQL count query incorrectly returns the count as all data count but should return all data count minus
start offset. (DSP-16153)
• Validation error does not get returned when docValues are applied when types do not allow docValues.
(DSP-16884)
With this fix, the following exception behavior is applied:
# Throw exception when docValues:true is specified for a column and column type does not support
docValues.
# Do not throw exception and ignore docValues:true for columns with types that do not support docValues
if docValues:true is set for *.
• When using live indexing, also known as Real Time (RT) indexing, stale Solr documents contain data that is
updated in the database. This issue happens when a facet query is run against a search index (core) while
inserting or loading data, and the search core is shut down. (DSP-18786)
• When driver uses paging, CQL query fails when using a Solr index to query with a sort on a field that
contains the primary key name in the field: InvalidRequest: Error from server: code=2200 [Invalid
query] message="Cursor functionality requires a sort containing a uniqueKey field tie
breaker". (DSP-19210)
Known issues:
• The count() query with Solr enabled can be inaccurate or inconsistent. (DSP-19401)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
37
DataStax Enterprise release notes
All upgrade advice from previous versions applies. Carefully review the DataStax Enterprise upgrade planning
and upgrade instructions to ensure a smooth upgrade and avoid pitfalls and frustrations.
TinkerPop changes for DSE 6.0.10
DataStax Enterprise (DSE) 6.0.10 includes TinkerPop 3.3.7 with all DataStax enhancements from earlier
versions.
DSE 6.0.9 release notes
9 July 2019
In this section:
• 6.0.9 Components
DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:
• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.
• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.
• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.
6.0.9 Components
• Netty 4.1.13.13.dse
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
38
DataStax Enterprise release notes
DSE 6.0.9 is compatible with Apache Cassandra™ 3.11 and includes all DataStax enhancements from earlier
versions.
• Fixed possible data loss when using DSE Tiered Storage. (DB-3404)
If using DSE Tiered Storage, you must immediately upgrade to at least DSE 5.1.16, DSE 6.0.9, or DSE
6.7.4. Be sure to follow the upgrade instructions.
• 6.0.8 Components
• 6.0.8 DSEFS
DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:
• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.
• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
39
DataStax Enterprise release notes
• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.
6.0.8 Components
All components from DSE 6.0.8 are listed. Components that are updated for DSE 6.0.8 are indicated with an
asterisk (*).
• Netty 4.1.13.13.dse *
DSE 6.0.8 is compatible with Apache Cassandra™ 3.11 and adds production-certified enhancements.
• Significant fixes and improvements for native memory, the chunk cache, and async read timeouts.
DSEFS highlights
• Fix handling of path alternatives in DSEFS shell to provide wildcard support for mkdir and ls commands.
(DSP-17768)
• You can now dynamically pass cluster and connection configuration for different graph objects. Fixes the
issue where DseGraphFrame cannot directly copy graph from one cluster to another. (DSP-18605)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
40
DataStax Enterprise release notes
• New configurable memory leak tracking: new nodetool leaksdetection command and Memory leak detection
settings options in cassandra.yaml. (DB-3123)
• Changes to correct uneven distribution of shard requests with the STATIC set cover finder. (DSP-18197)
• New recommended method for case-insensitive text search, faceting, grouping, and sorting with new
LowerCaseStrField Solr field type. This type sets field values as lowercase and stores them as lowercase in
docValues. (DSP-18763)
• The queryExecutorThreads and timeAllowed Solr parameters can be used together. (DSP-18717)
• Avoid interrupting request threads when an internode handshake fails so that the Lucene file channel lock
cannot be interrupted. Fixes LUCENE-8262. (DSP-18211)
# Improved lightweight transactions (LWT) performance. New cassandra.yaml LWT configuration options.
(DB-3018)
# Optimized memory usage for direct reads pool when using a high number of LWTs. (DB-3124)
When not set in cassandra.yaml, the default calculated size of direct_reads_size_in_mb changed from
128 MB to 2 MB per TPC core thread, plus 2 MB shared by non-TPC threads, with a maximum value of
128 MB.
• Improved logging identifies which client, keyspace, table, and partition key is rejected when mutation
exceeds size threshold. (DB-1051)
• Enable upgrading and downgrading SSTables using a CQL file that contains DDL statements to recreate the
schema. (DB-2951)
Resolved issues:
• Possible direct memory leak when part of bulk allocation fails. (DB-3125)
• Counters in memtable allocators and buffer pool metrics can be incorrect when out of memory (OOM)
failures occur. (DB-3126)
• Memory leak occurs when a read from disk times out. (DB-3127)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
41
DataStax Enterprise release notes
• Bootstrap should fail when the node can't fetch the schema from other nodes in the cluster. (DB-3186)
• Deadlock when replaying schema mutations from commit log during DSE startup. (DB-3190)
• Make the remote host visible in the error message for failed magic number verification. (DSP-18645)
Known issue:
• A warning message is displayed when DSE authentication is enabled, but Spark security is not enabled.
(DSP-17273)
• Spark Cassandra Connector: To improve connection for streaming applications with shorter batch times, the
default value for Keep Alive is increased to 1 hour. (DSP-17393)
Resolved issues:
• Reduce probability of hitting max_concurrent_sessions limit for OLAP workloads with BYOS (Bring Your
Own Spark). (DSP-18280)
For OLAP workloads with BYOS, DataStax recommends increasing the max_concurrent_sessions using
this formula as a guideline:
• Accessing files from Spark through WebHDFS interface fails with message: java.io.IOException:
Content-Length is missing. (DSP-18559)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
42
DataStax Enterprise release notes
• Submitting many Spark applications will reach the default tombstone_failure_threshold before the default 90
days gc_grace_seconds defined for the system_auth.role_permissions table. (DSP-19098)
6.0.8 DSEFS
Resolved issues:
• Fix handling of path alternatives in DSEFS shell to provide wildcard support for mkdir and ls commands.
(DSP-17768)
For example, to make several subdirectories with a single command:
• The graph configuration and gremlin_server sections in DSE Graph system-level options are now correctly
commented out at the top level. (DSP-18477)
Resolved issues:
• Time, date, inet, and duration data types are not supported in graph search indexes. (DSP-17694)
• Should prevent sharing Gremlin Groovy closures between scripts that are submitted through session-less
connections, like DSE drivers. (DSP-18146)
• Operations through gremlin-console run with system permissions, but should run with anonymous
permissions. (DSP-18471)
• DseGraphFrame cannot directly copy graph from one cluster to another. You can now dynamically pass
cluster and connection configuration for different graph objects. (DSP-18605)
Workaround for earlier versions:
$ g.V.write.format("csv").save("dsefs://culster1/tmp/vertices") &&
g.E.write.format("csv").save("dsefs://culster1/tmp/edges")
$ g.updateVertices(spark.read.format("csv").load("dsefs://culster1/tmp/vertices")
&& g.updateEdges(spark.read.format("csv").load("dsefs://culster1/tmp/edges")
• Issue querying a search index when the vertex label is set to cache properties. (DSP-18898)
• UnsatisfiedLinkError when insert multi edge with DseGraphFrame in BYOS (Bring Your Own Spark).
(DSP-18916)
• DSE Graph does not use primary key predicate in Search/.has() predicate. (DSP-18993)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
43
DataStax Enterprise release notes
• Reject requests from the TPC backpressure queue when requests are on the queue for too long.
(DSP-15875)
• Changes to correct uneven distribution of shard requests with the STATIC set cover finder. (DSP-18197)
A new inertia parameter for dsetool set_core_property supports fine tuning. The default value of 1 can be
adjusted for environments with vnodes and more than 10 vnodes.
• New recommended method for case-insensitive text search, faceting, grouping, and sorting with new
LowerCaseStrField custom Solr field type. This type sets field values as lowercase and stores them as
lowercase in docValues. (DSP-18763)
DataStax does not support using the TextField Solr field type with solr.KeywordTokenizer and
solr.LowerCaseFilterFactory to achieve single-token, case-insensitive indexing on a CQL text field.
Resolved issues:
• SASI queries don't work on tables with row level access control (RLAC). (DB-3082)
• Documents might not be removed from the index when a key element has value equal to a Solr reserved
word. (DSP-17419)
• Avoid interrupting request threads when an internode handshake fails so that the Lucene file channel lock
cannot be interrupted. Fixes LUCENE-8262. (DSP-18211)
Workaround for earlier versions: Reload the search core without restarting or reindexing.
• Search should error out, rather than timeout, on Solr query with non-existing field list (fl) fields. (DSP-18218)
• Fixed PartitionStrategy when setting vertex label and having includeMetaProperties configured to
true.
• Fixed bug with EventStrategy in relation to addE() where detachment was not happening properly.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
44
DataStax Enterprise release notes
• Fixed bug in detachment of Path where embedded collection objects would prevent that process.
• Quieted "host unavailable" warnings for both the driver and Gremlin Console.
• Implemented EdgeLabelVerificationStrategy.
• Fixed behavior of P for within() and without() in Gremlin Language Variants (GLV) to be consistent with
Java when using variable arguments (varargs).
• Refactored use of commons-lang to use common-lang3 only. Dependencies may still use commons-lang.
• Added GraphSON serialization support for Duration, Char, ByteBuffer, Byte, BigInteger, and BigDecimal in
gremlin-python.
• Added ProfilingAware interface to allow steps to be notified that profile() was being called.
• Fixed bug where profile() could produce negative timings when group() contained a reducing barrier.
• Improved logic determining the dead or alive state of a Java driver connection.
• Fixed a bug in PartitionStrategy where addE() as a start step was not applying the partition.
• Added Symbol.asyncIterator member to the Traversal class to provide support for await ... of
loops (async iterables).
Bug fixes:
• TINKERPOP-2094 Gremlin Driver Cluster Builder serializer method does not use mimeType as suggested.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
45
DataStax Enterprise release notes
• TINKERPOP-2105 Gremlin-Python connection not returned back to the pool on exception from the Gremlin
Server.
Improvements:
• TINKERPOP-1889 JavaScript Gremlin Language Variants (GLV): Use heartbeat to prevent connection
timeout.
• TINKERPOP-2071 gremlin-python: the graphson deserializer for g:Set should return a python set.
• TINKERPOP-2074 Ensure that only NuGet packages for the current version are pushed.
• TINKERPOP-2078 Hide use of EmptyGraph or RemoteGraph behind a more unified method for
TraversalSource construction.
• TINKERPOP-2084 For remote requests in console, display the remote stack trace.
• 6.0.7 Components
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
46
DataStax Enterprise release notes
6.0.7 DSEFS
DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:
• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.
• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.
• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.
6.0.7 Components
All components from DSE 6.0.7 are listed. Components that are updated for DSE 6.0.7 are indicated with an
asterisk (*).
• Netty 4.1.13.13.dse *
DSE 6.0.7 is compatible with Apache Cassandra™ 3.11 and adds production-certified enhancements.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
47
DataStax Enterprise release notes
• Improved user tools for SSTable upgrades (sstableupgrade) and downgrades (sstabledowngrade).
(DB-2950)
• New cassandra.yaml direct_reads_size_in_mb option sets the size of the new buffer pool for direct
transient reads. (DB-2958)
• Remedy deadlock during node startup when calculating disk boundaries. (DB-3028)
• The frame decoding off-heap queue size is configurable and smaller by default. (DB-3047)
• Improved updateEdges and updateVertices usability for single label update. (DSP-18404)
• Operations through gremlin-console run with anonymous instead of system permissions. (DSP-18471)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
48
DataStax Enterprise release notes
• The sstableloader downgrade from DSE to OSS Apache Cassandra is supported with new
sstabledowngrade tool. (DB-2756)
The sstabledowngrade command cannot be used to downgrade system tables or downgrade DSE
versions.
• TupleType values with null fields NPE when being made byte-comparable. (DB-2872)
• Support for using sstableloader to stream OSS Cassandra 3.x and DSE 5.x data to DSE 6.0 and later.
(DB-2909)
The sstabledowngrade command cannot be used to downgrade system tables or downgrade DSE
versions.
# Buffer pool, and metrics for the buffer pool, are now in two pools. In cassandra.yaml,
file_cache_size_in_mb option sets the file cache (or chunk cache) and new direct_reads_size_in_mb
option for all other short-lived read operations. (DB-2958)
To retrieve the buffer pool metrics:
# cassandra-env.sh respect heap and direct memory values set in jvm.options or as environment
variables. (DB-2973)
The precedence for heap and direct memory is:
# Environment variables
# jvm.options
# calculations in cassandra-env.sh
# AIO is automatically disabled if the chunk cache size is small enough: less or equal to system RAM / 8.
(DB-2997)
# Limit off-heap frame queues by configurable number of frames and total number of bytes. (DB-3047)
Resolved issues:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
49
DataStax Enterprise release notes
• The sstableloader downgrade from DSE to OSS Apache Cassandra is not supported. New
sstabledowngrade tool is required. (DB-2756)
• nodesync fails when validating MV row with empty partition key. (DB-2823)
• TupleType values with null fields NPE when being made byte-comparable. (DB-2872)
• The memory in use in the buffer pool is not identical to the memory allocated. (DB-2904)
• Offline sstable tools fail with Out of Direct Memory error. (DB-2955)
• DIRECT_MEMORY is being calculated using 25% of total system memory if -Xmx is set in jvm.options.
(DB-2973)
• Netty direct buffers can potentially double the -XX:MaxDirectMemorySize limit. (DB-2993)
• Increased NIO direct memory because the buffers are not cleaned until GC is run. (DB-2996)
• Check of two versions of metadata for a column fails on upgrade from DSE 5.0.x when type is not of same
class. Loosen the check from CASSANDRA-13776 to prevent Trying to compare 2 different types
ERROR on upgrades. (DB-3021)
• Dropped UDT columns in SSTables deserialization are broken after upgrading from DSE 5.0. (DB-3031)
• Kerberos protocol and QoP parameters are not correctly propagated. (DSP-15455)
• RpcExecutionException does not print the user who is not authorized to perform a certain action.
(DSP-15895)
Known issue:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
50
DataStax Enterprise release notes
Resolved issues:
• After client-to-node SSL is enabled, all Spark nodes must also listen on port 7480. (DSP-15744)
• dse client-tool configuration byos-export does not export required Spark properties. (DSP-15938)
• Downloaded Spark JAR files are executable for all users. (DSP-17692)
• Issue with viewing information for completed jobs when authentication is enabled. (DSP-17854)
• Spark Cassandra Connector does properly cache manually prepared RegularStatements, see
SPARKC-558. (DSP-18075)
• Invalid options show for dse spark-submit command line help. (DSP-18293)
• Spark SQL function concat_ws results in a compilation error when an array column is included in the column
list and when the number of columns to be concatenated exceeds 8. (DSP-18383)
• Improved error messaging for AlwaysOn SQL (AOSS) client tool. (DSP-18409)
• CQL syntax error when single quote is not correctly escaped before including in save cache query to AOSS
cache table. (DSP-18418)
Known issue:
• DSE 6.0.7 is not compatible with Zeppelin in SparkR and PySpark 0.8.1. (DSP-18777)
The Apache Spark™ 2.2.3.4 that is included with DSE 6.0.7 contains the patched protocol and all versions
of DSE are compatible with the Scala interpreter.
However, SparkR and PySpark use only a separate channel for communication with Zeppelin. This protocol
was vulnerable to attack from other users on the system and was secured in CVE-2018-11760. Zeppelin
in SparkR and PySpark 0.8.1 fails because it does not recognize that Spark 2.2.2 and later contain this
patched protocol and attempts to use the old protocol. The Zeppelin patch to recognize this protocol is not
available in a released Zeppelin build.
Solution: Do not upgrade to DSE 6.0.7 if you use SparkR or PySpark. Wait for the Zeppelin release later
than 0.8.1 that will recognize that DSE-packaged Spark can use the secured protocol.
• Submitting many Spark apps will reach the default tombstone_failure_threshold before the default 90 days
gc_grace_seconds defined for the system_auth.role_permissions table. (DSP-19098)
Workaround for use cases where a large number of Spark jobs are submitted:
1. Before the user starts the Spark jobs, manually grant permissions to the user:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
51
DataStax Enterprise release notes
3. After this user completes all the Spark jobs, revoke permissions for the user:
6.0.7 DSEFS
Resolved issues:
• Change dsefs:// default port when the DSEFS setting public_port is changed in dse.yaml. (DSP-17962)
The shortcut dsefs:/// now automatically resolves to broadcastaddress:dsefs.public_port, instead of
incorrectly using broadcastaddress:5598 regardless of the configured port.
• DSEFS WebHDFS API GETFILESTATUS op returns AccessDeniedException for the file even when user
has correct permission. (DSP-18044)
• Problem with change group ownership of files using the fileSystem.setOwner method. (DSP-18052)
• Vertex and especially edge loading is simplified. idColumn function is no longer required. (DSP-18404)
Resolved issues:
• OLAP traversal duplicates the partition key properties: OLAP g.V().properties() prints 'first' vertex n times
with custom ids. (DSP-15688)
• Edges are inserted with tombstone values set when inserting a recursive edge with multiple cardinality.
(DSP-17377)
Resolved issues:
• Solr HTTP request for CSV output is blank. The CSVResponseWriter returns only stored fields if a field list is
not provided in the URL. (DSP-18029)
To workaround, specify a field list with the URL:
/select?q=*%3A*&sort=lst_updt_gdttm+desc&rows=10&fl=field1,field2&wt=csv&indent=true
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
52
DataStax Enterprise release notes
• Disables the ScriptEngine global function cache which can hold on to references to "g" along with some
other minor bug fixes/enhancements.
27 February 2019
DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:
• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.
• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.
• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.
All components from DSE 6.0.6 are listed. Components that are updated for DSE 6.0.6 are indicated with an
asterisk (*).
• Netty 4.1.13.12.dse
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
53
DataStax Enterprise release notes
DSE 6.0.6 is compatible with Apache Cassandra™ 3.11 and includes all production-certified changes from
earlier versions.
DSE 6.0.6 Important bug fix
• DSE 5.0 SSTables with UDTs are corrupted in DSE 5.1, DSE 6.0, and DSE 6.7. (DB-2954,
Cassandra-15035)
If the DSE 5.0.x schema contains user-defined types (UDTs), the SSTable serialization headers are fixed
when DSE is started with DSE 6.0.6 or later.
6.0.5 DSEFS
DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
54
DataStax Enterprise release notes
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:
• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.
• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.
• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.
All components from DSE 6.0.5 are listed. Components that are updated for DSE 6.0.5 are indicated with an
asterisk (*).
• Netty 4.1.13.12.dse *
DSE 6.0.5 is compatible with Apache Cassandra™ 3.11 and adds production-certified enhancements.
• DSE Metrics Collector aggregates DSE metrics and integrates with existing monitoring solutions to facilitate
problem resolution and remediation. (DSP-17319)
See:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
55
DataStax Enterprise release notes
• Fixed resource leak related to streaming operations that affects tiered storage users. Excessive number of
TieredRowWriter threads causing java.lang.OutOfMemoryError. (DB-2463)
• Exception now occurs when user with no permissions returns no rows on restricted table. (DB-2668)
• Upgraded nodes that still have big-format SSTables from DSE 5.x caused errors during read. (DB-2801)
• Fixed an issue where heap memory usage seems higher with default file cache settings. (DB-2865)
• Fixed prepared statement cache issues when using row-level access control (RLAC) permissions. Existing
prepared statements were not correctly invalidated. (DB-2867)
• You use scripts that invoke DSEFS commands and need to handle failures properly.
• You use dse spark-sql-metastore-migrate with DSE Unified Authentication and internal authentication.
(DSP-17632)
• You have DSE 5.0.x with DSEFS client connected to DSE 5.1.x and later DSEFS server. (DSP-17600)
• You get errors for OLAP traversals after dropping schema elements. (DSP-15884)
• You want server side error messages for remote exceptions reported in Gremlin console. (DSP-16375)
• Use graph OLAP and want secret tokens redacted in log files. (DSP-18074)
• You want to build fuzzy-text search indexes on string properties that form part of a vertex label ID.
(DSP-17386)
# Upgrade Apache Commons Compress to prevent Denial Of Service (DoS) vulnerability present in
Commons Compress 1.16.1, CVE-2018-11771. (DSP-17019)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
56
DataStax Enterprise release notes
# Critical memory leak and corruption fixes for encrypted indexes. (DSP-17111)
• DSE 5.0 SSTables with UDTs will be corrupted after migrating to DSE 5.1, DSE 6.0, and DSE 6.7.
(DB-2954, CASSANDRA-15035)
If the DSE 5.0.x schema contains user-defined types (UDTs), upgrade to at least DSE 5.1.13, DSE
6.0.6, or DSE 6.7.2. The SSTable serialization headers are fixed when DSE is started with the upgraded
versions.
# nodetool listendpointspendinghints command prints hint information about the endpoints this node has
hints for. (DB-1674)
# nodetool rebuild_view rebuilds materialized views for local data. Existing view data is not cleared.
(DB-2451)
# Improved messages for nodetool nodesyncservice ratesimulator command include explanation for
single node clusters and when no tables have NodeSync enabled. (DB-2468)
• Direct Memory field output of nodetool gcstats includes all allocated off-heap memory. Metrics for native
memory are added in org.apache.cassandra.metrics.NativeMemoryMetrics.java. (DB-2796)
• Batch replay is interrupted and good batches are skipped when a mutation of an unknown table is found.
(DB-2855)
• New environment variable MAX_DIRECT_MEMORY overrides cassandra.yaml value for how much direct
memory (NIO direct buffers) that the JVM can use. (DB-2919)
Resolved issues:
• Running the nodetool nodesyncservice enable command reports the error NodeSyncRecord
constructor assertion failed. (DB-2280)
Workaround: Before DSE 6.0.5, a restart of DSE resolves the issue so that you can execute the command
and enable NodeSync without error.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
57
DataStax Enterprise release notes
• Rebuild should not fail when a keyspace is not replicated to other datacenters. (DB-2301)
• Repair may skip some ranges due to received range cache. (DB-2432)
• Read and compaction errors with levelled compaction strategy (LCS). (DB-2446)
• Chunk cache can retain data from a previous version of a file, causing restore failures. (DB-2489)
• LineNumberInference is not failure-safe, not finding the source information can break the request. (DB-2568)
• Improved error message when Netty Epoll library cannot be loaded. (DB-2579)
• The nodetool gcstats command output incorrectly reports the GC reclaimed metric in bytes, instead of the
expected MB. (DB-2598)
• Incorrect order of application of nodetool garbagecollect leaves tombstones that should be deleted.
(DB-2658)
• Exception should occur when user with no permissions returns no rows on restricted table. (DB-2668)
• DSE does not start with Unable to gossip with any peers error if cross_node_timeout is true.
(DB-2670)
• Heap memory usage is higher with default file cache settings. (DB-2865)
• Prepared statement cache issues when using row-level access control (RLAC) permissions. Existing
prepared statements are not correctly invalidated. (DB-2867)
• User-defined aggregates (UDAs) that instantiate user-defined types (UDTs) break after restart. (DB-2771)
• Upgraded nodes that still have big-format SSTables from DSE 5.x can cause errors during read. (DB-2801)
Workaround for upgrades from DSE 5.x to DSE versions before 6.0.5 and DSE 6.7.0: Run offline
sstableupgrade before starting the upgraded node.
• Late continuous paging errors can leave unreleased buffers behind. (DB-2862)
• Improve config encryption error reporting for missing system key and unencrypted passwords. (DSP-17480)
• Fix sstableloader error when internode encryption, client_encryption, and config encryption are enabled.
(DSP-17536)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
58
DataStax Enterprise release notes
• Improved error handling: only submission-related error exceptions from Spark submitted applications are
wrapped in a Dse Spark Submit Bootstrapper Failed to Submit error. (DSP-16359)
• Improved error message for dse client-tool when DSE Analytics is not correctly configured. (DSP-17322)
# Provide a way for clients to determine if AlwaysOn SQL (AOSS) is enabled in DSE. (DSP-17180)
# Improved logging messages with recommended resolutions for AlwaysOn SQL (AOSS). (DSP-17326,
DSP-17533)
# Improved error message for AlwaysOn SQL (AOSS) when the role specified by auth_user does not
exist. (DSP-17358)
# Set default for spark.sql.thriftServer.incrementalCollect to true for AlwaysOn SQL (AOSS). (DSP-17428)
# Structured Streaming support for (Bring Your Own Spark) BYOS Spark 2.3. (DSP-17593)
Resolved issues:
• Race condition allows Spark Executor working directories to be removed before stopping those executors.
(DSP-15769)
• Restore DseGraphFrame support in BYOS and spark-dependencies artifacts. Include graph frames python
library in graphframe.jar. (DSP-16383)
• Search optimizations for search analytics Spark SQL queries are applied to a datacenter that no longer has
search enabled. Queries launched from a search-enabled datacenter cause search optimizations even when
the target datacenter does not have search enabled. (DSP-16465)
• Unable to get available memory before Spark Workers are registered. (DSP-16790)
• Spark shell error Cannot proxy as a super user occurs when AlwaysOn Spark SQL (AOSS) is running
with authentication. (DSP-17200)
• Spark Connector has hard dependencies on dse-core when running Spark Application tests with dse-
connector. (DSP-17232)
• AlwaysOn SQL (AOSS) should attempt to auto start again on datacenter restart, regardless of the previous
status. (DSP-17359)
• AlwaysOn SQL (AOSS) restart hangs for at least 15 minutes if it cannot start, should fail with meaningful
error message. (DSP-17264)
• Submission in client mode does not support specifying remote jars (DSEFS) for main application resource
(main jar) and jars specified with --jars / spark.jars. (DSP-17382)
• Incorrect conversions in DirectJoin Spark SQL operations for timestamps, UDTs, and collections.
(DSP-17444)
• DSE 5.0.x DSEFS client is not able to list files when connected to 5.1.x (and up) DSEFS server.
(DSP-17600)
• dse spark-sql-metastore-migrate does not work with DSE Unified Authentication and internal
authentication. (DSP-17632)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
59
DataStax Enterprise release notes
6.0.5 DSEFS
• Add the ability to disable and configure DSEFS internode (node-to-node) authentication. (DSP-17721)
Resolved issues:
• DSEFS throws exceptions and cannot initialize when listen_address is left blank. (DSP-16296)
• Moving a directory under itself causes data loss and orphan data structures. (DSP-17347)
• New tool fixes inconsistencies in graph data that are caused by schema changes, like label delete, or
improper data loading. (DSP-15884)
# Spark: spark.dseGraph("name").cleanUp()
JMX operations are not cluster-aware. Invoke on each node as appropriate to your environment.
Resolved issues:
• DSEGF label drop hang with a lot of edges, both ended the same label. (DSP-17096)
• A Gremlin query with search predicate containing \u2028 or \u2029 characters fails. (DSP-17227)
• Geo.inside predicate with Polygon no longer works on secondary index if JTS is not installed. (DSP-17284)
• Search indexes on key fields work only with non-tokenized queries. (DSP-17386)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
60
DataStax Enterprise release notes
• DseGraphFrame fail to read properties with symbols, like period (.), in names. (DSP-17818)
• Large queries with oversize frames no longer cause buffer corruption on the receiver. (DSP-15664)
• If a client executes a query that results in a shard attempting to send an internode frame larger than the size
specified in frame_length_in_mb, the client receive an error message with a message like this:
Attempted to write a frame of <n> bytes with a maximum frame size of <n> bytes
In earlier versions, the query timed out with no message. Information was provided only as error in the logs.
• In earlier releases, CQL search queries failed with UTFDataFormatException on very large SELECT clauses
and when tables have a very large number of columns. (DSP-17220)
With this fix, CQL search queries fail with UTFDataFormatException only when SELECT clauses constitute
a string larger than 64k UTF-8 encode bytes.
• Upgrade Apache Commons Compress to prevent Denial Of Service (DoS) vulnerability present in Commons
Compress 1.16.1, CVE-2018-11771. (DSP-17019)
• Requesting a core reindex with dsetool reload_core or REBUILD SEARCH INDEX no longer builds up a
queue of reindexing tasks on a node. Instead, a single starting reindexing task handles all reindex requests
that are already submitted to that node. (DSP-17045, DSP-13030)
• The calculated value for maxMergeCount is changed to improve indexing performance. (DSP-17597)
where num_tokens is the number of token ranges to assign to the virtual node (vnode) as configured in
cassandra.yaml.
Resolved issues:
• Race condition occurs on bootstrap completion and Solr core fails to initialize during node bootstrap.
(DB-1383, DSP-14823)
Workaround: Restart the node that failed to initialize.
• Internode protocol can send oversize frames causing buffer corruption on the receiver. (DSP-15664)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
61
DataStax Enterprise release notes
• CQL search queries fail with UTFDataFormatException on very large SELECT clauses. (DSP-17220)
With this fix, CQL search queries fail with UTFDataFormatException only when SELECT clauses constitute
a string larger than 64k UTF-8 encode bytes.
• Unexpected search index errors occur when non-ASCII characters, like the U+3000 (ideographic space)
character, are in indexed columns. (DSP-17816, DSP-17961)
• TextField type in search index schema should be case-sensitive if created when using copyField.
(DSP-17817)
• gf.V().id().next() causes data to get mismatched with properties in legacy DseGraphFrame. (DSP-17979)
• Avoid calling iter.next() in a loop when notifying indexers about range tombstones (CASSANDRA-14794)
• DESC order reads can fail to return the last Unfiltered in the partition (CASSANDRA-14766)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
62
DataStax Enterprise release notes
• 6.0.4 Components
DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:
• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.
• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.
• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.
6.0.4 Components
All components from DSE 6.0.4 are listed. No components were updated from the previous DSE version.
• Netty 4.1.13.11.dse
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
63
DataStax Enterprise release notes
DSE 6.0.4 is compatible with Apache Cassandra™ 3.11 and includes all production-certified enhancements from
earlier DSE versions.
General upgrade advice for DSE 6.0.4
DataStax Enterprise 6.0.4 is compatible with Apache Cassandra™ 3.11.
All upgrade advice from previous versions applies. Carefully review the DataStax Enterprise upgrade planning
and upgrade instructions to ensure a smooth upgrade and avoid pitfalls and frustrations.
DSE 6.0.3 release notes
20 September 2018
DataStax recommends installing the latest patch release. Due to DB-2477, DataStax does not recommend
using DSE 6.0.3 for production.
• 6.0.3 Components
6.0.3 DSEFS
DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:
• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.
• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.
• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.
6.0.3 Components
All components from DSE 6.0.3 are listed. Components that are updated for DSE 6.0.3 are indicated with an
asterisk (*).
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
64
DataStax Enterprise release notes
• Netty 4.1.13.11.dse
DataStax Enterprise 6.0.3 is compatible with Apache Cassandra™ 3.11 and includes all production-certified
enhancements from earlier DSE versions.
• Deleting a static column and adding it back as a non-static column introduces corruption. (DB-1630)
• NodeSync command line tool only connects over JMX to a single node. (DB-1693)
• Unexpected behavior change when using row-level permissions with modification conditions like IF EXISTS.
(DB-2429)
• Jetty 9.4.1 upgrade addresses security vulnerabilities in Spark dependencies packaged with DSE.
(DSP-16893)
• dse spark-submit kill and status commands support optionally explicit Spark Master IP address.
(DSP-16910, DSP-16991)
• Fixed problems with temporary and data directories for Spark applications. (DSP-15476, DSP-15880)
• Spark Cassandra Connector method saveToCassandra should not require solr_query column when search
is enabled. (DSP-16427)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
65
DataStax Enterprise release notes
• Fully qualified paths with resource URL are correctly resolved in Spark structured streaming checkpointing.
Backport SPARK-20894. (DSP-16972)
DSEFS highlights
Important bug fixes:
• Only superusers are allowed to remove corrupted non-empty directories when authentication is enabled for
DSEFS. Improved error message when performing an operation on a corrupted path. (DSP-16340)
• DSEFS Hadoop layer doesn't properly translate DSEFS exceptions to Hadoop exceptions in some methods.
(DSP-16933)
• Closing DSEFS client before all issued requests are completed causes unexpected message type:
DefaultLastHttpContent error. (DSP-16953)
• Under high loads, DSEFS reports temporary incorrect state for various files/directories. (DSP-17178)
• Aligned query behavior using geo.inside() predicate for polygon search with and without search indexes.
(DSP-16108)
• Fixed bug where deleting a search index that was defined inside a graph fails. (DSP-16765)
• Changed default write consistency level (CL) for Graph to LOCAL_QUORUM. (DSP-17140)
In earlier DSE versions, the default QUORUM write consistency level (CL) was not appropriate for multi-
datacenter production environments.
• Reduce the number of token filters for distributed searches with vnodes. (DSP-14189)
• Avoid unnecessary exception and error creation in the Solr query parser. (DSP-17147)
• Avoid accumulating redundant router state updates during schema disagreement. (DSP-15615)
• A search enabled node could return different exceptions than a non-search enabled node when a keyspace
or table did not exist. (DSP-16834)
• DSE does not start without appropriate Tomcat JAR scanning exclusions. (DSP-16841)
• CQL single-pass queries have incorrect results when query is run with primary key and search index
schema does not contain all columns in selection. (DSP-16895)
• Node health score of 1 is not obtainable. Search node gets stuck at 0.00 node health score after replacing a
node in a cluster. (DSP-17107)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
66
DataStax Enterprise release notes
If using DSE Tiered Storage, you must immediately upgrade to at least DSE 5.1.16, DSE 6.0.9, or DSE
6.7.4. Be sure to follow the upgrade instructions.
• DSE 5.0 SSTables with UDTs will be corrupted after migrating to DSE 5.1, DSE 6.0, and DSE 6.7.
(DB-2954, CASSANDRA-15035)
If the DSE 5.0.x schema contains user-defined types (UDTs), upgrade to at least DSE 5.1.13, DSE
6.0.6, or DSE 6.7.2. The SSTable serialization headers are fixed when DSE is started with the upgraded
versions.
• Due to Thread Per Core (TPC) asynchronous request processing architecture, the
index_summary_capacity_in_mb and index_summary_resize_interval_in_minutes settings in
cassandra.yaml are removed. (DB-2390)
• NodeSync waits to start until all nodes in the cluster are upgraded. (DB-2385)
• Improved error handling and logging for TDE encryption key management. (DP-15314)
• DataStax does more extensive testing on OpenJDK 8 due to the end of public updates for Oracle JRE/JDK
8. (DSP-16179)
Resolved issues:
• NodeSync command line tool only connects over JMX to a single node. (DB-1693)
• Move TWCS message "No compaction necessary for bucket size" to Trace level or NoSpam. (DB-2022)
• sstableloader options assume the RPC/native (client) interface is the same as the internode (node-to-node)
interface. (DB-2184)
• NodeSync fails on upgraded nodes while a cluster is in a partially upgraded state. (DB-2385)
• Compaction strategy instantiation errors don't generate meaningful error messages, instead return only
InvocationTargetException. (DB-2404)
• Unexpected behavior change when using row-level permissions with modification conditions like IF EXISTS.
(DB-2429)
• Authentication cache loading can exhaust native threads. The Spark master node is not able to be elected.
(DB-2248)
• Audit events for CREATE ROLE and ALTER ROLE with incorrect spacing exposes PASSWORD in plain
text. (DB-2285)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
67
DataStax Enterprise release notes
• Timestamps inserted with ISO 8601 format are saved with wrong millisecond value. (DB-2312)
• Error out if not all permissions for GRANT/REVOKE/RESTRICT/UNRESTRICT are applicable for a
resource. (DB-2373)
• BulkLoader class exits without printing the stack trace for throwable error. (DB-2377)
• Unexpected behavior change when using row-level permissions with modification conditions like IF EXISTS.
(DB-2429)
• Using geo types does not work when memtable allocation type is set to offheap_objects. (DSP-16302)
• The -graph option for the cassandra-stress tool failed on generating the target output html in the JAR file.
(DSP-17046)
Known issue:
• Upgraded nodes that still have big-format SSTables from DSE 5.x can cause errors during read. (DB-2801)
Workaround for upgrades from DSE 5.x to DSE versions before 6.0.5 and DSE 6.7.0: Run offline
sstableupgrade before starting the upgraded node.
• DSE pyspark libraries are added to PYTHONPATH for dse exec command. Add support for Jupyter
integration. (DSP-16797)
• dse spark-submit kill and status commands support optionally explicit master address. (DSP-16910,
DSP-16991)
• Address security vulnerabilities in Spark dependencies packaged with DSE. Upgrade Netty to 9.4.11.
(DSP-16893)
• Jetty 9.4.1 upgrade addresses security vulnerabilities in Spark dependencies packaged with DSE.
(DSP-16893)
Resolved issues:
• Problems with temporary and data directories for Spark applications. (DSP-15476, DSP-15880)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
68
DataStax Enterprise release notes
# DSE client applications, like Spark, will not start if user HOME environment variable is not defined, user
home directory does not exist, or the current user does not have write permissions.
# Temporary data directory for AOSS is /var/log/spark/rdd, the same as the server-side temporary
data location for Spark. Configurable with SPARK_EXECUTOR_DIRS environment variable in spark-
env.sh.
# If TMPDIR environment variable is missing, /tmp is set for all DSE apps. If /tmp directory does not
exist, it is created with 1777 permissions. If directory creation fails, perform a hard stop.
• Improved security isolates Spark applications; prevents run_as runner for Spark from running a malicious
program. (DSP-16093)
• Spark Cassandra Connector method saveToCassandra should not require solr_query column when search
is enabled. (DSP-16427)
• DSE Spark logging does not match OSS Spark logging levels. (DSP-16726)
• Metastore can't handle table with 100+ columns with auto Spark SQL table creation. (DSP-16742)
• DseDirectJoin and reading from Hive Tables does not work in Spark Structured Streaming. (DSP-16856)
• Fully qualified paths with resource URL are resolved in Spark structured streaming checkpointing. Backport
SPARK-20894. (DSP-16972)
• AlwaysOn SQL (AOSS) dsefs directory creation does not wait for all operations to finish before closing
DSEFS client. (DSP-16997)
6.0.3 DSEFS
• Only superusers are able to remove corrupted non-empty directories when authentication is enabled for
DSEFS. (DSP-16340)
Resolved issues:
• In DSEFS shell, listing too many local file system directories in a single session causes a file descriptor leak.
(DSP-16657)
• DSEFS fails to start when there is a table with duration type or other type DSEFS can't understand.
(DSP-16825)
• DSEFS Hadoop layer doesn't properly translate DSEFS exceptions to Hadoop exceptions in some methods.
(DSP-16933)
• Closing DSEFS client before all issued requests are completed causes unexpected message type:
DefaultLastHttpContent error. (DSP-16953)
• Under high loads, DSEFS reports temporary incorrect state for various files/directories. (DSP-17178)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
69
DataStax Enterprise release notes
schema.config().option('graph.traversal_sources.g.evaluation_timeout').set(Duration.ofDays(1094))
Known issue:
Resolved issues:
• Align query behavior using geo.inside() predicate for polygon search with and without search indexes.
(DSP-16108)
• Avoid looping indefinitely when a thread making internode requests is interrupted while trying to acquire a
connection. (DSP-16544)
• Deleting a search index that was defined inside a graph fails. (DSP-16765)
• Reduce the number of unique token selections for distributed searches with vnodes. (DSP-14189)
Search load balancing strategies are per search index (per core) and are set with dsetool set_core_property.
• Avoid unnecessary exception and error creation in the Solr query parser. (DSP-17147)
Resolved issues:
• Avoid accumulating redundant router state updates during schema disagreement. (DSP-15615)
• NRT codec is not registered at startup for Solr cores that have switched to RT. (DSP-16663)
• Dropping search index when index build is in progress can interrupt Solr core closure. (DSP-16774)
• Exceptions thrown when search is enabled and table is not found in existing keyspace. (DSP-16834)
• DSE should not start without appropriate Tomcat JAR scanning exclusions. (DSP-16841)
• CQL single-pass queries have incorrect results when query is run with primary key and search index
schema does not contain all columns in selection. (DSP-16895)
Best practice: For optimal single-pass queries, including queries where solr_query is used with a partition
restriction, and queries with partition restrictions and a search predicate, ensure that the columns to
SELECT are not indexed in the search index schema.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
70
DataStax Enterprise release notes
Workaround: Since auto-generation indexes all columns by default, you can ensure that the field is not
indexed but still returned in a single-pass query. For example, this statement indexes everything except
for column c3, and informs the search index schema about column c3 for efficient and correct single-pass
queries.
• Node health score of 1 is not obtainable. Search node gets stuck at 0.00 node health score after replacing a
node in a cluster. (DSP-17107)
• 6.0.2 Components
DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:
• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.
• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.
• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
71
DataStax Enterprise release notes
6.0.2 Components
All components from DSE 6.0.2 are listed. Components that are updated for DSE 6.0.2 are indicated with an
asterisk (*).
• Netty 4.1.13.11.dse
DataStax Enterprise 6.0.2 is compatible with Apache Cassandra™ 3.11 and includes all production-certified
enhancements from earlier DSE versions.
• Fixed issue where CassandraConnectionConf creates excessive database connections and reports too
many HashedWheelTimer instances. (DSP-16365)
DSE Graph
DSE Search
• Schemas with stored=true work because stored=true is ignored. The workaround for 6.0.x upgrades with
schema.xml fields with “indexed=false, stored=true, docValues=true” is no longer required. (DSP-16392)
• Minor bug fixes and error handling improvements. (DSP-16435, DSP-16061, DSP-16078)
• -d option to create local encryption keys without configuring the directory in dse.yaml. (DSP-15380)
Resolved issues:
• Use more precise grep patterns to prevent accidental matches in cassandra-env.sh. (DB-2114)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
72
DataStax Enterprise release notes
• For tables using DSE Tiered Storage, nodetool cleanup places cleaned SSTables in the wrong tier.
(DB-2173)
• Support creating system keys before the output directory is configured in dse.yaml. (DSP-15380)
• Improved compatibility with external tables stored in the DSE Metastore in remote systems. (DSP-16561)
• DSE 5.0 SSTables with UDTs will be corrupted after migrating to DSE 5.1, DSE 6.0, and DSE 6.7.
(DB-2954, CASSANDRA-15035)
If the DSE 5.0.x schema contains user-defined types (UDTs), upgrade to at least DSE 5.1.13, DSE
6.0.6, or DSE 6.7.2. The SSTable serialization headers are fixed when DSE is started with the upgraded
versions.
• Apache Hadoop Azure libraries for Hadoop 2.7.1 have been added to the Spark classpath to simplify
integration with Microsoft Azure and Microsoft Azure Blob Storage. (DSP-15943)
# AlwaysOn SQL (AOSS) support for enabling Kerberos and SSL at the same time. (DSP-16087)
# Add 120 seconds wait time so that Spark Master recovery process completes before status check of
AlwaysOn SQL (AOSS) app. (DSP-16249)
# AlwaysOn SQL (AOSS) driver continually runs on a node even when DSE is down. (DSP-16297)
# Improved defaults and errors for AlwaysOn SQL (AOSS) workpool. (DSP-16343)
Resolved issues:
• Need to disable cluster object JMX metrics report to prevent count exceptions spam in Spark driver log.
(DSP-16442)
6.0.2 DSEFS
• DSEFS operations: chown, chgrp, and chmod support recursive (-R) and verbose (-v) flag. (DSP-14238)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
73
DataStax Enterprise release notes
# Idle DSEFS internode connections are closed after 120 seconds. Configurable with new dse.yaml
option internode_idle_connection_timeout_ms.
• DSEFS clients close idle connections after 60 seconds, configurable in dse.yaml. (DSP-14284)
# If the second read is issued after a failed read, it is not blocked forever. The stream is automatically
closed on errors, and subsequent reads will fail with IllegalStateException.
# The timeout message includes information about the underlying DataSource object.
# No more reads are issued to the underlying DataSource after it reports hasMoreData = false.
# The read loop has been simplified to properly move to the next buffer if the requested number of bytes
hasn't been delivered yet.
# Empty buffer returned from the DataSource when hasMoreData = true is not treated as an EOF. The
read method validates offset and length arguments.
• Security improvement: DSEFS uses an isolated native memory pool for file data and metadata sent between
nodes. This isolation makes it harder to exploit potential memory management bugs. (DSP-16492)
Resolved issues:
• DSEFS silently fails when TCP port 5599 is not open between nodes. (DSP-16101)
• Vertices and vertex properties created or modified with graphframes respect TTL as defined in the schema.
In earlier versions, vertices and vertex properties had no TTL. Edges created or modified with graphframes
continue to have no TTL. (DSP-15555)
Resolved issues:
• DGF interceptor does not take into account GraphStep parameters with g.V(id) queries. (DSP-16172)
• The clause LIMIT does not work in a graph traversal with search predicate TOKEN, returning only a subset
of expected results. (DSP-16292)
• The node health option uptime_ramp_up_period_seconds default value in dse.yaml is reduced to 3 hours
(10800 seconds). (DSP-15752)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
74
DataStax Enterprise release notes
• Use monotonically increasing time source for search query execution latency calculation. (DSP-16435)
Resolved issues:
• DataStax Bulk Loader (dsbulk) version 1.1.0 is automatically installed with DataStax Enterprise 6.0.2, and
can also be installed as a standalone tool. See DataStax Bulk Loader 1.1.0 release notes. (DSP-16484)
• Fixed regression issue where the HTTPChannelizer doesn’t instantiate the specified
AuthenticationHandler.
• 6.0.1 Components
• 6.0.1 Highlights
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
75
DataStax Enterprise release notes
DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:
• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.
• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.
• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.
6.0.1 Components
All components from DSE 6.0.1 are listed. Components that are updated for DSE 6.0.1 are indicated with an
asterisk (*).
• Netty 4.1.13.11.dse
DSE 6.0.1 is compatible with Apache Cassandra™ 3.11 and adds additional production-certified enhancements.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
76
DataStax Enterprise release notes
• Fixed issue where multiple Spark Masters can be started on the same machine. (DSP-15636)
• Improved AlwaysOn SQL (AOSS) startup reliability. (DSP-15871, DSP-15468, DSP-15695, DSP-15839)
• Resolved the missing /tmp directory in DSEFS after fresh cluster installation. (DSP-16058)
• Fixed the HashedWheelTimer leak in Spark Connector that affected BYOS. (DSP-15569)
DSE Search
• Fix for the known issue that prevented using TTL (time-to-live) with DSE Search live indexing (RT indexing).
(DSP-16038, DSP-14216)
• DSE 5.0 SSTables with UDTs will be corrupted after migrating to DSE 5.1, DSE 6.0, and DSE 6.7.
(DB-2954, CASSANDRA-15035)
If the DSE 5.0.x schema contains user-defined types (UDTs), upgrade to at least DSE 5.1.13, DSE
6.0.6, or DSE 6.7.2. The SSTable serialization headers are fixed when DSE is started with the upgraded
versions.
• LDAP tuning parameters allow all LDAP connection pool options to be set. (DSP-15948)
Resolved issues:
• Use the indexed item type as backing table key validator of 2i on collections. (DB-1121)
• Add getConcurrentCompactors to JMX in order to avoid loading DatabaseDescriptor to check its value in
nodetool. (DB-1730)
• Send a final error message when a continuous paging session is cancelled. (DB-1798)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
77
DataStax Enterprise release notes
• Apply view batchlog mutation parallel with local view mutations. (DB-1900)
• Use same IO queue depth as Linux scheduler and advise against overriding it. (DB-1909)
• Fix startup error message rejecting COMPACT STORAGE after upgrade. (DB-1916)
• Improve user warnings on startup when libaio package is not installed. (DB-1917)
• Prevent OOM due to OutboundTcpConnection backlog by dropping request messages after the queue
becomes too large. (DB-2001)
• sstableloader does not decrypt passwords using config encryption in DSE. (DSP-13492)
• The Spark Jobserver demo has an incorrect version for the Spark Jobserver API. (DSP-15832)
Workaround: In the demo's gradle.properties file, change the version from 0.6.2 to 0.6.2.238.
• Decreased the number of exceptions logged during master move from node to node. (DSP-14405)
• When querying remote cluster from Spark job, connector does not route requests to data replicas.
(DSP-15202)
• AlwaysOn SQL dependency on JPS is removed. The jps_directory entry in dse.yaml is removed.
(DSP-15468)
• Improved security for Spark JobServer. All uploaded JARs, temporary files, and logs are created under the
current user's home directory: ~/.spark-jobserver. (DSP-15832)
• During misconfigured cluster bootstrap, the AlwaysOn SqlServer does not start due to missing /tmp/hive
directory in DSEFS. (DSP-16058)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
78
DataStax Enterprise release notes
Resolved issues:
• A shard request timeout caused an assertion error from Lucene getNumericDocValues in the log.
(DSP-14216)
• In some situations, AlwaysOn SQL cannot start unless DSE node is restarted. (DSP-15871)
• Java driver in Spark Connector uses daemon threads to prevent shutdown hooks from being blocked by
driver thread pools. (DSP-16051)
• dse client-tool spark sql-schema --all exports definitions for solr_admin keyspace. (DSP-16073).
6.0.1 DSEFS
Resolved issues:
• DseGraphFrame performance improvement reduces number of joins for count() and other id only queries.
(DSP-15554)
• Performance improvements for traversal execution with Fluent API and script-based executions.
(DSP-15686)
Resolved issues:
• When using graph frames, cannot upload edges when ids for vertices are complex non-text ids.
(DSP-15614)
• CassandraHiveMetastore is prevented from adding multiple partitions for file-based data sources. Fixes
MSCK REPAIR TABLE command. (DSP-16067)
• Output Solr foreign filter cache warning only on classes other than DSE classes. (DSP-15625)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
79
DataStax Enterprise release notes
# Xerces2-j: CVE-2013-4002
# uimaj-core: CVE-2017-15691
Resolved issues:
• Offline sstable tools fail is DSE Search index is present on a table. (DSP-15628)
• HTTP read on solr_stress doesn't inject random data into placeholders. (DSP-15727)
• Search index TTL Expiration thread loops without effect with live indexing (RT indexing). (DSP-16038)
• Search incorrectly assumes only single-row ORDER BY clauses on first clustering key. (DSP-16064)
DataStax recommends using the latest DataStax Bulk Loader 1.2.0 For details, see DataStax Bulk Loader.
Cassandra enhancements for DSE 6.0.1
DataStax Enterprise 6.0.1 is compatible with Apache Cassandra™ 3.11, includes all DataStax enhancements
from earlier releases, and adds these production-certified changes:
• cassandra-stress throws NPE if insert section isn't specified in user profile (CASSSANDRA-14426)
• Don't use guava collections in the non-system keyspace jmx attributes (CASSANDRA-12271)
• Serialize empty buffer as empty string for json output format (CASSANDRA-14245)
• Cassandra not starting when using enhanced startup scripts in windows (CASSANDRA-14418)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
80
DataStax Enterprise release notes
• Delay hints store excise by write timeout to avoid race with decommission (CASSANDRA-13740)
• Avoid deadlock when running nodetool refresh before node is fully up (CASSANDRA-14310)
• CqlRecordReader no longer quotes the keyspace when connecting, as the java driver will
(CASSANDRA-10751)
• Bump to Groovy 2.4.15 - resolves a Groovy bug preventing Lambda creation in GLVs in some cases.
(TINKERPOP-1953)
• 6.0.0 Components
DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
81
DataStax Enterprise release notes
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:
• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.
• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.
• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.
DSE 6.0.0 Do not use TTL (time-to-live) with DSE Search live indexing (RT indexing). To use these features
together, upgrade to DSE 6.0.1. (DSP-16038)
6.0.0 Components
• Netty 4.1.13.11.dse
DSE 6.0 is compatible with Apache Cassandra™ 3.11 and adds additional production-certified enhancements.
Experimental features. These features are experimental and are not supported for production:
• SASI indexes.
Known issues:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
82
DataStax Enterprise release notes
Workaround: create a directory that matches the keyspace name, and then create symbolic links into that
directory from snapshot directory with name of the destination table. For example:
• DSE 5.0 SSTables with UDTs will be corrupted after migrating to DSE 5.1, DSE 6.0, and DSE 6.7.
(DB-2954, CASSANDRA-15035)
If the DSE 5.0.x schema contains user-defined types (UDTs), upgrade to at least DSE 5.1.13, DSE
6.0.6, or DSE 6.7.2. The SSTable serialization headers are fixed when DSE is started with the upgraded
versions.
• DSE 6.0 will not start with OpsCenter 6.1 installed. OpsCenter 6.5 is required for managing DSE 6.0
clusters. See DataStax OpsCenter compatibility with DSE. (DSP-15996)
Support for Thrift-compatible tables (COMPACT STORAGE) is dropped. Before upgrading to DSE 6.0, you
must migrate all tables that have COMPACT STORAGE to CQL table format.
Upgrades from DSE 5.0.x or DSE 5.1.x with Thrift-compatible tables require DSE 5.1.6 or later or DSE 5.0.12
or later.
• Allow user-defined functions (UDFs) within GROUP BY clause and allow non-deterministic UDFs within
GROUP BY clause. New CQL keywords (DETERMINISTIC and MONOTONIC). The cassandra.yaml file
enable_user_defined_functions_threads option has no changes to default behavior of true; set to false to
use UDFs in GROUP BY clauses. (DB-672)
• Improved architecture with Thread Per Core (TPC) asynchronous read and write paths. (DB-707)
New DSE start-up parameters:
# -Ddse.io.aio.enable
# -Ddse.io.aio.force
# aggregated_request_timeout_in_ms
# streaming_connections_per_host
# key_cache_* settings are no longer used in new SSTable format, but retained to support existing
SSTable format
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
83
DataStax Enterprise release notes
# Deprecated options:
Deprecated options Replaced with
rpc_address native_transport_address
rpc_interface native_transport_interface
rpc_interface_prefer_ipv6 native_transport_interface_prefer_ipv6
rpc_port native_transport_port
broadcast_rpc_address native_transport_broadcast_address
rpc_keepalive native_transport_keepalive
# batch_size_warn_threshold_in_kb: 64
# column_index_size_in_kb: 16
# memtable_flush_writers: 4
• Authentication and authorization improvements. RLAC (setting row-level permissions) speed is improved.
(DB-909)
• JMX exposed metrics for external dropped messages include COUNTER_MUTATION, MUTATION,
VIEW_MUTATION, RANGE_SLICE, READ, READ_REPAIR, LWT, HINTS, TRUNCATE, SNAPSHOT,
SCHEMA, REPAIR, OTHER. (DB-1127)
• After upgrade is complete and all nodes are on DSE 6.0 and the required schema change occurs,
authorization (CassandraAuthorizer) and audit logging (CassandraAuditWriter) enable the use of new
columns. (DB-1597)
• The DataStax Installer is no longer supported. To upgrade from earlier versions that used the DataStax
Installer, see Upgrading to DSE 6.0 from DataStax Installer installations. For new installations, use a
supported installation method. (DSP-13640)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
84
DataStax Enterprise release notes
# Database administrators can manage role permissions without having access to the data. (DB-757)
# Filter rows from system keyspaces and system_schema tables based on user permissions. New
system_keyspaces_filtering option in cassandra.yaml returns information based on user access to
keyspaces. (DB-404)
# New metric for replayed batchlogs and trace-level logging include the age of the replayed batchlog.
(DB-1314)
# Decimals with a scale > 100 are no longer converted to a plain string to prevent
DecimalSerializer.toString() being used as an attack vector. (DB-1848)
# Auditing by role: new dse.yaml audit options included_roles and excluded_roles. (DSP-15733)
• libaio package dependency for DataStax Enterprise 6.0 installations on RHEL-based systems using Yum
and on Debian-based systems using APT install. For optimal performance in tarball installations, DataStax
recommends installing the libaio package. (DSP-14228)
• The default number of threads used by performance objects increased from 1 to 4. Upgrade restrictions
apply. (DSP-14515)
• Support for Thrift-compatible tables (COMPACT STORAGE) is dropped. Before upgrading, migrate all
tables that have COMPACT STORAGE to CQL table format. DSE 6.0 will not start if COMPACT STORAGE
tables are present. See Upgrading from DSE 5.1.x or Upgrading from DSE 5.0.x. (DSP-14839)
• The minimum supported version of Oracle Java SE Runtime Environment 8 (JDK) is 1.8u151. (DSP-14818)
• sstabledump supports the -l option to output each partition as its own JSON object. (DSP-15079)
• Upgrades to OpsCenter 6.5 or later are required before starting DSE 6.0. DataStax recommends upgrading
to the latest OpsCenter version that supports your DSE version. Check the compatibility page for your
products. (DSP-15996)
Resolved issues:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
85
DataStax Enterprise release notes
• Add result set metadata to prepared statement MD5 hash calculation. (DB-608)
system.peers:
dse_version text,
graph boolean,
server_id text,
workload text,
workloads frozen<set<text>>
system.local:
dse_version text,
graph boolean,
server_id text,
workload text,
workloads frozen<set<text>>
• Create administrator roles who can carry out everyday administrative tasks without having unnecessary
access to data. (DB-757)
• When repairing Paxos commits, only block on nodes are being repaired. (DB-761)
• Error in counting iterated SSTables when choosing whether to defrag in timestamp ordered path. (DB-1018)
• Expose ports (storage, native protocol, JMX) in system local and peers tables. (DB-1040)
• Load mapped buffer into physical memory after mlocking it for MemoryOnlyStrategy. (DB-1052)
• Forbid advancing KeyScanningIterator before exhausting or closing the current iterator. (DB-1199)
• New nodetool abortrebuild command stops a currently running rebuild operation. (DB-1234)
• Drop response on view lock acquisition timeout and add ViewLockAcquisitionTimeouts metric. (DB-1522)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
86
DataStax Enterprise release notes
• dsetool ring prints ERROR when data_file_directories is removed from cassandra.yaml. (DSP-13547)
• Support for DSE Advanced Replication V1 is removed. For V1 installations, you must first upgrade to DSE
5.1.x and migrate your DSE Advanced Replication to V2, and then upgrade to DSE 6.0. (DSP-13376)
• Enhanced CLI security prevents injection attacks and sanitizes and validates the command line inputs.
(DSP-13682)
Resolved issues:
• Improve logging on unsupported operation failure and remove the failed mutation from replog. (DSP-15043)
• Channel creation fails with NPE when using mixed case destination name. (DSP-15538)
Experimental features. These features are experimental and are not supported for production:
Known issues:
• DSE Analytics: Additional configuration is required when enabling context-per-jvm in the Spark Jobserver.
(DSP-15163)
• Previously deprecated environment variables, including SPARK_CLASSPATH, are removed in Spark 2.2.0.
(DSP-8379)
• AlwaysOn SQL service, a HA (highly available) Spark SQL Thrift server. (DSP-10996)
# The spark_config_settings and hive_config_settings are removed from dse.yaml. The configuration is
provided in the spark-alwayson-sql.conf file in DSEHOME/resources/spark/conf with the same default
contents as DSEHOME/resources/spark/conf/spark-defaults.conf. (DSP-15837)
• Cassandra File System (CFS) is removed. Use DSEFS instead. Before upgrading to DSE 6.0, remove CFS
keyspaces. See the From CFS to DSEFS dev blog post. (DSP-12470)
• Authenticate JDBC users to Spark SQL Thrift Server. Queries that are executed during JDBC session are
run as the user who authenticated through JDBC. (DSP-13395)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
87
DataStax Enterprise release notes
• Encryption for data stored on the server and encryption of Spark spill files is supported. (DSP-13841)
• Spark local applications no longer use /var/lib/spark/rdd, instead configure and use .sparkdirectory for
processes started by the user. (DSP-14380)
• Input metrics are not thread-safe and are not used properly in CassandraJoinRDD and
CassandraLeftJoinRDD. (DSP-14569)
• AlwaysOn SQL workpool option adds high availability (HA) for the JDBC or ODBC connections for analytics
node. (DSP-14719)
• CFS is removed. Before upgrade, move HiveMetaStore from CFS to DSEFS and update URL references.
(DSP-14831)
• Include SPARK-21494 to use correct app id when authenticating to external service. (DSP-14140)
• Upgrade to DSE 6.0 must be complete on all nodes in the cluster before Spark Worker and Spark Master
will start. (DSP-14735)
# All Spark-related parameters are now camelCase. Parameters are case-sensitive. The snake_versions
are automatically translated to the camelCaseVersions except when the parameters are used as table
options. In SparkSQL and with spark.read.options(...), the parameters are case-insensitive because of
internal SQL implementation.
• Use NodeSync (continuous repair) and LOCAL_QUORUM for reading from Spark recovery storage.
(DSP-15219)
Supporting changes:
# Spark Master will not start until LOCAL_QUORUM is achieved for dse_analytics keyspace.
# Spark Master recovery data is first attempted to be updated with LOCAL_QUORUM, and if that fails,
then attempt with LOCAL_ONE. Recovery data are always queried with LOCAL_QUORUM (unlike
previous versions of DSE where we used LOCAL_ONE)
DataStax strongly recommends enabling NodeSync for continuous repair on all tables in the
dse_analytics keyspace. NodeSync is required on the rm_shared_data keyspace that stores Spark
recovery information.
Resolved issues:
• DSE does not work with Spark Crypto based encryption. (DSP-14140)
6.0.0 DSEFS
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
88
DataStax Enterprise release notes
• Improved authorization security sets the default permission to 755 for directories and 644 for files. New
DSEFS clusters create the root directory / with 755 permission to prevent non-super users from modifying
root content; for example, by using mkdir or put commands. (DSP-13609)
• New tool to move hive metastore from CFS to DSEFS and update references.
Known issues:
• Dropping a property of vertex label with materialized view (MV) indices breaks graph. To drop a property
key for a vertex label that has a materialized view index, additional steps are required to prevent data loss or
cluster errors. See Dropping graph schema. (DSP-15532)
• Secondary indexes used for DSE Graph queries have higher latency in DSE 6.0 than in the previous
version. (DB-1928)
• Backup snapshots taken with OpsCenter 6.1 will not load to DSE 6.0. Use the backup service in OpsCenter
6.5 or later. (DSP-15922)
# Standard vertex IDs are deprecated. Use custom vertex IDs instead. (DSP-13485)
• Schema API changes: all .remove() methods are renamed to .drop() and schema.clear() is renamed to
schema.drop(). Schema API supports removing vertex/edge labels and property keys. Unify use of drop |
remove | clear in the Schema API and use .drop() everywhere. (DSP-8385, DSP-14150)
• Include materialized view (MV) indexes in query optimizer only if the MV was fully built. (DSP-10219)
• Improve Graph OLAP performance by smart routing query to DseGraphFrame engine with
DseGraphFrameInterceptorStrategy. (DSP-13489)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
89
DataStax Enterprise release notes
• Graph online analytical processing (OLAP) supports drop() with DseGraphFrame interceptor. Simple queries
can be used in drop operations. (DSP-13998)
• DSE Graphs vertices and edges tables are accessible from SparkSQL and automated to dse_graph
SparkSQL database. (DSP-12046)
• More Gremlin APIs are supported in DSEGraphFrames: dedup, sort, limit, filter, as()/select(), or().
(DSP-13649)
• Some graph and gremlin_server properties in earlier versions of DSE are no longer required for DSE 6.0.
The default settings from the earlier versions of dse.yaml are preserved. These settings were removed from
dse.yaml.
# adjacency_cache_clean_rate
# adjacency_cache_max_entry_size_in_mb
# adjacency_cache_size_in_mb
# gremlin_server_enabled
# index_cache_clean_rate
# index_cache_max_entry_size_in_mb
# window_size
If these properties exist in the dse.yaml file after upgrading to DSE 6.0, logs display warnings. You can
ignore these warnings or modify dse.yaml so that only the required graph system level and gremlin_server
properties are present. (DSP-14308)
• Spark Jobserver is the DSE custom version 0.8.0.44. Applications must use the compatible Spark Jobserver
API in DataStax repository. (DSP-14152)
• Edge label names and property key names allow only [a-zA-Z0-9], underscore, hyphen, and period. The
string formatting for vertices with text custom IDs has changed. (DSP-14710)
Supporting changes (DSP-15167):
# In-place upgrades allow existing schemas with invalid edge label names and property key names.
• Invoking toString on a custom vertex ID containing a text property, or on an edge ID that is incident upon a
vertex with a custom vertex ID, now returns a value that encloses the text property value in double quotation
marks and escapes the value's internal double-quotes. This change protects older formats from irresolvable
parsing ambiguity. For example:
// old
{~label=v, x=foo}
{~label=w, x=a"b}
// new
{~label=v, x="foo"}
{~label=w, x="a""b"}
• Support for math()-step (math) to enable scientific calculator functionality within Gremlin. (DSP-14786)
• The GraphQueryThreads JMX attribute has been removed. Thread selection occurs with Thread Per Core
(TPC) asynchronous request processing architecture. (DSP-15222)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
90
DataStax Enterprise release notes
Resolved issues:
• Intermittent KryoException: Buffer underflow error when running order by query in OLTP mode.
(DSP-12694)
• DseGraphFrames properties().count() step return vertex count instead of multi properties count.
(DSP-15049)
• GraphSON parsing error prevents proper type detection under certain conditions. (DSP-14066)
Experimental features. These features are experimental and are not supported for production:
Known issues:
• Search index TTL Expiration thread loops without effect with live indexing (RT indexing). (DSP-16038)
• DSE Search is very IO intensive. Performance is impacted by the Thread Per Core (TPC) asynchronous
read and write paths architecture. (DB-707)
Before using DSE Search in DSE 6.0 and later, review and follow the DataStax recommendations:
# On search nodes, change the tpc_cores value from its default to the number of physical CPUs. Refer
to Tuning TPC cores.
# Disable AIO and set the file_cache_size_in_mb value to 512. Refer to Disabling AIO.
# Locate DSE Cassandra transactional data and Solr-based DSE Search data on separate Solid State
Drives (SSDs). Refer to Set the location of search indexes.
# Plan for sufficient memory resources and disk space to meet operational requirements. Refer to
Capacity planning for DSE Search.
• Writes are flushed to disk in segments that use a new Lucene codec that does not exist in earlier versions.
Unique key values are no longer stored as both docValues and Lucene stored fields. The unique key values
are now stored only as docValues in a new codec to store managed fields like Lucene. Downgrades to
versions earlier than DSE 6.0 are not supported. (DSP-8465)
• Document inserts and updates using HTTP are removed. Before upgrading, ensure you are using CQL for
all inserts and updates. (DSP-9725).
• The <dataDir> parameter in the solrconfig.xml file is not supported. Instead, follow the steps in Set the
location of search indexes. (DSP-13199)
• Improved performance by early termination of sorting. Ideal for queries that need only a few results returned,
from a large number of total matches. (DSP-13253)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
91
DataStax Enterprise release notes
# The default for CQL text type changed from solr.TextField to solr.StrField.
• Delete by id is removed. Delete by query no longer accepts wildcard queries, including queries that match
all documents (for example, <delete><query>*:*</query></delete>). Instead, use CQL to DELETE by
Primary Key or the TRUNCATE command. (DSP-13436)
# RAM buffer size settings are no longer required in search index config. Global RAM buffer usage in
Lucene is governed by the memtable size limits in cassandra.yaml. RAM buffers are counted toward the
memtable_heap_space_in_mb.
• The HTTP API for Solr core management is removed. Instead, use CQL commands for search index
management or dsetool search index commands. (DSP-13530)
• The Tika functionality bundled with Apache Solr is removed. Instead, use the stand-alone Apache Tika
project. (DSP-13892)
# The solrvalidation.log is removed. You can safely remove appender SolrValidationErrorAppender and
the logger SolrValidationErrorLogger from logback.xml. Indexing errors manifest as:
# failures at the coordinator if they represent failures that might succeed at some later point in time
using the hint replay mechanism
# as messages in the system.log if the failures are due to non-recoverable indexing validation errors
(for data that is written to the database, but not indexed properly)
• The DSE custom update request processor (URP) implementation is deprecated. Use the field input/output
(FIT) transformer API instead. (DSP-14360)
• The stored flag in search index schemas is deprecated and is no longer added to auto-generated schemas.
If the flag exists in custom schemas, it is ignored. (DSP-14425)
# Indexing is no longer asynchronous. Document updates are written to the Lucene RAM buffer
synchronously with the mutation backing table.
# enable_back_pressure_adaptive_nrt_commit
# max_solr_concurrency_per_core
# solr_indexing_error_log_options
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
92
DataStax Enterprise release notes
• StallMetrics MBean is removed. Before upgrading to DSE 6.0, change operators that use the MBean.
(DSP-14860)
• Optimize Paging when limit is smaller than the page size. (DSP-15207)
Resolved issues includes all bug fixes up to DSE 5.1.8. Additional 6.0.0 fixes:
• For use with DSE 6.0.x, DataStax Studio 6.0.0 is installed as a standalone tool. (DSP-13999, DSP-15623)
• DataStax Bulk Loader (dsbulk) version 1.0.1 is automatically installed with DataStax Enterprise 6.0.0, and
can also be installed as a standalone tool. (DSP-13999, DSP-15623)
• Fix updating base table rows with TTL not removing view entries (CASSANDRA-14071)
• RPM package spec: fix permissions for installed jars and config files (CASSANDRA-14181)
• Gossip thread slows down when using batch commit log (CASSANDRA-12966)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
93
DataStax Enterprise release notes
• Avoid reading static row twice from old format sstables (CASSANDRA-13236)
• Upgrade netty version to fix memory leak with client encryption (CASSANDRA-13114)
• Add result set metadata to prepared statement MD5 hash calculation (CASSANDRA-10786)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
94
DataStax Enterprise release notes
• Add incremental repair support for --hosts, --force, and subrange repair (CASSANDRA-13818)
• Add additional unit tests for batch behavior, TTLs, Timestamps (CASSANDRA-13846)
• Emit metrics whenever we hit tombstone failures and warn thresholds (CASSANDRA-13771)
• Allow changing log levels via nodetool for related classes (CASSANDRA-12696)
• Reduce memory copies and object creations when acting on ByteBufs (CASSANDRA-13789)
• Don't delete incremental repair sessions if they still have sstables (CASSANDRA-13758)
• Support for migrating legacy users to roles has been dropped (CASSANDRA-13371)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
95
DataStax Enterprise release notes
• Don't add localhost to the graph when calculating where to stream from (CASSANDRA-13583)
• Change the accessibility of RowCacheSerializer for third party row cache plugins (CASSANDRA-13579)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
96
DataStax Enterprise release notes
• Fix incorrect cqlsh results when selecting same columns multiple times (CASSANDRA-13262)
• Change protocol to allow sending key space independent of query string (CASSANDRA-10145)
• Take number of files in L0 in account when estimating remaining compaction tasks (CASSANDRA-13354)
• Skip building views during base table streams on range movements (CASSANDRA-13065)
• Improve error messages for +/- operations on maps and tuples (CASSANDRA-13197)
• Make it possible to monitor an ideal consistency level separate from actual consistency level
(CASSANDRA-13289)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
97
DataStax Enterprise release notes
• Use new token allocation for non bootstrap case as well (CASSANDRA-13080)
• Require forceful decommission if number of nodes is less than replication factor (CASSANDRA-12510)
• Nodetool repair can hang forever if we lose the notification for the repair completing/failing
(CASSANDRA-13480)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
98
DataStax Enterprise release notes
• Fixed a bug in NumberHelper that led to wrong min/max results if numbers exceeded the Integer limits.
(TINKERPOP-1873)
• Improved error messaging for failed serialization and deserialization of request/response messages.
• Fixed bug in handling of Direction.BOTH in Messenger implementations to pass the message to the
opposite side of the `StarGraph` in VertexPrograms for OLAP traversals. (TINKERPOP-1862)
• Fixed a bug in Gremlin Console which prevented handling of gremlin.sh flags that had an equal sign (=)
between the flag and its arguments. (TINKERPOP-1879)
• Fixed a bug where SparkMessenger was not applying the edgeFunction`from MessageScope`in
VertexPrograms for OLAP-based traversals. (TINKERPOP-1872)
• TinkerPop drivers prior to 3.2.4 won't authenticate with Kerberos anymore. A long-deprecated option on the
Gremlin Server protocol was removed.
• Can unload data from any Cassandra 2.1 or later data source
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
99
Chapter 3. Installing DataStax Enterprise 6.0
Installation information is located in the Installation Guide.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
100
Chapter 4. Configuration
Depending on your environment, some of the following settings might not persist after reboot. Check with your
system administrator to ensure these settings are viable for your environment.
Use the Preflight check tool to run a collection of tests on a DSE node to detect and fix node configurations. The
tool can detect and optionally fix many invalid or suboptimal configuration settings, such as user resource limits,
swap, and disk settings.
Configure the chunk cache
Beginning in DataStax Enterprise (DSE) 6.0, the amount of native memory used by the DSE process has
increased significantly.
The main reason for this increase is the chunk cache (or file cache), which is like an OS page cache. The
following sections provide additional information:
• See Chunk cache history for a historical description of the chunk cache, and how it is calculated in DSE 6.0
and later.
• See Chunk cache differences from OS page cache to understand key differences between the chunk cache
and the OS page cache.
Consider the following recommendations depending on workload type for your cluster.
DSE recommendations
Regarding DSE, consider the following recommendations when choosing the max direct memory and file cache
size:
• Adequate memory for native raw memory (such as bloom filters and off-heap memtables)
For 64 GB servers, the default settings are typically adequate. For larger servers, increase the max direct
memory (-XX:MaxDirectMemorySize), but leave approximately 15-20% of memory for the OS and other in-
memory structures. The file cache size will be set automatically to half of that. This setting is acceptable, but the
size could be increased gradually if the cache hit rate is too low and there is still available memory on the server.
Disabling asynchronous I/O (AIO) and explicitly setting the chunk cache size (file_cache_size_in_mb) improves
performance for most DSE Search workloads. When enforced, SSTables and Lucene segments, as well as other
minor off-heap elements, will reside in the OS page cache and be managed by the kernel.
A potentially negative impact of disabling AIO might be measurably higher read latency when DSE goes to disk,
in cases where the dataset is larger than available memory.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
101
Configuration
To disable AIO and set the chunk cache size, see Disable AIO.
DSE Analytics relies heavily on memory for performance. Because Apache Spark™ effectively manages its own
memory through the Apache Spark application settings, you must determine how much memory the Apache
Spark application receives. Therefore, you must think about how much memory to allocate to the chunk cache
versus how much memory to allocate for Apache Spark applications. Similar to DSE Search, you can disable
AIO and lower the chunk cache size to provide Apache Spark with more memory.
Because DSE Graph heavily relies on several different workloads, it’s important to follow the previous
recommendations for the specific workload. If you use DSE Search or DSE Analytics with DSE Graph, lower the
chunk cache and disable AIO for the best performance. If you use DSE Graph only on top of Apache Cassandra,
increase the chunk cache gradually, leaving 15-20% of memory available for other processes.
There are several differences between the chunk cache and the OS page cache, and a full description is outside
the scope of this information. However, the following differences are relevant to DSE:
• Because the OS page cache is sized dynamically by the operating system, it can grow and shrink depending
on the available server memory. The chunk cache must be sized statically.
If the chunk cache is too small, the available server memory will be unused. For servers with large amounts
of memory (50 GB or more), the memory is wasted. If the chunk cache is too large, the available memory on
the server can reduce enough that the OS will kill the DSE process to avoid an out of memory issue.
At the time of writing, the size of the chunk cache cannot be changed dynamically so to change the size
of the chunk cache the DSE process must be restarted.
• Restarting the DSE process will destroy the chunk cache, so each time the process is restarted, the chunk
cache will be cold. The OS page cache only becomes cold after a server restart.
• The memory used by the file cache is part of the DSE process memory, and is therefore seen by the OS as
user memory. However, the OS page cache memory is seen as buffer memory.
• The chunk cache uses mostly NIO direct memory, storing file chunks into NIO byte buffers. However, NIO
does have an on-heap footprint, which DataStax is working to reduce.
The chunk cache is not new to Apache Cassandra, and was originally intended to cache small parts (chunks) of
SSTable files to make read operations faster. However, the default file access mode was memory mapped until
DSE 5.1, so the chunk cache had a secondary role and its size was limited to 512 MB.
The default setting of 512 MB was configured by the file_cache_size_in_mb parameter in cassandra.yaml.
In DSE 6.0 and later, the chunk cache has increased relevance, not just because it replaces the OS page cache
for database read operations, but because it is a central component of the asynchronous thread-per-core (TPC)
architecture.
By default, the chunk cache is configured to use the following portion of the max direct memory:
• One-half (½) of the max direct memory for the DSE process
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
102
Configuration
The max direct memory is calculated as one-half (½) of the system memory minus the JVM heap size:
You can explicitly configure the max direct memory by setting the JVM MaxDirectMemorySize (-
XX:MaxDirectMemorySize) parameter. See increasing the max direct memory. Alternatively, you can override
the max direct memory setting by explicitly configuring the file_cache_size_in_mb parameter in cassandra.yaml.
Install the latest Java Virtual Machine
Configure your operating system to use the latest build of a Technology Compatibility Kit (TCK) Certified
OpenJDK version 8. For example, OpenJDK 8 (1.8.0_151 minimum). Java 9 is not supported.
Although Oracle JRE/JDK 8 is supported, DataStax does more extensive testing on OpenJDK 8. This change
is due to the end of public updates for Oracle JRE/JDK 8.
Synchronize clocks
Use Network Time Protocol (NTP) to synchronize the clocks on all nodes and application servers.
Synchronizing clocks is required because DataStax Enterprise (DSE) overwrites a column only if there is another
version whose timestamp is more recent, which can happen when machines are in different locations.
DSE timestamps are encoded as microseconds because UNIX Epoch time does not include timezone
information. The timestamp for all writes in DSE is Universal Time Coordinated (UTC). DataStax recommends
converting to local time only when generating output to be read by humans.
1
RHEL-based system $ sudo yum install ntpdate
1
On RHEL 7 and later, chrony is the default network time protocol daemon. The configuration file for chrony is located in /etc/chrony.conf
on these systems.
$ ntpstat
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
103
Configuration
Run the following command to view all current Linux kernel settings:
$ sudo sysctl -a
TCP settings
During low traffic intervals, a firewall configured with an idle connection timeout can close connections to local
nodes and nodes in other data centers. To prevent connections between nodes from timing out, set the following
network kernel settings:
These values set the TCP keepalive timeout to 60 seconds with 3 probes, 10 seconds gap between each.
The settings detect dead TCP connections after 90 seconds (60 + 10 + 10 + 10). The additional traffic is
negligible, and permanently leaving these settings is not an issue. See Firewall idle connection timeout
causes nodes to lose communication during low traffic times on Linux .
2. Change the following settings to handle thousands of concurrent connections used by the database:
Instead of changing the system TCP settings, you can prevent reset connections during streaming by tuning
the streaming_keep_alive_period_in_secs setting in cassandra.yaml.
1. Edit the /etc/pam.d/su file and uncomment the following line to enable the pam_limits.so module:
This change to the PAM configuration file ensures that the system reads the files in the /etc/security/
limits.d directory.
2. If you run DSE as root, some Linux distributions (such as Ubuntu), require setting the limits for the root user
explicitly instead of using cassandra_user:
RHEL-based systems
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
104
Configuration
All systems
vm.max_map_count = 1048575
3. Configure the following settings for the <cassandra_user> in the configuration file:
4. Reboot the server or run the following command to make all changes take effect:
$ sudo sysctl -p
net.ipv4.tcp_keepalive_time=60
net.ipv4.tcp_keepalive_probes=3
net.ipv4.tcp_keepalive_intvl=10
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.core.rmem_default=16777216
net.core.wmem_default=16777216
net.core.optmem_max=40960
net.ipv4.tcp_rmem=4096 87380 16777216
net.ipv4.tcp_wmem=4096 65536 16777216
3. To confirm the user limits are applied to the DSE process, run the following command where pid is the
process ID of the currently running DSE process:
$ cat /proc/pid/limits
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
105
Configuration
Do not use governors that lower the CPU frequency. To ensure optimal performance, reconfigure all CPUs to
use the performance governor, which locks the frequency at maximum.
The performance governor will not switch frequencies, which means that power savings will be bypassed to
always run at maximum throughput. On most systems, run the following command to set the governor:
If this directory does not exist on your system, refer to one of the following pages based on your operating
system:
For more information, see High server load and latency when CPU frequency scaling is enabled in the DataStax
Help Center.
Disable zone_reclaim_mode on NUMA systems
The Linux kernel can be inconsistent in enabling/disabling zone_reclaim_mode, which can result in odd
performance problems.
To ensure that zone_reclaim_mode is disabled:
For more information, see Peculiar Linux kernel performance problem on NUMA systems.
Disable swap
Failure to disable swap entirely can severely lower performance. Because the database has multiple replicas
and transparent failover, it is preferable for a replica to be killed immediately when memory is low rather than
go into swap. This allows traffic to be immediately redirected to a functioning replica instead of continuing to
hit the replica that has high latency due to swapping. If your system has a lot of DRAM, swapping still lowers
performance significantly because the OS swaps out executable code so that more DRAM is available for
caching disks.
If you insist on using swap, you can set vm.swappiness=1. This allows the kernel swap out the absolute least
used parts.
To make this change permanent, remove all swap file entries from /etc/fstab.
For more information, see Nodes seem to freeze after some period of time.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
106
Configuration
Complete the optimization settings for either SSDs or spinning disks. Do not complete both procedures for
either storage type.
Optimize SSDs
Complete the following steps to ensure the best settings for SSDs.
2. Apply the same rotational flag setting for any block devices created from SSD storage, such as mdarrays.
$ lsblk
4. Set the IO scheduler to either deadline or noop for each of the listed devices:
For example:
where device_name is the name of the device you want to apply settings for.
• The deadline scheduler optimizes requests to minimize IO latency. If in doubt, use the deadline
scheduler.
• The noop scheduler is the right choice when the target block device is an array of SSDs behind a high-
end IO controller that performs IO optimization.
5. Set the nr_requests value to indicate the maximum number of read and write requests that can be queued:
Machine size Value
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
107
Configuration
The recommended readahead setting for RAID on SSDs is the same as that for SSDs that are not being
used in a RAID installation.
touch /var/lock/subsys/local
echo 0 > /sys/class/block/sda/queue/rotational
echo 8 > /sys/class/block/sda/queue/read_ahead_kb
Heap size is usually between ¼ and ½ of system memory. Do not devote all memory to heap because it is also
used for offheap cache and file system cache.
See Tuning Java Virtual Machine for more information on tuning the Java Virtual Machine (JVM).
If you want to use Concurrent-Mark-Sweep (CMS) garbage collection, contact the DataStax Services team for
configuration help. Tuning Java resources provides details on circumstances where CMS is recommended,
though using CMS requires time, expertise, and repeated testing to achieve optimal results.
The easiest way to determine the optimum heap size for your environment is:
1. Set the MAX_HEAP_SIZE in the jvm.options file to a high arbitrary value on a single node.
3. Use the value for setting the heap size in the cluster.
This method decreases performance for the test node, but generally does not significantly reduce cluster
performance.
If you don't see improved performance, contact the DataStax Services team for additional help in tuning the JVM.
Check Java Hugepages settings
Many modern Linux distributions ship with the Transparent Hugepages feature enabled by default. When Linux
uses Transparent Hugepages, the kernel tries to allocate memory in large chunks (usually 2MB), rather than 4K.
This allocation can improve performance by reducing the number of pages the CPU must track. However, some
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
108
Configuration
applications still allocate memory based on 4K pages, which can cause noticeable performance problems when
Linux tries to defragment 2MB pages.
For more information, see the Cassandra Java Huge Pages blog and this RedHat bug report.
To solve this problem, disable defrag for Transparent Hugepages:
For more information, including a temporary fix, see No DSE processing but high CPU usage.
After changing properties in the cassandra.yaml file, you must restart the node for the changes to take effect.
Syntax
For the properties in each section, the parent setting has zero spaces. Each child entry requires at least two
spaces. Adhere to the YAML syntax and retain the spacing.
• Default values that are not defined are shown as Default: none.
Organization
The configuration properties are grouped into the following sections:
• Quick start
The minimal properties needed for configuring a cluster.
• Default directories
If you have changed any of the default directories during installation, set these properties to the new
locations. Make sure you have root access.
• Commonly used
Properties most frequently used when configuring DataStax Enterprise.
• Performance tuning
Tuning performance and system resource utilization, including commit log, compaction, memory, disk I/O,
CPU, reads, and writes.
• Advanced
Properties for advanced users or properties that are less commonly used.
• Security
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
109
Configuration
• Continuous paging options Properties configure memory, threads, and duration when pushing pages
continuously to the client.
cluster_name
The name of the cluster. This setting prevents nodes in one logical cluster from joining another. All
nodes in a cluster must have the same value.
Default: 'Test Cluster'
listen_address
The IP address or hostname that the database binds to for connecting this node to other nodes.
Default: localhost
listen_interface
The interface that the database binds to for connecting to other nodes. Interfaces must correspond to a
single address. IP aliasing is not supported.
Set listen_address or listen_interface, not both.
Default: commented out (wlan0)
listen_interface_prefer_ipv6
Use IPv4 or IPv6 when interface is specified by name.
When only a single address is used, that address is selected without regard to this setting.
Default: commented out (false)
Default directories
data_file_directories:
- /var/lib/cassandra/data
commitlog_directory: /var/lib/cassandra/commitlog
cdc_raw_directory: /var/lib/cassandra/cdc_raw
hints_directory: /var/lib/cassandra/hints
saved_caches_directory: /var/lib/cassandra/saved_caches
If you have changed any of the default directories during installation, set these properties to the new locations.
Make sure you have root access.
data_file_directories
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
110
Configuration
The directory where table data is stored on disk. The database distributes data evenly across the
location, subject to the granularity of the configured compaction strategy. If not set, the directory is
$DSE_HOME/data/data.
For production, DataStax recommends RAID 0 and SSDs.
Default: - /var/lib/cassandra/data
commitlog_directory
The directory where the commit log is stored. If not set, the directory is $DSE_HOME/data/commitlog.
For optimal write performance, place the commit log on a separate disk partition, or ideally on a
separate physical device from the data file directories. Because the commit log is append only, a hard
disk drive (HDD) is acceptable.
DataStax recommends explicitly setting the location of the DSE Metrics Collector data directory.
When the DSE Metrics Collector is enabled and when the insights_options data dir is not explicitly
set in dse.yaml, the default location of the DSE Metrics Collector data directory is the same directory
as the commitlog directory.
Default: /var/lib/cassandra/commitlog
cdc_raw_directory
The directory where the change data capture (CDC) commit log segments are stored on flush. DataStax
recommends a physical device that is separate from the data directories. If not set, the directory is
$DSE_HOME/data/cdc_raw. See Change Data Capture (CDC) logging.
Default: /var/lib/cassandra/cdc_raw
hints_directory
The directory in which hints are stored. If not set, the directory is $CASSANDRA_HOME/data/hints.
Default: /var/lib/cassandra/hints
saved_caches_directory
The directory location where table key and row caches are stored. If not set, the directory is
$DSE_HOME/data/saved_caches.
Default: /var/lib/cassandra/saved_caches
Commonly used properties
Properties most frequently used when configuring DataStax Enterprise.
Before starting a node for the first time, DataStax recommends that you carefully evaluate your requirements.
commit_failure_policy: stop
prepared_statements_cache_size_mb:
# disk_optimization_strategy: ssd
disk_failure_policy: stop
endpoint_snitch: com.datastax.bdp.snitch.DseSimpleSnitch
seed_provider:
- org.apache.cassandra.locator.SimpleSeedProvider
- seeds: "127.0.0.1"
enable_user_defined_functions: false
enable_scripted_user_defined_functions: false
enable_user_defined_functions_threads: true
commit_failure_policy
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
111
Configuration
• die - Shut down the node and kill the JVM, so the node can be replaced.
• stop - Shut down the node, leaving the node effectively dead, available for inspection using JMX.
• stop_commit - Shut down the commit log, letting writes collect but continuing to service reads.
Default: stop
prepared_statements_cache_size_mb
Maximum size of the native protocol prepared statement cache. Change this value only if there are
more prepared statements than fit in the cache.
Generally, the calculated default value is appropriate and does not need adjusting. DataStax
recommends contacting the DataStax Services team before changing this value.
Specifying a value that is too large results in long running GCs and possibly out-of-memory errors.
Keep the value at a small fraction of the heap.
Constantly re-preparing statements is a performance penalty. When not set, the default is automatically
calculated to heap / 256 or 10 MB, whichever is greater.
Default: calculated
disk_optimization_strategy
The strategy for optimizing disk reads.
• die - Shut down gossip and client transports, and kill the JVM for any file system errors or single
SSTable errors, so the node can be replaced.
• stop_paranoid - Shut down the node, even for single SSTable errors.
• stop - Shut down the node, leaving the node effectively dead, but available for inspection using
JMX.
• best_effort - Stop using the failed disk and respond to requests based on the remaining available
SSTables. This setting allows obsolete data at consistency level of ONE.
• ignore - Ignore fatal errors and lets the requests fail; all file system errors are logged but otherwise
ignored.
• DseSimpleSnitch
Appropriate only for development deployments. Proximity is determined by DSE workload, which
places transactional, analytics, and search nodes into their separate datacenters. Does not
recognize datacenter or rack information.
• GossipingPropertyFileSnitch
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
112
Configuration
Recommended for production. Reads rack and datacenter for the local node in cassandra-
rackdc.properties file and propagates these values to other nodes via gossip. For migration from
the PropertyFileSnitch, uses the cassandra-topology.properties file if it is present.
• PropertyFileSnitch
Determines proximity by rack and datacenter that are explicitly configured in cassandra-
topology.properties file.
• Ec2Snitch
For EC2 deployments in a single region. Loads region and availability zone information from the
Amazon EC2 API. The region is treated as the datacenter, the availability zone is treated as the
rack, and uses only private IP addresses. For this reason, Ec2Snitch does not work across multiple
regions.
• Ec2MultiRegionSnitch
Uses the public IP as the broadcast_address to allow cross-region connectivity. This means you
must also set seed addresses to the public IP and open the storage_port or ssl_storage_port
on the public IP firewall. For intra-region traffic, the database switches to the private IP after
establishing a connection.
• RackInferringSnitch
Proximity is determined by rack and datacenter, which are assumed to correspond to the 3rd and
2nd octet of each node's IP address, respectively. Best used as an example for writing a custom
snitch class (unless this happens to match your deployment conventions).
• GoogleCloudSnitch
Use for deployments on Google Cloud Platform across one or more regions. The region is
treated as a datacenter and the availability zones are treated as racks within the datacenter. All
communication occurs over private IP addresses within the same logical network.
• CloudstackSnitch
Use the CloudstackSnitch for Apache Cloudstack environments.
See Snitches.
Default: com.datastax.bdp.snitch.DseSimpleSnitch
seed_provider
The addresses of hosts that are designated as contact points in the cluster. A joining node contacts one
of the nodes in the -seeds list to learn the topology of the ring.
Use only seed provider implementations bundled with DSE.
• class_name - The class that handles the seed logic. It can be customized, but this is typically not
required.
Default: org.apache.cassandra.locator.SimpleSeedProvider
• - seeds - A comma delimited list of addresses that are used by gossip for bootstrapping new nodes
joining a cluster. If your cluster includes multiple nodes, you must change the list from the default
value to the IP address of one of the nodes.
Default: "127.0.0.1"
Making every node a seed node is not recommended because of increased maintenance and
reduced gossip performance. Gossip optimization is not critical, but it is recommended to use a
small seed list (approximately three nodes per datacenter).
See Initializing a single datacenter per workload type and Initializing multiple datacenters per
workload type.
Default: org.apache.cassandra.locator.SimpleSeedProvider
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
113
Configuration
enable_user_defined_functions
Enables user defined functions (UDFs). UDFs present a security risk, since they are executed on the
server side. UDFs are executed in a sandbox to contain the execution of malicious code.
• true - Enabled. Supports Java as the code language. Detect endless loops and unintended memory
leaks.
• false - Disabled.
• true - Enabled. Only one instance of a function can run at one time. Asynchronous execution
prevents UDFs from running too long or forever and destabilizing the cluster.
• false - Disabled. Allows multiple instances of the same function to run simultaneously. Required to
use UDFs within GROUP BY clauses.
Disabling asynchronous UDF execution implicitly disables the security manager. You must
monitor the read timeouts for UDFs that run too long or forever, which can cause the cluster to
destabilize.
Default: true
Common compaction settings
compaction_throughput_mb_per_sec: 16
compaction_large_partition_warning_threshold_mb: 100
compaction_throughput_mb_per_sec
The MB per second to throttle compaction for the entire system. The faster the database inserts data,
the faster the system must compact in order to keep the SSTable count down.
memtable_heap_space_in_mb: 2048
memtable_offheap_space_in_mb: 2048
memtable_heap_space_in_mb
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
114
Configuration
The amount of on-heap memory allocated for memtables. The database uses the total of this amount
and the value of memtable_offheap_space_in_mb to set a threshold for automatic memtable flush.
See memtable_cleanup_threshold and Tuning the Java heap.
Default: calculated 1/4 of heap size (2048)
memtable_offheap_space_in_mb
The amount of off-heap memory allocated for memtables. The database uses the total of this amount
and the value of memtable_heap_space_in_mb to set a threshold for automatic memtable flush.
See memtable_cleanup_threshold and Tuning the Java heap.
Default: calculated 1/4 of heap size (2048)
Common automatic backup settings
incremental_backups: false
snapshot_before_compaction: false
incremental_backups
Enables incremental backups.
• true - Enable incremental backups to create a hard link to each SSTable flushed or streamed
locally in a backups subdirectory of the keyspace data. Incremental backups enable storing
backups off site without transferring entire snapshots.
The database does not automatically clear incremental backup files. DataStax recommends
setting up a process to clear incremental backup hard links each time a new snapshot is
created.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
115
Configuration
• Streaming settings
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
# commitlog_sync_group_window_in_ms: 1000
# commitlog_sync_batch_window_in_ms: 2 //deprecated
commitlog_segment_size_in_mb: 32
# commitlog_total_space_in_mb: 8192
# commitlog_compression:
# - class_name: LZ4Compressor
# parameters:
# -
commitlog_sync
Commit log synchronization method:
• periodic - Send ACK signal for writes immediately. Commit log is synced every
commitlog_sync_period_in_ms.
• group - Send ACK signal for writes after the commit log has been flushed to disk. Wait up to
commitlog_sync_group_window_in_ms between flushes.
• batch - Send ACK signal for writes after the commit log has been flushed to disk. Each incoming
write triggers the flush task.
Default: periodic
commitlog_sync_period_in_ms
Use with commitlog_sync: periodic. Time interval between syncing the commit log to disk. Periodic
syncs are acknowledged immediately.
Default: 10000
commitlog_sync_group_window_in_ms
Use with commitlog_sync: group. The time that the database waits between flushing the commit log
to disk. DataStax recommends using group instead of batch.
Default: commented out (1000)
commitlog_sync_batch_window_in_ms
Deprecated. Use with commitlog_sync: batch. The maximum length of time that queries may be
batched together.
Default: commented out (2)
commitlog_segment_size_in_mb
The size of an individual commitlog file segment. A commitlog segment may be archived, deleted, or
recycled after all its data has been flushed to SSTables. This data can potentially include commitlog
segments from every table in the system. The default size is usually suitable, but for commitlog
archiving you might want a finer granularity; 8 or 16 MB is reasonable.
If you set max_mutation_size_in_kb explicitly, then you must set commitlog_segment_size_in_mb to:
2 * max_mutation_size_in_kb / 1024
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
116
Configuration
Default: 32
max_mutation_size_in_kb
The maximum size of a mutation before the mutation is rejected. Before increasing the commitlog
segment size of the commitlog segments, investigate why the mutations are larger than expected. Look
for underlying issues with access patterns and data model, because increasing the commitlog segment
size is a limited fix. When not set, the default is calculated as (commitlog_segment_size_in_mb *
1024) / 2.
Default: calculated
commitlog_total_space_in_mb
Disk usage threshold for commit logs before triggering the database flushing memtables to disk. If the
total space used by all commit logs exceeds this threshold, the database flushes memtables to disk for
the oldest commitlog segments to reclaim disk space by removing those log segments from the commit
log. This flushing reduces the amount of data to replay on start-up, and prevents infrequently updated
tables from keeping commitlog segments indefinitely. If the commitlog_total_space_in_mb is small,
the result is more flush activity on less-active tables.
See Configuring memtable thresholds.
Default for 64-bit JVMs: calculated (8192 or 25% of the total space of the commit log
value, whichever is smaller)
Default for 32-bit JVMs: calculated (32 or 25% of the total space of the commit log value,
whichever is smaller )
commitlog_compression
The compressor to use if commit log is compressed. To make changes, uncomment the
commitlog_compression section and these options:
# commitlog_compression:
# - class_name: LZ4Compressor
# parameters:
# -
When not set, the default compression for the commit log is uncompressed.
Default: commented out
Lightweight transactions (LWT) settings
concurrent_lw_transactions
Maximum number of permitted concurrent lightweight transactions (LWT).
• A higher number might improve throughput if non-contending LWTs are in heavy use, but will use
more memory and might be less successful with contention.
• When not set, the default value is 8x the number of TPC cores. This default value is appropriate for
most environments.
cdc_enabled: false
cdc_total_space_in_mb: 4096
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
117
Configuration
cdc_free_space_check_interval_ms: 250
• true - use CDC functionality to reject mutations that contain a CDC-enabled table if at space limit
threshold in cdc_raw_directory.
Default: false
cdc_total_space_in_mb
Total space to use for change-data-capture (CDC) logs on disk. If space allocated for CDC exceeds
this value, the database throws WriteTimeoutException on mutations, including CDC-enabled tables.
A CDCCompactor (a consumer) is responsible for parsing the raw CDC logs and deleting them when
parsing is completed.
Default: calculated (4096 or 1/8th of the total space of the drive where the cdc_raw_directory resides)
cdc_free_space_check_interval_ms
Interval between checks for new available space for CDC-tracked tables when the
cdc_total_space_in_mb threshold is reached and the CDCCompactor is running behind or experiencing
back pressure. When not set, the default is 250.
Default: commented out (250)
Compaction settings
#concurrent_compactors: 1
# concurrent_validations: 0
concurrent_materialized_view_builders: 2
sstable_preemptive_open_interval_in_mb: 50
# pick_level_on_streaming: false
See also compaction_throughput_mb_per_sec in the common compaction settings section and Configuring
compaction.
concurrent_compactors
The number of concurrent compaction processes allowed to run simultaneously on a node, not
including validation compactions for anti-entropy repair. Simultaneous compactions help preserve
read performance in a mixed read-write workload by limiting the number of small SSTables that
accumulate during a single long-running compaction. If your data directories are backed by SSDs,
increase this value to the number of cores. If compaction running too slowly or too fast, adjust
compaction_throughput_mb_per_sec first.
Increasing concurrent compactors leads to more use of available disk space for compaction,
because concurrent compactions happen in parallel, especially for STCS. Ensure that adequate disk
space is available before increasing this configuration.
Generally, the calculated default value is appropriate and does not need adjusting. DataStax
recommends contacting the DataStax Services team before changing this value.
Default: calculated The fewest number of disks or number of cores, with a minimum of 2 and a
maximum of 8 per CPU core.
concurrent_validations
Number of simultaneous repair validations to allow. When not set, the default is unbounded. Values less
than one are interpreted as unbounded.
Default: commented out (0) unbounded
concurrent_materialized_view_builders
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
118
Configuration
Number of simultaneous materialized view builder tasks allowed to run concurrently. When a view
is created, the node ranges are split into (num_processors * 4) builder tasks and submitted to this
executor.
Default: 2
sstable_preemptive_open_interval_in_mb
The size of the SSTables to trigger preemptive opens. The compaction process opens SSTables before
they are completely written and uses them in place of the prior SSTables for any range previously
written. This process helps to smoothly transfer reads between the SSTables by reducing cache churn
and keeps hot rows hot.
A low value has a negative performance impact and will eventually cause heap pressure and GC
activity. The optimal value depends on hardware and workload.
Default: 50
pick_level_on_streaming
The compaction level for streamed-in SSTables.
• true - streamed-in SSTables of tables using LeveledCompactionStrategy (LCS) are placed on the
same level as the source node. For operational tasks like nodetool refresh or replacing a node, true
improves performance for compaction work.
memtable_allocation_type: heap_buffers
# memtable_cleanup_threshold: 0.34
memtable_flush_writers: 4
memtable_allocation_type
The method the database uses to allocate and manage memtable memory.
Default: heap_buffers
memtable_cleanup_threshold
Ratio used for automatic memtable flush.
Generally, the calculated default value is appropriate and does not need adjusting. DataStax
recommends contacting the DataStax Services team before changing this value.
When not set, the calculated default is 1/(memtable_flush_writers + 1)
Default: commented out (0.34)
memtable_flush_writers
The number of memtable flush writer threads per disk and the total number of memtables that can
be flushed concurrently, generally a combination of compute that is I/O bound. Memtable flushing
is more CPU efficient than memtable ingest. A single thread can keep up with the ingest rate of a
server on a single fast disk, until the server temporarily becomes I/O bound under contention, typically
with compaction. Generally, the default value is appropriate and does not need adjusting for SSDs.
However, the recommended default for HDDs: 2.
Default for SSDs: 4
Cache and index settings
column_index_size_in_kb: 16
# file_cache_size_in_mb: 4096
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
119
Configuration
# direct_reads_size_in_mb: 128
column_index_size_in_kb
Granularity of the index of rows within a partition. For huge rows, decrease this setting to improve seek
time. Lower density nodes might benefit from decreasing this value to 4, 2, or 1.
Default: 16
file_cache_size_in_mb
DSE 6.0.0-6.0.6: Maximum memory for buffer pooling and SSTable chunk cache. 32 MB is reserved
for pooling buffers, the remaining memory is the cache for holding recent or frequently used index
pages and uncompressed SSTable chunks. This pool is allocated off heap and is in addition to the
memory allocated for heap. Memory is allocated only when needed.
DSE 6.0.7 and later: Buffer pool is split into two pools, this setting defines the maximum memory to
use file buffers that are stored in the file cache, also known as chunk cache. Memory is allocated only
when needed but is not released. The other buffer pool is direct_reads_size_in_mb.
See Tuning Java Virtual Machine.
Default: calculated (0.5 of -XX:MaxDirectMemorySize)
direct_reads_size_in_mb
DSE 6.0.7 and later: Buffer pool is split into two pools, this setting defines the buffer pool for
transient read operations. A buffer is typically used by a read operation and then returned to this pool
when the operation is finished so that it can be reused by other operations. The other buffer pool is
file_cache_size_in_mb. When not set, the default calculated as 2 MB per TPC core thread, plus 2 MB
shared by non-TPC threads, with a maximum value of 128 MB.
Default: calculated
Streaming settings
# stream_throughput_outbound_megabits_per_sec: 200
# inter_dc_stream_throughput_outbound_megabits_per_sec: 200
# streaming_keep_alive_period_in_secs: 300
# streaming_connections_per_host: 1
stream_throughput_outbound_megabits_per_sec
Throttle for the throughput of all outbound streaming file transfers on a node. The database does
mostly sequential I/O when streaming data during bootstrap or repair which can saturate the network
connection and degrade client (RPC) performance. When not set, the value is 200 Mbps.
Default: commented out (200)
inter_dc_stream_throughput_outbound_megabits_per_sec
Throttle for all streaming file transfers between datacenters, and for network stream traffic as configured
with stream_throughput_outbound_megabits_per_sec. When not set, the value is 200 Mbps.
Should be set to a value less than or equal to stream_throughput_outbound_megabits_per_sec
since it is a subset of total throughput.
Default: commented out (200)
streaming_keep_alive_period_in_secs
Interval to send keep-alive messages to prevent reset connections during streaming. The stream
session fails when a keep-alive message is not received for 2 keep-alive cycles. When not set, the
default is 300 seconds (5 minutes) so that a stalled stream times out in 10 minutes.
Default: commented out (300)
streaming_connections_per_host
Maximum number of connections per host for streaming. Increase this value when you notice that joins
are CPU-bound, rather than network-bound. For example, a few nodes with large files. When not set,
the default is 1.
Default: commented out (1)
Fsync settings
trickle_fsync: true
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
120
Configuration
trickle_fsync_interval_in_kb: 10240
trickle_fsync
When set to true, causes fsync to force the operating system to flush the dirty buffers at the set
interval trickle_fsync_interval_in_kb. Enable this parameter to prevent sudden dirty buffer flushing from
impacting read latencies. Recommended for use with SSDs, but not with HDDs.
Default: false
trickle_fsync_interval_in_kb
The size of the fsync in kilobytes.
Default: 10240
max_value_size_in_mb
The maximum size of any value in SSTables. SSTables are marked as corrupted when the threshold is
exceeded.
Default: 256
Thread Per Core (TPC) parameters
#tpc_cores:
# tpc_io_cores:
io_global_queue_depth: 128
tpc_cores
The number of concurrent CoreThreads. The CoreThreads are the main workers in a DSE 6.x node,
and process various asynchronous tasks from their queue. If not set, the default is the number of cores
(processors on the machine) minus one. Note that configuring tpc_cores affects the default value for
tpc_io_cores.
To achieve optimal throughput and latency, for a given workload, set tpc_cores to half the number
of CPUs (minimum) to double the number of CPUs (maximum). In cases where there are a large
number of incoming client connections, increasing tpc_cores to more than the default usually results in
CoreThreads receiving more CPU time.
DSE Search workloads only: set tpc_cores to the number of physical CPUs. See Tuning search
for maximum indexing throughput.
Default: commented out; defaults to the number of cores minus one.
tpc_io_cores
The subset of tpc_cores that process asynchronous IO tasks. (That is, disk reads.) Must be smaller or
equal to tpc_cores. Lower this value to decrease parallel disk IO requests.
Default: commented out; by default, calculated as min(io_global_queue_depth/4, tpc_cores)
io_global_queue_depth
Global IO queue depth used for reads when AIO is enabled, which is the default for SSDs. The optimal
queue depth as found with the fio tool for a given disk setup.
Default: 128
NodeSync parameters
nodesync:
rate_in_kb: 1024
rate_in_kb
The maximum kilobytes per second for data validation on the local node. The optimum validation rate
for each node may vary.
Default: 1024
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
121
Configuration
Advanced properties
Properties for advanced users or properties that are less commonly used.
Advanced initialization properties
batch_size_warn_threshold_in_kb: 64
batch_size_fail_threshold_in_kb: 640
unlogged_batch_across_partitions_warn_threshold: 10
# broadcast_address: 1.2.3.4
# listen_on_broadcast_address: false
# initial_token:
# num_tokens: 128
# allocate_tokens_for_local_replication_factor: 3
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
tracetype_query_ttl: 86400
tracetype_repair_ttl: 604800
auto_bootstrap
This setting has been removed from default configuration.
• true - causes new (non-seed) nodes migrate the right data to themselves automatically
• true - If this node uses multiple physical network interfaces, set a unique IP address for
broadcast_address
• false - if this node is on a network that automatically routes between public and private networks,
like Amazon EC2 does
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
122
Configuration
See listen_address.
Default: false
initial_token
The token to start the contiguous range. Set this property for single-node-per-token architecture, in
which a node owns exactly one contiguous range in the ring space. Setting this property overrides
num_tokens.
If your installation is not using vnodes or this node's num_tokens is set it to 1 or is commented out, you
should always set an initial_token value when setting up a production cluster for the first time, and
when adding capacity. See Generating tokens.
Use this parameter only with num_tokens (vnodes ) in special cases such as Restoring from a
snapshot.
Default: 1 (disabled)
num_tokens
Define virtual node (vnode) token architecture.
All other nodes in the datacenter must have the same token architecture.
• a number between 2 and 128 - the number of token ranges to assign to this virtual node (vnode). A
higher value increases the probability that the data and workload are evenly distributed.
DataStax recommends not using vnodes with DSE Search. However, if you decide
to use vnodes with DSE Search, do not use more than 8 vnodes and ensure that
allocate_tokens_for_local_replication_factor option in cassandra.yaml is correctly configured for
your environment.
Using vnodes can impact performance for your cluster. DataStax recommends testing the
configuration before enabling vnodes in production environments.
When the token number varies between nodes in a datacenter, the vnode logic assigns a
proportional number of ranges relative to other nodes in the datacenter. In general, if all nodes
have equal hardware capability, each node should have the same num_tokens value.
Default: 1 (disabled)
To migrate an existing cluster from single node per token range to vnodes, see Enabling virtual nodes
on an existing production cluster.
allocate_tokens_for_local_replication_factor
• RF of keyspaces in datacenter - triggers the recommended algorithmic allocation for the RF and
num_tokens for this node.
The allocation algorithm optimizes the workload balance using the target keyspace replication
factor. DataStax recommends setting the number of tokens to 8 to distribute the workload with
~10% variance between nodes. The allocation algorithm attempts to choose tokens in a way that
optimizes replicated load over the nodes in the datacenter for the specified RF. The load assigned
to each node is close to proportional to the number of vnodes.
The allocation algorithm is supported only for the Murmur3Partitioner and RandomPartitioner
partitioners. The Murmur3Partitioner is the default partitioning strategy for new clusters and the
right choice for new clusters in almost all cases.
• commented out - uses the random selection algorithm to assign token ranges randomly.
Over time, loads in a datacenter using the random selection algorithm become unevenly
distributed. DataStax recommends using only the allocation algorithm.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
123
Configuration
The class that distributes rows (by partition key) across all nodes in the cluster. Any IPartitioner
may be used, including your own as long as it is in the class path. For new clusters use the default
partitioner.
DataStax Enterprise provides the following partitioners for backward compatibility:
• RandomPartitioner
• ByteOrderedPartitioner (deprecated)
• OrderPreservingPartitioner (deprecated)
See Partitioners.
Default: org.apache.cassandra.dht.Murmur3Partitioner
tracetype_query_ttl
TTL for different trace types used during logging of the query process.
Default: 86400
tracetype_repair_ttl
TTL for different trace types used during logging of the repair process.
Default: 604800
Advanced automatic backup setting
auto_snapshot: true
auto_snapshot
Enables snapshots of the data before truncating a keyspace or dropping a table. To prevent data loss,
DataStax strongly advises using the default setting. If you set auto_snapshot to false, you lose data on
truncation or drop.
Default: true
Global row properties
column_index_cache_size_in_kb: 2
# row_cache_class_name: org.apache.cassandra.cache.OHCProvider
row_cache_size_in_mb: 0
row_cache_save_period: 0
# row_cache_keys_to_save: 100
When creating or modifying tables, you can enable or disable the row cache for that table by setting the caching
parameter. Other row cache tuning and configuration options are set at the global (node) level. The database
uses these settings to automatically distribute memory for each table on the node based on the overall workload
and specific table usage. You can also configure the save periods for these caches globally.
column_index_cache_size_in_kb
(Only applies to BIG format SSTables) Threshold for the total size of all index entries for a partition that
the database stores in the partition key cache. If the total size of all index entries for a partition exceeds
this amount, the database stops putting entries for this partition into the partition key cache.
Default: 2
row_cache_class_name
The classname of the row cache provider to use. Valid values:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
124
Configuration
Default: 0 (disabled)
row_cache_save_period
The number of seconds that rows are kept in cache. Caches are saved to saved_caches_directory. This
setting has limited use as described in row_cache_size_in_mb.
Default: 0 (disabled)
row_cache_keys_to_save
The number of keys from the row cache to save. All keys are saved.
Default: commented out (100)
Counter caches properties
counter_cache_size_in_mb:
counter_cache_save_period: 7200
# counter_cache_keys_to_save: 100
Counter cache helps to reduce counter locks' contention for hot counter cells. In case of RF = 1 a counter cache
hit causes the database to skip the read before write entirely. With RF > 1 a counter cache hit still helps to
reduce the duration of the lock hold, helping with hot counter cell updates, but does not allow skipping the read
entirely. Only the local (clock, count) tuple of a counter cell is kept in memory, not the whole counter, so it is
relatively cheap.
If you reduce the counter cache size, the database may load the hottest keys start-up.
counter_cache_size_in_mb
When no value is set, the database uses the smaller of minimum of 2.5% of heap or 50 megabytes
(MB). If your system performs counter deletes and relies on low gc_grace_seconds, you should disable
the counter cache. To disable, set to 0.
Default: calculated
counter_cache_save_period
The time, in seconds, after which the database saves the counter cache (keys only). The database
saves caches to saved_caches_directory.
Default: 7200 (2 hours)
counter_cache_keys_to_save
Number of keys from the counter cache to save. When not set, the database saves all keys.
Default: commented out (disabled, saves all keys)
Tombstone settings
tombstone_warn_threshold: 1000
tombstone_failure_threshold: 100000
When executing a scan, within or across a partition, the database must keep tombstones in memory to allow
them to return to the coordinator. The coordinator uses tombstones to ensure that other replicas know about the
deleted rows. Workloads that generate numerous tombstones may cause performance problems and exhaust
the server heap. Adjust these thresholds only if you understand the impact and want to scan more tombstones.
You can adjust these thresholds at runtime using the StorageServiceMBean.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
125
Configuration
See the DataStax Developer Blog post Cassandra anti-patterns: Queues and queue-like datasets.
tombstone_warn_threshold
The database issues a warning if a query scans more than this number of tombstones.
Default: 1000
tombstone_failure_threshold
The database aborts a query if it scans more than this number of tombstones.
Default: 100000
Network timeout settings
read_request_timeout_in_ms: 5000
range_request_timeout_in_ms: 10000
aggregated_request_timeout_in_ms: 120000
write_request_timeout_in_ms: 2000
counter_write_request_timeout_in_ms: 5000
cas_contention_timeout_in_ms: 1000
truncate_request_timeout_in_ms: 60000
request_timeout_in_ms: 10000
# cross_dc_rtt_in_ms: 0
read_request_timeout_in_ms
Default: 5000. How long the coordinator waits for read operations to complete before timing it out.
range_request_timeout_in_ms
Default: 10000. How long the coordinator waits for sequential or index scans to complete before timing
it out.
aggregated_request_timeout_in_ms
How long the coordinator waits for sequential or index scans to complete. Lowest acceptable value is
10 ms. This timeout does not apply to aggregated queries such as SELECT, COUNT(*), MIN(x), and so
on.
Default: 120000 (2 minutes)
write_request_timeout_in_ms
How long the coordinator waits for write requests to complete with at least one node in the local
datacenter. Lowest acceptable value is 10 ms.
See Hinted handoff: repair during write path.
Default: 2000 (2 seconds)
counter_write_request_timeout_in_ms
How long the coordinator waits for counter writes to complete before timing it out.
Default: 5000 (5 seconds)
cas_contention_timeout_in_ms
How long the coordinator continues to retry a CAS (compare and set) operation that contends with other
proposals for the same row. If the coordinator cannot complete the operation within this timespan, it
aborts the operation.
Default: 1000 (1 second)
truncate_request_timeout_in_ms
How long the coordinator waits for a truncate (the removal of all data from a table) to complete before
timing it out. The long default value allows the database to take a snapshot before removing the data. If
auto_snapshot is disabled (not recommended), you can reduce this time.
Default: 60000 (1 minute)
request_timeout_in_ms
The default timeout value for other miscellaneous operations. Lowest acceptable value is 10 ms.
See Hinted handoff: repair during write path.
Default: 10000
cross_dc_rtt_in_ms
How much to increase the cross-datacenter timeout (write_request_timeout_in_ms +
cross_dc_rtt_in_ms) for requests that involve only nodes in a remote datacenter. This setting is
intended to reduce hint pressure.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
126
Configuration
DataStax recommends using LOCAL_* consistency levels (CL) for read and write requests in multi-
datacenter deployments to avoid timeouts that may occur when remote nodes are chosen to satisfy
the CL, such as QUORUM.
Default: commented out (0)
slow_query_log_timeout_in_ms
Default: 500. How long before a node logs slow queries. Select queries that exceed this value generate
an aggregated log message to identify slow queries. To disable, set to 0.
Inter-node settings
storage_port: 7000
cross_node_timeout: false
# internode_send_buff_size_in_bytes:
# internode_recv_buff_size_in_bytes:
internode_compression: dc
inter_dc_tcp_nodelay: false
storage_port
The port for inter-node communication. Follow security best practices, do not expose this port to the
internet. Apply firewall rules.
See Securing DataStax Enterprise ports.
Default: 7000
cross_node_timeout
Enables operation timeout information exchange between nodes to accurately measure request
timeouts. If this property is disabled, the replica assumes any requests are forwarded to it instantly by
the coordinator. During overload conditions this means extra time is required for processing already-
timed-out requests.
Before enabling this property make sure NTP (network time protocol) is installed and the times are
synchronized among the nodes.
Default: false
internode_send_buff_size_in_bytes
The sending socket buffer size, in bytes, for inter-node calls.
See TCP settings.
• /proc/sys/net/core/wmem_max
• /proc/sys/net/core/rmem_max
• /proc/sys/net/ipv4/tcp_wmem
• /proc/sys/net/ipv4/tcp_wmem
• none - No compression.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
127
Configuration
Default: dc
inter_dc_tcp_nodelay
Enables tcp_nodelay for inter-datacenter communication. When disabled, the network sends larger,
but fewer, network packets. This reduces overhead from the TCP protocol itself. However, disabling
inter_dc_tcp_nodelay may increase latency by blocking cross datacenter responses.
Default: false
Native transport (CQL Binary Protocol)
start_native_transport: true
native_transport_port: 9042
# native_transport_port_ssl: 9142
# native_transport_max_frame_size_in_mb: 256
# native_transport_max_concurrent_connections: -1
# native_transport_max_concurrent_connections_per_ip: -1
native_transport_address: localhost
# native_transport_interface: eth0
# native_transport_interface_prefer_ipv6: false
# native_transport_broadcast_address: 1.2.3.4
native_transport_keepalive: true
start_native_transport
Enables or disables the native transport server.
Default: true
native_transport_port
The port where the CQL native transport listens for clients. For security reasons, do not expose this port
to the internet. Firewall it if needed.
Default: 9042
native_transport_max_frame_size_in_mb
The maximum allowed size of a frame. Frame (requests) larger than this are rejected as invalid.
Default: 256
native_transport_max_concurrent_connections
The maximum number of concurrent client connections.
Default: -1 (unlimited)
native_transport_max_concurrent_connections_per_ip
The maximum number of concurrent client connections per source IP address.
Default: -1 (unlimited)
native_transport_address
When left blank, uses the configured hostname of the node. Unlike the listen_address, this value
can be set to 0.0.0.0, but you must set the native_transport_broadcast_address to a value other than
0.0.0.0.
Set native_transport_address OR native_transport_interface, not both.
Default: localhost
native_transport_interface
IP aliasing is not supported.
Set native_transport_address OR native_transport_interface, not both.
Default: eth0
native_transport_interface_prefer_ipv6
Use IPv4 or IPv6 when interface is specified by name.
When only a single address is used, that address is selected without regard to this setting.
Default: commented out (false)
native_transport_broadcast_address
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
128
Configuration
Native transport address to broadcast to drivers and other DSE nodes. This cannot be set to 0.0.0.0.
# gc_log_threshold_in_ms: 200
# gc_warn_threshold_in_ms: 1000
# otc_coalescing_strategy: DISABLED
# otc_coalescing_window_us: 200
# otc_coalescing_enough_coalesced_messages: 8
gc_log_threshold_in_ms
The threshold for log messages at the INFO level. Adjust to minimize logging.
Default: commented out (200)
gc_warn_threshold_in_ms
Threshold for GC pause. Any GC pause longer than this interval is logged at the WARN level. By
default, the database logs any GC pause greater than 200 ms at the INFO level.
• FIXED
• MOVINGAVERAGE
• TIMEHORIZON
• DISABLED
• For FIXED strategy - the amount of time after the first message is received before it is sent with any
accompanying messages.
• For MOVING average - the maximum wait time and the interval that messages must arrive on
average to enable coalescing.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
129
Configuration
The percentage of time that gossip messages are sent to a seed node during each round of gossip.
Decreases the time to propagate gossip changes across the cluster.
Default: 1.0 (100%)
Backpressure settings
back_pressure_enabled: false
back_pressure_strategy:
- class_name: org.apache.cassandra.net.RateBasedBackPressure
parameters:
- high_ratio: 0.90
factor: 5
flow: FAST
back_pressure_enabled
Enables the coordinator to apply the specified back pressure strategy to each mutation that is sent to
replicas.
Default: false
back_pressure_strategy
To add new strategies, implement org.apache.cassandra.net.BackpressureStrategy and provide a
public constructor that accepts a Map<String, Object>.
Use only strategy implementations bundled with DSE.
class_name
The default class_name uses the ratio between incoming mutation responses and outgoing mutation
requests.
Default: org.apache.cassandra.net.RateBasedBackPressure
high_ratio
When outgoing mutations are below this value, they are rate limited according to the incoming rate
decreased by the factor (described below). When above this value, the rate limiting is increased by the
factor.
Default: 0.90
factor
A number between 1 and 10. When backpressure is below high ratio, outgoing mutations are rate
limited according to the incoming rate decreased by the given factor; if above high ratio, the rate limiting
is increased by the given factor.
Default: 5
flow
The flow speed to apply rate limiting:
Default: FAST
dynamic_snitch_badness_threshold
The performance threshold for dynamically routing client requests away from a poorly performing
node. Specifically, it controls how much worse a poorly performing node has to be before the dynamic
snitch prefers other replicas. A value of 0.2 means the database continues to prefer the static snitch
values until the node response time is 20% worse than the best performing node. Until the threshold is
reached, incoming requests are statically routed to the closest replica as determined by the snitch.
Default: 0.1
dynamic_snitch_reset_interval_in_ms
Time interval after which the database resets all node scores. This allows a bad node to recover.
Default: 600000
dynamic_snitch_update_interval_in_ms
The time interval, in milliseconds, between the calculation of node scores. Because score calculation is
CPU intensive, be careful when reducing this interval.
Default: 100
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
130
Configuration
hinted_handoff_enabled: true
# hinted_handoff_disabled_datacenters:
# - DC1
# - DC2
max_hint_window_in_ms: 10800000 # 3 hours
hinted_handoff_throttle_in_kb: 1024
max_hints_delivery_threads: 2
hints_directory: /var/lib/cassandra/hints
hints_flush_period_in_ms: 10000
max_hints_file_size_in_mb: 128
#hints_compression:
# - class_name: LZ4Compressor
# parameters:
# -
batchlog_replay_throttle_in_kb: 1024
# batchlog_endpoint_strategy: random_remote
hinted_handoff_enabled
Enables or disables hinted handoff. A hint indicates that the write needs to be replayed to an
unavailable node. The database writes the hint to a hints file on the coordinator node.
• true - globally enable hinted handoff, except for datacenters specified for
hinted_handoff_disabled_datacenters
Default: true
hinted_handoff_disabled_datacenters
A blacklist of datacenters that will not perform hinted handoffs. To disable hinted handoff on a certain
datacenter, add its name to this list.
Default: commented out
max_hint_window_in_ms
Maximum amount of time during which the database generates hints for an unresponsive node.
After this interval, the database does not generate any new hints for the node until it is back up and
responsive. If the node goes down again, the database starts a new interval. This setting can prevent a
sudden demand for resources when a node is brought back online and the rest of the cluster attempts
to replay a large volume of hinted writes.
See About failure detection and recovery.
Default: 10800000 (3 hours)
hinted_handoff_throttle_in_kb
Maximum amount of traffic per delivery thread in kilobytes per second. This rate reduces proportionally
to the number of nodes in the cluster. For example, if there are two nodes in the cluster, each delivery
thread uses. the maximum rate. If there are three, each node throttles to half of the maximum, since the
two nodes are expected to deliver hints simultaneously.
When applying this limit, the calculated hint transmission rate is based on the uncompressed hint
size, even if internode_compression or hints_compression is enabled.
Default: 1024
hints_flush_period_in_ms
The time, in milliseconds, to wait before flushing hints from internal buffers to disk.
Default: 10000
max_hints_delivery_threads
Number of threads the database uses to deliver hints. In multiple datacenter deployments, consider
increasing this number because cross datacenter handoff is generally slower.
Default: 2
max_hints_file_size_in_mb
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
131
Configuration
• random_remote - Default, purely random. Prevents the local rack, if possible. Same behavior as
earlier releases.
• dynamic - Mostly the same as dynamic_remote, except that local rack is not excluded, which offers
lower availability guarantee than random_remote or dynamic_remote. Note: this strategy will fall
back to random_remote if dynamic_snitch is not enabled.
Default: random_remote
Security properties
DSE Advanced Security fortifies DataStax Enterprise (DSE) databases against potential harm due to deliberate
attack or user error. Configuration properties include authentication and authorization, permissions, roles,
encryption of data in-flight and at-rest, and data auditing. DSE Unified Authentication provides authentication,
authorization, and role management. Enabling DSE Unified Authentication requires additional configuration in
dse.yaml, see Configuring DSE Unified Authentication.
authenticator: com.datastax.bdp.cassandra.auth.DseAuthenticator
# internode_authenticator: org.apache.cassandra.auth.AllowAllInternodeAuthenticator
authorizer: com.datastax.bdp.cassandra.auth.DseAuthorizer
role_manager: com.datastax.bdp.cassandra.auth.DseRoleManager
system_keyspaces_filtering: false
roles_validity_in_ms: 120000
# roles_update_interval_in_ms: 120000
permissions_validity_in_ms: 120000
# permissions_update_interval_in_ms: 120000
authenticator
The authentication backend. The only supported authenticator is DseAuthenticator for external
authentication with multiple authentication schemes such as Kerberos, LDAP, and internal
authentication. Authenticators other than DseAuthenticator are deprecated and not supported. Some
security features might not work correctly if other authenticators are used. See authentication_options in
dse.yaml.
Use only authentication implementations bundled with DSE.
Default: com.datastax.bdp.cassandra.auth.DseAuthenticator
internode_authenticator
Internode authentication backend to enable secure connections from peer nodes.
Use only authentication implementations bundled with DSE.
Default: org.apache.cassandra.auth.AllowAllInternodeAuthenticator
authorizer
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
132
Configuration
The authorization backend. Authorizers other than DseAuthorizer are not supported. DseAuthorizer
supports enhanced permission management of DSE-specific resources. Authorizers other than
DseAuthorizer are deprecated and not supported. Some security features might not work correctly if
other authorizers are used. See Authorization options in dse.yaml.
Use only authorization implementations bundled with DSE.
Default: com.datastax.bdp.cassandra.auth.DseAuthorizer
system_keyspaces_filtering
Enables system keyspace filtering so that users can access and view only schema information
for rows in the system and system_schema keyspaces to which they have access. When
system_keyspaces_filtering is set to true:
• Data in the following tables of the system keyspace are filtered based on the role's DESCRIBE
privileges for keyspaces; only rows for appropriate keyspaces will be displayed in:
# size_estimates
# sstable_activity
# built_indexes
# built_views
# available_ranges
# view_builds_in_progress
• Data in all tables in the system_schema keyspace are filtered based on a role's DESCRIBE privileges
for keyspaces stored in the system_schema tables.
• Read operations against other tables in the system keyspace are denied
Security requirements and user permissions apply. Enable this feature only after appropriate user
permissions are granted. You must grant the DESCRIBE permission to role on any keyspaces stored
in the system keyspaces. If you do not grant the permission, you will see an error that states the
keyspace is not found.
See Controlling access to keyspaces and tables and Configuring the security keyspaces replication
factors.
Default: false
role_manager
The DSE Role Manager supports LDAP roles and internal roles supported by the
CassandraRoleManager. Role options are stored in the dse_security keyspace. When using the DSE
Role Manager, increase the replication factor of the dse_security keyspace. Role managers other than
DseRoleManager are deprecated and not supported. Some security features might not work correctly if
other role managers are used.
Use only role manager implementations bundled with DSE.
Default: com.datastax.bdp.cassandra.auth.DseRoleManager
roles_validity_in_ms
Validity period for roles cache in milliseconds. Determines how long to cache the list of roles assigned
to the user; users may have several roles, either through direct assignment or inheritance (a role that
has been granted to another role). Adjust this setting based on the complexity of your role hierarchy,
tolerance for role changes, the number of nodes in your environment, and activity level of the cluster.
Fetching permissions can be an expensive operation, so this setting allows flexibility. Granted roles
are cached for authenticated sessions in AuthenticatedUser. After the specified time elapses, role
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
133
Configuration
validity is rechecked. Disabled automatically when internal authentication is not enabled when using
DseAuthenticator.
• milliseconds - how long to cache the list of roles assigned to the user
REVOKE does not automatically invalidate cached permissions. Permissions are invalidated the next
time they are refreshed.
Default: 120000 (2 minutes)
permissions_update_interval_in_ms
Sets refresh interval for the standard authentication cache and the row-level access control
(RLAC) cache. After this interval, cache entries become eligible for refresh. On next access,
the database schedules an async reload and returns the old value until the reload completes. If
permissions_validity_in_ms is non-zero, the value for roles_update_interval_in_ms must also be non-
zero. When not set, the default is the same value as permissions_validity_in_ms.
Default: commented out (2000)
permissions_cache_max_entries
The maximum number of entries that are held by the standard authentication cache and row-level
access control (RLAC) cache. With the default value of 1000, the RLAC permissions cache can have
up to 1000 entries in it, and the standard authentication cache can have up to 1000 entries. This single
option applies to both caches. To size the permissions cache for use with Setting up Row Level Access
Control (RLAC), use this formula:
If this option is not present in cassandra.yaml, manually enter it to use a value other than 1000. See
Enabling DSE Unified Authentication.
Default: not set (1000)
Inter-node encryption options
Node-to-node (internode) encryption protects data that is transferred between nodes in a cluster using SSL.
server_encryption_options:
internode_encryption: none
keystore: resources/dse/conf/.keystore
keystore_password: cassandra
truststore: resources/dse/conf/.truststore
truststore_password: cassandra
# More advanced defaults below:
# protocol: TLS
# algorithm: SunX509
# store_type: JKS
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
134
Configuration
# cipher_suites:
[TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_DHE_RSA_WITH_AES_128_CBC_SHA,TLS_DHE_RSA_WITH_
# require_client_auth: false
# require_endpoint_verification: false
server_encryption_options
Inter-node encryption options. If enabled, you must also generate keys and provide the appropriate key
and truststore locations and passwords. No custom encryption options are supported.
The passwords used in these options must match the passwords used when generating the keystore
and truststore. For instructions on generating these files, see Creating a Keystore to Use with JSSE.
• none - No encryption
Default: none
keystore
Relative path from DSE installation directory or absolute path to the Java keystore (JKS) suitable for
use with Java Secure Socket Extension (JSSE), which is the Java version of the Secure Sockets Layer
(SSL), and Transport Layer Security (TLS) protocols. The keystore contains the private key used to
encrypt outgoing messages.
Default: resources/dse/conf/.keystore
keystore_password
Password for the keystore. This must match the password used when generating the keystore and
truststore.
Default: cassandra
truststore
Relative path from DSE installation directory or absolute path to truststore containing the trusted
certificate for authenticating remote servers.
Default: resources/dse/conf/.truststore
truststore_password
Password for the truststore.
Default: cassandra
protocol
Default: commented out (TLS)
algorithm
Default: commented out (SunX509)
store_type
Valid types are JKS, JCEKS, and PKCS12.
PKCS11 is not supported.
Default: commented out (JKS)
truststore_type
Valid types are JKS, JCEKS, and PKCS12.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
135
Configuration
PKCS11 is not supported. Also, due to an OpenSSL issue, you cannot use a PKCS12 truststore that
was generated via OpenSSL. For example, a truststore generated via the following command will not
work with DSE:
However, truststores generated via Java's keytool and then converted to PKCS12 work with DSE.
Example:
• TLS_RSA_WITH_AES_128_CBC_SHA
• TLS_RSA_WITH_AES_256_CBC_SHA
• TLS_DHE_RSA_WITH_AES_128_CBC_SHA
• TLS_DHE_RSA_WITH_AES_256_CBC_SHA
• TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA
• TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
client_encryption_options:
enabled: false
# If enabled and optional is set to true, encrypted and unencrypted connections over
native transport are handled.
optional: false
keystore: resources/dse/conf/.keystore
keystore_password: cassandra
# require_client_auth: false
# Set trustore and truststore_password if require_client_auth is true
# truststore: resources/dse/conf/.truststore
# truststore_password: cassandra
# More advanced defaults below:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
136
Configuration
# protocol: TLS
# algorithm: SunX509
# store_type: JKS
# cipher_suites:
[TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_DHE_RSA_WITH_AES_128_CBC_SHA,TLS_DHE_RSA_WITH_
client_encryption_options
Whether to enable client-to-node encryption. You must also generate keys and provide the appropriate
key and truststore locations and passwords. There are no custom encryption options enabled for
DataStax Enterprise.
Advanced settings:
enabled
Whether to enable client-to-node encryption.
Default: false
optional
When optional is selected, both encrypted and unencrypted connections over native transport are
allowed. That is a necessary transition state to facilitate enabling client to node encryption on live
clusters without inducing an outage for existing unencrypted clients. Typically, once existing clients
are migrated to encrypted connections, optional is unselected in order to enforce native transport
encryption.
Default: false
keystore
Relative path from DSE installation directory or absolute path to the Java keystore (JKS) suitable for
use with Java Secure Socket Extension (JSSE), which is the Java version of the Secure Sockets Layer
(SSL), and Transport Layer Security (TLS) protocols. The keystore contains the private key used to
encrypt outgoing messages.
Default: resources/dse/conf/.keystore
keystore_password
Password for the keystore.
Default: cassandra
require_client_auth
Whether to enable certificate authentication for client-to-node encryption. When not set, the default is
false.
When set to true, client certificates must be present on all nodes in the cluster.
Default: commented out (false)
truststore
Relative path from DSE installation directory or absolute path to truststore containing the trusted
certificate for authenticating remote servers.
Default: resources/dse/conf/.truststore
truststore_password
Password for the truststore. This must match the password used when generating the keystore and
truststore.
Truststore password and path is only required when require_client_auth is set to true.
Default: cassandra
protocol
Default: commented out (TLS)
algorithm
Default: commented out (SunX509)
store_type
Valid types are JKS, JCEKS and PKCS12. For file-based keystores, use PKCS12.
PKCS11 is not supported.
Default: commented out (JKS)
truststore_type
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
137
Configuration
However, truststores generated via Java's keytool and then converted to PKCS12 work with DSE.
Example:
• TLS_RSA_WITH_AES_128_CBC_SHA
• TLS_RSA_WITH_AES_256_CBC_SHA
• TLS_DHE_RSA_WITH_AES_128_CBC_SHA
• TLS_DHE_RSA_WITH_AES_256_CBC_SHA
• TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA
• TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
transparent_data_encryption_options:
enabled: false
chunk_length_kb: 64
cipher: AES/CBC/PKCS5Padding
key_alias: testing:1
# CBC IV length for AES must be 16 bytes, the default size
# iv_length: 16
key_provider:
- class_name: org.apache.cassandra.security.JKSKeyProvider
parameters:
- keystore: conf/.keystore
keystore_password: cassandra
store_type: JCEKS
key_password: cassandra
transparent_data_encryption_options
DataStax Enterprise supports this option only for backward compatibility. When using DSE, configure
data encryption options in the dse.yaml; see Transparent data encryption.
TDE properties:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
138
Configuration
• cipher: options:
# AES
# CBC
# PKCS5Padding
• key_alias: testing:1
• iv_length: 16
iv_length is commented out in the default cassandra.yaml file. Uncomment only if cipher is set
to AES. The value must be 16 (bytes).
• key_provider:
# class_name: org.apache.cassandra.security.JKSKeyProvider
parameters:
# keystore: conf/.keystore
# keystore_password: cassandra
# store_type: JCEKS
# key_password: cassandra
SSL Ports
ssl_storage_port: 7001
native_transport_port_ssl: 9142
ssl_storage_port
The SSL port for encrypted communication. Unused unless enabled in encryption_options. Follow
security best practices, do not expose this port to the internet. Apply firewall rules.
Default: 7001
native_transport_port_ssl
Dedicated SSL port where the CQL native transport listens for clients with encrypted communication.
For security reasons, do not expose this port to the internet. Firewall it if needed.
Default: 9142
Continuous paging options
continuous_paging:
max_concurrent_sessions: 60
max_session_pages: 4
max_page_size_mb: 8
max_local_query_time_ms: 5000
client_timeout_sec: 600
cancel_timeout_sec: 5
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
139
Configuration
paused_check_interval_ms: 1
continuous_paging
Options to tune continuous paging that pushes pages, when requested, continuously to the client:
Guidance
• Because memtables and SSTables are used by the continuous paging query, you can define the
maximum period of time during which memtables cannot be flushed and compacted SSTables
cannot be deleted.
• If fewer threads exist than sessions, a session cannot execute until another one is swapped out.
• Distributed queries (CL > ONE or non-local data) are swapped out after every page, while local
queries at CL = ONE are swapped out after max_local_query_time_ms.
max_concurrent_sessions
The maximum number of concurrent sessions. Additional sessions are rejected with an unavailable
error.
Default: 60
max_session_pages
The maximum number of pages that can be buffered for each session. If the client is not reading from
the socket, the producer thread is blocked after it has prepared max_session_pages.
Default: 4
max_page_size_mb
The maximum size of a page, in MB. If an individual CQL row is larger than this value, the page can be
larger than this value.
Default: 8
max_local_query_time_ms
The maximum time for a local continuous query to run. When this threshold is exceeded, the
session is swapped out and rescheduled. Swapping and rescheduling ensures the release of
resources that prevent the memtables from flushing and ensures fairness when max_threads <
max_concurrent_sessions. Adjust when high write workloads exist on tables that have continuous
paging requests.
Default: 5000
client_timeout_sec
How long the server will wait, in seconds, for clients to request more pages if the client is not reading
and the server queue is full.
Default: 600
cancel_timeout_sec
How long to wait before checking if a paused session can be resumed. Continuous paging sessions
are paused because of backpressure or when the client has not request more pages with backpressure
updates.
Default: 5
paused_check_interval_ms
How long to wait, in milliseconds, before checking if a continuous paging sessions can be resumed,
when that session is paused because of backpressure.
Default: 1
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
140
Configuration
# phi_convict_threshold: 8
phi_convict_threshold
The sensitivity of the failure detector on an exponential scale. Generally, this setting does not need
adjusting.
See About failure detection and recovery.
When not set, the internal value is 8.
Default: commented out (8)
Memory leak detection settings
#leaks_detection_params:
# sampling_probability: 0
# max_stacks_cache_size_mb: 32
# num_access_records: 0
# max_stack_depth: 30
sampling_probability
The sampling probability to track for the specified resource. For resources tracked, see nodetool
leaksdetection.
• A number between 0 and 1 - the percentage of time to randomly track a resource. For example,
0.5 will track resources 50% of the time.
Tracking incurs a significant stack trace collection cost for every access and consumes heap space.
Enable tracking only when directed by DataStax Support.
Default: commented out (0)
max_stacks_cache_size_mb
Set the size of the cache for call stack traces. Stack traces are used to debug leaked resources, and
use heap memory. Set the amount of heap memory dedicated to each resource by setting the max
stacks cache size in MB.
Default: commented out (32)
num_access_records
Set the average number of stack traces kept when a resource is accessed. Currently only supported for
chunks in the cache.
Default: commented out (0)
max_stack_depth
Set the depth of the stack traces collected. Changes only the depth of the stack traces that will be
collected from the time the parameter is set. Deeper stacks are more unique, so increasing the depth
may require increasing stacks_cache_size_mb.
Default: commented out (30)
dse.yaml configuration file
The dse.yaml file is the primary configuration file for security, DSE Search, DSE Graph, and DSE Analytics.
After changing properties in the dse.yaml file, you must restart the node for the changes to take effect.
The cassandra.yaml file is the primary configuration file for the DataStax Enterprise database.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
141
Configuration
Syntax
For the properties in each section, the parent setting has zero spaces. Each child entry requires at least
two spaces. Adhere to the YAML syntax and retain the spacing. For example, no spaces before the parent
node_health_options entry, and at least two spaces before the child settings:
node_health_options:
refresh_rate_ms: 50000
uptime_ramp_up_period_seconds: 10800
dropped_mutation_window_minutes: 30
Organization
The DataStax Enterprise configuration properties are grouped into the following sections:
• DSE In-Memory
• Node health
• Health-based routing
• Lease metrics
• Audit logging
• audit_logging_options
• Inter-node messaging
• DSE Multi-Instance
• Authentication options
• Authorization options
• Kerberos options
• LDAP options
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
142
Configuration
Authentication options
Authentication options for the DSE Authenticator that allows you to use multiple schemes for authentication in a
DataStax Enterprise cluster. Additional authenticatorconfiguration is required in cassandra.yaml.
Internal and LDAP schemes can also used for role management, see role_management_options.
# authentication_options:
# enabled: false
# default_scheme: internal
# other_schemes:
# - ldap
# - kerberos
# scheme_permissions: false
# transitional_mode: disabled
# allow_digest_with_kerberos: true
# plain_text_without_ssl: warn
authentication_options
Options for the DseAuthenticator to authenticate users when the authenticator option in
cassandra.yaml is set to com.datastax.bdp.cassandra.auth.DseAuthenticator. Authenticators other than
DseAuthenticator are not supported.
enabled
Enables user authentication.
• false - The DseAuthenticator does not authenticate users and allows all connections.
• true - Use multiple schemes for authentication. Every role requires permissions to a scheme in
order to be assigned.
• false - Do not use multiple schemes for authentication. Prevents unintentional role assignment that
might occur if user or group names overlap in the authentication service.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
143
Configuration
Controls whether DIGEST-MD5 authentication is also allowed with Kerberos. The DIGEST-MD5
mechanism is not directly associated with an authentication scheme, but is used by Kerberos to pass
credentials between nodes and jobs.
• true - DIGEST-MD5 authentication is also allowed with Kerberos. In analytics clusters, set to true to
use Hadoop inter-node authentication with Hadoop and Spark jobs.
Analytics nodes require true to use internode authentication with Hadoop and Spark jobs. When not set,
the default is true.
Default: commented out (true)
plain_text_without_ssl
Controls how the DseAuthenticator responds to plain text authentication requests over unencrypted
client connections. Set to one of the following values:
• disabled - Transitional mode is disabled. All connections must provide valid credentials and map to
a login-enabled role.
• permissive - Only super users are authenticated and logged in. All other authentication attempts
are logged in as the anonymous user.
• normal - Allow all connections that provide credentials. Maps all authenticated users to their role
AND maps all other connections to anonymous.
• strict - Allow only authenticated connections that map to a login-enabled role OR connections that
provide a blank username and password as anonymous.
Credentials are required for all connections after authentication is enabled; use a blank username
and password to login with anonymous role in transitional mode.
#role_management_options:
# mode: internal
# stats: false
role_management_options
Options for the DSE Role Manager. To enable role manager, set:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
144
Configuration
When scheme_permissions is enabled, all roles must have permission to execute on the authentication
scheme, see Binding a role to an authentication scheme.
mode
Set to one of the following values:
• internal - Scheme that manages roles per individual user in the internal database. Allows nesting
roles for permission management.
• ldap - Scheme that assigns roles by looking up the user name in LDAP and mapping the group
attribute (ldap_options) to an internal role name. To configure an LDAP scheme, complete the
steps in Defining an LDAP scheme.
Internal role management allows nesting roles for permission management; when using LDAP mode
role, nesting is disabled. Using GRANT role_name TO role_name results in an error.
Default: commented out (internal)
stats
Set to true, to enable logging of DSE role creation and modification events in the
dse_security.role_stats system table. All nodes must have the stats option enabled, and must be
restarted for the functionality to take effect.
To query role events:
(2 rows)
#authorization_options:
# enabled: false
# transitional_mode: disabled
# allow_row_level_security: false
authorization_options
Options for the DSE Authorizer.
enabled
Whether to use the DSE Authorizer for role-based access control (RBAC).
• true - use the DSE Authorizer for role-based access control (RBAC)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
145
Configuration
• strict - Permissions can be passed to resources, and are enforced on authenticated users.
Permissions are not enforced against anonymous users.
kerberos_options:
keytab: resources/dse/conf/dse.keytab
service_principal: dse/_HOST@REALM
http_principal: HTTP/_HOST@REALM
qop: auth
kerberos_options
Options to configure security for a DataStax Enterprise cluster using Kerberos.
keytab
The file path of dse.keytab.
service_principal
The service_principal that the DataStax Enterprise process runs under must use the form dse_user/
_HOST@REALM, where:
• dse_user is the name of the user that starts the DataStax Enterprise process.
• REALM is the name of your Kerberos realm. In the Kerberos principal, REALM must be uppercase.
http_principal
The http_principal is used by the Tomcat application container to run DSE Search. The Tomcat
web server uses the GSSAPI mechanism (SPNEGO) to negotiate the GSSAPI security mechanism
(Kerberos). Set REALM to the name of your Kerberos realm. In the Kerberos principal, REALM must be
uppercase.
qop
A comma-delimited list of Quality of Protection (QOP) values that clients and servers can use for each
connection. The client can have multiple QOP values, while the server can have only a single QOP
value. The valid values are:
• auth-conf - Authentication plus integrity protection and encryption of all transmitted data.
Encryption using auth-conf is separate and independent of whether encryption is done using
SSL. If both auth-conf and SSL are enabled, the transmitted data is encrypted twice. DataStax
recommends choosing only one method and using it for both encryption and authentication.
LDAP options
Define LDAP options to authenticate users against an external LDAP service and/or for Role Management using
LDAP group look up.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
146
Configuration
# ldap_options:
# server_host:
# server_port: 389
# hostname_verification: false
# search_dn:
# search_password:
# use_ssl: false
# use_tls: false
# truststore_path:
# truststore_password:
# truststore_type: jks
# user_search_base:
# user_search_filter: (uid={0})
# user_memberof_attribute: memberof
# group_search_type: directory_search
# group_search_base:
# group_search_filter: (uniquemember={0})
# group_name_attribute: cn
# credentials_validity_in_ms: 0
# search_validity_in_seconds: 0
# connection_pool:
# max_active: 8
# max_idle: 8
Microsoft Active Directory (AD) example, for both authentication and role management:
ldap_options:
server_host: win2012ad_server.mycompany.lan
server_port: 389
search_dn: cn=lookup_user,cn=users,dc=win2012domain,dc=mycompany,dc=lan
search_password: lookup_user_password
use_ssl: false
use_tls: false
truststore_path:
truststore_password:
truststore_type: jks
#group_search_type: directory_search
group_search_type: memberof_search
#group_search_base:
#group_search_filter:
group_name_attribute: cn
user_search_base: cn=users,dc=win2012domain,dc=mycompany,dc=lan
user_search_filter: (sAMAccountName={0})
user_memberof_attribute: memberOf
connection_pool:
max_active: 8
max_idle: 8
ldap_options
Options to configure LDAP security. When not set, LDAP authentication is not used.
Default: commented out
server_host
A comma separated list of LDAP server hosts.
Do not use LDAP on the same host (localhost) in production environments. Using LDAP on the same
host (localhost) is appropriate only in single node test or development environments.
Default: none
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
147
Configuration
server_port
The port on which the LDAP server listens.
• 636 - typically used for encrypted connections; the default SSL port for LDAP is 636
• A valid truststore with the correct path specified in truststore_path must exist. The truststore
must have a certificate entry, trustedCertEntry, including a SAN DNSName entry that matches the
hostname of the LDAP server.
Default: false
search_dn
Distinguished name (DN) of an account with read access to the user_search_base and
group_search_base. For example:
• OpenLDAP: uid=lookup,ou=users,dc=springsource,dc=com
Do not create/use an LDAP account or group called cassandra. The DSE database comes with a
default login role, cassandra, that has access to all database objects and uses the consistency level
QUOROM.
When not set, an anonymous bind is used for the search on the LDAP server.
Default: commented out
search_password
The password of the search_dn account.
Default: commented out
use_ssl
Whether to use an SSL-encrypted connection.
• true - use an SSL-encrypted connection, set server_port to the LDAP port for the server (typically
port 636)
• true - enable TLS connections to the LDAP server, set server_port to the TLS port of the LDAP
server.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
148
Configuration
Distinguished name (DN) of the object to start the recursive search for user entries for authentication
and role management memberof searches. For example to search all users in example.com,
ou=users,dc=example,dc=com.
• For your LDAP domain, set the ou and dc elements. Typically set to
ou=users,dc=domain,dc=top_level_domain. For example, ou=users,dc=example,dc=com.
• memberof_search - Recursively search for user entries using the user_search_base and
user_search_filter. Get groups from the user attribute defined in user_memberof_attribute.
The directory server must have memberof support.
• duration period in milliseconds - enable a search cache and improve performance by reducing the
number of requests that are sent to the internal or LDAP server. See Defining an LDAP scheme.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
149
Configuration
• duration period in seconds - enables a search cache and improves performance by reducing the
number of requests that are sent to the internal or LDAP server
system_info_encryption:
enabled: false
cipher_algorithm: AES
secret_key_strength: 128
chunk_length_kb: 64
key_provider: KmipKeyProviderFactory
kmip_host: kmip_host_name
DataStax recommends using a remote encryption key from a KMIP provider when using Transparent Data
Encryption (TDE) features. Use a local encryption key only if a KMIP server is not available.
system_info_encryption
Options to set encryption settings for system resources that might contain sensitive information,
including the system.batchlog and system.paxos tables, hint files, and the database commit log.
enabled
Whether to enable encryption of system resources. See Encrypting system resources.
The system_trace keyspace is NOT encrypted by enabling the system_information_encryption
section. In environments that also have tracing enabled, manually configure encryption with
compression on the system_trace keyspace. See Transparent data encryption.
Default: false
cipher_algorithm
The name of the JCE cipher algorithm used to encrypt system resources.
Table 11: Supported cipher algorithms names
cipher_algorithm secret_key_strength
DES 56
Blowfish 32-448
RC2 40-128
Default: AES
secret_key_strength
Length of key to use for the system resources. See Supported cipher algorithms names.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
150
Configuration
DSE uses a matching local key or requests the key type from the KMIP server. For KMIP, if an
existing key does not match, the KMIP server automatically generates a new key.
Default: 128
chunk_length_kb
Optional. Size of SSTable chunks when data from the system.batchlog or system.paxos are written to
disk.
To encrypt existing data, run nodetool upgradesstables -a system batchlog paxos on all
nodes in the cluster.
Default: 64
key_provider
KMIP key provider to enable encrypting sensitive system data with a KMIP key. Comment out if using a
local encryption key.
Default: commented out (KmipKeyProviderFactory)
kmip_host
The KMIP key server host. Set to the kmip_group_name that defines the KMIP host in kmip_hosts
section. DSE requests a key from the KMIP host and uses the key generated by the KMIP provider.
Default: commented out
Encrypted configuration properties settings
Settings for using encrypted passwords in sensitive configuration file properties.
system_key_directory: /etc/dse/conf
config_encryption_active: false
config_encryption_key_name: (key_filename | KMIP_key_URL )
system_key_directory
Path to the directory where local encryption/decryption key files are stored, also called system keys.
Distribute the system keys to all nodes in the cluster. Ensure that the DSE account is the folder owner
and has read/write/execute (700) permissions.
See Setting up local encryption keys.
This directory is not used for KMIP keys.
Default: /etc/dse/conf
config_encryption_active
Whether to enable encryption on sensitive data stored in tables and in configuration files.
Default: false
config_encryption_key_name
Set to the local encryption key filename or KMIP key URL to use for configuration file property value
decryption.
Use dsetool dsetool encryptconfigvalue to generate encrypted values for the configuration file
properties.
Default: system_key. The default name is not configurable.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
151
Configuration
kmip_hosts:
your_kmip_groupname:
hosts: kmip1.yourdomain.com, kmip2.yourdomain.com
keystore_path: pathto/kmip/keystore.jks
keystore_type: jks
keystore_password: password
truststore_path: pathto/kmip/truststore.jks
truststore_type: jks
truststore_password: password
key_cache_millis: 300000
timeout: 1000
protocol: protocol
cipher_suites: supported_cipher
kmip_hosts
Connection settings for key servers that support the KMIP protocol.
kmip_groupname
A user-defined name for a group of options to configure a KMIP server or servers, key settings, and
certificates. Configure options for a kmip_groupname section for each KMIP key server or group of
KMIP key servers. Using separate key server configuration settings allows use of different key servers
to encrypt table data, and eliminates the need to enter key server configuration information in DDL
statements and other configurations. Multiple KMIP hosts are supported.
Default: commented out
hosts
A comma-separated list KMIP hosts (host[:port]) using the FQDN (Fully Qualified Domain Name). DSE
queries the host in the listed order, so add KMIP hosts in the intended failover sequence.
For example, if the host list contains kmip1.yourdomain.com, kmip2.yourdomain.com, DSE tries
kmip1.yourdomain.com and then kmip2.yourdomain.com.
keystore_path
The path to a Java keystore created from the KMIP agent PEM files.
Default: commented out (/etc/dse/conf/KMIP_keystore.jks)
keystore_type
The type of keystore.
Default: commented out (jks)
keystore_password
The password to access the keystore.
Default: commented out (password)
truststore_path
The path to a Java truststore that was created using the KMIP root certificate.
Default: commented out (/etc/dse/conf/KMIP_truststore.jks)
truststore_type
The type of truststore.
Default: commented out (jks)
truststore_password
The password to access the truststore.
Default: commented out (password)
key_cache_millis
Milliseconds to locally cache the encryption keys that are read from the KMIP hosts. The longer the
encryption keys are cached, the fewer requests are made to the KMIP key server, but the longer it takes
for changes, like revocation, to propagate to the DataStax Enterprise node. DataStax Enterprise uses
concurrent encryption, so multiple threads fetch the secret key from the KMIP key server at the same
time. DataStax recommends using the default value.
Default: commented out (300000)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
152
Configuration
timeout
Socket timeout in milliseconds.
Default: commented out (1000)
protocol
protocol
When not specified, JVM default is used. Example: TLSv1.2
cipher_suites
When not specified, JVM default is used. Examples:
• TLS_RSA_WITH_AES_128_CBC_SHA
• TLS_RSA_WITH_AES_256_CBC_SHA
• TLS_DHE_RSA_WITH_AES_128_CBC_SHA
• TLS_DHE_RSA_WITH_AES_256_CBC_SHA
• TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA
• TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
See cipher_algorithm.
DSE Search index encryption settings
# solr_encryption_options:
# decryption_cache_offheap_allocation: true
# decryption_cache_size_in_mb: 256
solr_encryption_options
Settings to tune encryption of search indexes.
decryption_cache_offheap_allocation
Whether to allocate shared DSE Search decryption cache off JVM heap.
• true - allocate shared DSE Search decryption cache off JVM heap
• false - do not allocate shared DSE Search decryption cache off JVM heap
# max_memory_to_lock_fraction: 0.20
# max_memory_to_lock_mb: 10240
max_memory_to_lock_fraction
A fraction of the system memory. The default value of 0.20 specifies to use up to 20% of system
memory. This max_memory_to_lock_fraction value is ignored if max_memory_to_lock_mb is set to a
non-zero value. To specify a fraction, use instead of max_memory_to_lock_mb.
Default: commented out (0.20)
max_memory_to_lock_mb
A maximum amount of memory in megabytes (MB).
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
153
Configuration
node_health_options:
refresh_rate_ms: 50000
uptime_ramp_up_period_seconds: 10800
dropped_mutation_window_minutes: 30
node_health_options
Node health options are always enabled.
refresh_rate_ms
Default: 60000
uptime_ramp_up_period_seconds
The amount of continuous uptime required for the node's uptime score to advance the node health
score from 0 to 1 (full health), assuming there are no recent dropped mutations. The health score is a
composite score based on dropped mutations and uptime.
If a node is repairing after a period of downtime, you might want to increase the uptime period to the
expected repair time.
Default: commented out (10800 3 hours)
dropped_mutation_window_minutes
The historic time window over which the rate of dropped mutations affect the node health score.
Default: 30
Health-based routing
enable_health_based_routing: true
enable_health_based_routing
Whether to consider node health for replication selection for distributed DSE Search queries. Health-
based routing enables a trade-off between index consistency and query throughput.
• true - consider node health when multiple candidates exist for a particular token range.
• false - ignore node health for replication selection. When the primary concern is performance, do
not enable health-based routing.
Default: true
Lease metrics
lease_metrics_options:
enabled:false
ttl_seconds: 604800
lease_metrics_options
Lease holder statistics help monitor the lease subsystem for automatic management of Job Tracker and
Spark Master nodes.
enabled
Enables (true) or disables (false) log entries related to lease holders. Most of the time you do not want
to enable logging.
Default: false
ttl_seconds
Defines the time, in milliseconds, to persist the log of lease holder changes. Logging of lease holder
changes is always on, and has a very low overhead.
Default: 604800
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
154
Configuration
• async_bootstrap_reindex
ttl_index_rebuild_options:
fixed_rate_period: 300
initial_delay: 20
max_docs_per_batch: 4096
thread_pool_size: 1
ttl_index_rebuild_options
Section of options to control the schedulers in charge of querying for and removing expired records, and
the execution of the checks.
fix_rate_period
Time interval to check for expired data in seconds.
Default: 300
initial_delay
The number of seconds to delay the first TTL check to speed up start-up time.
Default: 20
max_docs_per_batch
The maximum number of documents to check and delete per batch by the TTL rebuild thread. All
documents determined to be expired are deleted from the index during each check, to avoid memory
pressure, their unique keys are retrieved and deletes issued in batches.
Default: 4096
thread_pool_size
The maximum number of cores that can execute TTL cleanup concurrently. Set the thread_pool_size
to manage system resource consumption and prevent many search cores from executing simultaneous
TTL deletes.
Default: 1
Reindexing of bootstrapped data
async_bootstrap_reindex: false
async_bootstrap_reindex
For DSE Search, configure whether to asynchronously reindex bootstrapped data. Default: false
• If enabled, the node joins the ring immediately after bootstrap and reindexing occurs
asynchronously. Do not wait for post-bootstrap reindexing so that the node is not marked down.
The dsetool ring command can be used to check the status of the reindexing.
• If disabled, the node joins the ring after reindexing the bootstrapped data.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
155
Configuration
cql_solr_query_paging: off
cql_solr_query_paging
• driver - Respects driver paging settings. Specifies to use Solr pagination (cursors) only when the
driver uses pagination. Enabled automatically for DSE SearchAnalytics workloads.
• off - Paging is off. Ignore driver paging settings for CQL queries and use normal Solr paging unless:
cql_solr_query_row_timeout: 10000
cql_solr_query_row_timeout
The maximum time in milliseconds to wait for each row to be read from the database during CQL Solr
queries.
Default: commented out (10000 10 seconds)
DSE Search resource upload limit
solr_resource_upload_limit_mb: 10
solr_resource_upload_limit_mb
Option to disable or configure the maximum file size of the search index config or schema. Resource
files can be uploaded, but the search index config and schema are stored internally in the database
after upload.
• upload size - The maximum upload size limit in megabytes (MB) for a DSE Search resource file
(search index config or schema).
Default: 10
Shard transport options
shard_transport_options:
netty_client_request_timeout: 60000
shard_transport_options
Fault tolerance option for inter-node communication between DSE Search nodes.
netty_client_request_timeout
Timeout behavior during distributed queries. The internal timeout for all search queries to prevent long
running queries. The client request timeout is the maximum cumulative time (in milliseconds) that a
distributed search request will wait idly for shard responses.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
156
Configuration
# back_pressure_threshold_per_core: 1024
# flush_max_time_per_core: 5
# load_max_time_per_core: 5
# enable_index_disk_failure_policy: false
# solr_data_dir: /MyDir
# solr_field_cache_enabled: false
# ram_buffer_heap_space_in_mb: 1024
# ram_buffer_offheap_space_in_mb: 1024
back_pressure_threshold_per_core
The maximum number of queued partitions during search index rebuilding and reindexing. This
maximum number safeguards against excessive heap use by the indexing queue. If set lower than the
number of threads per core (TPC), not all TPC threads can be actively indexing.
Default: commented out (1024)
flush_max_time_per_core
The maximum time, in minutes, to wait for the flushing of asynchronous index updates that occurs at
DSE Search commit time or at flush time. Expert level knowledge is required to change this value.
Always set the value reasonably high to ensure flushing completes successfully to fully sync DSE
Search indexes with the database data. If the configured value is exceeded, index updates are only
partially committed and the commit log is not truncated which can undermine data durability.
When a timeout occurs, it usually means this node is being overloaded and cannot flush in a timely
manner. Live indexing increases the time to flush asynchronous index updates.
Default: commented out (5)
load_max_time_per_core
The maximum time, in minutes, to wait for each DSE Search index to load on startup or create/reload
operations. This advanced option should be changed only if exceptions happen during search index
loading. When not set, the default is 5 minutes.
Default: commented out (5)
enable_index_disk_failure_policy
Whether to apply the configured disk failure policy if IOExceptions occur during index update
operations.
• true - apply the configured Cassandra disk failure policy to index write failures
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
157
Configuration
flushes also de-schedule pending auto-soft commits to avoid potentially flushing too many small
segments. When not set, the default is 1024.
Default: commented out (1024)
ram_buffer_offheap_space_in_mb
Global Lucene RAM buffer usage threshold for offheap to force segment flush. Setting too low might
induce a state of constant flushing during periods of ongoing write activity. For NRT, forced segment
flushes also de-schedule pending auto-soft commits to avoid potentially flushing too many small
segments. When not set, the default is 1024.
Default: commented out (1024)
Performance Service options
# performance_core_threads: 4
# performance_max_threads: 32
# performance_queue_capacity: 32000
performance_core_threads
Number of background threads used by the performance service under normal conditions. Default: 4
performance_max_threads
Maximum number of background threads used by the performance service.
performance_queue_capacity
The number of queued tasks in the backlog when the number of performance_max_threads are busy.
Default: 32000
Performance Service options
These settings are used by the Performance Service to configure collection of performance metrics on
transactional nodes. Performance metrics are stored in the dse_perf keyspace and can be queried with CQL
using any CQL-based utility, such as cqlsh or any application using a CQL driver. To temporarily make changes
for diagnostics and testing, use the dsetool perf subcommands.
graph_events
Graph event information.
graph_events:
ttl_seconds: 600
ttl_seconds
The TTL in milliseconds.
Default: 600
cql_slow_log_options
Options to configure reporting distributed sub-queries for search (query executions on individual shards)
that take longer than a specified period of time.
# cql_slow_log_options:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
158
Configuration
# enabled: true
# threshold: 200.0
# minimum_samples: 100
# ttl_seconds: 259200
# skip_writing_to_db: true
# num_slowest_queries: 5
• A value greater than 1 is expressed in time and will log queries that take longer than the specified
number of milliseconds.
• A value of 0 to 1 is expressed as a percentile and will log queries that exceed this percentile.
• false - write slow queries to the database; the threshold must be >= 2000 ms to prevent a high load
on the database
cql_system_info_options:
enabled: false
refresh_rate_ms: 10000
enabled
Whether to collect system-wide performance information about a cluster.
Default: false
refresh_rate_ms
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 10000 (10 seconds)
resource_level_latency_tracking_options
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
159
Configuration
resource_level_latency_tracking_options:
enabled: false
refresh_rate_ms: 10000
Default: false
refresh_rate_ms
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 10000 (10 seconds)
db_summary_stats_options
Options to configure collection of summary statistics at the database level.
db_summary_stats_options:
enabled: false
refresh_rate_ms: 10000
Default: false
refresh_rate_ms
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 10000 (10 seconds)
cluster_summary_stats_options
Options to configure collection of statistics at a cluster-wide level.
cluster_summary_stats_options:
enabled: false
refresh_rate_ms: 10000
Default: false
refresh_rate_ms
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 10000 (10 seconds)
spark_cluster_info_options
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
160
Configuration
Options to configure collection of data associated with Spark cluster and Spark applications.
spark_cluster_info_options:
enabled: false
refresh_rate_ms: 10000
Default: false
refresh_rate_ms
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 10000 (10 seconds)
histogram_data_options
Histogram data for the dropped mutation metrics are stored in the dropped_messages table in the
dse_perf keyspace.
histogram_data_options:
enabled: false
refresh_rate_ms: 10000
retention_count: 3
Default: false
refresh_rate_ms
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 10000 (10 seconds)
retention_count
Default: 3
user_level_latency_tracking_options
User-resource latency tracking settings.
user_level_latency_tracking_options:
enabled: false
refresh_rate_ms: 10000
top_stats_limit: 100
quantiles: false
Default: false
refresh_rate_ms
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
161
Configuration
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 10000 (10 seconds)
top_stats_limit
Limit the number of individual metrics.
Default: 100
quantiles
Default: false
DSE Search Performance Service options
These settings are used by the DataStax Enterprise Performance Service.
solr_slow_sub_query_log_options:
enabled: false
ttl_seconds: 604800
threshold_ms: 3000
async_writers: 1
solr_update_handler_metrics_options:
enabled: false
ttl_seconds: 604800
refresh_rate_ms: 60000
solr_request_handler_metrics_options:
enabled: false
ttl_seconds: 604800
refresh_rate_ms: 60000
solr_index_stats_options:
enabled: false
ttl_seconds: 604800
refresh_rate_ms: 60000
solr_cache_stats_options:
enabled: false
ttl_seconds: 604800
refresh_rate_ms: 60000
solr_latency_snapshot_options:
enabled: false
ttl_seconds: 604800
refresh_rate_ms: 60000
solr_slow_sub_query_log_options
See Collecting slow search queries.
enabled
Default: false
ttl_seconds
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 604800 (about 10 minutes)
async_writers
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
162
Configuration
The number of server threads dedicated to writing in the log. More than one server thread might
degrade performance.
Default: 1
threshold_ms
Default: 3000
solr_update_handler_metrics_options
Options to collect search index direct update handler statistics over time.
See Collecting handler statistics.
enabled
Default: false
ttl_seconds
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 604800 (about 10 minutes)
refresh_rate_ms
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 60000 (1 minute)
solr_index_stats_options
Options to record search index statistics over time.
See Collecting index statistics.
enabled
Default: false
ttl_seconds
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 604800 (about 10 minutes)
refresh_rate_ms
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 60000 (1 minute)
solr_cache_stats_options
See Collecting cache statistics.
enabled
Default: false
ttl_seconds
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 604800 (about 10 minutes)
refresh_rate_ms
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 60000 (1 minute)
solr_latency_snapshot_options
See Collecting Apache Solr performance statistics.
enabled
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
163
Configuration
Default: false
ttl_seconds
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 604800 (about 10 minutes)
refresh_rate_ms
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 60000 (1 minute)
Spark Performance Service options
See Monitoring Spark application information.
spark_application_info_options:
enabled: false
refresh_rate_ms: 10000
driver:
sink: false
connectorSource: false
jvmSource: false
stateSource: false
executor:
sink: false
connectorSource: false
jvmSource: false
spark_application_info_options
Statistics options.
enabled
Default: false
refresh_rate_ms
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 10000 (10 seconds)
driver
Options to configure collection of metrics at the Spark Driver.
connectorSource
Whether to collect Spark Cassandra Connector metrics at the Spark Driver.
Default: false
jvmSource
Whether to collect JVM heap and garbage collection (GC) metrics from the Spark Driver.
Default: false
stateSource
Whether to collect application state metrics at the Spark Driver.
Default: false
executor
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
164
Configuration
Default: false
connectorSource
Whether to collect Spark Cassandra Connector metrics at Spark executors.
Default: false
jvmSource
Whether to collect JVM heap and GC metrics at Spark executors.
Default: false
DSE Analytics options
• Spark
spark_shared_secret_bit_length: 256
spark_security_enabled: false
spark_security_encryption_enabled: false
spark_daemon_readiness_assertion_interval: 1000
resource_manager_options:
worker_options:
cores_total: 0.7
memory_total: 0.6
workpools:
- name: alwayson_sql
cores: 0.25
memory: 0.25
spark_ui_options:
encryption: inherit
encryption_options:
enabled: false
keystore: .keystore
keystore_password: cassandra
require_client_auth: false
truststore: .truststore
truststore_password: cassandra
# Advanced settings
# protocol: TLS
# algorithm: SunX509
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
165
Configuration
# store_type: JKS
# cipher_suites:
[TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_DHE_RSA_WITH_AES_128_CBC_SHA,TLS_DHE_RSA_WITH_
spark_shared_secret_bit_length
The length of a shared secret used to authenticate Spark components and encrypt the connections
between them. This value is not the strength of the cipher for encrypting connections. Default: 256
spark_security_enabled
In DSE 6.0.8 and later, when DSE authentication is enabled with authentication_options, Spark security
is enabled regardless of this setting.
Enables Spark security based on shared secret infrastructure. Enables mutual authentication and
optional encryption between DSE Spark Master and Workers, and of communication channels, except
the web UI.
Default: false
spark_security_encryption_enabled
In DSE 6.0.8 and later, when DSE authentication is enabled with authentication_options, Spark security
is enabled regardless of this setting.
Enables encryption between DSE Spark Master and Workers, and of communication channels,
except the web UI. Uses DIGEST-MD5 SASL-based encryption mechanism. Requires
spark_security_enabled: true.
Configure encryption between the Spark processes and DSE with client-to-node encryption in
cassandra.yaml.
spark_daemon_readiness_assertion_interval
Time interval, in milliseconds, between subsequent retries by the Spark plugin for Spark Master and
Worker readiness to start. Default: 1000
resource_manager_options
DataStax Enterprise can control the memory and cores offered by particular Spark Workers in semi-
automatic fashion. You can define the total amount of physical resources available to Spark Workers,
and optionally add named work pools with specific resources dedicated to them.
worker_options
If the option is not specified, the default value 0.6 is used. The amount of system resources that are
made available to the Spark Worker.
cores_total
The number of total system cores available to Spark. If the option is not specified, the default value 0.7
is used.
For DSE 6.0.11 and later, the SPARK_WORKER_TOTAL_CORES environment variables takes precedence
over this setting.
This setting can be the exact number of cores or a decimal of the total system cores. When the value is
expressed as a decimal, the available resources are calculated in the following way:
The lowest value that you can assign to Spark Worker cores is 1 core. If the results are lower, no
exception is thrown and the values are automatically limited.
Setting cores_total or a workpool's cores to 1.0 is a decimal value, meaning 100% of the available
cores will be reserved. Setting cores_total or cores to 1 (no decimal point) is an explicit value, and
one core will be reserved.
memory_total
The amount of total system memory available to Spark. This setting can be the exact amount of
memory or a decimal of the total system memory. When the value is an absolute value, you can use
standard suffixes like M for megabyte and G for gigabyte.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
166
Configuration
When the value is expressed as a decimal, the available resources are calculated in the following way:
The lowest values that you can assign to Spark Worker memory is 64 MB. If the results are lower, no
exception is thrown and the values are automatically limited.
If the option is not specified, the default value 0.6 is used.
For DSE 6.0.11 and later, the SPARK_WORKER_TOTAL_MEMORY environment variables takes
precedence over this setting.
workpools
Named work pools that can use a portion of the total resources defined under worker_options. A
default work pool named default is used if no work pools are defined in this section. If work pools are
defined, the resources allocated to the work pools are taken from the total amount, with the remaining
resources available to the default work pool. The total amount of resources defined in the workpools
section must not exceed the resources available to Spark in worker_options.
A work pool named alwayson_sql is created by default for AlwaysOn SQL. By default, it is configured
to use 25% of the resources available to Spark.
name
The name of the work pool.
cores
The number of system cores to use in this work pool expressed as either an absolute value or a decimal
value. This option follows the same rules as cores_total.
memory
The amount of memory to use in this work pool expressed as either an absolute value or a decimal
value. This option follows the same rules as memory_total.
spark_ui_options
Specify the source for SSL settings for Spark Master and Spark Worker UIs. The spark_ui_options
apply only to Spark daemon UIs, and do not apply to user applications even when the user applications
are run in cluster mode.
encryption
• inherit - inherit the SSL settings from the client encryption options.
Default: inherit
encryption_options
Set encryption options for HTTPS of Spark Master and Worker UI. The spark_encryption_options are
not valid for DSE 5.1 and later.
enabled
Whether to enable Spark encryption for Spark client-to-Spark cluster and Spark internode
communication.
Default: false
keystore
The keystore for Spark encryption keys.
The relative file path is the base Spark configuration directory that is defined by the SPARK_CONF_DIR
environment variable. The default Spark configuration directory is resources/spark/conf.
Default: resources/dse/conf/.ui-keystore
keystore_password
The password to access the key store.
Default: cassandra
require_client_auth
Whether to require truststore for client authentication. When not set, the default is false.
Default: commented out (false)
truststore
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
167
Configuration
• TLS_RSA_WITH_AES_128_CBC_SHA
• TLS_RSA_WITH_AES_256_CBC_SHA
• TLS_DHE_RSA_WITH_AES_128_CBC_SHA
• TLS_DHE_RSA_WITH_AES_256_CBC_SHA
• TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA
• TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
spark_process_runner:
runner_type: default
run_as_runner_options:
user_slots:
- slot1
- slot2
spark_process_runner:
Options to configure how Spark driver and executor processes are created and managed.
runner_type
• run_as - Use the run_as_runner_options options. See Running Spark processes as separate
users.
run_as_runner_options
The slot users for separating Spark processes users from the DSE service user. See Running Spark
processes as separate users.
Default: slot1, slot2
AlwaysOn SQL options
Properties to enable and configure AlwaysOn SQL.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
168
Configuration
# thrift_port: 10000
# web_ui_port: 9077
# reserve_port_wait_time_ms: 100
# alwayson_sql_status_check_wait_time_ms: 500
# workpool: alwayson_sql
# log_dsefs_dir: /spark/log/alwayson_sql
# auth_user: alwayson_sql
# runner_max_errors: 10
alwayson_sql_options
The AlwaysOn SQL options enable and configure the server on this node.
enabled
Whether to enable AlwaysOn SQL for this node. The node must be an analytics node. When not set,
the default is false.
Default: commented out (false)
thrift_port
The Thrift port on which AlwaysOn SQL listens.
Default: commented out (10000)
web_ui_port
The port on which the AlwaysOn SQL web UI is available.
Default: commented out (9077)
reserve_port_wait_time_ms
The wait time in milliseconds to reserve the thrift_port if it is not available.
Default: commented out (100)
alwayson_sql_status_check_wait_time_ms
The time in milliseconds to wait for a health check status of the AlwaysOn SQL server.
Default: commented out (500)
workpool
The work pool name used by AlwaysOn SQL.
Default: commented out (alwayson_sql)
log_dsefs_dir
Location in DSEFS of the AlwaysOn SQL log files.
Default: commented out (/spark/log/alwayson_sql)
auth_user
The role to use for internal communication by AlwaysOn SQL if authentication is enabled. Custom roles
must be created with login=true.
Default: commented out (alwayson_sql)
runner_max_errors
The maximum number of errors that can occur during AlwaysOn SQL service runner thread runs before
stopping the service. A service stop requires a manual restart.
Default: commented out (10)
DSE File System (DSEFS) options
Properties to enable and configure the DSE File System (DSEFS).
DSEFS replaced the Cassandra File System (CFS). DSE version 6.0 and later do not support CFS.
dsefs_options:
enabled:
keyspace_name: dsefs
work_dir: /var/lib/dsefs
public_port: 5598
private_port: 5599
data_directories:
- dir: /var/lib/dsefs/data
storage_weight: 1.0
min_free_space: 5368709120
# service_startup_timeout_ms: 30000
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
169
Configuration
# service_close_timeout_ms: 600000
# server_close_timeout_ms: 2147483647 # Integer.MAX_VALUE
# compression_frame_max_size: 1048576
# query_cache_size: 2048
# query_cache_expire_after_ms: 2000
# gossip_options:
# round_delay_ms: 2000
# startup_delay_ms: 5000
# shutdown_delay_ms: 10000
# rest_options:
# request_timeout_ms: 330000
# connection_open_timeout_ms: 55000
# client_close_timeout_ms: 60000
# server_request_timeout_ms: 300000
# idle_connection_timeout_ms: 60000
# internode_idle_connection_timeout_ms: 120000
# core_max_concurrent_connections_per_host: 8
# transaction_options:
# transaction_timeout_ms: 3000
# conflict_retry_delay_ms: 200
# conflict_retry_count: 40
# execution_retry_delay_ms: 1000
# execution_retry_count: 3
# block_allocator_options:
# overflow_margin_mb: 1024
# overflow_factor: 1.05
dsefs_options
Enable and configure options for DSEFS.
enabled
Whether to enable DSEFS.
• blank or commented out (#) - DSEFS will start only if the node is configured to run analytics
workloads.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
170
Configuration
- dir
Mandatory attribute to identify the set of directories. DataStax recommends segregating these data
directories on physical devices that are different from the devices that are used for DataStax Enterprise.
Using multiple directories on JBOD improves performance and capacity.
Default: commented out (/var/lib/dsefs/data)
storage_weight
The weighting factor for this location specifies how much data to place in this directory, relative to other
directories in the cluster. This soft constraint determines how DSEFS distributes the data. For example,
a directory with a value of 3.0 receives about three times more data than a directory with a value of 1.0.
Default: commented out (1.0)
min_free_space
The reserved space, in bytes, to not use for storing file data blocks. You can use a unit of measure
suffix to specify other size units. For example: terabyte (1 TB), gigabyte (10 GB), and megabyte (5000
MB).
Default: commented out (5368709120)
Advanced properties for DSEFS
service_startup_timeout_ms
Wait time, in milliseconds, before the DSEFS server times out while waiting for services to bootstrap.
Default: commented out (30000)
service_close_timeout_ms
Wait time, in milliseconds, before the DSEFS server times out while waiting for services to close.
Default: commented out (600000)
server_close_timeout_ms
Wait time, in milliseconds, that the DSEFS server waits during shutdown before closing all pending
connections.
Default: commented out (2147483647)
compression_frame_max_size
The maximum accepted size of a compression frame defined during file upload.
Default: commented out (1048576)
query_cache_size
Maximum number of elements in a single DSEFS Server query cache.
Default: commented out (2048)
query_cache_expire_after_ms
The time to retain the DSEFS Server query cache element in cache. The cache element expires when
this time is exceeded.
Default: commented out (2000)
gossip options
Options to configure DSEFS gossip rounds.
round_delay_ms
The delay, in milliseconds, between gossip rounds.
Default: commented out (2000)
startup_delay_ms
The delay time, in milliseconds, between registering the location and reading back all other locations
from the database.
Default: commented out (5000)
shutdown_delay_ms
The delay time, in milliseconds, between announcing shutdown and shutting down the node.
Default: commented out (30000)
rest_options
Options to configure DSEFS rest times.
request_timeout_ms
The time, in milliseconds, that the client waits for a response that corresponds to a given request.
Default: commented out (330000)
connection_open_timeout_ms
The time, in milliseconds, that the client waits to establish a new connection.
Default: commented out (55000)
client_close_timeout_ms
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
171
Configuration
The time, in milliseconds, that the client waits for pending transfer to complete before closing a
connection.
Default: commented out (60000)
server_request_timeout_ms
The time, in milliseconds, to wait for the server rest call to complete.
Default: commented out (300000)
idle_connection_timeout_ms
The time, in milliseconds, for RestClient to wait before closing an idle connection. If RestClient does not
close connection after timeout, the connection is closed after 2*idle_connection_timeout_ms.
overflow_margin_mb
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
172
Configuration
# insights_options:
# data_dir: /var/lib/cassandra/insights_data
# log_dir: /var/log/cassandra/
insights_options
Options for DSE Metrics Collector.
data_dir
Directory to store collected metrics. When not set, the default directory is /var/lib/cassandra/
insights_data.
When data_dir is not set, the default location of the /insights_data directory is the same location
as the /commitlog directory, as defined with the commitlog_directory property in cassandra.yaml.
log_dir
Directory to store logs for collected metrics. The log file is dse-collectd.log. The file with the collectd
PID is dse-collectd.pid. When not set, the default directory is /var/log/cassandra/.
Audit database activities
Track database activity using the audit log feature. To get the maximum information from data auditing, turn on
data auditing on every node.
See Setting up database auditing.
audit_logging_options
Options to enable and configure database activity logging.
enabled
Whether to enable database activity auditing.
Default: false
logger
The logger to use for recording events:
Configure logging level, sensitive data masking, and log file name/location in the logback.xml file.
Default: SLF4JAuditWriter
included_categories
Comma separated list of event categories that are captured, where the category names are:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
173
Configuration
• UNKNOWN - Events where the category and type are both UNKNOWN.
• UNKNOWN - Events where the category and type are both UNKNOWN.
retention_time: 0
cassandra_audit_writer_options:
mode: sync
batch_size: 50
flush_time: 250
queue_size: 30000
write_consistency: QUORUM
# dropped_event_log: /var/log/cassandra/dropped_audit_events.log
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
174
Configuration
# day_partition_millis: 3600000
retention_time
The amount of time, in hours, audit events are retained by supporting loggers. Only the
CassandraAuditWriter supports retention time.
• sync - A query is not executed until the audit event is successfully written.
• async - Audit events are queued for writing to the audit table, but are not necessarily logged before
the query executes. A pool of writer threads consumes the audit events from the queue, and writes
them to the audit table in batch queries.
While async substantially improves performance under load, if there is a failure between when
a query is executed, and its audit event is written to the table, the audit table might be missing
entries for queries that were executed.
Default: sync
batch_size
Available only when mode: async. Must be greater than 0.
The maximum number of events the writer dequeues before writing them out to the table. If
warnings in the logs reveal that batches are too large, decrease this value or increase the value of
batch_size_warn_threshold_in_kb in cassandra.yaml.
Default: 50
flush_time
Available only when mode: async.
The maximum amount of time in milliseconds before an event is removed from the queue by a writer
before being written out. This flush time prevents events from waiting too long before being written to
the table when there are not a lot of queries happening.
Default: 500
queue_size
The size of the queue feeding the asynchronous audit log writer threads. When there are more events
being produced than the writers can write out, the queue fills up, and newer queries are blocked until
there is space on the queue. If a value of 0 is used, the queue size is unbounded, which can lead to
resource exhaustion under heavy query load.
Default: 30000
write_consistency
The consistency level that is used to write audit events.
Default: QUORUM
dropped_event_log
The directory to store the log file that reports dropped events. When not set, the default is /var/log/
cassandra/dropped_audit_events.log.
Default: commented out (/var/log/cassandra/dropped_audit_events.log)
day_partition_millis
The interval, in milliseconds, between changing nodes to spread audit log information across multiple
nodes. For example, to change the target node every 12 hours, specify 43200000 milliseconds. When
not set, the default is 3600000 (1 hour).
Default: commented out (3600000) (1 hour)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
175
Configuration
# tiered_storage_options:
# strategy1:
# tiers:
# - paths:
# - /mnt1
# - /mnt2
# - paths: [ /mnt3, /mnt4 ]
# - paths: [ /mnt5, /mnt6 ]
#
# local_options:
# k1: v1
# k2: v2
#
# 'another strategy':
# tiers: [ paths: [ /mnt1 ] ]
tiered_storage_options
Options to configure the smart movement of data across different types of storage media so that data
is matched to the most suitable drive type, according to the performance and cost characteristics it
requires
strategy1
The first disk configuration strategy. Create a strategy2, strategy3, and so on. In this example, strategy1
is the configurable name of the tiered storage configuration strategy.
tiers
The unnamed tiers in this section define a storage tier with the paths and file paths that define the
priority order.
local_options
Local configuration options overwrite the tiered storage settings for the table schema in the local
dse.yaml file. See Testing DSE Tiered Storage configurations.
- paths
The section of file paths that define the data directories for this tier of the disk configuration. Typically
list the fastest storage media first. These paths are used only to store data that is configured to use
tiered storage. These paths are independent of any settings in the cassandra.yaml file.
- /filepath
The file paths that define the data directories for this tier of the disk configuration.
DSE Advanced Replication configuration settings
DSE Advanced Replication configuration options to replicate data from remote clusters to central data hubs.
# advanced_replication_options:
# enabled: false
# conf_driver_password_encryption_enabled: false
# advanced_replication_directory: /var/lib/cassandra/advrep
# security_base_path: /base/path/to/advrep/security/files/
advanced_replication_options
Options to enable and configure DSE Advanced Replication.
enabled
Whether to enable an edge node to collect data in the replication log.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
176
Configuration
internode_messaging_options:
port: 8609
# frame_length_in_mb: 256
# server_acceptor_threads: 8
# server_worker_threads: 16
# client_max_connections: 100
# client_worker_threads: 16
# handshake_timeout_seconds: 10
# client_request_timeout_seconds: 60
internode_messaging_options
Configuration options for inter-node messaging.
port
The mandatory port for the inter-node messaging service.
Default: 8609
frame_length_in_mb
Maximum message frame length. When not set, the default is 256.
Default: commented out (256)
server_acceptor_threads
The number of server acceptor threads. When not set, the default is the number of available
processors.
Default: commented out
server_worker_threads
The number of server worker threads. When not set, the default is the number of available processors *
8.
Default: commented out
client_max_connections
The maximum number of client connections. When not set, the default is 100.
Default: commented out (100)
client_worker_threads
The number of client worker threads. When not set, the default is the number of available processors *
8.
Default: commented out
handshake_timeout_seconds
Timeout for communication handshake process. When not set, the default is 10.
Default: commented out (10)
client_request_timeout_seconds
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
177
Configuration
Timeout for non-query search requests like core creation and distributed deletes. When not set, the
default is 60.
Default: commented out (60)
DSE Multi-Instance server_id
server_id
In DSE Multi-Instance /etc/dse-nodeId/dse.yaml files, the server_id option is generated to uniquely
identify the physical server on which multiple instances are running. The server_id default value is the
media access control address (MAC address) of the physical server. You can change server_id when
the MAC address is not unique, such as a virtualized server where the host’s physical MAC is cloned.
DSE Graph options
# graph:
# analytic_evaluation_timeout_in_minutes: 10080
# realtime_evaluation_timeout_in_seconds: 30
# schema_agreement_timeout_in_ms: 10000
# system_evaluation_timeout_in_seconds: 180
# index_cache_size_in_mb: 128
# max_query_queue: 10000
# max_query_threads (no explicit default)
# max_query_params: 16
graph
These graph options are system-level configuration options and options that are shared between graph
instances.
Option names and values expressed in ISO 8601 format used in earlier DSE 5.0 releases are still valid.
The ISO 8601 format is deprecated.
analytic_evaluation_timeout_in_minutes
Maximum time to wait for an OLAP analytic (Spark) traversal to evaluate. When not set, the default is
10080 (168 hours).
Default: commented out (10080)
realtime_evaluation_timeout_in_seconds
Maximum time to wait for an OLTP real-time traversal to evaluate. When not set, the default is 30
seconds.
Default: commented out (30)
schema_agreement_timeout_in_ms
Maximum time to wait for the database to agree on schema versions before timing out. When not set,
the default is 10000 (10 seconds).
Default: commented out (10000)
system_evaluation_timeout_in_seconds
Maximum time to wait for a graph system-based request to execute, like creating a new graph. When
not set, the default is 180 (3 minutes).
Default: commented out (180)
schema_mode
Controls the way that the schemas are handled.
• Production = Schema must be created before data insertion. Schema cannot be changed after
data is inserted. Full graph scans are disallowed unless the option graph.allow_scan is changed to
TRUE.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
178
Configuration
• Development = No schema is required to write data to a graph. Schema can be changed after data
is inserted. Full graph scans are allowed unless the option graph.allow_scan is changed to FALSE.
When not set, the default is Production. If this option is not present, manually enter it to use
Development.
Default: not present
index_cache_size_in_mb
The amount of ram to allocate to the index cache. When not set, the default is 128.
Default: commented out (128)
max_query_queue
The maximum number of CQL queries that can be queued as a result of Gremlin requests. Incoming
queries are rejected if the queue size exceeds this setting. When not set, the default is 10000.
Default: commented out (10000)
max_query_threads
The maximum number of threads to use for queries to the database. When this option is not set, the
default is calculated:
See gremlinPool.
Default: calculated
max_query_params
The maximum number of parameters that can be passed on a graph query request for TinkerPop
drivers and drivers using the Cassandra native protocol. Passing very large numbers of parameters
on requests is an anti-pattern, because the script evaluation time increases proportionally. DataStax
recommends reducing the number of parameters to speed up script compilation times. Before you
increase this value, consider alternate methods for parameterizing scripts, like passing a single map. If
the graph query request requires many arguments, pass a list.
Default: commented out (16)
DSE Graph Gremlin Server options
The Gremlin Server is configured using Apache TinkerPop specifications.
# gremlin_server:
# port: 8182
# threadPoolWorker: 2
# gremlinPool: 0
# scriptEngines:
# gremlin-groovy:
# config:
# sandbox_enabled: false
# sandbox_rules:
# whitelist_packages:
# - package.name
# whitelist_types:
# - fully.qualified.type.name
# whitelist_supers:
# - fully.qualified.class.name
# blacklist_packages:
# - package.name
# blacklist_supers:
# - fully.qualified.class.name
gremlin_server
The top-level configurations in Gremlin Server.
port
The available communications port for Gremlin Server. When not set, the default is 8182.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
179
Configuration
node_health_options:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
180
Configuration
refresh_rate_ms: 50000
uptime_ramp_up_period_seconds: 10800
dropped_mutation_window_minutes: 30
hosts: [localhost]
port: 8182
serializer: { className:
org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0,
config: { ioRegistries:
[org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerIoRegistryV3d0] }}
hosts
Identifies a host or hosts running a DSE node that is running Gremlin Server. You may need to use the
native_transport_address value set in cassandra.yaml.
Default: [localhost]
You can also connect to the Spark Master node for the datacenter by either running the console from
the Spark Master or specifying the Spark Master in the hosts field in the remote.yaml file.
port
Identifies a port on a DSE node running Gremlin Server. The port value needs to match the port value
specified for gremlin_server: in the dse.yaml file.
Default: 8182
serializer
Specifies the class and configuration for the serializer used to pass information between the Gremlin
console and the Gremlin Server.
Default: { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0,
config: { ioRegistries:
[org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerIoRegistryV3d0]
DSE Graph Gremlin connectionPool options
The connectionPool settings specify a number of options that will be passed between the Gremlin console and
the Gremlin Server.
connectionPool: {
enableSsl: false,
maxContentLength: 65536000,
maxInProcessPerConnection: 4,
maxSimultaneousUsagePerConnection: 16,
maxSize: 8,
maxWaitForConnection: 3000,
maxWaitForSessionClose: 3000,
minInProcessPerConnection: 1,
minSimultaneousUsagePerConnection: 8,
minSize: 2,
reconnectInterval: 1000,
resultIterationBatchSize: 64,
# trustCertChainFile: /etc/dse/graph/gremlin-console/conf/mycert.pem
# Note: trustCertChainFile deprecated as of TinkerPop 3.2.10; instead use trustStore.
trustStore: /full/path/to/jsse/truststore/file
}
enableSsl
Determines if SSL should be enabled. If enabled on the server, SSL must be enabled on the client.
To configure the Gremlin console to use SSL, when SSL is enabled on the Gremlin Server, edit the
connectionPool section of remote.yaml:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
181
Configuration
# Java Secure Socket Extension (JSSE) truststore file via the trustStore parameter
Example:
hosts: [localhost]
username: Cassandra_username
password: Cassandra_password
port: 8182
...
connectionPool: {
enableSsl: true,
trustStore: /full/path/to/JSSE/truststore/file,
...
...
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
182
Configuration
The override value for the size of the result batches to be returned from the server.
Default: 64
trustCertChainFile
The location of the public certificate from the DSE truststore file, in PEM format. Also set enableSsl:
true.
If you are using the deprecated trustCertChainFile in your version of remote.yaml, here are
the details. Depending on how you created the DSE truststore file, you may already have the
PEM format certificate file from the root Certificate Authority. If so, specify the PEM file with this
trustCertChainFile option. If not, export the public certificate from the DSE truststore (CER format)
and convert it to PEM format. Then specify the PEM file with this option. Example:
$ pwd
/etc/dse/graph/gremlin-console/conf
In this example, the connectionPool section of remote.yaml should then include the following options
(assuming you are aware that trustCertChainFile is deprecated, as noted above).
connectionPool: {
enableSsl: true,
trustCertChainFile: /etc/dse/graph/gremlin-console/conf/mycert.pem,
...
}
Default: Unspecified
trustStore
The location of the Java Secure Socket Extension (JSSE) truststore file. Trusted certificates for verifying
the remote client's certificate. Similar to setting the JSSE property javax.net.ssl.trustStore. If
this value is not provided in remote.yaml and if SSL is enabled (via enableSSL: true), the default
TrustManager is used.
Default: Unspecified
DSE Graph Gremlin AuthProperties options
Security considerations for authentication between the Gremlin console and the Gremlin server require additional
options in the remote.yaml file.
# jaasEntry:
# protocol:
# username: xxx
# password: xxx
jaasEntry
Sets the AuthProperties.Property.JAAS_ENTRY properties for authentication to Gremlin Server.
Default: commented out (no value)
protocol
Sets the AuthProperties.Property.PROTOCOL properties for authentication to Gremlin Server.
Default: commented out (no value)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
183
Configuration
username
The username to submit on requests that require authentication.
Default: commented out (xxx)
password
The password to submit on requests that require authentication.
Default: commented out (xxx)
cassandra-rackdc.properties file
The GossipingPropertyFileSnitch, Ec2Snitch, and Ec2MultiRegionSnitch use the cassandra-rackdc.properties
configuration file to determine which datacenters and racks nodes belong to. They inform the database about the
network topology to route requests efficiently and distribute replicas evenly. Settings for this file depend on the
type of snitch:
• GossipingPropertyFileSnitch
This page also includes instructions for migrating from the PropertyFileSnitch to the GossipingPropertyFileSnitch.
GossipingPropertyFileSnitch
This snitch is recommended for production. It uses rack and datacenter information for the local node defined in
the cassandra-rackdc.properties file and propagates this information to other nodes via gossip.
To configure a node to use GossipingPropertyFileSnitch, edit the cassandra-rackdc.properties file as follows:
• Define the datacenter and rack that include this node. The default settings:
dc=DC1
rack=RAC1
datacenter and rack names are case-sensitive. For examples, see Initializing a single datacenter per
workload type and Initializing multiple datacenters per workload type.
• To save bandwidth, add the prefer_local=true option. This option tells DataStax Enterprise to use the
local IP address when communication is not across different datacenters.
cassandra-topology.properties file
The PropertyFileSnitch uses the cassandra-topology.properties for datacenters and rack names and to
determine network topology so that requests are routed efficiently and allows the database to distribute replicas
evenly.
The GossipingPropertyFileSnitch snitch is recommended for production. See Migrating from the
PropertyFileSnitch to the GossipingPropertyFileSnitch.
PropertyFileSnitch
This snitch determines proximity as determined by rack and datacenter. It uses the network details located in the
cassandra-topology.properties file. When using this snitch, you can define your datacenter names to be whatever
you want. Make sure that the datacenter names correlate to the name of your datacenters in the keyspace
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
184
Configuration
definition. Every node in the cluster should be described in the cassandra-topology.properties file, and this
file should be exactly the same on every node in the cluster.
Setting datacenters and rack names
If you had non-uniform IPs and two physical datacenters with two racks in each, and a third logical datacenter for
replicating analytics data, the cassandra-topology.properties file might look like this:
# datacenter One
175.56.12.105=DC1:RAC1
175.50.13.200=DC1:RAC1
175.54.35.197=DC1:RAC1
120.53.24.101=DC1:RAC2
120.55.16.200=DC1:RAC2
120.57.102.103=DC1:RAC2
# datacenter Two
110.56.12.120=DC2:RAC1
110.50.13.201=DC2:RAC1
110.54.35.184=DC2:RAC1
50.33.23.120=DC2:RAC2
50.45.14.220=DC2:RAC2
50.17.10.203=DC2:RAC2
172.106.12.120=DC3:RAC1
172.106.12.121=DC3:RAC1
172.106.12.122=DC3:RAC1
• node0
dc_suffix=_1_cassandra
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
185
Configuration
• node1
dc_suffix=_1_cassandra
• node2
dc_suffix=_1_cassandra
• node3
dc_suffix=_1_cassandra
• node4
dc_suffix=_1_analytics
• node5
dc_suffix=_1_search
us-east_1_cassandra
us-east_1_analytics
us-east_1_search
The datacenter naming convention in this example is based on the workload. You can use other conventions,
such as DC1, DC2 or 100, 200.
1. In the cassandra.yaml, set the listen_address to the private IP address of the node, and the
broadcast_address to the public IP address of the node.
This allows DataStax Enterprise nodes in one EC2 region to bind to nodes in another region, thus enabling
multiple datacenter support. For intra-region traffic, DataStax Enterprise switches to the private IP after
establishing a connection.
2. Set the addresses of the seed nodes in the cassandra.yaml file to that of the public IP. Private IP are not
routable between networks. For example:
To find the public IP address, from each of the seed nodes in EC2:
$ curl http://instance-data/latest/meta-data/public-ipv4
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
186
Configuration
• node0 • node0
dc_suffix=_1_transactional dc_suffix=_1_transactional
• node1 • node1
dc_suffix=_1_transactional dc_suffix=_1_transactional
• node2 • node2
dc_suffix=_2_transactional dc_suffix=_2_transactional
• node3 • node3
dc_suffix=_2_transactional dc_suffix=_2_transactional
• node4 • node4
dc_suffix=_1_analytics dc_suffix=_1_analytics
• node5 • node5
dc_suffix=_1_search dc_suffix=_1_search
This results in four us-east datacenters: This results in four us-west datacenters:
us-east_1_transactional us-west_1_transactional
us-east_2_transactional us-west_2_transactional
us-east_1_analytics us-west_1_analytics
us-east_1_search us-west_1_search
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
187
Configuration
Node dc_suffix
node0 dc_suffix=_a_transactional
node1 dc_suffix=_a_transactional
node2 dc_suffix=_a_transactional
node3 dc_suffix=_a_transactional
node4 dc_suffix=_a_analytics
node5 dc_suffix=_a_search
Synopsis
Change the start up parameters using the following syntax:
• Command line:
• jvm.options file:
-Dparameter_name=value
• cassandra-env.sh file:
JVM_OPTS="$JVM_OPTS -Dparameter_name=value"
Only pass the parameter to the start-up operation once. If the same switch is passed to the start operation
multiple times, for example from both the jvm.options file and on the command line, DSE may fail to start or
may use the wrong parameter.
Startup examples
Starting a node without joining the ring:
• Command line:
• jvm.options:
-Dcassandra.join_ring=false
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
188
Configuration
• Command line:
• jvm.options:
-Dcassandra.replace_address=10.91.176.160
• Command line:
dse -Ddse.ldap.retry_interval.ms=20
• jvm.options:
-Ddse.ldap.retry_interval.ms=20
The value for consistent replace should match the value for application read consistency.
Default: ONE
-Ddse.consistent_replace.parallelism
Specify how many ranges will be repaired simultaneously during a consistent replace. The higher
the parallelism, the more resources are consumed cluster-wide, which may affect overall cluster
performance. Used only in conjunction with Dcassandra.consistent_replace.
Default: 2
-Ddse.consistent_replace.retries
Specify how many times a failed repair will be retried during a replace. If all retries fail, the replace fails.
Used only in conjunction with Dcassandra.consistent_replace.
Default: 3
-Ddse.consistent_replace.whitelist
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
189
Configuration
Specify keyspaces and tables on which to perform a consistent replace. The keyspaces and tables
can be specified as: “ks1, ks2.cf1”. The default is blank, in which case all keyspaces and tables are
replaced. Used only in conjunction with Dcassandra.consistent_replace.
Default: blank (not set)
-Dcassandra.disable_auth_caches_remote_configuration
Set to true to disable authentication caches, for example the caches used for credentials, permissions,
and roles. This will mean those config options can only be set (persistently) in cassandra.yaml and will
require a restart for new values to take effect.
Default: false.
-Dcassandra.expiration_date_overflow_policy
Set the policy (REJECT or CAP) for any TTL (time to live) timestamps that exceeds the maximum value
supported by the storage engine, 2038-01-19T03:14:06+00:00. The database storage engine can
only encode TTL timestamps through January 19 2038 03:14:07 UTC due to the Year 2038 problem.
• CAP: Allow requests and insert expiration timestamps later than 2038-01-19T03:14:06+00:00 as
2038-01-19T03:14:06+00:00.
Default: REJECT.
-Dcassandra.force_default_indexing_page_size
Set to true to disable dynamic calculation of the page size used when indexing an entire partition
during initial index build or a rebuild. Fixes the page size to the default of 10000 rows per page.
Default: false.
-Dcassandra.ignore_dc
Set to true to ignore the datacenter name change on startup. Applies only when using
DseSimpleSnitch.
Default: false.
-Dcassandra.initial_token
Use when DSE is not using virtual nodes (vnodes). Set to the initial partitioner token for the node on the
first start up.
Default: blank (not set).
Vnodes automatically select tokens.
-Dcassandra.join_ring
Set to false to prevent the node from joining a ring on startup.
Add the node to the ring afterwards using nodetool join and a JMX call.
Default: true.
-Dcassandra.load_ring_state
Set to false to clear all gossip state for the node on restart.
Default: true.
-Dcassandra.metricsReporterConfigFile
Enables pluggable metrics reporter and configures it from the specified file.
Default: blank (not set).
-Dcassandra.native_transport_port
Set to the port number that CQL native transport listens for clients.
Default: 9042.
-Dcassandra.native_transport_startup_delay_seconds
Set to the number of seconds to delay the native transport server start up.
Default: 0 (no delay).
-Dcassandra.partitioner
Set to the partitioner name.
Default: org.apache.cassandra.dht.Murmur3Partitioner.
-Dcassandra.partition_sstables_by_token_range
Set to false to disable JBOD SSTable partitioning by token range to multiple data_file_directories.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
190
Configuration
Advanced setting that should only be used with guidance from DataStax Support.
Default: true.
-Dcassandra.printHeapHistogramOnOutOfMemoryError
Set to false to disable a heap histogram dump on an OutOfMemoryError.
Default: false.
-Dcassandra.replace_address
Set to the listen_address or the broadcast_address when replacing a dead node with a new node. The
new node must be in the same state as before bootstrapping, without any data in its data directory.
The broadcast_address defaults to the listen_address except when the ring is using the
Configuring Amazon EC2 multi-region snitch.
-Dcassandra.replace_address_first_boot
Same as -Dcassandra.replace_address but only runs the first time the Cassandra node boots.
This property is preferred over -Dcassandra.replace_address since it has no effect on subsequent
boots if it is not removed from jvm.options or cassandra-env.sh.
-Dcassandra.replayList
Allows restoring specific tables from an archived commit log.
-Dcassandra.ring_delay_ms
Set to the number of milliseconds the node waits to hear from other nodes before formally joining the
ring.
Default: 30000.
-Dcassandra.ssl_storage_port
Sets the SSL port for encrypted communication.
Default: 7001.
-Dcassandra.start_native_transport
Enables or disables the native transport server. See start_native_transport in cassandra.yaml.
Default: true.
-Dcassandra.storage_port
Sets the port for inter-node communication.
Default: 7000.
-Dcassandra.write_survey
Set to true to enable a tool for testing new compaction and compression strategies. write_survey
allows you to experiment with different strategies and benchmark write performance differences without
affecting the production workload. See Testing compaction and compression.
Default: false.
Java Management Extension system properties
DataStax Enterprise exposes metrics and management operations via Java Management Extensions (JMX).
JConsole and the nodetool utility are JMX-compliant management tools.
-Dcom.sun.management.jmxremote.port
Sets the port number on which the database listens for JMX connections.
By default, you can interact with DataStax Enterprise using JMX on port 7199 without authentication.
Default: 7199
-Dcom.sun.management.jmxremote.ssl
Change to true to enable SSL for JMX.
Default: false
-Dcom.sun.management.jmxremote.authenticate
True enables remote authentication for JMX.
Default: false
-Djava.rmi.server.hostname
Sets the interface hostname or IP that JMX should use to connect. Uncomment and set if you are
having trouble connecting.
Search system properties
DataStax Enterprise (DSE) Search system properties.
-Ddse.search.client.timeout.secs
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
191
Configuration
Set the timeout in seconds for native driver search core management calls using the dsetool search-
specific commands.
Default: 600 (10 minutes).
-Ddse.search.query.threads
Sets the number of Search queries that can execute in parallel. Consider increasing this value or
reducing client/driver requests per connection if EnqueuedRequestCount does not stabilize near zero.
Default: The default is two times the number of CPUs (including hyperthreading).
-Ddse.timeAllowed.enabled.default
The Solr timeAllowed option is enforced by default to prevent long-running shard queries (such as
complex facets and Boolean queries) from using system resources after they have timed out from the
DSE Search coordinator.
DSE Search checks the timeout per segment instead of during document or terms iteration. The
system property solr.timeAllowed.docsPerSample has been removed.
By default for all queries, the timeAllowed value is the same as the
internode_messaging_options.client_request_timeout_seconds setting in dse.yaml. For more
details, see Limiting queries by time.
Using the Solr timeAllowed parameter may cause a latency cost. If you find the cost for queries is
too high in your environment, consider setting the -Ddse.timeAllowed.enabled.default property
to false at DSE startup time. Or set timeAllowed.enable to false in the query.
Default: true.
-Ddse.solr.data.dir
Set the path to store DSE Search data. See Set the location of search indexes.
-Dsolr.offheap.enable
The DSE Search per-segment filter cache is moved off-heap by using native memory to reduce on-
heap memory consumption and garbage collection overhead. The off-heap filter cache is enabled by
default. To disable, set to false to pass the offheap JVM system property at startup time. When not set,
the default is true.
Default: true
Threads per core system properties
Tune TPC using the Netty system parameters.
-Ddse.io.aio.enable
Set to false to have all read operations use the AsynchronousFileChannel regardless of the
operating system or disk type.
The default setting true allows dynamic switching of libraries for read operations as follows:
• AsynchronousFileChannel for read operations on hard disk drives and all non-Linux operating
systems
Use this advanced setting only with guidance from DataStax Support.
Default: true
-Ddse.io.aio.force
Set to true to force all read operations to use LibAIO regardless of the disk type or operating system.
Use this advanced setting only with guidance from DataStax Support.
Default: false
-Dnetty.eventloop.busy_extra_spins=N
Set to the number of iterations in the epoll event loops performed when queues are empty before
moving on to the next backoff stage. Increasing the value reduces latency while increasing CPU usage
when the loops are idle.
Default: 10
-Dnetty.epoll_check_interval_nanos
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
192
Configuration
Sets the granularity for calling an epoll select in nanoseconds, which is a system call. Setting the value
too low impacts performance because by making too many system calls. Setting the value too high,
impacts performance by delaying the discovery of new events.
Default: 2000
-Dnetty.schedule_check_interval_nanos
Set the granularity for checking if scheduled events are ready to execute in nanoseconds. Specifying a
value below 1 nanosecond is not productive. Too high a values delays scheduled tasks.
Default: 1000
LDAP system properties for DataStax Enterprise Authentication
-Ddse.ldap.connection.timeout.ms
The number of milliseconds before the connection timesout.
Default:
-Ddse.ldap.retry_interval.ms
Allows you to set the time in milliseconds between subsequent retries when authenticating via an LDAP
server.
Default: 10
-Ddse.ldap.pool.min.idle
Finer control over the connection pool for DataStax Enterprise LDAP authentication connector. The
min idle settings determines the minimum number of connections allowed in the pool before the evictor
thread will create new connections. This setting has no effect if the evictor thread isn't configured to run.
Default:
-Ddse.ldap.pool.exhausted.action
Determines what the pool does when it is full. It can be one of:
Default: block
-Ddse.ldap.pool.max.wait
When the dse.ldap.pool.exhausted.action is block, sets the number of milliseconds to block the
pool before throwing an exception.
Default:
-Ddse.ldap.pool.test.borrow
Tests a connection when it is borrowed from the pool.
Default:
-Ddse.ldap.pool.test.return
Tests a connection returned to the pool.
Default:
-Ddse.ldap.pool.test.idle
Tests any connections in the eviction loop that are not being evicted. Only works if the time between
eviction runs is greater than 0ms.
Default:
-Ddse.ldap.pool.time.between.evictions
Determines the time in ms (milliseconds) between eviction runs. When run with the
dse.ldap.pool.test.idle this becomes a basic keep alive for connections.
Default:
-Ddse.ldap.pool.num.tests.per.eviction
Number of connections in the pool that are tested each connection run. If this is set the same as max
active (the pool size) then all connections will be tested each eviction run.
Default:
-Ddse.ldap.pool.min.evictable.idle.time.ms
Determines the minimum time in ms (milliseconds) that a connection can sit in the pool before it
becomes available for eviction.
Default:
-Ddse.ldap.pool.soft.min.evictable.idle.time.ms
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
193
Configuration
Determines the minimum time in ms (milliseconds) that a connection can sit the pool before it
becomes available for eviction with the proviso that the number of connections doesn't fall below
dse.ldap.pool.min.evictable.idle.time.ms.
Default:
Kerberos system properties
-Ddse.sasl.protocol
Kerberos principal name, user@realm. For example, [email protected].
-Djava.security.auth.login.config
The path to the JAAS configuration file for DseClient.
NodeSync system parameters
-Ddse.nodesync.controller_update_interval_sec
Set the frequency to execute NodeSync auto-tuning process in seconds.
Default: 300 (5 minutes).
-Ddse.nodesync.log_reporter_interval_sec
Set the frequency of short INFO progress report in seconds.
Default: 600 (10 minutes).
-Ddse.nodesync.min_validation_interval_sec
Set to the minimum number of seconds between validations of the same segment, mostly to avoid busy
spinning on new/empty clusters.
Default: 300 (5 minutes).
-Ddse.nodesync.min_warn_interval_sec
Set to the minimum number of seconds between logging warnings.
Avoid logging warnings too often.
Default: 36000 (10 hours).
-Ddse.nodesync.rate_checker_interval_sec
Set the frequency in seconds of comparing the current configured rate to tables and their deadline. Log
a warning if rate considered too low.
Default: 1800 (30 minutes).
-Ddse.nodesync.segment_lock_timeout_sec
Set the Time-to-live (TTL) on locks inserted in the status table in seconds.
Default: 600 (10 minutes).
-Ddse.nodesync.segment_size_target_bytes
Set to the targeted maximum size for segments in bytes.
Default: 26214400 (200 MB).
-Ddse.nodesync.size_checker_interval_sec
Set the frequency to check if the depth used for a table should be updated due to data size changes in
seconds.
Default: 7200 (2 hours).
2. Answer the questions below to determine the appropriate compaction strategy for each table.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
194
Configuration
If the answer is yes, use TWCS (TimeWindowCompactionStrategy). If the answer is no, read the
following questions.
Does your table handle more reads than writes, or more writes than reads?
LCS (LeveledCompactionStrategy) is appropriate if there are twice or more reads than writes, especially
randomized reads. If the reads and writes are approximately equal, the performance penalty from LCS
may not be worth the benefit. Be aware that LCS can be overwhelmed by a high number of writes. One
advantage of LCS is that it keeps related data in a small set of SSTables.
Does the data in your table change often?
If your data is immutable or there are few upserts, use STCS (SizeTieredCompactionStrategy), which
does not have the write performance penalty of LCS.
Do you require predictable levels of read and write activity?
LCS keeps the SSTables within predictable sizes and numbers. For example, if your table's read and
write ratio is small, and the read activity is expected to conform to a Service Level Agreement (SLA), it
may be worth the LCS write performance penalty to keep read rates and latency at predictable levels.
And, you may be able to overcome the LCS write penalty by adding more nodes.
Will your table be populated by a batch process?
For batched reads and writes, STCS performs better than LCS. The batch process causes little or no
fragmentation, so the benefits of LCS are not realized; batch processes can overwhelm tables that use
LCS.
Does your system have limited disk space?
LCS handles disk space more efficiently than STCS: LCS requires about 10% headroom in addition to
the space occupied by the data. In some cases, STCS and DTCS (DateTieredStorageStrategy) require
as much as 50% more headroom than the data space. (DTCS is deprecated.)
Is your system reaching its limits for input and output?
LCS is significantly more input and output intensive than DTCS or STCS. Switching to LCS may
introduce extra input and output load that offsets the advantages.
Configuring and running compaction
Set the table compaction strategy in the CREATE TABLE or ALTER TABLE statement parameters. See
table_options.
You can start compaction manually using the nodetool compact command.
Testing compaction strategies
To test the compaction strategy:
• Create a three-node cluster using one of the compaction strategies, then stress test the cluster using
thecassandra-stress utility and measure the results.
• Set up a node on your existing cluster and enable the write survey mode option on the node to analyze live
data.
NodeSync service
About NodeSync
NodeSync is an easy to use continuous background repair that has low overhead and provides consistent
performance and virtually eliminates manual efforts to run repair operations in a DataStax cluster.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
195
Configuration
For write-heavy workloads, where more than 20% of the operations are writes, you may notice CPU
consumption overhead associated with NodeSync. If that's the case for you environment, DataStax
recommends using nodetool repair instead of enabling NodeSync. See nodetool repair.
NodeSync service
By default, each node runs the NodeSync service. The service is idle unless it has something to validate.
NodeSync is enabled on a per table basis. The service continuously validates local data ranges for NodeSync-
enabled tables and repairs any inconsistency found. The local data ranges are split into small segments, which
act as validation save points. Segments are prioritized in order to try to meet the per-table deadline target.
Segments
A segment is a small local token range of a table. NodeSync recursively splits local ranges in half a certain
number of times (depth) to create segments. The depth is calculated using the total table size, assuming equal
distribution of data. Typically segments cover no more than 200 MB. The token ranges can be no smaller than a
single partition, so very large partitions can result in segments larger than the configured size.
Validation process and status
After a segment is selected for validation, NodeSync reads the entirety of the data it covers from all replica
(using paging), checks for inconsistencies, and repairs if needed. When a node validates a segment, it “locks”
it in a system table to avoid work duplication by other nodes. It is not a race-free lock; there is a possibility of
duplicated work which saves the complexity and cost of true distributed locking.
Segment validation is saved on completion in the system_distributed.nodesync_status table, which is used
internally for resuming on failure, prioritization, segment locking, and by tools. It is not meant to be read directly.
# successful: All replicas responded and all inconsistencies (if any) were properly repaired.
# unsuccessful: Either some replicas did not respond or repairs on inconsistent replicas failed.
# partial_in_sync: Not all replica responded, but all respondents were in sync.
# partial_repaired: Not all replica responded, some that responded were repaired.
Limitations
• For debugging/tuning, understanding of traditional repair will be mostly unhelpful, since NodeSync depends
on the read repair path
• No special optimizations for remote DC - may perform poorly on particularly bad WAN links
• NodeSync only makes internal adjustments to try to hit the configured rate - operators must ensure this
configured throughput is sufficient to meet the gc_grace_seconds commitment and can be achieved by the
hardware
Tables with NodeSync enabled will be skipped for repair operations run against all or specific keyspaces. For
individual tables, running the repair command will be rejected when NodeSync is enabled.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
196
Configuration
On the next restart of DataStax Enterprise (DSE), the NodeSync service will start up.
Data only needs to be validated if the table is in more than one datacenter or is in a datacenter where the
keyspace has a replication factor or 2 or more.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
197
Configuration
nodesync={'enabled': 'true'};
NodeSync records warnings to the system.log, if it detects any of the following conditions:
• rate_in_kb is too low to validate all tables within their deadline, even under ideal circumstances.
• rate_in_kb cannot be sustained by the node (too high for the node load/hardware).
1. Check the rate_in_kb setting within the nodesync section in the cassandra.yaml file.
The configured rate is different from the effective rate, which can be found in the NodeSync Service
metrics.
• Failures - When a node fails, it does not participate in NodeSync validation while it is offline.
• Temporary overloads - During periods of overload, such as an unexpected events, nodes can not achieve
the configured rate.
• Data size variation - The rate required to repair all tables within a fixed amount of time directly depends on
the size of the data to validate, which is typically a moving target.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
198
Configuration
All these factors can impact the overall NodeSync rate. Therefore build safety margins within the configured
rate. The NodeSyncServiceRate simulator helps to set the rate.
Setting the NodeSync deadline
Each table with NodeSync enabled has a deadline_target_sec property. This is the target for the maximum time
between 2 validations of the same data. As long as the deadline is met, all parts of the ring (for the table) are
validated at least that often.
The deadline (deadline_target_sec) relates to grace period (gc_grace_seconds). The deadline should
always be less than or equal to the grace period. As long as the deadline is met, no data is resurrected due to
tombstone purging.
The deadline defaults to which ever is longer, the grace period or four days. Typically an acceptable default,
unless the table has a grace period of zero. For testing, the deadline value can be less than the grace period.
Verify for a few weeks if a lower gc_grace value is realistic without taking risk before changing it.
NodeSync prioritize segments in order to try to meet the deadline. The next segment to validate at any given
time is the one the closest to missing its deadline. For example, if table 1 has half the deadline of table 2, table
1 validates approximately twice as often as table 2.
Use OpsCenter to get a graphical representation of the NodeSync validation status. See Viewing NodeSync
Status.
The syntax to change the per-table nodesync property:
This is an advanced tool. Usually, it is better to let NodeSync prioritize segments on its own.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
199
Configuration
• cassandra-topology.properties (PropertyFileSnitch)
1. In the cassandra.yaml file , set the listen_address to the private IP address of the node, and the
broadcast_address to the public address of the node.
This allows nodes to bind to nodes in another network or region, thus enabling multiple datacenter support.
For intra-network or region traffic, DSE switches to the private IP after establishing a connection.
2. Set the addresses of the seed nodes in the cassandra.yaml file to that of the public IP. Private IP are not
routable between networks. For example:
Be sure to enable encryption and authentication when using public IPs. See Configuring SSL for node-to-node
connections. Another option is to use a custom VPN to have local, inter-region/ datacenter IPs.
listen_on_broadcast_address: true
In non-EC2 environments, the public address to private address routing is not automatically enabled. Enabling
listen_on_broadcast_address allows DSE to listen on both listen_address and broadcast_address with
two network interfaces.
Configuring the snitch for multiple networks
External communication between the datacenters can only happen when using the broadcast_address (public IP).
The GossipingPropertyFileSnitch is recommended for production. The cassandra-rackdc.properties file defines
the datacenters used by this snitch. Enable the option prefer_local to ensure that traffic to broadcast_address
will re-route to listen_address.
For each node in the network, specify its datacenter in cassandra-rackdc.properties file.
In the example below, there are two datacenters and each datacenter is named for its workload. The datacenter
naming convention in this example is based on the workload. You can use other conventions, such as DC1, DC2
or 100, 200. (datacenter names are case-sensitive.)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
200
Configuration
Network A Network B
• node0 • node0
dc=DC_A_transactional dc=DC_A_transactional
rack=RAC1 rack=RAC1
• node1 • node1
dc=DC_A_transactional dc=DC_A_transactional
rack=RAC1 rack=RAC1
• node2 • node2
dc=DC_B_transactional dc=DC_B_transactional
rack=RAC1 rack=RAC1
• node3 • node3
dc=DC_B_transactional dc=DC_B_transactional
rack=RAC1 rack=RAC1
• node4 • node4
dc=DC_A_analytics dc=DC_A_analytics
rack=RAC1 rack=RAC1
• node5 • node5
dc=DC_A_search dc=DC_A_search
rack=RAC1 rack=RAC1
In cloud deployments, the region name is treated as the datacenter name and availability zones are treated as
racks within a datacenter. For example, if a node is in the us-east-1 region, us-east is the datacenter name and 1
is the rack location. (Racks are important for distributing replicas, but not for datacenter naming.)
In the example below, there are two DataStax Enterprise datacenters and each datacenter is named for its
workload. The datacenter naming convention in this example is based on the workload. You can use other
conventions, such as DC1, DC2 or 100, 200. (Datacenter names are case-sensitive.)
For each node, specify its datacenter in the cassandra-rackdc.properties. The dc_suffix option defines the
datacenters used by the snitch. Any other lines are ignored.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
201
Configuration
• node0 • node0
dc_suffix=_1_transactional dc_suffix=_1_transactional
• node1 • node1
dc_suffix=_1_transactional dc_suffix=_1_transactional
• node2 • node2
dc_suffix=_2_transactional dc_suffix=_2_transactional
• node3 • node3
dc_suffix=_2_transactional dc_suffix=_2_transactional
• node4 • node4
dc_suffix=_1_analytics dc_suffix=_1_analytics
• node5 • node5
dc_suffix=_1_search dc_suffix=_1_search
This results in four us-east datacenters: This results in four us-west datacenters:
us-east_1_transactional us-west_1_transactional
us-east_2_transactional us-west_2_transactional
us-east_1_analytics us-west_1_analytics
us-east_1_search us-west_1_search
Property Description
cluster_name Name of the cluster that this node is joining. Must be the same for every node in the
cluster.
listen_address The IP address or hostname that the database binds to for connecting this node to other
nodes.
listen_interface Use this option instead of listen_address to specify the network interface by name, rather
than address/hostname
(Optional) broadcast_address The public IP address this node uses to broadcast to other nodes outside the network
or across regions in multiple-region EC2 deployments. If this property is commented
out, the node uses the same IP address or hostname as listen_address. A node
does not need a separate broadcast_address in a single-node or single-datacenter
installation, or in an EC2-based network that supports automatic switching between
private and public communication. It is necessary to set a separate listen_address and
broadcast_address on a node with multiple physical network interfaces or other topologies
where not all nodes have access to other nodes by their private IP addresses. For specific
configurations, see the instructions for listen_address. The default is the listen_address.
seed_provider A -seeds list is comma-delimited list of hosts (IP addresses) that gossip uses to learn the
topology of the ring. Every node should have the same list of seeds.
Making every node a seed node is not recommended because of increased
maintenance and reduced gossip performance. Gossip optimization is not critical, but
it is recommended to use a small seed list (approximately three nodes per datacenter).
storage_port The inter-node communication port (default is 7000). Must be the same for every node in
the cluster.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
202
Configuration
Property Description
initial_token For legacy clusters. Set this property for single-node-per-token architecture, in which a
node owns exactly one contiguous range in the ring space.
num_tokens For new clusters. The number of tokens randomly assigned to this node in a cluster that
uses virtual nodes (vnodes).
Base the size of the directory on the value of the Java -mx option.
3. On the line after the comment, set the CASSANDRA_HEAPDUMP_DIR to the desired path:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
203
Configuration
DataStax Enterprise requires the same token architecture on all nodes in a datacenter. The nodes must all be
vnode-enabled or single-token architecture. Across the entire cluster, datacenter architecture can vary. For
example, a single cluster with:
Using 8 vnodes distributes the workload between systems with a ~10% variance and has minimal impact on
performance.
# The allocation algorithm distributes the token ranges proportionately using the num_tokens settings.
All systems in the datacenter should have the same num_token settings unless the systems
performance varies between systems. To distribute more of the workload to the higher performance
hardware, increase the number of tokens for those systems.
The allocation algorithm efficiently balances the workload using fewer tokens; when systems are added
to a datacenter, the algorithm maintains the balance. Using a higher number of tokens more evenly
distributes the workload, but also significantly increases token management overhead.
Set the number of vnode tokens based on the workload distribution requirements of the datacenter:
Table 12: Allocation algorithm workload distribution variance
Replication 4 vnode (tokens) 8 vnode (tokens) 64 vnode 128 vnode
factor (tokens) (tokens)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
204
Configuration
Enabling vnodes
In the cassandra.yaml file:
To upgrade existing clusters to vnodes, see Enabling virtual nodes on an existing production cluster.
Disabling vnodes
If you do not use vnodes, you must make sure that each node is responsible for roughly an equal amount of
data. To ensure that each node is responsible for an equal amount of data, assign each node an initial-token
value and calculate the tokens for each datacenter as described in Generating tokens.
b. Uncomment the initial_token and set it to 1 or to the value of a generated token for a multi-node cluster.
DataStax recommends not using vnodes with DSE Search. However, if you decide to use vnodes with DSE
Search, do not use more than 8 vnodes and ensure that allocate_tokens_for_local_replication_factor option in
cassandra.yaml is correctly configured for your environment.
2. Once the new datacenter with vnodes enabled is up, switch your clients to use the new datacenter.
Logging configuration
Changing logging locations
Logging locations are set at installation. Generally, the default logs location is /var/log. For example, /var/
log/cassandra and /var/log/tomcat.
For details, see Default file locations for package installations and Default file locations for tarball installations.
You can also change logging locations with OpsCenter Configuration Profiles.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
205
Configuration
• To generate all logs in the same location, add CASSANDRA_LOG_DIR to the dse-env.sh file:
export CASSANDRA_LOG_DIR="/your/log/location"
• For finer-grained control, edit the logback.xml file and replace ${cassandra.logdir} with the path.
2. To change the Tomcat server log locations for DSE Search, edit one of these files:
export TOMCAT_LOGS="/your/log/location"
Configuring logging
Logging functionality uses Simple Logging Facade for Java (SLF4J) with a logback backend. Logs are written
to the system.log and debug.log in the logging directory. You can configure logging programmatically or
manually. Manual ways to configure logging are:
Logback looks for the logback-test.xml file first, and then for the logback.xml file.
The following example details the XML configuration of the logback.xml file:
<configuration scan="true">
<jmxConfigurator />
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
206
Configuration
</rollingPolicy>
<triggeringPolicy class="ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy">
<maxFileSize>20MB</maxFileSize>
</triggeringPolicy>
<encoder>
<pattern>%-5level [%thread] %date{ISO8601} %X{service} %F:%L - %msg%n</pattern>
</encoder>
</appender>
<if condition='isDefined("dse.console.useColors")'>
<then>
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<withJansi>true</withJansi>
<filter class="ch.qos.logback.classic.filter.ThresholdFilter">
<level>INFO</level>
</filter>
<encoder>
<pattern>%highlight(%-5level) [%thread] %green(%date{ISO8601})
%yellow(%X{service}) %F:%L - %msg%n<$
</encoder>
</appender>
</then>
</if>
<if condition='isNull("dse.console.useColors")'>
<then>
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<filter class="ch.qos.logback.classic.filter.ThresholdFilter">
<level>INFO</level>
</filter>
<encoder>
<pattern>%-5level [%thread] %date{ISO8601} %X{service} %F:%L - %msg%n</pattern>
</encoder>
</appender>
</then>
</if>
<include file="${SPARK_SERVER_LOGBACK_CONF_FILE}"/>
<include file="${GREMLIN_SERVER_LOGBACK_CONF_FILE}"/>
<!-- Uncomment the LogbackMetrics appender and the corresponding appender-ref in the
root to activate
<appender name="LogbackMetrics"
class="com.codahale.metrics.logback.InstrumentedAppender" />
-->
<root level="${logback.root.level:-INFO}">
<appender-ref ref="SYSTEMLOG" />
<appender-ref ref="STDOUT" />
<!-- Comment out the ASYNCDEBUGLOG appender to disable debug.log -->
<appender-ref ref="ASYNCDEBUGLOG" />
<!-- Uncomment LogbackMetrics and its associated appender to enable metric collecting for
logs. -->
<!-- <appender-ref ref="LogbackMetrics" /> -->
<appender-ref ref="SparkMasterFileAppender" />
<appender-ref ref="SparkWorkerFileAppender" />
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
207
Configuration
<!--audit log-->
<appender name="SLF4JAuditWriterAppender"
class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>${cassandra.logdir}/audit/audit.log</file>
<encoder>
<pattern>%-5level [%thread] %date{ISO8601} %X{service} %F:%L - %msg%n</pattern>
<immediateFlush>true</immediateFlush>
</encoder>
<rollingPolicy class="ch.qos.logback.core.rolling.FixedWindowRollingPolicy">
<fileNamePattern>${cassandra.logdir}/audit/audit.log.%i.zip</fileNamePattern>
<minIndex>1</minIndex>
<maxIndex>5</maxIndex>
</rollingPolicy>
<triggeringPolicy class="ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy">
<maxFileSize>200MB</maxFileSize>
</triggeringPolicy>
</appender>
<appender name="DroppedAuditEventAppender"
class="ch.qos.logback.core.rolling.RollingFileAppender" prudent=$
<file>${cassandra.logdir}/audit/dropped-events.log</file>
<encoder>
<pattern>%-5level [%thread] %date{ISO8601} %X{service} %F:%L - %msg%n</pattern>
<immediateFlush>true</immediateFlush>
</encoder>
<rollingPolicy class="ch.qos.logback.core.rolling.FixedWindowRollingPolicy">
<fileNamePattern>${cassandra.logdir}/audit/dropped-events.log.%i.zip</
fileNamePattern>
<minIndex>1</minIndex>
<maxIndex>5</maxIndex>
</rollingPolicy>
<triggeringPolicy class="ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy">
<maxFileSize>200MB</maxFileSize>
</triggeringPolicy>
</appender>
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
208
Configuration
</configuration>
The appender configurations specify where to print the log and its configuration. Each appender is defined as
appendername="appender", and are described as follows.
SYSTEMLOG
Directs logs and ensures that WARN and ERROR messages are written synchronously to the /var/
log/cassandra/system.log file.
DEBUGLOG | ASYNCDEBUGLOG
Generates the /var/log/cassandra/debug.log file, which contains an asynchronous log of events
written to the system.log file, plus production logging information useful for debugging issues.
STDOUT
Directs logs to the console in a human-readable format.
LogbackMetrics
Records the rate of logged events by their logging level.
SLF4JAuditWriterAppender | DroppedAuditEventAppender
Used by the audit logging functionality. See Setting up database auditing for more information.
The following logging functionality is configurable:
• Rolling policy
Log levels
The valid values for setting the log level include ALL for logging information at all levels, TRACE through
ERROR, and OFF for no logging. TRACE creates the most verbose log, and ERROR, the least.
• ALL
• TRACE
• DEBUG
• INFO (Default)
• WARN
• ERROR
• OFF
When set to TRACE or DEBUG output appears only in the debug.log. When set to INFO the debug.log is
disabled.
Increasing logging levels can generate heavy logging output on a moderately trafficked cluster.
Use the nodetool getlogginglevels command to see the current logging configuration.
bin\nodetool getlogginglevels
Logger Name Log Level
ROOT INFO
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
209
Configuration
com.thinkaurelius.thrift ERROR
To add debug logging to a class permanently using the logback framework, use nodetool setlogginglevel to
confirm the component or class before setting it in the logback.xml file in installation_location/conf. Modify
to include the following line or similar at the end of the file:
Command archive_command=
Command restore_command=
Parameters %from Fully qualified path of the archived commitlog segment from the restore_directories.
Command restore_directories=
Format restore_directories=restore_directory_location
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
210
Configuration
Command restore_point_in_time=
Restore stops when the first client-supplied timestamp is greater than the restore point timestamp.
Because the order in which the database receives mutations does not strictly follow the timestamp order,
this can leave some mutations unrecovered.
1. Enable CDC logging and configure CDC directories and space in cassandra.yaml.
For example, to enable CDC logging with default values:
cdc_enabled: true
cdc_total_space_in_mb: 4096
cdc_free_space_check_interval_ms: 250
cdc_raw_directory: /var/lib/cassandra/cdc_raw
2. To enable CDC logging for a database table, create or alter the table with the table property.
For example, to enable CDC logging on the cycling table:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
211
Chapter 5. Initializing a DataStax Enterprise cluster
Complete the following tasks before initializing a DSE cluster.
• Establish a firm understanding of how the database works. Be sure to read at least Understanding the
database architecture and Data replication.
• Ensure the environment is suitable for the use case and workload.
• Determine the snitch and replication strategy. The GossipingPropertyFileSnitch and NetworkTopologyStrategy
are recommended for production environments.
• Determine which nodes are seed nodes. Do not make all nodes seed nodes.
Seed nodes are not required for DSE Search datacenters, see Internode communications (gossip).
• Review and make appropriate changes to other property files, such as cassandra-rackdc.properties.
• Set virtual nodes correctly for the type of datacenter. DataStax recommends using 8 vnodes (tokens). See
Virtual nodes for more information.
Initializing datacenters
In most circumstances, each workload type, such as search, analytics, and transactional, should be organized
into separate virtual datacenters. Workload segregation avoids contention for resources. However, workloads can
be combined in SearchAnalytics nodes when there is not a large demand for analytics, or when analytics queries
must use a DSE Search index. Generally, combining transactional (OLTP) and analytics (OLAP) workloads
results in decreased performance.
When creating a keyspace using CQL, DataStax Enterprise creates a virtual datacenter for a cluster, even a one-
node cluster, automatically. You assign nodes that run the same type of workload to the same datacenter. The
separate, virtual datacenters for different types of nodes segregate workloads that run DSE Search from those
nodes that run other workload types.
Single datacenters per workload type
If using a single, physical datacenter, single datacenter deployments are useful.
Multiple datacenters per workload type
If using multiple datacenters, consider multiple datacenter deployments.
The following scenarios describe some benefits of using multiple, physical datacenters:
• Isolating replicas from external infrastructure failures, such as networking between datacenters and power
outages.
• Diversifying assets between public cloud providers and on-premise managed datacenters.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
212
Initializing a DataStax Enterprise cluster
• Preventing the slow down of a real-time analytics cluster by a development cluster running analytics jobs on
live data.
• Using virtual datacenters in the physical datacenter to ensure reads from a specific datacenter is local to the
requests, especially when using a consistency level greater than ONE. This strategy ensures lower latency
because it avoids reads from one node in New York and another read from a node in Los Angeles.
In contrast, a multiple datacenter cluster has more than one datacenter for each type of workload.
The eight-node cluster spans two racks across three datacenters. Applications in each datacenter will use a
default consistency level of LOCAL_QUORUM. One node per rack will serve as a seed node.
Prerequisites:
To prepare the environment, complete the prerequisite tasks outlined in Initializing a DataStax Enterprise
cluster.
If the new datacenter uses existing nodes from another datacenter or cluster, complete the following steps to
ensure that old data will not interfere with the new cluster:
1. If the nodes are behind a firewall, open the required ports for internal/external communication.
3. Clear the data from DataStax Enterprise (DSE) to completely remove application directories.
1. Complete the following steps to prevent client applications from prematurely connecting to the new
datacenter, and to ensure that the consistency level for reads or writes does not query the new datacenter:
If client applications, including DSE Search and DSE Analytics, are not properly configured, they
might connect to the new datacenter before it is online. Incorrect configuration results in connection
exceptions, timeouts, and/or inconsistent data.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
213
Initializing a DataStax Enterprise cluster
b. Direct clients to an existing datacenter. Otherwise, clients might try to access the new datacenter,
which might not have any data.
2. Configure every keyspace using SimpleStrategy to use the NetworkTopologyStrategy replication strategy,
including (but not restricted to) the following keyspaces.
If SimpleStrategy was used previously, this step is required to configure NetworkTopologyStrategy.
a. Use ALTER KEYSPACE to change the keyspace replication strategy to NetworkTopologyStrategy for
the following keyspaces.
b. Use DESCRIBE SCHEMA to check the replication strategy of keyspaces in the cluster. Ensure that any
existing keyspaces use the NetworkTopologyStrategy replication strategy.
DESCRIBE SCHEMA ;
3. In the new datacenter, install DSE on each new node. Do not start the service or restart the node.
4. Configure properties in cassandra.yaml on each new node, following the configuration of the other nodes in
the cluster.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
214
Initializing a DataStax Enterprise cluster
Use the yaml_diff tool to review and make appropriate changes to the cassandra.yaml and dse.yaml
configuration files.
• auto_bootstrap: true
This setting has been removed from the default configuration, but, if present, should be set
to true.
• listen_address: empty
If not set, DSE asks the system for the local address, which is associated with its host name.
In some cases, DSE does not produce the correct address, which requires specifying the
listen_address.
• endpoint_snitch: snitch
See endpoint_snitch and snitches.
Do not use the DseSimpleSnitch. The DseSimpleSnitch (default) is used only for single-
datacenter deployments (or single-zone deployments in public clouds), and does not
recognize datacenter or rack information.
• If using a cassandra.yaml or dse.yaml file from a previous version, check the Upgrade
Guide for removed settings.
b. Configure node architecture (all nodes in the datacenter must use the same type):
Virtual node (vnode) allocation algorithm settings
DataStax recommends not using vnodes with DSE Search. However, if you decide
to use vnodes with DSE Search, do not use more than 8 vnodes and ensure that
allocate_tokens_for_local_replication_factor option in cassandra.yaml is correctly configured
for your environment.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
215
Initializing a DataStax Enterprise cluster
• Generate the initial token for each node and set this value for the initial_token property.
See Adding or replacing single-token nodes for more information.
After making any changes in the configuration files, you must the restart the node for the changes to
take effect.
a. On nodes in the existing datacenters, update the -seeds property in cassandra.yaml to include the
seed nodes in the new datacenter.
b. Add the new datacenter definition to the cassandra.yaml properties file for the type of snitch used in
the cluster. If changing snitches, see Switching snitches.
7. After you have installed and configured DataStax Enterprise on all nodes, start the seed nodes one at a
time, and then start the rest of the nodes:
8. Rotate starting DSE through the racks until all the nodes are up.
9. After all nodes are running in the cluster and the client applications are datacenter aware, use cqlsh to alter
the keyspaces to add the desired replication in the new datacenter.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
216
Initializing a DataStax Enterprise cluster
If client applications, including DSE Search and DSE Analytics, are not properly configured, they
might connect to the new datacenter before it is online. Incorrect configuration results in connection
exceptions, timeouts, and/or inconsistent data.
10. Run nodetool rebuild on each node in the new datacenter, specifying the datacenter to rebuild from. This
step replicates the data to the new datacenter in the cluster.
You must specify an existing datacenter in the command line, or the new nodes will appear to rebuild
successfully, but might not contain all anticipated data.
Requests to the new datacenter with LOCAL_ONE or ONE consistency levels can fail if the existing
datacenters are not completely in-sync.
a. Use nodetool rebuild on one or more nodes at the same time. Run on one node at a time to
reduce the impact on the existing cluster.
b. Alternatively, run the command on multiple nodes simultaneously when the cluster can handle the
extra I/O and network pressure.
$ dsetool status
If DSE has problems starting, look for starting DSE troubleshooting and other articles in the Support
Knowledge Center.
12. Complete 3 through 11 to add the third datacenter (DC3) to the cluster.
The datacenters in the cluster are now replicating with each other.
DC: Analytics
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Tokens Rack
UN 110.54.125.2 28.44 KB 13.0.% e2451cdf-f070- ... -922337.... RAC1
UN 110.82.155.2 44.47 KB 16.7% f9fa427c-a2c5- ... 30745512... RAC2
UN 110.82.155.3 54.33 KB 23.6% b9fc31c7-3bc0- ..- 45674488... RAC1
DC: Solr
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Tokens Rack
UN 110.54.125.3 15.44 KB 50.2.% e2451cdf-f070- ... 9243578.... RAC1
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
217
Initializing a DataStax Enterprise cluster
The ten-node cluster spans two racks across five datacenters. Applications in each datacenter will use a default
consistency level of LOCAL_QUORUM. One node per rack will serve as a seed node.
Prerequisites:
Complete the prerequisite tasks outlined in Initializing a DataStax Enterprise cluster to prepare the
environment.
If the new datacenter uses existing nodes from another datacenter or cluster, complete the following steps to
ensure that old data will not interfere with the new cluster:
1. If the nodes are behind a firewall, open the required ports for internal/external communication.
3. Clear the data from DataStax Enterprise (DSE) to completely remove application directories.
1. Complete the following steps to prevent client applications from prematurely connecting to the new
datacenter, and to ensure that the consistency level for reads or writes does not query the new datacenter:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
218
Initializing a DataStax Enterprise cluster
If client applications, including DSE Search and DSE Analytics, are not properly configured, they
might connect to the new datacenter before it is online. Incorrect configuration results in connection
exceptions, timeouts, and/or inconsistent data.
b. Direct clients to an existing datacenter. Otherwise, clients might try to access the new datacenter,
which might not have any data.
2. Configure every keyspace using SimpleStrategy to use the NetworkTopologyStrategy replication strategy,
including (but not restricted to) the following keyspaces.
If SimpleStrategy was used previously, this step is required to configure NetworkTopologyStrategy.
a. Use ALTER KEYSPACE to change the keyspace replication strategy to NetworkTopologyStrategy for
the following keyspaces.
b. Use DESCRIBE SCHEMA to check the replication strategy of keyspaces in the cluster. Ensure that any
existing keyspaces use the NetworkTopologyStrategy replication strategy.
DESCRIBE SCHEMA ;
3. In the new datacenter, install DSE on each new node. Do not start the service or restart the node.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
219
Initializing a DataStax Enterprise cluster
4. Configure properties in cassandra.yaml on each new node, following the configuration of the other nodes in
the cluster.
Use the yaml_diff tool to review and make appropriate changes to the cassandra.yaml and dse.yaml
configuration files.
• auto_bootstrap: true
This setting has been removed from the default configuration, but, if present, should be set
to true.
• listen_address: empty
If not set, DSE asks the system for the local address, which is associated with its host name.
In some cases, DSE does not produce the correct address, which requires specifying the
listen_address.
• endpoint_snitch: snitch
See endpoint_snitch and snitches.
Do not use the DseSimpleSnitch. The DseSimpleSnitch (default) is used only for single-
datacenter deployments (or single-zone deployments in public clouds), and does not
recognize datacenter or rack information.
• If using a cassandra.yaml or dse.yaml file from a previous version, check the Upgrade
Guide for removed settings.
b. Configure node architecture (all nodes in the datacenter must use the same type):
Virtual node (vnode) allocation algorithm settings
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
220
Initializing a DataStax Enterprise cluster
DataStax recommends not using vnodes with DSE Search. However, if you decide
to use vnodes with DSE Search, do not use more than 8 vnodes and ensure that
allocate_tokens_for_local_replication_factor option in cassandra.yaml is correctly configured
for your environment.
For more information, refer to Virtual node (vnode) configuration.
Single-token architecture settings
• Generate the initial token for each node and set this value for the initial_token property.
See Adding or replacing single-token nodes for more information.
After making any changes in the configuration files, you must the restart the node for the changes to
take effect.
a. On nodes in the existing datacenters, update the -seeds property in cassandra.yaml to include the
seed nodes in the new datacenter.
b. Add the new datacenter definition to the cassandra.yaml properties file for the type of snitch used in
the cluster. If changing snitches, see Switching snitches.
7. After you have installed and configured DataStax Enterprise on all nodes, start the seed nodes one at a
time, and then start the rest of the nodes:
8. Rotate starting DSE through the racks until all the nodes are up.
9. After all nodes are running in the cluster and the client applications are datacenter aware, use cqlsh to alter
the keyspaces to add the desired replication in the new datacenter.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
221
Initializing a DataStax Enterprise cluster
If client applications, including DSE Search and DSE Analytics, are not properly configured, they
might connect to the new datacenter before it is online. Incorrect configuration results in connection
exceptions, timeouts, and/or inconsistent data.
10. Run nodetool rebuild on each node in the new datacenter, specifying the datacenter to rebuild from. This
step replicates the data to the new datacenter in the cluster.
You must specify an existing datacenter in the command line, or the new nodes will appear to rebuild
successfully, but might not contain all anticipated data.
Requests to the new datacenter with LOCAL_ONE or ONE consistency levels can fail if the existing
datacenters are not completely in-sync.
a. Use nodetool rebuild on one or more nodes at the same time. Run on one node at a time to
reduce the impact on the existing cluster.
b. Alternatively, run the command on multiple nodes simultaneously when the cluster can handle the
extra I/O and network pressure.
$ dsetool status
If DSE has problems starting, look for starting DSE troubleshooting and other articles in the Support
Knowledge Center.
The datacenters in the cluster are now replicating with each other.
DC: Analytics
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Tokens Rack
UN 110.54.125.2 28.44 KB 50.2.% e2451cdf-f070- ... -922337.... RAC1
UN 110.82.155.2 44.47 KB 49.8% f9fa427c-a2c5- ... 30745512... RAC2
DC: Solr
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Tokens Rack
UN 110.54.125.3 15.44 KB 50.2.% e2451cdf-f070- ... 9243578.... RAC1
UN 110.82.155.4 18.78 KB 49.8.% e2451cdf-f070- ... 10000 RAC2
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
222
Initializing a DataStax Enterprise cluster
DC: Analytics2
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Tokens Rack
UN 110.82.155.3 54.33 KB 50.2% b9fc31c7-3bc0- ..- 45674488... RAC1
UN 110.55.120.2 54.33 KB 49.8% b8gd45e4-3bc0- ..- 45674488... RAC2
What's next:
• A seed node is used to bootstrap the gossip process for new nodes joining a cluster.
• To learn the topology of the ring, a joining node contacts one of the nodes in the -seeds list in
cassandra.yaml.
• The first time you bring up a node in a new cluster, only one node is the seed node.
• The seeds list is a comma delimited list of addresses. Since this example cluster includes 5 nodes, you must
change the list from the default value "127.0.0.1" to the IP address of one of the nodes.
• After all nodes are added, all nodes in the datacenter must be configured to use the same seed nodes.
Making every node a seed node is not recommended because of increased maintenance and reduced gossip
performance. Gossip optimization is not critical, but it is recommended to use a small seed list (approximately
three nodes per datacenter).
This single datacenter example has 5 nodes, where nodeA, nodeB, and nodeC are seed nodes.
nodeA 110.82.155.0 #
nodeB 110.82.155.1 #
nodeC 110.54.125.1 #
nodeD 110.54.125.2
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
223
Initializing a DataStax Enterprise cluster
nodeE 110.54.155.2
1. In the new datacenter, install DSE on each new node. Do not start the service or restart the node.
2. For nodeA, nodeB, and nodeC, configure only nodeA as seed node:
a. In cassandra.yaml:
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
- seeds: 110.82.155.0
3. Start the seed nodes one at a time nodeA, nodeB, and then nodeC.
4. For nodeA, nodeB, and nodeC, change cassandra.yaml to configure nodeA, nodeB, and nodeC as seed
nodes:
a. In cassandra.yaml:
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
- seeds: 110.82.155.0, 110.82.155.1, 110.54.125.1
You do not need to restart nodeA, nodeB, or nodeC after changing the seed node entry in
cassandra.yaml; the nodes will reread the seed nodes.
5. For nodeD and nodeE, change cassandra.yaml to configure nodeA, nodeB, and nodeC as seed nodes:
a. In cassandra.yaml:
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
- seeds: 110.82.155.0, 110.82.155.1, 110.54.125.1
# Comment out the listen_address property. If the node is properly configured (host name, name
resolution, and so on), the database uses InetAddress.getLocalHost() to get the local address from
the system.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
224
Initializing a DataStax Enterprise cluster
• Node in a multi-node installations: set the listen_address property to the node's IP address or hostname,
or set listen_interface.
• Node with two physical network interfaces in a multi-datacenter installation or cluster deployed
across multiple Amazon EC2 regions using the Ec2MultiRegionSnitch:
1. Set listen_address to this node's private IP or hostname, or set listen_interface (for communication
within the local datacenter).
4. If this node is a seed node, add the node's public IP address or hostname to the seeds list.
These steps provide information about setting up a cluster having one or more datacenters.
• node1 10.176.43.66
• node2 10.168.247.41
• node4 10.169.61.170
• node5 10.169.30.138
2. Calculate the token assignments as described in Calculating tokens for single-token architecture nodes.
The following tables list tokens for a 6 node cluster with a single datacenter or two datacenters.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
225
Initializing a DataStax Enterprise cluster
node0 0
node1 21267647932558653966460912964485513216
node2 42535295865117307932921825928971026432
node3 63802943797675961899382738893456539648
node4 85070591730234615865843651857942052864
node5 106338239662793269832304564822427566080
node0 0 NA DC1
3. If the nodes are behind a firewall, open the required ports for internal/external communication.
4. If DataStax Enterprise is running, stop the node and clear the data:
• Tarball installations:
From the installation location, stop the database:
$ bin/dse cassandra-stop
5. Configure properties in cassandra.yaml on each new node, following the configuration of the other nodes in
the cluster.
Use the yaml_diff tool to review and make appropriate changes to the cassandra.yaml and dse.yaml
configuration files.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
226
Initializing a DataStax Enterprise cluster
• initial_token: token_value_from_calculation
• num_tokens: 1
• listen_address: empty
If not set, DSE asks the system for the local address, which is associated with its host name.
In some cases, DSE does not produce the correct address, which requires specifying the
listen_address.
• auto_bootstrap: false
Add the bootstrap setting only when initializing a new cluster with no data.
• endpoint_snitch: snitch
See endpoint_snitch and snitches.
Do not use the DseSimpleSnitch. The DseSimpleSnitch (default) is used only for single-
datacenter deployments (or single-zone deployments in public clouds), and does not
recognize datacenter or rack information.
• If using a cassandra.yaml or dse.yaml file from a previous version, check the Upgrade
Guide for removed settings.
6. Set the properties in the dse.yaml file as required by your use case.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
227
Initializing a DataStax Enterprise cluster
110.82.155.4=DC_Search:RAC2
After making any changes in the configuration files, you must the restart the node for the changes to take
effect.
8. After you have installed and configured DataStax Enterprise on all nodes, start the seed nodes one at a time,
and then start the rest of the nodes:
$ dsetool status
If DSE has problems starting, look for starting DSE troubleshooting and other articles in the Support
Knowledge Center.
Datacenter: Cassandra
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 110.82.155.0 21.33 KB 256 33.3% a9fa31c7-f3c0-... RAC1
UN 110.82.155.1 21.33 KB 256 33.3% f5bb416c-db51-... RAC1
UN 110.82.155.2 21.33 KB 256 16.7% b836748f-c94f-... RAC1
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
228
Initializing a DataStax Enterprise cluster
Usage:
• Package installations:
• Tarball installations:
$ installation_location/resources/cassandra/tools/bin/token-generator num_of_nodes_in_dc
... [options]
Options Description
• -h
• --help
Offset token values Use when adding or replacing dead nodes or datacenters.
• --ringoffset offset
Test Displays various ring arrangements and generates an HTML file showing these
arrangements.
• --test
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
229
Initializing a DataStax Enterprise cluster
2. Assign the tokens to nodes on alternating racks in the cassandra-rackdc.properties or the cassandra-
topology.properties file.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
230
Initializing a DataStax Enterprise cluster
2. After calculating the tokens, assign the tokens so that the nodes in each datacenter are evenly dispersed
around the ring.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
231
Initializing a DataStax Enterprise cluster
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
232
Initializing a DataStax Enterprise cluster
Datacenter 1
Datacenter 2
TokenToken position
position
The results show the generated token values for the Murmur3Partitioner for one datacenter with 3 nodes
and one datacenter with 2 nodes with an offset:
DC #1:
Node #1: 6148914691236517105
Node #2: 12297829382473034310
Node #3: 18446744073709551516
DC #2:
Node #1: 9144875253562394637
Node #2: 18368247290417170445
The value of the offset is for the first node and all other nodes are calculated for even distribution from the
offset.
The tokens without the offset are:
2. After calculating the tokens, assign the tokens so that the nodes in each datacenter are evenly dispersed
around the ring and alternate the rack assignments.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
233
Chapter 6. Security
For securing DataStax Enterprise 6.0, see the DataStax Security Guide.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
234
Chapter 7. Using DataStax Enterprise advanced
functionality
Information on using DSE Analytics, DSEFS, DSE Search, DSE Graph, DSE Advanced Replication, DSE In-
Memory, DSE Multi-Instance, DSE Tiered Storage and DSE Performance services.
DSE Analytics
DataStax Enterprise (DSE) integrates real-time and batch operational analytics capabilities with an enhanced
version of Apache Spark™. With DSE Analytics you can easily generate ad-hoc reports, target customers with
personalization, and process real-time streams of data. The analytics toolset lets you write code once and then
use it for both real-time and batch workloads.
About DSE Analytics
DataStax Enterprise (DSE) integrates real-time and batch operational analytics capabilities with an enhanced
version of Apache Spark™. With DSE Analytics you can easily generate ad-hoc reports, target customers with
personalization, and process real-time streams of data. The analytics toolset lets you write code once and then
use it for both real-time and batch workloads.
DSE Analytics jobs can use the DataStax Enterprise File System (DSEFS) to handle the large data sets typical
of analytic processing. DSEFS replaces CFS (Cassandra File System).
DSE Analytics features
No single point of failure
DSE Analytics supports a peer-to-peer, distributed cluster for running Spark jobs. Being peers, any
node in the cluster can load data files, and any analytics node can assume the responsibilities of Spark
Master.
Spark Master management
DSE Analytics provides automatic Spark Master management.
Analytics without ETL
Using DSE Analytics, you run Spark jobs directly against data in the database. You can perform real-
time and analytics workloads at the same time without one workload affecting the performance of the
other. Starting some cluster nodes as Analytics nodes and others as pure transactional real-time nodes
automatically replicates data between nodes.
DataStax Enterprise file system (DSEFS)
DSEFS (DataStax Enterprise file system) is a fault-tolerant, general-purpose, distributed file system
within DataStax Enterprise. It is designed for use cases that need to leverage a distributed file system
for data ingestion, data staging, and state management for Spark Streaming applications (such
as checkpointing or write-ahead logging). DSEFS is similar to HDFS, but avoids the deployment
complexity and single point of failure typical of HDFS. DSEFS is HDFS-compatible and is designed to
work in place of HDFS in Spark and other systems.
DSE Analytics Solo
DSE Analytics Solo datacenters are devoted entirely to DSE Analytics processing, for deployments that
require separation of analytics jobs from transactional data.
Integrated security
DSE Analytics uses the advanced security features of DSE, simplifying configuration and deployment.
AlwaysOn SQL
AlwaysOn SQL is a highly-available service that provides JDBC and ODBC interfaces to applications
accessing DSE Analytics data.
Enabling DSE Analytics
To enable Anayltics, follow the architecture guidelines for choosing a workload type for the datacenters in the
cluster.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
235
Using DataStax Enterprise advanced functionality
• dse_analytics
• dse_leases
• dsefs
• "HiveMetaStore"
All analytics keyspaces are initially created with the SimpleStrategy replication strategy and a replication
factor (RF) of 1. Each of these must be updated in production environments to avoid data loss. After starting
the cluster, alter the keyspace to use the NetworkTopologyStrategy replication strategy with an appropriate
settings for the replication factor and datacenters. For most environments using DSE Analytics, a suitable
replication factor will be either 3 or the cluster size, whichever is smaller.
For example, use a CQL statement to configure the dse_leases keyspace for a replication factor of 3 in both
DC1 and DC2 datacenters using NetworkTopologyStrategy:
Only replicate DSE Analytics keyspaces to other DSE Analytics datacenters. DSEFS does not support
replication to other datacenters, and the dsefs keyspace only contains metadata, not the data stored in
DSEFS. Each DSE Analytics datacenter should have its own DSEFS instance.
The datacenter name used is case-sensitive. If needed, use the dsetool status command to confirm the exact
datacenter spelling.
After adjusting the replication factor, nodetool repair must be run on each node in the affected datacenters.
For example to repair the altered keyspace dse_leases:
Repeat the above steps for each of the analytics keyspaces listed above. For more information see Changing
keyspace replication strategy.
DSE Analytics and Search integration
An integrated DSE SearchAnalytics cluster allows analytics jobs to be performed using CQL queries. This
integration allows finer-grained control over the types of queries that are used in analytics workloads, and
improves performance by reducing the amount of data that is processed. However, a DSE SearchAnalytics
cluster does not provide workload isolation and there are no detailed guidelines for provisioning and performance
in production environments.
Nodes that are started in SearchAnalytics mode allow you to create analytics queries that use DSE Search
indexes. These queries return RDDs that are used by Spark jobs to analyze the returned data.
The following code shows how to use a DSE Search query from the DSE Spark console.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
236
Using DataStax Enterprise advanced functionality
.take(10)
For a detailed example, see Running the Wikipedia demo with SearchAnalytics.
Configuring a DSE SearchAnalytics cluster
1. Create DSE SearchAnalytics nodes in a mixed-workload cluster, as described in Initializing a single
datacenter per workload type.
The name of the datacenter is set to SearchAnalytics when using the DseSimpleSnitch. Do not modify
existing search or analytics nodes that use DseSimpleSnitch to be SearchAnalytics nodes. If you use
another snitch like GossipingPropertyFileSnitch you can have a mixed workload within a datacenter.
2. Perform load testing to ensure your hardware has enough CPU and memory for the additional resource
overhead that is required by Spark and Solr.
SearchAnalytics nodes always use driver paging settings. See Using pagination (cursors) with CQL Solr
queries.
SearchAnalytics nodes might consume more resources than search or analytics nodes. Resource
requirements of the nodes greatly depend on the type of query patterns you are using.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
237
Using DataStax Enterprise advanced functionality
When in auto mode the predicate push down will do a COUNT operation against the Search indices both with
and without the predicate filters applied. If the number of records with the predicate filter is less than the result
of the following formula:
To create a temporary table in Spark SQL with Solr predicate push down enabled:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
238
Using DataStax Enterprise advanced functionality
Traditional DSE Analytics deployments have both the DataStax database process and the Spark process
running on the same machine. This allows for simple deployment of analytic processing when the analysis is not
as intensive, or the database is not as heavily used.
DSE Analytics Solo allows customers to deploy DSE Analytics processing on segregated hardware
configurations in a different datacenter from the transactional DSE nodes. This ensures consistent behavior of
both engines in a configuration that does not compete for computer resources. This configuration is good for
processing-intensive analytic workloads.
DSE Analytics Solo allows the flexibility to have more nodes dedicated to data processing than are used for
database transactions. This is particularly good for situations where the processing needs far exceed the
transactional resource needs. For example, suppose you have a Spark Streaming job that will analyze and
filter 99.9% of the incoming data, storing only a few records after analysis. The resources required by the
transactional datacenter are much smaller than the resources required to analyze the data.
DSE Analytics Solo is more elastic in terms of scaling up, or down, the analytic processing in the cluster. This is
particularly useful when you need extra analytics processing, such as end of the day or end of the quarter surges
in analytics jobs. Since a DSE Analytics Solo node does not store database data, when new nodes are added to
a cluster there is very little data moved across the network to the new nodes. In an analytics and transactional
collocated environment, adding a node means moving transactional data between the existing nodes and the
new nodes.
For information on creating a DSE Analytics Solo datacenter, see Creating a DSE Analytics Solo datacenter.
Analyzing data using Spark
Spark is the default mode when you start an analytics node in a packaged installation.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
239
Using DataStax Enterprise advanced functionality
About Spark
Apache Spark is a framework for analyzing large data sets across a cluster, and is enabled when you start an
Analytics node. Spark runs locally on each node and executes in memory when possible. Spark uses multiple
threads instead of multiple processes to achieve parallelism on a single node, avoiding the memory overhead of
several JVMs.
Apache Spark integration with DataStax Enterprise includes:
• AlwaysOn SQL
• Spark streaming
• SparkR integration
Spark architecture
The software components for a single DataStax Enterprise analytics node are:
• Spark Worker
• The database
A Spark Master acts purely as a resource manager for Spark applications. Spark Workers launch executors that
are responsible for executing part of the job that is submitted to the Spark Master. Each application has its own
set of executors. Spark architecture is described in the Apache documentation.
DSE Spark nodes use a different resource manager than standalone Spark nodes. The DSE Resource
Manager simplifies integration between Spark and DSE. In a DSE Spark cluster, client applications use the
CQL protocol to connect to any DSE node, and that node redirects the request to the Spark Master.
The communication between the Spark client application (or driver) and the Spark Master is secured the same
way as connections to DSE, which means that plain password authentication as well as Kerberos authentication
is supported, with or without SSL encryption. Encryption and authentication can be configured per application,
rather than per cluster. Authentication and encryption between the Spark Master and Worker nodes can be
enabled or disabled regardless of the application settings.
Spark supports multiple applications. A single application can spawn multiple jobs and the jobs run in parallel.
An application reserves some resources on every node and these resources are not freed until the application
finishes. For example, every session of Spark shell is an application that reserves resources. By default, the
scheduler tries allocate the application to the highest number of different nodes. For example, if the application
declares that it needs four cores and there are ten servers, each offering two cores, the application most likely
gets four executors, each on a different node, each consuming a single core. However, the application can
get also two executors on two different nodes, each consuming two cores. You can configure the application
scheduler. Spark Workers and Spark Master are part of the main DSE process. Workers spawn executor JVM
processes which do the actual work for a Spark application (or driver). Spark executors use native integration to
access data in local transactional nodes through the Open Source Spark-Cassandra Connector. The memory
settings for the executor JVMs are set by the user submitting the driver to DSE.
In deployment for each Analytics datacenter one node runs the Spark Master, and Spark Workers run on each
of the nodes. The Spark Master comes with automatic high availability.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
240
Using DataStax Enterprise advanced functionality
As you run Spark, you can access data in the Hadoop Distributed File System (HDFS), or the DataStax
Enterprise File System (DSEFS) by using the URL for the respective file system.
Highly available Spark Master
The Spark Master High Availability mechanism uses a special table in the dse_analytics keyspace to
store information required to recover Spark workers and the application. Reads to the recovery data in
dse_analytics are always performed using the LOCAL_QUORUM consistency level. Writes are attempted
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
241
Using DataStax Enterprise advanced functionality
first using LOCAL_QUORUM, and if that fails, the write is retried using LOCAL_ONE. Unlike the high availability
mechanism mentioned in Spark documentation, DataStax Enterprise does not use ZooKeeper.
If the original Spark Master fails, the reserved one automatically takes over. To find the current Spark Master,
run:
The Spark Master will not start until LOCAL_QUORUM is attainable for the dse_analytics keyspace.
Unsupported features
The following Spark features and APIs are not supported:
By default DSEFS is required to execute Spark applications. DSEFS should not be disabled when Spark is
enabled on a DSE node. If there is a strong reason not to use DSEFS as the default file system, reconfigure
Spark to use a different file system. For example to use a local file system set the following properties in
spark-daemon-defaults.conf:
spark.hadoop.fs.defaultFS=file:///
spark.hadoop.hive.metastore.warehouse.dir=file:///tmp/warehouse
How you start Spark depends on the installation and if you want to run in Spark mode or SearchAnalytics
mode:
Package installations:
To start the Spark trackers on a cluster of analytics nodes, edit the /etc/default/dse file to set
SPARK_ENABLED to 1.
When you start DataStax Enterprise as a service, the node is launched as a Spark node. You can
enable additional components.
Tarball installations:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
242
Using DataStax Enterprise advanced functionality
To start the Spark trackers on a cluster of analytics nodes, use the -k option:
$ installation_location/bin/dse cassandra -k
Nodes started with -k are automatically assigned to the default Analytics datacenter if you do not
configure a datacenter in the snitch property file.
You can enable additional components:
Mode Option Description
For example:
To start a node in SearchAnalytics mode, use the -k and -s options.
$ installation_location/bin/dse cassandra -k -s
Starting the node with the Spark option starts a node that is designated as the master, as shown by the
Analytics(SM) workload in the output of the dsetool ring command:
$ dsetool ring
0
10.200.175.149 Analytics rack1 Analytics(SM) no Up
Normal 185 KiB ? -9223372036854775808
0.90
10.200.175.148 Analytics rack1 Analytics(SW) no Up
Normal 194.5 KiB ? 0
0.90
Note: you must specify a keyspace to get ownership information.
Launching Spark
After starting a Spark node, use dse commands to launch Spark.
Usage:
Package installations: dse spark
Tarball installations: installation_location/bin/dse spark
You can use Cassandra specific properties to start Spark. Spark binds to the listen_address that is specified
in cassandra.yaml.
DataStax Enterprise supports these commands for launching Spark on the DataStax Enterprise command line:
dse spark
Enters interactive Spark shell, offers basic auto-completion.
Package installations: dse spark
Tarball installations: installation_location/bin/ dse spark
dse spark-submit
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
243
Using DataStax Enterprise advanced functionality
Launches applications on a cluster like spark-submit. Using this interface you can use Spark cluster
managers without the need for separate configurations for each application. The syntax for package
installations is:
For example, if you write a class that defines an option named d, enter the command as follows:
The JAR file can be located in a DSEFS directory. If the DSEFS cluster is secured, provide
authentication credentials as described in DSEFS authentication.
The dse spark-submit command supports the same options as Apache Spark's spark-submit. For
example, to submit an application using cluster mode using the supervise option to restart in case of
failure:
The directory in which you run the dse Spark commands must be writable by the current user.
export DSE_USERNAME=user
export DSE_PASSWORD=secret
These environment variables are supported for all Spark and dse client-tool commands.
DataStax recommends using the environment variables instead of passing user credentials on the
command line.
You can provide authentication credentials in several ways, see Credentials for authentication.
Specifying Spark URLs
You do not need to specify the Spark Master address when starting Spark jobs with DSE. If you connect to any
Spark node in a datacenter, DSE will automatically discover the Master address and connect the client to the
Master.
Specify the URL for any Spark node using the following format:
By default the URL is dse://?, which is equivalent to dse://localhost:9042. Any parameters you set in the
URL will override the configuration read from DSE's Spark configuration settings.
You can specify the work pool in which the application will be run by adding the workpool=work pool name as
a URL parameter. For example, dse://1.1.1.1:123?workpool=workpool2.
Valid parameters are CassandraConnectorConf settings with the spark.cassandra. prefix stripped. For
example, you can set the spark.cassandra.connection.local_dc option to dc2 by specifying dse://?
connection.local_dc=dc2.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
244
Using DataStax Enterprise advanced functionality
spark.blockManager.port 38000
spark.broadcast.port 38001
spark.driver.port 38002
spark.executor.port 38003
spark.fileserver.port 38004
spark.replClassServer.port 38005
For a full list of ports used by DSE, see Securing DataStax Enterprise ports.
1. Export the DataStax Enterprise client configuration from the remote node to the client node:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
245
Using DataStax Enterprise advanced functionality
To set the driver host to a publicly accessible IP address, pass in the spark.driver.host option.
Kerberos authentication is not supported in the Spark web UI. If authentication is enabled and either LDAP
or Internal authentication is not available, the Spark web UI will not be accessible. If this occurs, disable
authentication for the Spark web UI only by removing the spark.ui.filters setting in spark-daemon-
defaults.conf located in the Spark configuration directory.
DSE SSL encryption and authentication only apply to the Spark Master and Worker UIs, not the Spark Driver
UI. To use encryption and authentication with the Driver UI, refer to the Spark security documentation.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
246
Using DataStax Enterprise advanced functionality
The UI includes information on the number of cores and amount of memory available to Spark in total and in
each work pool, and similar information for each Spark worker. The applications list the associated work pool.
See the Spark documentation for information on using the Spark web UI.
Authorization in the Spark web UI
When authorization is enabled and an authenticated user accesses the web UI, what they can see and do
is controlled by their permissions. This allows administrators to control who has permission to view specific
application logs, view the executors for the application, kill the application, and list all applications. Viewing and
modifying applications can be configured per datacenter, work pool, or application.
See Using authorization with Spark for details on granting permissions.
Displaying fully qualified domain names in the web UI
To display fully qualified domain names (FQDNs) in the Spark web UI, set the SPARK_PUBLIC_DNS variable in
spark-env.sh on each Analytics node.
Set SPARK_PUBLIC_DNS to the FQDN of the node if you have SSL enabled for the web UI.
Redirecting to the fully qualified domain name of the master
Set the SPARK_LOCAL_IP or SPARK_LOCAL_HOSTNAME in the spark-env.sh file on each node to the fully qualified
domain name (FQDN) of the node to force any redirects to the web UI using the FQDN of the Spark master.
This is useful when enabling SSL in the web UI.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
247
Using DataStax Enterprise advanced functionality
If the tool is run on a server that is not part of the DSE cluster, see Running Spark commands against a
remote cluster.
Jupyter integration
Download and install Jupyter notebook on a DSE node.
To launch Jupyter notebook:
A Jupyter notebook starts with the correct Python path. You must create a context to work with DSE. In
contrast to Livy and Zeppelin integrations, the Jupyter integration does not start an interpreter that creates a
context.
Livy integration
Download and install Livy on a DSE node. By default Livy runs Spark in local mode. Before starting Livy create
a configuration file by copying the conf/livy.conf.template to conf/livy.conf, then uncomment or add
the following two properties:
livy.spark.master = dse:///
livy.repl.enable-hive-context = true
To launch Livy:
RStudio integration
Download and install R on all DSE Analytics nodes, install RStudio desktop on one of the nodes, then run
RStudio:
These instructions are for RStudio desktop, not RStudio Server. In multiuser environments, we recommend
using AlwaysOn SQL and JDBC connections rather than SparkR.
Zeppelin integration
Download and install Zeppelin on a DSE node. To launch Zeppelin server:
By default Zeppelin runs Spark in local mode. Update the master property to dse:/// in the Spark session in
the Interpreters configuration page. No configuration file changes are required to run Zeppelin.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
248
Using DataStax Enterprise advanced functionality
Configuring Spark
Configuring Spark for DataStax Enterprise includes:
Configuring Spark nodes
Modify the settings for Spark nodes security, performance, and logging.
To manage Spark performance and operations:
The temporary directory for shuffle data, RDDs, and other ephemeral Spark data can be configured for both
the locally running driver and for the Spark server processes managed by DSE (Spark Master, Workers,
shuffle service, executor and driver running in cluster mode).
For the locally running Spark driver, the SPARK_LOCAL_DIRS environment variable can be customized in the
user environment or in spark-env.sh. By default, it is set to the system temporary directory. For example,
on Ubuntu it is /tmp/. If there's no system temporary directory, then SPARK_LOCAL_DIRS is set to a .spark
directory in the user's home directory.
For all other Spark server processes, the SPARK_EXECUTOR_DIRS environment variable can be customized in
the user environment or in spark-env.sh. By default it is set to /var/lib/spark/rdd.
The default SPARK_LOCAL_DIRS and SPARK_EXECUTOR_DIRS environment variable values differ from non-
DSE Spark.
To configure worker cleanup, modify the SPARK_WORKER_OPTS environment variable and add the cleanup
properties. The SPARK_WORKER_OPTS environment variable can be set in the user environment or in spark-
env.sh. For example, the following enables worker cleanup,.sets the cleanup interval to 30 minutes (i.e. 1800
seconds), and sets the length of time application worker directories will be retained to 7 days (i.e. 604800
seconds).
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
249
Using DataStax Enterprise advanced functionality
In multiple datacenter clusters, use a virtual datacenter to isolate Spark jobs. Running Spark jobs consume
resources that can affect latency and throughput.
DataStax Enterprise supports the use of virtual nodes (vnodes) with Spark.
Secure Spark nodes
Client-to-node SSL
Ensure that the truststore entries in cassandra.yaml are present as described in Client-to-node
encryption, even when client authentication is not enabled.
Enabling security and authentication
Security is enabled using the spark_security_enabled option in dse.yaml. Setting it to
enabled turns on authentication between the Spark Master and Worker nodes, and allows you to
enable encryption. To encrypt Spark connections for all components except the web UI, enable
spark_security_encryption_enabled. The length of the shared secret used to secure Spark
components is set using the spark_shared_secret_bit_length option, with a default value of 256
bits. These options are described in DSE Analytics options. For production clusters, enable these
authentication and encryption. Doing so does not significantly affect performance.
Authentication and Spark applications
If authentication is enabled, users need to be authenticated in order to submit an application.
Authorization and Spark applications
If DSE authorization is enabled, users needs permission to submit an application. Additionally, the
user submitting the application automatically receives permission to manage the application, which
can optionally be extended to other users.
Database credentials for the Spark SQL Thrift server
In the hive-site.xml file, configure authentication credentials for the Spark SQL Thrift server. Ensure
that you use the hive-site.xml file in the Spark directory:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
250
Using DataStax Enterprise advanced functionality
periodically. For security reasons, the user who is authenticated with the token should not be able to
renew it. Therefore, delegation tokens have two associated users: token owner and token renewer.
The token renewer is none so that only a DSE internal process can renew it. When the application is
submitted, DSE automatically renews delegation tokens that are associated with Spark application.
When the application is unregistered (finished), the delegation token renewal is stopped and the
token is cancelled.
Set Kerberos options, see Defining a Kerberos scheme.
Configure Spark memory and cores
Spark memory options affect different components of the Spark ecosystem:
Spark History server and the Spark Thrift server memory
The SPARK_DAEMON_MEMORY option configures the memory that is used by the Spark SQL
Thrift server and history-server. Add or change this setting in the spark-env.sh file on nodes that run
these server applications.
Spark Worker memory
The memory_total option in resource_manager_options.worker_options section of dse.yaml
configures the total system memory that you can assign to all executors that are run by the work
pools on the particular node. The default work pool will use all of this memory if no other work pools
are defined. If you define additional work pools, you can set the total amount of memory by setting the
memory option in the work pool definition.
Application executor memory
You can configure the amount of memory that each executor can consume for the application. Spark
uses a 512MB default. Use either the spark.executor.memory option, described in "Spark Available
Properties", or the --executor-memory mem argument to the dse spark command.
Application memory
You can configure additional Java options that are applied by the worker when spawning an executor for
the application. Use the spark.executor.extraJavaOptions property, described in Spark 1.6.2 Available
Properties. For example: spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value
-Dnumbers="one two three"
Core management
You can manage the number of cores by configuring these options.
• Application cores
In the Spark configuration object of your application, you configure the number of application cores that
the application requests from the cluster using either the spark.cores.max configuration property or the
--total-executor-cores cores argument to the dse spark command.
See the Spark documentation for details about memory and core allocation.
DataStax Enterprise can control the memory and cores offered by particular Spark Workers in semi-automatic
fashion. The resource_manager_options.worker_options section in the dse.yaml file has options to
configure the proportion of system resources that are made available to Spark Workers and any defined
work pools, or explicit resource settings. When specifying decimal values of system resources the available
resources are calculated in the following way:
• Spark Worker memory = memory_total * (total system memory - memory assigned to DSE)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
251
Using DataStax Enterprise advanced functionality
This calculation is used for any decimal values. If the setting is not specified, the default value 0.7 is used. If
the value does not contain a decimal place, the setting is the explicit number of cores or amount of memory
reserved by DSE for Spark.
Setting cores_total or a workpool's cores to 1.0 is a decimal value, meaning 100% of the available cores
will be reserved. Setting cores_total or cores to 1 (no decimal point) is an explicit value, and one core will
be reserved.
The lowest values you can assign to a named work pool's memory and cores are 64 MB and 1 core,
respectively. If the results are lower, no exception is thrown and the values are automatically limited.
The following example shows a work pool named workpool1 with 1 core and 512 MB of RAM assigned to it.
The remaining resources calculated from the values in worker_options are assigned to the default work
pool.
resource_manager_options:
worker_options:
cores_total: 0.7
memory_total: 0.7
workpools:
- name: workpool1
cores: 1
memory: 512M
# Uncomment the following line to make this snitch prefer the internal ip when possible,
as the Ec2MultiRegionSnitch does.
prefer_local=true
This tells the cluster to communicate only on private IP addresses within the datacenter rather than the public
routable IP addresses.
Configuring the number of retries to retrieve Spark configuration
When Spark fetches configuration settings from DSE, it will not fail immediately if it cannot retrieve the
configuration data, but will retry 5 times by default, with increasing delay between retries. The number of
retries can be set in the Spark configuration, by modifying the spark.dse.configuration.fetch.retries
configuration property when calling the dse spark command, or in spark-defaults.conf.
Disabling continuous paging
Continuous paging streams bulk amounts of records from DSE to the DataStax Java Driver
used by DSE Spark. By default, continuous paging in queries is enabled. To disable it, set the
spark.dse.continuous_paging_enabled setting to false when starting the Spark SQL shell or in spark-
defaults.conf. For example:
Using continuous paging can potentially improve performance up to 3 times, though the improvement
will depend on the data and the queries. Some factors that impact the performance improvement are the
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
252
Using DataStax Enterprise advanced functionality
number of executor JVMs per node and the number of columns included in the query. Greater performance
gains were observed with fewer executor JVMs per node and more columns selected.
2. Set the SPARK_MASTER_WEBUI_PORT variable to the new port number. For example, to set it to port 7082:
export SPARK_MASTER_WEBUI_PORT=7082
To add the Graphite JARs to Spark in a package installation, copy them to the Spark lib directory:
spark.network.crypto.enabled true
spark.dseShuffle.noSasl.port 7437 The port number on which a shuffle service for unsecured
applications is started. Bound to the listen_address in
cassandra.yaml.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
253
Using DataStax Enterprise advanced functionality
By default Spark executor logs, which log the majority of your Spark Application output, are
redirected to standard output. The output is managed by Spark Workers. Configure logging by adding
spark.executor.logs.rolling.* properties to spark-daemon-defaults.conf file.
spark.executor.logs.rolling.maxRetainedFiles 3
spark.executor.logs.rolling.strategy size
spark.executor.logs.rolling.maxSize 50000
Additional Spark properties that affect the master and driver can be added to spark-daemon-defaults.conf.
For example, to enable Spark's commons-crypto encryption library:
spark.network.crypto.enabled true
dse://10.200.181.62:9042?
connection.local_dc=Analytics;connection.host=10.200.181.63;
10.200.181.62
$ dsetool ring
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
254
Using DataStax Enterprise advanced functionality
• Query the dse_leases.leases table to list all the masters from each data center with Analytics nodes:
Ensure that the replication factor is configured correctly for the dse_leases keyspace
If the dse_leases keyspace is not properly replicated, the Spark Master might not be elected.
Every time you add a new datacenter, you must manually increase the replication factor of the dse_leases
keyspace for the new DSE Analytics datacenter. If DataStax Enterprise or Spark security options are
enabled on the cluster, you must also increase the replication factor for the dse_security keyspace across
all logical datacenters.
The initial node in a multi datacenter has a replication factor of 1 for the dse_leases keyspace. For new
datacenters, the first node is created with the dse_leases keyspace with an replication factor of 1 for that
datacenter. However, any datacenters that you add have a replication factor of 0 and require configuration
before you start DSE Analytics nodes. You must change the replication factor of the dse_leases keyspace for
multiple analytics datacenters. See Setting the replication factor for analytics keyspaces.
Monitoring the lease subsystem
All changes to lease holders are recorded in the dse_leases.logs table. Most of the time, you do not want to
enable logging.
1. To turn on logging, ensure that the lease_metrics_options is enabled in the dse.yaml file:
lease_metrics_options:
enabled:true
ttl_seconds: 604800
name | dc | monitor | at |
new_holder | old_holder
-------------------+-----+---------------+---------------------------------
+---------------+------------
Leader/master/6.0 | dc1 | 10.200.180.44 | 2018-05-17 00:45:02.971000+0000 |
10.200.180.44 |
Leader/master/6.0 | dc1 | 10.200.180.49 | 2018-05-17 02:37:07.381000+0000 |
10.200.180.49 |
3. When the lease_metrics_option is enabled, you can examine the acquire, renew, resolve, and disable
operations. Most of the time, these operations should complete in 100 ms or less:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
255
Using DataStax Enterprise advanced functionality
4. If the log warnings and errors do not contain relevant information, edit the logback.xml file and add:
Troubleshooting
Perform these various lease holder troubleshooting activities before you contact DataStax Support.
Verify the workload status
Run the dsetool ring command:
$ dsetool ring
If the replication factor is inadequate or if the replicas are down, the output of the dsetool ring
command contains a warning:
0
10.200.178.232 SearchGraphAnalytics rack1 SearchAnalytics yes
Up Normal 153.04 KiB ? -9223372036854775808
0.00
10.200.178.230 SearchGraphAnalytics rack1 SearchAnalytics(SM) yes
Up Normal 92.98 KiB ? 0
0.000
If the automatic Job Tracker or Spark Master election fails, verify that an appropriate replication factor
is set for the dse_leases keyspace.
Use cqlsh commands to verify the replication factor of the analytics keyspaces
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
256
Using DataStax Enterprise advanced functionality
• SPARK_WORKER_DIR/worker-n/application_id/executor_id/stderr
• SPARK_WORKER_DIR/worker-n/application_id/executor_id/stdout
2. If you want to enable rolling logging for Spark executors, add the following options to spark-daemon-
defaults.conf.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
257
Using DataStax Enterprise advanced functionality
Enable rolling logging with 3 log files retained before deletion. The log files are broken up by size with a
maximum size of 50,000 bytes.
spark.executor.logs.rolling.maxRetainedFiles 3
spark.executor.logs.rolling.strategy size
spark.executor.logs.rolling.maxSize 50000
The default location of the Spark configuration files depends on the type of installation:
When user credentials are specified in plain text on the dse command line, like dse -u username
-p password, the credentials are present in the logs of Spark workers when the driver is run in
cluster mode.
The Spark Master, Spark Worker, executor, and driver logs might include sensitive information.
Sensitive information includes passwords and digest authentication tokens for Kerberos guidelines
mode that are passed in the command line or Spark configuration. DataStax recommends using
only safe communication channels like VPN and SSH to access the Spark user interface.
You can provide authentication credentials in several ways, see Credentials for authentication.
• All simultaneously running applications deployed by a single DSE service user will be run as a single OS
user.
• Applications deployed by different DSE service users will be run by different OS users.
• All applications will be run as a different OS user than the DSE service user.
This allows you to prevent an application from accessing DSE server private files, and prevent one application
from accessing the private files of another application.
How the run_as process runner works
DSE uses sudo to run Spark applications components (drivers and executors) as specific OS users. DSE
doesn't link a DSE service user with a particular OS user. Instead, a configurable number of spare user
accounts or slots are used. When a request to run an executor or a driver is received, DSE finds an unused
slot, and locks it for that application. Until the application is finished, all of that application's processes run as
that slot user. When the application completes, the slot user will be released and will be available to other
applications.
Since the number of slots is limited, a single slot is shared among all the simultaneously running applications
run by the same DSE service user. Such a slot is released once all the applications of that user are removed.
When there is not enough slots to run an application, an error is logged and DSE will try to run the executor or
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
258
Using DataStax Enterprise advanced functionality
driver on a different node. DSE does not limit the number of slots you can configure. If you need to run more
applications simultaneously, create more slot users.
Slots assignment is done on a per node basis. Executors of a single application may run as different slot users
on different DSE nodes. When DSE is run on a fat node, different DSE instances running within the same OS
should be configured with different sets of slot users. If they use the same slot users, a single OS user may run
the applications of two different DSE service users.
When a slot is released, all directories which are normally managed by Spark for the application are removed.
If the application doesn't finish, but all executors are done on a node, and a slot user is about to be released,
all the application files are modified so that their ownership is changed to the DSE service user with owner-
only permission. When a new executor for this application is run on this node, the application files are
reassigned back to the slot user assigned to that application.
Configuring the run_as process runner
The administrator needs to prepare slot users in the OS before configuring DSE. The run_as process runner
requires:
• Each slot user has its own primary group, which name is the same as the name of slot user. This is
typically the default behaviour of the OS. For example, the slot1 user's primary group is slot1.
• The DSE service user is a member of each slot's primary group. For example, if the DSE service user is
cassandra, the cassandra user is a member of the slot1 group.
• The DSE service user is a member of a group with the same name as the service user. For example, if
the DSE service user is cassandra, the cassandra user is a member of the cassandra group.
• sudo is configured so that the DSE service user can execute any command as any slot user without
providing a password.
Override the umask setting to 007 for slot users so that files created by sub-processes will not be accessible by
anyone else by default, and DSE configuration files are not visible to slot users.
You may further secure the DSE server environment by modifying the OS's limits.conf file to set exact disk
space quotas for each slot user.
After adding the slot users and groups and configuring the OS, modify the dse.yaml file. In the
spark_process_runner section enable the run_as process runner and set the list of slot users on each node.
spark_process_runner:
# Allowed options are: default, run_as
runner_type: run_as
run_as_runner_options:
user_slots:
- slot1
- slot2
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
259
Using DataStax Enterprise advanced functionality
3. Make sure the DSE service user is a member of a group with the same name as the service user. For
example, if the DSE service user is cassandra:
$ groups cassandra
cassandra : cassandra
4. Log out and back in again to make the group changes take effect.
6. Modify dse.yaml to enable the run_as process runner and add the new runners.
# Configure the way how the driver and executor processes are created and managed.
spark_process_runner:
# Allowed options are: default, run_as
runner_type: run_as
# RunAs runner uses sudo to start Spark drivers and executors. A set of
predefined fake users, called slots, is used
# for this purpose. All drivers and executors owned by some DSE user are run as
some slot user x. At the same time
# drivers and executors of any other DSE user use different slots.
run_as_runner_options:
user_slots:
- slot1
- slot2
2. On each node in the cluster, edit the spark-defaults.conf file to enable event logging and specify the
directory for event logs:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
260
Using DataStax Enterprise advanced functionality
3. Start the Spark history server on one of the nodes in the cluster:
The Spark history server is a front-end application that displays logging data from all nodes in the
Spark cluster. It can be started from any node in the cluster.
If you've enabled authentication set the authentication method and credentials in a properties file and
pass it to the dse command. For example, for basic authentication:
spark.hadoop.com.datastax.bdp.fs.client.authentication.basic.username=role name
spark.hadoop.com.datastax.bdp.fs.client.authentication.basic.password=password
If you set the event log location in spark-defaults.conf, set the spark.history.fs.logDirectory
property in your properties file.
spark.history.fs.logDirectory=dsefs:///spark/events
If you specify a properties file, none of the configuration in spark-defaults.conf is used. The
properties file should contain all the required configuration properties.
The history server is started and can be viewed by opening a browser to http://node
hostname:18080.
The Spark Master web UI does not show the historical logs. To work around this known issue,
access the history from port 18080.
4. When event logging is enabled, the default behavior is for all logs to be saved, which causes the storage
to grow over time. To enable automated cleanup edit spark-defaults.conf and edit the following
options:
spark.history.fs.cleaner.enabled true
spark.history.fs.cleaner.interval 1d
spark.history.fs.cleaner.maxAge 7d
For these settings, automated cleanup is enabled, the cleanup is performed daily, and logs older than
seven days are deleted.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
261
Using DataStax Enterprise advanced functionality
You pass settings for Spark, Spark Shell, and other DataStax Enterprise Spark built-in applications using the
intermediate application spark-submit, described in Spark documentation.
Configuring the Spark shell
Pass Spark configuration arguments using the following syntax:
[--help] [--verbose]
[--conf name=spark.value|sparkproperties.conf]
[--executor-memory memory]
[--jars additional-jars]
[--master dse://?appReconnectionTimeoutSeconds=secs]
[--properties-file path_to_properties_file]
[--total-executor-cores cores]
--conf name=spark.value|sparkproperties.conf
An arbitrary Spark option to the Spark configuration prefixed by spark.
• name-spark.value
• sparkproperties.conf - a configuration
--executor-memory mem
The amount of memory that each executor can consume for the application. Spark uses a 512 MB
default. Specify the memory argument in JVM format using the k, m, or g suffix.
--help
Shows a help message that displays all options except DataStax Enterprise Spark shell options.
--jars path_to_additional_jars
A comma-separated list of paths to additional JAR files.
--properties-file path_to_properties_file
The location of the properties file that has the configuration settings. By default, Spark loads the
settings from spark-defaults.conf.
--total-executor-cores cores
The total number of cores the application uses.
--verbose
Displays which arguments are recognized as Spark configuration options and which arguments are
forwarded to the Spark shell.
Spark shell application arguments:
-i app_script_file
Spark shell application argument that runs a script from the specified file.
Configuring Spark applications
You pass the Spark submission arguments using the following syntax:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
262
Using DataStax Enterprise advanced functionality
--files files
A comma-separated list of files that are distributed among the executors and available for the
application.
In general, Spark submission arguments are translated into system properties -Dname=value and other VM
parameters like classpath. The application arguments are passed directly to the application.
Property list
When you run dse spark-submit on a node in your Analytics cluster, all the following properties are set
automatically, and the Spark Master is automatically detected. Only set the following properties if you need to
override the automatically managed properties.
spark.cassandra.connection.native.port
Default = 9042. Port for native client protocol connections.
spark.cassandra.connection.rpc.port
Default = 9160. Port for thrift connections.
spark.cassandra.connection.host
The host name or IP address to which the Thrift RPC service and native transport is bound.
The native_transport_address property in the cassandra.yaml, which is localhost by default,
determines the default value of this property.
You can explicitly set the Spark Master address using the --master master address parameter to dse spark-
submit.
Read properties
spark.cassandra.input.split.size
Default = 100000. Approximate number of rows in a single Spark partition. The higher the value, the
fewer Spark tasks are created. Increasing the value too much may limit the parallelism level.
spark.cassandra.input.fetch.size_in_rows
Default = 1000. Number of rows being fetched per round-trip to the database. Increasing this value
increases memory consumption. Decreasing the value increases the number of round-trips. In earlier
releases, this property was spark.cassandra.input.page.row.size.
spark.cassandra.input.consistency.level
Default = LOCAL_ONE. Consistency level to use when reading.
Write properties
You can set the following properties in SparkConf to fine tune the saving process.
spark.cassandra.output.batch.size.bytes
Default = 1024. Maximum total size of a single batch in bytes.
spark.cassandra.output.consistency.level
Default = LOCAL_QUORUM. Consistency level to use when writing.
spark.cassandra.output.concurrent.writes
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
263
Using DataStax Enterprise advanced functionality
• Make sure all keyspaces in the DC1 datacenter use NetworkTopologyStrategy. If necessary, alter the
keyspace.
• Add nodes to a new datacenter named DC2, then enable Analytics on those nodes.
• Configure the dse_leases and dse_analytics keyspaces to replicate to both DC1 and DC2. For example:
• When submitting Spark applications specify the --master URL with the name or IP address of a node in
the DC2 datacenter, and set the spark.cassandra.connection.local_dc configuration option to DC1.
Accessing an external DSE transactional cluster from a DSE Analytics Solo cluster
To access an external DSE transactional cluster, explicitly set the connection to the transactional cluster when
creating RDDs or Datasets within the application.
In the following examples, the external DSE transactional cluster has a node running on 10.10.0.2.
To create an RDD from the transactional cluster's data:
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql._
import org.apache.spark.SparkContext
val rddFromTransactionalCluster = {
// Sets connectorToTransactionalCluster as default connection for everything in this
code block
implicit val c = connectorToTransactionalCluster
// get the data from the test.words table
sc.cassandraTable("test","words")
}
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
264
Using DataStax Enterprise advanced functionality
import org.apache.spark.sql.cassandra._
import com.datastax.spark.connector.cql.CassandraConnectorConf
val df = spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "words", "keyspace" -> "test"))
.load()
When you submit the application to the DSE Analytics Solo cluster, it will retrieve the data from the external
DSE transactional cluster.
Spark JVMs and memory management
Spark jobs running on DataStax Enterprise are divided among several different JVM processes, each with
different memory requirements.
DataStax Enterprise and Spark Master JVMs
The Spark Master runs in the same process as DataStax Enterprise, but its memory usage is negligible. The
only way Spark could cause an OutOfMemoryError in DataStax Enterprise is indirectly by executing queries
that fill the client request queue. For example, if it ran a query with a high limit and paging was disabled or it
used a very large batch to update or insert data in a table. This is controlled by MAX_HEAP_SIZE in cassandra-
env.sh. If you see an OutOfMemoryError in system.log, you should treat it as a standard OutOfMemoryError
and follow the usual troubleshooting steps.
Spark executor JVMs
The Spark executor is where Spark performs transformations and actions on the RDDs and is usually
where a Spark-related OutOfMemoryError would occur. An OutOfMemoryError in an executor will show
up in the stderr log for the currently executing application (usually in /var/lib/spark). There are several
configuration settings that control executor memory and they interact in complicated ways.
• spark.executor.memory is a system property that controls how much executor memory a specific
application gets. It must be less than or equal to the calculated value of memory_total. It can be specified
in the constructor for the SparkContext in the driver application, or via --conf spark.executor.memory
or --executor-memory command line options when submitting the job using spark-submit.
• SPARK_DRIVER_MEMORY in spark-env.sh
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
265
Using DataStax Enterprise advanced functionality
Spark Streaming applications require synchronized clocks to operate correctly. See Synchronize clocks.
The following Scala example demonstrates how to connect to a text input stream at a particular IP address
and port, count the words in the stream, and save the results to the database.
import org.apache.spark.streaming._
2. Create a new StreamingContext object based on an existing SparkConf configuration object, specifying
the interval in which streaming data will be divided into batches by passing in a batch duration.
Spark allows you to specify the batch duration in milliseconds, seconds, and minutes.
3. Import the database-specific functions for StreamingContext, DStream, and RDD objects.
import com.datastax.spark.connector.streaming._
4. Create the DStream object that will connect to the IP and port of the service providing the data stream.
5. Count the words in each batch and save the data to the table.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
266
Using DataStax Enterprise advanced functionality
ssc.start()
ssc.awaitTermination()
In the following example, you start a service using the nc utility that repeats strings, then consume the
output of that service using Spark Streaming.
Using cqlsh, start by creating a target keyspace and table for streaming to write into.
$ nc -lk 9999 one two two three three three four four four four someword
$ dse spark
import org.apache.spark.streaming._
import com.datastax.spark.connector.streaming._
Using cqlsh connect to the streaming_test keyspace and run a query to show the results.
$ cqlsh -k streaming_test
word | count
---------+-------
three | 3
one | 1
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
267
Using DataStax Enterprise advanced functionality
two | 2
four | 4
someword | 1
What's next:
Run the http_receiver demo. See the Spark Streaming Programming Guide for more information, API
documentation, and examples on Spark Streaming.
Creating a Spark Structured Streaming sink using DSE
Spark Structured Streaming is a high-level API for streaming applications. DSE supports Structured
Streaming for storing data into DSE.
The following Scala example shows how to store data from a streaming source to DSE using the
cassandraFormat method.
This example sets the OutputMode to Update, described in the Spark API documentation.
The cassandraFormat method is equivalent to calling the format method and in
org.apache.spark.sql.cassandra.
Any tables you create or destroy, and any table data you delete, in a Spark SQL session will not be
reflected in the underlying DSE database, but only in that session's metastore.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
268
Using DataStax Enterprise advanced functionality
$ dse spark-sql
The Spark SQL shell in DSE automatically creates a Spark session and connects to the Spark SQL Thrift
server to handle the underlying JDBC connections.
If the schema changes in the underlying database table during a Spark SQL session (for example, a column
was added using CQL), drop the table and then refresh the metastore to continue querying the table with the
correct schema.
Queries to a table whose schema has been modified cause a runtime exception.
Spark SQL limitations
• You cannot load data from one file system to a table in a different file system.
CREATE TABLE IF NOT EXISTS test (id INT, color STRING) PARTITIONED BY (ds STRING);
LOAD DATA INPATH 'hdfs2://localhost/colors.txt' OVERWRITE INTO TABLE test PARTITION
(ds ='2008-08-15');
The first line creates a table on the default file system. The second line attempts to load data into that
table from a path on a different file system, and will fail.
$ dse spark
2. Use the sql method to pass in the query, storing the result in a variable.
results.show()
+--------------------+-----------+
| id|description|
+--------------------+-----------+
|de2d0de1-4d70-11e...| thing|
|db7e4191-4d70-11e...| another|
|d576ad50-4d70-11e...|yet another|
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
269
Using DataStax Enterprise advanced functionality
+--------------------+-----------+
After the Spark session instance is created, you can use it to create a DataFrame instance from the query.
Queries are executed by calling the SparkSession.sql method.
employees.collect();
If you have properties that are spelled the same but with different capitalizations (for example, id and Id),
start Spark SQL with the --conf spark.sql.caseSensitive=true option.
Prerequisites:
Start your cluster with both Graph and Spark enabled.
$ dse spark-sql
USE dse_graph;
SELECT * FROM gods_vertices where name = 'Zeus';
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
270
Using DataStax Enterprise advanced functionality
Vertices are identified by id columns. Edge tables have src and dst columns that identify the from
and to vertices, respectively. A join can be used to traverse the graph. For example to find all vertex
ids that are reached by the out edges:
What's next: The same steps work from the Spark shell using spark.sql() to run the query statements, or
using the JDBC/ODBC driver and the Spark SQL Thrift Server.
Using Spark predicate push down in Spark SQL queries
Spark predicate push down to database allows for better optimized Spark queries. A predicate is a condition
on a query that returns true or false, typically located in the WHERE clause. A predicate push down filters
the data in the database query, reducing the number of entries retrieved from the database and improving
query performance. By default the Spark Dataset API will automatically push down valid WHERE clauses to the
database.
You can also use predicate push down on DSE Search indices within SearchAnalytics data centers.
Restrictions on column filters
Partition key columns can be pushed down as long as:
Clustering key columns can be pushed down with the following rules:
• Only the last predicate in the filter can be a non equivalence predicate.
• If there is more than one predicate for a column, the predicates cannot be equivalence predicates.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
271
Using DataStax Enterprise advanced functionality
INSERT INTO words (user, word, count ) VALUES ( 'Zebra', 'zed', 100 );
Then create a Spark Dataset in the Spark console using that table and look for PushedFilters in the output
after issuing the EXPLAIN command:
== Physical Plan ==
*Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [user#0,word#1,count#2]
ReadSchema: struct<user:string,word:string,count:int>
Because this query doesn't filter on columns capable of being pushed down, there are no PushedFilters in
the physical plan.
Adding a filter, however, does change the physical plan to include PushedFilters:
== Physical Plan ==
*Scan org.apache.spark.sql.cassandra.CassandraSourceRelation
[user#0,word#1,count#2] PushedFilters: [*GreaterThan(word,ham)], ReadSchema:
struct<user:string,word:string,count:int>
The PushedFilters section of the physical plan includes the GreaterThan push down filter. The asterisk
indicates that push down filter will be handled only at the datasource level.
Troubleshooting predicate push down
When creating Spark SQL queries that use comparison operators, making sure the predicates are pushed
down to the database correctly is critical to retrieving the correct data with the best performance.
For example, given a CQL table with the following schema:
Suppose you want to write a query that selects all entries where the birthday is earlier than a given date:
== Physical Plan ==
*Filter (cast(birthday#1 as string) < 2001-1-1)
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
272
Using DataStax Enterprise advanced functionality
+- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation
[year#0,birthday#1,userid#2,likes#3,name#4] ReadSchema:
struct<year:int,birthday:timestamp,userid:string,likes:string,name:string>
Time taken: 0.72 seconds, Fetched 1 row(s)
Note that the Filter directive is treating the birthday column, a CQL TIMESTAMP, as a string. The query
optimizer looks at this comparison and needs to make the types match before generating a predicate. In
this case the optimizer decides to cast the birthday column as a string to match the string '2001-1-1',
but cast functions cannot be pushed down. The predicate isn't pushed down, and it doesn't appear in
PushedFilters. A full table scan will be performed at the database layer, with the results returned to Spark
for further processing.
To push down the correct predicate for this query, use the cast method to specify that the predicate is
comparing the birthday column to a TIMESTAMP, so the types match and the optimizer can generate the
correct predicate.
== Physical Plan ==
*Scan org.apache.spark.sql.cassandra.CassandraSourceRelation
[year#0,birthday#1,userid#2,likes#3,name#4]
PushedFilters: [*LessThan(birthday,2001-01-01 00:00:00.0)],
ReadSchema: struct<year:int,birthday:timestamp,userid:string,likes:string,name:string>
Time taken: 0.034 seconds, Fetched 1 row(s)
Note the PushedFilters indicating that the LessThan predicate will be pushed down for the column data in
birthday. This should speed up the query as a full table scan will be avoided.
SELECT statement
FROM statement
[JOIN | INNER JOIN | LEFT JOIN | LEFT SEMI JOIN | LEFT OUTER JOIN | RIGHT JOIN | RIGHT
OUTER JOIN | FULL JOIN | FULL OUTER JOIN]
ON join condition
SELECT statement 1
[UNION | UNION ALL | UNION DISTINCT | INTERSECT | EXCEPT]
SELECT statement 2
Select queries run on new columns return '', or empty results, instead of None.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
273
Using DataStax Enterprise advanced functionality
You can remove a table from the cache using a UNCACHE TABLE query.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
274
Using DataStax Enterprise advanced functionality
UPPER
LOWER
REGEXP
ORDER
OUTER
RIGHT
SELECT
SEMI
STRING
SUM
TABLE
TIMESTAMP
TRUE
UNCACHE
UNION
WHERE
INTERSECT
EXCEPT
SUBSTR
SUBSTRING
SQRT
ABS
Inserting data into tables with static columns using Spark SQL
Static columns are mapped to different columns in Spark SQL and require special handling. Spark SQL Thrift
servers use Hive. When you when run an insert query, you must pass data to those columns.
To work around the different columns, set cql3.output.query in the insertion Hive table properties to
limit the columns that are being inserted. In Spark SQL, alter the external table to configure the prepared
statement as the value of the Hive CQL output query. For example, this prepared statement takes values that
are inserted into columns a and b in mytable and maps these values to columns b and a, respectively, for
insertion into the new row.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
275
Using DataStax Enterprise advanced functionality
$ bin/dse spark
2. Use the provided HiveContext instance sqlContext to create a new query in HiveQL by calling the sql
method on the sqlContext object..
$ dse pyspark
table1 = spark.read.format("org.apache.spark.sql.cassandra")
.options(table="kv", keyspace="ks")
.load()
table1.write.format("org.apache.spark.sql.cassandra")
.options(table="othertable", keyspace = "ks")
.save(mode ="append")
Using the DSE Spark console, the following Scala example shows how to create a DataFrame object from
one table and save it to another.
$ dse spark
The write operation uses one of the helper methods, cassandraFormat, included in the Spark Cassandra
Connector. This is a simplified way of setting the format and options for a standard DataFrame operation. The
following command is equivalent to write operation using cassandraFormat:
table1.write.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "othertable", "keyspace" -> "test"))
.save()
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
276
Using DataStax Enterprise advanced functionality
another node. Both AlwaysOn SQL and the Spark SQL Thriftserver provide JDBC and ODBC interfaces to
DSE, and share many configuration settings.
1. If you are using Kerberos authentication, in the hive-site.xml file, configure your authentication
credentials for the Spark SQL Thrift server.
<property>
<name>hive.server2.authentication.kerberos.principal</name>
<value>thriftserver/[email protected]</value>
</property>
<property>
<name>hive.server2.authentication.kerberos.keytab</name>
<value>/etc/dse/dse.keytab</value>
</property>
Ensure that you use the hive-site.xml file in the Spark directory:
3. Start the server by entering the dse spark-sql-thriftserver start command as a user with
permissions to write to the Spark directories.
To override the default settings for the server, pass in the configuration property using the --hiveconf
option. See the HiveServer2 documentation for a complete list of configuration properties.
By default, the server listens on port 10000 on the localhost interface on the node from which it was
started. You can specify the server to start on a specific port. For example, to start the server on port
10001, use the --hiveconf hive.server2.thrift.port=10001 option.
You can configure the port and bind address permanently in resources/spark/conf/spark-env.sh:
You can specify general Spark configuration settings by using the --conf option.
4. Use DataFrames to read and write large volumes of data. For example, to create the table_a_cass_df
table that uses a DataFrame while referencing table_a:
With DataFrames, compatibility issues exist with UUID and Inet types when inserting data with the
JDBC driver.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
277
Using DataStax Enterprise advanced functionality
5. Use the Spark Cassandra Connector tuning parameters to optimize reads and writes.
What's next:
You can now connect your application by using the Simba JDBC driver to the server at the URI:
jdbc:hive2://hostname:port number, using the Simba ODBC driver or use dse beeline.
Starting SparkR
Start the SparkR shell using the dse command to automatically set the Spark session within R.
$ dse sparkR
Lifecycle Manager allows you to enable and configure AlwaysOn SQL in managed clusters.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
278
Using DataStax Enterprise advanced functionality
When AlwaysOn SQL is enabled within an Analytics datacenter, all nodes within the datacenter must have
AlwaysOn SQL enabled. Use dsetool ring to find which nodes in the datacenter are Analytics nodes.
AlwaysOn SQL is not supported when using DSE Multi-Instance or other deployments with multiple DSE
instances on the same server.
The dse client-tool alwayson-sql command controls the server. The command works on the local
datacenter unless you specify the datacenter with the --dc option:
• reserve_port_wait_time_ms
• alwayson_sql_status_check_wait_time_ms
• log_dsefs_dir
• runner_max_errors
Changing other options requires a restart, except for the enabled option. Enabling or disabling AlwaysOn
SQL requires restarting DSE.
The spark-alwayson-sql.conf file contains Spark and Hive settings as properties. When AlwaysOn SQL is
started, spark-alwayson-sql.conf is scanned for Spark properties, similar to other Spark applications started
with dse spark-submit. Properties that begin with spark.hive are submitted as properties using --hiveconf,
removing the spark. prefix.
For example, if spark-alwayson-sql.conf has the following setting:
spark.hive.server2.table.type.mapping CLASSIC
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
279
Using DataStax Enterprise advanced functionality
Under spark.master set the Spark URI to the connect to the DSE Analytics Solo datacenter.
spark.master=dse://?connection.local_dc=dc1
spark.cassandra.connection.local_dc=dc0
To start the server on a specific datacenter, specify the datacenter name with the --dc option:
You can also view the status in a web browser by going to http://node name or IP address:AlwaysOn SQL
web UI port. By default, the port is 9077. For example, if 10.10.10.1 is the IP address of an Analytics node with
AlwaysOn SQL enabled, navigate to http://10.10.10.1:9077.
The returned status is one of:
• STOPPED_AUTO_RESTART: the server is being started but is not yet ready to accept client requests.
• STOPPED_MANUAL_RESTART: the server was stopped with either a stop or restart command. If the server
was issued a restart command, the status will be changed to STOPPED_AUTO_RESTART as the server
starts again.
• STARTING: the server is actively starting up but is not yet ready to accept client requests.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
280
Using DataStax Enterprise advanced functionality
The temporary cache table is only valid for the session in which it was created, and will not be recreated on
server restart.
Create a permanent cache table using the CREATE CACHE TABLE directive and a SELECT query:
The table cache can be destroyed using the UNCACHE TABLE and CLEAR CACHE directives.
CLEAR CACHE;
Issuing DROP TABLE will remove all metadata including the table cache.
Enabling SSL for AlwaysOn SQL
Communication between the driver and AlwaysOn SQL can be encrypted using SSL.
The following instructions give an example of how to set up SSL with a self-signed keystore and truststore.
2. If the SSL keystore and truststore used for AlwaysOn SQL differ from the keystore and truststore
configured in cassandra.yaml, add the required settings to enable SSL to the hive-site.xml configuration
file.
By default the SSL settings in cassandra.yaml will be used with AlwaysOn SQL.
<property>
<name>hive.server2.thrift.bind.host</name>
<value>hostname</value>
</property>
<property>
<name>hive.server2.use.SSL</name>
<value>true</value>
</property>
<property>
<name>hive.server2.keystore.path</name>
<value>path to keystore/keystore.jks</value>
</property>
<property>
<name>hive.server2.keystore.password</name>
<value>keystore password</value>
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
281
Using DataStax Enterprise advanced functionality
</property>
Changes in the hive-site.xml configuration file only require a restart of AlwaysOn SQL service,
not DSE.
$ dse beeline
jdbc:spark://hostname:10000/default;SSL=1;SSLTrustStore=path to truststore/
truststore.jks;SSLTrustStorePwd=truststore password
DSE supports multiple authentication mechanisms, but AlwaysOn SQL only supports one mechanism per
datacenter.
AlwaysOn SQL supports DSE proxy authentication. The user who executes the queries is the user who
authenticated using JDBC. If AlwaysOn SQL was started by user Amy, and then Bob begins a JDBC session,
the queries are executed by Amy on behalf of Bob. Amy must have permissions to execute these queries on
behalf of Bob.
To enable authentication in AlwaysOn SQL alwayson_sql_options, follow these steps.
1. Create the auth_user role specified in AlwaysOn SQL options and grant the following permissions to the
role.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
282
Using DataStax Enterprise advanced functionality
If you use Kerberos, set up a role that matches the full Kerberos principal name for each user.
4. Allow the AlwaysOn SQL role (auth_user) to execute commands with the user role.
For internal roles:
GRANT PROXY.EXECUTE
ON ROLE 'user_name'
TO alwayson_sql;
GRANT PROXY.EXECUTE
ON ROLE 'user_name/[email protected]'
TO alwayson_sql;
• If Kerberos authentication is to be used, Kerberos does not need to be enabled in DSE. AlwaysOn
SQL must have its own service principal and keytab.
• The user must have login permissions in DSE in order to login through JDBC to AlwaysOn SQL.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
283
Using DataStax Enterprise advanced functionality
This example shows how to enable Kerberos authentication. Modify the Kerberos domain and path to the
keytab file.
<!-- Start of: configuration for authenticating JDBC users with Kerberos -->
<property>
<name>hive.server2.enable.doAs</name>
<value>true</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>KERBEROS</value>
</property>
<property>
<name>hive.server2.authentication.kerberos.principal</name>
<value>hiveserver2/_HOST@KERBEROS DOMAIN</value>
</property>
<property>
<name>hive.server2.authentication.kerberos.keytab</name>
<value>path to hiveserver2.keytab</value>
</property>
<!-- End of: configuration for authenticating JDBC users with Kerberos -->
7. Modify the owner of the /spark and /tmp/hive directories in DSEFS so the new role can write to the log
and temp files.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
284
Using DataStax Enterprise advanced functionality
$ dse beeline
3. Connect to the server using the JDBC URI for your server.
This will generate the byos.properties file in your home directory. See dse client-tool for more
information on the options for dse client-tool.
What's next:
The byos.properties file can be copied to a node in the external Spark cluster and used with the Spark shell,
as described in Connecting to DataStax Enterprise using the Spark shell on an external Spark cluster.
Connecting to DataStax Enterprise using the Spark shell on an external Spark cluster
Use the generated byos.properties configuration file and the byos-version.jar from a DataStax Enterprise
node to connect to the DataStax Enterprise cluster from the Spark shell on an external Spark cluster.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
285
Using DataStax Enterprise advanced functionality
Prerequisites:
You must generate the byos.properties on a node in your DataStax Enterprise cluster.
1. Copy the byos.properties file you previously generated from the DataStax Enterprise node to the local
Spark node.
$ scp [email protected]:~/byos.properties .
If you are using Kerberos authentication, specify the --generate-token and --token-renewer
<username> options when generating byos.properties, as described in dse client-tool configuration
byos-export.
2. Copy the byos-version.jar file from the clients directory from a node in your DataStax Enterprise cluster
to the local Spark node.
The byos-version.jar file location depends on the type of installation.
$ scp [email protected]:/usr/share/dse/clients/dse-byos_2.11-6.0.2.jar
byos-6.0.jar
4. If you are using Kerberos authentication, set up a CRON job or other task scheduler to periodically call
dse client-tool cassandra renew-token <token> where <token> is the encoded token string in
byos.properties.
5. Start the Spark shell using the byos.properties and byos-version.jar file.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
286
Using DataStax Enterprise advanced functionality
2. Login as the hive user on the Spark SQL Thrift Server host.
4. Merge the existing Spark SQL Thrift Server configuration properties with the generated BYOS
configuration file into a new file.
$ cat /usr/hdp/current/spark-thriftserver/conf/spark-thrift-sparkconf.conf
byos.properties > custom-sparkconf.conf
5. Start Spark SQL Thrift Server with the custom configuration file and byos-version.jar.
$ beeline -u 'jdbc:hive2://hostname:port/default;principal=hive/_HOST@REALM'
What's next:
Generated SQL schema files can be passed to beeline with the -f option to generate a mapping for DSE
tables so both Hadoop and DataStax Enterprise tables will be available through the service for queries.
Using the Spark Jobserver
DataStax Enterprise includes a bundled copy of the open-source Spark Jobserver, an optional component
for submitting and managing Spark jobs, Spark contexts, and JARs on DSE Analytics clusters. Refer to the
Components in the release notes to find the version of the Spark Jobserver included in this version of DSE.
Valid spark-submit options are supported and can be applied to the Spark Jobserver. To use the Jobserver:
The default location of the Spark Jobserver depends on the type of installation:
All the uploaded JARs, temporary files, and log files are created in the user's $HOME/.spark-jobserver
directory, first created when starting Spark Jobserver.
Beneficial use cases for the Spark Jobserver include sharing cached data, repeated queries of cached data,
and faster job starts.
Running multiple SparkContext instances in a single JVM is not recommended. Therefore it is not
recommended to create a new SparkContext for each submitted job in a single Spark Jobserver instance.
We recommend one of the two following Spark Jobserver usages.
• Context per JVM: each job has it's own SparkContext in a separate JVM.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
287
Using DataStax Enterprise advanced functionality
By default, the H2 database is used for storing Spark Jobserver related metadata. In this setup, using
Context per JVM requires additional configuration. See the Spark Jobserver docs for details.
In Context per JVM mode, job results must not contain instances of classes that are not present in the
Spark Jobserver classpath. Problems with returning unknown (to server) types can be recognized by
following log line:
For an example of how to create and submit an application through the Spark Jobserver, see the spark-
jobserver demo included with DSE.
The default location of the demos directory depends on the type of installation:
spray.can.server {
ssl-encryption = on
keystore = "path to keystore"
keystorePW = "keystore password"
}
The default location of the Spark Jobserver depends on the type of installation:
• File data blocks are stored locally on each node and are replicated onto multiples nodes.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
288
Using DataStax Enterprise advanced functionality
The redundancy factor is set at the DSEFS directory or file level, which is more granular than the
replication factor that is set at the keyspace level in the database.
For performance on production clusters, store the DSEFS data on physical devices that are separate from
the database. For development and testing you may store DSEFS data on the same physical device as the
database.
Deployment overview
• The DSEFS server runs in the same JVM as DataStax Enterprise. Similar to the database, there is no
master node. All nodes running DSEFS are equal.
• A single DSEFS cannot span multiple datacenters. To deploy DSEFS in multiple datacenters, you can
create a separate instance of DSEFS for each datacenter.
• You can use different keyspaces to configure multiple DSEFS file systems in a single datacenter.
• For optimal performance, locate the local DSEFS data on a different physical drive than the database.
• Encryption is not supported. Use operating system access controls to protect the local DSEFS data
directories. Other limitations apply.
• DSEFS uses the LOCAL_QUORUM consistency level to store file metadata. DSEFS will always try to write
each data block to replicated node locations, and even if a write fails, it will retry to another node before
acknowledging the write. DSEFS writes are very similar to the ALL consistency level, but with additional
failover to provide high-availability. DSEFS reads are similar to the ONE consistency level.
Enabling DSEFS
DSEFS is automatically enabled on analytics nodes, and disabled on non-analytics nodes. You can enable the
DSEFS service on any node in a DataStax Enterprise cluster. Nodes within the same datacenter with DSEFS
enabled will join together to behave as a DSEFS cluster.
On each node:
1. In the dse.yaml file, set the properties for the DSE File System options:
dsefs_options:
enabled:
keyspace_name: dsefs
work_dir: /var/lib/dsefs
public_port: 5598
private_port: 5599
data_directories:
- dir: /var/lib/dsefs/data
storage_weight: 1.0
min_free_space: 5368709120
a. Enable DSEFS:
enabled: true
If enabled is blank or commented out, DSEFS starts only if the node is configured to run analytics
workloads.
keyspace_name: dsefs
You can optionally configure multiple DSEFS file systems in a single datacenter.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
289
Using DataStax Enterprise advanced functionality
c. Define the work directory for storing the DSEFS metadata for the local node. The work directory
should not be shared with other DSEFS nodes:
work_dir: /var/lib/dsefs
public_port: 5598
DataStax recommends that all nodes in the cluster have the same value. Firewalls must open
this port to trusted clients. The service on this port is bound to the native_transport_address.
private_port: 5599
Do not open this port to firewalls; this private port must be not visible from outside of the
cluster.
f. Set the data directories where the file data blocks are stored locally on each node.
data_directories:
- dir: /var/lib/dsefs/data
If you use the default /var/lib/dsefs/data data directory, verify that the directory exists and
that you have root access. Otherwise, you can define your own directory location, change the
ownership of the directory, or both:
Ensure that the data directory is writeable by the DataStax Enterprise user. Put the data
directories on different physical devices than the database. Using multiple data directories on
JBOD improves performance and capacity.
g. For each data directory, set the weighting factor to specify how much data to place in this directory,
relative to other directories in the cluster. This soft constraint determines how DSEFS distributes
the data. For example, a directory with a value of 3.0 receives about three times more data than a
directory with a value of 1.0.
data_directories:
- dir: /var/lib/dsefs/data
storage_weight: 1.0
h. For each data directory, define the reserved space, in bytes, to not use for storing file data blocks.
See min_free_space.
data_directories:
- dir: /var/lib/dsefs/data
storage_weight: 1.0
min_free_space: 5368709120
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
290
Using DataStax Enterprise advanced functionality
4. With guidance from DataStax Support, you can tune advanced DSEFS properties:
# service_startup_timeout_ms: 30000
# service_close_timeout_ms: 600000
# server_close_timeout_ms: 2147483647 # Integer.MAX_VALUE
# compression_frame_max_size: 1048576
# query_cache_size: 2048
# query_cache_expire_after_ms: 2000
# gossip_options:
# round_delay_ms: 2000
# startup_delay_ms: 5000
# shutdown_delay_ms: 10000
# rest_options:
# request_timeout_ms: 330000
# connection_open_timeout_ms: 55000
# client_close_timeout_ms: 60000
# server_request_timeout_ms: 300000
# idle_connection_timeout_ms: 60000
# internode_idle_connection_timeout_ms: 120000
# core_max_concurrent_connections_per_host: 8
# transaction_options:
# transaction_timeout_ms: 3000
# conflict_retry_delay_ms: 200
# conflict_retry_count: 40
# execution_retry_delay_ms: 1000
# execution_retry_count: 3
# block_allocator_options:
# overflow_margin_mb: 1024
# overflow_factor: 1.05
Disabling DSEFS
To disable DSEFS and remove metadata and data:
1. Remove all directories and files from the DSEFS file system:
$ dse fs rm -r filepath
3. Verify that all DSEFS data directories where the file data blocks are stored locally on each node are empty.
These data directories are configured in dse.yaml. Your directories are probably different from this
default data_directories value:
data_directories:
- dir: /var/lib/dsefs/data
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
291
Using DataStax Enterprise advanced functionality
Do not delete the data_directories before removing the dsefs keyspace tables, or removing the
node from the cluster.
Configuring DSEFS
You must configure data replication. You can optionally configure multiple DSEFS file systems in a datacenter,
and perform other functions, including setting the Kafka log retention.
DSEFS does not span datacenters. Create a separate DSEFS instance in each datacenter, as described in the
steps below.
DSEFS limitations
Know these limitations when you configure and tune DSEFS. The following functionality and features are not
supported:
• Encryption.
Use operating system access controls to protect the local DSEFS data directories.
• File system consistency checks (fsck) and file repair have only limited support. Running fsck will re-
replicate blocks that were under-replicated because a node was taken out of a cluster.
• File repair.
• Checksum.
• Automatic backups.
• Multi-datacenter replication.
• Snapshots.
a. Globally: set replication for the metadata in the dsefs keyspace that is stored in the database.
For example, use a CQL statement to configure a replication factor of 3 on the Analytics
datacenter using NetworkTopologyStrategy:
Datacenter names are case-sensitive. Verify the case of the using utility, such as dsetool
status.
c. Locally: set the redundancy factor on a specific DSEFS file or directory where the data blocks are
stored.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
292
Using DataStax Enterprise advanced functionality
When a redundancy factor is not specified, it is inherited from the parent directory. The default
redundancy factor is 3.
2. If you have multiple Analytics datacenters, you must configure each DSEFS file system to replicate within
its own datacenter:
a. In the dse.yaml file, specify a separate DSEFS keyspace for each logical datacenter.
For example, on a cluster with logical datacenters DC1 and DC2.
On each node in DC1:
dsefs_options:
...
keyspace_name: dsefs1
dsefs_options:
...
keyspace_name: dsefs2
On DC2:
For example, in a cluster with multiple datacenters, the keyspace names dsefs1 and dsefs2 define
separate file systems in each datacenter.
3. When bouncing a streaming application, verify the Kafka log configuration (especially
log.retention.check.interval.ms and policies.log.retention.bytes). Ensure the Kafka log
retention policy is robust enough to handle the length of time expected to bring the application and
consumers back up.
For example, if the log retention policy is too conservative and deletes or rolls are logged very
frequently to save disk space, the users are likely to encounter issues when attempting to recover from
a checkpoint that references offsets that are no longer maintained by the Kafka logs.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
293
Using DataStax Enterprise advanced functionality
$ dse fs
For example, to list the file system status and disk space usage in human-readable format:
Optional command arguments are enclosed in square brackets. For example, [dse_auth_credentials] and [-
R]
Variable values are italicized. For example, directory and [subcommand].
Working with the local file system in the DSEFS shell
You can refer to files in the local file system by prefixing paths with file:. For example the following command
will list files in the system root directory:
If you need to perform many subsequent operations on the local file system, first change the current working
directory to file: or any local file system path:
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
294
Using DataStax Enterprise advanced functionality
DSEFS shell remembers the last working directory of each file system separately. To go back to the previous
DSEFS directory, enter:
To refer to a path relative to the last working directory of the file system, prefix a relative path with either dsefs:
or file:. The following session will create a directory new_directory in the directory /home/user1:
To copy a file between two different file systems, you can also use the cp command with explicit file system
prefixes in the paths:
Authentication
For dse dse_auth_credentials you can provide user credentials in several ways, see Providing credentials from
DSE tools. For authentication with DSEFS, see DSEFS authentication.
Wildcard support
Some DSEFS commands support wildcard pattern expansion in the path argument. Path arguments containing
wildcards are expanded before method invocation into a set of paths matching the wildcard pattern, then the
given method is invoked for each expanded path.
For example in the following directory tree:
dirA
|--dirB
|--file1
|--file2
Giving the stat dirA/* command would be transparently translated into three invocations: stat dirA/dirB,
stat dirA/file1, and stat dirA/file2.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
295
Using DataStax Enterprise advanced functionality
• * matches any files system entry (file or directory) name, as in the example of stat dirA/*.
• ? matches any single character in the file system entry name. For example stat dirA/dir? matches
dirA/dirB.
• [] matches any characters enclosed within the brackets. For example stat dirA/file[0123] matches
dirA/file1 and dirA/file2.
• {} matches any sequence of characters enclosed within the brackets and separated with ,. For example
stat dirA/{dirB,file2} matches dirA/dirB and dirA/file2.
Forcing synchronization
Before confirming writing a file, DSEFS by default forces all blocks of the file to be written to the storage
devices. This behavior can be controlled with --no-force-sync and --force-fsync flags when creating files
or directories in the DSEFS shell with mkdir, put, and cp commands. The force/no-force behavior is inherited
from the parent directory, if not specified. For example, if a directory is created with --no-force-sync, then all
files are created with --no-force-sync unless --force-fsync is explicitly set during file creation.
Turning off forced synchronization improves latency and performance at a cost of durability. For example,
if a power loss occurs before writing the data to the storage device, you may lose data. Turn off forced
synchronization only if you have a reliable backup power supply in your datacenter and failure of all replicas is
unlikely, or if you can afford losing file data.
The Hadoop SYNC_BLOCK flag has the same effect as --force-sync in DSEFS. The Hadoop LAZY_PERSIST
flag has the same effect as --no-force-sync in DSEFS.
Removing a DSEFS node
When removing a node running DSEFS from a DSE cluster, additional steps are needed to ensure proper
correctness within the DSEFS data set.
Make sure the replication factor for the cluster is greater than ONE before continuing.
1. From a node in the same datacenter as the node to be removed, start the DSEFS shell.
$ dse fs
dsefs > df
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
296
Using DataStax Enterprise advanced functionality
3. Find the node to be removed in the list and note the UUID value for it under the Location column.
4. If the node is up, unmount it from DSEFS with the command umount UUID.
5. If the node is not up (for example, after a hardware failure), force unmount it from DSEFS with the
command umount -f UUID.
6. Run a file system check with the fsck command to make sure all blocks are replicated.
If data was written to a DSEFS node, more nodes were added to the cluster, and the original node was
removed without running fsck, the data in the original node may be permanently lost.
$ dse fs
dsefs > df
3. Find the directory to be removed in the list and note the UUID value for it under the Location column.
5. Run a file system check with the fsck command to make sure all blocks are replicated.
If the file system check results in an IOException, make sure all the nodes in the cluster are running.
Examples
Using the DSEFS shell, these commands put the local bluefile to the remote DSEFS greenfile:
dsefs / > ls -l
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
297
Using DataStax Enterprise advanced functionality
dsefs / > ls -l
Type Permission Owner Group Length Modified Name
Using the dse command, these commands create the test2 directory and upload the local README.md file to the
new DSEFS directory.
You can use two or more dse commands in a single command line. This is faster because the JVM is launched
and connected/disconnected with DSEFS only once. For example:
The following example shows how to use the --no-force-sync flag on a directory, and how to check the state
of the --force-sync flag using stat. These commands are run from within the DSEFS shell.
DSEFS compression
DSEFS is able to compress files to save storage space and bandwidth. Compression is performed by DSE
during upload upon a user’s explicit request. Decompression is transparent. Data is always uncompressed by
the server before it is returned to the client.
Compression is performed within block boundaries. The unit of compression—the chunk of data that gets
compressed individually—is called a frame and its size can be specified during file upload.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
298
Using DataStax Enterprise advanced functionality
Encoders
DSEFS is shipped with the lz4 encoder which works out of the box.
Compression
To compress files use the -c or --compression-encoder parameter for put or cp command. The parameter
specifies the compression encoder to use for the file that is about to get uploaded.
The frame size can optionally be set with the -f, --compression-frame-size option.
The maximum frame size in bytes is set in the compression_frame_max_size option in dse.yaml. If a user
sets the frame size to a value greater than compression_frame_max_size when using put -f an error will be
thrown and the command will fail. Modify the compression_frame_max_size setting based on the available
memory of the node.
Files that are compressed can be appended in the same way as uncompressed files. If the file is compressed
the appended data gets transparently compressed with the file's encoder specified for the initial put operation.
Directories can have a default compression encoder specified during directory creation with the mkdir
command. Newly added files with the put command inherit the default compression encoder from containing
directory. You can override the default compression encoder with the c parameter during put operations.
Decompression
Decompression is performed automatically for all commands that transport data to the client. There is no need
for additional configuration to retrieve the original, decompressed file content.
Storage space
Enabling compression creates a distinction between the logical and physical file size.
The logical size is the size of a file before uploading it to DSEFS, where it is then compressed. The logical size
is shown by the stat command under Size.
The physical size is the actual size of a data stored on the storage device. The physical size is shown by the df
command and by the stat -v command for each block separately, under the Compressed length column.
Limitations
Truncating compressed files is not possible.
DSEFS authentication
DSEFS works with secured DataStax Enterprise clusters.
Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
299
Using DataStax Enterprise advanced functionality
For related SSL details, see Enabling SSL encryption for DSEFS.
• Set with the dse spark-submit command using one of the credential options described in Providing
credentials on command line.
• Programmatically set the user credentials in the Spark configuration object before the SparkContext is
created:
conf.set("spark.hadoop.com.datastax.bdp.fs.client.authentication.basic.username",
<user>)
conf.set("spark.hadoop.com.datastax.bdp.fs.client.authentication.basic.password",
<pass>)
If a Kerberos authentication token is in use, you do not need to set any properties in the context object. If
you need to explicitly set the token, set the spark.hadoop.cassandra.auth.token property.
• When running the Spark Shell, where the SparkContext is created at startup, set the properties in the
Hadoop configuration object:
sc.hadoopConfiguration.set("com.datastax.bdp.fs.client.authentication.basic.username",
<user>)
sc.hadoopConfiguration.set("com.datastax.bdp.fs.client.authentication.basic.password",
<pass>)
• When running a Spark application or the Spark Shell, provide properties in the spark-defaults.conf
configuration file:
<property>
<name>com.datastax.bdp.fs.client.authentication.basic.username</name>
<value>username</value>
</property>
<property>
<name>com.datastax.bdp.fs.client.authentication.basic.password</name>
<value>password</value>
</property>
Optional: If you want to use this method, but do not have privileges to write to core-default.xml, copy
this file to any location path and set the environment variable to point to the file with:
export HADOOP2_CONF_DIR=path
DSEFS shell
Providing authentication credentials while using the DSEFS shell is as easy as in other DSE tools. The DSEFS
shell supports different authentication methods listed below in priority order. When more than one method
can be used, the one with hi