0% found this document useful (0 votes)
1K views1,015 pages

Dse Admin 60

Uploaded by

Johnson Ramu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views1,015 pages

Dse Admin 60

Uploaded by

Johnson Ramu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1015

DSE 6.

0
Administrator Guide
Earlier DSE version
Latest 6.0 patch:
6.0.13

Updated: 2020-09-18-07:00
© 2020 DataStax, Inc. All rights reserved.
DataStax, Titan, and TitanDB are registered trademarks of DataStax,
Inc. and its subsidiaries in the United States and/or other countries.

Apache Cassandra, Apache, Tomcat, Lucene, Solr, Hadoop, Spark, TinkerPop, and Cassandra are trademarks
of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.

Kubernetes is the registered trademark of the Linux Foundation.


DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13

Contents
Chapter 1. Getting started................................................................................................................................... 16

About advanced functionality............................................................................................................................18

New features.....................................................................................................................................................19

Chapter 2. Release notes.....................................................................................................................................22

DSE release notes............................................................................................................................................22

DSE 6.0.13 release notes.......................................................................................................................... 22

DSE 6.0.12 release notes.......................................................................................................................... 26


DSE 6.0.11 release notes.......................................................................................................................... 29

DSE 6.0.10 release notes.......................................................................................................................... 33

DSE 6.0.9 release notes............................................................................................................................ 38

DSE 6.0.8 release notes............................................................................................................................ 39

DSE 6.0.7 release notes............................................................................................................................ 46

DSE 6.0.6 release notes............................................................................................................................ 53

DSE 6.0.5 release notes............................................................................................................................ 54

DSE 6.0.4 release notes............................................................................................................................ 62

DSE 6.0.3 release notes............................................................................................................................ 64

DSE 6.0.2 release notes............................................................................................................................ 71

DSE 6.0.1 release notes............................................................................................................................ 75

DSE 6.0.0 release notes............................................................................................................................ 81

Bulk loader release notes.................................................................................................................................99

Studio release notes......................................................................................................................................... 99

Chapter 3. Installing........................................................................................................................................... 100

Chapter 4. Configuration....................................................................................................................................101

Recommended production settings................................................................................................................ 101

Configure the chunk cache.......................................................................................................................101

Install the latest Java Virtual Machine......................................................................................................103

Synchronize clocks................................................................................................................................... 103

Set kernel parameters.............................................................................................................................. 103

Disable settings that impact performance................................................................................................ 105

Optimize disk settings...............................................................................................................................107

Set the heap size for Java garbage collection......................................................................................... 108

Check Java Hugepages settings.............................................................................................................. 108


DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13

YAML and configuration properties................................................................................................................ 109

cassandra.yaml......................................................................................................................................... 109

dse.yaml.................................................................................................................................................... 141
remote.yaml...............................................................................................................................................180

cassandra-rackdc.properties..................................................................................................................... 184

cassandra-topology.properties.................................................................................................................. 184

Cloud provider snitches.................................................................................................................................. 185

Amazon EC2 single-region snitch............................................................................................................ 185

Amazon EC2 multi-region snitch.............................................................................................................. 186

Google Cloud Platform............................................................................................................................. 187

Apache CloudStack snitch........................................................................................................................188


JVM system properties................................................................................................................................... 188

Cassandra................................................................................................................................................. 189

JMX........................................................................................................................................................... 191

DSE Search.............................................................................................................................................. 191

TPC........................................................................................................................................................... 192

LDAP......................................................................................................................................................... 193

Kerberos....................................................................................................................................................194

NodeSync..................................................................................................................................................194

Choosing a compaction strategy.................................................................................................................... 194

NodeSync service........................................................................................................................................... 195

About NodeSync....................................................................................................................................... 195

Starting and stopping the NodeSync service........................................................................................... 197

Enabling NodeSync validation.................................................................................................................. 197

Tuning NodeSync validations................................................................................................................... 198

Manually starting NodeSync validation.....................................................................................................199

Using multiple network interfaces...................................................................................................................199

Configuring gossip settings.............................................................................................................................202

Configuring the heap dump directory............................................................................................................. 203

Configuring Virtual Nodes...............................................................................................................................203

Virtual node (vnode) configuration............................................................................................................203

Enabling virtual nodes on an existing production cluster......................................................................... 205

Logging configuration......................................................................................................................................205

Changing logging locations.......................................................................................................................205

Configuring logging................................................................................................................................... 206


DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13

Commit log archive configuration............................................................................................................. 210

Change Data Capture (CDC) logging...................................................................................................... 211

Chapter 5. Initializing a cluster......................................................................................................................... 212


Initializing datacenters.................................................................................................................................... 212

Initializing a single datacenter per workload type.................................................................................... 213

Initializing multiple datacenters per workload type................................................................................... 218

Setting seed nodes for a single datacenter................................................................................................... 223

Use cases for listen address.......................................................................................................................... 224

Initializing single-token architecture datacenters............................................................................................ 225

Calculating tokens for single-token architecture nodes............................................................................228

Chapter 6. Security............................................................................................................................................. 234


Chapter 7. DSE advanced functionality.......................................................................................................... 235

DSE Analytics................................................................................................................................................. 235

About DSE Analytics.................................................................................................................................235

Setting the replication factor for analytics keyspaces.............................................................................. 236

DSE Analytics and Search integration..................................................................................................... 236

About DSE Analytics Solo........................................................................................................................ 238

Analyzing data using Spark......................................................................................................................239

DSEFS (DataStax Enterprise file system)................................................................................................288

DSE Search.................................................................................................................................................... 308

About DSE Search................................................................................................................................... 308

Configuring DSE Search...........................................................................................................................315

Search performance tuning and monitoring............................................................................................. 342

DSE Search operations............................................................................................................................ 347

Solr interfaces........................................................................................................................................... 352

HTTP API SolrJ and other Solr clients.....................................................................................................362

DSE Graph......................................................................................................................................................362

About DataStax Enterprise Graph............................................................................................................ 362

DSE Graph Terminology...........................................................................................................................364

DSE Graph Operations.............................................................................................................................365

DSE Graph Tools..................................................................................................................................... 372

Starting the Gremlin console.................................................................................................................... 373

DSE Graph Reference..............................................................................................................................376

DSE Management Services............................................................................................................................383

Performance Service................................................................................................................................ 383


DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13

Best Practice Service................................................................................................................................432

Capacity Service....................................................................................................................................... 432

Repair Service.......................................................................................................................................... 433


DSE Advanced Replication.............................................................................................................................433

About DSE Advanced Replication............................................................................................................ 433

Architecture............................................................................................................................................... 433

Traffic between the clusters..................................................................................................................... 439

Terminology...............................................................................................................................................440

Getting started.......................................................................................................................................... 440

Keyspaces.................................................................................................................................................450

Data types.................................................................................................................................................451
Operations.................................................................................................................................................451

CQL queries..............................................................................................................................................467

Metrics.......................................................................................................................................................468

Managing invalid messages..................................................................................................................... 474

Managing audit logs................................................................................................................................. 475

dse advrep commands............................................................................................................................. 476

DSE In-Memory.............................................................................................................................................. 509

Creating or altering tables to use DSE In-Memory.................................................................................. 509

Verifying table properties.......................................................................................................................... 511

Managing memory.................................................................................................................................... 511

Backing up and restoring data................................................................................................................. 512

DSE Multi-Instance......................................................................................................................................... 512

About DSE Multi-Instance.........................................................................................................................512

DSE Multi-Instance architecture............................................................................................................... 512

Adding nodes to DSE Multi-Instance....................................................................................................... 514

DSE Multi-Instance commands................................................................................................................ 517

DSE Tiered Storage....................................................................................................................................... 518

About DSE Tiered Storage.......................................................................................................................518

Configuring DSE Tiered Storage.............................................................................................................. 519

Testing configurations............................................................................................................................... 521

Chapter 8. Tools................................................................................................................................................. 523

DSE Metrics Collector.....................................................................................................................................523

nodetool...........................................................................................................................................................523

About the nodetool utility.......................................................................................................................... 523


DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13

abortrebuild............................................................................................................................................... 523

assassinate............................................................................................................................................... 524

bootstrap................................................................................................................................................... 526
cfhistograms.............................................................................................................................................. 527

cfstats........................................................................................................................................................ 527

cleanup......................................................................................................................................................527

clearsnapshot............................................................................................................................................ 529

compact.....................................................................................................................................................530

compactionhistory..................................................................................................................................... 532

compactionstats........................................................................................................................................ 537

decommission........................................................................................................................................... 538
describecluster.......................................................................................................................................... 539

describering...............................................................................................................................................541

disableautocompaction..............................................................................................................................543

disablebackup........................................................................................................................................... 544

disablebinary............................................................................................................................................. 545

disablegossip.............................................................................................................................................547

disablehandoff........................................................................................................................................... 548

disablehintsfordc....................................................................................................................................... 549

drain.......................................................................................................................................................... 551

enableautocompaction.............................................................................................................................. 552

enablebackup............................................................................................................................................ 553

enablebinary..............................................................................................................................................555

enablegossip............................................................................................................................................. 556

enablehandoff............................................................................................................................................557

enablehintsfordc........................................................................................................................................ 558

failuredetector............................................................................................................................................560

flush...........................................................................................................................................................561

garbagecollect........................................................................................................................................... 562

gcstats....................................................................................................................................................... 564

getbatchlogreplaythrottle........................................................................................................................... 566

getcachecapacity.......................................................................................................................................567

getcachekeystosave..................................................................................................................................568

getcompactionthreshold............................................................................................................................ 570

getcompactionthroughput..........................................................................................................................571
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13

getconcurrentcompactors.......................................................................................................................... 572

getconcurrentviewbuilders.........................................................................................................................574

getendpoints..............................................................................................................................................575
gethintedhandoffthrottlekb.........................................................................................................................578

getinterdcstreamthroughput...................................................................................................................... 579

getlogginglevels.........................................................................................................................................580

getmaxhintwindow.....................................................................................................................................582

getseeds....................................................................................................................................................583

getsstables................................................................................................................................................ 585

getstreamthroughput................................................................................................................................. 587

gettimeout..................................................................................................................................................589
gettraceprobability..................................................................................................................................... 590

gossipinfo.................................................................................................................................................. 592

handoffwindow.......................................................................................................................................... 593

help............................................................................................................................................................595

info.............................................................................................................................................................599

inmemorystatus......................................................................................................................................... 600

invalidatecountercache..............................................................................................................................602

invalidatekeycache.................................................................................................................................... 603

invalidaterowcache....................................................................................................................................605

join.............................................................................................................................................................606

listendpointspendinghints.......................................................................................................................... 607

leaksdetection........................................................................................................................................... 609

listsnapshots..............................................................................................................................................611

mark_unrepaired....................................................................................................................................... 613

move..........................................................................................................................................................614

netstats......................................................................................................................................................616

nodesyncservice........................................................................................................................................618

pausehandoff.............................................................................................................................................629

proxyhistograms........................................................................................................................................ 631

rangekeysample........................................................................................................................................ 633

rebuild........................................................................................................................................................634

rebuild_index............................................................................................................................................. 637

rebuild_view.............................................................................................................................................. 638

refresh....................................................................................................................................................... 640
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13

refreshsizeestimates................................................................................................................................. 641

reloadseeds...............................................................................................................................................643

reloadtriggers............................................................................................................................................ 644
relocatesstables........................................................................................................................................ 645

removenode.............................................................................................................................................. 647

repair......................................................................................................................................................... 649

replaybatchlog........................................................................................................................................... 652

resetlocalschema...................................................................................................................................... 654

resume...................................................................................................................................................... 655

resumehandoff.......................................................................................................................................... 656

ring............................................................................................................................................................ 657
scrub..........................................................................................................................................................659

sequence...................................................................................................................................................660

setbatchlogreplaythrottle........................................................................................................................... 663

setcachecapacity.......................................................................................................................................665

setcachekeystosave.................................................................................................................................. 666

setcompactionthreshold............................................................................................................................ 668

setcompactionthroughput.......................................................................................................................... 669

setconcurrentcompactors.......................................................................................................................... 670

setconcurrentviewbuilders......................................................................................................................... 671

sethintedhandoffthrottlekb......................................................................................................................... 672

setinterdcstreamthroughput.......................................................................................................................674

setlogginglevel...........................................................................................................................................675

setmaxhintwindow..................................................................................................................................... 677

setstreamthroughput................................................................................................................................. 679

settimeout..................................................................................................................................................680

settraceprobability..................................................................................................................................... 682

sjk.............................................................................................................................................................. 684

snapshot....................................................................................................................................................686

status.........................................................................................................................................................689

statusbackup............................................................................................................................................. 691

statusbinary............................................................................................................................................... 693

statusgossip.............................................................................................................................................. 694

statushandoff.............................................................................................................................................695

stop............................................................................................................................................................697
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13

stopdaemon...............................................................................................................................................698

tablehistograms......................................................................................................................................... 700

tablestats................................................................................................................................................... 701
toppartitions...............................................................................................................................................706

tpstats........................................................................................................................................................709

truncatehints..............................................................................................................................................715

upgradesstables........................................................................................................................................ 716

verify..........................................................................................................................................................718

version.......................................................................................................................................................720

viewbuildstatus.......................................................................................................................................... 721

dse commands................................................................................................................................................722
About dse commands............................................................................................................................... 722

dse connection options............................................................................................................................. 723

add-node................................................................................................................................................... 724

advrep....................................................................................................................................................... 727

beeline.......................................................................................................................................................760

cassandra..................................................................................................................................................761

cassandra-stop..........................................................................................................................................763

exec...........................................................................................................................................................764

fs................................................................................................................................................................765

gremlin-console......................................................................................................................................... 766

hadoop fs.................................................................................................................................................. 767

list-nodes................................................................................................................................................... 767

pyspark......................................................................................................................................................768

remove-node............................................................................................................................................. 769

spark..........................................................................................................................................................771

spark-class................................................................................................................................................ 773

spark-jobserver..........................................................................................................................................774

spark-history-server...................................................................................................................................776

spark-sql....................................................................................................................................................777

spark-sql-thriftserver..................................................................................................................................778

spark-submit..............................................................................................................................................779

SparkR...................................................................................................................................................... 782

-v............................................................................................................................................................... 783

dse client-tool..................................................................................................................................................783
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13

About dse client-tool................................................................................................................................. 783

client-tool connection options................................................................................................................... 784

cassandra..................................................................................................................................................786
configuration export.................................................................................................................................. 788

configuration byos-export..........................................................................................................................789

configuration import.................................................................................................................................. 791

spark..........................................................................................................................................................792

alwayson-sql..............................................................................................................................................794

nodesync......................................................................................................................................................... 796

disable....................................................................................................................................................... 798

enable........................................................................................................................................................801
help............................................................................................................................................................804

tracing........................................................................................................................................................807

validation................................................................................................................................................... 817

dsefs shell commands.................................................................................................................................... 819

append...................................................................................................................................................... 819

cat..............................................................................................................................................................820

cd...............................................................................................................................................................822

chgrp......................................................................................................................................................... 824

chmod........................................................................................................................................................825

chown........................................................................................................................................................ 827

cp...............................................................................................................................................................828

df............................................................................................................................................................... 830

du.............................................................................................................................................................. 831

echo...........................................................................................................................................................833

exit.............................................................................................................................................................834

fsck............................................................................................................................................................ 835

get............................................................................................................................................................. 836

ls................................................................................................................................................................837

mkdir..........................................................................................................................................................839

mv..............................................................................................................................................................841

put............................................................................................................................................................. 843

pwd............................................................................................................................................................845

realpath..................................................................................................................................................... 846

rename...................................................................................................................................................... 847
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13

rm.............................................................................................................................................................. 848

rmdir.......................................................................................................................................................... 849

stat.............................................................................................................................................................851
truncate..................................................................................................................................................... 852

umount...................................................................................................................................................... 853

dsetool.............................................................................................................................................................854

About dsetool............................................................................................................................................ 854

Connection options................................................................................................................................... 855

core_indexing_status................................................................................................................................ 857

create_core............................................................................................................................................... 859

createsystemkey....................................................................................................................................... 862
encryptconfigvalue.................................................................................................................................... 864

get_core_config.........................................................................................................................................864

get_core_schema......................................................................................................................................865

help............................................................................................................................................................867

index_checks.............................................................................................................................................868

infer_solr_schema..................................................................................................................................... 870

inmemorystatus......................................................................................................................................... 871

insights_config...........................................................................................................................................872

insights_filters............................................................................................................................................875

list_index_files........................................................................................................................................... 877

list_core_properties................................................................................................................................... 879

list_subranges........................................................................................................................................... 880

listjt............................................................................................................................................................ 881

managekmip list........................................................................................................................................ 882

managekmip expirekey............................................................................................................................. 883

managekmip revoke..................................................................................................................................884

managekmip destroy.................................................................................................................................885

node_health...............................................................................................................................................886

partitioner.................................................................................................................................................. 887

perf............................................................................................................................................................ 888

read_resource........................................................................................................................................... 891

rebuild_indexes......................................................................................................................................... 892

reload_core............................................................................................................................................... 894

ring............................................................................................................................................................ 896
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13

set_core_property..................................................................................................................................... 897

sparkmaster cleanup.................................................................................................................................899

sparkworker restart................................................................................................................................... 900


status.........................................................................................................................................................901

stop_core_reindex.....................................................................................................................................902

tieredtablestats.......................................................................................................................................... 903

tsreload......................................................................................................................................................905

unload_core...............................................................................................................................................906

upgrade_index_files.................................................................................................................................. 907

write_resource...........................................................................................................................................908

Stress tools..................................................................................................................................................... 909


cassandra-stress tool................................................................................................................................ 909

Interpreting the output of cassandra-stress..............................................................................................919

fs-stress tool..............................................................................................................................................920

SSTable utilities.............................................................................................................................................. 921

About SSTable tools................................................................................................................................. 921

sstabledowngrade..................................................................................................................................... 922

sstabledump.............................................................................................................................................. 924

sstableexpiredblockers.............................................................................................................................. 930

sstablelevelreset........................................................................................................................................931

sstableloader............................................................................................................................................. 933

sstablemetadata........................................................................................................................................ 935

sstableofflinerelevel...................................................................................................................................939

sstablepartitions........................................................................................................................................ 941

sstablerepairedset..................................................................................................................................... 944

sstablescrub.............................................................................................................................................. 946

sstablesplit.................................................................................................................................................948

sstableupgrade..........................................................................................................................................950

sstableutil.................................................................................................................................................. 951

sstableverify.............................................................................................................................................. 953

DataStax tools.................................................................................................................................................954

Preflight check tool......................................................................................................................................... 955

cluster_check and yaml_diff tools.................................................................................................................. 956

Chapter 9. Operations........................................................................................................................................ 957

Starting and stopping DSE............................................................................................................................. 957


DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13

Starting as a service.................................................................................................................................957

Starting as a stand-alone process............................................................................................................959

Stopping a node....................................................................................................................................... 961


Adding or removing nodes, datacenters, or clusters......................................................................................962

Adding nodes to vnode-enabled cluster................................................................................................... 962

Adding a datacenter to a cluster.............................................................................................................. 963

Adding a datacenter to a cluster using a designated datacenter as a data source..................................968

Replacing a dead node or dead seed node.............................................................................................972

Replacing a running node........................................................................................................................ 975

Moving a node from one rack to another.................................................................................................976

Decommissioning a datacenter................................................................................................................ 977


Removing a node..................................................................................................................................... 979

Changing the IP address of a node......................................................................................................... 980

Switching snitches.................................................................................................................................... 981

Changing keyspace replication strategy...................................................................................................982

Migrating or renaming a cluster................................................................................................................983

Adding single-token nodes to a cluster.................................................................................................... 984

Adding a datacenter to a single-token architecture cluster...................................................................... 985

Replacing a dead node in a single-token architecture cluster................................................................. 986

Backing up and restoring data....................................................................................................................... 989

About snapshots....................................................................................................................................... 989

Taking a snapshot.................................................................................................................................... 989

Deleting snapshot files..............................................................................................................................990

Enabling incremental backups..................................................................................................................991

Restoring from a snapshot....................................................................................................................... 991

Restoring a snapshot into a new cluster..................................................................................................992

Recovering from a single disk failure using JBOD...................................................................................993

Repairing nodes..............................................................................................................................................995

Manual repair: Anti-entropy repair............................................................................................................ 995

When to run anti-entropy repair............................................................................................................... 998

Changing repair strategies........................................................................................................................999

Monitoring a DSE cluster..............................................................................................................................1001

Tuning the database..................................................................................................................................... 1001

Tuning Java Virtual Machine.................................................................................................................. 1001

Tuning Bloom filters................................................................................................................................ 1006


DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13

Configuring memtable thresholds........................................................................................................... 1007

Data caching................................................................................................................................................. 1007

Configuring data caches......................................................................................................................... 1007


Monitoring and adjusting caching........................................................................................................... 1009

Compacting and compressing...................................................................................................................... 1009

Configuring compaction.......................................................................................................................... 1009

Compression........................................................................................................................................... 1010

Testing compaction and compression.................................................................................................... 1011

Migrating data to DSE.................................................................................................................................. 1012

Collecting node health and indexing scores.................................................................................................1012

Clearing data from DSE............................................................................................................................... 1014


Chapter 10. Planning........................................................................................................................................ 1015
Chapter 1. Getting started with DataStax
Enterprise 6.0
Information about using DataStax Enterprise for Administrators.
This topic provides basic information and a roadmap to documentation for System Administrators new to DataStax
Enterprise.
Which product?
DataStax Offerings provides basic information to help you choose which product best fits your requirements.
Learn
Before diving into administration tasks, you can save a lot of time when setting up and operating DataStax
Enterprise (DSE) in a production environment by learning a few basics first:

• DataStax Enterprise-based applications and clusters are much different than relational databases and use
a data model based on the types of queries, not on modeling entities and relationships. Architecture in brief
contains key concepts and terminology for understanding the database.

• You can use DSE OpsCenter and Lifecycle Manager for most administrative tasks.

• Save yourself some time and frustration by spending a few moments looking at DataStax Doc and Search tips.
These short topics talk about navigation and bookmarking aids that will make your journey through the docs
more efficient and productive.

The following are not administrator specific but are presented to give you a fuller picture of the database:

• Cassandra Query Language (CQL) is the query language for DataStax Enterprise.

• DataStax provides drivers in several programming languages for connecting client applications to the
database.

• APIs are available for OpsCenter, DseGraphFrame, DataStax Spark Cassandra Connector, and the drivers.

Plan
The Planning and testing guide contains guidelines for capacity planning and hardware selection in production
environments. Key topics include:

• Estimating disk capacity

• Estimating RAM

• CPU recommendations

Install
DataStax offers a variety of ways to set up a cluster:
Cloud

• Google Cloud Platform (GCP) Marketplace | Google Deployment Guide

• Microsoft Azure Marketplace | Azure Deployment Guide

• AWS Quick Start | Amazon Deployment Guide

On premises

• Installing and deploying DSE using Lifecycle Manager

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
16
Getting started with DataStax Enterprise 6.0

• Packages for Yum- and Debian-based platforms

• Docker images

• Binary tarball

• Deployment per workload type

For help with choosing an install type, see Which install method should I use?
Secure
DSE Advanced Security provides fine-grained user and access controls to keep applications data protected and
compliant with regulatory standards like PCI, SOX, HIPAA, and the European Union’s General Data Protection
Regulation (GDPR). Key topics include:

• Create database users and roles

• Set up and configure LDAP access

• Configure database permissions

The DSE database includes the default role cassandra with password cassandra. This is a superuser login has
full access to the database. DataStax recommends only using the cassandra role once during initial Role Based
Access Control (RBAC) set up to establish your own root account and then disabling the cassandra role. See
Adding a superuser login.

Tune
Important topics for optimizing the performance of the database include:

• Recommended production settings

• Tuning the Java Virtual Machine

• Enable the Nodesync service (continuous background repair)

• Load test your cluster before deployment

Operations
The most commonly used operations include:

• Starting and stopping DataStax Enterprise per workload type.

• Backup and recovery

• Adding or removing nodes, datacenters, or clusters

• Moving a node from one rack to another

• Tools

Load
The primary tools for getting data into and out of the database are:

• DataStax Bulk Loader

• DataStax Apache Kafka Connector

• DSE Graph Loader

For other methods, see Migrating data to DataStax Enterprise.


Monitor
DataStax provides the following tools to monitor clusters and view metrics:

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
17
Getting started with DataStax Enterprise 6.0

• DSE OpsCenter

• DSE Metrics Collector

Troubleshooting

• Support Knowledge Base

• Troubleshooting guide

• Submit a support ticket (registered customers)

Upgrading
Key topics in the Upgrade Guide include:

• Upgrading from earlier DSE releases

• Patch release upgrades

• Upgrading from Apache Cassandra to DataStax Enterprise

Advanced Functionality
See Advanced functionality in DataStax Enterprise 6.0.

Advanced functionality in DataStax Enterprise 6.0


Brief descriptions of the advanced functionality in DataStax Enterprise 6.0.
DataStax Enterprise (DSE) version 6.0 is the industry's best distributed cloud database designed for hybrid cloud.
Easily deploy the only active-everywhere database platform that runs wherever needed: on-premises, across
regions or clouds. Benefit from all the capabilities of the best distribution of Apache Cassandra™ with enterprise
tooling and expert support required for production cloud applications.
DSE Analytics
Built on a production-certified version of Apache Spark™, with enhanced capabilities like AlwaysOn SQL
for process streaming and historical data at cloud scale.
DSE Search
Provides powerful search and indexing capabilities, including support for full-text, relevancy, sub-string,
and fuzzy queries over large data sets, aggregation, and geospatial matchups.
DSE Graph
DSE Graph is optimized for storing billions of items and their relationships to enable you to identify and
analyze hidden relationships between connected data and build powerful modern applications for real-
time use cases: fraud detection, customer 360, social networks, IoT, and recommendation systems. The
DSE Graph Quick Start is a great place to get started.
DSE OpsCenter
Provides visual management and monitoring for DataStax Enterprise, including automatic backups,
reduced manual operations, automatic failover, patch release upgrades, and secure management of
DSE clusters on-premises, in the cloud, or in hybrid environments that span multiple data centers.
Lifecycle Manager
A visual provisioning and monitoring tool for DataStax Enterprise clusters. LCM allows you to define
the cluster configuration including datacenter, node topology, and security. LCM monitoring helps you
troubleshoot installation, configuration, and upgrade jobs.
DSE Advanced Security
Provides fine-grained user and access controls to keep applications data protected and compliance
with regulatory standards like PCI, SOX, HIPAA, and the European Union’s General Data Protection
Regulation (GDPR).
DSE Metrics Collector
Aggregates DSE metrics and integrates with existing monitoring solutions to facilitate problem resolution
and remediation.
DSE Management Services

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
18
Getting started with DataStax Enterprise 6.0

DSE Management Services automatically handle administration and maintenance tasks and assist with
overall database cluster management.
NodeSync service
Continuous background repair that virtually eliminates manual efforts to run repair operations in a
DataStax cluster.
Advanced Replication
Advanced Replication allows a single cluster to have a primary hub with multiple spokes. This allows
configurable, bi-directional distributed data replication to and from source and destination clusters.
DSE In-Memory
Store and access data exclusively from memory.
DSE Multi-Instance
Run multiple DataStax Enterprise nodes on a single host machine.
DSE Tiered Storage
Automate data movement across different types of storage media.

DataStax Enterprise 6.0 new features


DataStax Enterprise, built on Apache Cassandra™, powers the Right-Now Enterprise with an always-on,
distributed cloud database designed for hybrid cloud. DataStax Enterprise (DSE) 6.0 dramatically increases
performance and eases operational management with new features and enhancements.
Be sure to read the DataStax Enterprise 6.0 release notes.

Feature Description

NodeSync DSE NodeSync removes the need for manual repair operations in DSE's distribution of Cassandra and eliminates
cluster outages that are attributed to manual repair failures. This equates to operational cost savings, reduced support
cycles, and reduced application management pain. NodeSync also makes applications run more predictably, making
capacity planning easier. NodeSync’s advantages for operational simplicity extend across the whole data layer
including database, search, and analytics.
Be sure to read the DSE NodeSync: Operational Simplicity at its Best blog.

Advanced Performance DSE Advanced Performance delivers numerous performance advantages over open-source Apache Cassandra
including:

• Thread per core (TPC) and asynchronous architecture: A coordination-free design, DSE’s thread-per-core
architecture provides up to 2x more throughput for read and write operations.

• Storage engine optimizations that provide up to half the latency of open source Cassandra and include optimized
compaction.

• DataStax Bulk Loader Up to 4x faster loads and unloads of data over current data loading utilities. Up to 4 times
faster than current data loading utilities. Be sure to read the Introducing DataStax Bulk Loader blog.

• Continuous paging improves DSE Analytics read performance by up to 3x over open source Apache Cassandra
and Apache Spark.

Be sure to read the DSE Advanced Performance blog.

DSE TrafficControl DSE TrafficControl provides a backpressure mechanism to avoid overloading DSE nodes with client or replica
requests that could make DSE nodes unresponsive or lead to long garbage collections and out of memory errors. DSE
TrafficControl is enabled by default and comes pre-tuned to accommodate very different workloads, from simple reads
and writes to the most extreme workloads. It requires no configuration.

Automated Upgrades for Part of OpsCenter LifeCycle Manager, the Upgrade Service handles patch upgrades of DSE clusters at the data center,
patch releases rack, or node level with up to 60% less manual involvement. The Upgrade Service allows you to easily clone your
existing configuration profile to ensure compatibility with DSE upgrades. Be sure to read the Taking the Pain Out of
Database Upgrades blog.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
19
Getting started with DataStax Enterprise 6.0

Feature Description

DSE Analytics New features in DSE Analytics include:

• AlwaysOn SQL with advanced security, ensures around-the-clock uptime for analytics queries with the freshest,
secure insight. It is interoperable with existing business intelligence tools that utilize ODBC/JDBC and other Spark-
based tools. Be sure to read the Introducing AlwaysOn SQL for DSE Analytics blog.

• Structured Streaming simple, efficient, and robust streaming of data from Apache Kafka, file systems, or other
sources.

• Enhanced Spark SQL support allows you to execute Spark queries using a variation of the SQL language. Spark
SQL includes APIs for returning Spark Datasets in Scala and Java, and interactively using an SQL shell or visually
through DataStax Studio notebooks.

Be sure to read the What’s New for DataStax Enterprise Analytics 6 blog.

DSE Graph New features in DSE Graph include:

• Better throughput for DSE Graph due to Advanced Performance improvements, resulting in DSE Graph handling
more requests per node.

• Smart Analytics Query Routing: the DSE Graph engine automatically routes a Gremlin OLAP traversal to the
correct implementation (DSE Graph Frames or Gremlin OLAP) for the fastest and best execution.

• Advanced Schema Management provides the ability to remove any graph schema element, not just vertex labels
or properties.

• The Batches in DSE Graph Fluent API adds the ability to execute DSE Graph statements in batches to speeds up
writes to DSE Graph.

• TinkerPop 3.3.0. DataStax has added a lot of great enhancements to the Apache TinkerPopTM tool suite.
Enhancements have proved faster, more robust graph querying and provided a better developer experience.

Be sure to read the What’s New in DSE Graph 6 blog.

DSE Security New security features include:

• Private Schemas: Control who can see what parts of a table definition, critical for security compliance best
practices.

• Separation of Duties: Create administrator roles who can carry out everyday administrative tasks without having
unnecessary access to data.

• Auditing by Role: Focus your audits on the users you need to scrutinize. You can now elect to audit activity by user
type and increase the signal to noise ratio by removing application tier system accounts from the audit trail.

• Unified Authorization for DSE Analytics: Additional protection for data used for analytics operations.

Be sure to read the Safe data? Check. DataStax Enterprise Advanced Security blog.

DSE Search Built with a production-certified version of Apache Solr™ 6, DSE Search requires less configuration, improved search
data consistency, and a more synchronous write path for indexing data with less moving pieces to tune and monitor.
DSE 5.1 introduced index management CQL and cqlsh commands to streamline operations and development. DSE 6.0
adds a wider array of CQL query functionality and indexing support.
Be sure to read the What’s New for Search in DSE 6 blog.

Drivers DataStax drivers are updated for DSE 6.0, including:

• The Batches in DSE Graph Fluent API adds the ability to execute DSE Graph statements in batches to speed up
writes to DSE Graph.

• The C# and Node.js DataStax drivers include Batches in DSE Graph Fluent API, as well as the Java and Python
drivers.

Be sure to read the What’s New With Drivers for DSE 6 blog.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
20
Getting started with DataStax Enterprise 6.0

Feature Description

DataStax Studio Improvements to DSE Studio further ease DSE development include:

• Notebook Sharing: Easily collaborate with your colleagues to develop DSE applications using the new import and
export capabilities.

• Spark SQL support: Query and analyze data with Spark SQL using DataStax Studio's visual and intelligent
notebooks, which provide syntax highlighting, auto-code completion and correction, and more.

• Interactive Graphs: explore and configure DSE Graph schemas with a whiteboard-like view that allows you to drag
your vertices and edges.

• Notebook History: provides a historical dated record with descriptions and change events that makes it easy to
track and rollback changes.

Be sure to read the Announcing DataStax Studio 6 blog.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
21
Chapter 2. DataStax Enterprise release notes
Release notes for DataStax Enterprise 6.0.

DataStax Enterprise 6.0 release notes


DataStax Enterprise release notes cover cluster requirements, upgrade guidance, components, security updates,
changes and enhancements, issues, and resolved issues for DataStax Enterprise (DSE) 6.0.x.
Each point release includes a highlights and executive summary section to provide guidance and add visibility
to important improvements.

Requirement for Uniform Licensing


All nodes in each DataStax licensed cluster must be uniformly licensed to use the same subscription. For
example, if a cluster contains five nodes, all five nodes within that cluster must be DSE. Mixing different
subscriptions within a cluster is not permitted. The DataStax Advanced Workloads Pack may be added to any
DSE cluster in an incremental fashion. For example, a 10-node DSE cluster may be extended to include three
nodes of the Advanced Workloads Pack. “Cluster” means a collection of nodes running the software which
communicate with one another using gossip. See Enterprise Terms.
Before you upgrade

Upgrade advice Compatibility

Before you upgrade to a later major version, upgrade to the latest patch Upgrades to DSE 6.0 are supported from:
release (6.0.13) on your current version. Be sure to read the relevant
upgrade documentation. • DSE 5.1

• DSE 5.0

Check the compatibility page for your products. DSE 6.0 product compatibility:

• OpsCenter 6.5

• Studio 6.0

See Upgrading DataStax drivers. DataStax Drivers: You may need to recompile your client application
code.

Use DataStax Bulk Loader for loading and unloading data. Loads data into DSE 5.0 or later and unloads data from any Apache
Cassandra™ 2.1 or later data source.

DSE 6.0.13 release notes


26 August 2020
In this section:

• 6.0.13 Components

• Cassandra enhancements for DSE 6.0.13

• General upgrade advice for DSE 6.0.13

• TinkerPop changes for DSE 6.0.13

DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
22
DataStax Enterprise release notes

In response to this scenario:

• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.

• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.

• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.

• DataStax recommends 16 or more logical cores for Advanced Performance nodes.

DSE 6.0.13 Components

All components from DSE 6.0.13 are listed. Components that are updated for DSE 6.0.11 are indicated with an
asterisk (*).

• Apache Solr™ 6.0.1.1.2793

• Apache Spark™ 2.2.3.13

• Apache TinkerPop™ 3.3.7-20190521-f71ce0d7

• Apache Tomcat® 8.0.53

• DSE Java Driver 1.6.10

• Netty 4.1.25.7.dse

• Spark JobServer 0.8.0.45.2

• TinkerPop 3.3.7 with production-certified changes

DSE 6.0.13 is compatible with Apache Cassandra™ 3.11 and adds production-certified enhancements if any.

DataStax recommends upgrading all DSE Search nodes to DSE 6.0.13 or later.

6.0.13 DSE core

Changes and enhancements:

• Fixed StackOverflowError thrown during read repairs (only large clusters or clusters with enabled vnodes
are affected). (DB-4350)

• Increased default direct_reads_size_in_mb value. Previously it was 2M per core + 2M shared. It is now
4M per core + 4M shared. (DB-4348)

• Slow indexing at bootstrap time due to early TPC boundaries computation when node is replaced by a node
with the same IP (DB-4049)

• Fixed a problem with the treatment of zeroes in the type decimal that could cause assertion errors, or not
being able to find some rows if their key is 0 written using different precisions, or both. (DB-4472)

• CQLSH can be run with Python 2 or 3. (DB-4151)

• Fixed the NullPointerException issue described in CASSANDRA-14200: NPE when dumping an SSTable
with null value for timestamp column. (DB-4512)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
23
DataStax Enterprise release notes

• Fix an issue that was causing excessive contention during encryption/decryption operations. The fix results
in an encryption/decryption performance improvement. (DB-4419)

• A new configuration option in cassandra.yaml was added: snapshot_before_dropping_column which is


false by default. When enabled, every time the user drops a column/columns from a table, a snapshot will
be created on each node in the cluster before the change in schema is applied.

• Fixed an issue to prevent an unbounded number of flushing tasks for memtables that are almost empty.
(DB-4376)

• Global BloomFilterFalseRatio is now calculated in the same way as table BloomFilterFalseRatio. Now
both types of metrics include true negatives, the formula is ratio = falsePositiveCount / (truePositiveCount +
falsePositiveCount + trueNegativeCount). (DB-4439)

• Fixed a bug whereby after a node replacement procedure. the bootstrap indexing in DSE Search was
happening only on one TPC core. (DB-4049)

• DNS Service Discovery is now a part of the DSE/LDAP integration. (DSP-11450)

• Systemd units are included for DSE packages for CentOS and compatible OSes. (DSP-7603)

• The server_host option in dse.yaml now handles mutiple, comma separated LDAP server addresses.
(DSP-20833)

• Cassandra tools now work on encrypted SSTables when security is configured. (DSP-20940)

• Workaround for LOGBACK-1194 - explicit scanPeriod added to logback.xml. (DSP-17911)

• Recording a slow CQL query to the log will no longer block the thread. (DSP-20894)

• Added entries to jvm.options to assist with capturing thread dumps. (DSP-20778)

• The frequency of range queries performed by lease manager is now configurable via
dse.lease.refresh.interval.seconds system property (an addition to JMX and dsetool command)
(DSP-20696)

• Security updates:

# Fixed a CVE-2019-20444 issue in which HttpObjectDecoder.java in Netty, before 4.1.44, allowed an


HTTP header that lacked a colon. (DB-4068)

# The jackson-databind library has been upgraded to 2.9.10.4 to address a Jackson databind vulnerability
(CVE-2020-8840) (DSP-20981)

# DNS Service Discovery is now a part of the DSE/LDAP integration. (DSP-11450)

# Fixed some security vulnerabilities for Solr HTTP REST API when authorization is enabled. Now, users
with no appropriate permissions can perform search operations. Resources can be deleted when
authorization is enabled, given the correct permissions. (DSP-20749)

# Fixed an issue where the audit logging did not capture search queries. (DSP-21058)

# There are two new LDAP options in dse.yaml - extra_user_search_bases and


extra_group_search_bases where the user can define additional search bases for users and groups
respectively. For users, if the user is not found in one search base all other bases are searched. For
groups, groups found in all defined search bases are merged. (DSP-12612)

# While there is no change in default behavior, there is a new render_cql_literals option in dse.yaml
under the audit logging section, which is false by default. When enabled, bound variables for logged
statements will be rendered as CQL literals, which means there will be additional quotation marks and
escaping, as well as values of all complex types (collections, tuples, udts) will be in human readable
format. (DSP-17032)

# Fixed LDAP settings to properly handle nested groups so that LDAP enumerates all ancestors of a
user's distinguishedName. Inherited groups retrieval with directory_search and members_search types.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
24
DataStax Enterprise release notes

Fixed fetching parent groups of a role that's mapped to an LDAP group. See new dse.yaml options,
all_groups_xxx in ldap_options, to configure optimized retrieval of parent groups, including inherited
ones, in a single roundtrip. (DSP-20107)

# When DSE tries one authentication scheme and finds that the password is invalid, DSE now tries
another scheme, but only if the user has a scheme permission for that other scheme. (DSP-20903)

# Raised the upper bound limit on DSE LDAP caches. The upper limit for
ldap_options.credentials_validity_in_ms has been increased to 864,000,000 ms, which is
10 days. The upper limit for ldap_options.search_validity_in_seconds has been increased to
864,000 seconds, which is 10 days. (DSP-21072)

# Fixed an error condition when DSE failed to get the LDAP roles while refreshing a database schema.
(DSP-21075)

6.0.13 DSE Advanced Replication

Changes and enhancements:

• Advanced Replication's OutOfMemoryErrors caused by Roaring bitmap deserialization. (DSP-15675)

6.0.13 DSEFS

Changes and enhancements:

• To minimize fsck impact on overloaded clusters, throttling is possible via -p or --parallelism arguments.

• Backported DSP-15762: optimize remove-recursive implementation, lowering the tombstone impact on


Spark jobs. (DSP-20750)

• The byos-export command exports dsefs configuration for AbstractFileSystem (DSP-20906)

• Fixed an issue where an excessive number of connections are created to port 5599 when using DSEFS.
(DSP-21021)

• Fixed excessive allocation when running fsck on DSEFS volumes. (DSP-21246)

6.0.13 DSE Search

Changes and enhancements:

• Search-related latency metrics will now decay in time like other metrics. Named queries (using query.name
parameter) will now have separate latency metrics. New mbean atributes are available for search
latency metrics: TotalLatency (us), Min, Max, Mean, StdDev, DurationUnit, MeanRate, OneMinuteRate,
FiveMinuteRate, FifteenMinuteRate, RateUnit, 98th, 999th. (DSP-19612)

• Significantly reduced the time to (re)load encrypted search cores. (DSP-20692)

• Fixed some security vulnerabilities for Solr HTTP REST API when authorization is enabled. Now, users with
no appropriate permissions can perform search operations. Resources can be deleted when authorization is
enabled, given the correct permissions. (DSP-20749)

• Fixed a bug where a decryption block cache occasionally was not operational (SOLR-14498). (DSP-20987)

• Fixed an issue where the audit logging did not capture search queries. (DSP-21058)

• Fixed a bug where after several months of up time an encrypted index wouldn't accept more writes unless
the core is reloaded. (DSP-21234)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
25
DataStax Enterprise release notes

Cassandra enhancements for DSE 6.0.13


DataStax Enterprise 6.0.13 is compatible with Apache Cassandra™ 3.11, includes all DataStax enhancements
from earlier releases.
General upgrade advice for DSE 6.0.13
DataStax Enterprise 6.0.13 is compatible with Apache Cassandra™ 3.11.
All upgrade advice from previous versions applies. Carefully review the DataStax Enterprise upgrade planning
and upgrade instructions to ensure a smooth upgrade and avoid pitfalls and frustrations.
TinkerPop changes for DSE 6.0.13
DataStax Enterprise (DSE) 6.0.13 includes TinkerPop 3.3.7 with all DataStax enhancements from earlier
versions. See the TinkerPop upgrade documentation.
DSE 6.0.12 release notes
4 May 2020
In this section:

• 6.0.12 Components

• Cassandra enhancements for DSE 6.0.12

• General upgrade advice for DSE 6.0.12

• TinkerPop changes for DSE 6.0.12

Table 1: DSE functionality


6.0.12 DSE core 6.0.12 DSE Advanced Replication

6.0.12 DSE Analytics 6.0.12 DSEFS

6.0.12 DSE Graph 6.0.12 DSE Search

DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:

• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.

• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.

• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.

• DataStax recommends 16 or more logical cores for Advanced Performance nodes.

DSE 6.0.12 Components

All components from DSE 6.0.12 are listed. Components that are updated for DSE 6.0.11 are indicated with an
asterisk (*).

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
26
DataStax Enterprise release notes

• Apache Solr™6.0.1.1.2716

• Apache Spark™2.2.3.13

• Apache Tomcat® 8.0.53

• DataStax Bulk Loader 1.2.1

• DSE Java Driver 1.6.10

• Key Management Interoperability Protocol (KMIP) 1.7.1e

• Netty 4.1.25.6.dse

• Spark Jobserver 0.8.0.45.2 DSE custom version

• TinkerPop 3.3.7 with production-certified changes

For a full list, see DataStax Enterprise 6.0.12 third-party software.


DSE 6.0.12 is compatible with Apache Cassandra™ 3.11 and adds production-certified enhancements.

DataStax recommends upgrading all DSE Search nodes to DSE 6.0.12 or later.

6.0.12 DSE core

Changes and enhancements:

• Added hostname_verification to ldap_options in dse.yaml. (DSP-20302)

• The frequency of range queries performed by lease manager is now configurable via JMX and dsetool
command. (DSP-20696)

• Added dse.ldap.retry_interval.ms system property, which sets the time between subsequent retries
when trying authentication using LDAP server. (DSP-20298)

• Removed Jodd Core dependency that created vulnerability to Arbitrary File Writes. (DSP-19206)

• Added a new JMX attribute of ConnectionSearchPassword for LdapAuthenticator bean has been added,
which updates the LDAP search password without the need to restart DSE. (DSP-18928)

• dsetool ring shows in-progress search index building during bootstrap. (DSP-15281)

• Made the search reference visible in the error message for LDAP connections. (DSP-20578)

• DecayingEstimatedHistogram now decays even when there are no updates so invalid metric values do not
linger. (DSP-20674)

• Added functionality to query role_stats when stats is enabled under role_management_options in


dse.yaml. (DB-4283)

• The replica side filtering dtests test_update_on_wide_table and


test_complementary_update_with_limit_on_static_column_with_not_empty_partitions are more
reliable. (DB-4043)

• Nodesync can now be enabled on all system distributed and protected tables. (DB-3241)

• Improved the estimated values of histogram percentiles reported via JMX. In some cases, the percentiles
may go slightly up. (DB-4275)

• Added anticompaction to nodetool stop command help menu. (DB-3821)

• Added --disable-history option to cqlsh that disables saving history to disk for current execution. Added
history section to cqlshrc which is called with boolean parameter disabled that is set to False by default.
(DB-3843)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
27
DataStax Enterprise release notes

• Improved error messaging for enabled internode SSL encryption in Cassandra Tools test suite. (DB-3957)

• Removed the serialization header partition/clustering key validation (DB-4111)

• Security updates:

# Upgraded Jackson Core and Jackson Mapper to address CVE-2019-10172. (DSP-20073)

Resolved issues:

• LDAP cursors leak. (DSP-20623)

• Bug that prevented LIST ROLES and LIST USERS to work with system-keyspace-filtering enabled.
(DB-4221)

• Continuous paging sessions could leak if the continuous result sets on the driver side were not exhausted or
cancelled. (DB-4313)

• Potentially incorrect dropped messages in case of time drifts on a machine. (DB-3891)

• Read inconsistencies. (CASSANDRA-12126) (DB-3873)

• Error that caused nodetool viewbuildstatus to return an incorrect error message. (DB-2397)

6.0.12 DSE Advanced Replication

Resolved issues:

• Advanced Replication's OutOfMemoryErrors caused by Roaring bitmap deserialization. (DSP-15675)

6.0.12 DSE Analytics

Changes and enhancements:

• Internal continuous paging sessions were not closed when LIMIT clause was added in SQL query, which
caused sessions leak and inability to close the Spark application gracefully because the Java driver waited
indefinitely for orphaned sessions to finish. (DSP-19804)

• Removed Jodd Core dependency that created vulnerability to Arbitrary File Writes. (DSP-19206)

• Added spark.cassandra.session.consistency.level parameter to Spark Connector. Set


HiveMetaStore default consistency level to LOCAL_QUORUM instead of ONE. (DSP-19982)

• During Spark Application startup, Exception: java.lang.ExceptionInInitializerError thrown from


the UncaughtExceptionHandler in thread "main" was logged, sometimes instead of a meaningful
error. (DSP-20474)

• Security updates:

# Patched hive with HIVE-13390 to fix CVE-2016-3083. (DSP-20612)

6.0.12 DSEFS

Changes and enhancements:

• DSEFS local file system implementation returns alphabetically sorted directories and files when using
wildcards and listing command. (DSP-20057)

• When creating a file through WebHDFS API, DSEFS does not verify WX permissions of parent's parent
when the parent exists. (DSP-20355)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
28
DataStax Enterprise release notes

• DSEFS internode encrypted communication doesn't fail when


server_encryption_options.require_endpoint_verification is enabled. (DSP-20689)

Resolved issues:

• DSEFS cannot use Mixed Case keyspaces, which was broken by DSP-16825. (DSP-20354)

6.0.12 DSE Graph

Changes and enhancements:

• Exposed configuration and metrics for Gremlin query cache. (DSP-20240)

• Changed classic Graph query so vertices are read from _p tables in Cassandra using SELECT ... WHERE
<vertex primary key columns> statement. The search predicate is applied in memory. (DSP-20230)

6.0.12 DSE Search

Changes and enhancements:

• Error messages related to Solr errors contain better description of the root cause. (DSP-13792)

• The dsetool stop_core_reindex command now mentions the node in the output message. (DSP-17090)

• Added indexing reason to output of dsetool core_indexing_status command. (DSP-17672)

• Improved warnings for search index creation via dsetool or CQL. (DSP-17994)

• Improved guidance with warnings when index rebuild is required for ALTER SEARCH INDEX, RELOAD SEARCH
INDEX, and dsetool reload_core commands. (DSP-19347)

• Improved real-time search to fix a docValues bug. (DSP-20300)

• suggest request handler requires select permission. Previously, suggest request handler returned
forbidden response when authorization was on, regardless of the user permissions. (DSP-20697)

• Security update:

# Upgraded Apache Solr to address CVE-2018-8026. (DSP-16653)

Cassandra enhancements for DSE 6.0.12


DataStax Enterprise 6.0.12 is compatible with Apache Cassandra™ 3.11, includes all DataStax enhancements
from earlier releases.
General upgrade advice for DSE 6.0.12
DataStax Enterprise 6.0.11 is compatible with Apache Cassandra™ 3.11.
All upgrade advice from previous versions applies. Carefully review the DataStax Enterprise upgrade planning
and upgrade instructions to ensure a smooth upgrade and avoid pitfalls and frustrations.
TinkerPop changes for DSE 6.0.12
DataStax Enterprise (DSE) 6.0.12 includes TinkerPop 3.3.7 with all DataStax enhancements from earlier
versions. See the TinkerPop upgrade documentation.
DSE 6.0.11 release notes
10 December 2019
In this section:

• 6.0.11 Components

• DSE 6.0.11 Highlights

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
29
DataStax Enterprise release notes

• Cassandra enhancements for DSE 6.0.11

• General upgrade advice for DSE 6.0.11

• TinkerPop changes for DSE 6.0.11

Table 2: DSE functionality


6.0.11 DSE core 6.0.11 DSE Search

6.0.11 DSE Analytics

DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:

• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.

• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.

• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.

• DataStax recommends 16 or more logical cores for Advanced Performance nodes.

DSE 6.0.11 Components

All components from DSE 6.0.11 are listed. Components that are updated for DSE 6.0.10 are indicated with an
asterisk (*).

• Apache Solr™ 6.0.1.1.2642 *

• Apache Spark™ 2.2.3.9 *

• Apache Tomcat® 8.0.53

• DataStax Bulk Loader 1.2.1

• DSE Java Driver 1.6.10

• Key Management Interoperability Protocol (KMIP) 1.7.1e

• Netty 4.1.25.6.dse

• Spark Jobserver 0.8.0.45.2 * DSE custom version

• TinkerPop 3.3.7 with production-certified changes

For a full list, see DataStax Enterprise 6.0.11 third-party software.


DSE 6.0.11 is compatible with Apache Cassandra™ 3.11 and adds production-certified enhancements.

DSE 6.0.11 Highlights

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
30
DataStax Enterprise release notes

High-value benefits of upgrading to DSE 6.0.11:


DSE Search highlights

DataStax recommends upgrading all DSE Search nodes to DSE 6.0.11 or later.

• Fixed a bug to avoid multiple disposals of Solr filter cache DocSet objects. (DSP-15765)

• Improve performance, logging, and add options for using the Solr timeAllowed parameter in all queries.
The Solr timeAllowed option in queries is now enforced by default to prevent long-running shard queries.
(DSP-19781, DSP-19790)

6.0.11 DSE core

Changes and enhancements:

• Add support for nodesync command to specify different IP addresses for JMX and CQL. (DB-2969)

• Enhancements to the offline sstablescrub utility. (DB-3510, DB-3511)

# Specify which SSTables to scrub.

# Scrub multiple tables of the same keyspace.

# Specify the number of threads to simultaneously scrub SSTables within a table.

• Prevent accepting streamed SSTables or loading SSTables when the clustering order does not match.
(DB-3530)

• Dropping and re-adding the same column with incompatible types is not supported. This change prevents
unreadable SSTables. (DB-3586)

Resolved issues:

• Background compactions block SSTable operations too long. (DB-3682)

• Post-bootstrap indexing is executed by only a single CPU core. (DB-3692)

• Reads against ma and mc SSTables hit more SSTables than necessary due to the bug fixed by
CASSANDRA-14861. (DB-3691)

• Error retrieving expired columns with secondary index on key components. (DB-3764)

• The diff logic used by the secondary index does not always pick the latest schema and results in ERROR
[CoreThread-8] errors on batch writes. (DB-3838)

• Unexpected CoreThread error thrown by LWT.PROPOSE. (DB-3858)

• Fixed concurrency factor calculation for distributed range read with a maximum 10 times
the number of cores. Configurable maximum concurrency factor with new JVM argument -
Ddse.max_concurrent_range_requests. (DB-3859)

• Prevent continuous triggers with read defragmentation. (DB-3866)

• Cached serialized mutations can cause G1 GC humongous objects. (DB-3867)

• AIO and DSE Metrics Collector are not available on REHL/Centos 6.x because GLIBC_2.14 is not present.
(DSP-18603)

• Upgrade Jackson Databind to address CVE-2019-14540 and CVE-2019-16942. (DSP-19764, DSP-19896)

• Using SELECT JSON for empty BLOB values incorrectly returns an empty string instead of the expected 0x.
(DSP-20022)

• RoleManager cache keeps invalid values if the LDAP connectivity is down. (DSP-20098)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
31
DataStax Enterprise release notes

• LDAP user login fails due to parsing failure on user DN with parentheses. (DSP-20106)

6.0.11 DSE Analytics

Changes and enhancements:

• Add new DSE class com/datastax/bdp/spark/* for dse-spark-dependencies. (DSP-16070)

• New du dsefs shell command lists sizes of the files and directories in a specific directory. (DSP-19572)

• Improve configuration of available system resources for Spark Workers. You can now set the total memory
and total cores with new environment variables that take precedence over the resource_manager_options
defined in dse.yaml. (DSP-19673)
dse.yaml resource_manager_options Environment variable

memory_total SPARK_WORKER_TOTAL_MEMORY

cores_total SPARK_WORKER_TOTAL_CORES

• Support for multiple contact points is added for DSEFS implementation of the Hadoop FileSystem.
(DSP-19704)
Provide FileSystem URI with:

$ dsefs://host0\[:port\]\[,host1\[:port\]\]/

6.0.11 DSE Search

Enhancements:

• The Solr timeAllowed option in queries is now enforced by default to prevent long-running shard queries.
This change prevents complex facets and boolean queries from using system resources after the DSE
Search coordinator considers the queries to have timed out. For all queries, the default for the timeAllowed
value uses the value of client_request_timeout_seconds setting in dse.yaml. (DSP-19781, DSP-19790)
While using Solr timeAllowed in queries improves performance for long zombie queries, it can cause
increased per-request latency cost in mixed workloads. If the per-request latency cost is too high, use the
-Ddse.timeAllowed.enabled.default search system property to disable timeAllowed in your queries.

• Upgrade spray-json to prevent Denial Of Service (DoS) vulnerability CVE-2018-18854 and


CVE-2018-18853. (DSP-19208)

Resolved issues:

• Error on disposals of Solr filter cache DocSet objects. (DSP-15765)

• Apply filter cache optimization to remote shard requests when RF=N. . (DSP-19800)

• Filter cache warming doesn't warm parent-only filter correctly when RF=N. (DSP-19802)

• Memory allocation issue causes performance degradation at query time. (DSP-19805)

Cassandra enhancements for DSE 6.0.11


DataStax Enterprise 6.0.11 is compatible with Apache Cassandra™ 3.11, includes all DataStax enhancements
from earlier releases, and adds these production-certified changes:

• Handle paging states serialized with a different version than the session version (CASSANDRA-15176)

• Toughen up column drop/recreate type validations (CASSANDRA-15204)

• SSTable min/max metadata can cause data loss (CASSANDRA-14861)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
32
DataStax Enterprise release notes

• Use Bounds instead of Range for sstables in anticompaction (CASSANDRA-14411)

General upgrade advice for DSE 6.0.11


DataStax Enterprise 6.0.11 is compatible with Apache Cassandra™ 3.11.
All upgrade advice from previous versions applies. Carefully review the DataStax Enterprise upgrade planning
and upgrade instructions to ensure a smooth upgrade and avoid pitfalls and frustrations.
TinkerPop changes for DSE 6.0.11
DataStax Enterprise (DSE) 6.0.11 includes TinkerPop 3.3.7 with all DataStax enhancements from earlier
versions. See the TinkerPop upgrade documentation.
DSE 6.0.10 release notes
19 September 2019
In this section:

• 6.0.10 Components

• DSE 6.0.10 Highlights

• Cassandra enhancements for DSE 6.0.10

• General upgrade advice for DSE 6.0.10

• TinkerPop changes for DSE 6.0.10

Table 3: DSE functionality


6.0.10 DSE core 6.0.10 DSE Graph

6.0.10 DSE Analytics 6.0.10 DSE Search

DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:

• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.

• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.

• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.

• DataStax recommends 16 or more logical cores for Advanced Performance nodes.

DSE 6.0.10 Components

All components from DSE 6.0.10 are listed. Components that are updated for DSE 6.0.10 are indicated with an
asterisk (*).

• Apache Solr™ 6.0.1.1.2507 *

• Apache Spark™ 2.2.3.5 *

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
33
DataStax Enterprise release notes

• Apache Tomcat® 8.0.53

• DataStax Bulk Loader 1.2.1

• DSE Java Driver 1.6.10 *

• Key Management Interoperability Protocol (KMIP) 1.7.1e

• Netty 4.1.13.13.dse

• Spark Jobserver 0.8.0.45 DSE custom version

• TinkerPop 3.3.7 with additional production-certified changes

DSE 6.0.10 is compatible with Apache Cassandra™ 3.11 and adds production-certified enhancements.

DSE 6.0.10 Highlights

High-value benefits of upgrading to DSE 6.0.10 include these highlights:


DSE Database (DSE core) highlights

• Fixed incorrect handling of frozen type issues to accept all valid CQL statements and reject all invalid CQL
statements. (DB-3084)

• Standalone cqlsh client tool provides an interface for developers to interact with the database and issue
CQL commands without having to install the database software. From DataStax Labs, download the version
of CQLSH that corresponds to your DataStax database version. (DSP-18694)

• New options to select cipher suite and protocol to configure KMIP encryption when connecting to a KMIP
server. (DSP-17294)

DSE Analytics highlights

• Storing and revoking permissions for the application owner is removed. The application owner is explicitly
assumed to have these permissions. (DSP-19393)

DSE Graph highlights

• Fixed an issue where T values are hidden by property keys of the same name in valueMap(). (DSP-19261)

DSE Search highlights

• Improved search query latency. (DSP-18677)

• Unbounded facet searches are no longer allowed. (DSP-18693)

# facet.limit < 0 is no longer supported. Override the default facet.limit of 20000 with the -
Dsolr.max.facet.limit.size system property.

# This change adds guardrails that can cause misconfigured faceting queries to fail. Before upgrading, set
an explicit facet.limit.

6.0.10 DSE core

Changes and enhancements:

• DSE version now appears as a comment in all configuration files. (DB-1022)

• Improved troubleshooting. A log entry is now created when autocompaction is disabled or enabled for a
table. (DB-1635)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
34
DataStax Enterprise release notes

• Enhanced DroppedMessages logging output adds the size percentiles of the dropped messages, their most
common destinations, and the most common tables targeted for read requests or mutations. (DB-1250)

• Reformatted StatusLogger output to reduce details in the INFO level system.log. The detailed output is still
present in the debug.log. (DB-2552)

• For nodetool tpstats -F json and nodetool tpstats -F yaml, wait latencies (in ms) appear in the
output. Although not labeled, the wait latencies are included in the following order: 50%, 75%, 95%, 98%,
99%, Min, and Max. (DB-3401)

• New resources improve debugging leaked chunks before the cache evicts them and provide more
meaningful call stack and stack trace. (DB-3504)

# RandomAccessReader/RandomAccessReader

# AsyncPartitionReader/FlowSource

# AsyncSSTableScanner/FlowSource

• Allocate large buffers directly in the chunk cache. (DB-3506)

• Buffers should return to the pool if a chunk is leaked. (DB-3512)

• New nodetool commands to get current values: getcachecapacity, getcachekeystosave, and


gethintedhandoffthrottlekb. (DB-3618)

• New options to select cipher suite and protocol to configure KMIP encryption when connecting to a KMIP
server. (DSP-17294)

• Standalone cqlsh client tool provides an interface for developers to interact with the database and issue
CQL commands without having to install the database software. From DataStax Labs, download the version
of CQLSH that corresponds to your DataStax database version. (DSP-18694)

• Upgraded Apache MINA Core library to 2.0.21 to prevent a security issue where Apache MINA Core was
vulnerable to information disclosure. (DSP-19213)

• Update Jackson Databind to 2.9.9.1 for all components except DataStax Bulk Loader. (DSP-19441)

Resolved issues:

• Fix to prevent NPE during repair in mixed-version clusters. (DB-1985)

• Tarball installs to create two instances on the same physical server with remote JMX access with binding
the separated IPs to port 7199 causes JMX error of Address already in use (Bind failed) because
com.sun.management.jmxremote.host is ignored. (DB-2483)

• Prevent changing the replication strategy of system keyspaces. (DB-2960)

• Upgrade Jackson Databind to address CVE-2018-11307 and CVE-2018-19361. (DB-2911, DSP-18099,


DSP-19319)

• Slow startup or node hangs when encryption is used. (DB-3050)

• Incorrect handling of frozen type issues: valid CQL statements are not accepted and invalid CQL statements
are not property rejected. (DB-3084)

• DSE fails to start with ERROR Attempted serializing to buffer exceeded maximum of 65535 bytes. Improved
error to identify a workaround for commitlog corruption. (DB-3162)

• sstabledowngrade needs write access to the snapshot folder for a different output location. (DB-3231)

• The number of pending compactions reported by nodetool compactionstats was incorrect (off by one) for
Time Window Compaction Strategy (TWCS). (DB-3284)

• Invalid JSON output for nodetool tpstats -F json. (DB-3401)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
35
DataStax Enterprise release notes

• When unable to send mutations to replicas due to overloading, hints are mistakenly created against the local
node. (DB-3421)

• When a non-frozen UDT column is dropped and the table is later re-created from the schema that was
created as part of a snapshot, the dropped column record is invalid and may lead to failure loading some
SSTables. (DB-3434)

• sstablepartitions incorrectly handles -k and -x options. (DB-3442)


Workaround: To specify multiple keys, repeat the -k or -x option for each key.

• Memory leaks when updating tables with materialized views. (DB-3504)

• Error in custom provider prevents DSE node startup. With this fix, the node will start up but insights
is not active. See the DataStax Support Knowledge Base for steps to resolve existing missing or incorrect
keyspace replication problems. (DSP-19521)

Known issues:

• On Oracle Linux 7.x, StorageService.java:4970 exception occurs with DSE package installation.
(DSP-19625)
Workaround: On Oracle Linux 7.x operating systems, install DSE using the binary tarball.

6.0.10 DSE Analytics

Changes and enhancements:

• Storing and revoking permissions for the application owner is removed. Instead of explicitly storing
permission of the application owner to manage and view Spark applications, the application owner is
explicitly assumed to have these permissions. (DSP-19393)

Resolved issues:

• Spark applications incorrectly reported that joins were broken. DirectJoin output check too strict.
(DSP-19063)

• Submitting many Spark apps will reach the default tombstone_failure_threshold before the default 90 days
gc_grace_seconds defined for the system_auth.role_permissions table. (DSP-19098)
Workaround with this fix:

1. Manually grant permissions to the user before the user starts Spark jobs:

GRANT AUTHORIZE, DESCRIBE, MODIFY ON ANY SUBMISSION IN WORKPOOL


'datacenter_name.workpool' TO role_name;

2. Start Spark jobs for this user.


3. After all Spark jobs are complete for this user, revoke the permissions for this user.

REVOKE AUTHORIZE, DESCRIBE, MODIFY ON ANY SUBMISSION IN WORKPOOL


'datacenter_name.workpool' FROM role_name;

• Credentials are not masked in the debug level logs for Spark Jobserver and Spark submitted jobs.
(DSP-19490)

6.0.10 DSE Graph

Changes and enhancements:

• New graph truncate command to remove all data from graph. (DSP-17609)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
36
DataStax Enterprise release notes

• Support for ifExists() before truncate(), like system.graph("foo").ifExists().truncate(), in DSE


Graph (classic graph) API. (DSP-19357)

Resolved issues:

• gremlin-console startup time is improved. (DSP-11550)

• T values get hidden by property keys of the same name in valueMap(). (DSP-19261)

6.0.10 DSE Search

Enhancements:

• DSE 6.0 search query latency is on parity with DSE 5.1. (DSP-18677)

• For token ranges dictated by distribution, filter cache warming occurs when a node is restarted, a search
index is rebuilt, or when node health score is up to 0.9. New per-core metrics for metric type WarmupMetrics
and other improvements. (DSP-8621)

• Unbounded facet searches are no longer allowed. (DSP-18693)

# facet.limit < 0 is no longer supported. Override the default facet.limit of 20000 with the -
Dsolr.max.facet.limit.size system property.

# This change adds guardrails that can cause misconfigured faceting queries to fail. Before upgrading, set
an explicit facet.limit.

Resolved issues:

• Solr CQL count query incorrectly returns the count as all data count but should return all data count minus
start offset. (DSP-16153)

• Validation error does not get returned when docValues are applied when types do not allow docValues.
(DSP-16884)
With this fix, the following exception behavior is applied:

# Throw exception when docValues:true is specified for a column and column type does not support
docValues.

# Do not throw exception and ignore docValues:true for columns with types that do not support docValues
if docValues:true is set for *.

• When using live indexing, also known as Real Time (RT) indexing, stale Solr documents contain data that is
updated in the database. This issue happens when a facet query is run against a search index (core) while
inserting or loading data, and the search core is shut down. (DSP-18786)

• When driver uses paging, CQL query fails when using a Solr index to query with a sort on a field that
contains the primary key name in the field: InvalidRequest: Error from server: code=2200 [Invalid
query] message="Cursor functionality requires a sort containing a uniqueKey field tie
breaker". (DSP-19210)

Known issues:

• The count() query with Solr enabled can be inaccurate or inconsistent. (DSP-19401)

Cassandra enhancements for DSE 6.0.10


DataStax Enterprise 6.0.10 is compatible with Apache Cassandra™ 3.11, includes all DataStax enhancements
from earlier releases, and adds these production-certified changes:
General upgrade advice for DSE 6.0.10
DataStax Enterprise 6.0.10 is compatible with Apache Cassandra™ 3.11.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
37
DataStax Enterprise release notes

All upgrade advice from previous versions applies. Carefully review the DataStax Enterprise upgrade planning
and upgrade instructions to ensure a smooth upgrade and avoid pitfalls and frustrations.
TinkerPop changes for DSE 6.0.10
DataStax Enterprise (DSE) 6.0.10 includes TinkerPop 3.3.7 with all DataStax enhancements from earlier
versions.
DSE 6.0.9 release notes
9 July 2019
In this section:

• 6.0.9 Components

• 6.0.9 Important bug fix

• Cassandra enhancements for DSE 6.0.9

• General upgrade advice for DSE 6.0.9

• TinkerPop changes for DSE 6.0.9

DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:

• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.

• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.

• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.

• DataStax recommends 16 or more logical cores for Advanced Performance nodes.

6.0.9 Components

All components from DSE 6.0.9 are listed.

• Apache Solr™ 6.0.1.1.2460

• Apache Spark™ 2.2.3.4

• Apache Tomcat® 8.0.53

• DataStax Bulk Loader 1.2.0

• DSE Java Driver 1.6.9

• Key Management Interoperability Protocol (KMIP) 1.7.1e

• Netty 4.1.13.13.dse

• Spark Jobserver 0.8.0.45 DSE custom version

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
38
DataStax Enterprise release notes

• TinkerPop 3.3.7 with additional production-certified changes

DSE 6.0.9 is compatible with Apache Cassandra™ 3.11 and includes all DataStax enhancements from earlier
versions.

DSE 6.0.9 Important bug fix

• Fixed possible data loss when using DSE Tiered Storage. (DB-3404)
If using DSE Tiered Storage, you must immediately upgrade to at least DSE 5.1.16, DSE 6.0.9, or DSE
6.7.4. Be sure to follow the upgrade instructions.

Cassandra enhancements for DSE 6.0.9


DataStax Enterprise 6.0.9 is compatible with Apache Cassandra™ 3.11 and includes all DataStax
enhancements from earlier releases.
General upgrade advice for DSE 6.0.9
DataStax Enterprise 6.0.9 is compatible with Apache Cassandra™ 3.11.
All upgrade advice from previous versions applies. Carefully review the DataStax Enterprise upgrade planning
and upgrade instructions to ensure a smooth upgrade and avoid pitfalls and frustrations.
TinkerPop changes for DSE 6.0.9
DataStax Enterprise (DSE) 6.0.9 includes TinkerPop 3.3.7 with all DataStax enhancements from earlier
versions.
DSE 6.0.8 release notes
11 June 2019
In this section:

• 6.0.8 Components

• DSE 6.0.8 Highlights

• Cassandra enhancements for DSE 6.0.8

• General upgrade advice for DSE 6.0.8

• TinkerPop changes for DSE 6.0.8

Table 4: DSE functionality


• 6.0.8 DSE core • 6.0.8 DSE Graph

• 6.0.8 DSE Analytics • 6.0.8 DSE Search

• 6.0.8 DSEFS

DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:

• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.

• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
39
DataStax Enterprise release notes

• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.

• DataStax recommends 16 or more logical cores for Advanced Performance nodes.

6.0.8 Components

All components from DSE 6.0.8 are listed. Components that are updated for DSE 6.0.8 are indicated with an
asterisk (*).

• Apache Solr™ 6.0.1.1.2460 *

• Apache Spark™ 2.2.3.4 *

• Apache Tomcat® 8.0.53 *

• DataStax Bulk Loader 1.2.0

• DSE Java Driver 1.6.9

• Key Management Interoperability Protocol (KMIP) 1.7.1e

• Netty 4.1.13.13.dse *

• Spark Jobserver 0.8.0.45 DSE custom version

• TinkerPop 3.3.7 with additional production-certified changes

DSE 6.0.8 is compatible with Apache Cassandra™ 3.11 and adds production-certified enhancements.

DSE 6.0.8 Highlights

High-value benefits of upgrading to DSE 6.0.8 include these highlights:


DSE Database (DSE core) highlights

• Significant fixes and improvements for native memory, the chunk cache, and async read timeouts.

• New configurable memory leak tracking. (DB-3123)

• Improved lightweight transactions (LWT) handling. (DB-3018, DB-3124)

DSE Analytics highlights

• When DSE authentication is enabled, Spark security is forced to be enabled. (DSP-17274)

• Spark security is turned on in dse.yaml configuration file. (DSP-17271)

DSEFS highlights

• Fix handling of path alternatives in DSEFS shell to provide wildcard support for mkdir and ls commands.
(DSP-17768)

DSE Graph highlights

• Operations through gremlin-console run with anonymous permissions. (DSP-18471)

• You can now dynamically pass cluster and connection configuration for different graph objects. Fixes the
issue where DseGraphFrame cannot directly copy graph from one cluster to another. (DSP-18605)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
40
DataStax Enterprise release notes

DSE Search highlights


Changes and improvements:

• Performance improvements and overload protection for search queries. (DSP-15875)

• New configurable memory leak tracking: new nodetool leaksdetection command and Memory leak detection
settings options in cassandra.yaml. (DB-3123)

• Performance improvements to Solr deletes that correspond to Cassandra rows. (DSP-17419)

• Changes to correct uneven distribution of shard requests with the STATIC set cover finder. (DSP-18197)

• New recommended method for case-insensitive text search, faceting, grouping, and sorting with new
LowerCaseStrField Solr field type. This type sets field values as lowercase and stores them as lowercase in
docValues. (DSP-18763)

Important bug fixes:

• The queryExecutorThreads and timeAllowed Solr parameters can be used together. (DSP-18717)

• Avoid interrupting request threads when an internode handshake fails so that the Lucene file channel lock
cannot be interrupted. Fixes LUCENE-8262. (DSP-18211)

6.0.8 DSE core

Changes and enhancements:

• Improved lightweight transactions (LWT) handling:

# Improved lightweight transactions (LWT) performance. New cassandra.yaml LWT configuration options.
(DB-3018)

# Optimized memory usage for direct reads pool when using a high number of LWTs. (DB-3124)
When not set in cassandra.yaml, the default calculated size of direct_reads_size_in_mb changed from
128 MB to 2 MB per TPC core thread, plus 2 MB shared by non-TPC threads, with a maximum value of
128 MB.

• Improved logging identifies which client, keyspace, table, and partition key is rejected when mutation
exceeds size threshold. (DB-1051)

• Improve status reporting for nodesync validation list. (DB-2707)

• Enable upgrading and downgrading SSTables using a CQL file that contains DDL statements to recreate the
schema. (DB-2951)

• Configurable memory leak tracking. (DB-3123)

# New nodetool leaksdetection command

Resolved issues:

• Nodes in a cluster continue trying to connect to a decommissioned node. (DB-2886)

• 32-bit integer overflow in StreamingTombstoneHistogramBuilder during compaction. (DB-3108)

• Possible direct memory leak when part of bulk allocation fails. (DB-3125)

• Counters in memtable allocators and buffer pool metrics can be incorrect when out of memory (OOM)
failures occur. (DB-3126)

• Memory leak occurs when a read from disk times out. (DB-3127)

• AssertionError in temporary buffer pool causes CorruptSSTableException. (DB-3172)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
41
DataStax Enterprise release notes

• Memory leak on errors when reading. (DB-3175)

• Bootstrap should fail when the node can't fetch the schema from other nodes in the cluster. (DB-3186)

• Increment pending echos when sending gossip echo requests. (DB-3187)

• Deadlock when replaying schema mutations from commit log during DSE startup. (DB-3190)

• Make the remote host visible in the error message for failed magic number verification. (DSP-18645)

Known issue:

• Possible data loss when using DSE Tiered Storage. (DB-3404)


If using DSE Tiered Storage, you must immediately upgrade to at least DSE 5.1.16, DSE 6.0.9, or DSE
6.7.4. Be sure to follow the upgrade instructions.

6.0.8 DSE Analytics

Changes and enhancements:

• A warning message is displayed when DSE authentication is enabled, but Spark security is not enabled.
(DSP-17273)

• When DSE authentication is enabled, Spark security is forced to be enabled. (DSP-17274)


dse.yaml Spark security is enforced

authentication_options When enabled: true

spark_security_enabled This setting is ignored.

spark_security_encryption_enabled This setting is ignored.

• Spark Cassandra Connector: To improve connection for streaming applications with shorter batch times, the
default value for Keep Alive is increased to 1 hour. (DSP-17393)

Resolved issues:

• Cassandra Spark Connector rejects nested UDT when null. (DSP-17965)

• CassandraHiveMetastore does not unquote predicates for server-side filtering. (DSP-18017)

• Reduce probability of hitting max_concurrent_sessions limit for OLAP workloads with BYOS (Bring Your
Own Spark). (DSP-18280)
For OLAP workloads with BYOS, DataStax recommends increasing the max_concurrent_sessions using
this formula as a guideline:

max_concurrent_sessions = spark_executors_threads_per_node x reliability_coefficient

where reliability_coefficient must be greater than 1, with a minimum reliability_coefficient


value between 2 and replication factor (RF) x 2.

• dse spark-submit --status driver_ID command fails. (DSP-18616)

• BYOS DSEFS access fails with AuthenticationException with dseauth_internal_no_otherschemes.


(DSP-18822)

• Accessing files from Spark through WebHDFS interface fails with message: java.io.IOException:
Content-Length is missing. (DSP-18559)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
42
DataStax Enterprise release notes

• Submitting many Spark applications will reach the default tombstone_failure_threshold before the default 90
days gc_grace_seconds defined for the system_auth.role_permissions table. (DSP-19098)

6.0.8 DSEFS

Resolved issues:

• Fix handling of path alternatives in DSEFS shell to provide wildcard support for mkdir and ls commands.
(DSP-17768)
For example, to make several subdirectories with a single command:

$ dse fs mkdir -p /datastax/demos/weather_sensors/{byos-daily,byos-monthly,byos-


station}

$ dse fs mkdir -p {path1,path2}/dir

6.0.8 DSE Graph

Changes and enhancements:

• The graph configuration and gremlin_server sections in DSE Graph system-level options are now correctly
commented out at the top level. (DSP-18477)

Resolved issues:

• NPE when dropping a graph with an alias in gremlin console. (DSP-13387)

• Time, date, inet, and duration data types are not supported in graph search indexes. (DSP-17694)

• Should prevent sharing Gremlin Groovy closures between scripts that are submitted through session-less
connections, like DSE drivers. (DSP-18146)

• Operations through gremlin-console run with system permissions, but should run with anonymous
permissions. (DSP-18471)

• DseGraphFrame cannot directly copy graph from one cluster to another. You can now dynamically pass
cluster and connection configuration for different graph objects. (DSP-18605)
Workaround for earlier versions:

1. Export graph to DSEFS:

$ g.V.write.format("csv").save("dsefs://culster1/tmp/vertices") &&
g.E.write.format("csv").save("dsefs://culster1/tmp/edges")

2. Import graph to the other cluster:

$ g.updateVertices(spark.read.format("csv").load("dsefs://culster1/tmp/vertices")
&& g.updateEdges(spark.read.format("csv").load("dsefs://culster1/tmp/edges")

• Issue querying a search index when the vertex label is set to cache properties. (DSP-18898)

• UnsatisfiedLinkError when insert multi edge with DseGraphFrame in BYOS (Bring Your Own Spark).
(DSP-18916)

• DSE Graph does not use primary key predicate in Search/.has() predicate. (DSP-18993)

6.0.8 DSE Search

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
43
DataStax Enterprise release notes

Changes and enhancements:

• Reject requests from the TPC backpressure queue when requests are on the queue for too long.
(DSP-15875)

• Changes to correct uneven distribution of shard requests with the STATIC set cover finder. (DSP-18197)
A new inertia parameter for dsetool set_core_property supports fine tuning. The default value of 1 can be
adjusted for environments with vnodes and more than 10 vnodes.

• New recommended method for case-insensitive text search, faceting, grouping, and sorting with new
LowerCaseStrField custom Solr field type. This type sets field values as lowercase and stores them as
lowercase in docValues. (DSP-18763)
DataStax does not support using the TextField Solr field type with solr.KeywordTokenizer and
solr.LowerCaseFilterFactory to achieve single-token, case-insensitive indexing on a CQL text field.

Resolved issues:

• SASI queries don't work on tables with row level access control (RLAC). (DB-3082)

• Documents might not be removed from the index when a key element has value equal to a Solr reserved
word. (DSP-17419)

• FQ broken with queryExecutorThreads and timeAllowed set. (DSP-18717)

• Avoid interrupting request threads when an internode handshake fails so that the Lucene file channel lock
cannot be interrupted. Fixes LUCENE-8262. (DSP-18211)
Workaround for earlier versions: Reload the search core without restarting or reindexing.

• Search should error out, rather than timeout, on Solr query with non-existing field list (fl) fields. (DSP-18218)

Cassandra enhancements for DSE 6.0.8


DataStax Enterprise 6.0.8 is compatible with Apache Cassandra™ 3.11 and includes all production-certified
enhancements from earlier releases.
General upgrade advice for DSE 6.0.8
DataStax Enterprise 6.0.8 is compatible with Apache Cassandra™ 3.11.
All upgrade advice from previous versions applies. Carefully review the DataStax Enterprise upgrade planning
and upgrade instructions to ensure a smooth upgrade and avoid pitfalls and frustrations.
TinkerPop changes for DSE 6.0.8
DataStax Enterprise (DSE) 6.0.8 includes these production-certified enhancements to TinkerPop 3.3.7:

• Developed DSL pattern for gremlin-javascript.

• Generated uberjar artifact for Gremlin Console.

• Improved folding of property() step into related mutating steps.

• Added inject() to steps generated on the DSL TraversalSource.

• Removed gperfutils dependencies from Gremlin Console.

• Fixed PartitionStrategy when setting vertex label and having includeMetaProperties configured to
true.

• Ensure gremlin.sh works when directories contain spaces.

• Prevented client-side hangs if metadata generation fails on the server.

• Fixed bug with EventStrategy in relation to addE() where detachment was not happening properly.

• Ensured that gremlin.sh works when directories contain spaces.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
44
DataStax Enterprise release notes

• Fixed bug in detachment of Path where embedded collection objects would prevent that process.

• Enabled ctrl+c to interrupt long running processes in Gremlin Console.

• Quieted "host unavailable" warnings for both the driver and Gremlin Console.

• Fixed construction of g:List from arrays in gremlin-javascript.

• Fixed bug in GremlinGroovyScriptEngine interpreter mode around class definitions.

• Implemented EdgeLabelVerificationStrategy.

• Fixed behavior of P for within() and without() in Gremlin Language Variants (GLV) to be consistent with
Java when using variable arguments (varargs).

• Cleared the input buffer after exceptions in Gremlin Console.

• Added parameter to configure the processor in the gremlin-javascript client constructor.

• Docker images now use gremlin user instead of root user.

• Refactored use of commons-lang to use common-lang3 only. Dependencies may still use commons-lang.

• Bumped commons-lang3 to 3.8.1.

• Added GraphSON serialization support for Duration, Char, ByteBuffer, Byte, BigInteger, and BigDecimal in
gremlin-python.

• Added ProfilingAware interface to allow steps to be notified that profile() was being called.

• Fixed bug where profile() could produce negative timings when group() contained a reducing barrier.

• Improved logic determining the dead or alive state of a Java driver connection.

• Improved handling of dead connections and the availability of hosts.

• Bumped httpclient to 4.5.7.

• Bumped slf4j to 1.7.25.

• Bumped commons-codec to 1.12.

• Fixed partial response failures when using authentication in gremlin-python.

• Fixed a bug in PartitionStrategy where addE() as a start step was not applying the partition.

• Improved performance of JavaTranslator by reducing calls to Method.getParameters().

• Implemented EarlyLimitStrategy which is supposed to significantly reduce backend operations for


queries that use range().

• Reduced chance of hash collisions in Bytecode and its inner classes.

• Added Symbol.asyncIterator member to the Traversal class to provide support for await ... of
loops (async iterables).

Bug fixes:

• TINKERPOP-2081 PersistedOutputRDD materialises rdd lazily with Spark 2.x.

• TINKERPOP-2091 Wrong/missing feature requirements in StructureStandardTestSuite.

• TINKERPOP-2094 Gremlin Driver Cluster Builder serializer method does not use mimeType as suggested.

• TINKERPOP-2095 GroupStep looks for irrelevant barrier steps.

• TINKERPOP-2096 gremlinpython: AttributeError when connection is closed before result is received.

• TINKERPOP-2100 coalesce() creating unexpected results when used with order().

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
45
DataStax Enterprise release notes

• TINKERPOP-2105 Gremlin-Python connection not returned back to the pool on exception from the Gremlin
Server.

• TINKERPOP-2113 P.Within() doesn't work when given a List argument.

Improvements:

• TINKERPOP-1889 JavaScript Gremlin Language Variants (GLV): Use heartbeat to prevent connection
timeout.

• TINKERPOP-2010 Generate jsdoc for gremlin-javascript.

• TINKERPOP-2013 Process tests that are auto-ignored stink.

• TINKERPOP-2018 Generate API docs for Gremlin.Net.

• TINKERPOP-2038 Make groovy script cache size configurable.

• TINKERPOP-2050 Add a :bytecode command to Gremlin Console.

• TINKERPOP-2062 Add Traversal class to CoreImports.

• TINKERPOP-2065 Optimize iterate() for remote traversals.

• TINKERPOP-2067 Allow getting raw data from Gremlin.Net.Driver.IGremlinClient.

• TINKERPOP-2068 Bump Jackson Databind 2.9.7.

• TINKERPOP-2069 Document configuration of Gremlin.Net.

• TINKERPOP-2070 gremlin-javascript: Introduce Connection representation.

• TINKERPOP-2071 gremlin-python: the graphson deserializer for g:Set should return a python set.

• TINKERPOP-2073 Generate tabs for static code blocks.

• TINKERPOP-2074 Ensure that only NuGet packages for the current version are pushed.

• TINKERPOP-2077 VertexProgram.Builder should have a default create() method with no Graph.

• TINKERPOP-2078 Hide use of EmptyGraph or RemoteGraph behind a more unified method for
TraversalSource construction.

• TINKERPOP-2084 For remote requests in console, display the remote stack trace.

• TINKERPOP-2092 Deprecate default GraphSON serializer fields.

• TINKERPOP-2097 Create a DriverRemoteConnection with an initialized Client.

• TINKERPOP-2102 Deprecate static fields on TraversalSource related to remoting.

• TINKERPOP-2106 When gremlin executes timeout, throw TimeoutException instead of


TraversalInterruptedException/InterruptedIOException.

• TINKERPOP-2110 Allow connection on different path (from /gremlin).

• TINKERPOP-2114 Document common Gremlin anti-patterns.

• TINKERPOP-2118 Bump to Groovy 2.4.16.

• TINKERPOP-2121 Bump Jackson Databind 2.9.8.

DSE 6.0.7 release notes


1 April 2019
In this section:

• 6.0.7 Components

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
46
DataStax Enterprise release notes

• DSE 6.0.7 Highlights

• Cassandra enhancements for DSE 6.0.7

• General upgrade advice for DSE 6.0.7

• TinkerPop changes for DSE 6.0.7

Table 5: DSE functionality


6.0.7 DSE core 6.0.7 DSE Graph

6.0.7 DSE Analytics 6.0.7 DSE Search

6.0.7 DSEFS

DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:

• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.

• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.

• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.

• DataStax recommends 16 or more logical cores for Advanced Performance nodes.

6.0.7 Components

All components from DSE 6.0.7 are listed. Components that are updated for DSE 6.0.7 are indicated with an
asterisk (*).

• Apache Solr™ 6.0.1.1.2407 *

• Apache Spark™ 2.2.3.4 *

• Apache Tomcat® 8.0.53

• DataStax Bulk Loader 1.2.0

• DSE Java Driver 1.6.9

• Key Management Interoperability Protocol (KMIP) 1.7.1e

• Netty 4.1.13.13.dse *

• Spark Jobserver 0.8.0.45 DSE custom version

• TinkerPop 3.3.6 with additional production-certified changes

DSE 6.0.7 is compatible with Apache Cassandra™ 3.11 and adds production-certified enhancements.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
47
DataStax Enterprise release notes

DSE 6.0.7 Highlights

High-value benefits of upgrading to DSE 6.0.7 include these highlights:


DSE Database (DSE core) highlights

• Compaction performance improvement with new cassandra.yaml pick_level_on_streaming option.


(DB-1658)

• Improved user tools for SSTable upgrades (sstableupgrade) and downgrades (sstabledowngrade).
(DB-2950)

• New cassandra.yaml direct_reads_size_in_mb option sets the size of the new buffer pool for direct
transient reads. (DB-2958)

• Reduction of LWT contention by improved handling of IO threads. (DB-2965)

• Remedy deadlock during node startup when calculating disk boundaries. (DB-3028)

• Correct handling of dropped UDT columns in SSTables. (DB-3031)


Workaround: If issues with UDTs in SSTables exist after upgrade from DSE 5.0.x, run sstablescrub -e
fix-only offline on the SSTables that have or had UDTs that were created in DSE 5.0.x.

• The frame decoding off-heap queue size is configurable and smaller by default. (DB-3047)

DSE Analytics highlights

• Authorization to AlwaysOn SQL web UI is supported. (DSP-18236)

• Handle quote in cache query of AlwaysOn SQL (AOSS). (DSP-18418)

• Fix leakage in BulkTableWriter. (DSP-18513)

DSE Graph highlights

• Some minor DSE GraphFrame code fixes. (DSP-18215)

• Improved updateEdges and updateVertices usability for single label update. (DSP-18404)

• Operations through gremlin-console run with anonymous instead of system permissions. (DSP-18471)

• Gremlin (groovy) scripts compile faster. (DSP-18025)

• Data caching improvements during DSE GraphFrame operations. (DSP-17870)

DSE Search highlights

• Fixed facets and stats queries when using queryExecutorThreads. (DSP-18237)

• Fixed timestamp PK routing with solr_query. (DSP-18223)

• Search/Solr HTTP request for CSV output is fixed. (DSP-18029)

6.0.7 DSE core

Changes and enhancements:

• Compaction performance improvement with new cassandra.yaml pick_level_on_streaming option.


(DB-1658)
Streamed-in SSTables of tables using LCS (leveled compaction strategy) are placed in the same level as
the source node, with possible up-leveling. Set pick_level_on_streaming to true to save compaction work for
operations like nodetool refresh and replacing a node.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
48
DataStax Enterprise release notes

• The sstableloader downgrade from DSE to OSS Apache Cassandra is supported with new
sstabledowngrade tool. (DB-2756)
The sstabledowngrade command cannot be used to downgrade system tables or downgrade DSE
versions.

• TupleType values with null fields NPE when being made byte-comparable. (DB-2872)

• Support for using sstableloader to stream OSS Cassandra 3.x and DSE 5.x data to DSE 6.0 and later.
(DB-2909)

• Memory improvements with these supported changes:

# Configurable memory is supported for offline sstable tools. (DB-2955)


You can use these environment variables tools:

# MAX_HEAP_SIZE - defaults to 256 MB

# MAX_DIRECT_MEMORY - defaults to ((system_memory - heap_size) / 4) with a minimum of


1 GB and a max of 8 GB.

To specify memory on the command line:

$ MAX_HEAP_SIZE=2g MAX_DIRECT_MEMORY=10g sstabledowngrade keyspace table

The sstabledowngrade command cannot be used to downgrade system tables or downgrade DSE
versions.

# Buffer pool, and metrics for the buffer pool, are now in two pools. In cassandra.yaml,
file_cache_size_in_mb option sets the file cache (or chunk cache) and new direct_reads_size_in_mb
option for all other short-lived read operations. (DB-2958)
To retrieve the buffer pool metrics:

$ nodetool sjk mxdump -q


"org.apache.cassandra.metrics:type=CachedReadsBufferPool,name=*"

$ nodetool sjk mxdump -q


"org.apache.cassandra.metrics:type=DirectReadsBufferPool,name=*"

For legacy compatibility, org.apache.cassandra.metrics:type=BufferPool still exists and is the


same as org.apache.cassandra.metrics:type=CachedReadsBufferPool.

# cassandra-env.sh respect heap and direct memory values set in jvm.options or as environment
variables. (DB-2973)
The precedence for heap and direct memory is:

# Environment variables

# jvm.options

# calculations in cassandra-env.sh

# AIO is automatically disabled if the chunk cache size is small enough: less or equal to system RAM / 8.
(DB-2997)

# Limit off-heap frame queues by configurable number of frames and total number of bytes. (DB-3047)

Resolved issues:

• Native server Message.Dispatcher.Flusher task stalls under heavy load. (DB-1814)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
49
DataStax Enterprise release notes

• Race in CommitLog can cause failed force-flush-all. (DB-2542)

• Unclosed range tombstones in read response. (DB-2601)

• The sstableloader downgrade from DSE to OSS Apache Cassandra is not supported. New
sstabledowngrade tool is required. (DB-2756)

• Unused memory in buffer pool. (DB-2788)

• nodesync fails when validating MV row with empty partition key. (DB-2823)

• TupleType values with null fields NPE when being made byte-comparable. (DB-2872)

• The memory in use in the buffer pool is not identical to the memory allocated. (DB-2904)

• Reference leak in SSTableRewriter in sstableupgrade when keepOriginals is true. (DB-2944)

• Hint-dispatcher file-channel not closed, if open() fails with OOM. (DB-2947)

• Offline sstable tools fail with Out of Direct Memory error. (DB-2955)

• Hints and metadata should not use buffer pool. (DB-2958)

• Lightweight transactions contention may cause IO thread exhaustion. (DB-2965)

• DIRECT_MEMORY is being calculated using 25% of total system memory if -Xmx is set in jvm.options.
(DB-2973)

• Netty direct buffers can potentially double the -XX:MaxDirectMemorySize limit. (DB-2993)

• Increased NIO direct memory because the buffers are not cleaned until GC is run. (DB-2996)

• nodesync cannot be enabled on materialized views (MV). (DB-3008)

• Mishandling of frozen in complex nested types. (DB-3081)

• Check of two versions of metadata for a column fails on upgrade from DSE 5.0.x when type is not of same
class. Loosen the check from CASSANDRA-13776 to prevent Trying to compare 2 different types
ERROR on upgrades. (DB-3021)

• Deadlock during node startup when calculating disk boundaries. (DB-3028)

• cqlsh EXECUTE AS command does not work. (DB-3098)

• Dropped UDT columns in SSTables deserialization are broken after upgrading from DSE 5.0. (DB-3031)

• Kerberos protocol and QoP parameters are not correctly propagated. (DSP-15455)

• RpcExecutionException does not print the user who is not authorized to perform a certain action.
(DSP-15895)

• Leak in BulkTableWriter. (DSP-18513)

Known issue:

• Possible data loss when using DSE Tiered Storage. (DB-3404)


If using DSE Tiered Storage, you must immediately upgrade to at least DSE 5.1.16, DSE 6.0.9, or DSE
6.7.4. Be sure to follow the upgrade instructions.

6.0.7 DSE Analytics

Changes and enhancements:

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
50
DataStax Enterprise release notes

• Support configuration to connect to multiple hosts from BYOS connector. (DSP-18231)

Resolved issues:

• After client-to-node SSL is enabled, all Spark nodes must also listen on port 7480. (DSP-15744)

• dse client-tool configuration byos-export does not export required Spark properties. (DSP-15938)

• Downloaded Spark JAR files are executable for all users. (DSP-17692)

• Issue with viewing information for completed jobs when authentication is enabled. (DSP-17854)

• Spark Cassandra Connector does properly cache manually prepared RegularStatements, see
SPARKC-558. (DSP-18075)

• Unexpected gossip failure. java.lang.NullPointerException: null. (DSP-18194)

• Apache Spark local privilege escalation vulnerability: CVE-2018-11760. (DB-18225)

• Invalid options show for dse spark-submit command line help. (DSP-18293)

• Can't access AlwaysOn SQL (AOSS) UI when authorization is enabled. (DSP-18236)

• Spark SQL function concat_ws results in a compilation error when an array column is included in the column
list and when the number of columns to be concatenated exceeds 8. (DSP-18383)

• Improved error messaging for AlwaysOn SQL (AOSS) client tool. (DSP-18409)

• CQL syntax error when single quote is not correctly escaped before including in save cache query to AOSS
cache table. (DSP-18418)

• Remove class DGFCleanerInterceptor from byos.jar. (DSP-18445)

• GBTClassifier in Spark ML fails when periodic checkpointing is on. (DSP-18450)

Known issue:

• DSE 6.0.7 is not compatible with Zeppelin in SparkR and PySpark 0.8.1. (DSP-18777)
The Apache Spark™ 2.2.3.4 that is included with DSE 6.0.7 contains the patched protocol and all versions
of DSE are compatible with the Scala interpreter.
However, SparkR and PySpark use only a separate channel for communication with Zeppelin. This protocol
was vulnerable to attack from other users on the system and was secured in CVE-2018-11760. Zeppelin
in SparkR and PySpark 0.8.1 fails because it does not recognize that Spark 2.2.2 and later contain this
patched protocol and attempts to use the old protocol. The Zeppelin patch to recognize this protocol is not
available in a released Zeppelin build.
Solution: Do not upgrade to DSE 6.0.7 if you use SparkR or PySpark. Wait for the Zeppelin release later
than 0.8.1 that will recognize that DSE-packaged Spark can use the secured protocol.

• Submitting many Spark apps will reach the default tombstone_failure_threshold before the default 90 days
gc_grace_seconds defined for the system_auth.role_permissions table. (DSP-19098)
Workaround for use cases where a large number of Spark jobs are submitted:

1. Before the user starts the Spark jobs, manually grant permissions to the user:

GRANT AUTHORIZE, DESCRIBE, MODIFY ON ANY SUBMISSION IN WORKPOOL


'datacenter_name.workpool' TO role_name;

2. Start Spark jobs for this user.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
51
DataStax Enterprise release notes

3. After this user completes all the Spark jobs, revoke permissions for the user:

REVOKE AUTHORIZE, DESCRIBE, MODIFY ON ANY SUBMISSION IN WORKPOOL


'datacenter_name.workpool' FROM role_name;

6.0.7 DSEFS

Resolved issues:

• Change dsefs:// default port when the DSEFS setting public_port is changed in dse.yaml. (DSP-17962)
The shortcut dsefs:/// now automatically resolves to broadcastaddress:dsefs.public_port, instead of
incorrectly using broadcastaddress:5598 regardless of the configured port.

• DSEFS WebHDFS API GETFILESTATUS op returns AccessDeniedException for the file even when user
has correct permission. (DSP-18044)

• Problem with change group ownership of files using the fileSystem.setOwner method. (DSP-18052)

6.0.7 DSE Graph

Changes and enhancements:

• Vertex and especially edge loading is simplified. idColumn function is no longer required. (DSP-18404)

Resolved issues:

• OLAP traversal duplicates the partition key properties: OLAP g.V().properties() prints 'first' vertex n times
with custom ids. (DSP-15688)

• Edges are inserted with tombstone values set when inserting a recursive edge with multiple cardinality.
(DSP-17377)

• AND operator is ignored in combination with OR operator in graph searches. (DSP-18061)

6.0.7 DSE Search

Resolved issues:

• SASI should discard stale static row. (DB-2956)

• Anti-compaction transaction causes temporary data loss. (DB-3016)

• Solr HTTP request for CSV output is blank. The CSVResponseWriter returns only stored fields if a field list is
not provided in the URL. (DSP-18029)
To workaround, specify a field list with the URL:

/select?q=*%3A*&sort=lst_updt_gdttm+desc&rows=10&fl=field1,field2&wt=csv&indent=true

• Timestamp PK routing on solr_query fails. (DSP-18223)

• Facets and stats queries broken when using queryExecutorThreads. (DSP-18237)

Cassandra enhancements for DSE 6.0.7


DataStax Enterprise 6.0.7 is compatible with Apache Cassandra™ 3.11, includes all DataStax enhancements
from earlier releases, and adds these production-certified changes:

• Always close RT markers returned by ReadCommand#executeLocally(). (CASSANDRA-14515)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
52
DataStax Enterprise release notes

Severe concurrency issues in STCS,DTCS,TWCS,TMD.Topology,TypeParser. (CASSANDRA-14781)

General upgrade advice for DSE 6.0.7


DataStax Enterprise 6.0.7 is compatible with Apache Cassandra™ 3.11.
All upgrade advice from previous versions applies. Carefully review the DataStax Enterprise upgrade planning
and upgrade instructions to ensure a smooth upgrade and avoid pitfalls and frustrations.
TinkerPop changes for DSE 6.0.7
DataStax Enterprise (DSE) 6.0.7 includes production-certified enhancements to TinkerPop 3.3.6. See
TinkerPop upgrade documentation for all changes.

• Disables the ScriptEngine global function cache which can hold on to references to "g" along with some
other minor bug fixes/enhancements.

DSE 6.0.6 release notes


DataStax recommends the latest patch release for most environments.

27 February 2019

DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:

• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.

• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.

• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.

• DataStax recommends 16 or more logical cores for Advanced Performance nodes.

DSE 6.0.6 Components

All components from DSE 6.0.6 are listed. Components that are updated for DSE 6.0.6 are indicated with an
asterisk (*).

• Apache Solr™ 6.0.1.1.2380

• Apache Spark™ 2.2.2.8

• Apache Tomcat® 8.0.53

• DataStax Bulk Loader 1.2.0

• DSE Java Driver 1.6.9

• Key Management Interoperability Protocol (KMIP) 1.7.1e

• Netty 4.1.13.12.dse

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
53
DataStax Enterprise release notes

• Spark Jobserver 0.8.0.45 DSE custom version

• TinkerPop 3.3.5 with additional production-certified changes *

DSE 6.0.6 is compatible with Apache Cassandra™ 3.11 and includes all production-certified changes from
earlier versions.
DSE 6.0.6 Important bug fix

• DSE 5.0 SSTables with UDTs are corrupted in DSE 5.1, DSE 6.0, and DSE 6.7. (DB-2954,
Cassandra-15035)
If the DSE 5.0.x schema contains user-defined types (UDTs), the SSTable serialization headers are fixed
when DSE is started with DSE 6.0.6 or later.

DSE 6.0.6 Known issue:

• Possible data loss when using DSE Tiered Storage. (DB-3404)


If using DSE Tiered Storage, you must immediately upgrade to at least DSE 5.1.16, DSE 6.0.9, or DSE
6.7.4. Be sure to follow the upgrade instructions.

Cassandra enhancements for DSE 6.0.6


DataStax Enterprise 6.0.6 is compatible with Apache Cassandra™ 3.11 and includes all production-certified
enhancements from previous releases.
General upgrade advice for DSE 6.0.6
DataStax Enterprise 6.0.6 is compatible with Apache Cassandra™ 3.11.
All upgrade advice from previous versions applies. Carefully review the DataStax Enterprise upgrade planning
and upgrade instructions to ensure a smooth upgrade and avoid pitfalls and frustrations.
TinkerPop changes for DSE 6.0.6
DataStax Enterprise (DSE) 6.0.6 includes all enhancements from previous DSE releases that are in addition to
TinkerPop 3.3.5. See TinkerPop upgrade documentation for all changes.
DSE 6.0.5 release notes
7 February 2019
In this section:

• DSE 6.0.5 Components

• DSE 6.0.5 Highlights

• DSE 6.0.5 Known issues

• Cassandra enhancements for DSE 6.0.5

• General upgrade advice for DSE 6.0.5

• TinkerPop changes for DSE 6.0.5

Table 6: DSE functionality


6.0.5 DSE core 6.0.5 DSE Graph

6.0.5 DSE Analytics 6.0.5 DSE Search

6.0.5 DSEFS

DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
54
DataStax Enterprise release notes

The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:

• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.

• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.

• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.

• DataStax recommends 16 or more logical cores for Advanced Performance nodes.

DSE 6.0.5 Components

All components from DSE 6.0.5 are listed. Components that are updated for DSE 6.0.5 are indicated with an
asterisk (*).

• Apache Solr™ 6.0.1.1.2380 *

• Apache Spark™ 2.2.2.8 *

• Apache Tomcat® 8.0.53 *

• DataStax Bulk Loader 1.2.0

• DSE Java Driver 1.6.9

• Key Management Interoperability Protocol (KMIP) 1.7.1e

• Netty 4.1.13.12.dse *

• Spark Jobserver 0.8.0.45 DSE custom version

• TinkerPop 3.3.5 with additional production-certified changes *

DSE 6.0.5 is compatible with Apache Cassandra™ 3.11 and adds production-certified enhancements.

DSE 6.0.5 Highlights

High-value benefits of upgrading to DSE 6.0.5 include these highlights:


DSE Database (DSE core) highlights
Improvements:

• DSE Metrics Collector aggregates DSE metrics and integrates with existing monitoring solutions to facilitate
problem resolution and remediation. (DSP-17319)
See:

# Enable DSE Metrics Collector

# Configuring data and log directories for DSE Metrics Collector

Important bug fixes:

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
55
DataStax Enterprise release notes

• Fixed resource leak related to streaming operations that affects tiered storage users. Excessive number of
TieredRowWriter threads causing java.lang.OutOfMemoryError. (DB-2463)

• Exception now occurs when user with no permissions returns no rows on restricted table. (DB-2668)

• Upgraded nodes that still have big-format SSTables from DSE 5.x caused errors during read. (DB-2801)

• Fixed an issue where heap memory usage seems higher with default file cache settings. (DB-2865)

• Fixed prepared statement cache issues when using row-level access control (RLAC) permissions. Existing
prepared statements were not correctly invalidated. (DB-2867)

DSE Analytics highlights


Upgrade if:

• DSEFS or AOSS fail to start.

• You use BYOS with Spark 2.3 or 2.4.

• You are getting OOM or authentication errors.

• You use scripts that invoke DSEFS commands and need to handle failures properly.

• You use dse spark-sql-metastore-migrate with DSE Unified Authentication and internal authentication.
(DSP-17632)

• You want to run the DSEFS auth demo. (DSP-17700)

• You have DSE 5.0.x with DSEFS client connected to DSE 5.1.x and later DSEFS server. (DSP-17600)

• You experienced a memory leak in Spark Thrift Server. (DSP-17433)

• You use DSEFS with listen_on_broadcast_address is true in cassandra.yaml. (DSP-17363)

• You use DSEFS and listen_address is blank in cassandra.yaml. (DSP-16296)

• You are moving directories in DSEFS. (DSP-17347)

• Improve memory handling in AlwaysOn SQL (AOSS) by enabling spark.sql.thriftServer.incrementalCollect to


prevent OOM on large result sets. (DSP-17428)

DSE Graph highlights


Upgrade if:

• You want new JMX operations for graph MBeans. (DSP-15928)

• You get errors for OLAP traversals after dropping schema elements. (DSP-15884)

• You have slow gremlin script compilation times. (DSP-14132)

• You want server side error messages for remote exceptions reported in Gremlin console. (DSP-16375)

• You occasionally get inconsistent query results. (DSP-18005)

• Use graph OLAP and want secret tokens redacted in log files. (DSP-18074)

• You want to build fuzzy-text search indexes on string properties that form part of a vertex label ID.
(DSP-17386)

DSE Search highlights


Upgrade if:

• You want security improvements:

# Upgrade Apache Commons Compress to prevent Denial Of Service (DoS) vulnerability present in
Commons Compress 1.16.1, CVE-2018-11771. (DSP-17019)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
56
DataStax Enterprise release notes

# Critical memory leak and corruption fixes for encrypted indexes. (DSP-17111)

# Upgrade Apache Tomcat to prevent Denial Of Service (DoS), CVE-2018-1336. (DSP-17303)

• You index timestamp partition keys. (DSP-17761)

• You do a lot of reindexing. (DSP-17975)

DSE 6.0.5 Known issue:

• Possible data loss when using DSE Tiered Storage. (DB-3404)


If using DSE Tiered Storage, you must immediately upgrade to at least DSE 5.1.16, DSE 6.0.9, or DSE
6.7.4. Be sure to follow the upgrade instructions.

• DSE 5.0 SSTables with UDTs will be corrupted after migrating to DSE 5.1, DSE 6.0, and DSE 6.7.
(DB-2954, CASSANDRA-15035)
If the DSE 5.0.x schema contains user-defined types (UDTs), upgrade to at least DSE 5.1.13, DSE
6.0.6, or DSE 6.7.2. The SSTable serialization headers are fixed when DSE is started with the upgraded
versions.

DSE 6.0.5 core

Changes and enhancements:

• nodetool command changes:

# New tool sstablepartitions identifies large partitions. (DB-803)

# nodetool listendpointspendinghints command prints hint information about the endpoints this node has
hints for. (DB-1674)

# nodetool rebuild_view rebuilds materialized views for local data. Existing view data is not cleared.
(DB-2451)

# Improved messages for nodetool nodesyncservice ratesimulator command include explanation for
single node clusters and when no tables have NodeSync enabled. (DB-2468)

• Taking a snapshot causes FSError serialization error. (DB-2581)

• Direct Memory field output of nodetool gcstats includes all allocated off-heap memory. Metrics for native
memory are added in org.apache.cassandra.metrics.NativeMemoryMetrics.java. (DB-2796)

• Batch replay is interrupted and good batches are skipped when a mutation of an unknown table is found.
(DB-2855)

• New environment variable MAX_DIRECT_MEMORY overrides cassandra.yaml value for how much direct
memory (NIO direct buffers) that the JVM can use. (DB-2919)

• Improved encryption key error reporting. (DSP-17723)

Resolved issues:

• Race condition occurs on bootstrap completion. (DB-1383)

• Running the nodetool nodesyncservice enable command reports the error NodeSyncRecord
constructor assertion failed. (DB-2280)
Workaround: Before DSE 6.0.5, a restart of DSE resolves the issue so that you can execute the command
and enable NodeSync without error.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
57
DataStax Enterprise release notes

• Rebuild should not fail when a keyspace is not replicated to other datacenters. (DB-2301)

• Repair may skip some ranges due to received range cache. (DB-2432)

• Read and compaction errors with levelled compaction strategy (LCS). (DB-2446)

• Excessive number of TieredRowWriter threads causing java.lang.OutOfMemoryError (DB-2463)

• The nodetool nodesyncservice ratesimulator -deadline-overrides option is not supported. (DB-2468)

• NullPointerException during compaction on table with TimeWindowCompactionStrategy (TWCS). (DB-2472)

• Chunk cache can retain data from a previous version of a file, causing restore failures. (DB-2489)

• LineNumberInference is not failure-safe, not finding the source information can break the request. (DB-2568)

• Improved error message when Netty Epoll library cannot be loaded. (DB-2579)

• Prevent potential SSTable corruption with nodetool refresh. (DB-2594)

• The nodetool gcstats command output incorrectly reports the GC reclaimed metric in bytes, instead of the
expected MB. (DB-2598)

• TypeParser is not thread safe. (DB-2602)

• STCS, DTCS, TWCS, TMD aren't thread-safe. (DB-2609)

• Possible corruption in compressed files with uncompressed chunks. (DB-2634)

• Incorrect order of application of nodetool garbagecollect leaves tombstones that should be deleted.
(DB-2658)

• Exception should occur when user with no permissions returns no rows on restricted table. (DB-2668)

• DSE does not start with Unable to gossip with any peers error if cross_node_timeout is true.
(DB-2670)

• Memory leak on unfetched continuous paging requests. (DB-2851)

• Heap memory usage is higher with default file cache settings. (DB-2865)

• Prepared statement cache issues when using row-level access control (RLAC) permissions. Existing
prepared statements are not correctly invalidated. (DB-2867)

• User-defined aggregates (UDAs) that instantiate user-defined types (UDTs) break after restart. (DB-2771)

• Upgraded nodes that still have big-format SSTables from DSE 5.x can cause errors during read. (DB-2801)
Workaround for upgrades from DSE 5.x to DSE versions before 6.0.5 and DSE 6.7.0: Run offline
sstableupgrade before starting the upgraded node.

• Late continuous paging errors can leave unreleased buffers behind. (DB-2862)

• Security: java-xmlbuilder is vulnerable to XML external entities (XXE). (DSP-13962)

• dsetool does not work when native_transport_interface is set in cassandra.yaml. (DSP-16796)


To workaround for earlier versions: Use native_transport_interface_prefer_ipv6 instead.

• Improve config encryption error reporting for missing system key and unencrypted passwords. (DSP-17480)

• Fix sstableloader error when internode encryption, client_encryption, and config encryption are enabled.
(DSP-17536)

• sstableloader throws an error if system_info_encryption is enabled in dse.yaml and a table is encrypted.


(DSP-17826)

6.0.5 DSE Analytics

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
58
DataStax Enterprise release notes

Changes and enhancements:

• Improved error handling: only submission-related error exceptions from Spark submitted applications are
wrapped in a Dse Spark Submit Bootstrapper Failed to Submit error. (DSP-16359)

• Improved error message for dse client-tool when DSE Analytics is not correctly configured. (DSP-17322)

• AlwaysOn SQL (AOSS) improvements:

# Provide a way for clients to determine if AlwaysOn SQL (AOSS) is enabled in DSE. (DSP-17180)

# Improved logging messages with recommended resolutions for AlwaysOn SQL (AOSS). (DSP-17326,
DSP-17533)

# Improved error message for AlwaysOn SQL (AOSS) when the role specified by auth_user does not
exist. (DSP-17358)

# Set default for spark.sql.thriftServer.incrementalCollect to true for AlwaysOn SQL (AOSS). (DSP-17428)

# Structured Streaming support for (Bring Your Own Spark) BYOS Spark 2.3. (DSP-17593)

Resolved issues:

• Memory leak in Spark Thrift Server. (DSP-17433)

• Race condition allows Spark Executor working directories to be removed before stopping those executors.
(DSP-15769)

• Restore DseGraphFrame support in BYOS and spark-dependencies artifacts. Include graph frames python
library in graphframe.jar. (DSP-16383)

• Search optimizations for search analytics Spark SQL queries are applied to a datacenter that no longer has
search enabled. Queries launched from a search-enabled datacenter cause search optimizations even when
the target datacenter does not have search enabled. (DSP-16465)

• Unable to get available memory before Spark Workers are registered. (DSP-16790)

• DirectJoin and Spark Extensions don't work with Pyspark. (DSP-16904)

• Spark shell error Cannot proxy as a super user occurs when AlwaysOn Spark SQL (AOSS) is running
with authentication. (DSP-17200)

• Spark Connector has hard dependencies on dse-core when running Spark Application tests with dse-
connector. (DSP-17232)

• AlwaysOn SQL (AOSS) should attempt to auto start again on datacenter restart, regardless of the previous
status. (DSP-17359)

• AlwaysOn SQL (AOSS) restart hangs for at least 15 minutes if it cannot start, should fail with meaningful
error message. (DSP-17264)

• Submission in client mode does not support specifying remote jars (DSEFS) for main application resource
(main jar) and jars specified with --jars / spark.jars. (DSP-17382)

• Incorrect conversions in DirectJoin Spark SQL operations for timestamps, UDTs, and collections.
(DSP-17444)

• DSE 5.0.x DSEFS client is not able to list files when connected to 5.1.x (and up) DSEFS server.
(DSP-17600)

• dse spark-sql-metastore-migrate does not work with DSE Unified Authentication and internal
authentication. (DSP-17632)

• SparkContext closing is faulty with significantly increased shutdown time. (DSP-17699)

• Spark Web UI redirection drops path component. (DSP-17877)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
59
DataStax Enterprise release notes

6.0.5 DSEFS

Changes and enhancements:

• Improved error message when no available chunks are found. (DSP-16623)

• Add the ability to disable and configure DSEFS internode (node-to-node) authentication. (DSP-17721)

Resolved issues:

• DSEFS throws exceptions and cannot initialize when listen_address is left blank. (DSP-16296)

• Timeout issues in DSEFS startup. (DSP-16875)


Initialization would fail with error messages similar to:

com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for


query failed (no host was tried)

• DSEFS exit code not set in some cases (DSP-17266)

• Moving a directory under itself causes data loss and orphan data structures. (DSP-17347)

• DSEFS does not support listen_on_broadcast_address as configured in cassandra.yaml. (DSP-17363)

• DSEFS retries resolving corrupted paths. (DSP-17379)

• DSEFS auth demo does not work. (DSP-17700)

6.0.5 DSE Graph

Changes and enhancements:

• New tool fixes inconsistencies in graph data that are caused by schema changes, like label delete, or
improper data loading. (DSP-15884)

# DSE Graph Gremlin console: graph.cleanUp()

# Spark: spark.dseGraph("name").cleanUp()

• New JMX operations for graph MBeans. (DSP-15928)

# adjacency-cache.size - adjacency cache size attribute

# adjacency-cache.clear - operation to clean adjacency cache

# index-cache.size - vertex cache size attribute

# index-cache.clear - operation to clean vertex cache

JMX operations are not cluster-aware. Invoke on each node as appropriate to your environment.

Resolved issues:

• Properties unattached to vertex show up with null values. (DSP-12300)

• DSEGF label drop hang with a lot of edges, both ended the same label. (DSP-17096)

• Graph/Search escaping fixes. (DSP-17216, DSP-17277, DSP-17816)

• A Gremlin query with search predicate containing \u2028 or \u2029 characters fails. (DSP-17227)

• Geo.inside predicate with Polygon no longer works on secondary index if JTS is not installed. (DSP-17284)

• Search indexes on key fields work only with non-tokenized queries. (DSP-17386)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
60
DataStax Enterprise release notes

• g.V().repeat(...).until(...).path() returns incomplete path without edges. (DSP-17933)

• Graph OLTP: Potential ThreadLocal resource leak. (DSP-17808)

• Graph OLTP: Slow gremlin script compilation times. (DSP-14132)

• DseGraphFrame fail to read properties with symbols, like period (.), in names. (DSP-17818)

• DSE GraphFrame operations cache but do not explicitly uncache. (DSP-17870)

• Inconsistent results when using gremlin on static data. (DSP-18005)

• Graph OLAP: secret tokens are unmasked in log files. (DSP-18074)

6.0.5 DSE Search

Changes and enhancements:

• Large queries with oversize frames no longer cause buffer corruption on the receiver. (DSP-15664)

• If a client executes a query that results in a shard attempting to send an internode frame larger than the size
specified in frame_length_in_mb, the client receive an error message with a message like this:

Attempted to write a frame of <n> bytes with a maximum frame size of <n> bytes

In earlier versions, the query timed out with no message. Information was provided only as error in the logs.

• In earlier releases, CQL search queries failed with UTFDataFormatException on very large SELECT clauses
and when tables have a very large number of columns. (DSP-17220)
With this fix, CQL search queries fail with UTFDataFormatException only when SELECT clauses constitute
a string larger than 64k UTF-8 encode bytes.

• New DSE start-up parameter -Ddse.consistent_replace improves LOCAL_QUORUM and QUORUM


consistency on new node after node replacement. (DB-1577)

• Upgrade Apache Commons Compress to prevent Denial Of Service (DoS) vulnerability present in Commons
Compress 1.16.1, CVE-2018-11771. (DSP-17019)

• Requesting a core reindex with dsetool reload_core or REBUILD SEARCH INDEX no longer builds up a
queue of reindexing tasks on a node. Instead, a single starting reindexing task handles all reindex requests
that are already submitted to that node. (DSP-17045, DSP-13030)

• Upgrade Apache Tomcat to prevent Denial Of Service (DoS), CVE-2018-1336. (DSP-17303)

• The calculated value for maxMergeCount is changed to improve indexing performance. (DSP-17597)

max(max(<maxThreadCount * 2>, <num_tokens * 8>), <maxThreadCount + 5>)

where num_tokens is the number of token ranges to assign to the virtual node (vnode) as configured in
cassandra.yaml.

• CQL timestamp field can be part of a Solr unique key. (DSP-17761)

Resolved issues:

• Race condition occurs on bootstrap completion and Solr core fails to initialize during node bootstrap.
(DB-1383, DSP-14823)
Workaround: Restart the node that failed to initialize.

• Internode protocol can send oversize frames causing buffer corruption on the receiver. (DSP-15664)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
61
DataStax Enterprise release notes

• CQL search queries fail with UTFDataFormatException on very large SELECT clauses. (DSP-17220)
With this fix, CQL search queries fail with UTFDataFormatException only when SELECT clauses constitute
a string larger than 64k UTF-8 encode bytes.

• java.lang.AssertionError: rtDocValues.maxDoc=5230 maxDoc=4488 error is thrown in the system.log


during indexing and reindexing. (DSP-17529)

• Histogram for snapshot is unsynchronized. (DSP-17308)

• Unexpected search index errors occur when non-ASCII characters, like the U+3000 (ideographic space)
character, are in indexed columns. (DSP-17816, DSP-17961)

• TextField type in search index schema should be case-sensitive if created when using copyField.
(DSP-17817)

• gf.V().id().next() causes data to get mismatched with properties in legacy DseGraphFrame. (DSP-17979)

• Loading frozen map columns fails during search read-before-write. (DSP-18073)

Cassandra enhancements for DSE 6.0.5


DataStax Enterprise 6.0.5 is compatible with Apache Cassandra™ 3.11, includes all DataStax enhancements
from earlier releases, and adds these production-certified changes:

• Pad uncompressed chunks when they would be interpreted as compressed (CASSANDRA-14892)

• Correct SSTable sorting for garbagecollect and levelled compaction (CASSANDRA-14870)

• Avoid calling iter.next() in a loop when notifying indexers about range tombstones (CASSANDRA-14794)

• Fix purging semi-expired RT boundaries in reversed iterators (CASSANDRA-14672)

• DESC order reads can fail to return the last Unfiltered in the partition (CASSANDRA-14766)

• Fix corrupted collection deletions for dropped columns in messages (CASSANDRA-14568)

• Fix corrupted static collection deletions in messages (CASSANDRA-14568)

• Handle failures in parallelAllSSTableOperation (cleanup/upgradesstables/etc) (CASSANDRA-14657)

• Improve TokenMetaData cache populating performance avoid long locking (CASSANDRA-14660)

• Fix static column order for SELECT * wildcard queries (CASSANDRA-14638)

• sstableloader should use discovered broadcast address to connect intra-cluster (CASSANDRA-14522)

• Fix reading columns with non-UTF names from schema (CASSANDRA-14468)

General upgrade advice for DSE 6.0.5


DataStax Enterprise 6.0.5 is compatible with Apache Cassandra™ 3.11.
All upgrade advice from previous versions applies. Carefully review the DataStax Enterprise upgrade planning
and upgrade instructions to ensure a smooth upgrade and avoid pitfalls and frustrations.
TinkerPop changes for DSE 6.0.5
DataStax Enterprise (DSE) 6.0.5 includes production-certified enhancements to TinkerPop 3.3.6.
Resolved issues:

• Masked sensitive configuration options in the KryoShimServiceLoader logs.

• Fixed a concurrency issue in TraverserSet.

DSE 6.0.4 release notes


8 October 2018
DataStax recommends the latest patch release for most environments.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
62
DataStax Enterprise release notes

• Important bug fix

• 6.0.4 Components

DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:

• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.

• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.

• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.

• DataStax recommends 16 or more logical cores for Advanced Performance nodes.

DSE 6.0.4 Important bug fix

• Fix wrong offset in size calculation in trie builder. (DB-2477)

DSE 6.0.4 Known issue:

• Possible data loss when using DSE Tiered Storage. (DB-3404)


If using DSE Tiered Storage, you must immediately upgrade to at least DSE 5.1.16, DSE 6.0.9, or DSE
6.7.4. Be sure to follow the upgrade instructions.

6.0.4 Components

All components from DSE 6.0.4 are listed. No components were updated from the previous DSE version.

• Apache Solr™ 6.0.1.1.2338

• Apache Spark™ 2.2.2.5

• Apache Tomcat® 8.0.47

• DataStax Bulk Loader 1.1.0

• DSE Java Driver 1.6.9

• Key Management Interoperability Protocol (KMIP) 1.7.1e

• Netty 4.1.13.11.dse

• Spark Jobserver 0.8.0.45 DSE custom version

• TinkerPop 3.3.3 with additional production-certified changes

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
63
DataStax Enterprise release notes

DSE 6.0.4 is compatible with Apache Cassandra™ 3.11 and includes all production-certified enhancements from
earlier DSE versions.
General upgrade advice for DSE 6.0.4
DataStax Enterprise 6.0.4 is compatible with Apache Cassandra™ 3.11.
All upgrade advice from previous versions applies. Carefully review the DataStax Enterprise upgrade planning
and upgrade instructions to ensure a smooth upgrade and avoid pitfalls and frustrations.
DSE 6.0.3 release notes
20 September 2018

DataStax recommends installing the latest patch release. Due to DB-2477, DataStax does not recommend
using DSE 6.0.3 for production.

• 6.0.3 Components

• DSE 6.0.3 Highlights

• DSE 6.0.3 Known issues

• General upgrade advice for DSE 6.0.3

• TinkerPop changes for DSE 6.0.3

Table 7: DSE functionality


6.0.3 DSE core 6.0.3 DSE Graph

6.0.3 DSE Analytics 6.0.3 DSE Search

6.0.3 DSEFS

DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:

• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.

• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.

• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.

• DataStax recommends 16 or more logical cores for Advanced Performance nodes.

6.0.3 Components

All components from DSE 6.0.3 are listed. Components that are updated for DSE 6.0.3 are indicated with an
asterisk (*).

• Apache Solr™ 6.0.1.1.2338 *

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
64
DataStax Enterprise release notes

• Apache Spark™ 2.2.2.5 *

• Apache Tomcat® 8.0.47

• DataStax Bulk Loader 1.1.0

• DSE Java Driver 1.6.9

• Key Management Interoperability Protocol (KMIP) 1.7.1e

• Netty 4.1.13.11.dse

• Spark Jobserver 0.8.0.45 DSE custom version

• TinkerPop 3.3.3 with additional production-certified changes *

DataStax Enterprise 6.0.3 is compatible with Apache Cassandra™ 3.11 and includes all production-certified
enhancements from earlier DSE versions.

DSE 6.0.3 Highlights

High-value benefits of upgrading to DSE 6.0.3 include these highlights:


DSE Database (DSE core) highlights
Improvements:

• Deleting a static column and adding it back as a non-static column introduces corruption. (DB-1630)

• NodeSync command line tool only connects over JMX to a single node. (DB-1693)

• Create a log message when DDL statements are executed. (DB-2383)

Important bug fixes:

• Authentication cache loading can exhaust native threads. (DB-2248)

• The nodesync tasks fail with assertion error. (DB-2323)

• Unexpected behavior change when using row-level permissions with modification conditions like IF EXISTS.
(DB-2429)

• Non-internal users are unable to use permissions granted on CREATE. (DSP-16824)

DSE Analytics highlights


Improvements:

• Improved security isolates Spark applications. (DSP-16093)

• Upgrade to Spark 2.2.2. (DSP-16761)

• Jetty 9.4.1 upgrade addresses security vulnerabilities in Spark dependencies packaged with DSE.
(DSP-16893)

• dse spark-submit kill and status commands support optionally explicit Spark Master IP address.
(DSP-16910, DSP-16991)

Important bug fixes:

• Fixed problems with temporary and data directories for Spark applications. (DSP-15476, DSP-15880)

• Spark Cassandra Connector method saveToCassandra should not require solr_query column when search
is enabled. (DSP-16427)

• Cassandra streaming sink doesn't work with some sources. (DSP-16635)

• Metastore can't handle table with 100+ columns. (DSP-16742)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
65
DataStax Enterprise release notes

• Fully qualified paths with resource URL are correctly resolved in Spark structured streaming checkpointing.
Backport SPARK-20894. (DSP-16972)

DSEFS highlights
Important bug fixes:

• Only superusers are allowed to remove corrupted non-empty directories when authentication is enabled for
DSEFS. Improved error message when performing an operation on a corrupted path. (DSP-16340)

• cassandra nonsuperuser gets dsefs AccessDeniedException due to Insufficient permissions. (DSP-16713)

• DSEFS Hadoop layer doesn't properly translate DSEFS exceptions to Hadoop exceptions in some methods.
(DSP-16933)

• Closing DSEFS client before all issued requests are completed causes unexpected message type:
DefaultLastHttpContent error. (DSP-16953)

• Under high loads, DSEFS reports temporary incorrect state for various files/directories. (DSP-17178)

DSE Graph highlights

• Aligned query behavior using geo.inside() predicate for polygon search with and without search indexes.
(DSP-16108)

• Added convenience methods for reading graph configuration: getEffectiveAllowScan and


getEffectiveSchemaMode. (DSP-16650)

• Fixed bug where deleting a search index that was defined inside a graph fails. (DSP-16765)

• Changed default write consistency level (CL) for Graph to LOCAL_QUORUM. (DSP-17140)
In earlier DSE versions, the default QUORUM write consistency level (CL) was not appropriate for multi-
datacenter production environments.

DSE Search highlights


Improvements:

• Reduce the number of token filters for distributed searches with vnodes. (DSP-14189)

• Avoid unnecessary exception and error creation in the Solr query parser. (DSP-17147)

Important bug fixes:

• Avoid accumulating redundant router state updates during schema disagreement. (DSP-15615)

• A search enabled node could return different exceptions than a non-search enabled node when a keyspace
or table did not exist. (DSP-16834)

• DSE does not start without appropriate Tomcat JAR scanning exclusions. (DSP-16841)

• CQL single-pass queries have incorrect results when query is run with primary key and search index
schema does not contain all columns in selection. (DSP-16895)

• Node health score of 1 is not obtainable. Search node gets stuck at 0.00 node health score after replacing a
node in a cluster. (DSP-17107)

DSE 6.0.3 Known issues:

• Wrong offset in size calculation in trie builder. (DB-2477)

• Possible data loss when using DSE Tiered Storage. (DB-3404)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
66
DataStax Enterprise release notes

If using DSE Tiered Storage, you must immediately upgrade to at least DSE 5.1.16, DSE 6.0.9, or DSE
6.7.4. Be sure to follow the upgrade instructions.

• DSE 5.0 SSTables with UDTs will be corrupted after migrating to DSE 5.1, DSE 6.0, and DSE 6.7.
(DB-2954, CASSANDRA-15035)
If the DSE 5.0.x schema contains user-defined types (UDTs), upgrade to at least DSE 5.1.13, DSE
6.0.6, or DSE 6.7.2. The SSTable serialization headers are fixed when DSE is started with the upgraded
versions.

6.0.3 DSE core

Changes and enhancements:

• Create a log message when DDL statements are executed. (DB-2383)

• Due to Thread Per Core (TPC) asynchronous request processing architecture, the
index_summary_capacity_in_mb and index_summary_resize_interval_in_minutes settings in
cassandra.yaml are removed. (DB-2390)

• Connections on non-serialization errors are not dropped. (DB-2233)

• NetworkTopologyStrategy warning about unrecognized option at startup. (DB-2235)

• NodeSync waits to start until all nodes in the cluster are upgraded. (DB-2385)

• Improved error handling and logging for TDE encryption key management. (DP-15314)

• DataStax does more extensive testing on OpenJDK 8 due to the end of public updates for Oracle JRE/JDK
8. (DSP-16179)

• Non-internal users are unable to use permissions granted on CREATE. (DSP-16824)

Resolved issues:

• NodeSync command line tool only connects over JMX to a single node. (DB-1693)

• TotalBlockedTasksGauge metric value is computed incorrectly. (DB-2002)

• Move TWCS message "No compaction necessary for bucket size" to Trace level or NoSpam. (DB-2022)

• Non-portable syntax (MX4J bash-isms) in cassandra-env.sh broke service scripts. (DB-2123)

• sstableloader options assume the RPC/native (client) interface is the same as the internode (node-to-node)
interface. (DB-2184)

• The nodesync tasks fail with assertion error. (DB-2323)

• NodeSync fails on upgraded nodes while a cluster is in a partially upgraded state. (DB-2385)

• StackOverflowError around IncrementalTrieWriterPageAware#writeRecursive() during compaction.


(DB-2364)

• Compaction strategy instantiation errors don't generate meaningful error messages, instead return only
InvocationTargetException. (DB-2404)

• Unexpected behavior change when using row-level permissions with modification conditions like IF EXISTS.
(DB-2429)

• Authentication cache loading can exhaust native threads. The Spark master node is not able to be elected.
(DB-2248)

• Audit events for CREATE ROLE and ALTER ROLE with incorrect spacing exposes PASSWORD in plain
text. (DB-2285)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
67
DataStax Enterprise release notes

• Client warnings are not always propagated via LocalSessionWrapper. (DB-2304)

• Timestamps inserted with ISO 8601 format are saved with wrong millisecond value. (DB-2312)

• Compaction fails with IllegalArgumentException: null. (DB-2329)

• Error out if not all permissions for GRANT/REVOKE/RESTRICT/UNRESTRICT are applicable for a
resource. (DB-2373)

• BulkLoader class exits without printing the stack trace for throwable error. (DB-2377)

• Unexpected behavior change when using row-level permissions with modification conditions like IF EXISTS.
(DB-2429)

• nodetool describecluster incorrectly shows DseDelegateSnitch instead of the snitch configured in


cassandra.yaml. (DSP-16158)

• Using geo types does not work when memtable allocation type is set to offheap_objects. (DSP-16302)

• Heap-size calculation is incorrect for RpcCallStatement + SearchIndexStatement. (DSP-16731)

• The -graph option for the cassandra-stress tool failed on generating the target output html in the JAR file.
(DSP-17046)

Known issue:

• Upgraded nodes that still have big-format SSTables from DSE 5.x can cause errors during read. (DB-2801)
Workaround for upgrades from DSE 5.x to DSE versions before 6.0.5 and DSE 6.7.0: Run offline
sstableupgrade before starting the upgraded node.

6.0.3 DSE Analytics

Changes and enhancements:

• DSE pyspark libraries are added to PYTHONPATH for dse exec command. Add support for Jupyter
integration. (DSP-16797)

• DSE custom strategies allowed in Spark Structured Streaming. (DSP-16856)

• dse spark-submit kill and status commands support optionally explicit master address. (DSP-16910,
DSP-16991)

• Address security vulnerabilities in Spark dependencies packaged with DSE. Upgrade Netty to 9.4.11.
(DSP-16893)

• Jetty 9.4.1 upgrade addresses security vulnerabilities in Spark dependencies packaged with DSE.
(DSP-16893)

# Jetty Http Utility CVE-2017-7656

# Jetty Http Utility CVE-2017-7657

# Jetty Http Utility CVE-2017-7658

# Jetty Server Core CVE-2018-12538

# Jetty Utilities CVE-2018-12536

Resolved issues:

• A Spark application can be registered twice in rare instances. (DSP-15247)

• Problems with temporary and data directories for Spark applications. (DSP-15476, DSP-15880)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
68
DataStax Enterprise release notes

# DSE client applications, like Spark, will not start if user HOME environment variable is not defined, user
home directory does not exist, or the current user does not have write permissions.

# Temporary data directory for AOSS is /var/log/spark/rdd, the same as the server-side temporary
data location for Spark. Configurable with SPARK_EXECUTOR_DIRS environment variable in spark-
env.sh.

# If TMPDIR environment variable is missing, /tmp is set for all DSE apps. If /tmp directory does not
exist, it is created with 1777 permissions. If directory creation fails, perform a hard stop.

• Improved security isolates Spark applications; prevents run_as runner for Spark from running a malicious
program. (DSP-16093)

• Spark Cassandra Connector method saveToCassandra should not require solr_query column when search
is enabled. (DSP-16427)

• Cassandra streaming sink doesn't work with some sources. (DSP-16635)

• cassandra nonsuperuser gets dsefs AccessDeniedException due to Insufficient permissions. (DSP-16713)

• DSE Spark logging does not match OSS Spark logging levels. (DSP-16726)

• Metastore can't handle table with 100+ columns with auto Spark SQL table creation. (DSP-16742)

• DseDirectJoin and reading from Hive Tables does not work in Spark Structured Streaming. (DSP-16856)

• Fully qualified paths with resource URL are resolved in Spark structured streaming checkpointing. Backport
SPARK-20894. (DSP-16972)

• AlwaysOn SQL (AOSS) dsefs directory creation does not wait for all operations to finish before closing
DSEFS client. (DSP-16997)

6.0.3 DSEFS

Changes and enhancements:

• Improved error message when performing an operation on a corrupted path. (DSP-16340)

• Only superusers are able to remove corrupted non-empty directories when authentication is enabled for
DSEFS. (DSP-16340)

Resolved issues:

• 8 ms timeout failure when a data directory is removed. (DSP-16645)

• In DSEFS shell, listing too many local file system directories in a single session causes a file descriptor leak.
(DSP-16657)

• DSEFS fails to start when there is a table with duration type or other type DSEFS can't understand.
(DSP-16825)

• DSEFS Hadoop layer doesn't properly translate DSEFS exceptions to Hadoop exceptions in some methods.
(DSP-16933)

• Closing DSEFS client before all issued requests are completed causes unexpected message type:
DefaultLastHttpContent error. (DSP-16953)

• Under high loads, DSEFS reports temporary incorrect state for various files/directories. (DSP-17178)

6.0.3 DSE Graph

Changes and enhancements:

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
69
DataStax Enterprise release notes

• Maximum evaluation timeout limit is 1094 days. (DSP-16709)

# Gremlin evaluation_timeout parameter:

schema.config().option('graph.traversal_sources.g.evaluation_timeout').set(Duration.ofDays(1094))

# dse.yaml options: analytic_evaluation_timeout, realtime_evaluation_timeout

• Default write consistency level (CL) for Graph is LOCAL_QUORUM. (DSP-17140)


In earlier DSE versions, the default QUORUM write consistency level (CL) was not appropriate for multi-
datacenter production environments.

Known issue:

• Point-in-polygon queries no longer work without JTS. (DSP-17284)


Although point-in-polygon queries previously worked without JTS, the queries used a Cartesian coordinate
system implementation that did not understand the dateline. For best results, install JTS. See Spatial
queries with polygons require JTS.

Resolved issues:

• Align query behavior using geo.inside() predicate for polygon search with and without search indexes.
(DSP-16108)

• Avoid looping indefinitely when a thread making internode requests is interrupted while trying to acquire a
connection. (DSP-16544)

• Setting graph.traversal_sources.g.evaluation_timeout breaks graph. (DSP-16709)

• Deleting a search index that was defined inside a graph fails. (DSP-16765)

6.0.3 DSE Search

Changes and enhancements:

• Reduce the number of unique token selections for distributed searches with vnodes. (DSP-14189)
Search load balancing strategies are per search index (per core) and are set with dsetool set_core_property.

• Log fewer messages at INFO level in TTLIndexRebuildTask. (DSP-15600)

• Avoid unnecessary exception and error creation in the Solr query parser. (DSP-17147)

Resolved issues:

• Avoid accumulating redundant router state updates during schema disagreement. (DSP-15615)

• Should not allow search index rebuild during drain. (DSP-16504)

• NRT codec is not registered at startup for Solr cores that have switched to RT. (DSP-16663)

• Dropping search index when index build is in progress can interrupt Solr core closure. (DSP-16774)

• Exceptions thrown when search is enabled and table is not found in existing keyspace. (DSP-16834)

• DSE should not start without appropriate Tomcat JAR scanning exclusions. (DSP-16841)

• CQL single-pass queries have incorrect results when query is run with primary key and search index
schema does not contain all columns in selection. (DSP-16895)
Best practice: For optimal single-pass queries, including queries where solr_query is used with a partition
restriction, and queries with partition restrictions and a search predicate, ensure that the columns to
SELECT are not indexed in the search index schema.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
70
DataStax Enterprise release notes

Workaround: Since auto-generation indexes all columns by default, you can ensure that the field is not
indexed but still returned in a single-pass query. For example, this statement indexes everything except
for column c3, and informs the search index schema about column c3 for efficient and correct single-pass
queries.

CREATE SEARCH INDEX ON test_search.abc WITH COLUMNS * { indexed : true }, c3


{ indexed : false };

• Node health score of 1 is not obtainable. Search node gets stuck at 0.00 node health score after replacing a
node in a cluster. (DSP-17107)

General upgrade advice for DSE 6.0.3


DataStax Enterprise 6.0.3 is compatible with Apache Cassandra™ 3.11.
All upgrade advice from previous versions applies. Carefully review the DataStax Enterprise upgrade planning
and upgrade instructions to ensure a smooth upgrade and avoid pitfalls and frustrations.
TinkerPop changes for DSE 6.0.3
DataStax Enterprise (DSE) 6.0.3 includes TinkerPop 3.3.3 and all enhancements from earlier DSE versions.

DSE 6.0.2 release notes


19 July 2018

• 6.0.2 Components

• DSE 6.0.2 Highlights

• DSE 6.0.2 Known issues

• General upgrade advice for DSE 6.0.2

• TinkerPop changes for DSE 6.0.2

Table 8: DSE functionality


6.0.2 DSE core 6.0.2 DSE Graph

6.0.2 DSE Analytics 6.0.2 DSE Search

6.0.2 DSEFS DataStax Bulk Loader 1.1.0

DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:

• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.

• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.

• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
71
DataStax Enterprise release notes

• DataStax recommends 16 or more logical cores for Advanced Performance nodes.

6.0.2 Components

All components from DSE 6.0.2 are listed. Components that are updated for DSE 6.0.2 are indicated with an
asterisk (*).

• Apache Solr™ 6.0.1.1.2321 *

• Apache Spark™ 2.2.1.2

• Apache Tomcat® 8.0.47

• DataStax Bulk Loader 1.1.0 *

• DSE Java Driver 1.6.5

• Netty 4.1.13.11.dse

• Spark Jobserver 0.8.0.45 DSE custom version

• TinkerPop 3.3.3 with additional production-certified changes *

DataStax Enterprise 6.0.2 is compatible with Apache Cassandra™ 3.11 and includes all production-certified
enhancements from earlier DSE versions.

DSE 6.0.2 Highlights

High-value benefits of upgrading to DSE 6.0.2 include these highlights:


DSE Analytics and DSEFS

• Fixed issue where CassandraConnectionConf creates excessive database connections and reports too
many HashedWheelTimer instances. (DSP-16365)

DSE Graph

• Fixed several edge cases of using search indexes. (DSP-14802, DSP-16292)

DSE Search

• Search index permissions can be applied at the keyspace level. (DSP-15385)

• Schemas with stored=true work because stored=true is ignored. The workaround for 6.0.x upgrades with
schema.xml fields with “indexed=false, stored=true, docValues=true” is no longer required. (DSP-16392)

• Minor bug fixes and error handling improvements. (DSP-16435, DSP-16061, DSP-16078)

6.0.2 DSE core

Changes and enhancements:

• sstableloader supports custom config file locations. (DSP-16092)

• -d option to create local encryption keys without configuring the directory in dse.yaml. (DSP-15380)

Resolved issues:

• Show delegated snitch in nodetool describecluster. (DB-2057)

• Use more precise grep patterns to prevent accidental matches in cassandra-env.sh. (DB-2114)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
72
DataStax Enterprise release notes

• Add missing equality sign to SASI schema snapshot. (DB-2129)

• For tables using DSE Tiered Storage, nodetool cleanup places cleaned SSTables in the wrong tier.
(DB-2173)

• Support creating system keys before the output directory is configured in dse.yaml. (DSP-15380)

• Client prepared statements are not populated in system.prepared_statements table. (DSP-15900)

• Improved compatibility with external tables stored in the DSE Metastore in remote systems. (DSP-16561)

DSE 6.0.2 Known issue:

• Possible data loss when using DSE Tiered Storage. (DB-3404)


If using DSE Tiered Storage, you must immediately upgrade to at least DSE 5.1.16, DSE 6.0.9, or DSE
6.7.4. Be sure to follow the upgrade instructions.

• DSE 5.0 SSTables with UDTs will be corrupted after migrating to DSE 5.1, DSE 6.0, and DSE 6.7.
(DB-2954, CASSANDRA-15035)
If the DSE 5.0.x schema contains user-defined types (UDTs), upgrade to at least DSE 5.1.13, DSE
6.0.6, or DSE 6.7.2. The SSTable serialization headers are fixed when DSE is started with the upgraded
versions.

6.0.2 DSE Analytics

Changes and enhancements:

• Apache Hadoop Azure libraries for Hadoop 2.7.1 have been added to the Spark classpath to simplify
integration with Microsoft Azure and Microsoft Azure Blob Storage. (DSP-15943)

• AlwaysOn SQL (AOSS) improvements:

# AlwaysOn SQL (AOSS) support for enabling Kerberos and SSL at the same time. (DSP-16087)

# Add 120 seconds wait time so that Spark Master recovery process completes before status check of
AlwaysOn SQL (AOSS) app. (DSP-16249)

# AlwaysOn SQL (AOSS) driver continually runs on a node even when DSE is down. (DSP-16297)

# AlwaysOn SQL (AOSS) binds to native_transport_address. (DSP-16469)

# Improved defaults and errors for AlwaysOn SQL (AOSS) workpool. (DSP-16343)

Resolved issues:

• CassandraConnectionConf creates excessive database connections and reports too many


HashedWheelTimer instances. (DSP-16365)

• Need to disable cluster object JMX metrics report to prevent count exceptions spam in Spark driver log.
(DSP-16442)

• Fixed Spark-Connector dependencies and published SparkBuildExamples. (DSP-16699)

6.0.2 DSEFS

Changes and enhancements:

• DSEFS operations: chown, chgrp, and chmod support recursive (-R) and verbose (-v) flag. (DSP-14238)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
73
DataStax Enterprise release notes

• Client and internode connection improvements. (DSP-14284, DSP-16065)

# DSEFS clients close idle connections after 60 seconds, configurable in dse.yaml.

# Idle DSEFS internode connections are closed after 120 seconds. Configurable with new dse.yaml
option internode_idle_connection_timeout_ms.

# Configurable connection pool with core_max_concurrent_connections_per_host.

• DSEFS clients close idle connections after 60 seconds, configurable in dse.yaml. (DSP-14284)

• Improvements to DataSourceInputStream remove possible lockup. (DSP-16409)

# If the second read is issued after a failed read, it is not blocked forever. The stream is automatically
closed on errors, and subsequent reads will fail with IllegalStateException.

# The timeout message includes information about the underlying DataSource object.

# No more reads are issued to the underlying DataSource after it reports hasMoreData = false.

# The read loop has been simplified to properly move to the next buffer if the requested number of bytes
hasn't been delivered yet.

# Empty buffer returned from the DataSource when hasMoreData = true is not treated as an EOF. The
read method validates offset and length arguments.

• Security improvement: DSEFS uses an isolated native memory pool for file data and metadata sent between
nodes. This isolation makes it harder to exploit potential memory management bugs. (DSP-16492)

Resolved issues:

• DSEFS silently fails when TCP port 5599 is not open between nodes. (DSP-16101)

6.0.2 DSE Graph

Changes and enhancements:

• Vertices and vertex properties created or modified with graphframes respect TTL as defined in the schema.
In earlier versions, vertices and vertex properties had no TTL. Edges created or modified with graphframes
continue to have no TTL. (DSP-15555)

• Improved Gremlin console authentication configuration. (DSP-9905)

Resolved issues:

• 0 (zero) is not treated as unlimited abort of max num errors. (DGL-307)

• Search indexes are broken for multi cardinality properties. (DSP-14802)

• DGF interceptor does not take into account GraphStep parameters with g.V(id) queries. (DSP-16172)

• The clause LIMIT does not work in a graph traversal with search predicate TOKEN, returning only a subset
of expected results. (DSP-16292)

6.0.2 DSE Search

Changes and enhancements:

• The node health option uptime_ramp_up_period_seconds default value in dse.yaml is reduced to 3 hours
(10800 seconds). (DSP-15752)

• CQL solr_query supports Solr facet heatmaps. (DSP-16404)

• Improved handling of asynchronous I/O timeouts during search read-before-write. (DSP-16061)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
74
DataStax Enterprise release notes

• Schemas with stored=true work because stored=true is ignored. (DSP-16392)

• Use monotonically increasing time source for search query execution latency calculation. (DSP-16435)

Resolved issues:

• Search index permissions can be applied at keyspace level. (DSP-15835)

• The encryptors thread cache in ThreadLocalIndexEncryptionConfiguration leaves entries in the cache.


(DSP-16078)

• Classpath conflict between Lucene and SASI versions of Snowball. (DSP-16116)

• Indexing fails if fields have 'indexed=false', 'stored=true', and `docValues=true'. (DSP-16392)

DataStax Bulk Loader 1.1.0

Changes and enhancements:

• DataStax Bulk Loader (dsbulk) version 1.1.0 is automatically installed with DataStax Enterprise 6.0.2, and
can also be installed as a standalone tool. See DataStax Bulk Loader 1.1.0 release notes. (DSP-16484)

General upgrade advice for DSE 6.0.2


DataStax Enterprise 6.0.2 is compatible with Apache Cassandra™ 3.11.
All upgrade advice from previous versions applies. Carefully review the DataStax Enterprise upgrade planning
and upgrade instructions to ensure a smooth upgrade and avoid pitfalls and frustrations.
TinkerPop changes for DSE 6.0.2
DataStax Enterprise (DSE) 6.0.2 includes these production-certified enhancements to TinkerPop 3.3.3:

• Implemented TraversalSelectStep which allows to select() runtime-generated keys.

• Coerced BulkSet to g:List in GraphSON 3.0.

• Deprecated CredentialsGraph DSL in favor of CredentialsTraversalDsl which uses the recommended


method for Gremlin DSL development.

• Allowed iterate() to be called after profile().

• Fixed regression issue where the HTTPChannelizer doesn’t instantiate the specified
AuthenticationHandler.

• Defaulted GLV tests for gremlin-python to run for GraphSON 3.0.

• Fixed a bug with Tree serialization in GraphSON 3.0.

• In gremlin-python, the GraphSON 3.0 g:Set type is now deserialized to List.

DSE 6.0.1 release notes


5 June 2018

• 6.0.1 Components

• 6.0.1 Highlights

• DSE 6.0.1 Known issues

• Cassandra enhancements for DSE 6.0.1

• General upgrade advice DSE 6.0.1

• TinkerPop changes for 6.0.1

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
75
DataStax Enterprise release notes

Table 9: DSE functionality


6.0.1 DSE core 6.0.1 DSE Graph

6.0.1 DSE Analytics 6.0.1 DSE Search

6.0.1 DSEFS DataStax Bulk Loader 1.0.2

DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.
The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:

• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.

• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.

• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.

• DataStax recommends 16 or more logical cores for Advanced Performance nodes.

6.0.1 Components

All components from DSE 6.0.1 are listed. Components that are updated for DSE 6.0.1 are indicated with an
asterisk (*).

• Apache Solr™ 6.0.1.1.2295 *

• Apache Spark™ 2.2.1.2 *

• Apache Tomcat® 8.0.47

• DataStax Bulk Loader 1.0.2 *

• DSE Java Driver 1.6.5

• Netty 4.1.13.11.dse

• Spark Jobserver 0.8.0.45 DSE custom version *

• TinkerPop 3.3.3 with additional production-certified changes *

DSE 6.0.1 is compatible with Apache Cassandra™ 3.11 and adds additional production-certified enhancements.

DSE 6.0.1 Highlights

High-value benefits of upgrading to DSE 6.0.1 include these highlights:


DataStax Enterprise core

• Fix binding JMX to any address. (DB-2081)

• DataStax Bulk Loader 1.0.2 is bundled with DSE 6.0.1. (DSP-16206)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
76
DataStax Enterprise release notes

DSE Analytics and DSEFS

• Upgrade to Spark 2.2.1 for bug fixes.

• Fixed issue where multiple Spark Masters can be started on the same machine. (DSP-15636)

• Improved Spark Master discovery and reliability. (DSP-15801, DSP-14405)

• Improved AlwaysOn SQL (AOSS) startup reliability. (DSP-15871, DSP-15468, DSP-15695, DSP-15839)

• Resolved the missing /tmp directory in DSEFS after fresh cluster installation. (DSP-16058)

• Fixed handling of Parquet files with partitions. (DSP-16067)

• Fixed the HashedWheelTimer leak in Spark Connector that affected BYOS. (DSP-15569)

DSE Search

• Fix for the known issue that prevented using TTL (time-to-live) with DSE Search live indexing (RT indexing).
(DSP-16038, DSP-14216)

• Addresses security vulnerabilities in libraries packaged with DSE. (DSP-15978)

• Fix for using faceting with non-zero offsets. (DSP-15946)

• Fix for ORDER BY clauses in native CQL syntax. (DSP-16064)

DSE 6.0.1 Known issues:

• Possible data loss when using DSE Tiered Storage. (DB-3404)


If using DSE Tiered Storage, you must immediately upgrade to at least DSE 5.1.16, DSE 6.0.9, or DSE
6.7.4. Be sure to follow the upgrade instructions.

• DSE 5.0 SSTables with UDTs will be corrupted after migrating to DSE 5.1, DSE 6.0, and DSE 6.7.
(DB-2954, CASSANDRA-15035)
If the DSE 5.0.x schema contains user-defined types (UDTs), upgrade to at least DSE 5.1.13, DSE
6.0.6, or DSE 6.7.2. The SSTable serialization headers are fixed when DSE is started with the upgraded
versions.

• dsetool does not work when native_transport_interface is set in cassandra.yaml. (DSP-16796)


To workaround: Use native_transport_interface_prefer_ipv6 instead.

6.0.1 DSE core

Changes and enhancements:

• Improved NodeSync usability with secure environments. (DB-2034)

• sstableloader supports custom config file locations. (DSP-16092)

• LDAP tuning parameters allow all LDAP connection pool options to be set. (DSP-15948)

Resolved issues:

• Use the indexed item type as backing table key validator of 2i on collections. (DB-1121)

• Add getConcurrentCompactors to JMX in order to avoid loading DatabaseDescriptor to check its value in
nodetool. (DB-1730)

• Send a final error message when a continuous paging session is cancelled. (DB-1798)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
77
DataStax Enterprise release notes

• Ignore empty counter cells on digest calculation. (DB-1881)

• Apply view batchlog mutation parallel with local view mutations. (DB-1900)

• Use same IO queue depth as Linux scheduler and advise against overriding it. (DB-1909)

• Fix startup error message rejecting COMPACT STORAGE after upgrade. (DB-1916)

• Improve user warnings on startup when libaio package is not installed. (DB-1917)

• Avoid copy-on-heap when flushing. (DB-1916)

• Set MX4J_ADDRESS to 127.0.0.1 if not explicitly set. (DB-1950)

• Prevent OOM due to OutboundTcpConnection backlog by dropping request messages after the queue
becomes too large. (DB-2001)

• Fix exception in trace log messages of non-frozen user types. (DB-2005)

• Limit max cached direct buffer on NIO to 1 MB. (DB-2028)

• Reusing table ID with CREATE TABLE causes failure on restart. (DB-2032)

• BulkLoader class exits without printing the stack trace. (DB-2033)

• Fix binding JMX to any address. (DB-2081)

• sstableloader does not decrypt passwords using config encryption in DSE. (DSP-13492)

• dse client-tool help doesn't work if ~/.dserc file exists. (DSP-15869)

6.0.1 DSE Analytics

• The Spark Jobserver demo has an incorrect version for the Spark Jobserver API. (DSP-15832)
Workaround: In the demo's gradle.properties file, change the version from 0.6.2 to 0.6.2.238.

Changes and enhancements:

• Decreased the number of exceptions logged during master move from node to node. (DSP-14405)

• When querying remote cluster from Spark job, connector does not route requests to data replicas.
(DSP-15202)

• Long CassandraRDD.where clauses throw StackOverflow exceptions. (DSP-15438)

• AlwaysOn SQL dependency on JPS is removed. The jps_directory entry in dse.yaml is removed.
(DSP-15468)

• Improved AlwaysOn SQL configuration. (DSP-15734)

• Improved security for Spark JobServer. All uploaded JARs, temporary files, and logs are created under the
current user's home directory: ~/.spark-jobserver. (DSP-15832)

• Improved process scanning for AlwaysOn SQL driver. (DSP-15839)

• In Portfolio demo, pricer is no longer required to be run with sudo. (DSP-15970)

• Scala 2.10 in BYOS is no longer supported. (DSP-15999)

• Improved validation for authentication configuration for AlwaysOn SQL. (DSP-16018)

• Optimize memoizing converters for UDTs. (DSP-16121)

• During misconfigured cluster bootstrap, the AlwaysOn SqlServer does not start due to missing /tmp/hive
directory in DSEFS. (DSP-16058)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
78
DataStax Enterprise release notes

Resolved issues:

• A shard request timeout caused an assertion error from Lucene getNumericDocValues in the log.
(DSP-14216)

• Multiple Spark Masters can be started on the same machine. (DSP-15636)

• Do not start AlwaysOn SQL until Spark Master is ready. (DSP-15695)

• DSE client tool returns wrong Spark Master address. (DSP-15801)

• In some situations, AlwaysOn SQL cannot start unless DSE node is restarted. (DSP-15871)

• Portfolio demo does not work on package installs. (DSP-15970)

• Java driver in Spark Connector uses daemon threads to prevent shutdown hooks from being blocked by
driver thread pools. (DSP-16051)

• dse client-tool spark sql-schema --all exports definitions for solr_admin keyspace. (DSP-16073).

• HashedWheelTimer leak in Spark Connector, affecting BYOS. (DSP-15569)

6.0.1 DSEFS

Resolved issues:

• Can't quote file patterns in DSEFS shell. (DSP-15550)

6.0.1 DSE Graph

Changes and enhancements:

• DseGraphFrame performance improvement reduces number of joins for count() and other id only queries.
(DSP-15554)

• Performance improvements for traversal execution with Fluent API and script-based executions.
(DSP-15686)

Resolved issues:

• edge_threads and vertex_threads can end up being 0. (DGL-305)

• When using graph frames, cannot upload edges when ids for vertices are complex non-text ids.
(DSP-15614)

• CassandraHiveMetastore is prevented from adding multiple partitions for file-based data sources. Fixes
MSCK REPAIR TABLE command. (DSP-16067)

6.0.1 DSE Search

Changes and enhancements:

• Output Solr foreign filter cache warning only on classes other than DSE classes. (DSP-15625)

• Solr security upgrades bundle. (DSP-15978)

# Apache Directory API All: CVE-2015-3250

# Apache Hadoop Common: CVE-2016-5393, CVE-2017-15713, CVE-2016-3086

# Apache Tika parsers: CVE-2018-1339

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
79
DataStax Enterprise release notes

# Bouncy Castle Provider: CVE-2018-5382

# Data Mapper for Jackson: CVE-2018-5968, CVE-2017-17485, CVE-2017-15095, CVE-2018-7489,


CVE-2018-5968, CVE-2017-7525

# Guava: Google Core Libraries for Java: CVE-2018-10237

# Simple XML: CVE-2017-1000190

# Xerces2-j: CVE-2013-4002

# uimaj-core: CVE-2017-15691

Resolved issues:

• Offline sstable tools fail is DSE Search index is present on a table. (DSP-15628)

• HTTP read on solr_stress doesn't inject random data into placeholders. (DSP-15727)

• Servlet container shutdown (Tomcat) prematurely stops logback context. (DSP-15807)

• ERROR 500 on distributed http json.facet with non-zero offset. (DSP-15946)

• Search index TTL Expiration thread loops without effect with live indexing (RT indexing). (DSP-16038)

• Search incorrectly assumes only single-row ORDER BY clauses on first clustering key. (DSP-16064)

DataStax Bulk Loader 1.0.2

• DataStax Bulk Loader 1.0.2 is bundled with DSE 6.0.1. (DSP-16206)

DataStax recommends using the latest DataStax Bulk Loader 1.2.0 For details, see DataStax Bulk Loader.
Cassandra enhancements for DSE 6.0.1
DataStax Enterprise 6.0.1 is compatible with Apache Cassandra™ 3.11, includes all DataStax enhancements
from earlier releases, and adds these production-certified changes:

• cassandra-stress throws NPE if insert section isn't specified in user profile (CASSSANDRA-14426)

• nodetool listsnapshots is missing local system keyspace snapshots (CASSANDRA-14381)

• Remove string formatting lines from BufferPool hot path (CASSANDRA-14416)

• Detect OpenJDK jvm type and architecture (CASSANDRA-12793)

• Don't use guava collections in the non-system keyspace jmx attributes (CASSANDRA-12271)

• Allow existing nodes to use all peers in shadow round (CASSANDRA-13851)

• Fix cqlsh to read connection.ssl cqlshrc option again (CASSANDRA-14299)

• Downgrade log level to trace for CommitLogSegmentManager (CASSANDRA-14370)

• CQL fromJson(null) throws NullPointerException (CASSANDRA-13891)

• Serialize empty buffer as empty string for json output format (CASSANDRA-14245)

• Cassandra not starting when using enhanced startup scripts in windows (CASSANDRA-14418)

• Fix progress stats and units in compactionstats (CASSANDRA-12244)

• Better handle missing partition columns in system_schema.columns (CASSANDRA-14379)

• Deprecate background repair and probablistic read_repair_chance table options (CASSANDRA-13910)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
80
DataStax Enterprise release notes

• Delay hints store excise by write timeout to avoid race with decommission (CASSANDRA-13740)

• Add missed CQL keywords to documentation (CASSANDRA-14359)

• Avoid deadlock when running nodetool refresh before node is fully up (CASSANDRA-14310)

• Handle all exceptions when opening sstables (CASSANDRA-14202)

• Handle incompletely written hint descriptors during startup (CASSANDRA-14080)

• Handle repeat open bound from SRP in read repair (CASSANDRA-14330)

• CqlRecordReader no longer quotes the keyspace when connecting, as the java driver will
(CASSANDRA-10751)

• Fix compaction failure caused by reading un-flushed data (CASSANDRA-12743)

• Fix JSON queries with IN restrictions and ORDER BY clause (CASSANDRA-14286)

• CQL fromJson(null) throws NullPointerException (CASSANDRA-13891)

• Check checksum before decompressing data (CASSANDRA-14284)

General upgrade advice DSE 6.0.1


DataStax Enterprise 6.0.1 is compatible with Apache Cassandra™ 3.11.
All upgrade advice from previous versions applies. Carefully review the DataStax Enterprise upgrade planning
and upgrade instructions to ensure a smooth upgrade and avoid pitfalls and frustrations.
TinkerPop changes for 6.0.1
DataStax Enterprise (DSE) 6.0.1 includes these production-certified enhancements to TinkerPop 3.3.3:

• Performance enhancement to Bytecode deserialization. (TINKERPOP-1936)

• Path history isn't preserved for keys in mutations. (TINKERPOP-1947)

• Traversal construction performance enhancements (TINKERPOP-1950)

• Bump to Groovy 2.4.15 - resolves a Groovy bug preventing Lambda creation in GLVs in some cases.
(TINKERPOP-1953)

DSE 6.0.0 release notes


17 April 2018

• 6.0.0 Components

• 6.0 New features

• Cassandra enhancements for DSE 6.0

• General upgrade advice for DSE 6.0.0

• TinkerPop changes for DSE 6.0.0

Table 10: DSE functionality


6.0.0 DSE core 6.0.0 DSE Graph

6.0.0 DSE Advanced Replication 6.0.0 DSE Search

6.0.0 DSE Analytics 6.0.0 DataStax Studio

6.0.0 DSEFS DataStax Bulk Loader 1.0.1

DSE Search and DSE Graph performance variability can result after upgrades from DSE 5.1 to DSE 6.0 and
DSE 6.7.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
81
DataStax Enterprise release notes

The DSE Advanced Performance feature introduced in DSE 6.0 included a fundamental architecture change.
Performance is highly dependent on data access patterns and varies from customer to customer. This
upgrade impact affects only DataStax customers using DSE Search and/or DSE Graph.
In response to this scenario:

• DataStax has extended DSE 5.1 end of life (EOL) support to April 18, 2024.

• DataStax is offering a free half-day Upgrade Assessment. This assessment is a DataStax Services
engagement designed to assess the upgrade compatibility of your DSE 5.1 deployment. If you are using
DSE 5.1 and plan to upgrade to DSE 6.0 or DSE 6.7 or DSE 6.8, contact DataStax to schedule your
complimentary assessment.

• DataStax continues to investigate performance differences related to DSE Search and DSE Graph that
occur after some upgrades to DSE 6.0 and DSE 6.7. Additional details have been and will continue to be
included in DSE release notes.

• DataStax recommends 16 or more logical cores for Advanced Performance nodes.

DSE 6.0.0 Do not use TTL (time-to-live) with DSE Search live indexing (RT indexing). To use these features
together, upgrade to DSE 6.0.1. (DSP-16038)

6.0.0 Components

• Apache Solr™ 6.0.1.1.2234

• Apache Spark™ 2.2.0.14

• Apache Tomcat® 8.0.47

• DataStax Bulk Loader 1.0.1

• DSE Java Driver 1.6.5

• Netty 4.1.13.11.dse

• Spark Jobserver 0.8.0.44 (DSE custom version)

• TinkerPop 3.3.2 with additional production-certified changes

DSE 6.0 is compatible with Apache Cassandra™ 3.11 and adds additional production-certified enhancements.

6.0 New features

See DataStax Enterprise 6.0 new features.

6.0.0 DSE core

Experimental features. These features are experimental and are not supported for production:

• SASI indexes.

• DSE OpsCenter Labs features in OpsCenter.

Known issues:

• sstableloader incorrectly detects keyspace when working with snapshots. (DB-2649)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
82
DataStax Enterprise release notes

Workaround: create a directory that matches the keyspace name, and then create symbolic links into that
directory from snapshot directory with name of the destination table. For example:

$ mkdir -p /var/tmp/keyspace1 ln -s <path>/cassandra/data/keyspace1/


standard1-0e65b961deb311e88daf5581c30c2cd4/snapshots/data-load /var/tmp/keyspace1/
standard1

• Possible data loss when using DSE Tiered Storage. (DB-3404)


If using DSE Tiered Storage, you must immediately upgrade to at least DSE 5.1.16, DSE 6.0.9, or DSE
6.7.4. Be sure to follow the upgrade instructions.

• DSE 5.0 SSTables with UDTs will be corrupted after migrating to DSE 5.1, DSE 6.0, and DSE 6.7.
(DB-2954, CASSANDRA-15035)
If the DSE 5.0.x schema contains user-defined types (UDTs), upgrade to at least DSE 5.1.13, DSE
6.0.6, or DSE 6.7.2. The SSTable serialization headers are fixed when DSE is started with the upgraded
versions.

• DSE 6.0 will not start with OpsCenter 6.1 installed. OpsCenter 6.5 is required for managing DSE 6.0
clusters. See DataStax OpsCenter compatibility with DSE. (DSP-15996)

Changes and enhancements:

Support for Thrift-compatible tables (COMPACT STORAGE) is dropped. Before upgrading to DSE 6.0, you
must migrate all tables that have COMPACT STORAGE to CQL table format.
Upgrades from DSE 5.0.x or DSE 5.1.x with Thrift-compatible tables require DSE 5.1.6 or later or DSE 5.0.12
or later.

• For TWCS, flush to separate SSTables based on write time. (DB-42)

• Allow to aggregate by time intervals. Allow aggregates in GROUP BY results. (DB-75)

• Allow user-defined functions (UDFs) within GROUP BY clause and allow non-deterministic UDFs within
GROUP BY clause. New CQL keywords (DETERMINISTIC and MONOTONIC). The cassandra.yaml file
enable_user_defined_functions_threads option has no changes to default behavior of true; set to false to
use UDFs in GROUP BY clauses. (DB-672)

• Improved architecture with Thread Per Core (TPC) asynchronous read and write paths. (DB-707)
New DSE start-up parameters:

# -Ddse.io.aio.enable

# -Ddse.io.aio.force

Observable metrics with nodetool tpstats.

• New options in cassandra.yaml. (DB-111, DB-707, DB-945, DB-1381, DB-1656)

# aggregated_request_timeout_in_ms

# batchlog_endpoint_strategy to improve batchlog endpoint selection. (DB-1367)

# client_timeout_sec, cancel_timeout_sec, file_cache_size_in_mb, tpc_cores, tpc_io_cores,


io_global_queue_depth

# The rpc_* properties are deprecated and renamed to native_transport_*. (DB-1130)

# streaming_connections_per_host

# key_cache_* settings are no longer used in new SSTable format, but retained to support existing
SSTable format

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
83
DataStax Enterprise release notes

• Removed options in cassandra.yaml:

# buffer_pool_use_heap_if_exhausted, concurrent_counter_writes, concurrent_materialized_view_writes,


concurrent_reads, concurrent_writes, credentials_update_interval_in_ms,
credentials_validity_in_ms, max_client_wait_time_ms, max_threads, native_transport_max_threads,
otc_backlog_expiration_interval_ms, request_scheduler.

# Deprecated options:
Deprecated options Replaced with

rpc_address native_transport_address

rpc_interface native_transport_interface

rpc_interface_prefer_ipv6 native_transport_interface_prefer_ipv6

rpc_port native_transport_port

broadcast_rpc_address native_transport_broadcast_address

rpc_keepalive native_transport_keepalive

• Default value changes in cassandra.yaml:

# batch_size_warn_threshold_in_kb: 64

# column_index_size_in_kb: 16

# memtable_flush_writers: 4

# roles_validity_in_ms: 120000 (2 minutes)

# permissions_validity_in_ms: 120000 (2 minutes)

• Legacy auth tables no longer supported. (DB-897)

• Authentication and authorization improvements. RLAC (setting row-level permissions) speed is improved.
(DB-909)

• Incremental repair is opt-in. (DB-1126)

• JMX exposed metrics for external dropped messages include COUNTER_MUTATION, MUTATION,
VIEW_MUTATION, RANGE_SLICE, READ, READ_REPAIR, LWT, HINTS, TRUNCATE, SNAPSHOT,
SCHEMA, REPAIR, OTHER. (DB-1127)

• By default, enable heap histogram logging on OutOfMemoryError. To disable, set the


cassandra.printHeapHistogramOnOutOfMemoryError system property to false. (DB-1498)

• After upgrade is complete and all nodes are on DSE 6.0 and the required schema change occurs,
authorization (CassandraAuthorizer) and audit logging (CassandraAuditWriter) enable the use of new
columns. (DB-1597)

• Automatic fallback of GossipingPropertyFileSnitch to PropertyFileSnitch (cassandra-


topology.properties) is disabled by default and can be enabled by using the -
Dcassandra.gpfs.enable_pfs_compatibility_mode=true startup flag. (DB-1663)

• Improved messages when mixing mutually exclusive YAML properties. (DB-1719)

• Background read-repair. (DB-1771)

• Authentication filters used in DSE Search moved to DSE core. (DSP-12531)

• The DataStax Installer is no longer supported. To upgrade from earlier versions that used the DataStax
Installer, see Upgrading to DSE 6.0 from DataStax Installer installations. For new installations, use a
supported installation method. (DSP-13640)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
84
DataStax Enterprise release notes

• Improved authentication and security. (DSP-14173)


Supporting changes:

# Allow to grant/revoke multiple permissions in one statement. (DB-792)

# Database administrators can manage role permissions without having access to the data. (DB-757)

# Filter rows from system keyspaces and system_schema tables based on user permissions. New
system_keyspaces_filtering option in cassandra.yaml returns information based on user access to
keyspaces. (DB-404)

# Removed cassandra.yaml options credentials_validity_in_ms and credentials_update_interval_in_ms.


For upgrade impact, see Upgrading from DataStax Enterprise 5.1 to 6.0. (DB-909)

# Warn when the cassandra superuser logs in. (DB-104)

# New metric for replayed batchlogs and trace-level logging include the age of the replayed batchlog.
(DB-1314)

# Decimals with a scale > 100 are no longer converted to a plain string to prevent
DecimalSerializer.toString() being used as an attack vector. (DB-1848)

# Auditing by role: new dse.yaml audit options included_roles and excluded_roles. (DSP-15733)

• libaio package dependency for DataStax Enterprise 6.0 installations on RHEL-based systems using Yum
and on Debian-based systems using APT install. For optimal performance in tarball installations, DataStax
recommends installing the libaio package. (DSP-14228)

• DSE performance objects metrics changes in tables dse_perf.node_snapshot, dse_perf.cluster_snapshot,


and dse_perf.dc_snapshot. (DSP-14413)

# Metrics are populated in two new columns: background_io_pending and hints_pending.

# Metrics are not populated, -1 is written for columns: read_requests_pending, write_requests_pending,


completed_mutations, and replicate_on_write_tasks_pending.

• The default number of threads used by performance objects increased from 1 to 4. Upgrade restrictions
apply. (DSP-14515)

• All tables are created without COMPACT STORAGE. (DSP-14735)

• Support for Thrift-compatible tables (COMPACT STORAGE) is dropped. Before upgrading, migrate all
tables that have COMPACT STORAGE to CQL table format. DSE 6.0 will not start if COMPACT STORAGE
tables are present. See Upgrading from DSE 5.1.x or Upgrading from DSE 5.0.x. (DSP-14839)

• The minimum supported version of Oracle Java SE Runtime Environment 8 (JDK) is 1.8u151. (DSP-14818)

• sstabledump supports the -l option to output each partition as its own JSON object. (DSP-15079)

• Audit improvements, new and changed filtering event categories. (DSP-15724)

• Upgrades to OpsCenter 6.5 or later are required before starting DSE 6.0. DataStax recommends upgrading
to the latest OpsCenter version that supports your DSE version. Check the compatibility page for your
products. (DSP-15996)

Resolved issues:

• Warn when the cassandra superuser logs in. (DB-104)

• Prevent multiple serializations of mutation. (DB-370)

• Internal implementation of paging by bytes. (DB-414)

• Connection refused should be logged less frequently. (DB-455)

• Refactor messaging service code. (DB-497)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
85
DataStax Enterprise release notes

• Change protocol to allow sending keyspace independent of query string. (DB-600)

• Add result set metadata to prepared statement MD5 hash calculation. (DB-608)

• Add DSE columns to system tables. (DB-716)

system.peers:
dse_version text,
graph boolean,
server_id text,
workload text,
workloads frozen<set<text>>

system.local:
dse_version text,
graph boolean,
server_id text,
workload text,
workloads frozen<set<text>>

• Fix LWT asserts for immutable TableMetadat.a (DB-728)

• MigrationManager should use toDebugString() when logging TableMetadata. (DB-739)

• Create administrator roles who can carry out everyday administrative tasks without having unnecessary
access to data. (DB-757)

• When repairing Paxos commits, only block on nodes are being repaired. (DB-761)

• Allow to grant/revoke multiple permissions in one statement (DB-792)

• SystemKeyspace.snapshotOnVersionChange() never called in production code. (DB-797)

• Error in counting iterated SSTables when choosing whether to defrag in timestamp ordered path. (DB-1018)

• Check for mismatched versions when answering schema pulls. (DB-1026)

• Expose ports (storage, native protocol, JMX) in system local and peers tables. (DB-1040)

• Rename ITrigger interface method from augment to augmentNonBlocking. (DB-1046)

• Load mapped buffer into physical memory after mlocking it for MemoryOnlyStrategy. (DB-1052)

• New STARTUP message parameters identify clients. (DB-1054)

• Emit client warning when a GRANT/REVOKE/RESTRICT/UNRESTRICT command has no effect. (DB-1083)

• Update bundled Python driver to 2.2.0.post0-d075d57. (DB-1152)

• Forbid advancing KeyScanningIterator before exhausting or closing the current iterator. (DB-1199)

• Ensure that empty clusterings with kind==CLUSTERING are Clustering.EMPTY. (DB-1248)

• New nodetool abortrebuild command stops a currently running rebuild operation. (DB-1234)

• Batchlog replays do not leverage remote coordinators. (DB-1337)

• Avoid copying EMPTY_STATIC_ROW to heap again with offheap memtable. (DB-1375)

• Allow DiskBoundaryManager to cache different directories. (DB-1454)

• Abort repair when there is only one node. (DB-1511)

• OutOfMemory during view update. (DB-1493)

• Drop response on view lock acquisition timeout and add ViewLockAcquisitionTimeouts metric. (DB-1522)

• Handle race condition on dropping keyspace and opening keyspace. (DB-1570)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
86
DataStax Enterprise release notes

• The JVM version check in conf/cassandra-env.sh does not work. (DB-1882)

• dsetool ring prints ERROR when data_file_directories is removed from cassandra.yaml. (DSP-13547)

• Driver: Jackson-databind is vulnerable to remote code execution (RCE) attacks. (DSP-13498)

• LDAP library issue. (DSP-15927)

6.0.0 DSE Advanced Replication

Changes and enhancements:

• Support for DSE Advanced Replication V1 is removed. For V1 installations, you must first upgrade to DSE
5.1.x and migrate your DSE Advanced Replication to V2, and then upgrade to DSE 6.0. (DSP-13376)

• Enhanced CLI security prevents injection attacks and sanitizes and validates the command line inputs.
(DSP-13682)

Resolved issues:

• Improve logging on unsupported operation failure and remove the failed mutation from replog. (DSP-15043)

• Channel creation fails with NPE when using mixed case destination name. (DSP-15538)

6.0.0 DSE Analytics

Experimental features. These features are experimental and are not supported for production:

• Importing graphs using DseGraphFrame.

Known issues:

• DSE Analytics: Additional configuration is required when enabling context-per-jvm in the Spark Jobserver.
(DSP-15163)

Changes and enhancements:

• Previously deprecated environment variables, including SPARK_CLASSPATH, are removed in Spark 2.2.0.
(DSP-8379)

• AlwaysOn SQL service, a HA (highly available) Spark SQL Thrift server. (DSP-10996)

# JPS is an option required for nodes with AlwaysON SQL enabled.

# The spark_config_settings and hive_config_settings are removed from dse.yaml. The configuration is
provided in the spark-alwayson-sql.conf file in DSEHOME/resources/spark/conf with the same default
contents as DSEHOME/resources/spark/conf/spark-defaults.conf. (DSP-15837)

• Cassandra File System (CFS) is removed. Use DSEFS instead. Before upgrading to DSE 6.0, remove CFS
keyspaces. See the From CFS to DSEFS dev blog post. (DSP-12470)

• Optimization for SearchAnalytics with SELECT COUNT(*) and no predicates. (DSP-12669)

• Authenticate JDBC users to Spark SQL Thrift Server. Queries that are executed during JDBC session are
run as the user who authenticated through JDBC. (DSP-13395)

• Solr optimization is automatic; spark.sql.dse.solr.enabled is deprecated, use


spark.sql.dse.search.enableOptimization instead. (DSP-13398)

• Optimization for SearchAnalytics with SELECT COUNT(*) and no predicates. (DSP-13398)

• dse spark-beeline command is removed, use dse beeline instead. (DSP-13468)

• cfs-stress tool is replaced by fs-stress tool. (DSP-13549)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
87
DataStax Enterprise release notes

• Encryption for data stored on the server and encryption of Spark spill files is supported. (DSP-13841)

• Improved security with Spark. (DSP-13991)

• Spark local applications no longer use /var/lib/spark/rdd, instead configure and use .sparkdirectory for
processes started by the user. (DSP-14380)

• Input metrics are not thread-safe and are not used properly in CassandraJoinRDD and
CassandraLeftJoinRDD. (DSP-14569)

• AlwaysOn SQL workpool option adds high availability (HA) for the JDBC or ODBC connections for analytics
node. (DSP-14719)

• CFS is removed. Before upgrade, move HiveMetaStore from CFS to DSEFS and update URL references.
(DSP-14831)

• Include SPARK-21494 to use correct app id when authenticating to external service. (DSP-14140)

• Upgrade to DSE 6.0 must be complete on all nodes in the cluster before Spark Worker and Spark Master
will start. (DSP-14735)

• Spark Cassandra Connector in DSE 6.0.0, has the following changes:

# Changes to default values: spark.output.concurrent.writes: 100, spark.task.maxFailures: 10


(DSP-15164)

# spark.cassandra.connection.connections_per_executor_max is removed; use


new properties spark.cassandra.connection.local_connections_per_executor,
spark.cassandra.connection.remote_connections_per_executor_min, and
spark.cassandra.connection.remote_connections_per_executor_max. (DSP-15193

# All Spark-related parameters are now camelCase. Parameters are case-sensitive. The snake_versions
are automatically translated to the camelCaseVersions except when the parameters are used as table
options. In SparkSQL and with spark.read.options(...), the parameters are case-insensitive because of
internal SQL implementation.

# The DSE artifact is com.datastax.dse : spark-connector: 6.0.0.

# The DseSparkDependencies JAR is still required. (DSP-15694)

• Use NodeSync (continuous repair) and LOCAL_QUORUM for reading from Spark recovery storage.
(DSP-15219)
Supporting changes:

# Spark Master will not start until LOCAL_QUORUM is achieved for dse_analytics keyspace.

# Spark Master recovery data is first attempted to be updated with LOCAL_QUORUM, and if that fails,
then attempt with LOCAL_ONE. Recovery data are always queried with LOCAL_QUORUM (unlike
previous versions of DSE where we used LOCAL_ONE)

# DSE Analytics internal data moved from spark_system to dse_analytics keyspace.

DataStax strongly recommends enabling NodeSync for continuous repair on all tables in the
dse_analytics keyspace. NodeSync is required on the rm_shared_data keyspace that stores Spark
recovery information.

Resolved issues:

• DSE does not work with Spark Crypto based encryption. (DSP-14140)

6.0.0 DSEFS

Changes and enhancements:

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
88
DataStax Enterprise release notes

• Wildcard characters are supported in DSEFS shell commands. (DSP-10583)

• DSEFS should support all DSE authentication schemes. (DSP-12956)

• Improved authorization security sets the default permission to 755 for directories and 644 for files. New
DSEFS clusters create the root directory / with 755 permission to prevent non-super users from modifying
root content; for example, by using mkdir or put commands. (DSP-13609)

• Enable SSL for DSEFS encryption. (DSP-13771)

• HTTP communication logging level changed from DEBUG to TRACE. (DSP-14400)

• DSEFS shell history has been moved to ~/.dse/dsefs_history. (DSP-15070)

• New tool to move hive metastore from CFS to DSEFS and update references.

• Add echo command to DSEFS shell. (DSP-15446)

• Changes in dse.yaml for advanced DSEFS settings.

• Alternatives wildcards are Hadoop compatible. (DSP-15249)

6.0.0 DSE Graph

Known issues:

• Dropping a property of vertex label with materialized view (MV) indices breaks graph. To drop a property
key for a vertex label that has a materialized view index, additional steps are required to prevent data loss or
cluster errors. See Dropping graph schema. (DSP-15532)

• Secondary indexes used for DSE Graph queries have higher latency in DSE 6.0 than in the previous
version. (DB-1928)

• Backup snapshots taken with OpsCenter 6.1 will not load to DSE 6.0. Use the backup service in OpsCenter
6.5 or later. (DSP-15922)

Changes and enhancements:

• Improved and simplified data batch loading of pre-formatted data. (DGL-235)


Supporting changes:

# Schema discovery and schema generation are deprecated. (DGL-246)

# Standard vertex IDs are deprecated. Use custom vertex IDs instead. (DSP-13485)

# Standard IDs are deprecated. (DGL-247)

# Transformations are deprecated. (DGL-248)

• Schema API changes: all .remove() methods are renamed to .drop() and schema.clear() is renamed to
schema.drop(). Schema API supports removing vertex/edge labels and property keys. Unify use of drop |
remove | clear in the Schema API and use .drop() everywhere. (DSP-8385, DSP-14150)

• Include materialized view (MV) indexes in query optimizer only if the MV was fully built. (DSP-10219)

• DSE profiling of graph statements from the gremlin shell. (DSP-13484)

• Improve Graph OLAP performance by smart routing query to DseGraphFrame engine with
DseGraphFrameInterceptorStrategy. (DSP-13489)

• OSS TinkerPop 3.3 supports Spark 2.2. (DSP-13632)

• Partitioned vertex tables (PVT) are removed. (DSP-13676)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
89
DataStax Enterprise release notes

• Graph online analytical processing (OLAP) supports drop() with DseGraphFrame interceptor. Simple queries
can be used in drop operations. (DSP-13998)

• DSE Graphs vertices and edges tables are accessible from SparkSQL and automated to dse_graph
SparkSQL database. (DSP-12046)

• More Gremlin APIs are supported in DSEGraphFrames: dedup, sort, limit, filter, as()/select(), or().
(DSP-13649)

• Some graph and gremlin_server properties in earlier versions of DSE are no longer required for DSE 6.0.
The default settings from the earlier versions of dse.yaml are preserved. These settings were removed from
dse.yaml.

# adjacency_cache_clean_rate

# adjacency_cache_max_entry_size_in_mb

# adjacency_cache_size_in_mb

# gremlin_server_enabled

# index_cache_clean_rate

# index_cache_max_entry_size_in_mb

# schema_mode - default schema_mode is production

# window_size

# ids (all vertex ID assignment and partitioning strategy options)

# various gremlin_server settings

If these properties exist in the dse.yaml file after upgrading to DSE 6.0, logs display warnings. You can
ignore these warnings or modify dse.yaml so that only the required graph system level and gremlin_server
properties are present. (DSP-14308)

• Spark Jobserver is the DSE custom version 0.8.0.44. Applications must use the compatible Spark Jobserver
API in DataStax repository. (DSP-14152)

• Edge label names and property key names allow only [a-zA-Z0-9], underscore, hyphen, and period. The
string formatting for vertices with text custom IDs has changed. (DSP-14710)
Supporting changes (DSP-15167):

# schema.describe() displays the entire schema, even if it contains illegal names.

# In-place upgrades allow existing schemas with invalid edge label names and property key names.

# Schema elements with illegal names cannot be uploaded or added.

• Invoking toString on a custom vertex ID containing a text property, or on an edge ID that is incident upon a
vertex with a custom vertex ID, now returns a value that encloses the text property value in double quotation
marks and escapes the value's internal double-quotes. This change protects older formats from irresolvable
parsing ambiguity. For example:

// old
{~label=v, x=foo}
{~label=w, x=a"b}
// new
{~label=v, x="foo"}
{~label=w, x="a""b"}

• Support for math()-step (math) to enable scientific calculator functionality within Gremlin. (DSP-14786)

• The GraphQueryThreads JMX attribute has been removed. Thread selection occurs with Thread Per Core
(TPC) asynchronous request processing architecture. (DSP-15222)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
90
DataStax Enterprise release notes

Resolved issues:

• spark.sql.hive.metastore.barrierPrefixes is set to org.apache.spark.sql.cassandra to properly use


CassandraConnector in DSE HiveMetastore implementation. (DSP-14120)

• Intermittent KryoException: Buffer underflow error when running order by query in OLTP mode.
(DSP-12694)

• DseGraphFrame does not support infix and() and or(). (DSP-12013)

• Graph could be left in an inconsistent state if a schema migration fails. (DSP-15532)

• DseGraphFrames properties().count() step return vertex count instead of multi properties count.
(DSP-15049)

• GraphSON parsing error prevents proper type detection under certain conditions. (DSP-14066)

6.0.0 DSE Search

Experimental features. These features are experimental and are not supported for production:

• The dsetool index_checks use an Apache Lucene® experimental feature.

Known issues:

• Search index TTL Expiration thread loops without effect with live indexing (RT indexing). (DSP-16038)

Changes and enhancements:

• DSE Search is very IO intensive. Performance is impacted by the Thread Per Core (TPC) asynchronous
read and write paths architecture. (DB-707)
Before using DSE Search in DSE 6.0 and later, review and follow the DataStax recommendations:

# On search nodes, change the tpc_cores value from its default to the number of physical CPUs. Refer
to Tuning TPC cores.

# Disable AIO and set the file_cache_size_in_mb value to 512. Refer to Disabling AIO.

# Locate DSE Cassandra transactional data and Solr-based DSE Search data on separate Solid State
Drives (SSDs). Refer to Set the location of search indexes.

# Plan for sufficient memory resources and disk space to meet operational requirements. Refer to
Capacity planning for DSE Search.

• Writes are flushed to disk in segments that use a new Lucene codec that does not exist in earlier versions.
Unique key values are no longer stored as both docValues and Lucene stored fields. The unique key values
are now stored only as docValues in a new codec to store managed fields like Lucene. Downgrades to
versions earlier than DSE 6.0 are not supported. (DSP-8465)

• Document inserts and updates using HTTP are removed. Before upgrading, ensure you are using CQL for
all inserts and updates. (DSP-9725).

• DSENRTCachingDirectoryFactory is removed. Before upgrading, change your search index config.


(DSP-10126)

• The <dataDir> parameter in the solrconfig.xml file is not supported. Instead, follow the steps in Set the
location of search indexes. (DSP-13199)

• Improved performance by early termination of sorting. Ideal for queries that need only a few results returned,
from a large number of total matches. (DSP-13253)

• Native CQL syntax for search queries. (DSP-13411)


Supporting changes:

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
91
DataStax Enterprise release notes

# The default for CQL text type changed from solr.TextField to solr.StrField.

# Updated wikipedia demo syntax.

# enable_tokenized_text_copy_fields replaces enable_string_copy_fields in spaceSaving


profiles.

# The spaceSavingNoTextfield resource generation profile is removed.

• Delete by id is removed. Delete by query no longer accepts wildcard queries, including queries that match
all documents (for example, <delete><query>*:*</query></delete>). Instead, use CQL to DELETE by
Primary Key or the TRUNCATE command. (DSP-13436)

• Search index config changes. (DSP-14137)

# mergePolicy, maxMergeDocs, and mergeFactor are no longer supported.

# RAM buffer size settings are no longer required in search index config. Global RAM buffer usage in
Lucene is governed by the memtable size limits in cassandra.yaml. RAM buffers are counted toward the
memtable_heap_space_in_mb.

• dsetool core_indexing_status --progress option is always true. (DSP-13465)

• The HTTP API for Solr core management is removed. Instead, use CQL commands for search index
management or dsetool search index commands. (DSP-13530)

• The Tika functionality bundled with Apache Solr is removed. Instead, use the stand-alone Apache Tika
project. (DSP-13892)

• Logging configuration improvements. (DSP-14137)

# The solrvalidation.log is removed. You can safely remove appender SolrValidationErrorAppender and
the logger SolrValidationErrorLogger from logback.xml. Indexing errors manifest as:

# failures at the coordinator if they represent failures that might succeed at some later point in time
using the hint replay mechanism

# as messages in the system.log if the failures are due to non-recoverable indexing validation errors
(for data that is written to the database, but not indexed properly)

• The DSE custom update request processor (URP) implementation is deprecated. Use the field input/output
(FIT) transformer API instead. (DSP-14360)

• The stored flag in search index schemas is deprecated and is no longer added to auto-generated schemas.
If the flag exists in custom schemas, it is ignored. (DSP-14425)

• Tuning improvements for indexing. (DSP-14785, DSP-14978))

# Indexing is no longer asynchronous. Document updates are written to the Lucene RAM buffer
synchronously with the mutation backing table.

# See Tuning search for maximum indexing throughput.

# The back_pressure_threshold_per_core in dse.yaml affects only index rebuilding/reindexing. DataStax


recommends not changing the default value of 1024.

# These options in dse.yaml are removed:

# enable_back_pressure_adaptive_nrt_commit

# max_solr_concurrency_per_core

# solr_indexing_error_log_options

DSE 6.0 will not start with these options present.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
92
DataStax Enterprise release notes

• StallMetrics MBean is removed. Before upgrading to DSE 6.0, change operators that use the MBean.
(DSP-14860)

• Optimize Paging when limit is smaller than the page size. (DSP-15207)

Resolved issues includes all bug fixes up to DSE 5.1.8. Additional 6.0.0 fixes:

• Isolate Solr Resource Loading at startup to the Local DC. (DSP-10911)

DataStax Studio 6.0.0

• For use with DSE 6.0.x, DataStax Studio 6.0.0 is installed as a standalone tool. (DSP-13999, DSP-15623)

For details, see DataStax Studio 6.0.0 release notes.

DataStax Bulk Loader 1.0.1

• DataStax Bulk Loader (dsbulk) version 1.0.1 is automatically installed with DataStax Enterprise 6.0.0, and
can also be installed as a standalone tool. (DSP-13999, DSP-15623)

For details, see DataStax Bulk Loader 1.0.1 release notes.


Cassandra enhancements for DSE 6.0
DataStax Enterprise 6.0.0 is compatible with Apache Cassandra™ 3.11 and adds these production-certified
enhancements:

• Add DEFAULT, UNSET, MBEAN and MBEANS to `ReservedKeywords`. (CASSANDRA-14205)

• Add Unittest for schema migration fix (CASSANDRA-14140)

• Print correct snitch info from nodetool describecluster (CASSANDRA-13528)

• Close socket on error during connect on OutboundTcpConnection (CASSANDRA-9630)

• Enable CDC unittest (CASSANDRA-14141)

• Split CommitLogStressTest to avoid timeout (CASSANDRA-14143)

• Improve commit log chain marker updating (CASSANDRA-14108)

• Fix updating base table rows with TTL not removing view entries (CASSANDRA-14071)

• Reduce garbage created by DynamicSnitch (CASSANDRA-14091)

• More frequent commitlog chained markers (CASSANDRA-13987)

• RPM package spec: fix permissions for installed jars and config files (CASSANDRA-14181)

• More PEP8 compiance for cqlsh (CASSANDRA-14021)

• Fix support for SuperColumn tables (CASSANDRA-12373)

• Fix missing original update in TriggerExecutor (CASSANDRA-13894)

• Improve short read protection performance (CASSANDRA-13794)

• Fix counter application order in short read protection (CASSANDRA-12872)

• Fix MV timestamp issues (CASSANDRA-11500)

• Fix AssertionError in short read protection (CASSANDRA-13747)

• Gossip thread slows down when using batch commit log (CASSANDRA-12966)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
93
DataStax Enterprise release notes

• Allow native function calls in CQLSSTableWriter (CASSANDRA-12606)

• Copy session properties on cqlsh.py do_login (CASSANDRA-13847)

• Fix load over calculated issue in IndexSummaryRedistribution (CASSANDRA-13738)

• Obfuscate password in stress-graphs (CASSANDRA-12233)

• ReverseIndexedReader may drop rows during 2.1 to 3.0 upgrade (CASSANDRA-13525)

• Avoid reading static row twice from old format sstables (CASSANDRA-13236)

• Fix possible NPE on upgrade to 3.0/3.X in case of IO errors (CASSANDRA-13389)

• Add duration data type (CASSANDRA-11873)

• Properly report LWT contention (CASSANDRA-12626)

• Stress daemon help is incorrect(CASSANDRA-12563)

• Remove ALTER TYPE support (CASSANDRA-12443)

• Fix assertion for certain legacy range tombstone pattern (CASSANDRA-12203)

• Remove support for non-JavaScript UDFs (CASSANDRA-12883)

• Better handle invalid system roles table (CASSANDRA-12700)

• Upgrade netty version to fix memory leak with client encryption (CASSANDRA-13114)

• Fix trivial log format error (CASSANDRA-14015)

• Allow SSTabledump to do a JSON object per partition (CASSANDRA-13848)

• Remove unused and deprecated methods from AbstractCompactionStrategy (CASSANDRA-14081)

• Fix Distribution.average in cassandra-stress (CASSANDRA-14090)

• Presize collections (CASSANDRA-13760)

• Add GroupCommitLogService (CASSANDRA-13530)

• Parallelize initial materialized view build (CASSANDRA-12245)

• Fix flaky SecondaryIndexManagerTest.assert[Not]MarkedAsBuilt (CASSANDRA-13965)

• Make LWTs send resultset metadata on every request (CASSANDRA-13992)

• Fix flaky indexWithFailedInitializationIsNotQueryableAfterPartialRebuild (CASSANDRA-13963)

• Introduce leaf-only iterator (CASSANDRA-9988)

• Allow only one concurrent call to StatusLogger (CASSANDRA-12182)

• Refactoring to specialised functional interfaces (CASSANDRA-13982)

• Speculative retry should allow more friendly parameters (CASSANDRA-13876)

• Throw exception if we send/receive repair messages to incompatible nodes (CASSANDRA-13944)

• Replace usages of MessageDigest with Guava's Hasher (CASSANDRA-13291)

• Add nodetool command to print hinted handoff window (CASSANDRA-13728)

• Fix some alerts raised by static analysis (CASSANDRA-13799)

• Checksum SSTable metadata (CASSANDRA-13321, CASSANDRA-13593)

• Add result set metadata to prepared statement MD5 hash calculation (CASSANDRA-10786)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
94
DataStax Enterprise release notes

• Add incremental repair support for --hosts, --force, and subrange repair (CASSANDRA-13818)

• Refactor GcCompactionTest to avoid boxing (CASSANDRA-13941)

• Expose recent histograms in JmxHistograms (CASSANDRA-13642)

• Add SERIAL and LOCAL_SERIAL support for cassandra-stress (CASSANDRA-13925)

• LCS needlessly checks for L0 STCS candidates multiple times (CASSANDRA-12961)

• Correctly close netty channels when a stream session ends (CASSANDRA-13905)

• Update lz4 to 1.4.0 (CASSANDRA-13741)

• Throttle base partitions during MV repair streaming to prevent OOM (CASSANDRA-13299)

• Improve short read protection performance (CASSANDRA-13794)

• Fix AssertionError in short read protection (CASSANDRA-13747)

• Use compaction threshold for STCS in L0 (CASSANDRA-13861)

• Fix problem with min_compress_ratio: 1 and disallow ratio < 1 (CASSANDRA-13703)

• Add extra information to SASI timeout exception (CASSANDRA-13677)

• Rework CompactionStrategyManager.getScanners synchronization (CASSANDRA-13786)

• Add additional unit tests for batch behavior, TTLs, Timestamps (CASSANDRA-13846)

• Add keyspace and table name in schema validation exception (CASSANDRA-13845)

• Emit metrics whenever we hit tombstone failures and warn thresholds (CASSANDRA-13771)

• Allow changing log levels via nodetool for related classes (CASSANDRA-12696)

• Add stress profile yaml with LWT (CASSANDRA-7960)

• Reduce memory copies and object creations when acting on ByteBufs (CASSANDRA-13789)

• simplify mx4j configuration (Cassandra-13578)

• Fix trigger example on 4.0 (CASSANDRA-13796)

• force minumum timeout value (CASSANDRA-9375)

• Add bytes repaired/unrepaired to nodetool tablestats (CASSANDRA-13774)

• Don't delete incremental repair sessions if they still have sstables (CASSANDRA-13758)

• Fix pending repair manager index out of bounds check (CASSANDRA-13769)

• Don't use RangeFetchMapCalculator when RF=1 (CASSANDRA-13576)

• Don't optimise trivial ranges in RangeFetchMapCalculator (CASSANDRA-13664)

• Use an ExecutorService for repair commands instead of new Thread(..).start() (CASSANDRA-13594)

• Fix race / ref leak in anticompaction (CASSANDRA-13688)

• Fix race / ref leak in PendingRepairManager (CASSANDRA-13751)

• Enable ppc64le runtime as unsupported architecture (CASSANDRA-13615)

• Improve sstablemetadata output (CASSANDRA-11483)

• Support for migrating legacy users to roles has been dropped (CASSANDRA-13371)

• Introduce error metrics for repair (CASSANDRA-13387)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
95
DataStax Enterprise release notes

• Refactoring to primitive functional interfaces in AuthCache (CASSANDRA-13732)

• Update metrics to 3.1.5 (CASSANDRA-13648)

• batch_size_warn_threshold_in_kb can now be set at runtime (CASSANDRA-13699)

• Avoid always rebuilding secondary indexes at startup (CASSANDRA-13725)

• Upgrade JMH from 1.13 to 1.19 (CASSANDRA-13727)

• Upgrade SLF4J from 1.7.7 to 1.7.25 (CASSANDRA-12996)

• Default for start_native_transport now true if not set in config (CASSANDRA-13656)

• Don't add localhost to the graph when calculating where to stream from (CASSANDRA-13583)

• Allow skipping equality-restricted clustering columns in ORDER BY clause (CASSANDRA-10271)

• Use common nowInSec for validation compactions (CASSANDRA-13671)

• Improve handling of IR prepare failures (CASSANDRA-13672)

• Send IR coordinator messages synchronously (CASSANDRA-13673)

• Flush system.repair table before IR finalize promise (CASSANDRA-13660)

• Fix column filter creation for wildcard queries (CASSANDRA-13650)

• Add 'nodetool getbatchlogreplaythrottle' and 'nodetool setbatchlogreplaythrottle' (CASSANDRA-13614)

• fix race condition in PendingRepairManager (CASSANDRA-13659)

• Allow noop incremental repair state transitions (CASSANDRA-13658)

• Run repair with down replicas (CASSANDRA-10446)

• Added started & completed repair metrics (CASSANDRA-13598)

• Added started & completed repair metrics (CASSANDRA-13598)

• Improve secondary index (re)build failure and concurrency handling (CASSANDRA-10130)

• Improve calculation of available disk space for compaction (CASSANDRA-13068)

• Change the accessibility of RowCacheSerializer for third party row cache plugins (CASSANDRA-13579)

• Allow sub-range repairs for a preview of repaired data (CASSANDRA-13570)

• NPE in IR cleanup when columnfamily has no sstables (CASSANDRA-13585)

• Fix Randomness of stress values (CASSANDRA-12744)

• Allow selecting Map values and Set elements (CASSANDRA-7396)

• Fast and garbage-free Streaming Histogram (CASSANDRA-13444)

• Update repairTime for keyspaces on completion (CASSANDRA-13539)

• Add configurable upper bound for validation executor threads (CASSANDRA-13521)

• Bring back maxHintTTL propery (CASSANDRA-12982)

• Add testing guidelines (CASSANDRA-13497)

• Add more repair metrics (CASSANDRA-13531)

• RangeStreamer should be smarter when picking endpoints for streaming (CASSANDRA-4650)

• Avoid rewrapping an exception thrown for cache load functions (CASSANDRA-13367)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
96
DataStax Enterprise release notes

• Log time elapsed for each incremental repair phase (CASSANDRA-13498)

• Add multiple table operation support to cassandra-stress (CASSANDRA-8780)

• Fix incorrect cqlsh results when selecting same columns multiple times (CASSANDRA-13262)

• Fix WriteResponseHandlerTest is sensitive to test execution order (CASSANDRA-13421)

• Improve incremental repair logging (CASSANDRA-13468)

• Start compaction when incremental repair finishes (CASSANDRA-13454)

• Add repair streaming preview (CASSANDRA-13257)

• Cleanup isIncremental/repairedAt usage (CASSANDRA-13430)

• Change protocol to allow sending key space independent of query string (CASSANDRA-10145)

• Make gc_log and gc_warn settable at runtime (CASSANDRA-12661)

• Take number of files in L0 in account when estimating remaining compaction tasks (CASSANDRA-13354)

• Skip building views during base table streams on range movements (CASSANDRA-13065)

• Improve error messages for +/- operations on maps and tuples (CASSANDRA-13197)

• Remove deprecated repair JMX APIs (CASSANDRA-11530)

• Fix version check to enable streaming keep-alive (CASSANDRA-12929)

• Make it possible to monitor an ideal consistency level separate from actual consistency level
(CASSANDRA-13289)

• Outbound TCP connections ignore internode authenticator (CASSANDRA-13324)

• Upgrade junit from 4.6 to 4.12 (CASSANDRA-13360)

• Cleanup ParentRepairSession after repairs (CASSANDRA-13359)

• Upgrade snappy-java to 1.1.2.6 (CASSANDRA-13336)

• Incremental repair not streaming correct sstables (CASSANDRA-13328)

• Upgrade the JNA version to 4.3.0 (CASSANDRA-13300)

• Add the currentTimestamp, currentDate, currentTime and currentTimeUUID functions


(CASSANDRA-13132)

• Remove config option index_interval (CASSANDRA-10671)

• Reduce lock contention for collection types and serializers (CASSANDRA-13271)

• Make it possible to override MessagingService.Verb ids (CASSANDRA-13283)

• Avoid synchronized on prepareForRepair in ActiveRepairService (CASSANDRA-9292)

• Adds the ability to use uncompressed chunks in compressed files (CASSANDRA-10520)

• Don't flush sstables when streaming for incremental repair (CASSANDRA-13226)

• Remove unused method (CASSANDRA-13227)

• Fix minor bugs related to #9143 (CASSANDRA-13217)

• Output warning if user increases RF (CASSANDRA-13079)

• Remove pre-3.0 streaming compatibility code for 4.0 (CASSANDRA-13081)

• Add support for + and - operations on dates (CASSANDRA-11936)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
97
DataStax Enterprise release notes

• Fix consistency of incrementally repaired data (CASSANDRA-9143)

• Increase commitlog version (CASSANDRA-13161)

• Make TableMetadata immutable, optimize Schema (CASSANDRA-9425)

• Refactor ColumnCondition (CASSANDRA-12981)

• Parallelize streaming of different keyspaces (CASSANDRA-4663)

• Improved compactions metrics (CASSANDRA-13015)

• Speed-up start-up sequence by avoiding un-needed flushes (CASSANDRA-13031)

• Use Caffeine (W-TinyLFU) for on-heap caches (CASSANDRA-10855)

• Thrift removal (CASSANDRA-11115)

• Remove pre-3.0 compatibility code for 4.0 (CASSANDRA-12716)

• Add column definition kind to dropped columns in schema (CASSANDRA-12705)

• Add (automate) Nodetool Documentation (CASSANDRA-12672)

• Update bundled cqlsh python driver to 3.7.0 (CASSANDRA-12736)

• Reject invalid replication settings when creating or altering a keyspace (CASSANDRA-12681)

• Clean up the SSTableReader#getScanner API wrt removal of RateLimiter (CASSANDRA-12422)

• Use new token allocation for non bootstrap case as well (CASSANDRA-13080)

• Avoid byte-array copy when key cache is disabled (CASSANDRA-13084)

• Require forceful decommission if number of nodes is less than replication factor (CASSANDRA-12510)

• Allow IN restrictions on column families with collections (CASSANDRA-12654)

• Log message size in trace message in OutboundTcpConnection (CASSANDRA-13028)

• Add timeUnit Days for cassandra-stress (CASSANDRA-13029)

• Add mutation size and batch metrics (CASSANDRA-12649)

• Add method to get size of endpoints to TokenMetadata (CASSANDRA-12999)

• Expose time spent waiting in thread pool queue (CASSANDRA-8398)

• Conditionally update index built status to avoid unnecessary flushes (CASSANDRA-12969)

• cqlsh auto completion: refactor definition of compaction strategy options (CASSANDRA-12946)

• Add support for arithmetic operators (CASSANDRA-11935)

• Add histogram for delay to deliver hints (CASSANDRA-13234)

• Fix cqlsh automatic protocol downgrade regression (CASSANDRA-13307)

• Changing `max_hint_window_in_ms` at runtime (CASSANDRA-11720)

• Nodetool repair can hang forever if we lose the notification for the repair completing/failing
(CASSANDRA-13480)

• Anticompaction can cause noisy log messages (CASSANDRA-13684)

• Switch to client init for sstabledump (CASSANDRA-13683)

• CQLSH: Don't pause when capturing data (CASSANDRA-13743)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
98
DataStax Enterprise release notes

General upgrade advice for DSE 6.0.0


DataStax Enterprise 6.0.0 is compatible with Apache Cassandra™ 3.11.
All upgrade advice from previous versions applies. Carefully review the DataStax Enterprise upgrade planning
and upgrade instructions to ensure a smooth upgrade and avoid pitfalls and frustrations.
TinkerPop changes for DSE 6.0.0
DataStax Enterprise (DSE) 6.0.0 includes all changes from previous releases plus these production-certified
changes that are in addition to TinkerPop 3.3.2. See TinkerPop upgrade documentation for all changes.

• Made iterate() a first class step. (TINKERPOP-1834)

• Fixed a bug in NumberHelper that led to wrong min/max results if numbers exceeded the Integer limits.
(TINKERPOP-1873)

• Improved error messaging for failed serialization and deserialization of request/response messages.

• Fixed bug in handling of Direction.BOTH in Messenger implementations to pass the message to the
opposite side of the `StarGraph` in VertexPrograms for OLAP traversals. (TINKERPOP-1862)

• Fixed a bug in Gremlin Console which prevented handling of gremlin.sh flags that had an equal sign (=)
between the flag and its arguments. (TINKERPOP-1879)

• Fixed a bug where SparkMessenger was not applying the edgeFunction`from MessageScope`in
VertexPrograms for OLAP-based traversals. (TINKERPOP-1872)

• TinkerPop drivers prior to 3.2.4 won't authenticate with Kerberos anymore. A long-deprecated option on the
Gremlin Server protocol was removed.

DataStax Bulk Loader release notes


Release notes for DataStax Bulk Loader 1.1.x and 1.0.x.
DataStax Bulk Loader 1.1.x and 1.0.x can migrate data in CSV or JSON format into DSE from another DSE or
TM
Apache Cassandra cluster.

• Can unload data from any Cassandra 2.1 or later data source

• Can load data to DSE 5.0 or later

DataStax Studio release notes


Release notes for DataStax Studio 6.0.x.
See the DataStax Studio 6.0 release notes in the DataStax Studio guide.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
99
Chapter 3. Installing DataStax Enterprise 6.0
Installation information is located in the Installation Guide.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
100
Chapter 4. Configuration

Recommended production settings


DataStax recommends the following settings for using DataStax Enterprise (DSE) in production environments.

Depending on your environment, some of the following settings might not persist after reboot. Check with your
system administrator to ensure these settings are viable for your environment.

Use the Preflight check tool to run a collection of tests on a DSE node to detect and fix node configurations. The
tool can detect and optionally fix many invalid or suboptimal configuration settings, such as user resource limits,
swap, and disk settings.
Configure the chunk cache
Beginning in DataStax Enterprise (DSE) 6.0, the amount of native memory used by the DSE process has
increased significantly.
The main reason for this increase is the chunk cache (or file cache), which is like an OS page cache. The
following sections provide additional information:

• See Chunk cache history for a historical description of the chunk cache, and how it is calculated in DSE 6.0
and later.

• See Chunk cache differences from OS page cache to understand key differences between the chunk cache
and the OS page cache.

Consider the following recommendations depending on workload type for your cluster.

DSE recommendations

Regarding DSE, consider the following recommendations when choosing the max direct memory and file cache
size:

• Total server memory size

• Adequate memory for the OS and other applications

• Adequate memory for the Java heap size

• Adequate memory for native raw memory (such as bloom filters and off-heap memtables)

For 64 GB servers, the default settings are typically adequate. For larger servers, increase the max direct
memory (-XX:MaxDirectMemorySize), but leave approximately 15-20% of memory for the OS and other in-
memory structures. The file cache size will be set automatically to half of that. This setting is acceptable, but the
size could be increased gradually if the cache hit rate is too low and there is still available memory on the server.

DSE Search recommendations

Disabling asynchronous I/O (AIO) and explicitly setting the chunk cache size (file_cache_size_in_mb) improves
performance for most DSE Search workloads. When enforced, SSTables and Lucene segments, as well as other
minor off-heap elements, will reside in the OS page cache and be managed by the kernel.
A potentially negative impact of disabling AIO might be measurably higher read latency when DSE goes to disk,
in cases where the dataset is larger than available memory.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
101
Configuration

To disable AIO and set the chunk cache size, see Disable AIO.

DSE Analytics recommendations

DSE Analytics relies heavily on memory for performance. Because Apache Spark™ effectively manages its own
memory through the Apache Spark application settings, you must determine how much memory the Apache
Spark application receives. Therefore, you must think about how much memory to allocate to the chunk cache
versus how much memory to allocate for Apache Spark applications. Similar to DSE Search, you can disable
AIO and lower the chunk cache size to provide Apache Spark with more memory.

DSE Graph recommendations

Because DSE Graph heavily relies on several different workloads, it’s important to follow the previous
recommendations for the specific workload. If you use DSE Search or DSE Analytics with DSE Graph, lower the
chunk cache and disable AIO for the best performance. If you use DSE Graph only on top of Apache Cassandra,
increase the chunk cache gradually, leaving 15-20% of memory available for other processes.

Chunk cache differences from OS page cache

There are several differences between the chunk cache and the OS page cache, and a full description is outside
the scope of this information. However, the following differences are relevant to DSE:

• Because the OS page cache is sized dynamically by the operating system, it can grow and shrink depending
on the available server memory. The chunk cache must be sized statically.
If the chunk cache is too small, the available server memory will be unused. For servers with large amounts
of memory (50 GB or more), the memory is wasted. If the chunk cache is too large, the available memory on
the server can reduce enough that the OS will kill the DSE process to avoid an out of memory issue.

At the time of writing, the size of the chunk cache cannot be changed dynamically so to change the size
of the chunk cache the DSE process must be restarted.

• Restarting the DSE process will destroy the chunk cache, so each time the process is restarted, the chunk
cache will be cold. The OS page cache only becomes cold after a server restart.

• The memory used by the file cache is part of the DSE process memory, and is therefore seen by the OS as
user memory. However, the OS page cache memory is seen as buffer memory.

• The chunk cache uses mostly NIO direct memory, storing file chunks into NIO byte buffers. However, NIO
does have an on-heap footprint, which DataStax is working to reduce.

Chunk cache history

The chunk cache is not new to Apache Cassandra, and was originally intended to cache small parts (chunks) of
SSTable files to make read operations faster. However, the default file access mode was memory mapped until
DSE 5.1, so the chunk cache had a secondary role and its size was limited to 512 MB.
The default setting of 512 MB was configured by the file_cache_size_in_mb parameter in cassandra.yaml.

In DSE 6.0 and later, the chunk cache has increased relevance, not just because it replaces the OS page cache
for database read operations, but because it is a central component of the asynchronous thread-per-core (TPC)
architecture.
By default, the chunk cache is configured to use the following portion of the max direct memory:

• One-half (½) of the max direct memory for the DSE process

• One-fourth (¼) of the max direct memory for tools

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
102
Configuration

The max direct memory is calculated as one-half (½) of the system memory minus the JVM heap size:

Max direct memory = ((system memory - JVM heap size))/2

You can explicitly configure the max direct memory by setting the JVM MaxDirectMemorySize (-
XX:MaxDirectMemorySize) parameter. See increasing the max direct memory. Alternatively, you can override
the max direct memory setting by explicitly configuring the file_cache_size_in_mb parameter in cassandra.yaml.
Install the latest Java Virtual Machine
Configure your operating system to use the latest build of a Technology Compatibility Kit (TCK) Certified
OpenJDK version 8. For example, OpenJDK 8 (1.8.0_151 minimum). Java 9 is not supported.

Although Oracle JRE/JDK 8 is supported, DataStax does more extensive testing on OpenJDK 8. This change
is due to the end of public updates for Oracle JRE/JDK 8.

See the installation instructions for your operating system:

• Installing Open JDK 8 on Debian or Ubuntu Systems

• Installing OpenJDK 8 on RHEL-based Systems

Synchronize clocks
Use Network Time Protocol (NTP) to synchronize the clocks on all nodes and application servers.
Synchronizing clocks is required because DataStax Enterprise (DSE) overwrites a column only if there is another
version whose timestamp is more recent, which can happen when machines are in different locations.
DSE timestamps are encoded as microseconds because UNIX Epoch time does not include timezone
information. The timestamp for all writes in DSE is Universal Time Coordinated (UTC). DataStax recommends
converting to local time only when generating output to be read by humans.

1. Install NTP for your operating system:


Operating system Command

Debian-based system $ sudo apt-get install ntpdate

1
RHEL-based system $ sudo yum install ntpdate

1
On RHEL 7 and later, chrony is the default network time protocol daemon. The configuration file for chrony is located in /etc/chrony.conf
on these systems.

2. Start the NTP service on all nodes:

$ sudo service ntp start -x

3. Run the ntupdate command to synchronize clocks:

$ sudo ntpdate 1.ro.pool.ntp.org

4. Verify that your NTP configuration is working:

$ ntpstat

Set kernel parameters


Configure the following kernel parameters for optimal traffic and user limits.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
103
Configuration

Run the following command to view all current Linux kernel settings:

$ sudo sysctl -a

TCP settings
During low traffic intervals, a firewall configured with an idle connection timeout can close connections to local
nodes and nodes in other data centers. To prevent connections between nodes from timing out, set the following
network kernel settings:

1. Set the following TCP keepalive timeout values:

$ sudo sysctl -w \ net.ipv4.tcp_keepalive_time=60 \ net.ipv4.tcp_keepalive_probes=3 \


net.ipv4.tcp_keepalive_intvl=10

These values set the TCP keepalive timeout to 60 seconds with 3 probes, 10 seconds gap between each.
The settings detect dead TCP connections after 90 seconds (60 + 10 + 10 + 10). The additional traffic is
negligible, and permanently leaving these settings is not an issue. See Firewall idle connection timeout
causes nodes to lose communication during low traffic times on Linux .

2. Change the following settings to handle thousands of concurrent connections used by the database:

$ sudo sysctl -w \ net.core.rmem_max=16777216 \ net.core.wmem_max=16777216


\ net.core.rmem_default=16777216 \ net.core.wmem_default=16777216 \
net.core.optmem_max=40960 \ net.ipv4.tcp_rmem='4096 87380 16777216' \
net.ipv4.tcp_wmem='4096 65536 16777216'

Instead of changing the system TCP settings, you can prevent reset connections during streaming by tuning
the streaming_keep_alive_period_in_secs setting in cassandra.yaml.

Set user resource limits


Use the ulimit -a command to view the current limits. Although limits can also be temporarily set using this
command, DataStax recommends making the changes permanent.
For more information, see Recommended production settings.
Debian-based systems

1. Edit the /etc/pam.d/su file and uncomment the following line to enable the pam_limits.so module:

session required pam_limits.so

This change to the PAM configuration file ensures that the system reads the files in the /etc/security/
limits.d directory.

2. If you run DSE as root, some Linux distributions (such as Ubuntu), require setting the limits for the root user
explicitly instead of using cassandra_user:

root - memlock unlimited


root - nofile 1048576
root - nproc 32768
root - as unlimited

RHEL-based systems

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
104
Configuration

1. Set the nproc limits to 32768 in the /etc/security/limits.d/90-nproc.conf configuration file:

cassandra_user - nproc 32768

All systems

1. Add the following line to /etc/sysctl.conf:

vm.max_map_count = 1048575

2. Open the configuration file for your installation type:


Installation type Configuration file

Tarball installation /etc/security/limits.conf

Package installation /etc/security/limits.d/cassandra.conf

3. Configure the following settings for the <cassandra_user> in the configuration file:

<cassandra_user> - memlock unlimited


<cassandra_user> - nofile 1048576
<cassandra_user> - nproc 32768
<cassandra_user> - as unlimited

4. Reboot the server or run the following command to make all changes take effect:

$ sudo sysctl -p

Persist updated settings

1. Add the following values to the /etc/sysctl.conf file:

net.ipv4.tcp_keepalive_time=60
net.ipv4.tcp_keepalive_probes=3
net.ipv4.tcp_keepalive_intvl=10
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.core.rmem_default=16777216
net.core.wmem_default=16777216
net.core.optmem_max=40960
net.ipv4.tcp_rmem=4096 87380 16777216
net.ipv4.tcp_wmem=4096 65536 16777216

2. Load the settings using one of the following commands:

$ sudo sysctl -p /etc/sysctl.conf

$ sudo sysctl -p /etc/sysctl.d/*.conf

3. To confirm the user limits are applied to the DSE process, run the following command where pid is the
process ID of the currently running DSE process:

$ cat /proc/pid/limits

Disable settings that impact performance


Disable the following settings, which can cause issues with performance.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
105
Configuration

Disable CPU frequency scaling


Recent Linux systems include a feature called CPU frequency scaling or CPU speed scaling. This feature allows
a server's clock speed to be dynamically adjusted so that the server can run at lower clock speeds when the
demand or load is low. This change reduces the server's power consumption and heat output, which significantly
impacts cooling costs. Unfortunately, this behavior has a detrimental effect on servers running DSE, because
throughput can be capped at a lower rate.
On most Linux systems, a CPUfreq governor manages the scaling of frequencies based on defined rules. The
default ondemand governor switches the clock frequency to maximum when demand is high, and switches to the
lowest frequency when the system is idle.

Do not use governors that lower the CPU frequency. To ensure optimal performance, reconfigure all CPUs to
use the performance governor, which locks the frequency at maximum.

The performance governor will not switch frequencies, which means that power savings will be bypassed to
always run at maximum throughput. On most systems, run the following command to set the governor:

for CPUFREQ in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor


do
[ -f $CPUFREQ ] || continue
echo -n performance > $CPUFREQ
done

If this directory does not exist on your system, refer to one of the following pages based on your operating
system:

• Debian-based systems: cpufreq-set command on Debian systems

• RHEL-based systems: CPUfreq setup on RHEL systems

For more information, see High server load and latency when CPU frequency scaling is enabled in the DataStax
Help Center.
Disable zone_reclaim_mode on NUMA systems
The Linux kernel can be inconsistent in enabling/disabling zone_reclaim_mode, which can result in odd
performance problems.
To ensure that zone_reclaim_mode is disabled:

$ echo 0 > /proc/sys/vm/zone_reclaim_mode

For more information, see Peculiar Linux kernel performance problem on NUMA systems.
Disable swap
Failure to disable swap entirely can severely lower performance. Because the database has multiple replicas
and transparent failover, it is preferable for a replica to be killed immediately when memory is low rather than
go into swap. This allows traffic to be immediately redirected to a functioning replica instead of continuing to
hit the replica that has high latency due to swapping. If your system has a lot of DRAM, swapping still lowers
performance significantly because the OS swaps out executable code so that more DRAM is available for
caching disks.
If you insist on using swap, you can set vm.swappiness=1. This allows the kernel swap out the absolute least
used parts.

$ sudo swapoff --all

To make this change permanent, remove all swap file entries from /etc/fstab.
For more information, see Nodes seem to freeze after some period of time.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
106
Configuration

Optimize disk settings


The default disk configurations on most Linux distributions are not optimal. Follow these steps to optimize
settings for your Solid State Drives (SSDs) or spinning disks.

Complete the optimization settings for either SSDs or spinning disks. Do not complete both procedures for
either storage type.

Optimize SSDs
Complete the following steps to ensure the best settings for SSDs.

1. Ensure that the SysFS rotational flag is set to false (zero).


This overrides any detection by the operating system to ensure the drive is considered an SSD.

2. Apply the same rotational flag setting for any block devices created from SSD storage, such as mdarrays.

3. Determine your devices by running lsblk:

$ lsblk

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT


vda 253:0 0 32G 0 disk
|
|-sda1 253:1 0 8M 0 part
|-sda2 253:2 0 32G 0 part /

In this example, the current devices are sda1 and sda2.

4. Set the IO scheduler to either deadline or noop for each of the listed devices:
For example:

$ echo deadline > /sys/block/device_name/queue/scheduler

where device_name is the name of the device you want to apply settings for.

• The deadline scheduler optimizes requests to minimize IO latency. If in doubt, use the deadline
scheduler.

$ echo deadline > /sys/block/device_name/queue/scheduler

• The noop scheduler is the right choice when the target block device is an array of SSDs behind a high-
end IO controller that performs IO optimization.

$ echo noop > /sys/block/device_name/queue/scheduler

5. Set the nr_requests value to indicate the maximum number of read and write requests that can be queued:
Machine size Value

Large machines $ echo 128 sys/block/device_name/queue/nr_requests

Small machines $ echo 32 sys/block/device_name/queue/nr_requests

6. Set the readahead value for the block device to 8 KB.


This setting tells the operating system not to read extra bytes, which can increase IO time and pollute the
cache with bytes that weren’t requested by the user.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
107
Configuration

The recommended readahead setting for RAID on SSDs is the same as that for SSDs that are not being
used in a RAID installation.

a. Open /etc/rc.local for editing.

b. Add the following lines to set the readahead on startup:

touch /var/lock/subsys/local
echo 0 > /sys/class/block/sda/queue/rotational
echo 8 > /sys/class/block/sda/queue/read_ahead_kb

c. Save and close /etc/rc.local.

Optimize spinning disks


1. Check to ensure read-ahead value is not set to 65536:

$ sudo blockdev --report /dev/spinning_disk

2. Set the readahead to 128, which is the recommended value:

$ sudo blockdev --setra 128 /dev/spinning_disk

Set the heap size for Java garbage collection


The default JVM garbage collection (GC) is G1 for DSE 5.1 and later.
DataStax does not recommend using G1 when using Java 7. This is due to a problem with class unloading in
G1. In Java 7, PermGen fills up indefinitely until a full GC is performed.

Heap size is usually between ¼ and ½ of system memory. Do not devote all memory to heap because it is also
used for offheap cache and file system cache.
See Tuning Java Virtual Machine for more information on tuning the Java Virtual Machine (JVM).

If you want to use Concurrent-Mark-Sweep (CMS) garbage collection, contact the DataStax Services team for
configuration help. Tuning Java resources provides details on circumstances where CMS is recommended,
though using CMS requires time, expertise, and repeated testing to achieve optimal results.

The easiest way to determine the optimum heap size for your environment is:

1. Set the MAX_HEAP_SIZE in the jvm.options file to a high arbitrary value on a single node.

2. View the heap used by that node:

• Enable GC logging and check the logs to see trends.

• Use List view in OpsCenter.

3. Use the value for setting the heap size in the cluster.

This method decreases performance for the test node, but generally does not significantly reduce cluster
performance.

If you don't see improved performance, contact the DataStax Services team for additional help in tuning the JVM.
Check Java Hugepages settings
Many modern Linux distributions ship with the Transparent Hugepages feature enabled by default. When Linux
uses Transparent Hugepages, the kernel tries to allocate memory in large chunks (usually 2MB), rather than 4K.
This allocation can improve performance by reducing the number of pages the CPU must track. However, some

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
108
Configuration

applications still allocate memory based on 4K pages, which can cause noticeable performance problems when
Linux tries to defragment 2MB pages.
For more information, see the Cassandra Java Huge Pages blog and this RedHat bug report.
To solve this problem, disable defrag for Transparent Hugepages:

$ echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

For more information, including a temporary fix, see No DSE processing but high CPU usage.

YAML and configuration properties


cassandra.yaml configuration file
The cassandra.yaml file is the main configuration file for DataStax Enterprise. The dse.yaml file is the primary
configuration file for security, DSE Search, DSE Graph, and DSE Analytics.

After changing properties in the cassandra.yaml file, you must restart the node for the changes to take effect.

Syntax
For the properties in each section, the parent setting has zero spaces. Each child entry requires at least two
spaces. Adhere to the YAML syntax and retain the spacing.

• Literal default values are shown as literal.

• Calculated values are shown as calculated.

• Default values that are not defined are shown as Default: none.

• Internally defined default values are described.


Default values can be defined internally, commented out, or have implementation dependencies on other
properties in the cassandra.yaml file. Additionally, some commented-out values may not match the
actual default values. The commented out values are recommended alternatives to the default values.

Organization
The configuration properties are grouped into the following sections:

• Quick start
The minimal properties needed for configuring a cluster.

• Default directories
If you have changed any of the default directories during installation, set these properties to the new
locations. Make sure you have root access.

• Commonly used
Properties most frequently used when configuring DataStax Enterprise.

• Performance tuning
Tuning performance and system resource utilization, including commit log, compaction, memory, disk I/O,
CPU, reads, and writes.

• Advanced
Properties for advanced users or properties that are less commonly used.

• Security

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
109
Configuration

DSE Unified Authentication provides authentication, authorization, and role management.

• Continuous paging options Properties configure memory, threads, and duration when pushing pages
continuously to the client.

• Memory leak detection settings Properties configure memory leak detection.

Quick start properties


The minimal properties needed for configuring a cluster.

cluster_name: 'Test Cluster'


listen_address: localhost
# listen_interface: wlan0
# listen_interface_prefer_ipv6: false

See Initializing a DataStax Enterprise cluster.

cluster_name
The name of the cluster. This setting prevents nodes in one logical cluster from joining another. All
nodes in a cluster must have the same value.
Default: 'Test Cluster'
listen_address
The IP address or hostname that the database binds to for connecting this node to other nodes.

• Never set listen_address to 0.0.0.0.

• Set listen_address or listen_interface, do not set both.

Default: localhost
listen_interface
The interface that the database binds to for connecting to other nodes. Interfaces must correspond to a
single address. IP aliasing is not supported.
Set listen_address or listen_interface, not both.
Default: commented out (wlan0)
listen_interface_prefer_ipv6
Use IPv4 or IPv6 when interface is specified by name.

• false - use first IPv4 address.

• true - use first IPv6 address.

When only a single address is used, that address is selected without regard to this setting.
Default: commented out (false)
Default directories

data_file_directories:
- /var/lib/cassandra/data
commitlog_directory: /var/lib/cassandra/commitlog
cdc_raw_directory: /var/lib/cassandra/cdc_raw
hints_directory: /var/lib/cassandra/hints
saved_caches_directory: /var/lib/cassandra/saved_caches

If you have changed any of the default directories during installation, set these properties to the new locations.
Make sure you have root access.
data_file_directories

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
110
Configuration

The directory where table data is stored on disk. The database distributes data evenly across the
location, subject to the granularity of the configured compaction strategy. If not set, the directory is
$DSE_HOME/data/data.
For production, DataStax recommends RAID 0 and SSDs.
Default: - /var/lib/cassandra/data
commitlog_directory
The directory where the commit log is stored. If not set, the directory is $DSE_HOME/data/commitlog.
For optimal write performance, place the commit log on a separate disk partition, or ideally on a
separate physical device from the data file directories. Because the commit log is append only, a hard
disk drive (HDD) is acceptable.

DataStax recommends explicitly setting the location of the DSE Metrics Collector data directory.
When the DSE Metrics Collector is enabled and when the insights_options data dir is not explicitly
set in dse.yaml, the default location of the DSE Metrics Collector data directory is the same directory
as the commitlog directory.

Default: /var/lib/cassandra/commitlog
cdc_raw_directory
The directory where the change data capture (CDC) commit log segments are stored on flush. DataStax
recommends a physical device that is separate from the data directories. If not set, the directory is
$DSE_HOME/data/cdc_raw. See Change Data Capture (CDC) logging.
Default: /var/lib/cassandra/cdc_raw
hints_directory
The directory in which hints are stored. If not set, the directory is $CASSANDRA_HOME/data/hints.
Default: /var/lib/cassandra/hints
saved_caches_directory
The directory location where table key and row caches are stored. If not set, the directory is
$DSE_HOME/data/saved_caches.
Default: /var/lib/cassandra/saved_caches
Commonly used properties
Properties most frequently used when configuring DataStax Enterprise.
Before starting a node for the first time, DataStax recommends that you carefully evaluate your requirements.

• Common initialization properties

• Common compaction settings

• Common memtable settings

• Common automatic backup settings

Common initialization properties

commit_failure_policy: stop
prepared_statements_cache_size_mb:
# disk_optimization_strategy: ssd
disk_failure_policy: stop
endpoint_snitch: com.datastax.bdp.snitch.DseSimpleSnitch
seed_provider:
- org.apache.cassandra.locator.SimpleSeedProvider
- seeds: "127.0.0.1"
enable_user_defined_functions: false
enable_scripted_user_defined_functions: false
enable_user_defined_functions_threads: true

Be sure to set the properties in the Quick start section as well.

commit_failure_policy

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
111
Configuration

Policy for commit disk failures:

• die - Shut down the node and kill the JVM, so the node can be replaced.

• stop - Shut down the node, leaving the node effectively dead, available for inspection using JMX.

• stop_commit - Shut down the commit log, letting writes collect but continuing to service reads.

• ignore - Ignore fatal errors and let the batches fail.

Default: stop
prepared_statements_cache_size_mb
Maximum size of the native protocol prepared statement cache. Change this value only if there are
more prepared statements than fit in the cache.
Generally, the calculated default value is appropriate and does not need adjusting. DataStax
recommends contacting the DataStax Services team before changing this value.

Specifying a value that is too large results in long running GCs and possibly out-of-memory errors.
Keep the value at a small fraction of the heap.
Constantly re-preparing statements is a performance penalty. When not set, the default is automatically
calculated to heap / 256 or 10 MB, whichever is greater.
Default: calculated
disk_optimization_strategy
The strategy for optimizing disk reads.

• ssd - solid state disks

• spinning - spinning disks

When commented out, the default is ssd.


Default: commented out (ssd)
disk_failure_policy
Sets how the database responds to disk failure. Recommend settings: stop or best_effort. Valid values:

• die - Shut down gossip and client transports, and kill the JVM for any file system errors or single
SSTable errors, so the node can be replaced.

• stop_paranoid - Shut down the node, even for single SSTable errors.

• stop - Shut down the node, leaving the node effectively dead, but available for inspection using
JMX.

• best_effort - Stop using the failed disk and respond to requests based on the remaining available
SSTables. This setting allows obsolete data at consistency level of ONE.

• ignore - Ignore fatal errors and lets the requests fail; all file system errors are logged but otherwise
ignored.

See Recovering from a single disk failure using JBOD.


Default: stop
endpoint_snitch
A class that implements the IEndpointSnitch interface. The database uses the snitch to locate nodes
and route requests.
Use only snitch implementations bundled with DSE.

• DseSimpleSnitch
Appropriate only for development deployments. Proximity is determined by DSE workload, which
places transactional, analytics, and search nodes into their separate datacenters. Does not
recognize datacenter or rack information.

• GossipingPropertyFileSnitch

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
112
Configuration

Recommended for production. Reads rack and datacenter for the local node in cassandra-
rackdc.properties file and propagates these values to other nodes via gossip. For migration from
the PropertyFileSnitch, uses the cassandra-topology.properties file if it is present.

• PropertyFileSnitch
Determines proximity by rack and datacenter that are explicitly configured in cassandra-
topology.properties file.

• Ec2Snitch
For EC2 deployments in a single region. Loads region and availability zone information from the
Amazon EC2 API. The region is treated as the datacenter, the availability zone is treated as the
rack, and uses only private IP addresses. For this reason, Ec2Snitch does not work across multiple
regions.

• Ec2MultiRegionSnitch
Uses the public IP as the broadcast_address to allow cross-region connectivity. This means you
must also set seed addresses to the public IP and open the storage_port or ssl_storage_port
on the public IP firewall. For intra-region traffic, the database switches to the private IP after
establishing a connection.

• RackInferringSnitch
Proximity is determined by rack and datacenter, which are assumed to correspond to the 3rd and
2nd octet of each node's IP address, respectively. Best used as an example for writing a custom
snitch class (unless this happens to match your deployment conventions).

• GoogleCloudSnitch
Use for deployments on Google Cloud Platform across one or more regions. The region is
treated as a datacenter and the availability zones are treated as racks within the datacenter. All
communication occurs over private IP addresses within the same logical network.

• CloudstackSnitch
Use the CloudstackSnitch for Apache Cloudstack environments.

See Snitches.
Default: com.datastax.bdp.snitch.DseSimpleSnitch
seed_provider
The addresses of hosts that are designated as contact points in the cluster. A joining node contacts one
of the nodes in the -seeds list to learn the topology of the ring.
Use only seed provider implementations bundled with DSE.

• class_name - The class that handles the seed logic. It can be customized, but this is typically not
required.
Default: org.apache.cassandra.locator.SimpleSeedProvider

• - seeds - A comma delimited list of addresses that are used by gossip for bootstrapping new nodes
joining a cluster. If your cluster includes multiple nodes, you must change the list from the default
value to the IP address of one of the nodes.
Default: "127.0.0.1"

Making every node a seed node is not recommended because of increased maintenance and
reduced gossip performance. Gossip optimization is not critical, but it is recommended to use a
small seed list (approximately three nodes per datacenter).

See Initializing a single datacenter per workload type and Initializing multiple datacenters per
workload type.
Default: org.apache.cassandra.locator.SimpleSeedProvider

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
113
Configuration

enable_user_defined_functions
Enables user defined functions (UDFs). UDFs present a security risk, since they are executed on the
server side. UDFs are executed in a sandbox to contain the execution of malicious code.

• true - Enabled. Supports Java as the code language. Detect endless loops and unintended memory
leaks.

• false - Disabled.

Default: false (disabled)


enable_scripted_user_defined_functions
Enables the use of JavaScript language in UDFs.

• true - Enabled. Allow JavaScript in addition to Java as a code language.

• false - Disabled. Only allow Java as a code language.

If enable_user_defined_functions is false, this setting has no impact.


Default: false
enable_user_defined_functions_threads
Enables asynchronous UDF execution which requires a function to complete before being executed
again.

• true - Enabled. Only one instance of a function can run at one time. Asynchronous execution
prevents UDFs from running too long or forever and destabilizing the cluster.

• false - Disabled. Allows multiple instances of the same function to run simultaneously. Required to
use UDFs within GROUP BY clauses.
Disabling asynchronous UDF execution implicitly disables the security manager. You must
monitor the read timeouts for UDFs that run too long or forever, which can cause the cluster to
destabilize.

Default: true
Common compaction settings

compaction_throughput_mb_per_sec: 16
compaction_large_partition_warning_threshold_mb: 100

compaction_throughput_mb_per_sec
The MB per second to throttle compaction for the entire system. The faster the database inserts data,
the faster the system must compact in order to keep the SSTable count down.

• 16 to 32 x rate of write throughput in MB/second, recommended value.

• 0 - disable compaction throttling

See Configuring compaction.


Default: 16
compaction_large_partition_warning_threshold_mb
The partition size threshold before logging a warning.
Default: 100
Common memtable settings

memtable_heap_space_in_mb: 2048
memtable_offheap_space_in_mb: 2048

memtable_heap_space_in_mb

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
114
Configuration

The amount of on-heap memory allocated for memtables. The database uses the total of this amount
and the value of memtable_offheap_space_in_mb to set a threshold for automatic memtable flush.
See memtable_cleanup_threshold and Tuning the Java heap.
Default: calculated 1/4 of heap size (2048)
memtable_offheap_space_in_mb
The amount of off-heap memory allocated for memtables. The database uses the total of this amount
and the value of memtable_heap_space_in_mb to set a threshold for automatic memtable flush.
See memtable_cleanup_threshold and Tuning the Java heap.
Default: calculated 1/4 of heap size (2048)
Common automatic backup settings

incremental_backups: false
snapshot_before_compaction: false

incremental_backups
Enables incremental backups.

• true - Enable incremental backups to create a hard link to each SSTable flushed or streamed
locally in a backups subdirectory of the keyspace data. Incremental backups enable storing
backups off site without transferring entire snapshots.
The database does not automatically clear incremental backup files. DataStax recommends
setting up a process to clear incremental backup hard links each time a new snapshot is
created.

• false - Do not enable incremental backups.

See Enabling incremental backups.


Default: false
snapshot_before_compaction
Whether to take a snapshot before each compaction. A snapshot is useful to back up data when there
is a data format change.
Be careful using this option, the database does not clean up older snapshots automatically.

See Configuring compaction.


Default: false
snapshot_before_dropping_column
When enabled, every time the user drops a column/columns from a table, a snapshot is created on
each node in the cluster before the change in schema is applied. Those snapshots have the same
name on each node. For example: auto-snapshot_drop-column-columnname_20200515143511000.
The name includes the name of the dropped column and the timestamp (UTC) when the column was
dropped.
The database does not automatically clear incremental backup files. DataStax recommends setting
up a process to clear incremental backup hard links each time a new snapshot is created.
Default: false
Performance tuning properties
Tuning performance and system resource utilization, including commit log, compaction, memory, disk I/O, CPU,
reads, and writes.
Performing tuning properties include:

• Commit log settings

• Lightweight transactions (LWT) settings

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
115
Configuration

• Change-data-capture (CDC) space settings

• Common compaction settings

• Common memtable settings

• Cache and index settings

• Streaming settings

Commit log settings

commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
# commitlog_sync_group_window_in_ms: 1000
# commitlog_sync_batch_window_in_ms: 2 //deprecated
commitlog_segment_size_in_mb: 32
# commitlog_total_space_in_mb: 8192
# commitlog_compression:
# - class_name: LZ4Compressor
# parameters:
# -

commitlog_sync
Commit log synchronization method:

• periodic - Send ACK signal for writes immediately. Commit log is synced every
commitlog_sync_period_in_ms.

• group - Send ACK signal for writes after the commit log has been flushed to disk. Wait up to
commitlog_sync_group_window_in_ms between flushes.

• batch - Send ACK signal for writes after the commit log has been flushed to disk. Each incoming
write triggers the flush task.

Default: periodic
commitlog_sync_period_in_ms
Use with commitlog_sync: periodic. Time interval between syncing the commit log to disk. Periodic
syncs are acknowledged immediately.
Default: 10000
commitlog_sync_group_window_in_ms
Use with commitlog_sync: group. The time that the database waits between flushing the commit log
to disk. DataStax recommends using group instead of batch.
Default: commented out (1000)
commitlog_sync_batch_window_in_ms
Deprecated. Use with commitlog_sync: batch. The maximum length of time that queries may be
batched together.
Default: commented out (2)
commitlog_segment_size_in_mb
The size of an individual commitlog file segment. A commitlog segment may be archived, deleted, or
recycled after all its data has been flushed to SSTables. This data can potentially include commitlog
segments from every table in the system. The default size is usually suitable, but for commitlog
archiving you might want a finer granularity; 8 or 16 MB is reasonable.

If you set max_mutation_size_in_kb explicitly, then you must set commitlog_segment_size_in_mb to:

2 * max_mutation_size_in_kb / 1024

The value must be positive and less than 2048.

See Commit log archive configuration.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
116
Configuration

Default: 32
max_mutation_size_in_kb
The maximum size of a mutation before the mutation is rejected. Before increasing the commitlog
segment size of the commitlog segments, investigate why the mutations are larger than expected. Look
for underlying issues with access patterns and data model, because increasing the commitlog segment
size is a limited fix. When not set, the default is calculated as (commitlog_segment_size_in_mb *
1024) / 2.
Default: calculated
commitlog_total_space_in_mb
Disk usage threshold for commit logs before triggering the database flushing memtables to disk. If the
total space used by all commit logs exceeds this threshold, the database flushes memtables to disk for
the oldest commitlog segments to reclaim disk space by removing those log segments from the commit
log. This flushing reduces the amount of data to replay on start-up, and prevents infrequently updated
tables from keeping commitlog segments indefinitely. If the commitlog_total_space_in_mb is small,
the result is more flush activity on less-active tables.
See Configuring memtable thresholds.
Default for 64-bit JVMs: calculated (8192 or 25% of the total space of the commit log
value, whichever is smaller)
Default for 32-bit JVMs: calculated (32 or 25% of the total space of the commit log value,
whichever is smaller )
commitlog_compression
The compressor to use if commit log is compressed. To make changes, uncomment the
commitlog_compression section and these options:

# commitlog_compression:
# - class_name: LZ4Compressor
# parameters:
# -

• class_name: LZ4Compressor, Snappy, or Deflate

• parameters: optional parameters for the compressor

When not set, the default compression for the commit log is uncompressed.
Default: commented out
Lightweight transactions (LWT) settings

$ # concurrent_lw_transactions: 128 # max_pending_lw_transactions: 10000

concurrent_lw_transactions
Maximum number of permitted concurrent lightweight transactions (LWT).

• A higher number might improve throughput if non-contending LWTs are in heavy use, but will use
more memory and might be less successful with contention.

• When not set, the default value is 8x the number of TPC cores. This default value is appropriate for
most environments.

Default: calculated 8x the number of TPC cores


max_pending_lw_transactions
Maximum number of lightweight transactions (LWT) in the queue before node reports
OverloadedException for LWTs.
Default: 10000
Change-data-capture (CDC) space settings

cdc_enabled: false
cdc_total_space_in_mb: 4096

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
117
Configuration

cdc_free_space_check_interval_ms: 250

See also cdc_raw_directory.


cdc_enabled
Enables change data capture (CDC) functionality on a per-node basis. This modifies the logic used for
write path allocation rejection.

• true - use CDC functionality to reject mutations that contain a CDC-enabled table if at space limit
threshold in cdc_raw_directory.

• false - standard behavior, never reject.

Default: false
cdc_total_space_in_mb
Total space to use for change-data-capture (CDC) logs on disk. If space allocated for CDC exceeds
this value, the database throws WriteTimeoutException on mutations, including CDC-enabled tables.
A CDCCompactor (a consumer) is responsible for parsing the raw CDC logs and deleting them when
parsing is completed.
Default: calculated (4096 or 1/8th of the total space of the drive where the cdc_raw_directory resides)
cdc_free_space_check_interval_ms
Interval between checks for new available space for CDC-tracked tables when the
cdc_total_space_in_mb threshold is reached and the CDCCompactor is running behind or experiencing
back pressure. When not set, the default is 250.
Default: commented out (250)
Compaction settings

#concurrent_compactors: 1
# concurrent_validations: 0
concurrent_materialized_view_builders: 2
sstable_preemptive_open_interval_in_mb: 50
# pick_level_on_streaming: false

See also compaction_throughput_mb_per_sec in the common compaction settings section and Configuring
compaction.

concurrent_compactors
The number of concurrent compaction processes allowed to run simultaneously on a node, not
including validation compactions for anti-entropy repair. Simultaneous compactions help preserve
read performance in a mixed read-write workload by limiting the number of small SSTables that
accumulate during a single long-running compaction. If your data directories are backed by SSDs,
increase this value to the number of cores. If compaction running too slowly or too fast, adjust
compaction_throughput_mb_per_sec first.
Increasing concurrent compactors leads to more use of available disk space for compaction,
because concurrent compactions happen in parallel, especially for STCS. Ensure that adequate disk
space is available before increasing this configuration.

Generally, the calculated default value is appropriate and does not need adjusting. DataStax
recommends contacting the DataStax Services team before changing this value.
Default: calculated The fewest number of disks or number of cores, with a minimum of 2 and a
maximum of 8 per CPU core.
concurrent_validations
Number of simultaneous repair validations to allow. When not set, the default is unbounded. Values less
than one are interpreted as unbounded.
Default: commented out (0) unbounded
concurrent_materialized_view_builders

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
118
Configuration

Number of simultaneous materialized view builder tasks allowed to run concurrently. When a view
is created, the node ranges are split into (num_processors * 4) builder tasks and submitted to this
executor.
Default: 2
sstable_preemptive_open_interval_in_mb
The size of the SSTables to trigger preemptive opens. The compaction process opens SSTables before
they are completely written and uses them in place of the prior SSTables for any range previously
written. This process helps to smoothly transfer reads between the SSTables by reducing cache churn
and keeps hot rows hot.
A low value has a negative performance impact and will eventually cause heap pressure and GC
activity. The optimal value depends on hardware and workload.
Default: 50
pick_level_on_streaming
The compaction level for streamed-in SSTables.

• true - streamed-in SSTables of tables using LeveledCompactionStrategy (LCS) are placed on the
same level as the source node. For operational tasks like nodetool refresh or replacing a node, true
improves performance for compaction work.

• false - streamed-in SSTables are placed in level 0.

When not set, the default is false.


Default: commented out (false)
Memtable settings

memtable_allocation_type: heap_buffers
# memtable_cleanup_threshold: 0.34
memtable_flush_writers: 4

memtable_allocation_type
The method the database uses to allocate and manage memtable memory.

• heap_buffers - On heap NIO (non-blocking I/O) buffers.

• offheap_buffers - Off heap (direct) NIO buffers.

• offheap_objects - Native memory, eliminating NIO buffer heap overhead.

Default: heap_buffers
memtable_cleanup_threshold
Ratio used for automatic memtable flush.
Generally, the calculated default value is appropriate and does not need adjusting. DataStax
recommends contacting the DataStax Services team before changing this value.
When not set, the calculated default is 1/(memtable_flush_writers + 1)
Default: commented out (0.34)
memtable_flush_writers
The number of memtable flush writer threads per disk and the total number of memtables that can
be flushed concurrently, generally a combination of compute that is I/O bound. Memtable flushing
is more CPU efficient than memtable ingest. A single thread can keep up with the ingest rate of a
server on a single fast disk, until the server temporarily becomes I/O bound under contention, typically
with compaction. Generally, the default value is appropriate and does not need adjusting for SSDs.
However, the recommended default for HDDs: 2.
Default for SSDs: 4
Cache and index settings

column_index_size_in_kb: 16
# file_cache_size_in_mb: 4096

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
119
Configuration

# direct_reads_size_in_mb: 128

column_index_size_in_kb
Granularity of the index of rows within a partition. For huge rows, decrease this setting to improve seek
time. Lower density nodes might benefit from decreasing this value to 4, 2, or 1.
Default: 16
file_cache_size_in_mb
DSE 6.0.0-6.0.6: Maximum memory for buffer pooling and SSTable chunk cache. 32 MB is reserved
for pooling buffers, the remaining memory is the cache for holding recent or frequently used index
pages and uncompressed SSTable chunks. This pool is allocated off heap and is in addition to the
memory allocated for heap. Memory is allocated only when needed.
DSE 6.0.7 and later: Buffer pool is split into two pools, this setting defines the maximum memory to
use file buffers that are stored in the file cache, also known as chunk cache. Memory is allocated only
when needed but is not released. The other buffer pool is direct_reads_size_in_mb.
See Tuning Java Virtual Machine.
Default: calculated (0.5 of -XX:MaxDirectMemorySize)
direct_reads_size_in_mb
DSE 6.0.7 and later: Buffer pool is split into two pools, this setting defines the buffer pool for
transient read operations. A buffer is typically used by a read operation and then returned to this pool
when the operation is finished so that it can be reused by other operations. The other buffer pool is
file_cache_size_in_mb. When not set, the default calculated as 2 MB per TPC core thread, plus 2 MB
shared by non-TPC threads, with a maximum value of 128 MB.
Default: calculated
Streaming settings

# stream_throughput_outbound_megabits_per_sec: 200
# inter_dc_stream_throughput_outbound_megabits_per_sec: 200
# streaming_keep_alive_period_in_secs: 300
# streaming_connections_per_host: 1

stream_throughput_outbound_megabits_per_sec
Throttle for the throughput of all outbound streaming file transfers on a node. The database does
mostly sequential I/O when streaming data during bootstrap or repair which can saturate the network
connection and degrade client (RPC) performance. When not set, the value is 200 Mbps.
Default: commented out (200)
inter_dc_stream_throughput_outbound_megabits_per_sec
Throttle for all streaming file transfers between datacenters, and for network stream traffic as configured
with stream_throughput_outbound_megabits_per_sec. When not set, the value is 200 Mbps.
Should be set to a value less than or equal to stream_throughput_outbound_megabits_per_sec
since it is a subset of total throughput.
Default: commented out (200)
streaming_keep_alive_period_in_secs
Interval to send keep-alive messages to prevent reset connections during streaming. The stream
session fails when a keep-alive message is not received for 2 keep-alive cycles. When not set, the
default is 300 seconds (5 minutes) so that a stalled stream times out in 10 minutes.
Default: commented out (300)
streaming_connections_per_host
Maximum number of connections per host for streaming. Increase this value when you notice that joins
are CPU-bound, rather than network-bound. For example, a few nodes with large files. When not set,
the default is 1.
Default: commented out (1)
Fsync settings

trickle_fsync: true

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
120
Configuration

trickle_fsync_interval_in_kb: 10240

trickle_fsync
When set to true, causes fsync to force the operating system to flush the dirty buffers at the set
interval trickle_fsync_interval_in_kb. Enable this parameter to prevent sudden dirty buffer flushing from
impacting read latencies. Recommended for use with SSDs, but not with HDDs.
Default: false
trickle_fsync_interval_in_kb
The size of the fsync in kilobytes.
Default: 10240
max_value_size_in_mb
The maximum size of any value in SSTables. SSTables are marked as corrupted when the threshold is
exceeded.
Default: 256
Thread Per Core (TPC) parameters

#tpc_cores:
# tpc_io_cores:
io_global_queue_depth: 128

tpc_cores
The number of concurrent CoreThreads. The CoreThreads are the main workers in a DSE 6.x node,
and process various asynchronous tasks from their queue. If not set, the default is the number of cores
(processors on the machine) minus one. Note that configuring tpc_cores affects the default value for
tpc_io_cores.
To achieve optimal throughput and latency, for a given workload, set tpc_cores to half the number
of CPUs (minimum) to double the number of CPUs (maximum). In cases where there are a large
number of incoming client connections, increasing tpc_cores to more than the default usually results in
CoreThreads receiving more CPU time.

DSE Search workloads only: set tpc_cores to the number of physical CPUs. See Tuning search
for maximum indexing throughput.
Default: commented out; defaults to the number of cores minus one.
tpc_io_cores
The subset of tpc_cores that process asynchronous IO tasks. (That is, disk reads.) Must be smaller or
equal to tpc_cores. Lower this value to decrease parallel disk IO requests.
Default: commented out; by default, calculated as min(io_global_queue_depth/4, tpc_cores)
io_global_queue_depth
Global IO queue depth used for reads when AIO is enabled, which is the default for SSDs. The optimal
queue depth as found with the fio tool for a given disk setup.
Default: 128
NodeSync parameters

nodesync:
rate_in_kb: 1024

By default, the NodeSync service runs on every node.


Manage the NodeSync service using the nodetool nodesyncservice command.

See Setting the NodeSync rate.

rate_in_kb
The maximum kilobytes per second for data validation on the local node. The optimum validation rate
for each node may vary.
Default: 1024

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
121
Configuration

Advanced properties
Properties for advanced users or properties that are less commonly used.
Advanced initialization properties

batch_size_warn_threshold_in_kb: 64
batch_size_fail_threshold_in_kb: 640
unlogged_batch_across_partitions_warn_threshold: 10
# broadcast_address: 1.2.3.4
# listen_on_broadcast_address: false
# initial_token:
# num_tokens: 128
# allocate_tokens_for_local_replication_factor: 3
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
tracetype_query_ttl: 86400
tracetype_repair_ttl: 604800

auto_bootstrap
This setting has been removed from default configuration.

• true - causes new (non-seed) nodes migrate the right data to themselves automatically

• false - When initializing a fresh cluster without data

See Initializing a DataStax Enterprise cluster.


When not set, the internal default is true.
Default: not present
batch_size_warn_threshold_in_kb
Threshold to log a warning message when any multiple-partition batch size exceeds this value in
kilobytes.
Increasing this threshold can lead to node instability.
Default: 64
batch_size_fail_threshold_in_kb
Threshold to fail and log WARN on any multiple-partition batch whose size exceeds this value. The
default value is 10X the value of batch_size_warn_threshold_in_kb.
Default: 640
unlogged_batch_across_partitions_warn_threshold
Threshold to log a WARN message on any batches not of type LOGGED that span across more
partitions than this limit.
Default: 10
broadcast_address
The public IP address this node uses to broadcast to other nodes outside the network or across regions
in multiple-region EC2 deployments. If this property is commented out, the node uses the same IP
address or hostname as listen_address. A node does not need a separate broadcast_address in a
single-node or single-datacenter installation, or in an EC2-based network that supports automatic
switching between private and public communication. It is necessary to set a separate listen_address
and broadcast_address on a node with multiple physical network interfaces or other topologies where
not all nodes have access to other nodes by their private IP addresses. For specific configurations, see
the instructions for listen_address.
Default: listen_address
listen_on_broadcast_address
Enables the node to communicate on both interfaces.

• true - If this node uses multiple physical network interfaces, set a unique IP address for
broadcast_address

• false - if this node is on a network that automatically routes between public and private networks,
like Amazon EC2 does

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
122
Configuration

See listen_address.
Default: false
initial_token
The token to start the contiguous range. Set this property for single-node-per-token architecture, in
which a node owns exactly one contiguous range in the ring space. Setting this property overrides
num_tokens.
If your installation is not using vnodes or this node's num_tokens is set it to 1 or is commented out, you
should always set an initial_token value when setting up a production cluster for the first time, and
when adding capacity. See Generating tokens.
Use this parameter only with num_tokens (vnodes ) in special cases such as Restoring from a
snapshot.
Default: 1 (disabled)
num_tokens
Define virtual node (vnode) token architecture.
All other nodes in the datacenter must have the same token architecture.

• 1 - disable vnodes and use 1 token for legacy compatibility.

• a number between 2 and 128 - the number of token ranges to assign to this virtual node (vnode). A
higher value increases the probability that the data and workload are evenly distributed.
DataStax recommends not using vnodes with DSE Search. However, if you decide
to use vnodes with DSE Search, do not use more than 8 vnodes and ensure that
allocate_tokens_for_local_replication_factor option in cassandra.yaml is correctly configured for
your environment.

Using vnodes can impact performance for your cluster. DataStax recommends testing the
configuration before enabling vnodes in production environments.

When the token number varies between nodes in a datacenter, the vnode logic assigns a
proportional number of ranges relative to other nodes in the datacenter. In general, if all nodes
have equal hardware capability, each node should have the same num_tokens value.

Default: 1 (disabled)
To migrate an existing cluster from single node per token range to vnodes, see Enabling virtual nodes
on an existing production cluster.
allocate_tokens_for_local_replication_factor

• RF of keyspaces in datacenter - triggers the recommended algorithmic allocation for the RF and
num_tokens for this node.
The allocation algorithm optimizes the workload balance using the target keyspace replication
factor. DataStax recommends setting the number of tokens to 8 to distribute the workload with
~10% variance between nodes. The allocation algorithm attempts to choose tokens in a way that
optimizes replicated load over the nodes in the datacenter for the specified RF. The load assigned
to each node is close to proportional to the number of vnodes.

The allocation algorithm is supported only for the Murmur3Partitioner and RandomPartitioner
partitioners. The Murmur3Partitioner is the default partitioning strategy for new clusters and the
right choice for new clusters in almost all cases.

• commented out - uses the random selection algorithm to assign token ranges randomly.
Over time, loads in a datacenter using the random selection algorithm become unevenly
distributed. DataStax recommends using only the allocation algorithm.

Default: commented out (use random selection algorithm)


See Virtual node (vnode) configuration, and for set up instructions see Adding nodes to vnode-enabled
cluster or Adding a datacenter to a cluster.
partitioner

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
123
Configuration

The class that distributes rows (by partition key) across all nodes in the cluster. Any IPartitioner
may be used, including your own as long as it is in the class path. For new clusters use the default
partitioner.
DataStax Enterprise provides the following partitioners for backward compatibility:

• RandomPartitioner

• ByteOrderedPartitioner (deprecated)

• OrderPreservingPartitioner (deprecated)

Use only partitioner implementations bundled with DSE.

See Partitioners.
Default: org.apache.cassandra.dht.Murmur3Partitioner
tracetype_query_ttl
TTL for different trace types used during logging of the query process.
Default: 86400
tracetype_repair_ttl
TTL for different trace types used during logging of the repair process.
Default: 604800
Advanced automatic backup setting

auto_snapshot: true

auto_snapshot
Enables snapshots of the data before truncating a keyspace or dropping a table. To prevent data loss,
DataStax strongly advises using the default setting. If you set auto_snapshot to false, you lose data on
truncation or drop.
Default: true
Global row properties

column_index_cache_size_in_kb: 2
# row_cache_class_name: org.apache.cassandra.cache.OHCProvider
row_cache_size_in_mb: 0
row_cache_save_period: 0
# row_cache_keys_to_save: 100

When creating or modifying tables, you can enable or disable the row cache for that table by setting the caching
parameter. Other row cache tuning and configuration options are set at the global (node) level. The database
uses these settings to automatically distribute memory for each table on the node based on the overall workload
and specific table usage. You can also configure the save periods for these caches globally.

See Configuring caches.

column_index_cache_size_in_kb
(Only applies to BIG format SSTables) Threshold for the total size of all index entries for a partition that
the database stores in the partition key cache. If the total size of all index entries for a partition exceeds
this amount, the database stops putting entries for this partition into the partition key cache.
Default: 2
row_cache_class_name
The classname of the row cache provider to use. Valid values:

• org.apache.cassandra.cache.OHCProvider - fully off-heap

• org.apache.cassandra.cache.SerializingCacheProvider - partially off-heap, available in earlier


releases

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
124
Configuration

Use only row cache provider implementations bundled with DSE.


When not set, the default is org.apache.cassandra.cache.OHCProvider (fully off-heap)
Default: commented out (org.apache.cassandra.cache.OHCProvider)
row_cache_size_in_mb
Maximum size of the row cache in memory. The row cache can save time, but it is space-intensive
because it contains the entire row. Use the row cache only for hot rows or static rows. If you reduce the
size, you may not get the hottest keys loaded on start up.

• 0 - disable row caching

• MB - Maximum size of the row cache in memory

Default: 0 (disabled)
row_cache_save_period
The number of seconds that rows are kept in cache. Caches are saved to saved_caches_directory. This
setting has limited use as described in row_cache_size_in_mb.
Default: 0 (disabled)
row_cache_keys_to_save
The number of keys from the row cache to save. All keys are saved.
Default: commented out (100)
Counter caches properties

counter_cache_size_in_mb:
counter_cache_save_period: 7200
# counter_cache_keys_to_save: 100

Counter cache helps to reduce counter locks' contention for hot counter cells. In case of RF = 1 a counter cache
hit causes the database to skip the read before write entirely. With RF > 1 a counter cache hit still helps to
reduce the duration of the lock hold, helping with hot counter cell updates, but does not allow skipping the read
entirely. Only the local (clock, count) tuple of a counter cell is kept in memory, not the whole counter, so it is
relatively cheap.

If you reduce the counter cache size, the database may load the hottest keys start-up.

counter_cache_size_in_mb
When no value is set, the database uses the smaller of minimum of 2.5% of heap or 50 megabytes
(MB). If your system performs counter deletes and relies on low gc_grace_seconds, you should disable
the counter cache. To disable, set to 0.
Default: calculated
counter_cache_save_period
The time, in seconds, after which the database saves the counter cache (keys only). The database
saves caches to saved_caches_directory.
Default: 7200 (2 hours)
counter_cache_keys_to_save
Number of keys from the counter cache to save. When not set, the database saves all keys.
Default: commented out (disabled, saves all keys)
Tombstone settings

tombstone_warn_threshold: 1000
tombstone_failure_threshold: 100000

When executing a scan, within or across a partition, the database must keep tombstones in memory to allow
them to return to the coordinator. The coordinator uses tombstones to ensure that other replicas know about the
deleted rows. Workloads that generate numerous tombstones may cause performance problems and exhaust
the server heap. Adjust these thresholds only if you understand the impact and want to scan more tombstones.
You can adjust these thresholds at runtime using the StorageServiceMBean.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
125
Configuration

See the DataStax Developer Blog post Cassandra anti-patterns: Queues and queue-like datasets.

tombstone_warn_threshold
The database issues a warning if a query scans more than this number of tombstones.
Default: 1000
tombstone_failure_threshold
The database aborts a query if it scans more than this number of tombstones.
Default: 100000
Network timeout settings

read_request_timeout_in_ms: 5000
range_request_timeout_in_ms: 10000
aggregated_request_timeout_in_ms: 120000
write_request_timeout_in_ms: 2000
counter_write_request_timeout_in_ms: 5000
cas_contention_timeout_in_ms: 1000
truncate_request_timeout_in_ms: 60000
request_timeout_in_ms: 10000
# cross_dc_rtt_in_ms: 0

read_request_timeout_in_ms
Default: 5000. How long the coordinator waits for read operations to complete before timing it out.
range_request_timeout_in_ms
Default: 10000. How long the coordinator waits for sequential or index scans to complete before timing
it out.
aggregated_request_timeout_in_ms
How long the coordinator waits for sequential or index scans to complete. Lowest acceptable value is
10 ms. This timeout does not apply to aggregated queries such as SELECT, COUNT(*), MIN(x), and so
on.
Default: 120000 (2 minutes)
write_request_timeout_in_ms
How long the coordinator waits for write requests to complete with at least one node in the local
datacenter. Lowest acceptable value is 10 ms.
See Hinted handoff: repair during write path.
Default: 2000 (2 seconds)
counter_write_request_timeout_in_ms
How long the coordinator waits for counter writes to complete before timing it out.
Default: 5000 (5 seconds)
cas_contention_timeout_in_ms
How long the coordinator continues to retry a CAS (compare and set) operation that contends with other
proposals for the same row. If the coordinator cannot complete the operation within this timespan, it
aborts the operation.
Default: 1000 (1 second)
truncate_request_timeout_in_ms
How long the coordinator waits for a truncate (the removal of all data from a table) to complete before
timing it out. The long default value allows the database to take a snapshot before removing the data. If
auto_snapshot is disabled (not recommended), you can reduce this time.
Default: 60000 (1 minute)
request_timeout_in_ms
The default timeout value for other miscellaneous operations. Lowest acceptable value is 10 ms.
See Hinted handoff: repair during write path.
Default: 10000
cross_dc_rtt_in_ms
How much to increase the cross-datacenter timeout (write_request_timeout_in_ms +
cross_dc_rtt_in_ms) for requests that involve only nodes in a remote datacenter. This setting is
intended to reduce hint pressure.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
126
Configuration

DataStax recommends using LOCAL_* consistency levels (CL) for read and write requests in multi-
datacenter deployments to avoid timeouts that may occur when remote nodes are chosen to satisfy
the CL, such as QUORUM.
Default: commented out (0)
slow_query_log_timeout_in_ms
Default: 500. How long before a node logs slow queries. Select queries that exceed this value generate
an aggregated log message to identify slow queries. To disable, set to 0.
Inter-node settings

storage_port: 7000
cross_node_timeout: false
# internode_send_buff_size_in_bytes:
# internode_recv_buff_size_in_bytes:
internode_compression: dc
inter_dc_tcp_nodelay: false

storage_port
The port for inter-node communication. Follow security best practices, do not expose this port to the
internet. Apply firewall rules.
See Securing DataStax Enterprise ports.
Default: 7000
cross_node_timeout
Enables operation timeout information exchange between nodes to accurately measure request
timeouts. If this property is disabled, the replica assumes any requests are forwarded to it instantly by
the coordinator. During overload conditions this means extra time is required for processing already-
timed-out requests.
Before enabling this property make sure NTP (network time protocol) is installed and the times are
synchronized among the nodes.
Default: false
internode_send_buff_size_in_bytes
The sending socket buffer size, in bytes, for inter-node calls.
See TCP settings.

The sending socket buffer size and internode_recv_buff_size_in_bytes is limited by


net.core.wmem_max. If this property is not set, net.ipv4.tcp_wmem determines the buffer size. For
more details run man tcp and refer to:

• /proc/sys/net/core/wmem_max

• /proc/sys/net/core/rmem_max

• /proc/sys/net/ipv4/tcp_wmem

• /proc/sys/net/ipv4/tcp_wmem

Default: not set


internode_recv_buff_size_in_bytes
The receiving socket buffer size in bytes for inter-node calls.
Default: not set
internode_compression
Controls whether traffic between nodes is compressed. Valid values:

• all - Compresses all traffic

• dc - Compresses traffic between datacenters only

• none - No compression.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
127
Configuration

Default: dc
inter_dc_tcp_nodelay
Enables tcp_nodelay for inter-datacenter communication. When disabled, the network sends larger,
but fewer, network packets. This reduces overhead from the TCP protocol itself. However, disabling
inter_dc_tcp_nodelay may increase latency by blocking cross datacenter responses.
Default: false
Native transport (CQL Binary Protocol)

start_native_transport: true
native_transport_port: 9042
# native_transport_port_ssl: 9142
# native_transport_max_frame_size_in_mb: 256
# native_transport_max_concurrent_connections: -1
# native_transport_max_concurrent_connections_per_ip: -1
native_transport_address: localhost
# native_transport_interface: eth0
# native_transport_interface_prefer_ipv6: false
# native_transport_broadcast_address: 1.2.3.4
native_transport_keepalive: true

See also native_transport_port_ssl in SSL Ports.

start_native_transport
Enables or disables the native transport server.
Default: true
native_transport_port
The port where the CQL native transport listens for clients. For security reasons, do not expose this port
to the internet. Firewall it if needed.
Default: 9042
native_transport_max_frame_size_in_mb
The maximum allowed size of a frame. Frame (requests) larger than this are rejected as invalid.
Default: 256
native_transport_max_concurrent_connections
The maximum number of concurrent client connections.
Default: -1 (unlimited)
native_transport_max_concurrent_connections_per_ip
The maximum number of concurrent client connections per source IP address.
Default: -1 (unlimited)
native_transport_address
When left blank, uses the configured hostname of the node. Unlike the listen_address, this value
can be set to 0.0.0.0, but you must set the native_transport_broadcast_address to a value other than
0.0.0.0.
Set native_transport_address OR native_transport_interface, not both.
Default: localhost
native_transport_interface
IP aliasing is not supported.
Set native_transport_address OR native_transport_interface, not both.
Default: eth0
native_transport_interface_prefer_ipv6
Use IPv4 or IPv6 when interface is specified by name.

• false - use first IPv4 address.

• true - use first IPv6 address.

When only a single address is used, that address is selected without regard to this setting.
Default: commented out (false)
native_transport_broadcast_address

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
128
Configuration

Native transport address to broadcast to drivers and other DSE nodes. This cannot be set to 0.0.0.0.

• blank - will be set to the value of native_transport_address

• IP_address - when native_transport_address is set to 0.0.0.0

Default: commented out (1.2.3.4)


native_transport_keepalive
Enables keepalive on native connections.
Default: true
Advanced fault detection settings
Settings to handle poorly performing or failing components.

# gc_log_threshold_in_ms: 200
# gc_warn_threshold_in_ms: 1000
# otc_coalescing_strategy: DISABLED
# otc_coalescing_window_us: 200
# otc_coalescing_enough_coalesced_messages: 8

gc_log_threshold_in_ms
The threshold for log messages at the INFO level. Adjust to minimize logging.
Default: commented out (200)
gc_warn_threshold_in_ms
Threshold for GC pause. Any GC pause longer than this interval is logged at the WARN level. By
default, the database logs any GC pause greater than 200 ms at the INFO level.

See Configuring logging.

Default: commented out (1000)


otc_coalescing_strategy
Strategy to combine multiple network messages into a single packet for outbound TCP connections
to nodes in the same data center. See the DataStax Developer Blog post Performance doubling with
message coalescing.
Use only strategy implementations bundled with DSE.
Supported strategies are:

• FIXED

• MOVINGAVERAGE

• TIMEHORIZON

• DISABLED

Default: commented out (DISABLED)


otc_coalescing_window_us
How many microseconds to wait for coalescing messages to nodes in the same datacenter.

• For FIXED strategy - the amount of time after the first message is received before it is sent with any
accompanying messages.

• For MOVING average - the maximum wait time and the interval that messages must arrive on
average to enable coalescing.

Default: commented out (200)


otc_coalescing_enough_coalesced_messages
The threshold for the number of messages to nodes in the same data center. Do not coalesce
messages when this value is exceeded. Should be more than 2 and less than 128.
Default: commented out (8)
seed_gossip_probability

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
129
Configuration

The percentage of time that gossip messages are sent to a seed node during each round of gossip.
Decreases the time to propagate gossip changes across the cluster.
Default: 1.0 (100%)
Backpressure settings

back_pressure_enabled: false
back_pressure_strategy:
- class_name: org.apache.cassandra.net.RateBasedBackPressure
parameters:
- high_ratio: 0.90
factor: 5
flow: FAST

back_pressure_enabled
Enables the coordinator to apply the specified back pressure strategy to each mutation that is sent to
replicas.
Default: false
back_pressure_strategy
To add new strategies, implement org.apache.cassandra.net.BackpressureStrategy and provide a
public constructor that accepts a Map<String, Object>.
Use only strategy implementations bundled with DSE.
class_name
The default class_name uses the ratio between incoming mutation responses and outgoing mutation
requests.
Default: org.apache.cassandra.net.RateBasedBackPressure
high_ratio
When outgoing mutations are below this value, they are rate limited according to the incoming rate
decreased by the factor (described below). When above this value, the rate limiting is increased by the
factor.
Default: 0.90
factor
A number between 1 and 10. When backpressure is below high ratio, outgoing mutations are rate
limited according to the incoming rate decreased by the given factor; if above high ratio, the rate limiting
is increased by the given factor.
Default: 5
flow
The flow speed to apply rate limiting:

• FAST - rate limited to the speed of the fastest replica

• SLOW - rate limit to the speed of the slowest replica

Default: FAST
dynamic_snitch_badness_threshold
The performance threshold for dynamically routing client requests away from a poorly performing
node. Specifically, it controls how much worse a poorly performing node has to be before the dynamic
snitch prefers other replicas. A value of 0.2 means the database continues to prefer the static snitch
values until the node response time is 20% worse than the best performing node. Until the threshold is
reached, incoming requests are statically routed to the closest replica as determined by the snitch.
Default: 0.1
dynamic_snitch_reset_interval_in_ms
Time interval after which the database resets all node scores. This allows a bad node to recover.
Default: 600000
dynamic_snitch_update_interval_in_ms
The time interval, in milliseconds, between the calculation of node scores. Because score calculation is
CPU intensive, be careful when reducing this interval.
Default: 100

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
130
Configuration

Hinted handoff options

hinted_handoff_enabled: true
# hinted_handoff_disabled_datacenters:
# - DC1
# - DC2
max_hint_window_in_ms: 10800000 # 3 hours
hinted_handoff_throttle_in_kb: 1024
max_hints_delivery_threads: 2
hints_directory: /var/lib/cassandra/hints
hints_flush_period_in_ms: 10000
max_hints_file_size_in_mb: 128
#hints_compression:
# - class_name: LZ4Compressor
# parameters:
# -
batchlog_replay_throttle_in_kb: 1024
# batchlog_endpoint_strategy: random_remote

See Hinted handoff: repair during write path.

hinted_handoff_enabled
Enables or disables hinted handoff. A hint indicates that the write needs to be replayed to an
unavailable node. The database writes the hint to a hints file on the coordinator node.

• false - do not enable hinted handoff

• true - globally enable hinted handoff, except for datacenters specified for
hinted_handoff_disabled_datacenters

Default: true
hinted_handoff_disabled_datacenters
A blacklist of datacenters that will not perform hinted handoffs. To disable hinted handoff on a certain
datacenter, add its name to this list.
Default: commented out
max_hint_window_in_ms
Maximum amount of time during which the database generates hints for an unresponsive node.
After this interval, the database does not generate any new hints for the node until it is back up and
responsive. If the node goes down again, the database starts a new interval. This setting can prevent a
sudden demand for resources when a node is brought back online and the rest of the cluster attempts
to replay a large volume of hinted writes.
See About failure detection and recovery.
Default: 10800000 (3 hours)
hinted_handoff_throttle_in_kb
Maximum amount of traffic per delivery thread in kilobytes per second. This rate reduces proportionally
to the number of nodes in the cluster. For example, if there are two nodes in the cluster, each delivery
thread uses. the maximum rate. If there are three, each node throttles to half of the maximum, since the
two nodes are expected to deliver hints simultaneously.
When applying this limit, the calculated hint transmission rate is based on the uncompressed hint
size, even if internode_compression or hints_compression is enabled.
Default: 1024
hints_flush_period_in_ms
The time, in milliseconds, to wait before flushing hints from internal buffers to disk.
Default: 10000
max_hints_delivery_threads
Number of threads the database uses to deliver hints. In multiple datacenter deployments, consider
increasing this number because cross datacenter handoff is generally slower.
Default: 2
max_hints_file_size_in_mb

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
131
Configuration

The maximum size for a single hints file, in megabytes.


Default: 128
hints_compression
The compressor for hint files. Supported compressors: LZ, Snappy, and Deflate. When not set, the
database does not compress hints files.
Default: LZ4Compressor
batchlog_replay_throttle_in_kb
Total maximum throttle, in KB per second, for replaying hints. Throttling is reduced proportionally to the
number of nodes in the cluster
Default: 1024
batchlog_endpoint_strategy
Strategy to choose the batchlog storage endpoints.

• random_remote - Default, purely random. Prevents the local rack, if possible. Same behavior as
earlier releases.

• dynamic_remote - Uses DynamicEndpointSnitch to select batchlog storage endpoints. Prevents


the local rack, if possible. This strategy offers the same availability guarantees as random_remote,
but selects the fastest endpoints according to the DynamicEndpointSnitch. DynamicEndpointSnitch
tracks reads but not writes. Write-only, or mostly-write, workloads might not benefit from this
strategy. Note: this strategy will fall back to random_remote if dynamic_snitch is not enabled.

• dynamic - Mostly the same as dynamic_remote, except that local rack is not excluded, which offers
lower availability guarantee than random_remote or dynamic_remote. Note: this strategy will fall
back to random_remote if dynamic_snitch is not enabled.

Default: random_remote
Security properties
DSE Advanced Security fortifies DataStax Enterprise (DSE) databases against potential harm due to deliberate
attack or user error. Configuration properties include authentication and authorization, permissions, roles,
encryption of data in-flight and at-rest, and data auditing. DSE Unified Authentication provides authentication,
authorization, and role management. Enabling DSE Unified Authentication requires additional configuration in
dse.yaml, see Configuring DSE Unified Authentication.

authenticator: com.datastax.bdp.cassandra.auth.DseAuthenticator
# internode_authenticator: org.apache.cassandra.auth.AllowAllInternodeAuthenticator
authorizer: com.datastax.bdp.cassandra.auth.DseAuthorizer
role_manager: com.datastax.bdp.cassandra.auth.DseRoleManager
system_keyspaces_filtering: false
roles_validity_in_ms: 120000
# roles_update_interval_in_ms: 120000
permissions_validity_in_ms: 120000
# permissions_update_interval_in_ms: 120000

authenticator
The authentication backend. The only supported authenticator is DseAuthenticator for external
authentication with multiple authentication schemes such as Kerberos, LDAP, and internal
authentication. Authenticators other than DseAuthenticator are deprecated and not supported. Some
security features might not work correctly if other authenticators are used. See authentication_options in
dse.yaml.
Use only authentication implementations bundled with DSE.
Default: com.datastax.bdp.cassandra.auth.DseAuthenticator
internode_authenticator
Internode authentication backend to enable secure connections from peer nodes.
Use only authentication implementations bundled with DSE.
Default: org.apache.cassandra.auth.AllowAllInternodeAuthenticator
authorizer

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
132
Configuration

The authorization backend. Authorizers other than DseAuthorizer are not supported. DseAuthorizer
supports enhanced permission management of DSE-specific resources. Authorizers other than
DseAuthorizer are deprecated and not supported. Some security features might not work correctly if
other authorizers are used. See Authorization options in dse.yaml.
Use only authorization implementations bundled with DSE.
Default: com.datastax.bdp.cassandra.auth.DseAuthorizer
system_keyspaces_filtering
Enables system keyspace filtering so that users can access and view only schema information
for rows in the system and system_schema keyspaces to which they have access. When
system_keyspaces_filtering is set to true:

• Data in the system.local and system.peers tables are visible

• Data in the following tables of the system keyspace are filtered based on the role's DESCRIBE
privileges for keyspaces; only rows for appropriate keyspaces will be displayed in:

# size_estimates

# sstable_activity

# built_indexes

# built_views

# available_ranges

# view_builds_in_progress

• Data in all tables in the system_schema keyspace are filtered based on a role's DESCRIBE privileges
for keyspaces stored in the system_schema tables.

• Read operations against other tables in the system keyspace are denied

Security requirements and user permissions apply. Enable this feature only after appropriate user
permissions are granted. You must grant the DESCRIBE permission to role on any keyspaces stored
in the system keyspaces. If you do not grant the permission, you will see an error that states the
keyspace is not found.

GRANT DESCRIBE ON KEYSPACE keyspace_name TO ROLE role_name;

See Controlling access to keyspaces and tables and Configuring the security keyspaces replication
factors.
Default: false
role_manager
The DSE Role Manager supports LDAP roles and internal roles supported by the
CassandraRoleManager. Role options are stored in the dse_security keyspace. When using the DSE
Role Manager, increase the replication factor of the dse_security keyspace. Role managers other than
DseRoleManager are deprecated and not supported. Some security features might not work correctly if
other role managers are used.
Use only role manager implementations bundled with DSE.
Default: com.datastax.bdp.cassandra.auth.DseRoleManager
roles_validity_in_ms
Validity period for roles cache in milliseconds. Determines how long to cache the list of roles assigned
to the user; users may have several roles, either through direct assignment or inheritance (a role that
has been granted to another role). Adjust this setting based on the complexity of your role hierarchy,
tolerance for role changes, the number of nodes in your environment, and activity level of the cluster.
Fetching permissions can be an expensive operation, so this setting allows flexibility. Granted roles
are cached for authenticated sessions in AuthenticatedUser. After the specified time elapses, role

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
133
Configuration

validity is rechecked. Disabled automatically when internal authentication is not enabled when using
DseAuthenticator.

• 0 - disable role caching

• milliseconds - how long to cache the list of roles assigned to the user

Default: 120000 (2 minutes)


roles_update_interval_in_ms
Refresh interval for roles cache. After this interval, cache entries become eligible for refresh. On next
access, the database schedules an async reload, and returns the old value until the reload completes. If
roles_validity_in_ms is non-zero, then this value must also be non-zero. When not set, the default is
the same value as roles_validity_in_ms.
Default: commented out (120000)
permissions_validity_in_ms
How long permissions in cache remain valid to manage performance impact of permissions queries.
Fetching permissions can be resource intensive. Set the cache validity period to your security
tolerances. The cache is used for the standard authentication and the row-level access control (RLAC)
cache. The cache is quite effective at small durations.

• 0 - disable permissions cache

• milliseconds - time, in milliseconds

REVOKE does not automatically invalidate cached permissions. Permissions are invalidated the next
time they are refreshed.
Default: 120000 (2 minutes)
permissions_update_interval_in_ms
Sets refresh interval for the standard authentication cache and the row-level access control
(RLAC) cache. After this interval, cache entries become eligible for refresh. On next access,
the database schedules an async reload and returns the old value until the reload completes. If
permissions_validity_in_ms is non-zero, the value for roles_update_interval_in_ms must also be non-
zero. When not set, the default is the same value as permissions_validity_in_ms.
Default: commented out (2000)
permissions_cache_max_entries
The maximum number of entries that are held by the standard authentication cache and row-level
access control (RLAC) cache. With the default value of 1000, the RLAC permissions cache can have
up to 1000 entries in it, and the standard authentication cache can have up to 1000 entries. This single
option applies to both caches. To size the permissions cache for use with Setting up Row Level Access
Control (RLAC), use this formula:

numRlacUsers * numRlacTables + 100

If this option is not present in cassandra.yaml, manually enter it to use a value other than 1000. See
Enabling DSE Unified Authentication.
Default: not set (1000)
Inter-node encryption options
Node-to-node (internode) encryption protects data that is transferred between nodes in a cluster using SSL.

server_encryption_options:
internode_encryption: none
keystore: resources/dse/conf/.keystore
keystore_password: cassandra
truststore: resources/dse/conf/.truststore
truststore_password: cassandra
# More advanced defaults below:
# protocol: TLS
# algorithm: SunX509
# store_type: JKS

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
134
Configuration

# cipher_suites:
[TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_DHE_RSA_WITH_AES_128_CBC_SHA,TLS_DHE_RSA_WITH_
# require_client_auth: false
# require_endpoint_verification: false

server_encryption_options
Inter-node encryption options. If enabled, you must also generate keys and provide the appropriate key
and truststore locations and passwords. No custom encryption options are supported.
The passwords used in these options must match the passwords used when generating the keystore
and truststore. For instructions on generating these files, see Creating a Keystore to Use with JSSE.

See Configuring SSL for node-to-node connections.


internode_encryption
Encryption options for of inter-node communication using the TLS_RSA_WITH_AES_128_CBC_SHA
cipher suite for authentication, key exchange, and encryption of data transfers. Use the DHE/ECDHE
ciphers, such as TLS_DHE_RSA_WITH_AES_128_CBC_SHA if running in (Federal Information
Processing Standard) FIPS 140 compliant mode.

• all - Encrypt all inter-node communications

• none - No encryption

• dc - Encrypt the traffic between the datacenters

• rack - Encrypt the traffic between the racks

Default: none
keystore
Relative path from DSE installation directory or absolute path to the Java keystore (JKS) suitable for
use with Java Secure Socket Extension (JSSE), which is the Java version of the Secure Sockets Layer
(SSL), and Transport Layer Security (TLS) protocols. The keystore contains the private key used to
encrypt outgoing messages.
Default: resources/dse/conf/.keystore
keystore_password
Password for the keystore. This must match the password used when generating the keystore and
truststore.
Default: cassandra
truststore
Relative path from DSE installation directory or absolute path to truststore containing the trusted
certificate for authenticating remote servers.
Default: resources/dse/conf/.truststore
truststore_password
Password for the truststore.
Default: cassandra
protocol
Default: commented out (TLS)
algorithm
Default: commented out (SunX509)
store_type
Valid types are JKS, JCEKS, and PKCS12.
PKCS11 is not supported.
Default: commented out (JKS)
truststore_type
Valid types are JKS, JCEKS, and PKCS12.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
135
Configuration

PKCS11 is not supported. Also, due to an OpenSSL issue, you cannot use a PKCS12 truststore that
was generated via OpenSSL. For example, a truststore generated via the following command will not
work with DSE:

$ openssl pkcs12 -export -nokeys -out truststore.pfx -in intermediate.chain.pem

However, truststores generated via Java's keytool and then converted to PKCS12 work with DSE.
Example:

$ keytool -importcert -alias rootca -file rootca.pem -keystore truststore.jks

$ keytool -importcert -alias intermediate -file intermediate.pem -keystore


truststore.jks

$ keytool -importkeystore -srckeystore truststore.jks -destkeystore truststore.pfx


-deststoretype pkcs12

Default: commented out (JKS)


cipher_suites
Supported ciphers:

• TLS_RSA_WITH_AES_128_CBC_SHA

• TLS_RSA_WITH_AES_256_CBC_SHA

• TLS_DHE_RSA_WITH_AES_128_CBC_SHA

• TLS_DHE_RSA_WITH_AES_256_CBC_SHA

• TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA

• TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA

Default: commented out


require_client_auth
Whether to enable certificate authentication for node-to-node (internode) encryption. When not set, the
default is false.
Default: commented out (false)
require_endpoint_verification
Whether to verify the connected host and the host name in the certificate match. When not set, the
default is false.
Default: commented out (false)
Client-to-node encryption options
Client-to-node encryption protects in-flight data from client machines to a database cluster using SSL (Secure
Sockets Layer) and establishes a secure channel between the client and the coordinator node.

client_encryption_options:
enabled: false
# If enabled and optional is set to true, encrypted and unencrypted connections over
native transport are handled.
optional: false
keystore: resources/dse/conf/.keystore
keystore_password: cassandra
# require_client_auth: false
# Set trustore and truststore_password if require_client_auth is true
# truststore: resources/dse/conf/.truststore
# truststore_password: cassandra
# More advanced defaults below:

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
136
Configuration

# protocol: TLS
# algorithm: SunX509
# store_type: JKS
# cipher_suites:
[TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_DHE_RSA_WITH_AES_128_CBC_SHA,TLS_DHE_RSA_WITH_

See Configuring SSL for client-to-node connections.

client_encryption_options
Whether to enable client-to-node encryption. You must also generate keys and provide the appropriate
key and truststore locations and passwords. There are no custom encryption options enabled for
DataStax Enterprise.
Advanced settings:
enabled
Whether to enable client-to-node encryption.
Default: false
optional
When optional is selected, both encrypted and unencrypted connections over native transport are
allowed. That is a necessary transition state to facilitate enabling client to node encryption on live
clusters without inducing an outage for existing unencrypted clients. Typically, once existing clients
are migrated to encrypted connections, optional is unselected in order to enforce native transport
encryption.
Default: false
keystore
Relative path from DSE installation directory or absolute path to the Java keystore (JKS) suitable for
use with Java Secure Socket Extension (JSSE), which is the Java version of the Secure Sockets Layer
(SSL), and Transport Layer Security (TLS) protocols. The keystore contains the private key used to
encrypt outgoing messages.
Default: resources/dse/conf/.keystore
keystore_password
Password for the keystore.
Default: cassandra
require_client_auth
Whether to enable certificate authentication for client-to-node encryption. When not set, the default is
false.
When set to true, client certificates must be present on all nodes in the cluster.
Default: commented out (false)
truststore
Relative path from DSE installation directory or absolute path to truststore containing the trusted
certificate for authenticating remote servers.
Default: resources/dse/conf/.truststore
truststore_password
Password for the truststore. This must match the password used when generating the keystore and
truststore.
Truststore password and path is only required when require_client_auth is set to true.
Default: cassandra
protocol
Default: commented out (TLS)
algorithm
Default: commented out (SunX509)
store_type
Valid types are JKS, JCEKS and PKCS12. For file-based keystores, use PKCS12.
PKCS11 is not supported.
Default: commented out (JKS)
truststore_type

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
137
Configuration

Valid types are JKS, JCEKS, and PKCS12.


PKCS11 is not supported. Also, due to an OpenSSL issue, you cannot use a PKCS12 truststore that
was generated via OpenSSL. For example, a truststore generated via the following command will not
work with DSE:

$ openssl pkcs12 -export -nokeys -out truststore.pfx -in intermediate.chain.pem

However, truststores generated via Java's keytool and then converted to PKCS12 work with DSE.
Example:

$ keytool -importcert -alias rootca -file rootca.pem -keystore truststore.jks

$ keytool -importcert -alias intermediate -file intermediate.pem -keystore


truststore.jks

$ keytool -importkeystore -srckeystore truststore.jks -destkeystore truststore.pfx


-deststoretype pkcs12

Default: commented out (JKS)


cipher_suites
Supported ciphers:

• TLS_RSA_WITH_AES_128_CBC_SHA

• TLS_RSA_WITH_AES_256_CBC_SHA

• TLS_DHE_RSA_WITH_AES_128_CBC_SHA

• TLS_DHE_RSA_WITH_AES_256_CBC_SHA

• TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA

• TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA

Default: commented out


Transparent data encryption options

transparent_data_encryption_options:
enabled: false
chunk_length_kb: 64
cipher: AES/CBC/PKCS5Padding
key_alias: testing:1
# CBC IV length for AES must be 16 bytes, the default size
# iv_length: 16
key_provider:
- class_name: org.apache.cassandra.security.JKSKeyProvider
parameters:
- keystore: conf/.keystore
keystore_password: cassandra
store_type: JCEKS
key_password: cassandra

transparent_data_encryption_options
DataStax Enterprise supports this option only for backward compatibility. When using DSE, configure
data encryption options in the dse.yaml; see Transparent data encryption.
TDE properties:

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
138
Configuration

• enabled: (Default: false)

• chunk_length_kb: (Default: 64)

• cipher: options:

# AES

# CBC

# PKCS5Padding

• key_alias: testing:1

• iv_length: 16
iv_length is commented out in the default cassandra.yaml file. Uncomment only if cipher is set
to AES. The value must be 16 (bytes).

• key_provider:

# class_name: org.apache.cassandra.security.JKSKeyProvider
parameters:

# keystore: conf/.keystore

# keystore_password: cassandra

# store_type: JCEKS

# key_password: cassandra

SSL Ports

ssl_storage_port: 7001
native_transport_port_ssl: 9142

See Securing DataStax Enterprise ports.

ssl_storage_port
The SSL port for encrypted communication. Unused unless enabled in encryption_options. Follow
security best practices, do not expose this port to the internet. Apply firewall rules.
Default: 7001
native_transport_port_ssl
Dedicated SSL port where the CQL native transport listens for clients with encrypted communication.
For security reasons, do not expose this port to the internet. Firewall it if needed.

• commented out (disabled) - the native_transport_port will encrypt all traffic

• port number different than native_transport_port - use encryption for native_transport_port_ssl,


keep native_transport_port unencrypted to use both unencrypted and encrypted traffic

Default: 9142
Continuous paging options

continuous_paging:
max_concurrent_sessions: 60
max_session_pages: 4
max_page_size_mb: 8
max_local_query_time_ms: 5000
client_timeout_sec: 600
cancel_timeout_sec: 5

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
139
Configuration

paused_check_interval_ms: 1

continuous_paging
Options to tune continuous paging that pushes pages, when requested, continuously to the client:

• Maximum memory used:

max_concurrent_sessions # max_session_pages # max_page_size_mb

Default: calculated (60 # 4 # 8 = 1920 MB)

Guidance

• Because memtables and SSTables are used by the continuous paging query, you can define the
maximum period of time during which memtables cannot be flushed and compacted SSTables
cannot be deleted.

• If fewer threads exist than sessions, a session cannot execute until another one is swapped out.

• Distributed queries (CL > ONE or non-local data) are swapped out after every page, while local
queries at CL = ONE are swapped out after max_local_query_time_ms.

max_concurrent_sessions
The maximum number of concurrent sessions. Additional sessions are rejected with an unavailable
error.
Default: 60
max_session_pages
The maximum number of pages that can be buffered for each session. If the client is not reading from
the socket, the producer thread is blocked after it has prepared max_session_pages.
Default: 4
max_page_size_mb
The maximum size of a page, in MB. If an individual CQL row is larger than this value, the page can be
larger than this value.
Default: 8
max_local_query_time_ms
The maximum time for a local continuous query to run. When this threshold is exceeded, the
session is swapped out and rescheduled. Swapping and rescheduling ensures the release of
resources that prevent the memtables from flushing and ensures fairness when max_threads <
max_concurrent_sessions. Adjust when high write workloads exist on tables that have continuous
paging requests.
Default: 5000
client_timeout_sec
How long the server will wait, in seconds, for clients to request more pages if the client is not reading
and the server queue is full.
Default: 600
cancel_timeout_sec
How long to wait before checking if a paused session can be resumed. Continuous paging sessions
are paused because of backpressure or when the client has not request more pages with backpressure
updates.
Default: 5
paused_check_interval_ms
How long to wait, in milliseconds, before checking if a continuous paging sessions can be resumed,
when that session is paused because of backpressure.
Default: 1

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
140
Configuration

Fault detection setting

# phi_convict_threshold: 8

phi_convict_threshold
The sensitivity of the failure detector on an exponential scale. Generally, this setting does not need
adjusting.
See About failure detection and recovery.
When not set, the internal value is 8.
Default: commented out (8)
Memory leak detection settings

#leaks_detection_params:
# sampling_probability: 0
# max_stacks_cache_size_mb: 32
# num_access_records: 0
# max_stack_depth: 30

sampling_probability
The sampling probability to track for the specified resource. For resources tracked, see nodetool
leaksdetection.

• 0 - disable tracking. Default.

• 1 - enable tracking all the time

• A number between 0 and 1 - the percentage of time to randomly track a resource. For example,
0.5 will track resources 50% of the time.

Tracking incurs a significant stack trace collection cost for every access and consumes heap space.
Enable tracking only when directed by DataStax Support.
Default: commented out (0)
max_stacks_cache_size_mb
Set the size of the cache for call stack traces. Stack traces are used to debug leaked resources, and
use heap memory. Set the amount of heap memory dedicated to each resource by setting the max
stacks cache size in MB.
Default: commented out (32)
num_access_records
Set the average number of stack traces kept when a resource is accessed. Currently only supported for
chunks in the cache.
Default: commented out (0)
max_stack_depth
Set the depth of the stack traces collected. Changes only the depth of the stack traces that will be
collected from the time the parameter is set. Deeper stacks are more unique, so increasing the depth
may require increasing stacks_cache_size_mb.
Default: commented out (30)
dse.yaml configuration file
The dse.yaml file is the primary configuration file for security, DSE Search, DSE Graph, and DSE Analytics.
After changing properties in the dse.yaml file, you must restart the node for the changes to take effect.

Package installations /etc/dse/dse.yaml


Tarball installations installation_location/
resources/dse/conf/dse.yaml

The cassandra.yaml file is the primary configuration file for the DataStax Enterprise database.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
141
Configuration

Syntax
For the properties in each section, the parent setting has zero spaces. Each child entry requires at least
two spaces. Adhere to the YAML syntax and retain the spacing. For example, no spaces before the parent
node_health_options entry, and at least two spaces before the child settings:

node_health_options:
refresh_rate_ms: 50000
uptime_ramp_up_period_seconds: 10800
dropped_mutation_window_minutes: 30

Organization
The DataStax Enterprise configuration properties are grouped into the following sections:

• Security and authentication options

• DSE In-Memory

• Node health

• Health-based routing

• Lease metrics

• DSE Search options

• DSE Analytics options

• Performance Service options

• DSE Metrics Collector options

• Audit logging

• audit_logging_options

• DSE Tiered Storage

• DSE Advanced Replication

• Inter-node messaging

• DSE Multi-Instance

• DSE Graph options

Security and authentication options

• Authentication options

• Role management options

• Authorization options

• Kerberos options

• LDAP options

• Encrypt sensitive system resources

• Encrypted configuration properties settings

• KMIP encryption options

• DSE Search index encryption settings

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
142
Configuration

Authentication options
Authentication options for the DSE Authenticator that allows you to use multiple schemes for authentication in a
DataStax Enterprise cluster. Additional authenticatorconfiguration is required in cassandra.yaml.
Internal and LDAP schemes can also used for role management, see role_management_options.

See Enabling DSE Unified Authentication.

# authentication_options:
# enabled: false
# default_scheme: internal
# other_schemes:
# - ldap
# - kerberos
# scheme_permissions: false
# transitional_mode: disabled
# allow_digest_with_kerberos: true
# plain_text_without_ssl: warn

authentication_options
Options for the DseAuthenticator to authenticate users when the authenticator option in
cassandra.yaml is set to com.datastax.bdp.cassandra.auth.DseAuthenticator. Authenticators other than
DseAuthenticator are not supported.
enabled
Enables user authentication.

• true - The DseAuthenticator authenticates users.

• false - The DseAuthenticator does not authenticate users and allows all connections.

When not set, the default is false.


Default: commented out false
default_scheme
Sets the first scheme to validate a user against when the driver does not request a specific scheme.

• internal - Plain text authentication using the internal password authentication.

• ldap - Plain text authentication using pass-through LDAP authentication.

• kerberos - GSSAPI authentication using the Kerberos authenticator.

Default: commented out (internal)


other_schemes
List of schemes that are also checked if validation against the first scheme fails and no scheme was
specified by the driver. Same scheme names as default_scheme.
scheme_permissions
Whether roles need to have permission granted to them in order to use specific authentication
schemes. These permissions can be granted only when the DseAuthorizer is used. Set to one of the
following values:

• true - Use multiple schemes for authentication. Every role requires permissions to a scheme in
order to be assigned.

• false - Do not use multiple schemes for authentication. Prevents unintentional role assignment that
might occur if user or group names overlap in the authentication service.

See Binding a role to an authentication scheme.


When not set, the default is false.
Default: commented out (false)
allow_digest_with_kerberos

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
143
Configuration

Controls whether DIGEST-MD5 authentication is also allowed with Kerberos. The DIGEST-MD5
mechanism is not directly associated with an authentication scheme, but is used by Kerberos to pass
credentials between nodes and jobs.

• true - DIGEST-MD5 authentication is also allowed with Kerberos. In analytics clusters, set to true to
use Hadoop inter-node authentication with Hadoop and Spark jobs.

• false - DIGEST-MD5 authentication is not used with Kerberos.

Analytics nodes require true to use internode authentication with Hadoop and Spark jobs. When not set,
the default is true.
Default: commented out (true)
plain_text_without_ssl
Controls how the DseAuthenticator responds to plain text authentication requests over unencrypted
client connections. Set to one of the following values:

• block - Block the request with an authentication error.

• warn - Log a warning about the request but allow it to continue.

• allow - Allow the request without any warning.

Default: commented out (warn)


transitional_mode
Whether to enable transitional mode for temporary use during authentication setup in an already
established environment.
Transitional mode allows access to the database using the anonymous role, which has all permissions
except AUTHORIZE.

• disabled - Transitional mode is disabled. All connections must provide valid credentials and map to
a login-enabled role.

• permissive - Only super users are authenticated and logged in. All other authentication attempts
are logged in as the anonymous user.

• normal - Allow all connections that provide credentials. Maps all authenticated users to their role
AND maps all other connections to anonymous.

• strict - Allow only authenticated connections that map to a login-enabled role OR connections that
provide a blank username and password as anonymous.

Credentials are required for all connections after authentication is enabled; use a blank username
and password to login with anonymous role in transitional mode.

Default: commented out (disabled)


Role management options

#role_management_options:
# mode: internal
# stats: false

See Enabling DSE Unified Authentication.

role_management_options
Options for the DSE Role Manager. To enable role manager, set:

• authorization_options enabled to true

• role_manager in cassandra.yaml to com.datastax.bdp.cassandra.auth.DseRoleManager

See Setting up logins and users.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
144
Configuration

When scheme_permissions is enabled, all roles must have permission to execute on the authentication
scheme, see Binding a role to an authentication scheme.
mode
Set to one of the following values:

• internal - Scheme that manages roles per individual user in the internal database. Allows nesting
roles for permission management.

• ldap - Scheme that assigns roles by looking up the user name in LDAP and mapping the group
attribute (ldap_options) to an internal role name. To configure an LDAP scheme, complete the
steps in Defining an LDAP scheme.

Internal role management allows nesting roles for permission management; when using LDAP mode
role, nesting is disabled. Using GRANT role_name TO role_name results in an error.
Default: commented out (internal)
stats
Set to true, to enable logging of DSE role creation and modification events in the
dse_security.role_stats system table. All nodes must have the stats option enabled, and must be
restarted for the functionality to take effect.
To query role events:

SELECT * FROM dse_security.role_stats;

role | created | password_changed


-------+---------------------------------+---------------------------------
user1 | 2020-04-13 00:44:09.221000+0000 | null
user2 | 2020-04-12 23:49:21.457000+0000 | 2020-04-12 23:49:21.457000+0000

(2 rows)

Default: commented out (false)


Authorization options

#authorization_options:
# enabled: false
# transitional_mode: disabled
# allow_row_level_security: false

See Enabling DSE Unified Authentication.

authorization_options
Options for the DSE Authorizer.
enabled
Whether to use the DSE Authorizer for role-based access control (RBAC).

• true - use the DSE Authorizer for role-based access control (RBAC)

• false - do not use the Dse Authorizer

When not set, the default is false.


Default: commented out (false)
transitional_mode
Allows the DSE Authorizer to operate in a temporary transitional mode during setup of authorization in a
cluster. Set to one of the following values:

• disabled - Transitional mode is disabled.

• normal - Permissions can be passed to resources, but are not enforced.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
145
Configuration

• strict - Permissions can be passed to resources, and are enforced on authenticated users.
Permissions are not enforced against anonymous users.

Default: commented out (disabled)


allow_row_level_security
Whether to enable row-level access control (RLAC) permissions; use the same setting on all nodes.

• true - use row-level security

• false - do not use row-level

When not set, the default is false.


Default: commented out (false)
Kerberos options

kerberos_options:
keytab: resources/dse/conf/dse.keytab
service_principal: dse/_HOST@REALM
http_principal: HTTP/_HOST@REALM
qop: auth

See Defining a Kerberos scheme.

kerberos_options
Options to configure security for a DataStax Enterprise cluster using Kerberos.
keytab
The file path of dse.keytab.
service_principal
The service_principal that the DataStax Enterprise process runs under must use the form dse_user/
_HOST@REALM, where:

• dse_user is the name of the user that starts the DataStax Enterprise process.

• _HOST is converted to a reverse DNS lookup of the broadcast address.

• REALM is the name of your Kerberos realm. In the Kerberos principal, REALM must be uppercase.

http_principal
The http_principal is used by the Tomcat application container to run DSE Search. The Tomcat
web server uses the GSSAPI mechanism (SPNEGO) to negotiate the GSSAPI security mechanism
(Kerberos). Set REALM to the name of your Kerberos realm. In the Kerberos principal, REALM must be
uppercase.
qop
A comma-delimited list of Quality of Protection (QOP) values that clients and servers can use for each
connection. The client can have multiple QOP values, while the server can have only a single QOP
value. The valid values are:

• auth - Authentication only.

• auth-int - Authentication plus integrity protection for all transmitted data.

• auth-conf - Authentication plus integrity protection and encryption of all transmitted data.
Encryption using auth-conf is separate and independent of whether encryption is done using
SSL. If both auth-conf and SSL are enabled, the transmitted data is encrypted twice. DataStax
recommends choosing only one method and using it for both encryption and authentication.

LDAP options
Define LDAP options to authenticate users against an external LDAP service and/or for Role Management using
LDAP group look up.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
146
Configuration

See Enabling DSE Unified Authentication.

# ldap_options:
# server_host:
# server_port: 389
# hostname_verification: false
# search_dn:
# search_password:
# use_ssl: false
# use_tls: false
# truststore_path:
# truststore_password:
# truststore_type: jks
# user_search_base:
# user_search_filter: (uid={0})
# user_memberof_attribute: memberof
# group_search_type: directory_search
# group_search_base:
# group_search_filter: (uniquemember={0})
# group_name_attribute: cn
# credentials_validity_in_ms: 0
# search_validity_in_seconds: 0
# connection_pool:
# max_active: 8
# max_idle: 8

Microsoft Active Directory (AD) example, for both authentication and role management:

ldap_options:
server_host: win2012ad_server.mycompany.lan
server_port: 389
search_dn: cn=lookup_user,cn=users,dc=win2012domain,dc=mycompany,dc=lan
search_password: lookup_user_password
use_ssl: false
use_tls: false
truststore_path:
truststore_password:
truststore_type: jks
#group_search_type: directory_search
group_search_type: memberof_search
#group_search_base:
#group_search_filter:
group_name_attribute: cn
user_search_base: cn=users,dc=win2012domain,dc=mycompany,dc=lan
user_search_filter: (sAMAccountName={0})
user_memberof_attribute: memberOf
connection_pool:
max_active: 8
max_idle: 8

See Defining an LDAP scheme.

ldap_options
Options to configure LDAP security. When not set, LDAP authentication is not used.
Default: commented out
server_host
A comma separated list of LDAP server hosts.
Do not use LDAP on the same host (localhost) in production environments. Using LDAP on the same
host (localhost) is appropriate only in single node test or development environments.
Default: none

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
147
Configuration

server_port
The port on which the LDAP server listens.

• 389 - the default port for unencrypted connections

• 636 - typically used for encrypted connections; the default SSL port for LDAP is 636

Default: commented out (389)


hostname_verification
Enable hostname verification. The following conditions must be met:

• Either use_ssl or use_tls must be set to true.

• A valid truststore with the correct path specified in truststore_path must exist. The truststore
must have a certificate entry, trustedCertEntry, including a SAN DNSName entry that matches the
hostname of the LDAP server.

Default: false
search_dn
Distinguished name (DN) of an account with read access to the user_search_base and
group_search_base. For example:

• OpenLDAP: uid=lookup,ou=users,dc=springsource,dc=com

• Microsoft Active Directory (AD): cn=lookup, cn=users, dc=springsource, dc=com

Do not create/use an LDAP account or group called cassandra. The DSE database comes with a
default login role, cassandra, that has access to all database objects and uses the consistency level
QUOROM.
When not set, an anonymous bind is used for the search on the LDAP server.
Default: commented out
search_password
The password of the search_dn account.
Default: commented out
use_ssl
Whether to use an SSL-encrypted connection.

• true - use an SSL-encrypted connection, set server_port to the LDAP port for the server (typically
port 636)

• false - do not enable SSL connections to the LDAP server

Default: commented out (false)


use_tls
Whether to enable TLS connections to the LDAP server.

• true - enable TLS connections to the LDAP server, set server_port to the TLS port of the LDAP
server.

• false - do not enable TLS connections to the LDAP server

Default: commented out (false)


truststore_path
The path to the truststore for SSL certificates.
Default: commented out
truststore_password
The password to access the trust store.
Default: commented out
truststore_type
The type of truststore.
Default: commented out (jks)
user_search_base

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
148
Configuration

Distinguished name (DN) of the object to start the recursive search for user entries for authentication
and role management memberof searches. For example to search all users in example.com,
ou=users,dc=example,dc=com.

• For your LDAP domain, set the ou and dc elements. Typically set to
ou=users,dc=domain,dc=top_level_domain. For example, ou=users,dc=example,dc=com.

• Active Directory uses a different search base, typically


CN=search,CN=Users,DC=ActDir_domname,DC=internal. For example,
CN=search,CN=Users,DC=example-sales,DC=internal.

Default: commented out


user_search_filter
Attribute that identifies the user that the search filter uses for looking up user names.

• uid={0} - when using LDAP

• samAccountName={0} - when using AD (Microsoft Active Directory). For example,


(sAMAccountName={0})

Default: commented out (uid={0})


user_memberof_attribute
Attribute that contains a list of group names; role manager assigns DSE roles that exactly match any
group name in the list. Required when managing roles using group_search_type: memberof_search
with LDAP (role_manager.mode:ldap). The directory server must have memberof support, which is a
default user attribute in Microsoft Active Directory (AD).
Default: commented out (memberof)
group_search_type
Required when managing roles with LDAP (role_manager.mode: ldap). Define how group membership
is determined for a user. Choose from one of the following values:

• directory_search - Filters the results by doing a subtree search of group_search_base to find


groups that contain the user name in the attribute defined in the group_search_filter. (Default)

• memberof_search - Recursively search for user entries using the user_search_base and
user_search_filter. Get groups from the user attribute defined in user_memberof_attribute.
The directory server must have memberof support.

Default: commented out (directory_search)


group_search_base
The unique distinguished name (DN) of the group record from which to start the group membership
search on.
Default: commented out
group_search_filter
Set to any valid LDAP filter.
Default: commented out (uniquemember={0})
group_name_attribute
The attribute in the group record that contains the LDAP group name. Role names are case-sensitive
and must match exactly on DSE for assignment. Unmatched groups are ignored.
Default: commented out (cn)
credentials_validity_in_ms
The duration period of the credentials cache.

• 0 - disable credentials cache

• duration period in milliseconds - enable a search cache and improve performance by reducing the
number of requests that are sent to the internal or LDAP server. See Defining an LDAP scheme.

When not set, the default is 0 (disabled).


Default: commented out (0)
search_validity_in_seconds

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
149
Configuration

The duration period for the search cache.

• 0 - disable search credentials cache

• duration period in seconds - enables a search cache and improves performance by reducing the
number of requests that are sent to the internal or LDAP server

Default: commented out (0, disabled)


connection_pool
The configuration settings for the connection pool for making LDAP requests.
max_active
The maximum number of active connections to the LDAP server.
Default: commented out (8)
max_idle
The maximum number of idle connections in the pool awaiting requests.
Default: commented out (8)
Encrypt sensitive system resources
Options to encrypt sensitive system resources using a local encryption key or a remote KMIP key.

system_info_encryption:
enabled: false
cipher_algorithm: AES
secret_key_strength: 128
chunk_length_kb: 64
key_provider: KmipKeyProviderFactory
kmip_host: kmip_host_name

DataStax recommends using a remote encryption key from a KMIP provider when using Transparent Data
Encryption (TDE) features. Use a local encryption key only if a KMIP server is not available.

system_info_encryption
Options to set encryption settings for system resources that might contain sensitive information,
including the system.batchlog and system.paxos tables, hint files, and the database commit log.
enabled
Whether to enable encryption of system resources. See Encrypting system resources.
The system_trace keyspace is NOT encrypted by enabling the system_information_encryption
section. In environments that also have tracing enabled, manually configure encryption with
compression on the system_trace keyspace. See Transparent data encryption.
Default: false
cipher_algorithm
The name of the JCE cipher algorithm used to encrypt system resources.
Table 11: Supported cipher algorithms names
cipher_algorithm secret_key_strength

AES 128, 192, or 256

DES 56

DESede 112 or 168

Blowfish 32-448

RC2 40-128

Default: AES
secret_key_strength
Length of key to use for the system resources. See Supported cipher algorithms names.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
150
Configuration

DSE uses a matching local key or requests the key type from the KMIP server. For KMIP, if an
existing key does not match, the KMIP server automatically generates a new key.
Default: 128
chunk_length_kb
Optional. Size of SSTable chunks when data from the system.batchlog or system.paxos are written to
disk.
To encrypt existing data, run nodetool upgradesstables -a system batchlog paxos on all
nodes in the cluster.
Default: 64
key_provider
KMIP key provider to enable encrypting sensitive system data with a KMIP key. Comment out if using a
local encryption key.
Default: commented out (KmipKeyProviderFactory)
kmip_host
The KMIP key server host. Set to the kmip_group_name that defines the KMIP host in kmip_hosts
section. DSE requests a key from the KMIP host and uses the key generated by the KMIP provider.
Default: commented out
Encrypted configuration properties settings
Settings for using encrypted passwords in sensitive configuration file properties.

system_key_directory: /etc/dse/conf
config_encryption_active: false
config_encryption_key_name: (key_filename | KMIP_key_URL )

system_key_directory
Path to the directory where local encryption/decryption key files are stored, also called system keys.
Distribute the system keys to all nodes in the cluster. Ensure that the DSE account is the folder owner
and has read/write/execute (700) permissions.
See Setting up local encryption keys.
This directory is not used for KMIP keys.

Default: /etc/dse/conf
config_encryption_active
Whether to enable encryption on sensitive data stored in tables and in configuration files.

• true - enable encryption of configuration property values using the specified


config_encryption_key_name. When set to true, the configuration values must be encrypted or
commented out. See Encrypting configuration file properties.
Lifecycle Manager (LCM) is not compatible when config_encryption_active is true in DSE
and OpsCenter. For LCM limitations, see Encrypted DSE configuration values.

• false - Do not enable encryption of configuration property values.

Default: false
config_encryption_key_name
Set to the local encryption key filename or KMIP key URL to use for configuration file property value
decryption.
Use dsetool dsetool encryptconfigvalue to generate encrypted values for the configuration file
properties.
Default: system_key. The default name is not configurable.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
151
Configuration

KMIP encryption options


Options for KMIP encryption keys and communication between the DataStax Enterprise node and the KMIP key
server or key servers. Enables DataStax Enterprise encryption features to use encryption keys that stored on a
server that is not running DataStax Enterprise.

kmip_hosts:
your_kmip_groupname:
hosts: kmip1.yourdomain.com, kmip2.yourdomain.com
keystore_path: pathto/kmip/keystore.jks
keystore_type: jks
keystore_password: password
truststore_path: pathto/kmip/truststore.jks
truststore_type: jks
truststore_password: password
key_cache_millis: 300000
timeout: 1000
protocol: protocol
cipher_suites: supported_cipher

kmip_hosts
Connection settings for key servers that support the KMIP protocol.
kmip_groupname
A user-defined name for a group of options to configure a KMIP server or servers, key settings, and
certificates. Configure options for a kmip_groupname section for each KMIP key server or group of
KMIP key servers. Using separate key server configuration settings allows use of different key servers
to encrypt table data, and eliminates the need to enter key server configuration information in DDL
statements and other configurations. Multiple KMIP hosts are supported.
Default: commented out
hosts
A comma-separated list KMIP hosts (host[:port]) using the FQDN (Fully Qualified Domain Name). DSE
queries the host in the listed order, so add KMIP hosts in the intended failover sequence.
For example, if the host list contains kmip1.yourdomain.com, kmip2.yourdomain.com, DSE tries
kmip1.yourdomain.com and then kmip2.yourdomain.com.
keystore_path
The path to a Java keystore created from the KMIP agent PEM files.
Default: commented out (/etc/dse/conf/KMIP_keystore.jks)
keystore_type
The type of keystore.
Default: commented out (jks)
keystore_password
The password to access the keystore.
Default: commented out (password)
truststore_path
The path to a Java truststore that was created using the KMIP root certificate.
Default: commented out (/etc/dse/conf/KMIP_truststore.jks)
truststore_type
The type of truststore.
Default: commented out (jks)
truststore_password
The password to access the truststore.
Default: commented out (password)
key_cache_millis
Milliseconds to locally cache the encryption keys that are read from the KMIP hosts. The longer the
encryption keys are cached, the fewer requests are made to the KMIP key server, but the longer it takes
for changes, like revocation, to propagate to the DataStax Enterprise node. DataStax Enterprise uses
concurrent encryption, so multiple threads fetch the secret key from the KMIP key server at the same
time. DataStax recommends using the default value.
Default: commented out (300000)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
152
Configuration

timeout
Socket timeout in milliseconds.
Default: commented out (1000)
protocol
protocol
When not specified, JVM default is used. Example: TLSv1.2
cipher_suites
When not specified, JVM default is used. Examples:

• TLS_RSA_WITH_AES_128_CBC_SHA

• TLS_RSA_WITH_AES_256_CBC_SHA

• TLS_DHE_RSA_WITH_AES_128_CBC_SHA

• TLS_DHE_RSA_WITH_AES_256_CBC_SHA

• TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA

• TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA

See cipher_algorithm.
DSE Search index encryption settings

# solr_encryption_options:
# decryption_cache_offheap_allocation: true
# decryption_cache_size_in_mb: 256

solr_encryption_options
Settings to tune encryption of search indexes.
decryption_cache_offheap_allocation
Whether to allocate shared DSE Search decryption cache off JVM heap.

• true - allocate shared DSE Search decryption cache off JVM heap

• false - do not allocate shared DSE Search decryption cache off JVM heap

When not set, the default is true.


Default: commented out (true)
decryption_cache_size_in_mb
The maximum size of shared DSE Search decryption cache in megabytes (MB).
Default: commented out (256)
DSE In-Memory options
To use the DSE In-Memory, choose one of these options to specify how much system memory to use for all in-
memory tables: fraction or size.

# max_memory_to_lock_fraction: 0.20
# max_memory_to_lock_mb: 10240

max_memory_to_lock_fraction
A fraction of the system memory. The default value of 0.20 specifies to use up to 20% of system
memory. This max_memory_to_lock_fraction value is ignored if max_memory_to_lock_mb is set to a
non-zero value. To specify a fraction, use instead of max_memory_to_lock_mb.
Default: commented out (0.20)
max_memory_to_lock_mb
A maximum amount of memory in megabytes (MB).

• not set - use the fraction specified with max_memory_to_lock_fraction

• number greater than 0 - maximum amount of memory in megabytes (MB)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
153
Configuration

Default: commented out (10240)


Node health options

node_health_options:
refresh_rate_ms: 50000
uptime_ramp_up_period_seconds: 10800
dropped_mutation_window_minutes: 30

node_health_options
Node health options are always enabled.
refresh_rate_ms
Default: 60000
uptime_ramp_up_period_seconds
The amount of continuous uptime required for the node's uptime score to advance the node health
score from 0 to 1 (full health), assuming there are no recent dropped mutations. The health score is a
composite score based on dropped mutations and uptime.
If a node is repairing after a period of downtime, you might want to increase the uptime period to the
expected repair time.
Default: commented out (10800 3 hours)
dropped_mutation_window_minutes
The historic time window over which the rate of dropped mutations affect the node health score.
Default: 30
Health-based routing

enable_health_based_routing: true

enable_health_based_routing
Whether to consider node health for replication selection for distributed DSE Search queries. Health-
based routing enables a trade-off between index consistency and query throughput.

• true - consider node health when multiple candidates exist for a particular token range.

• false - ignore node health for replication selection. When the primary concern is performance, do
not enable health-based routing.

Default: true
Lease metrics

lease_metrics_options:
enabled:false
ttl_seconds: 604800

lease_metrics_options
Lease holder statistics help monitor the lease subsystem for automatic management of Job Tracker and
Spark Master nodes.
enabled
Enables (true) or disables (false) log entries related to lease holders. Most of the time you do not want
to enable logging.
Default: false
ttl_seconds
Defines the time, in milliseconds, to persist the log of lease holder changes. Logging of lease holder
changes is always on, and has a very low overhead.
Default: 604800

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
154
Configuration

DSE Search options

• Scheduler settings for DSE Search indexes

• async_bootstrap_reindex

• CQL Solr paging

• Solr CQL query option

• DSE Search resource upload limit

• Shard transport options

• DSE Search indexing settings

Scheduler settings for DSE Search indexes


To ensure that records with TTLs are purged from search indexes when they expire, the search indexes are
periodically checked for expired documents.

ttl_index_rebuild_options:
fixed_rate_period: 300
initial_delay: 20
max_docs_per_batch: 4096
thread_pool_size: 1

ttl_index_rebuild_options
Section of options to control the schedulers in charge of querying for and removing expired records, and
the execution of the checks.
fix_rate_period
Time interval to check for expired data in seconds.
Default: 300
initial_delay
The number of seconds to delay the first TTL check to speed up start-up time.
Default: 20
max_docs_per_batch
The maximum number of documents to check and delete per batch by the TTL rebuild thread. All
documents determined to be expired are deleted from the index during each check, to avoid memory
pressure, their unique keys are retrieved and deletes issued in batches.
Default: 4096
thread_pool_size
The maximum number of cores that can execute TTL cleanup concurrently. Set the thread_pool_size
to manage system resource consumption and prevent many search cores from executing simultaneous
TTL deletes.
Default: 1
Reindexing of bootstrapped data

async_bootstrap_reindex: false

async_bootstrap_reindex
For DSE Search, configure whether to asynchronously reindex bootstrapped data. Default: false

• If enabled, the node joins the ring immediately after bootstrap and reindexing occurs
asynchronously. Do not wait for post-bootstrap reindexing so that the node is not marked down.
The dsetool ring command can be used to check the status of the reindexing.

• If disabled, the node joins the ring after reindexing the bootstrapped data.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
155
Configuration

CQL Solr paging


Options to specify the paging behavior.

cql_solr_query_paging: off

cql_solr_query_paging

• driver - Respects driver paging settings. Specifies to use Solr pagination (cursors) only when the
driver uses pagination. Enabled automatically for DSE SearchAnalytics workloads.

• off - Paging is off. Ignore driver paging settings for CQL queries and use normal Solr paging unless:

# The current workload is an analytics workload, including SearchAnalytics. SearchAnalytics


nodes always use driver paging settings.

# The cqlsh query parameter paging is set to driver.


Even when cql_solr_query_paging: off, paging is dynamically enabled with the
"paging":"driver" parameter in JSON queries.

When not set, the default is off.


Default: commented out (off)
Solr CQL query option
Available option for CQL Solr queries.

cql_solr_query_row_timeout: 10000

cql_solr_query_row_timeout
The maximum time in milliseconds to wait for each row to be read from the database during CQL Solr
queries.
Default: commented out (10000 10 seconds)
DSE Search resource upload limit

solr_resource_upload_limit_mb: 10

solr_resource_upload_limit_mb
Option to disable or configure the maximum file size of the search index config or schema. Resource
files can be uploaded, but the search index config and schema are stored internally in the database
after upload.

• 0 - disable resource uploading

• upload size - The maximum upload size limit in megabytes (MB) for a DSE Search resource file
(search index config or schema).

Default: 10
Shard transport options

shard_transport_options:
netty_client_request_timeout: 60000

shard_transport_options
Fault tolerance option for inter-node communication between DSE Search nodes.
netty_client_request_timeout
Timeout behavior during distributed queries. The internal timeout for all search queries to prevent long
running queries. The client request timeout is the maximum cumulative time (in milliseconds) that a
distributed search request will wait idly for shard responses.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
156
Configuration

Default: 60000 (1 minute)


DSE Search indexing settings

# back_pressure_threshold_per_core: 1024
# flush_max_time_per_core: 5
# load_max_time_per_core: 5
# enable_index_disk_failure_policy: false
# solr_data_dir: /MyDir
# solr_field_cache_enabled: false
# ram_buffer_heap_space_in_mb: 1024
# ram_buffer_offheap_space_in_mb: 1024

See Tuning search for maximum indexing throughput.

back_pressure_threshold_per_core
The maximum number of queued partitions during search index rebuilding and reindexing. This
maximum number safeguards against excessive heap use by the indexing queue. If set lower than the
number of threads per core (TPC), not all TPC threads can be actively indexing.
Default: commented out (1024)
flush_max_time_per_core
The maximum time, in minutes, to wait for the flushing of asynchronous index updates that occurs at
DSE Search commit time or at flush time. Expert level knowledge is required to change this value.
Always set the value reasonably high to ensure flushing completes successfully to fully sync DSE
Search indexes with the database data. If the configured value is exceeded, index updates are only
partially committed and the commit log is not truncated which can undermine data durability.
When a timeout occurs, it usually means this node is being overloaded and cannot flush in a timely
manner. Live indexing increases the time to flush asynchronous index updates.
Default: commented out (5)
load_max_time_per_core
The maximum time, in minutes, to wait for each DSE Search index to load on startup or create/reload
operations. This advanced option should be changed only if exceptions happen during search index
loading. When not set, the default is 5 minutes.
Default: commented out (5)
enable_index_disk_failure_policy
Whether to apply the configured disk failure policy if IOExceptions occur during index update
operations.

• true - apply the configured Cassandra disk failure policy to index write failures

• false - do not apply the disk failure policy

When not set, the default is false.


Default: commented out (false)
solr_data_dir
The directory to store index data. For example:
solr_data_dir: /var/lib/cassandra/solr.data
See Managing the location of DSE Search data.By default, each DSE Search index is saved in
solr_data_dir/keyspace_name.table_name, or as specified by the dse.solr.data.dir system
property.
Default: commented out
solr_field_cache_enabled
The Apache Lucene® field cache is deprecated. Instead, for fields that are sorted, faceted, or grouped
by, set docValues="true" on the field in the search index schema. Then reload the search index and
reindex. When not set, the default is false.
Default: commented out (false)
ram_buffer_heap_space_in_mb
Global Lucene RAM buffer usage threshold for heap to force segment flush. Setting too low might
induce a state of constant flushing during periods of ongoing write activity. For NRT, forced segment

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
157
Configuration

flushes also de-schedule pending auto-soft commits to avoid potentially flushing too many small
segments. When not set, the default is 1024.
Default: commented out (1024)
ram_buffer_offheap_space_in_mb
Global Lucene RAM buffer usage threshold for offheap to force segment flush. Setting too low might
induce a state of constant flushing during periods of ongoing write activity. For NRT, forced segment
flushes also de-schedule pending auto-soft commits to avoid potentially flushing too many small
segments. When not set, the default is 1024.
Default: commented out (1024)
Performance Service options

• Global Performance Service options

• Performance Service options

• DSE Search Performance Service options

• Spark Performance Service options

Global Performance Service options


Available options to configure the thread pool that is used by most plug-ins. A dropped task warning
is issued when the performance service requests more tasks than performance_max_threads +
performance_queue_capacity. When a task is dropped, collected statistics might not be current.

# performance_core_threads: 4
# performance_max_threads: 32
# performance_queue_capacity: 32000

performance_core_threads
Number of background threads used by the performance service under normal conditions. Default: 4
performance_max_threads
Maximum number of background threads used by the performance service.
performance_queue_capacity
The number of queued tasks in the backlog when the number of performance_max_threads are busy.
Default: 32000
Performance Service options
These settings are used by the Performance Service to configure collection of performance metrics on
transactional nodes. Performance metrics are stored in the dse_perf keyspace and can be queried with CQL
using any CQL-based utility, such as cqlsh or any application using a CQL driver. To temporarily make changes
for diagnostics and testing, use the dsetool perf subcommands.

See Collecting system level diagnostics.

graph_events
Graph event information.

graph_events:
ttl_seconds: 600

ttl_seconds
The TTL in milliseconds.
Default: 600
cql_slow_log_options
Options to configure reporting distributed sub-queries for search (query executions on individual shards)
that take longer than a specified period of time.

# cql_slow_log_options:

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
158
Configuration

# enabled: true
# threshold: 200.0
# minimum_samples: 100
# ttl_seconds: 259200
# skip_writing_to_db: true
# num_slowest_queries: 5

See Collecting slow queries.


enabled
Enables (true) or disables (false) log entries for slow queries. When not set, the default is true.
Default: commented out (true)
threshold
The threshold in milliseconds or as a percentile.

• A value greater than 1 is expressed in time and will log queries that take longer than the specified
number of milliseconds.

• A value of 0 to 1 is expressed as a percentile and will log queries that exceed this percentile.

Default: commented out (200.0 0.2 seconds)


minimum_samples
The initial number of queries before activating the percentile filter.
Default: commented out (100)
ttl_seconds
Time, in milliseconds, to keep the slow query log entries.
Default: commented out (259200)
skip_writing_to_db
Whether to keep slow queries in-memory only and not write data to database.

• false - write slow queries to the database; the threshold must be >= 2000 ms to prevent a high load
on the database

• true - skip writing to database, keep slow queries only in memory

Default: commented out (true)


num_slowest_queries
The number of slow queries to keep in-memory.
Default: commented out (5)
cql_system_info_options
Options to configure collection of system-wide performance information about a cluster.

cql_system_info_options:
enabled: false
refresh_rate_ms: 10000

enabled
Whether to collect system-wide performance information about a cluster.

• false - do not collect metrics

• true - enable collection of metrics

Default: false
refresh_rate_ms
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 10000 (10 seconds)
resource_level_latency_tracking_options

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
159
Configuration

Options to configure collection of object I/O performance statistics.

resource_level_latency_tracking_options:
enabled: false
refresh_rate_ms: 10000

See Collecting system level diagnostics.


enabled
Whether to collect object I/O performance statistics.

• false - do not collect metrics

• true - enable collection of metrics

Default: false
refresh_rate_ms
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 10000 (10 seconds)
db_summary_stats_options
Options to configure collection of summary statistics at the database level.

db_summary_stats_options:
enabled: false
refresh_rate_ms: 10000

See Collecting database summary diagnostics.


enabled
Whether to collect database summary performance information.

• false - do not collect metrics

• true - enable collection of metrics

Default: false
refresh_rate_ms
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 10000 (10 seconds)
cluster_summary_stats_options
Options to configure collection of statistics at a cluster-wide level.

cluster_summary_stats_options:
enabled: false
refresh_rate_ms: 10000

See Collecting cluster summary diagnostics.


enabled
Whether to collect statistics at a cluster-wide level.

• false - do not collect metrics

• true - enable collection of metrics

Default: false
refresh_rate_ms
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 10000 (10 seconds)
spark_cluster_info_options

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
160
Configuration

Options to configure collection of data associated with Spark cluster and Spark applications.

spark_cluster_info_options:
enabled: false
refresh_rate_ms: 10000

See Monitoring Spark with Spark Performance Objects.


enabled
Whether to collect Spark performance statistics.

• false - do not collect metrics

• true - enable collection of metrics

Default: false
refresh_rate_ms
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 10000 (10 seconds)
histogram_data_options
Histogram data for the dropped mutation metrics are stored in the dropped_messages table in the
dse_perf keyspace.

histogram_data_options:
enabled: false
refresh_rate_ms: 10000
retention_count: 3

See Collecting histogram diagnostics.


enabled

• false - do not collect metrics

• true - enable collection of metrics

Default: false
refresh_rate_ms
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 10000 (10 seconds)
retention_count
Default: 3
user_level_latency_tracking_options
User-resource latency tracking settings.

user_level_latency_tracking_options:
enabled: false
refresh_rate_ms: 10000
top_stats_limit: 100
quantiles: false

See Collecting user activity diagnostics.


enabled

• false - do not collect metrics

• true - enable collection of metrics

Default: false
refresh_rate_ms

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
161
Configuration

The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 10000 (10 seconds)
top_stats_limit
Limit the number of individual metrics.
Default: 100
quantiles
Default: false
DSE Search Performance Service options
These settings are used by the DataStax Enterprise Performance Service.

solr_slow_sub_query_log_options:
enabled: false
ttl_seconds: 604800
threshold_ms: 3000
async_writers: 1

solr_update_handler_metrics_options:
enabled: false
ttl_seconds: 604800
refresh_rate_ms: 60000

solr_request_handler_metrics_options:
enabled: false
ttl_seconds: 604800
refresh_rate_ms: 60000

solr_index_stats_options:
enabled: false
ttl_seconds: 604800
refresh_rate_ms: 60000

solr_cache_stats_options:
enabled: false
ttl_seconds: 604800
refresh_rate_ms: 60000

solr_latency_snapshot_options:
enabled: false
ttl_seconds: 604800
refresh_rate_ms: 60000

solr_slow_sub_query_log_options
See Collecting slow search queries.
enabled

• false - do not collect metrics

• true - enable collection of metrics

Default: false
ttl_seconds
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 604800 (about 10 minutes)
async_writers

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
162
Configuration

The number of server threads dedicated to writing in the log. More than one server thread might
degrade performance.
Default: 1
threshold_ms
Default: 3000
solr_update_handler_metrics_options
Options to collect search index direct update handler statistics over time.
See Collecting handler statistics.
enabled

• false - do not collect metrics

• true - enable collection of metrics

Default: false
ttl_seconds
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 604800 (about 10 minutes)
refresh_rate_ms
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 60000 (1 minute)
solr_index_stats_options
Options to record search index statistics over time.
See Collecting index statistics.
enabled

• false - do not collect metrics

• true - enable collection of metrics

Default: false
ttl_seconds
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 604800 (about 10 minutes)
refresh_rate_ms
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 60000 (1 minute)
solr_cache_stats_options
See Collecting cache statistics.
enabled

• false - do not collect metrics

• true - enable collection of metrics

Default: false
ttl_seconds
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 604800 (about 10 minutes)
refresh_rate_ms
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 60000 (1 minute)
solr_latency_snapshot_options
See Collecting Apache Solr performance statistics.
enabled

• false - do not collect metrics

• true - enable collection of metrics

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
163
Configuration

Default: false
ttl_seconds
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 604800 (about 10 minutes)
refresh_rate_ms
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 60000 (1 minute)
Spark Performance Service options
See Monitoring Spark application information.

spark_application_info_options:
enabled: false
refresh_rate_ms: 10000
driver:
sink: false
connectorSource: false
jvmSource: false
stateSource: false
executor:
sink: false
connectorSource: false
jvmSource: false

spark_application_info_options
Statistics options.
enabled

• false - do not collect metrics

• true - enable collection of metrics

Default: false
refresh_rate_ms
The length of the sampling period in milliseconds; the frequency to update the statistics.
Default: 10000 (10 seconds)
driver
Options to configure collection of metrics at the Spark Driver.
connectorSource
Whether to collect Spark Cassandra Connector metrics at the Spark Driver.

• false - do not collect metrics

• true - enable collection of metrics

Default: false
jvmSource
Whether to collect JVM heap and garbage collection (GC) metrics from the Spark Driver.

• false - do not collect metrics

• true - enable collection of metrics

Default: false
stateSource
Whether to collect application state metrics at the Spark Driver.

• false - do not collect metrics

• true - enable collection of metrics

Default: false
executor

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
164
Configuration

Options to configure collection of metrics at Spark executors.


sink
Whether to write metrics collected at Spark executors.

• false - do not collect metrics

• true - enable collection of metrics

Default: false
connectorSource
Whether to collect Spark Cassandra Connector metrics at Spark executors.

• false - do not collect metrics

• true - enable collection of metrics

Default: false
jvmSource
Whether to collect JVM heap and GC metrics at Spark executors.

• false - do not collect metrics

• true - enable collection of metrics

Default: false
DSE Analytics options

• Spark

• Starting Spark drivers and executors

• DSE File System (DSEFS) options

• Spark Performance Service

Spark resource and encryption options

spark_shared_secret_bit_length: 256
spark_security_enabled: false
spark_security_encryption_enabled: false

spark_daemon_readiness_assertion_interval: 1000

resource_manager_options:
worker_options:
cores_total: 0.7
memory_total: 0.6

workpools:
- name: alwayson_sql
cores: 0.25
memory: 0.25

spark_ui_options:
encryption: inherit
encryption_options:
enabled: false
keystore: .keystore
keystore_password: cassandra
require_client_auth: false
truststore: .truststore
truststore_password: cassandra
# Advanced settings
# protocol: TLS
# algorithm: SunX509

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
165
Configuration

# store_type: JKS
# cipher_suites:
[TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_DHE_RSA_WITH_AES_128_CBC_SHA,TLS_DHE_RSA_WITH_

spark_shared_secret_bit_length
The length of a shared secret used to authenticate Spark components and encrypt the connections
between them. This value is not the strength of the cipher for encrypting connections. Default: 256
spark_security_enabled
In DSE 6.0.8 and later, when DSE authentication is enabled with authentication_options, Spark security
is enabled regardless of this setting.
Enables Spark security based on shared secret infrastructure. Enables mutual authentication and
optional encryption between DSE Spark Master and Workers, and of communication channels, except
the web UI.
Default: false
spark_security_encryption_enabled
In DSE 6.0.8 and later, when DSE authentication is enabled with authentication_options, Spark security
is enabled regardless of this setting.
Enables encryption between DSE Spark Master and Workers, and of communication channels,
except the web UI. Uses DIGEST-MD5 SASL-based encryption mechanism. Requires
spark_security_enabled: true.
Configure encryption between the Spark processes and DSE with client-to-node encryption in
cassandra.yaml.
spark_daemon_readiness_assertion_interval
Time interval, in milliseconds, between subsequent retries by the Spark plugin for Spark Master and
Worker readiness to start. Default: 1000
resource_manager_options
DataStax Enterprise can control the memory and cores offered by particular Spark Workers in semi-
automatic fashion. You can define the total amount of physical resources available to Spark Workers,
and optionally add named work pools with specific resources dedicated to them.
worker_options
If the option is not specified, the default value 0.6 is used. The amount of system resources that are
made available to the Spark Worker.
cores_total
The number of total system cores available to Spark. If the option is not specified, the default value 0.7
is used.
For DSE 6.0.11 and later, the SPARK_WORKER_TOTAL_CORES environment variables takes precedence
over this setting.

This setting can be the exact number of cores or a decimal of the total system cores. When the value is
expressed as a decimal, the available resources are calculated in the following way:

Spark Worker cores = cores_total * total system cores

The lowest value that you can assign to Spark Worker cores is 1 core. If the results are lower, no
exception is thrown and the values are automatically limited.

Setting cores_total or a workpool's cores to 1.0 is a decimal value, meaning 100% of the available
cores will be reserved. Setting cores_total or cores to 1 (no decimal point) is an explicit value, and
one core will be reserved.
memory_total
The amount of total system memory available to Spark. This setting can be the exact amount of
memory or a decimal of the total system memory. When the value is an absolute value, you can use
standard suffixes like M for megabyte and G for gigabyte.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
166
Configuration

When the value is expressed as a decimal, the available resources are calculated in the following way:

Spark Worker memory = memory_total * (total system memory - memory assigned to


DataStax Enterprise)

The lowest values that you can assign to Spark Worker memory is 64 MB. If the results are lower, no
exception is thrown and the values are automatically limited.
If the option is not specified, the default value 0.6 is used.
For DSE 6.0.11 and later, the SPARK_WORKER_TOTAL_MEMORY environment variables takes
precedence over this setting.

workpools
Named work pools that can use a portion of the total resources defined under worker_options. A
default work pool named default is used if no work pools are defined in this section. If work pools are
defined, the resources allocated to the work pools are taken from the total amount, with the remaining
resources available to the default work pool. The total amount of resources defined in the workpools
section must not exceed the resources available to Spark in worker_options.
A work pool named alwayson_sql is created by default for AlwaysOn SQL. By default, it is configured
to use 25% of the resources available to Spark.
name
The name of the work pool.
cores
The number of system cores to use in this work pool expressed as either an absolute value or a decimal
value. This option follows the same rules as cores_total.
memory
The amount of memory to use in this work pool expressed as either an absolute value or a decimal
value. This option follows the same rules as memory_total.
spark_ui_options
Specify the source for SSL settings for Spark Master and Spark Worker UIs. The spark_ui_options
apply only to Spark daemon UIs, and do not apply to user applications even when the user applications
are run in cluster mode.
encryption

• inherit - inherit the SSL settings from the client encryption options.

• custom - use the following encryption_optionsfrom dse.yaml.

Default: inherit
encryption_options
Set encryption options for HTTPS of Spark Master and Worker UI. The spark_encryption_options are
not valid for DSE 5.1 and later.
enabled
Whether to enable Spark encryption for Spark client-to-Spark cluster and Spark internode
communication.
Default: false
keystore
The keystore for Spark encryption keys.
The relative file path is the base Spark configuration directory that is defined by the SPARK_CONF_DIR
environment variable. The default Spark configuration directory is resources/spark/conf.
Default: resources/dse/conf/.ui-keystore
keystore_password
The password to access the key store.
Default: cassandra
require_client_auth
Whether to require truststore for client authentication. When not set, the default is false.
Default: commented out (false)
truststore

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
167
Configuration

The truststore for Spark encryption keys.


The relative file path is the base Spark configuration directory that is defined by the SPARK_CONF_DIR
environment variable. The default Spark configuration directory is resources/spark/conf.
Default: commented out (resources/dse/conf/.ui-truststore)
truststore_password
The password to access the truststore.
Default: commented out (cassandra)
protocol
Defines the encryption protocol. The TLS protocol must be supported by JVM and Spark.
Default: commented out (TLS)
algorithm
Defines the key manager algorithm.
Default: commented out (TLSunX509SunX509S)
store_type
Defines the keystore type.
Default: commented out (JKS)
cipher_suites
Defines the cipher suites for Spark encryption:

• TLS_RSA_WITH_AES_128_CBC_SHA

• TLS_RSA_WITH_AES_256_CBC_SHA

• TLS_DHE_RSA_WITH_AES_128_CBC_SHA

• TLS_DHE_RSA_WITH_AES_256_CBC_SHA

• TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA

• TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA

Default: commented out


Starting Spark drivers and executors

spark_process_runner:
runner_type: default
run_as_runner_options:
user_slots:
- slot1
- slot2

spark_process_runner:
Options to configure how Spark driver and executor processes are created and managed.
runner_type

• default - Use the default runner type.

• run_as - Use the run_as_runner_options options. See Running Spark processes as separate
users.

run_as_runner_options
The slot users for separating Spark processes users from the DSE service user. See Running Spark
processes as separate users.
Default: slot1, slot2
AlwaysOn SQL options
Properties to enable and configure AlwaysOn SQL.

# AlwaysOn SQL options


# alwayson_sql_options:
# enabled: false

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
168
Configuration

# thrift_port: 10000
# web_ui_port: 9077
# reserve_port_wait_time_ms: 100
# alwayson_sql_status_check_wait_time_ms: 500
# workpool: alwayson_sql
# log_dsefs_dir: /spark/log/alwayson_sql
# auth_user: alwayson_sql
# runner_max_errors: 10

alwayson_sql_options
The AlwaysOn SQL options enable and configure the server on this node.
enabled
Whether to enable AlwaysOn SQL for this node. The node must be an analytics node. When not set,
the default is false.
Default: commented out (false)
thrift_port
The Thrift port on which AlwaysOn SQL listens.
Default: commented out (10000)
web_ui_port
The port on which the AlwaysOn SQL web UI is available.
Default: commented out (9077)
reserve_port_wait_time_ms
The wait time in milliseconds to reserve the thrift_port if it is not available.
Default: commented out (100)
alwayson_sql_status_check_wait_time_ms
The time in milliseconds to wait for a health check status of the AlwaysOn SQL server.
Default: commented out (500)
workpool
The work pool name used by AlwaysOn SQL.
Default: commented out (alwayson_sql)
log_dsefs_dir
Location in DSEFS of the AlwaysOn SQL log files.
Default: commented out (/spark/log/alwayson_sql)
auth_user
The role to use for internal communication by AlwaysOn SQL if authentication is enabled. Custom roles
must be created with login=true.
Default: commented out (alwayson_sql)
runner_max_errors
The maximum number of errors that can occur during AlwaysOn SQL service runner thread runs before
stopping the service. A service stop requires a manual restart.
Default: commented out (10)
DSE File System (DSEFS) options
Properties to enable and configure the DSE File System (DSEFS).
DSEFS replaced the Cassandra File System (CFS). DSE version 6.0 and later do not support CFS.

dsefs_options:
enabled:
keyspace_name: dsefs
work_dir: /var/lib/dsefs
public_port: 5598
private_port: 5599
data_directories:
- dir: /var/lib/dsefs/data
storage_weight: 1.0
min_free_space: 5368709120

# service_startup_timeout_ms: 30000

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
169
Configuration

# service_close_timeout_ms: 600000
# server_close_timeout_ms: 2147483647 # Integer.MAX_VALUE
# compression_frame_max_size: 1048576
# query_cache_size: 2048
# query_cache_expire_after_ms: 2000
# gossip_options:
# round_delay_ms: 2000
# startup_delay_ms: 5000
# shutdown_delay_ms: 10000
# rest_options:
# request_timeout_ms: 330000
# connection_open_timeout_ms: 55000
# client_close_timeout_ms: 60000
# server_request_timeout_ms: 300000
# idle_connection_timeout_ms: 60000
# internode_idle_connection_timeout_ms: 120000
# core_max_concurrent_connections_per_host: 8
# transaction_options:
# transaction_timeout_ms: 3000
# conflict_retry_delay_ms: 200
# conflict_retry_count: 40
# execution_retry_delay_ms: 1000
# execution_retry_count: 3
# block_allocator_options:
# overflow_margin_mb: 1024
# overflow_factor: 1.05

dsefs_options
Enable and configure options for DSEFS.
enabled
Whether to enable DSEFS.

• true - enables DSEFS on this node, regardless of the workload.

• false - disables DSEFS on this node, regardless of the workload.

• blank or commented out (#) - DSEFS will start only if the node is configured to run analytics
workloads.

Default: commented out (blank)


keyspace_name
The keyspace where the DSEFS metadata is stored. You can optionally configure multiple DSEFS file
systems within a single datacenter by specifying different keyspace names for each cluster.
Default: commented out (dsefs)
work_dir
The local directory for storing the local node metadata, including the node identifier. The volume of data
stored in this directory is nominal and does not require configuration for throughput, latency, or capacity.
This directory must not be shared by DSEFS nodes.
Default: commented out (/var/lib/dsefs)
public_port
The public port on which DSEFS listens for clients.
DataStax recommends that all nodes in the cluster have the same value. Firewalls must open this
port to trusted clients. The service on this port is bound to the native_transport_address.
Default: commented out (5598)
private_port
The private port for DSEFS inter-node communication.
Do not open this port to firewalls; this private port must be not visible from outside of the cluster.
Default: commented out (5599)
data_directories
One or more data locations where the DSEFS data is stored.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
170
Configuration

- dir
Mandatory attribute to identify the set of directories. DataStax recommends segregating these data
directories on physical devices that are different from the devices that are used for DataStax Enterprise.
Using multiple directories on JBOD improves performance and capacity.
Default: commented out (/var/lib/dsefs/data)
storage_weight
The weighting factor for this location specifies how much data to place in this directory, relative to other
directories in the cluster. This soft constraint determines how DSEFS distributes the data. For example,
a directory with a value of 3.0 receives about three times more data than a directory with a value of 1.0.
Default: commented out (1.0)
min_free_space
The reserved space, in bytes, to not use for storing file data blocks. You can use a unit of measure
suffix to specify other size units. For example: terabyte (1 TB), gigabyte (10 GB), and megabyte (5000
MB).
Default: commented out (5368709120)
Advanced properties for DSEFS
service_startup_timeout_ms
Wait time, in milliseconds, before the DSEFS server times out while waiting for services to bootstrap.
Default: commented out (30000)
service_close_timeout_ms
Wait time, in milliseconds, before the DSEFS server times out while waiting for services to close.
Default: commented out (600000)
server_close_timeout_ms
Wait time, in milliseconds, that the DSEFS server waits during shutdown before closing all pending
connections.
Default: commented out (2147483647)
compression_frame_max_size
The maximum accepted size of a compression frame defined during file upload.
Default: commented out (1048576)
query_cache_size
Maximum number of elements in a single DSEFS Server query cache.
Default: commented out (2048)
query_cache_expire_after_ms
The time to retain the DSEFS Server query cache element in cache. The cache element expires when
this time is exceeded.
Default: commented out (2000)
gossip options
Options to configure DSEFS gossip rounds.
round_delay_ms
The delay, in milliseconds, between gossip rounds.
Default: commented out (2000)
startup_delay_ms
The delay time, in milliseconds, between registering the location and reading back all other locations
from the database.
Default: commented out (5000)
shutdown_delay_ms
The delay time, in milliseconds, between announcing shutdown and shutting down the node.
Default: commented out (30000)
rest_options
Options to configure DSEFS rest times.
request_timeout_ms
The time, in milliseconds, that the client waits for a response that corresponds to a given request.
Default: commented out (330000)
connection_open_timeout_ms
The time, in milliseconds, that the client waits to establish a new connection.
Default: commented out (55000)
client_close_timeout_ms

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
171
Configuration

The time, in milliseconds, that the client waits for pending transfer to complete before closing a
connection.
Default: commented out (60000)
server_request_timeout_ms
The time, in milliseconds, to wait for the server rest call to complete.
Default: commented out (300000)
idle_connection_timeout_ms
The time, in milliseconds, for RestClient to wait before closing an idle connection. If RestClient does not
close connection after timeout, the connection is closed after 2*idle_connection_timeout_ms.

• time - wait time to close idle connection

• 0 - disable closing idle connections

Default: commented out (60000)


internode_idle_connection_timeout_ms
Wait time, in milliseconds, before closing idle internode connection. The internode connections are
primarily used to exchange data during replication. Do not set lower than the default value for heavily
utilized DSEFS clusters.
Default: commented out (0) (disabled)
core_max_concurrent_connections_per_host
Maximum number of connections to a given host per single CPU core. DSEFS keeps a connection pool
for each CPU core.
Default: 8
transaction_options
Options to configure DSEFS transaction times.
transaction_timeout_ms
Transaction run time, in milliseconds, before the transaction is considered for timeout and rollback.
Default: 3000
conflict_retry_delay_ms
Wait time, in milliseconds, before retrying a transaction that was ended due to a conflict. Default: 200
conflict_retry_count
The number of times to retry a transaction before giving up. Default: 40
execution_retry_delay_ms
Wait time, in milliseconds, before retrying a failed transaction payload execution. Default: 1000
execution_retry_count
The number of payload execution retries before signaling the error to the application. Default: 3
block_allocator_options
Controls how much additional data can be placed on the local coordinator before the local node
overflows to the other nodes. The trade-off is between data locality of writes and balancing the cluster.
A local node is preferred for a new block allocation, if:

used_size_on_the_local_node < average_used_size_per_node * overflow_factor +


overflow_margin

overflow_margin_mb

• margin_size - overflow margin size in megabytes

• 0 - disable block allocation overflow

Default: commented out (1024)


overflow_factor

• factor - overflow factor on an exponential scale

• 1.0 - disable block allocation overflow

Default: commented out (1.05)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
172
Configuration

DSE Metrics Collector options


When data_dir is not uncommented, the default location of the DSE Metrics Collector data directory is the
same directory as the commitlog directory as defined in cassandra.yaml.

Uncomment these options only to change the default directories:

# insights_options:
# data_dir: /var/lib/cassandra/insights_data
# log_dir: /var/log/cassandra/

insights_options
Options for DSE Metrics Collector.
data_dir
Directory to store collected metrics. When not set, the default directory is /var/lib/cassandra/
insights_data.
When data_dir is not set, the default location of the /insights_data directory is the same location
as the /commitlog directory, as defined with the commitlog_directory property in cassandra.yaml.
log_dir
Directory to store logs for collected metrics. The log file is dse-collectd.log. The file with the collectd
PID is dse-collectd.pid. When not set, the default directory is /var/log/cassandra/.
Audit database activities
Track database activity using the audit log feature. To get the maximum information from data auditing, turn on
data auditing on every node.
See Setting up database auditing.

audit_logging_options
Options to enable and configure database activity logging.
enabled
Whether to enable database activity auditing.

• true - enables database activity auditing

• false - disables database activity auditing

Default: false
logger
The logger to use for recording events:

• SLF4JAuditWriter - Capture events in a log file.

• CassandraAuditWriter - Capture events in a table, dse_audit.audit_log.

Configure logging level, sensitive data masking, and log file name/location in the logback.xml file.
Default: SLF4JAuditWriter
included_categories
Comma separated list of event categories that are captured, where the category names are:

• QUERY - Data retrieval events.

• DML - (Data manipulation language) Data change events.

• DDL - (Data definition language) Database schema change events.

• DCL - (Data change language) Role and permission management events.

• AUTH - (Authentication) Login and authorization related events.

• ERROR - Failed requests.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
173
Configuration

• UNKNOWN - Events where the category and type are both UNKNOWN.

Event categories that are not listed are not captured.


Use either included_categories or excluded_categories but not both. When specifying included
categories leave excluded_categories blank or commented out.
Default: none (include all categories)
excluded_categories
Comma separated list of categories to ignore, where the categories are:

• QUERY - Data retrieval events.

• DML - (Data manipulation language) Data change events.

• DDL - (Data definition language) Database schema change events.

• DCL - (Data change language) Role and permission management events.

• AUTH - (Authentication) Login and authorization related events.

• ERROR - Failed requests.

• UNKNOWN - Events where the category and type are both UNKNOWN.

Events in all other categories are logged.


Use either included_categories or excluded_categories but not both. When specifying excluded
categories leave included_categories blank or commented out.
Default: none (exclude no categories )
included_keyspaces
The keyspaces for which events are logged. Specify keyspace names in a comma separated list or use
a regular expression to filter on keyspace name.
DSE supports using either included_keyspaces or excluded_keyspaces but not both. When
specifying included categories leave excluded_keyspaces blank or comment it out.
Default: none (include all keyspaces)
excluded_keyspaces
Log events for all keyspaces which are not listed. Specify a comma separated list keyspace names or
use a regular expression to filter on keyspace name. Only use this option if included_keyspaces is
blank or commented out.
Default: none (exclude no keyspaces)
included_roles
The roles for which events are logged. Log events for the listed roles. Specify roles in a comma
separated list.
DSE supports using either included_roles or excluded_roles but not both. When specifying
included_roles leave excluded_keyspaces blank or comment it out.
Default: none (include all roles)
excluded_roles
The roles for which events are not logged. Specify a comma separated list role names. Only use this
option if included_roles is blank or commented out.
Default: none (exclude no roles)
Cassandra audit writer options

retention_time: 0
cassandra_audit_writer_options:
mode: sync
batch_size: 50
flush_time: 250
queue_size: 30000
write_consistency: QUORUM
# dropped_event_log: /var/log/cassandra/dropped_audit_events.log

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
174
Configuration

# day_partition_millis: 3600000

retention_time
The amount of time, in hours, audit events are retained by supporting loggers. Only the
CassandraAuditWriter supports retention time.

• 0 - retain events forever

• hours - the number of hours to retain audit events

Default: 0 (retain events forever)


cassandra_audit_writer_options
Audit writer options.
mode
The mode the writer runs in.

• sync - A query is not executed until the audit event is successfully written.

• async - Audit events are queued for writing to the audit table, but are not necessarily logged before
the query executes. A pool of writer threads consumes the audit events from the queue, and writes
them to the audit table in batch queries.
While async substantially improves performance under load, if there is a failure between when
a query is executed, and its audit event is written to the table, the audit table might be missing
entries for queries that were executed.

Default: sync
batch_size
Available only when mode: async. Must be greater than 0.
The maximum number of events the writer dequeues before writing them out to the table. If
warnings in the logs reveal that batches are too large, decrease this value or increase the value of
batch_size_warn_threshold_in_kb in cassandra.yaml.
Default: 50
flush_time
Available only when mode: async.
The maximum amount of time in milliseconds before an event is removed from the queue by a writer
before being written out. This flush time prevents events from waiting too long before being written to
the table when there are not a lot of queries happening.
Default: 500
queue_size
The size of the queue feeding the asynchronous audit log writer threads. When there are more events
being produced than the writers can write out, the queue fills up, and newer queries are blocked until
there is space on the queue. If a value of 0 is used, the queue size is unbounded, which can lead to
resource exhaustion under heavy query load.
Default: 30000
write_consistency
The consistency level that is used to write audit events.
Default: QUORUM
dropped_event_log
The directory to store the log file that reports dropped events. When not set, the default is /var/log/
cassandra/dropped_audit_events.log.
Default: commented out (/var/log/cassandra/dropped_audit_events.log)
day_partition_millis
The interval, in milliseconds, between changing nodes to spread audit log information across multiple
nodes. For example, to change the target node every 12 hours, specify 43200000 milliseconds. When
not set, the default is 3600000 (1 hour).
Default: commented out (3600000) (1 hour)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
175
Configuration

DSE Tiered Storage options


Options to define one or more disk configurations for DSE Tiered Storage. Specify multiple disk configurations
as unnamed tiers by a collection of paths that are defined in priority order, with the fastest storage media in
the top tier. With heterogeneous storage configurations across the cluster, specify each disk configuration
with config_name:config_settings, and then use this configuration in CREATE TABLE or ALTER TABLE
statements.
DSE Tiered Storage does not change compaction strategies. To manage compression and compaction
options, use the compaction option. See Modifying compression and compaction.

# tiered_storage_options:
# strategy1:
# tiers:
# - paths:
# - /mnt1
# - /mnt2
# - paths: [ /mnt3, /mnt4 ]
# - paths: [ /mnt5, /mnt6 ]
#
# local_options:
# k1: v1
# k2: v2
#
# 'another strategy':
# tiers: [ paths: [ /mnt1 ] ]

tiered_storage_options
Options to configure the smart movement of data across different types of storage media so that data
is matched to the most suitable drive type, according to the performance and cost characteristics it
requires
strategy1
The first disk configuration strategy. Create a strategy2, strategy3, and so on. In this example, strategy1
is the configurable name of the tiered storage configuration strategy.
tiers
The unnamed tiers in this section define a storage tier with the paths and file paths that define the
priority order.
local_options
Local configuration options overwrite the tiered storage settings for the table schema in the local
dse.yaml file. See Testing DSE Tiered Storage configurations.
- paths
The section of file paths that define the data directories for this tier of the disk configuration. Typically
list the fastest storage media first. These paths are used only to store data that is configured to use
tiered storage. These paths are independent of any settings in the cassandra.yaml file.
- /filepath
The file paths that define the data directories for this tier of the disk configuration.
DSE Advanced Replication configuration settings
DSE Advanced Replication configuration options to replicate data from remote clusters to central data hubs.

# advanced_replication_options:
# enabled: false
# conf_driver_password_encryption_enabled: false
# advanced_replication_directory: /var/lib/cassandra/advrep
# security_base_path: /base/path/to/advrep/security/files/

advanced_replication_options
Options to enable and configure DSE Advanced Replication.
enabled
Whether to enable an edge node to collect data in the replication log.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
176
Configuration

Default: commented out (false)


conf_driver_password_encryption_enabled
Whether to enable encryption of driver passwords. When enabled, the stored driver password is
expected to be encrypted. See Encrypting configuration file properties.
Default: commented out (false)
advanced_replication_directory
The directory for storing advanced replication CDC logs. A directory replication_logs will be created
in the specified directory.
Default: commented out (/var/lib/cassandra/advrep)
security_base_path
The base path to prepend to paths in the Advanced Replication configuration locations, including
locations to SSL keystore, SSL truststore, and so on.
Default: commented out (/base/path/to/advrep/security/files/)
Inter-node messaging options
Configuration options for the internal messaging service used by several components of DataStax Enterprise. All
internode messaging requests use this service.

internode_messaging_options:
port: 8609
# frame_length_in_mb: 256
# server_acceptor_threads: 8
# server_worker_threads: 16
# client_max_connections: 100
# client_worker_threads: 16
# handshake_timeout_seconds: 10
# client_request_timeout_seconds: 60

internode_messaging_options
Configuration options for inter-node messaging.
port
The mandatory port for the inter-node messaging service.
Default: 8609
frame_length_in_mb
Maximum message frame length. When not set, the default is 256.
Default: commented out (256)
server_acceptor_threads
The number of server acceptor threads. When not set, the default is the number of available
processors.
Default: commented out
server_worker_threads
The number of server worker threads. When not set, the default is the number of available processors *
8.
Default: commented out
client_max_connections
The maximum number of client connections. When not set, the default is 100.
Default: commented out (100)
client_worker_threads
The number of client worker threads. When not set, the default is the number of available processors *
8.
Default: commented out
handshake_timeout_seconds
Timeout for communication handshake process. When not set, the default is 10.
Default: commented out (10)
client_request_timeout_seconds

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
177
Configuration

Timeout for non-query search requests like core creation and distributed deletes. When not set, the
default is 60.
Default: commented out (60)
DSE Multi-Instance server_id
server_id
In DSE Multi-Instance /etc/dse-nodeId/dse.yaml files, the server_id option is generated to uniquely
identify the physical server on which multiple instances are running. The server_id default value is the
media access control address (MAC address) of the physical server. You can change server_id when
the MAC address is not unique, such as a virtualized server where the host’s physical MAC is cloned.
DSE Graph options

• DSE Graph system-level options

• DSE Graph Gremlin Server options

DSE Graph system-level options


These graph options are system-level configuration options and options that are shared between graph
instances. Add an option if it is not present in the provided dse.yaml file.

# graph:
# analytic_evaluation_timeout_in_minutes: 10080
# realtime_evaluation_timeout_in_seconds: 30
# schema_agreement_timeout_in_ms: 10000
# system_evaluation_timeout_in_seconds: 180
# index_cache_size_in_mb: 128
# max_query_queue: 10000
# max_query_threads (no explicit default)
# max_query_params: 16

graph
These graph options are system-level configuration options and options that are shared between graph
instances.
Option names and values expressed in ISO 8601 format used in earlier DSE 5.0 releases are still valid.
The ISO 8601 format is deprecated.
analytic_evaluation_timeout_in_minutes
Maximum time to wait for an OLAP analytic (Spark) traversal to evaluate. When not set, the default is
10080 (168 hours).
Default: commented out (10080)
realtime_evaluation_timeout_in_seconds
Maximum time to wait for an OLTP real-time traversal to evaluate. When not set, the default is 30
seconds.
Default: commented out (30)
schema_agreement_timeout_in_ms
Maximum time to wait for the database to agree on schema versions before timing out. When not set,
the default is 10000 (10 seconds).
Default: commented out (10000)
system_evaluation_timeout_in_seconds
Maximum time to wait for a graph system-based request to execute, like creating a new graph. When
not set, the default is 180 (3 minutes).
Default: commented out (180)
schema_mode
Controls the way that the schemas are handled.

• Production = Schema must be created before data insertion. Schema cannot be changed after
data is inserted. Full graph scans are disallowed unless the option graph.allow_scan is changed to
TRUE.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
178
Configuration

• Development = No schema is required to write data to a graph. Schema can be changed after data
is inserted. Full graph scans are allowed unless the option graph.allow_scan is changed to FALSE.

When not set, the default is Production. If this option is not present, manually enter it to use
Development.
Default: not present
index_cache_size_in_mb
The amount of ram to allocate to the index cache. When not set, the default is 128.
Default: commented out (128)
max_query_queue
The maximum number of CQL queries that can be queued as a result of Gremlin requests. Incoming
queries are rejected if the queue size exceeds this setting. When not set, the default is 10000.
Default: commented out (10000)
max_query_threads
The maximum number of threads to use for queries to the database. When this option is not set, the
default is calculated:

• If gremlinPool is present and nonzero:


10 * the gremlinPool setting

• If gremlinPool is not present in this file or set to zero:


The number of available CPU cores

See gremlinPool.
Default: calculated
max_query_params
The maximum number of parameters that can be passed on a graph query request for TinkerPop
drivers and drivers using the Cassandra native protocol. Passing very large numbers of parameters
on requests is an anti-pattern, because the script evaluation time increases proportionally. DataStax
recommends reducing the number of parameters to speed up script compilation times. Before you
increase this value, consider alternate methods for parameterizing scripts, like passing a single map. If
the graph query request requires many arguments, pass a list.
Default: commented out (16)
DSE Graph Gremlin Server options
The Gremlin Server is configured using Apache TinkerPop specifications.

# gremlin_server:
# port: 8182
# threadPoolWorker: 2
# gremlinPool: 0
# scriptEngines:
# gremlin-groovy:
# config:
# sandbox_enabled: false
# sandbox_rules:
# whitelist_packages:
# - package.name
# whitelist_types:
# - fully.qualified.type.name
# whitelist_supers:
# - fully.qualified.class.name
# blacklist_packages:
# - package.name
# blacklist_supers:
# - fully.qualified.class.name

gremlin_server
The top-level configurations in Gremlin Server.
port
The available communications port for Gremlin Server. When not set, the default is 8182.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
179
Configuration

Default: commented out (8182)


threadPoolWorker
The number of worker threads that handle non-blocking read and write (requests and responses) on the
Gremlin Server channel, including routing requests to the right server operations, handling scheduled
jobs on the server, and writing serialized responses back to the client. When not set, the default is 2.
Default: commented out (2)
gremlinPool
The number of Gremlin threads available to execute actual scripts in a ScriptEngine. This pool
represents the workers available to handle blocking operations in Gremlin Server.

• 0 - the value of the JVM property cassandra.available_processors, if that property is set

• When not set - the value of Runtime.getRuntime().availableProcessors()

Default: commented out (0)


scriptEngines
Section to configure gremlin server scripts.
gremlin-groovy
Section for gremlin-groovy scripts.
sandbox_enabled
Sandbox is enabled by default. To disable the gremlin groovy sandbox entirely, set to false.
sandbox_rules
Section for sandbox rules.
whitelist_packages
List of packages, one package per line, to whitelist.
-package.name
Retain the hyphen before the fully qualified package name.
whitelist_types
List of types, one type per line, to whitelist.
-fully.qualified.type.name
Retain the hyphen before the fully qualified type name.
whitelist_supers
List of super classes, one class per line, to whitelist. Retain the hyphen before the fully qualified class
name.
-fully.qualified.class.name
Retain the hyphen before the fully qualified class name.
blacklist_packages
List of packages, one package per line, to blacklist.
-package.name
Retain the hyphen before the fully qualified package name.
blacklist_supers
List of super classes, one class per line, to blacklist. Retain the hyphen before the fully qualified class
name.
-fully.qualified.class.name
Retain the hyphen before the fully qualified class name.
See also remote.yaml file for Gremlin console configuration .
remote.yaml configuration file
The remote.yaml file is the primary configuration file for DSE Graph Gremlin console connection to the Gremlin
Server.
The dse.yaml file is the primary configuration file for the DataStax Enterprise Graph configuration, and includes
the setting for the Gremlin Server options.
Synopsis
For the properties in each section, the parent setting has zero spaces. Each child entry requires at least
two spaces. Adhere to the YAML syntax and retain the spacing. For example, no spaces before the parent
node_health_options entry, and at least two spaces before the child settings:

node_health_options:

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
180
Configuration

refresh_rate_ms: 50000
uptime_ramp_up_period_seconds: 10800
dropped_mutation_window_minutes: 30

DSE Graph Gremlin basic options


An Apache TinkerPop YAML file, remote.yaml, is configured with Gremlin Server information: The Gremlin
Server is configured using Apache TinkerPop specifications.

hosts: [localhost]
port: 8182
serializer: { className:
org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0,
config: { ioRegistries:
[org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerIoRegistryV3d0] }}

hosts
Identifies a host or hosts running a DSE node that is running Gremlin Server. You may need to use the
native_transport_address value set in cassandra.yaml.
Default: [localhost]
You can also connect to the Spark Master node for the datacenter by either running the console from
the Spark Master or specifying the Spark Master in the hosts field in the remote.yaml file.
port
Identifies a port on a DSE node running Gremlin Server. The port value needs to match the port value
specified for gremlin_server: in the dse.yaml file.
Default: 8182
serializer
Specifies the class and configuration for the serializer used to pass information between the Gremlin
console and the Gremlin Server.
Default: { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0,
config: { ioRegistries:
[org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerIoRegistryV3d0]
DSE Graph Gremlin connectionPool options
The connectionPool settings specify a number of options that will be passed between the Gremlin console and
the Gremlin Server.

connectionPool: {
enableSsl: false,
maxContentLength: 65536000,
maxInProcessPerConnection: 4,
maxSimultaneousUsagePerConnection: 16,
maxSize: 8,
maxWaitForConnection: 3000,
maxWaitForSessionClose: 3000,
minInProcessPerConnection: 1,
minSimultaneousUsagePerConnection: 8,
minSize: 2,
reconnectInterval: 1000,
resultIterationBatchSize: 64,
# trustCertChainFile: /etc/dse/graph/gremlin-console/conf/mycert.pem
# Note: trustCertChainFile deprecated as of TinkerPop 3.2.10; instead use trustStore.
trustStore: /full/path/to/jsse/truststore/file
}

enableSsl
Determines if SSL should be enabled. If enabled on the server, SSL must be enabled on the client.
To configure the Gremlin console to use SSL, when SSL is enabled on the Gremlin Server, edit the
connectionPool section of remote.yaml:

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
181
Configuration

• Set enableSsl to true.

• Specify the path to the:

# Java Secure Socket Extension (JSSE) truststore file via the trustStore parameter

# Or the PEM-based trustCertChainFile


trustCertChainFile is deprecated as of TinkerPop 3.2.10 If SSL is enabled, when
you can, switch to specifying the JSSE truststore file via the trustStore parameter in
remote.yaml.

Example:

hosts: [localhost]
username: Cassandra_username
password: Cassandra_password
port: 8182
...
connectionPool: {
enableSsl: true,
trustStore: /full/path/to/JSSE/truststore/file,
...
...

For related information, refer to the TinkerPop security documentation.


Default: false
maxContentLength
The maximum length in bytes that a message can be sent to the server. This number can be no greater
than the setting of the same name in the server configuration.
Default: 65536000
maxInProcessPerConnection
The maximum number of in-flight requests that can occur on a connection.
Default: 4
maxSimultaneousUsagePerConnection
The maximum number of times that a connection can be borrowed from the pool simultaneously.
Default: 16
maxSize
The maximum size of a connection pool for a host.
Default: 8
maxWaitForConnection
The amount of time in milliseconds to wait for a new connection before timing out.
Default: 3000
maxWaitForSessionClose
The amount of time in milliseconds to wait for a session to close before timing out (does not apply to
sessionless connections).
Default: 3000
minInProcessPerConnection
The minimum number of in-flight requests that can occur on a connection.
Default: 1
minSimultaneousUsagePerConnection
The maximum number of times that a connection can be borrowed from the pool simultaneously.
Default: 8
minSize
The minimum size of a connection pool for a host.
Default: 2
reconnectInterval
The amount of time in milliseconds to wait before trying to reconnect to a dead host.
Default: 1000
resultIterationBatchSize

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
182
Configuration

The override value for the size of the result batches to be returned from the server.
Default: 64
trustCertChainFile
The location of the public certificate from the DSE truststore file, in PEM format. Also set enableSsl:
true.

Deprecated as of TinkerPop 3.2.10. Instead use trustStore.

If you are using the deprecated trustCertChainFile in your version of remote.yaml, here are
the details. Depending on how you created the DSE truststore file, you may already have the
PEM format certificate file from the root Certificate Authority. If so, specify the PEM file with this
trustCertChainFile option. If not, export the public certificate from the DSE truststore (CER format)
and convert it to PEM format. Then specify the PEM file with this option. Example:

$ pwd

/etc/dse/graph/gremlin-console/conf

$ keytool -export -keystore /etc/dse/keystores/client.truststore -alias clusterca


-file mycert.cer

$ openssl x509 -inform der -in mycert.cer -out mycert.pem

In this example, the connectionPool section of remote.yaml should then include the following options
(assuming you are aware that trustCertChainFile is deprecated, as noted above).

connectionPool: {
enableSsl: true,
trustCertChainFile: /etc/dse/graph/gremlin-console/conf/mycert.pem,
...
}

Default: Unspecified
trustStore
The location of the Java Secure Socket Extension (JSSE) truststore file. Trusted certificates for verifying
the remote client's certificate. Similar to setting the JSSE property javax.net.ssl.trustStore. If
this value is not provided in remote.yaml and if SSL is enabled (via enableSSL: true), the default
TrustManager is used.
Default: Unspecified
DSE Graph Gremlin AuthProperties options
Security considerations for authentication between the Gremlin console and the Gremlin server require additional
options in the remote.yaml file.

# jaasEntry:
# protocol:
# username: xxx
# password: xxx

jaasEntry
Sets the AuthProperties.Property.JAAS_ENTRY properties for authentication to Gremlin Server.
Default: commented out (no value)
protocol
Sets the AuthProperties.Property.PROTOCOL properties for authentication to Gremlin Server.
Default: commented out (no value)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
183
Configuration

username
The username to submit on requests that require authentication.
Default: commented out (xxx)
password
The password to submit on requests that require authentication.
Default: commented out (xxx)
cassandra-rackdc.properties file
The GossipingPropertyFileSnitch, Ec2Snitch, and Ec2MultiRegionSnitch use the cassandra-rackdc.properties
configuration file to determine which datacenters and racks nodes belong to. They inform the database about the
network topology to route requests efficiently and distribute replicas evenly. Settings for this file depend on the
type of snitch:

• GossipingPropertyFileSnitch

• Configuring the Amazon EC2 single-region snitch

• Configuring Amazon EC2 multi-region snitch

This page also includes instructions for migrating from the PropertyFileSnitch to the GossipingPropertyFileSnitch.
GossipingPropertyFileSnitch
This snitch is recommended for production. It uses rack and datacenter information for the local node defined in
the cassandra-rackdc.properties file and propagates this information to other nodes via gossip.
To configure a node to use GossipingPropertyFileSnitch, edit the cassandra-rackdc.properties file as follows:

• Define the datacenter and rack that include this node. The default settings:

dc=DC1
rack=RAC1

datacenter and rack names are case-sensitive. For examples, see Initializing a single datacenter per
workload type and Initializing multiple datacenters per workload type.

• To save bandwidth, add the prefer_local=true option. This option tells DataStax Enterprise to use the
local IP address when communication is not across different datacenters.

Migrating from the PropertyFileSnitch to the GossipingPropertyFileSnitch


To allow migration from the PropertyFileSnitch, the GossipingPropertyFileSnitch uses the cassandra-
topology.properties file when present. Delete the file after the migration is complete. For more information
about migration, see Switching snitches.

The GossipingPropertyFileSnitch always loads cassandra-topology.properties when that


file is present. Remove the file from each node on any new cluster or any cluster migrated from the
PropertyFileSnitch.

cassandra-topology.properties file
The PropertyFileSnitch uses the cassandra-topology.properties for datacenters and rack names and to
determine network topology so that requests are routed efficiently and allows the database to distribute replicas
evenly.
The GossipingPropertyFileSnitch snitch is recommended for production. See Migrating from the
PropertyFileSnitch to the GossipingPropertyFileSnitch.

PropertyFileSnitch
This snitch determines proximity as determined by rack and datacenter. It uses the network details located in the
cassandra-topology.properties file. When using this snitch, you can define your datacenter names to be whatever
you want. Make sure that the datacenter names correlate to the name of your datacenters in the keyspace

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
184
Configuration

definition. Every node in the cluster should be described in the cassandra-topology.properties file, and this
file should be exactly the same on every node in the cluster.
Setting datacenters and rack names
If you had non-uniform IPs and two physical datacenters with two racks in each, and a third logical datacenter for
replicating analytics data, the cassandra-topology.properties file might look like this:

Datacenter and rack names are case-sensitive.

# datacenter One

175.56.12.105=DC1:RAC1
175.50.13.200=DC1:RAC1
175.54.35.197=DC1:RAC1

120.53.24.101=DC1:RAC2
120.55.16.200=DC1:RAC2
120.57.102.103=DC1:RAC2

# datacenter Two

110.56.12.120=DC2:RAC1
110.50.13.201=DC2:RAC1
110.54.35.184=DC2:RAC1

50.33.23.120=DC2:RAC2
50.45.14.220=DC2:RAC2
50.17.10.203=DC2:RAC2

# Analytics Replication Group

172.106.12.120=DC3:RAC1
172.106.12.121=DC3:RAC1
172.106.12.122=DC3:RAC1

# default for unknown nodes


default =DC3:RAC1

Configuring snitches for cloud providers


Configure a cloud provider snitch that corresponds to the provider.
Configuring the Amazon EC2 single-region snitch
Use the Ec2Snitch for simple cluster deployments on Amazon EC2 where all nodes in the cluster are within a
single region. Because private IPs are used, this snitch does not work across multiple regions.
In EC2 deployments, the region name is treated as the datacenter name and availability zones are treated as
racks within a datacenter. For example, if a node is in the us-east-1 region, us-east is the datacenter name and
1 is the rack location. (Racks are important for distributing replicas, but not for datacenter naming.)
If you are using only a single datacenter, you do not need to specify any properties.
If you need multiple datacenters, set the dc_suffix options in the cassandra-rackdc.properties file. Any other
lines are ignored.
For example, for each node within the us-east region, specify the datacenter in its cassandra-
rackdc.properties file:
datacenter names are case-sensitive.

• node0
dc_suffix=_1_cassandra

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
185
Configuration

• node1
dc_suffix=_1_cassandra

• node2
dc_suffix=_1_cassandra

• node3
dc_suffix=_1_cassandra

• node4
dc_suffix=_1_analytics

• node5
dc_suffix=_1_search

This results in three datacenters for the region:

us-east_1_cassandra
us-east_1_analytics
us-east_1_search

The datacenter naming convention in this example is based on the workload. You can use other conventions,
such as DC1, DC2 or 100, 200.

Keyspace strategy options


When defining your keyspace strategy options, use the EC2 region name, such as ``us-east``, as your
datacenter name.
Configuring Amazon EC2 multi-region snitch
Use the Ec2MultiRegionSnitch for deployments on Amazon EC2 where the cluster spans multiple regions.
You must configure settings in both the cassandra.yaml file and the property file (cassandra-
rackdc.properties) used by the Ec2MultiRegionSnitch.

Configuring cassandra.yaml for cross-region communication


The Ec2MultiRegionSnitch uses public IP designated in the broadcast_address to allow cross-region
connectivity. Configure each node as follows:

1. In the cassandra.yaml, set the listen_address to the private IP address of the node, and the
broadcast_address to the public IP address of the node.
This allows DataStax Enterprise nodes in one EC2 region to bind to nodes in another region, thus enabling
multiple datacenter support. For intra-region traffic, DataStax Enterprise switches to the private IP after
establishing a connection.

2. Set the addresses of the seed nodes in the cassandra.yaml file to that of the public IP. Private IP are not
routable between networks. For example:

seeds: 50.34.16.33, 60.247.70.52

To find the public IP address, from each of the seed nodes in EC2:

$ curl http://instance-data/latest/meta-data/public-ipv4

Do not make all nodes seeds, see Internode communications (gossip).

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
186
Configuration

3. Be sure that the storage_port or ssl_storage_port is open on the public IP firewall.

Configuring the snitch for cross-region communication


In EC2 deployments, the region name is treated as the datacenter name and availability zones are treated as
racks within a datacenter. For example, if a node is in the us-east-1 region, us-east is the datacenter name and
1 is the rack location. (Racks are important for distributing replicas, but not for datacenter naming.)
For each node, specify its datacenter in the cassandra-rackdc.properties. The dc_suffix option defines the
datacenters used by the snitch. Any other lines are ignored.
In the example below, there are two DataStax Enterprise datacenters and each datacenter is named for its
workload. The datacenter naming convention in this example is based on the workload. You can use other
conventions, such as DC1, DC2 or 100, 200. (datacenter names are case-sensitive.)

Region: us-east Region: us-west

Node and datacenter: Node and datacenter:

• node0 • node0
dc_suffix=_1_transactional dc_suffix=_1_transactional

• node1 • node1
dc_suffix=_1_transactional dc_suffix=_1_transactional

• node2 • node2
dc_suffix=_2_transactional dc_suffix=_2_transactional

• node3 • node3
dc_suffix=_2_transactional dc_suffix=_2_transactional

• node4 • node4
dc_suffix=_1_analytics dc_suffix=_1_analytics

• node5 • node5
dc_suffix=_1_search dc_suffix=_1_search

This results in four us-east datacenters: This results in four us-west datacenters:

us-east_1_transactional us-west_1_transactional
us-east_2_transactional us-west_2_transactional
us-east_1_analytics us-west_1_analytics
us-east_1_search us-west_1_search

Keyspace strategy options


When defining your keyspace strategy options, use the EC2 region name, such as ``us-east``, as your
datacenter name.
Configuring the Google Cloud Platform snitch
Use the GoogleCloudSnitch for DataStax Enterprise deployments on Google Cloud Platform across one or
more regions. The region is treated as a datacenter and the availability zones are treated as racks within the
datacenter. All communication occurs over private IP addresses within the same logical network.
The region name is treated as the datacenter name and zones are treated as racks within a datacenter. For
example, if a node is in the us-central1-a region, us-central1 is the datacenter name and a is the rack location.
(Racks are important for distributing replicas, but not for datacenter naming.) This snitch can work across
multiple regions without additional configuration.
If you are using only a single datacenter, you do not need to specify any properties.
If you need multiple datacenters, set the dc_suffix options in the cassandra-rackdc.properties file. Any other
lines are ignored.
For example, for each node within the us-central1 region, specify the datacenter in its cassandra-
rackdc.properties file:
Datacenter names are case-sensitive.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
187
Configuration

Node dc_suffix

node0 dc_suffix=_a_transactional

node1 dc_suffix=_a_transactional

node2 dc_suffix=_a_transactional

node3 dc_suffix=_a_transactional

node4 dc_suffix=_a_analytics

node5 dc_suffix=_a_search

Configuring the Apache CloudStack snitch


Use the CloudstackSnitch for Apache CloudStack environments. Because zone naming is free-form in Apache
CloudStack, this snitch uses the widely-used <country> <az> notation.

Setting system properties during startup


Use the system property (-D) switch to modify the DataStax Enterprise (DSE) settings during start up.
To automatically pass the settings each time DSE starts, uncomment or add the switch to the jvm.options file.

Synopsis
Change the start up parameters using the following syntax:

• Command line:

dse cassandra -Dparameter_name=value

• jvm.options file:

-Dparameter_name=value

• cassandra-env.sh file:

JVM_OPTS="$JVM_OPTS -Dparameter_name=value"

Only pass the parameter to the start-up operation once. If the same switch is passed to the start operation
multiple times, for example from both the jvm.options file and on the command line, DSE may fail to start or
may use the wrong parameter.

Startup examples
Starting a node without joining the ring:

• Command line:

dse cassandra -Dcassandra.join_ring=false

• jvm.options:

-Dcassandra.join_ring=false

Replacing a dead node:

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
188
Configuration

• Command line:

dse cassandra -Dcassandra.replace_address=10.91.176.160

• jvm.options:

-Dcassandra.replace_address=10.91.176.160

Changing LDAP authentication retry interval from its default of 10 ms:

• Command line:

dse -Ddse.ldap.retry_interval.ms=20

• jvm.options:

-Ddse.ldap.retry_interval.ms=20

Cassandra system properties


Cassandra native Java Virtual Machine (JVM) system parameters.
-Dcassandra.auto_bootstrap
Set auto_bootstrap to false during the initial set up of a node to override the default setting in the
cassandra.yaml file.
Default: true.
-Dcassandra.available_processors
Number of processors available to DSE. In a multi-instance deployment, each instance independently
assumes that all CPU processors are available to it. Use this setting to specify a smaller set of
processors.
Default: all_processors.
-Dcassandra.config
Set to the directory location of the cassandra.yaml file.
Default: depends on the type of installation.
-Dcassandra.consistent.rangemovement
Set to true, makes bootstrapping behavior effective.
Default: false.
-Ddse.consistent_replace
Specify the level of consistency required during a node replacement (ONE, QUORUM, or LOCAL_QUORUM).
The default value, ONE, may result in possibly stale data but uses less system resources. If set to
QUORUM or LOCAL_QUORUM, the replacement node coordinates repair among a (local) quorum of replicas
concurrently with replacement streaming. Repair transfers the differences to the replacement node,
ensuring it is consistent with other replicas when the replacement process is finished, assuming data is
inserted using either QUORUM or LOCAL_QUORUM consistency levels.

The value for consistent replace should match the value for application read consistency.
Default: ONE
-Ddse.consistent_replace.parallelism
Specify how many ranges will be repaired simultaneously during a consistent replace. The higher
the parallelism, the more resources are consumed cluster-wide, which may affect overall cluster
performance. Used only in conjunction with Dcassandra.consistent_replace.
Default: 2
-Ddse.consistent_replace.retries
Specify how many times a failed repair will be retried during a replace. If all retries fail, the replace fails.
Used only in conjunction with Dcassandra.consistent_replace.
Default: 3
-Ddse.consistent_replace.whitelist

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
189
Configuration

Specify keyspaces and tables on which to perform a consistent replace. The keyspaces and tables
can be specified as: “ks1, ks2.cf1”. The default is blank, in which case all keyspaces and tables are
replaced. Used only in conjunction with Dcassandra.consistent_replace.
Default: blank (not set)
-Dcassandra.disable_auth_caches_remote_configuration
Set to true to disable authentication caches, for example the caches used for credentials, permissions,
and roles. This will mean those config options can only be set (persistently) in cassandra.yaml and will
require a restart for new values to take effect.
Default: false.
-Dcassandra.expiration_date_overflow_policy
Set the policy (REJECT or CAP) for any TTL (time to live) timestamps that exceeds the maximum value
supported by the storage engine, 2038-01-19T03:14:06+00:00. The database storage engine can
only encode TTL timestamps through January 19 2038 03:14:07 UTC due to the Year 2038 problem.

• REJECT: Reject requests that contain an expiration timestamp later than


2038-01-19T03:14:06+00:00.

• CAP: Allow requests and insert expiration timestamps later than 2038-01-19T03:14:06+00:00 as
2038-01-19T03:14:06+00:00.

• CAP-NOWARN: Allow requests and insert expiration timestamps later than


2038-01-19T03:14:06+00:00 as 2038-01-19T03:14:06+00:00, but do not emit a warning.

Default: REJECT.
-Dcassandra.force_default_indexing_page_size
Set to true to disable dynamic calculation of the page size used when indexing an entire partition
during initial index build or a rebuild. Fixes the page size to the default of 10000 rows per page.
Default: false.
-Dcassandra.ignore_dc
Set to true to ignore the datacenter name change on startup. Applies only when using
DseSimpleSnitch.
Default: false.
-Dcassandra.initial_token
Use when DSE is not using virtual nodes (vnodes). Set to the initial partitioner token for the node on the
first start up.
Default: blank (not set).
Vnodes automatically select tokens.
-Dcassandra.join_ring
Set to false to prevent the node from joining a ring on startup.
Add the node to the ring afterwards using nodetool join and a JMX call.
Default: true.
-Dcassandra.load_ring_state
Set to false to clear all gossip state for the node on restart.
Default: true.
-Dcassandra.metricsReporterConfigFile
Enables pluggable metrics reporter and configures it from the specified file.
Default: blank (not set).
-Dcassandra.native_transport_port
Set to the port number that CQL native transport listens for clients.
Default: 9042.
-Dcassandra.native_transport_startup_delay_seconds
Set to the number of seconds to delay the native transport server start up.
Default: 0 (no delay).
-Dcassandra.partitioner
Set to the partitioner name.
Default: org.apache.cassandra.dht.Murmur3Partitioner.
-Dcassandra.partition_sstables_by_token_range
Set to false to disable JBOD SSTable partitioning by token range to multiple data_file_directories.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
190
Configuration

Advanced setting that should only be used with guidance from DataStax Support.
Default: true.
-Dcassandra.printHeapHistogramOnOutOfMemoryError
Set to false to disable a heap histogram dump on an OutOfMemoryError.
Default: false.
-Dcassandra.replace_address
Set to the listen_address or the broadcast_address when replacing a dead node with a new node. The
new node must be in the same state as before bootstrapping, without any data in its data directory.
The broadcast_address defaults to the listen_address except when the ring is using the
Configuring Amazon EC2 multi-region snitch.
-Dcassandra.replace_address_first_boot
Same as -Dcassandra.replace_address but only runs the first time the Cassandra node boots.
This property is preferred over -Dcassandra.replace_address since it has no effect on subsequent
boots if it is not removed from jvm.options or cassandra-env.sh.
-Dcassandra.replayList
Allows restoring specific tables from an archived commit log.
-Dcassandra.ring_delay_ms
Set to the number of milliseconds the node waits to hear from other nodes before formally joining the
ring.
Default: 30000.
-Dcassandra.ssl_storage_port
Sets the SSL port for encrypted communication.
Default: 7001.
-Dcassandra.start_native_transport
Enables or disables the native transport server. See start_native_transport in cassandra.yaml.
Default: true.
-Dcassandra.storage_port
Sets the port for inter-node communication.
Default: 7000.
-Dcassandra.write_survey
Set to true to enable a tool for testing new compaction and compression strategies. write_survey
allows you to experiment with different strategies and benchmark write performance differences without
affecting the production workload. See Testing compaction and compression.
Default: false.
Java Management Extension system properties
DataStax Enterprise exposes metrics and management operations via Java Management Extensions (JMX).
JConsole and the nodetool utility are JMX-compliant management tools.
-Dcom.sun.management.jmxremote.port
Sets the port number on which the database listens for JMX connections.
By default, you can interact with DataStax Enterprise using JMX on port 7199 without authentication.
Default: 7199
-Dcom.sun.management.jmxremote.ssl
Change to true to enable SSL for JMX.
Default: false
-Dcom.sun.management.jmxremote.authenticate
True enables remote authentication for JMX.
Default: false
-Djava.rmi.server.hostname
Sets the interface hostname or IP that JMX should use to connect. Uncomment and set if you are
having trouble connecting.
Search system properties
DataStax Enterprise (DSE) Search system properties.
-Ddse.search.client.timeout.secs

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
191
Configuration

Set the timeout in seconds for native driver search core management calls using the dsetool search-
specific commands.
Default: 600 (10 minutes).
-Ddse.search.query.threads
Sets the number of Search queries that can execute in parallel. Consider increasing this value or
reducing client/driver requests per connection if EnqueuedRequestCount does not stabilize near zero.
Default: The default is two times the number of CPUs (including hyperthreading).
-Ddse.timeAllowed.enabled.default
The Solr timeAllowed option is enforced by default to prevent long-running shard queries (such as
complex facets and Boolean queries) from using system resources after they have timed out from the
DSE Search coordinator.
DSE Search checks the timeout per segment instead of during document or terms iteration. The
system property solr.timeAllowed.docsPerSample has been removed.
By default for all queries, the timeAllowed value is the same as the
internode_messaging_options.client_request_timeout_seconds setting in dse.yaml. For more
details, see Limiting queries by time.
Using the Solr timeAllowed parameter may cause a latency cost. If you find the cost for queries is
too high in your environment, consider setting the -Ddse.timeAllowed.enabled.default property
to false at DSE startup time. Or set timeAllowed.enable to false in the query.
Default: true.
-Ddse.solr.data.dir
Set the path to store DSE Search data. See Set the location of search indexes.
-Dsolr.offheap.enable
The DSE Search per-segment filter cache is moved off-heap by using native memory to reduce on-
heap memory consumption and garbage collection overhead. The off-heap filter cache is enabled by
default. To disable, set to false to pass the offheap JVM system property at startup time. When not set,
the default is true.
Default: true
Threads per core system properties
Tune TPC using the Netty system parameters.
-Ddse.io.aio.enable
Set to false to have all read operations use the AsynchronousFileChannel regardless of the
operating system or disk type.
The default setting true allows dynamic switching of libraries for read operations as follows:

• LibAIO on solid state drives (SSD) and EXT4/XFS

• AsynchronousFileChannel for read operations on hard disk drives and all non-Linux operating
systems

Use this advanced setting only with guidance from DataStax Support.
Default: true
-Ddse.io.aio.force
Set to true to force all read operations to use LibAIO regardless of the disk type or operating system.
Use this advanced setting only with guidance from DataStax Support.
Default: false
-Dnetty.eventloop.busy_extra_spins=N
Set to the number of iterations in the epoll event loops performed when queues are empty before
moving on to the next backoff stage. Increasing the value reduces latency while increasing CPU usage
when the loops are idle.
Default: 10
-Dnetty.epoll_check_interval_nanos

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
192
Configuration

Sets the granularity for calling an epoll select in nanoseconds, which is a system call. Setting the value
too low impacts performance because by making too many system calls. Setting the value too high,
impacts performance by delaying the discovery of new events.
Default: 2000
-Dnetty.schedule_check_interval_nanos
Set the granularity for checking if scheduled events are ready to execute in nanoseconds. Specifying a
value below 1 nanosecond is not productive. Too high a values delays scheduled tasks.
Default: 1000
LDAP system properties for DataStax Enterprise Authentication
-Ddse.ldap.connection.timeout.ms
The number of milliseconds before the connection timesout.
Default:
-Ddse.ldap.retry_interval.ms
Allows you to set the time in milliseconds between subsequent retries when authenticating via an LDAP
server.
Default: 10
-Ddse.ldap.pool.min.idle
Finer control over the connection pool for DataStax Enterprise LDAP authentication connector. The
min idle settings determines the minimum number of connections allowed in the pool before the evictor
thread will create new connections. This setting has no effect if the evictor thread isn't configured to run.
Default:
-Ddse.ldap.pool.exhausted.action
Determines what the pool does when it is full. It can be one of:

• fail - the pool with throw an exception

• block - the pool will block for max wait ms (default)

• grow - the pool will just keep growing (not recommended)

Default: block
-Ddse.ldap.pool.max.wait
When the dse.ldap.pool.exhausted.action is block, sets the number of milliseconds to block the
pool before throwing an exception.
Default:
-Ddse.ldap.pool.test.borrow
Tests a connection when it is borrowed from the pool.
Default:
-Ddse.ldap.pool.test.return
Tests a connection returned to the pool.
Default:
-Ddse.ldap.pool.test.idle
Tests any connections in the eviction loop that are not being evicted. Only works if the time between
eviction runs is greater than 0ms.
Default:
-Ddse.ldap.pool.time.between.evictions
Determines the time in ms (milliseconds) between eviction runs. When run with the
dse.ldap.pool.test.idle this becomes a basic keep alive for connections.
Default:
-Ddse.ldap.pool.num.tests.per.eviction
Number of connections in the pool that are tested each connection run. If this is set the same as max
active (the pool size) then all connections will be tested each eviction run.
Default:
-Ddse.ldap.pool.min.evictable.idle.time.ms
Determines the minimum time in ms (milliseconds) that a connection can sit in the pool before it
becomes available for eviction.
Default:
-Ddse.ldap.pool.soft.min.evictable.idle.time.ms

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
193
Configuration

Determines the minimum time in ms (milliseconds) that a connection can sit the pool before it
becomes available for eviction with the proviso that the number of connections doesn't fall below
dse.ldap.pool.min.evictable.idle.time.ms.
Default:
Kerberos system properties
-Ddse.sasl.protocol
Kerberos principal name, user@realm. For example, [email protected].
-Djava.security.auth.login.config
The path to the JAAS configuration file for DseClient.
NodeSync system parameters
-Ddse.nodesync.controller_update_interval_sec
Set the frequency to execute NodeSync auto-tuning process in seconds.
Default: 300 (5 minutes).
-Ddse.nodesync.log_reporter_interval_sec
Set the frequency of short INFO progress report in seconds.
Default: 600 (10 minutes).
-Ddse.nodesync.min_validation_interval_sec
Set to the minimum number of seconds between validations of the same segment, mostly to avoid busy
spinning on new/empty clusters.
Default: 300 (5 minutes).
-Ddse.nodesync.min_warn_interval_sec
Set to the minimum number of seconds between logging warnings.
Avoid logging warnings too often.
Default: 36000 (10 hours).
-Ddse.nodesync.rate_checker_interval_sec
Set the frequency in seconds of comparing the current configured rate to tables and their deadline. Log
a warning if rate considered too low.
Default: 1800 (30 minutes).
-Ddse.nodesync.segment_lock_timeout_sec
Set the Time-to-live (TTL) on locks inserted in the status table in seconds.
Default: 600 (10 minutes).
-Ddse.nodesync.segment_size_target_bytes
Set to the targeted maximum size for segments in bytes.
Default: 26214400 (200 MB).
-Ddse.nodesync.size_checker_interval_sec
Set the frequency to check if the depth used for a table should be updated due to data size changes in
seconds.
Default: 7200 (2 hours).

Choosing a compaction strategy


To implement a compaction strategy, follow these steps:

1. Read how data is maintained to understand the compaction strategies.

2. Answer the questions below to determine the appropriate compaction strategy for each table.

3. Configure each table to use the appropriate compaction strategy.

4. Test the compaction strategy with your data.

Which compaction strategy is best?


The following questions are based on developer and user experience with the compaction strategies.
Does your table process time series data?

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
194
Configuration

If the answer is yes, use TWCS (TimeWindowCompactionStrategy). If the answer is no, read the
following questions.
Does your table handle more reads than writes, or more writes than reads?
LCS (LeveledCompactionStrategy) is appropriate if there are twice or more reads than writes, especially
randomized reads. If the reads and writes are approximately equal, the performance penalty from LCS
may not be worth the benefit. Be aware that LCS can be overwhelmed by a high number of writes. One
advantage of LCS is that it keeps related data in a small set of SSTables.
Does the data in your table change often?
If your data is immutable or there are few upserts, use STCS (SizeTieredCompactionStrategy), which
does not have the write performance penalty of LCS.
Do you require predictable levels of read and write activity?
LCS keeps the SSTables within predictable sizes and numbers. For example, if your table's read and
write ratio is small, and the read activity is expected to conform to a Service Level Agreement (SLA), it
may be worth the LCS write performance penalty to keep read rates and latency at predictable levels.
And, you may be able to overcome the LCS write penalty by adding more nodes.
Will your table be populated by a batch process?
For batched reads and writes, STCS performs better than LCS. The batch process causes little or no
fragmentation, so the benefits of LCS are not realized; batch processes can overwhelm tables that use
LCS.
Does your system have limited disk space?
LCS handles disk space more efficiently than STCS: LCS requires about 10% headroom in addition to
the space occupied by the data. In some cases, STCS and DTCS (DateTieredStorageStrategy) require
as much as 50% more headroom than the data space. (DTCS is deprecated.)
Is your system reaching its limits for input and output?
LCS is significantly more input and output intensive than DTCS or STCS. Switching to LCS may
introduce extra input and output load that offsets the advantages.
Configuring and running compaction
Set the table compaction strategy in the CREATE TABLE or ALTER TABLE statement parameters. See
table_options.
You can start compaction manually using the nodetool compact command.
Testing compaction strategies
To test the compaction strategy:

• Create a three-node cluster using one of the compaction strategies, then stress test the cluster using
thecassandra-stress utility and measure the results.

• Set up a node on your existing cluster and enable the write survey mode option on the node to analyze live
data.

NodeSync service
About NodeSync
NodeSync is an easy to use continuous background repair that has low overhead and provides consistent
performance and virtually eliminates manual efforts to run repair operations in a DataStax cluster.

• Continuously validates that data is in sync on all replica.

• Always running but low impact on cluster performance

• Fully automatic, no manual intervention needed

• Completely replace anti-entropy repairs

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
195
Configuration

For write-heavy workloads, where more than 20% of the operations are writes, you may notice CPU
consumption overhead associated with NodeSync. If that's the case for you environment, DataStax
recommends using nodetool repair instead of enabling NodeSync. See nodetool repair.

NodeSync service
By default, each node runs the NodeSync service. The service is idle unless it has something to validate.
NodeSync is enabled on a per table basis. The service continuously validates local data ranges for NodeSync-
enabled tables and repairs any inconsistency found. The local data ranges are split into small segments, which
act as validation save points. Segments are prioritized in order to try to meet the per-table deadline target.
Segments
A segment is a small local token range of a table. NodeSync recursively splits local ranges in half a certain
number of times (depth) to create segments. The depth is calculated using the total table size, assuming equal
distribution of data. Typically segments cover no more than 200 MB. The token ranges can be no smaller than a
single partition, so very large partitions can result in segments larger than the configured size.
Validation process and status
After a segment is selected for validation, NodeSync reads the entirety of the data it covers from all replica
(using paging), checks for inconsistencies, and repairs if needed. When a node validates a segment, it “locks”
it in a system table to avoid work duplication by other nodes. It is not a race-free lock; there is a possibility of
duplicated work which saves the complexity and cost of true distributed locking.
Segment validation is saved on completion in the system_distributed.nodesync_status table, which is used
internally for resuming on failure, prioritization, segment locking, and by tools. It is not meant to be read directly.

• Validation status is:

# successful: All replicas responded and all inconsistencies (if any) were properly repaired.

# full_in_sync: All replica were already in sync.

# full_repaired: Some replica were repaired.

# unsuccessful: Either some replicas did not respond or repairs on inconsistent replicas failed.

# partial_in_sync: Not all replica responded, but all respondents were in sync.

# partial_repaired: Not all replica responded, some that responded were repaired.

# uncompleted: At most 1 node was available/responded; no validation happened.

# failed: Some unexpected errors occurred. (Check the node logs.)


If validation of a large segment is interrupted, increase the amount of redundant work.

Limitations

• For debugging/tuning, understanding of traditional repair will be mostly unhelpful, since NodeSync depends
on the read repair path

• No special optimizations for remote DC - may perform poorly on particularly bad WAN links

• In aggregate, CPU consumption of NodeSync might exceed traditional repair

• NodeSync only makes internal adjustments to try to hit the configured rate - operators must ensure this
configured throughput is sufficient to meet the gc_grace_seconds commitment and can be achieved by the
hardware

Tables with NodeSync enabled will be skipped for repair operations run against all or specific keyspaces. For
individual tables, running the repair command will be rejected when NodeSync is enabled.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
196
Configuration

Starting and stopping the NodeSync service


The NodeSync service automatically starts with the dse cassandra command. You can manually start and stop
the service on each node.

1. Verify the status of the NodeSync service:

$ nodetool nodesyncservice status

The output should indicate running.

The NodeSync service is running

2. Disable the NodeSync service:

$ nodetool nodesyncservice disable

On the next restart of DataStax Enterprise (DSE), the NodeSync service will start up.

3. Verify the status of the NodeSync service:

$ nodetool nodesyncservice status

The output should indicate not running.

The NodeSync service is not running

Enabling NodeSync validation


By default, NodeSync is disabled when a table is created. It is also disabled on tables that were migrated from
earlier versions. To continuously verify data consistency in the background without the need for anti-entropy
repairs, enable NodeSync on one or more tables.

Data only needs to be validated if the table is in more than one datacenter or is in a datacenter where the
keyspace has a replication factor or 2 or more.

• Enable on an existing table:

# Change the NodeSync setting on a single table using CQL syntax:

ALTER TABLE table_name WITH


nodesync={'enabled': 'true'};

# All tables in a keyspace using nodesync enable:

$ nodesync enable -v -k keyspace_name "*"

# A list of tables using nodesync enable:

$ nodesync enable keyspace_name.table_name keyspace_name.table_name

• Create a table with nodesync enabled:

CREATE TABLE table_name ( column_list ) WITH

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
197
Configuration

nodesync={'enabled': 'true'};

Tuning NodeSync validations


NodeSync tries to validate all tables within their respective deadlines, while respecting the configured rate limit.
If a table is 10GB and has a deadline_target_sec=10 and the rate_in_kb is set to 1MB/sec, validation will not
happen quickly enough. Configure the rate and deadlines realistically, take data sizes into account and adapt
with data growth.
For write-heavy workloads, where more than 20% of the operations are writes, you may notice CPU
consumption overhead associated with NodeSync. If that's the case for you environment, DataStax
recommends using nodetool repair instead of enabling NodeSync. See nodetool repair.

NodeSync records warnings to the system.log, if it detects any of the following conditions:

• rate_in_kb is too low to validate all tables within their deadline, even under ideal circumstances.

• rate_in_kb cannot be sustained by the node (too high for the node load/hardware).

Setting the NodeSync rate


Estimating rate setting impact
The rate_in_kb sets the per node rate of the local NodeSync service. It controls the maximum number of bytes
per second used to validate data. There is a fundamental tradeoff between how fast NodeSync validates data
and how many resources it consumes. The rate is a limit on the amount of resources used and a target that
NodeSync tries to achieve by auto-tuning internals. The set rate might not be achieved in practice, because
validation can complete at a slower rate on new or small cluster or the node might temporarily or permanently
lack available resources.
Initial rate setting
There is no strong requirement to keep all nodes validating at the same rate. Some nodes will simply validate
more data than others. When setting the rate, use the simplest method first by using the defaults.

1. Check the rate_in_kb setting within the nodesync section in the cassandra.yaml file.

2. Try increasing or decreasing the value at run time:

$ nodetool nodesyncservice setrate value_in_kb_sec

3. Check the configured rate.

$ nodetool nodesyncservice getrate

The configured rate is different from the effective rate, which can be found in the NodeSync Service
metrics.

Simulating NodeSync rates


When adjusting rates, use the NodeSync rate simulator to help determine the configuration settings by
computing the rate necessary to validate all tables within their allowed deadlines.
Unfortunately, no perfect value exists because NodeSync also deals with many unknown or difficult to predict
factors, such as:

• Failures - When a node fails, it does not participate in NodeSync validation while it is offline.

• Temporary overloads - During periods of overload, such as an unexpected events, nodes can not achieve
the configured rate.

• Data size variation - The rate required to repair all tables within a fixed amount of time directly depends on
the size of the data to validate, which is typically a moving target.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
198
Configuration

All these factors can impact the overall NodeSync rate. Therefore build safety margins within the configured
rate. The NodeSyncServiceRate simulator helps to set the rate.
Setting the NodeSync deadline
Each table with NodeSync enabled has a deadline_target_sec property. This is the target for the maximum time
between 2 validations of the same data. As long as the deadline is met, all parts of the ring (for the table) are
validated at least that often.
The deadline (deadline_target_sec) relates to grace period (gc_grace_seconds). The deadline should
always be less than or equal to the grace period. As long as the deadline is met, no data is resurrected due to
tombstone purging.
The deadline defaults to which ever is longer, the grace period or four days. Typically an acceptable default,
unless the table has a grace period of zero. For testing, the deadline value can be less than the grace period.
Verify for a few weeks if a lower gc_grace value is realistic without taking risk before changing it.
NodeSync prioritize segments in order to try to meet the deadline. The next segment to validate at any given
time is the one the closest to missing its deadline. For example, if table 1 has half the deadline of table 2, table
1 validates approximately twice as often as table 2.
Use OpsCenter to get a graphical representation of the NodeSync validation status. See Viewing NodeSync
Status.
The syntax to change the per-table nodesync property:

ALTER TABLE table_name


WITH nodesync = { 'enabled': 'true',
'deadline_target_sec': value };

Manually starting NodeSync validation


Force NodeSync to repair specific segments. After a user validation is submitted, it takes precedence over
normal NodeSync work. Normal work resumes automatically after the validation finishes.
For write-heavy workloads, where more than 20% of the operations are writes, you may notice CPU
consumption overhead associated with NodeSync. If that's the case for you environment, DataStax
recommends using nodetool repair instead of enabling NodeSync. See nodetool repair.

This is an advanced tool. Usually, it is better to let NodeSync prioritize segments on its own.

• Submitting user validations:

$ nodesync validation submit keyspace_name.table_name

• Listing user validations:

$ nodesync validation list

• Canceling user validations:

$ nodesync validation cancel validation_id

See nodesync validation.

Using multiple network interfaces


Steps for configuring DataStax Enterprise for multiple network interfaces or when using different regions in cloud
implementations.
You must configure settings in both the cassandra.yaml file and the relevant property file:

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
199
Configuration

• cassandra-rackdc.properties (GossipingPropertyFileSnitch, Ec2Snitch, or Ec2MultiRegionSnitch)

• cassandra-topology.properties (PropertyFileSnitch)

Configuring cassandra.yaml for multiple networks or across regions in cloud


implementations
In multiple networks or cross-region cloud scenarios, communication between datacenters can only take
place using an external IP address. The external IP address is defined in the cassandra.yaml file using the
broadcast_address setting. Configure each node as follows:

1. In the cassandra.yaml file , set the listen_address to the private IP address of the node, and the
broadcast_address to the public address of the node.
This allows nodes to bind to nodes in another network or region, thus enabling multiple datacenter support.
For intra-network or region traffic, DSE switches to the private IP after establishing a connection.

2. Set the addresses of the seed nodes in the cassandra.yaml file to that of the public IP. Private IP are not
routable between networks. For example:

seeds: 50.34.16.33, 60.247.70.52

Do not make all nodes seeds, see Internode communications (gossip).

3. Be sure that the storage_port or ssl_storage_port is open on the public IP firewall.

Be sure to enable encryption and authentication when using public IPs. See Configuring SSL for node-to-node
connections. Another option is to use a custom VPN to have local, inter-region/ datacenter IPs.

Additional cassandra.yaml configuration for non-EC2 implementations


If multiple network interfaces are used in a non-EC2 implementation, enable the listen_on_broadcast_address
option.

listen_on_broadcast_address: true

In non-EC2 environments, the public address to private address routing is not automatically enabled. Enabling
listen_on_broadcast_address allows DSE to listen on both listen_address and broadcast_address with
two network interfaces.
Configuring the snitch for multiple networks
External communication between the datacenters can only happen when using the broadcast_address (public IP).
The GossipingPropertyFileSnitch is recommended for production. The cassandra-rackdc.properties file defines
the datacenters used by this snitch. Enable the option prefer_local to ensure that traffic to broadcast_address
will re-route to listen_address.
For each node in the network, specify its datacenter in cassandra-rackdc.properties file.
In the example below, there are two datacenters and each datacenter is named for its workload. The datacenter
naming convention in this example is based on the workload. You can use other conventions, such as DC1, DC2
or 100, 200. (datacenter names are case-sensitive.)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
200
Configuration

Network A Network B

Node and datacenter: Node and datacenter:

• node0 • node0
dc=DC_A_transactional dc=DC_A_transactional
rack=RAC1 rack=RAC1

• node1 • node1
dc=DC_A_transactional dc=DC_A_transactional
rack=RAC1 rack=RAC1

• node2 • node2
dc=DC_B_transactional dc=DC_B_transactional
rack=RAC1 rack=RAC1

• node3 • node3
dc=DC_B_transactional dc=DC_B_transactional
rack=RAC1 rack=RAC1

• node4 • node4
dc=DC_A_analytics dc=DC_A_analytics
rack=RAC1 rack=RAC1

• node5 • node5
dc=DC_A_search dc=DC_A_search
rack=RAC1 rack=RAC1

Configuring the snitch for cross-region communication in cloud implementations


Be sure to use the appropriate snitch for your implementation. If deploying on Amazon EC2, see the
instructions in Ec2MultiRegionSnitch.

In cloud deployments, the region name is treated as the datacenter name and availability zones are treated as
racks within a datacenter. For example, if a node is in the us-east-1 region, us-east is the datacenter name and 1
is the rack location. (Racks are important for distributing replicas, but not for datacenter naming.)
In the example below, there are two DataStax Enterprise datacenters and each datacenter is named for its
workload. The datacenter naming convention in this example is based on the workload. You can use other
conventions, such as DC1, DC2 or 100, 200. (Datacenter names are case-sensitive.)
For each node, specify its datacenter in the cassandra-rackdc.properties. The dc_suffix option defines the
datacenters used by the snitch. Any other lines are ignored.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
201
Configuration

Region: us-east Region: us-west

Node and datacenter: Node and datacenter:

• node0 • node0
dc_suffix=_1_transactional dc_suffix=_1_transactional

• node1 • node1
dc_suffix=_1_transactional dc_suffix=_1_transactional

• node2 • node2
dc_suffix=_2_transactional dc_suffix=_2_transactional

• node3 • node3
dc_suffix=_2_transactional dc_suffix=_2_transactional

• node4 • node4
dc_suffix=_1_analytics dc_suffix=_1_analytics

• node5 • node5
dc_suffix=_1_search dc_suffix=_1_search

This results in four us-east datacenters: This results in four us-west datacenters:

us-east_1_transactional us-west_1_transactional
us-east_2_transactional us-west_2_transactional
us-east_1_analytics us-west_1_analytics
us-east_1_search us-west_1_search

Configuring gossip settings


When a node first starts up, it looks at its cassandra.yaml configuration file to determine the name of the cluster it
belongs to; which nodes (called seeds) to contact to obtain information about the other nodes in the cluster; and
other parameters for determining port and range information.

1. In the cassandra.yaml file, set the following parameters:

Property Description

cluster_name Name of the cluster that this node is joining. Must be the same for every node in the
cluster.

listen_address The IP address or hostname that the database binds to for connecting this node to other
nodes.

listen_interface Use this option instead of listen_address to specify the network interface by name, rather
than address/hostname

(Optional) broadcast_address The public IP address this node uses to broadcast to other nodes outside the network
or across regions in multiple-region EC2 deployments. If this property is commented
out, the node uses the same IP address or hostname as listen_address. A node
does not need a separate broadcast_address in a single-node or single-datacenter
installation, or in an EC2-based network that supports automatic switching between
private and public communication. It is necessary to set a separate listen_address and
broadcast_address on a node with multiple physical network interfaces or other topologies
where not all nodes have access to other nodes by their private IP addresses. For specific
configurations, see the instructions for listen_address. The default is the listen_address.

seed_provider A -seeds list is comma-delimited list of hosts (IP addresses) that gossip uses to learn the
topology of the ring. Every node should have the same list of seeds.
Making every node a seed node is not recommended because of increased
maintenance and reduced gossip performance. Gossip optimization is not critical, but
it is recommended to use a small seed list (approximately three nodes per datacenter).

storage_port The inter-node communication port (default is 7000). Must be the same for every node in
the cluster.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
202
Configuration

Property Description

initial_token For legacy clusters. Set this property for single-node-per-token architecture, in which a
node owns exactly one contiguous range in the ring space.

num_tokens For new clusters. The number of tokens randomly assigned to this node in a cluster that
uses virtual nodes (vnodes).

Configuring the heap dump directory


Analyzing the heap dump file can help troubleshoot memory problems. Java starts with the option -XX:
+HeapDumpOnOutOfMemoryError. Using this option triggers a heap dump in the event of an out-of-memory
condition. The heap dump file consists of references to objects that cause the heap to overflow. By default, the
database puts the file a subdirectory of the working, root directory when running as a service. If the database
does not have write permission to the root directory, the heap dump fails. If the root directory is too small to
accommodate the heap dump, the server crashes.
The DataStax Help Center also provides troubleshooting information.
To ensure that a heap dump succeeds and to prevent crashes, configure a heap dump directory that is:

• Accessible to the database for writing

• Large enough to accommodate a heap dump

Base the size of the directory on the value of the Java -mx option.

Set the location of the heap dump in the cassandra-env.sh file.

1. Open the cassandra-env.sh file for editing.

2. Scroll down to the comment about the heap dump path:

# set jvm HeapDumpPath with CASSANDRA_HEAPDUMP_DIR


if [ "x$CASSANDRA_HEAPDUMP_DIR" != "x" ]; then
JVM_OPTS="$JVM_OPTS -XX:HeapDumpPath=$CASSANDRA_HEAPDUMP_DIR/cassandra-`date +%s`-pid$
$.hprof"
fi

3. On the line after the comment, set the CASSANDRA_HEAPDUMP_DIR to the desired path:

# set jvm HeapDumpPath with CASSANDRA_HEAPDUMP_DIR


export CASSANDRA_HEAPDUMP_DIR=path
if [ "x$CASSANDRA_HEAPDUMP_DIR" != "x" ]; then
JVM_OPTS="$JVM_OPTS -XX:HeapDumpPath=$CASSANDRA_HEAPDUMP_DIR/cassandra-`date +%s`-pid$
$.hprof"
fi

4. Save the cassandra-env.sh file and restart.

Configuring Virtual Nodes


Virtual node (vnode) configuration
Virtual nodes simplify many tasks in DataStax Enterprise, such as eliminating the need to determine the partition
range (calculate and assign tokens), rebalancing the cluster when adding or removing nodes, and replacing dead
nodes. For a complete description of virtual nodes and how they work, see Virtual nodes.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
203
Configuration

DataStax Enterprise requires the same token architecture on all nodes in a datacenter. The nodes must all be
vnode-enabled or single-token architecture. Across the entire cluster, datacenter architecture can vary. For
example, a single cluster with:

• A transaction-only datacenter running OLTP.

• A single-token architecture search datacenter (no vnodes).

• An analytics datacenter with vnodes.

Guidelines for using virtual nodes


• DSE requires the same token architecture on all nodes in a datacenter.
The nodes must all be vnode-enabled or single-token architecture. Across the entire cluster, datacenter
architecture can vary.
For example, a single cluster with:

# A transaction-only datacenter running OLTP.

# A single-token architecture search datacenter (no vnodes).

# An analytics datacenter with vnodes.

• DataStax recommends using 8 vnodes (tokens).


DataStax recommends not using vnodes with DSE Search. However, if you decide
to use vnodes with DSE Search, do not use more than 8 vnodes and ensure that
allocate_tokens_for_local_replication_factor option in cassandra.yaml is correctly configured for your
environment.

Using 8 vnodes distributes the workload between systems with a ~10% variance and has minimal impact on
performance.

• Ensure correct vnode configuration with cassandra.yaml settings:

# When adding a vnode to an existing cluster or setting up nodes in a new datacenter,


set the target replication factor (RF) of keyspaces in the datacenter with the
allocate_tokens_for_local_replication_factor option.

# The allocation algorithm distributes the token ranges proportionately using the num_tokens settings.
All systems in the datacenter should have the same num_token settings unless the systems
performance varies between systems. To distribute more of the workload to the higher performance
hardware, increase the number of tokens for those systems.
The allocation algorithm efficiently balances the workload using fewer tokens; when systems are added
to a datacenter, the algorithm maintains the balance. Using a higher number of tokens more evenly
distributes the workload, but also significantly increases token management overhead.
Set the number of vnode tokens based on the workload distribution requirements of the datacenter:
Table 12: Allocation algorithm workload distribution variance
Replication 4 vnode (tokens) 8 vnode (tokens) 64 vnode 128 vnode
factor (tokens) (tokens)

2 ~17.5% ~12.5% ~3% ~1%

3 ~14% ~10% ~2% ~1%

5 ~11% ~7% ~1% ~1%

• Add nodes to the cluster one at a time.


When adding multiple nodes to the cluster using the allocation algorithm, ensure that nodes are added
one at a time. If nodes are added concurrently, the algorithm assigns the same tokens to different nodes.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
204
Configuration

Enabling vnodes
In the cassandra.yaml file:

1. Uncomment num_tokens and set the required number of tokens.

2. (Recommended) To use the allocation algorithm uncomment allocate_tokens_for_local_replication_factor


and set it to the target replication factor for the keyspaces in the datacenter. If the replication varies,
alternate between the replication factor (RF) settings.

3. Comment out the initial_token or leave unset.

To upgrade existing clusters to vnodes, see Enabling virtual nodes on an existing production cluster.
Disabling vnodes
If you do not use vnodes, you must make sure that each node is responsible for roughly an equal amount of
data. To ensure that each node is responsible for an equal amount of data, assign each node an initial-token
value and calculate the tokens for each datacenter as described in Generating tokens.

1. In the cassandra.yaml file:

a. Comment out the num_tokens and allocate_tokens_for_local_replication_factor.

b. Uncomment the initial_token and set it to 1 or to the value of a generated token for a multi-node cluster.

Enabling virtual nodes on an existing production cluster


You cannot directly convert a single-token nodes to a vnode. However, you can configure another datacenter
configured with vnodes already enabled and allow automatic distribution to the existing data into the new nodes.
This method has the least impact on performance.

DataStax recommends not using vnodes with DSE Search. However, if you decide to use vnodes with DSE
Search, do not use more than 8 vnodes and ensure that allocate_tokens_for_local_replication_factor option in
cassandra.yaml is correctly configured for your environment.

1. Add a new datacenter to the cluster.

2. Once the new datacenter with vnodes enabled is up, switch your clients to use the new datacenter.

3. Run a full repair with nodetool repair.


This step ensures that after you move the client to the new datacenter that any previous writes are
added to the new datacenter and that nothing else, such as hints, is dropped when you remove the old
datacenter.

4. Update your schema to no longer reference the old datacenter.

5. Remove the old datacenter from the cluster.


See Decommissioning a datacenter.

Logging configuration
Changing logging locations
Logging locations are set at installation. Generally, the default logs location is /var/log. For example, /var/
log/cassandra and /var/log/tomcat.
For details, see Default file locations for package installations and Default file locations for tarball installations.
You can also change logging locations with OpsCenter Configuration Profiles.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
205
Configuration

1. To change logging locations after installation:

• To generate all logs in the same location, add CASSANDRA_LOG_DIR to the dse-env.sh file:

export CASSANDRA_LOG_DIR="/your/log/location"

• For finer-grained control, edit the logback.xml file and replace ${cassandra.logdir} with the path.

2. To change the Tomcat server log locations for DSE Search, edit one of these files:

• Set TOMCAT_LOGS in the cassandra-env.sh file:

export TOMCAT_LOGS="/your/log/location"

• Set the locations in resources/tomcat/conf/logging.properties.

3. After you change logging locations, restart DataStax Enterprise.

Configuring logging
Logging functionality uses Simple Logging Facade for Java (SLF4J) with a logback backend. Logs are written
to the system.log and debug.log in the logging directory. You can configure logging programmatically or
manually. Manual ways to configure logging are:

• Run the nodetool setlogginglevel command.

• Configure the logback-test.xml or logback.xml file installed with DataStax Enterprise.

• Use the JConsole tool to configure logging through JMX.

Logback looks for the logback-test.xml file first, and then for the logback.xml file.
The following example details the XML configuration of the logback.xml file:

<configuration scan="true">
<jmxConfigurator />

<!-- SYSTEMLOG rolling file appender to system.log (INFO level) -->

<appender name="SYSTEMLOG" class="ch.qos.logback.core.rolling.RollingFileAppender">


<filter class="ch.qos.logback.classic.filter.ThresholdFilter">
<level>INFO</level>
</filter>
<file>${cassandra.logdir}/system.log</file>
<rollingPolicy class="ch.qos.logback.core.rolling.FixedWindowRollingPolicy">
<fileNamePattern>${cassandra.logdir}/system.log.%i.zip</fileNamePattern>
<minIndex>1</minIndex>
<maxIndex>20</maxIndex>
</rollingPolicy>
<triggeringPolicy class="ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy">
<maxFileSize>20MB</maxFileSize>
</triggeringPolicy>
<encoder>
<pattern>%-5level [%thread] %date{ISO8601} %X{service} %F:%L - %msg%n</pattern>
</encoder>
</appender>

<!-- DEBUGLOG rolling file appender to debug.log (all levels) -->

<appender name="DEBUGLOG" class="ch.qos.logback.core.rolling.RollingFileAppender">


<file>${cassandra.logdir}/debug.log</file>
<rollingPolicy class="ch.qos.logback.core.rolling.FixedWindowRollingPolicy">
<fileNamePattern>${cassandra.logdir}/debug.log.%i.zip</fileNamePattern>
<minIndex>1</minIndex>
<maxIndex>20</maxIndex>

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
206
Configuration

</rollingPolicy>
<triggeringPolicy class="ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy">
<maxFileSize>20MB</maxFileSize>
</triggeringPolicy>
<encoder>
<pattern>%-5level [%thread] %date{ISO8601} %X{service} %F:%L - %msg%n</pattern>
</encoder>
</appender>

<!-- ASYNCLOG assynchronous appender to debug.log (all levels) -->

<appender name="ASYNCDEBUGLOG" class="ch.qos.logback.classic.AsyncAppender">


<queueSize>1024</queueSize>
<discardingThreshold>0</discardingThreshold>
<includeCallerData>true</includeCallerData>
<appender-ref ref="DEBUGLOG" />
</appender>

<!-- STDOUT console appender to stdout (INFO level) -->

<if condition='isDefined("dse.console.useColors")'>
<then>
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<withJansi>true</withJansi>
<filter class="ch.qos.logback.classic.filter.ThresholdFilter">
<level>INFO</level>
</filter>
<encoder>
<pattern>%highlight(%-5level) [%thread] %green(%date{ISO8601})
%yellow(%X{service}) %F:%L - %msg%n<$
</encoder>
</appender>
</then>
</if>
<if condition='isNull("dse.console.useColors")'>
<then>
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<filter class="ch.qos.logback.classic.filter.ThresholdFilter">
<level>INFO</level>
</filter>
<encoder>
<pattern>%-5level [%thread] %date{ISO8601} %X{service} %F:%L - %msg%n</pattern>
</encoder>
</appender>
</then>
</if>

<include file="${SPARK_SERVER_LOGBACK_CONF_FILE}"/>
<include file="${GREMLIN_SERVER_LOGBACK_CONF_FILE}"/>

<!-- Uncomment the LogbackMetrics appender and the corresponding appender-ref in the
root to activate
<appender name="LogbackMetrics"
class="com.codahale.metrics.logback.InstrumentedAppender" />
-->

<root level="${logback.root.level:-INFO}">
<appender-ref ref="SYSTEMLOG" />
<appender-ref ref="STDOUT" />
<!-- Comment out the ASYNCDEBUGLOG appender to disable debug.log -->
<appender-ref ref="ASYNCDEBUGLOG" />
<!-- Uncomment LogbackMetrics and its associated appender to enable metric collecting for
logs. -->
<!-- <appender-ref ref="LogbackMetrics" /> -->
<appender-ref ref="SparkMasterFileAppender" />
<appender-ref ref="SparkWorkerFileAppender" />

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
207
Configuration

<appender-ref ref="GremlinServerFileAppender" />


</root>

<!--audit log-->
<appender name="SLF4JAuditWriterAppender"
class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>${cassandra.logdir}/audit/audit.log</file>
<encoder>
<pattern>%-5level [%thread] %date{ISO8601} %X{service} %F:%L - %msg%n</pattern>
<immediateFlush>true</immediateFlush>
</encoder>
<rollingPolicy class="ch.qos.logback.core.rolling.FixedWindowRollingPolicy">
<fileNamePattern>${cassandra.logdir}/audit/audit.log.%i.zip</fileNamePattern>
<minIndex>1</minIndex>
<maxIndex>5</maxIndex>
</rollingPolicy>
<triggeringPolicy class="ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy">
<maxFileSize>200MB</maxFileSize>
</triggeringPolicy>
</appender>

<logger name="SLF4JAuditWriter" level="INFO" additivity="false">


<appender-ref ref="SLF4JAuditWriterAppender"/>
</logger>

<appender name="DroppedAuditEventAppender"
class="ch.qos.logback.core.rolling.RollingFileAppender" prudent=$
<file>${cassandra.logdir}/audit/dropped-events.log</file>
<encoder>
<pattern>%-5level [%thread] %date{ISO8601} %X{service} %F:%L - %msg%n</pattern>
<immediateFlush>true</immediateFlush>
</encoder>
<rollingPolicy class="ch.qos.logback.core.rolling.FixedWindowRollingPolicy">
<fileNamePattern>${cassandra.logdir}/audit/dropped-events.log.%i.zip</
fileNamePattern>
<minIndex>1</minIndex>
<maxIndex>5</maxIndex>
</rollingPolicy>
<triggeringPolicy class="ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy">
<maxFileSize>200MB</maxFileSize>
</triggeringPolicy>
</appender>

<logger name="DroppedAuditEventLogger" level="INFO" additivity="false">


<appender-ref ref="DroppedAuditEventAppender"/>
</logger>

<logger name="org.apache.cassandra" level="DEBUG"/>


<logger name="com.datastax.bdp.db" level="DEBUG"/>
<logger name="com.datastax.driver.core.NettyUtil" level="ERROR"/>
<logger name="com.datastax.bdp.search.solr.metrics.SolrMetricsEventListener"
level="DEBUG"/>
<logger name="org.apache.solr.core.CassandraSolrConfig" level="WARN"/>
<logger name="org.apache.solr.core.SolrCore" level="WARN"/>
<logger name="org.apache.solr.core.RequestHandlers" level="WARN"/>
<logger name="org.apache.solr.handler.component" level="WARN"/>
<logger name="org.apache.solr.search.SolrIndexSearcher" level="WARN"/>
<logger name="org.apache.solr.update" level="WARN"/>
<logger name="org.apache.lucene.index" level="INFO"/>
<logger name="com.cryptsoft" level="OFF"/>
<logger name="org.apache.spark.rpc" level="ERROR"/>

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
208
Configuration

</configuration>

The appender configurations specify where to print the log and its configuration. Each appender is defined as
appendername="appender", and are described as follows.
SYSTEMLOG
Directs logs and ensures that WARN and ERROR messages are written synchronously to the /var/
log/cassandra/system.log file.
DEBUGLOG | ASYNCDEBUGLOG
Generates the /var/log/cassandra/debug.log file, which contains an asynchronous log of events
written to the system.log file, plus production logging information useful for debugging issues.
STDOUT
Directs logs to the console in a human-readable format.
LogbackMetrics
Records the rate of logged events by their logging level.
SLF4JAuditWriterAppender | DroppedAuditEventAppender
Used by the audit logging functionality. See Setting up database auditing for more information.
The following logging functionality is configurable:

• Rolling policy

# The policy for rolling logs over to an archive

# Location and name of the log file

# Location and name of the archive

# Minimum and maximum file size to trigger rolling

• Format of the message

• The log level

Log levels
The valid values for setting the log level include ALL for logging information at all levels, TRACE through
ERROR, and OFF for no logging. TRACE creates the most verbose log, and ERROR, the least.

• ALL

• TRACE

• DEBUG

• INFO (Default)

• WARN

• ERROR

• OFF

When set to TRACE or DEBUG output appears only in the debug.log. When set to INFO the debug.log is
disabled.

Increasing logging levels can generate heavy logging output on a moderately trafficked cluster.

Use the nodetool getlogginglevels command to see the current logging configuration.

bin\nodetool getlogginglevels
Logger Name Log Level
ROOT INFO

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
209
Configuration

com.thinkaurelius.thrift ERROR

To add debug logging to a class permanently using the logback framework, use nodetool setlogginglevel to
confirm the component or class before setting it in the logback.xml file in installation_location/conf. Modify
to include the following line or similar at the end of the file:

<logger name="org.apache.cassandra.gms.FailureDetector" level="DEBUG"/>

Restart the node to invoke the change.


Migrating to logback from log4j
If you upgrade from an earlier version that used log4j, you can convert log4j.properties files to logback.xml
using the logback PropertiesTranslator web-application.
Using log file rotation
The default policy rolls the system.log file after the size exceeds 20MB. Archives are compressed in zip format.
Logback names the log files system.log.1.zip, system.log.2.zip, and so on. For more information, see
logback documentation.
Enabling extended compaction logging
To configure collection of in-depth information about compaction activity on a node, and write it to a dedicated
log file, see the log_all property for compaction.
Commit log archive configuration
DataStax Enterprise provides commit log archiving and point-in-time recovery. The commit log is archived at
node startup and when a commit log is written to disk, or at a specified point-in-time. You configure this feature in
the commitlog_archiving.properties configuration file.
The commands archive_command and restore_command expect only a single command with arguments. The
parameters must be entered verbatim. STDOUT and STDIN or multiple commands cannot be executed. To
workaround, you can script multiple commands and add a pointer to this file. To disable a command, leave it
blank.

• Archive a commit log segment:

Command archive_command=

Parameters %path Fully qualified path of the segment to archive.

%name Name of the commit log.

Example archive_command=/bin/ln %path /backup/%name

• Restore an archived commit log:

Command restore_command=

Parameters %from Fully qualified path of the archived commitlog segment from the restore_directories.

%to Name of live commit log directory.

Example restore_command=cp -f %from %to

• Set the restore directory location:

Command restore_directories=

Format restore_directories=restore_directory_location

• Restore mutations created up to and including the specified timestamp:

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
210
Configuration

Command restore_point_in_time=

Format <timestamp> (YYYY:MM:DD HH:MM:SS)

Example restore_point_in_time=2013:12:11 17:00:00

Restore stops when the first client-supplied timestamp is greater than the restore point timestamp.
Because the order in which the database receives mutations does not strictly follow the timestamp order,
this can leave some mutations unrecovered.

Change Data Capture (CDC) logging


Change Data Capture (CDC) logging captures and tracks data that has changed. CDC logging is configured
per table, with limits on the amount of disk space to consume for storing the CDC logs. CDC logs use the same
binary format as the commit log.
Upon flushing the memtable to disk, CommitLogSegments that contain data for CDC-enabled tables are moved
to the configured cdc_raw directory. After the disk space limit is reached, CDC-enabled tables reject writes until
space is freed.
Prerequisites: Before enabling CDC logging, define a plan for moving and consuming the CDC log information.
DataStax recommends a physical device for the CDC log that is separate from the data directories.

1. Enable CDC logging and configure CDC directories and space in cassandra.yaml.
For example, to enable CDC logging with default values:

cdc_enabled: true
cdc_total_space_in_mb: 4096
cdc_free_space_check_interval_ms: 250
cdc_raw_directory: /var/lib/cassandra/cdc_raw

2. To enable CDC logging for a database table, create or alter the table with the table property.
For example, to enable CDC logging on the cycling table:

ALTER TABLE cycling WITH cdc=true;

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
211
Chapter 5. Initializing a DataStax Enterprise cluster
Complete the following tasks before initializing a DSE cluster.

• Establish a firm understanding of how the database works. Be sure to read at least Understanding the
database architecture and Data replication.

• Ensure the environment is suitable for the use case and workload.

• Review recommended production settings.

• Choose a name for the cluster.

• For a mixed-workload cluster, determine the purpose of each node.

• Determine the snitch and replication strategy. The GossipingPropertyFileSnitch and NetworkTopologyStrategy
are recommended for production environments.

• Obtain the IP address of each node.

• Ensure that DataStax Enterprise is installed on each node.

• Determine which nodes are seed nodes. Do not make all nodes seed nodes.
Seed nodes are not required for DSE Search datacenters, see Internode communications (gossip).

• Review and make appropriate changes to other property files, such as cassandra-rackdc.properties.

• Set virtual nodes correctly for the type of datacenter. DataStax recommends using 8 vnodes (tokens). See
Virtual nodes for more information.

Initializing datacenters
In most circumstances, each workload type, such as search, analytics, and transactional, should be organized
into separate virtual datacenters. Workload segregation avoids contention for resources. However, workloads can
be combined in SearchAnalytics nodes when there is not a large demand for analytics, or when analytics queries
must use a DSE Search index. Generally, combining transactional (OLTP) and analytics (OLAP) workloads
results in decreased performance.
When creating a keyspace using CQL, DataStax Enterprise creates a virtual datacenter for a cluster, even a one-
node cluster, automatically. You assign nodes that run the same type of workload to the same datacenter. The
separate, virtual datacenters for different types of nodes segregate workloads that run DSE Search from those
nodes that run other workload types.
Single datacenters per workload type
If using a single, physical datacenter, single datacenter deployments are useful.
Multiple datacenters per workload type
If using multiple datacenters, consider multiple datacenter deployments.
The following scenarios describe some benefits of using multiple, physical datacenters:

• Isolating replicas from external infrastructure failures, such as networking between datacenters and power
outages.

• Distributing data replication across multiple, geographically-dispersed nodes.

• Adding separation between different physical racks in a physical datacenter.

• Diversifying assets between public cloud providers and on-premise managed datacenters.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
212
Initializing a DataStax Enterprise cluster

• Preventing the slow down of a real-time analytics cluster by a development cluster running analytics jobs on
live data.

• Using virtual datacenters in the physical datacenter to ensure reads from a specific datacenter is local to the
requests, especially when using a consistency level greater than ONE. This strategy ensures lower latency
because it avoids reads from one node in New York and another read from a node in Los Angeles.

Initializing a single datacenter per workload type


In this scenario, a mixed workload cluster has only one datacenter for each type of workload. For example, an
eight-node cluster with the following nodes would use three datacenters, one for each workload type:

• DC1 = 3 DSE Analytics nodes

• DC2 = 3 Transactional nodes

• DC3 = 2 DSE Search nodes

In contrast, a multiple datacenter cluster has more than one datacenter for each type of workload.
The eight-node cluster spans two racks across three datacenters. Applications in each datacenter will use a
default consistency level of LOCAL_QUORUM. One node per rack will serve as a seed node.

Node IP address Type Seed Rack

node0 110.82.155.0 Transactional # RAC1

node1 110.82.155.1 Transactional RAC1

node2 110.54.125.1 Transactional RAC2

node3 110.54.125.2 Analytics RAC1

node4 110.54.155.2 Analytics # RAC2

node5 110.82.155.3 Analytics RAC1

node6 110.54.125.3 Search RAC1

node7 110.82.155.4 Search RAC2

Prerequisites:
To prepare the environment, complete the prerequisite tasks outlined in Initializing a DataStax Enterprise
cluster.

If the new datacenter uses existing nodes from another datacenter or cluster, complete the following steps to
ensure that old data will not interfere with the new cluster:

1. If the nodes are behind a firewall, open the required ports for internal/external communication.

2. Decommission each node that will be added to the new datacenter.

3. Clear the data from DataStax Enterprise (DSE) to completely remove application directories.

4. Install DSE on each node.

1. Complete the following steps to prevent client applications from prematurely connecting to the new
datacenter, and to ensure that the consistency level for reads or writes does not query the new datacenter:

If client applications, including DSE Search and DSE Analytics, are not properly configured, they
might connect to the new datacenter before it is online. Incorrect configuration results in connection
exceptions, timeouts, and/or inconsistent data.

a. Configure client applications to use the DCAwareRoundRobinPolicy.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
213
Initializing a DataStax Enterprise cluster

b. Direct clients to an existing datacenter. Otherwise, clients might try to access the new datacenter,
which might not have any data.

c. If using the QUORUM consistency level, change to LOCAL_QUORUM.

d. If using the ONE consistency level, set to LOCAL_ONE.

See the programming instructions for your driver.

2. Configure every keyspace using SimpleStrategy to use the NetworkTopologyStrategy replication strategy,
including (but not restricted to) the following keyspaces.
If SimpleStrategy was used previously, this step is required to configure NetworkTopologyStrategy.

a. Use ALTER KEYSPACE to change the keyspace replication strategy to NetworkTopologyStrategy for
the following keyspaces.

ALTER KEYSPACE keyspace_name WITH REPLICATION =


{'class' : 'NetworkTopologyStrategy', 'ExistingDC1' : 3};

• DSE security: system_auth, dse_security

• DSE performance: dse_perf

• DSE analytics: dse_leases, dsefs

• System resources: system_traces, system_distributed

• OpsCenter (if installed)

• All keyspaces created by users

b. Use DESCRIBE SCHEMA to check the replication strategy of keyspaces in the cluster. Ensure that any
existing keyspaces use the NetworkTopologyStrategy replication strategy.

DESCRIBE SCHEMA ;

CREATE KEYSPACE dse_perf WITH replication =


{'class': 'NetworkTopologyStrategy, 'DC1': '3'} AND durable_writes = true;
...

CREATE KEYSPACE dse_leases WITH replication =


{'class': 'NetworkTopologyStrategy, 'DC1': '3'} AND durable_writes = true;
...

CREATE KEYSPACE dsefs WITH replication =


{'class': 'NetworkTopologyStrategy, 'DC1': '3'} AND durable_writes = true;
...

CREATE KEYSPACE dse_security WITH replication =


{'class': 'NetworkTopologyStrategy, 'DC1': '3'} AND durable_writes = true;

3. In the new datacenter, install DSE on each new node. Do not start the service or restart the node.

Use the same version of DSE on all nodes in the cluster.

4. Configure properties in cassandra.yaml on each new node, following the configuration of the other nodes in
the cluster.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
214
Initializing a DataStax Enterprise cluster

Use the yaml_diff tool to review and make appropriate changes to the cassandra.yaml and dse.yaml
configuration files.

a. Configure node properties:

• -seeds: internal_IP_address of each seed node


Include at least one seed node from each datacenter. DataStax recommends more than
one seed node per datacenter, in more than one rack. Do not make all nodes seed
nodes.

• auto_bootstrap: true
This setting has been removed from the default configuration, but, if present, should be set
to true.

• listen_address: empty
If not set, DSE asks the system for the local address, which is associated with its host name.
In some cases, DSE does not produce the correct address, which requires specifying the
listen_address.

• endpoint_snitch: snitch
See endpoint_snitch and snitches.

Do not use the DseSimpleSnitch. The DseSimpleSnitch (default) is used only for single-
datacenter deployments (or single-zone deployments in public clouds), and does not
recognize datacenter or rack information.

Snitch Configuration file

GossipingPropertyFileSnitch cassandra-rackdc.properties file

Amazon EC2 single-region snitch

Amazon EC2 multi-region snitch

Google Cloud Platform snitch

PropertyFileSnitch cassandra-topology.properties file

• If using a cassandra.yaml or dse.yaml file from a previous version, check the Upgrade
Guide for removed settings.

b. Configure node architecture (all nodes in the datacenter must use the same type):
Virtual node (vnode) allocation algorithm settings

• Set num_tokens to 8 (recommended).

• Set allocate_tokens_for_local_replication_factor to the target replication factor for keyspaces


in the new datacenter. If the keyspace RF varies, alternate the settings to use all the
replication factors.

• Comment out the initial_token property.

DataStax recommends not using vnodes with DSE Search. However, if you decide
to use vnodes with DSE Search, do not use more than 8 vnodes and ensure that
allocate_tokens_for_local_replication_factor option in cassandra.yaml is correctly configured
for your environment.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
215
Initializing a DataStax Enterprise cluster

For more information, refer to Virtual node (vnode) configuration.


Single-token architecture settings

• Generate the initial token for each node and set this value for the initial_token property.
See Adding or replacing single-token nodes for more information.

• Comment out both num_tokens and allocate_tokens_for_local_replication_factor.

5. In the cassandra-rackdc.properties (GossipingPropertyFileSnitch) or cassandra-topology.properties


(PropertyFileSnitch) file, assign datacenter and rack names to the IP addresses of each node, and assign a
default datacenter name and rack name for unknown nodes.

Migration information: The GossipingPropertyFileSnitch always loads cassandra-


topology.properties when the file is present. Remove the file from each node on any new cluster,
or any cluster migrated from the PropertyFileSnitch.

# Transactional Node IP=Datacenter:Rack


110.82.155.0=DC_Transactional:RAC1
110.82.155.1=DC_Transactional:RAC1
110.54.125.1=DC_Transactional:RAC2
110.54.125.2=DC_Analytics:RAC1
110.54.155.2=DC_Analytics:RAC2
110.82.155.3=DC_Analytics:RAC1
110.54.125.3=DC_Search:RAC1
110.82.155.4=DC_Search:RAC2

# default for unknown nodes


default=DC1:RAC1

After making any changes in the configuration files, you must the restart the node for the changes to
take effect.

6. Make the following changes in the existing datacenters.

a. On nodes in the existing datacenters, update the -seeds property in cassandra.yaml to include the
seed nodes in the new datacenter.

b. Add the new datacenter definition to the cassandra.yaml properties file for the type of snitch used in
the cluster. If changing snitches, see Switching snitches.

7. After you have installed and configured DataStax Enterprise on all nodes, start the seed nodes one at a
time, and then start the rest of the nodes:

• Package installations: Starting DataStax Enterprise as a service

• Tarball installations: Starting DataStax Enterprise as a stand-alone process

8. Rotate starting DSE through the racks until all the nodes are up.

9. After all nodes are running in the cluster and the client applications are datacenter aware, use cqlsh to alter
the keyspaces to add the desired replication in the new datacenter.

ALTER KEYSPACE keyspace_name WITH REPLICATION =

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
216
Initializing a DataStax Enterprise cluster

{'class' : 'NetworkTopologyStrategy', 'ExistingDC1' : 3, 'NewDC2' : 2};

If client applications, including DSE Search and DSE Analytics, are not properly configured, they
might connect to the new datacenter before it is online. Incorrect configuration results in connection
exceptions, timeouts, and/or inconsistent data.

10. Run nodetool rebuild on each node in the new datacenter, specifying the datacenter to rebuild from. This
step replicates the data to the new datacenter in the cluster.

$ nodetool rebuild -- datacenter_name

You must specify an existing datacenter in the command line, or the new nodes will appear to rebuild
successfully, but might not contain all anticipated data.
Requests to the new datacenter with LOCAL_ONE or ONE consistency levels can fail if the existing
datacenters are not completely in-sync.

a. Use nodetool rebuild on one or more nodes at the same time. Run on one node at a time to
reduce the impact on the existing cluster.

b. Alternatively, run the command on multiple nodes simultaneously when the cluster can handle the
extra I/O and network pressure.

11. Check that your cluster is up and running:

$ dsetool status

If DSE has problems starting, look for starting DSE troubleshooting and other articles in the Support
Knowledge Center.

12. Complete 3 through 11 to add the third datacenter (DC3) to the cluster.

The datacenters in the cluster are now replicating with each other.

DC: Cassandra Workload: Cassandra Graph: no


==============================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 110.82.155.0 21.33 KB 256 33.3% a9fa31c7-f3c0-... RAC1
UN 110.82.155.1 21.33 KB 256 33.3% f5bb416c-db51-... RAC1
UN 110.54.125.1 21.33 KB 256 16.7% b836748f-c94f-... RAC2

DC: Analytics
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Tokens Rack
UN 110.54.125.2 28.44 KB 13.0.% e2451cdf-f070- ... -922337.... RAC1
UN 110.82.155.2 44.47 KB 16.7% f9fa427c-a2c5- ... 30745512... RAC2
UN 110.82.155.3 54.33 KB 23.6% b9fc31c7-3bc0- ..- 45674488... RAC1

DC: Solr
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Tokens Rack
UN 110.54.125.3 15.44 KB 50.2.% e2451cdf-f070- ... 9243578.... RAC1

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
217
Initializing a DataStax Enterprise cluster

UN 110.82.155.4 18.78 KB 49.8.% e2451cdf-f070- ... 10000 RAC2

Initializing multiple datacenters per workload type


In this scenario, a mixed workload cluster has more than one datacenter for each type of workload. For example,
the following ten-node cluster is spans five datacenters, whereas a single datacenter cluster has only one
datacenter for each node type.

• DC1 = 2 DSE Analytics nodes

• DC2 = 2 Transactional nodes

• DC3 = 2 DSE Search nodes

• DC4 = 2 DSE Analytics nodes

• DC5 = 2 Transactional nodes

The ten-node cluster spans two racks across five datacenters. Applications in each datacenter will use a default
consistency level of LOCAL_QUORUM. One node per rack will serve as a seed node.

Node IP address Type Seed

node0 110.82.155.0 Transactional # RAC1

node1 110.82.155.1 Transactional RAC1

node2 110.54.125.1 Transactional RAC2

node3 110.55.120.1 Transactional RAC1

node4 110.54.125.2 Analytics RAC1

node5 110.54.155.2 Analytics # RAC2

node6 110.82.155.3 Analytics RAC1

node7 110.55.120.2 Analytics RAC1

node8 110.54.125.3 Search RAC1

node9 110.82.155.4 Search RAC2

Prerequisites:
Complete the prerequisite tasks outlined in Initializing a DataStax Enterprise cluster to prepare the
environment.

If the new datacenter uses existing nodes from another datacenter or cluster, complete the following steps to
ensure that old data will not interfere with the new cluster:

1. If the nodes are behind a firewall, open the required ports for internal/external communication.

2. Decommission each node that will be added to the new datacenter.

3. Clear the data from DataStax Enterprise (DSE) to completely remove application directories.

4. Install DSE on each node.

1. Complete the following steps to prevent client applications from prematurely connecting to the new
datacenter, and to ensure that the consistency level for reads or writes does not query the new datacenter:

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
218
Initializing a DataStax Enterprise cluster

If client applications, including DSE Search and DSE Analytics, are not properly configured, they
might connect to the new datacenter before it is online. Incorrect configuration results in connection
exceptions, timeouts, and/or inconsistent data.

a. Configure client applications to use the DCAwareRoundRobinPolicy.

b. Direct clients to an existing datacenter. Otherwise, clients might try to access the new datacenter,
which might not have any data.

c. If using the QUORUM consistency level, change to LOCAL_QUORUM.

d. If using the ONE consistency level, set to LOCAL_ONE.

See the programming instructions for your driver.

2. Configure every keyspace using SimpleStrategy to use the NetworkTopologyStrategy replication strategy,
including (but not restricted to) the following keyspaces.
If SimpleStrategy was used previously, this step is required to configure NetworkTopologyStrategy.

a. Use ALTER KEYSPACE to change the keyspace replication strategy to NetworkTopologyStrategy for
the following keyspaces.

ALTER KEYSPACE keyspace_name WITH REPLICATION =


{'class' : 'NetworkTopologyStrategy', 'ExistingDC1' : 3};

• DSE security: system_auth, dse_security

• DSE performance: dse_perf

• DSE analytics: dse_leases, dsefs

• System resources: system_traces, system_distributed

• OpsCenter (if installed)

• All keyspaces created by users

b. Use DESCRIBE SCHEMA to check the replication strategy of keyspaces in the cluster. Ensure that any
existing keyspaces use the NetworkTopologyStrategy replication strategy.

DESCRIBE SCHEMA ;

CREATE KEYSPACE dse_perf WITH replication =


{'class': 'NetworkTopologyStrategy, 'DC1': '3'} AND durable_writes = true;
...

CREATE KEYSPACE dse_leases WITH replication =


{'class': 'NetworkTopologyStrategy, 'DC1': '3'} AND durable_writes = true;
...

CREATE KEYSPACE dsefs WITH replication =


{'class': 'NetworkTopologyStrategy, 'DC1': '3'} AND durable_writes = true;
...

CREATE KEYSPACE dse_security WITH replication =


{'class': 'NetworkTopologyStrategy, 'DC1': '3'} AND durable_writes = true;

3. In the new datacenter, install DSE on each new node. Do not start the service or restart the node.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
219
Initializing a DataStax Enterprise cluster

Use the same version of DSE on all nodes in the cluster.

4. Configure properties in cassandra.yaml on each new node, following the configuration of the other nodes in
the cluster.

Use the yaml_diff tool to review and make appropriate changes to the cassandra.yaml and dse.yaml
configuration files.

a. Configure node properties:

• -seeds: internal_IP_address of each seed node


Include at least one seed node from each datacenter. DataStax recommends more than
one seed node per datacenter, in more than one rack. Do not make all nodes seed
nodes.

• auto_bootstrap: true
This setting has been removed from the default configuration, but, if present, should be set
to true.

• listen_address: empty
If not set, DSE asks the system for the local address, which is associated with its host name.
In some cases, DSE does not produce the correct address, which requires specifying the
listen_address.

• endpoint_snitch: snitch
See endpoint_snitch and snitches.

Do not use the DseSimpleSnitch. The DseSimpleSnitch (default) is used only for single-
datacenter deployments (or single-zone deployments in public clouds), and does not
recognize datacenter or rack information.

Snitch Configuration file

GossipingPropertyFileSnitch cassandra-rackdc.properties file

Amazon EC2 single-region snitch

Amazon EC2 multi-region snitch

Google Cloud Platform snitch

PropertyFileSnitch cassandra-topology.properties file

• If using a cassandra.yaml or dse.yaml file from a previous version, check the Upgrade
Guide for removed settings.

b. Configure node architecture (all nodes in the datacenter must use the same type):
Virtual node (vnode) allocation algorithm settings

• Set num_tokens to 8 (recommended).

• Set allocate_tokens_for_local_replication_factor to the target replication factor for keyspaces


in the new datacenter. If the keyspace RF varies, alternate the settings to use all the
replication factors.

• Comment out the initial_token property.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
220
Initializing a DataStax Enterprise cluster

DataStax recommends not using vnodes with DSE Search. However, if you decide
to use vnodes with DSE Search, do not use more than 8 vnodes and ensure that
allocate_tokens_for_local_replication_factor option in cassandra.yaml is correctly configured
for your environment.
For more information, refer to Virtual node (vnode) configuration.
Single-token architecture settings

• Generate the initial token for each node and set this value for the initial_token property.
See Adding or replacing single-token nodes for more information.

• Comment out both num_tokens and allocate_tokens_for_local_replication_factor.

5. In the cassandra-rackdc.properties (GossipingPropertyFileSnitch) or cassandra-topology.properties


(PropertyFileSnitch) file, assign datacenter and rack names to the IP addresses of each node, and assign a
default datacenter name and rack name for unknown nodes.

Migration information: The GossipingPropertyFileSnitch always loads cassandra-


topology.properties when the file is present. Remove the file from each node on any new cluster,
or any cluster migrated from the PropertyFileSnitch.

# Transactional Node IP=Datacenter:Rack


110.82.155.0=DC_Transactional:RAC1
110.82.155.1=DC_Transactional:RAC1
110.54.125.1=DC_Transactional:RAC2
110.54.125.2=DC_Analytics:RAC1
110.54.155.2=DC_Analytics:RAC2
110.82.155.3=DC_Analytics:RAC1
110.54.125.3=DC_Search:RAC1
110.82.155.4=DC_Search:RAC2

# default for unknown nodes


default=DC1:RAC1

After making any changes in the configuration files, you must the restart the node for the changes to
take effect.

6. Make the following changes in the existing datacenters.

a. On nodes in the existing datacenters, update the -seeds property in cassandra.yaml to include the
seed nodes in the new datacenter.

b. Add the new datacenter definition to the cassandra.yaml properties file for the type of snitch used in
the cluster. If changing snitches, see Switching snitches.

7. After you have installed and configured DataStax Enterprise on all nodes, start the seed nodes one at a
time, and then start the rest of the nodes:

• Package installations: Starting DataStax Enterprise as a service

• Tarball installations: Starting DataStax Enterprise as a stand-alone process

8. Rotate starting DSE through the racks until all the nodes are up.

9. After all nodes are running in the cluster and the client applications are datacenter aware, use cqlsh to alter
the keyspaces to add the desired replication in the new datacenter.

ALTER KEYSPACE keyspace_name WITH REPLICATION =

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
221
Initializing a DataStax Enterprise cluster

{'class' : 'NetworkTopologyStrategy', 'ExistingDC1' : 3, 'NewDC2' : 2};

If client applications, including DSE Search and DSE Analytics, are not properly configured, they
might connect to the new datacenter before it is online. Incorrect configuration results in connection
exceptions, timeouts, and/or inconsistent data.

10. Run nodetool rebuild on each node in the new datacenter, specifying the datacenter to rebuild from. This
step replicates the data to the new datacenter in the cluster.

$ nodetool rebuild -- datacenter_name

You must specify an existing datacenter in the command line, or the new nodes will appear to rebuild
successfully, but might not contain all anticipated data.
Requests to the new datacenter with LOCAL_ONE or ONE consistency levels can fail if the existing
datacenters are not completely in-sync.

a. Use nodetool rebuild on one or more nodes at the same time. Run on one node at a time to
reduce the impact on the existing cluster.

b. Alternatively, run the command on multiple nodes simultaneously when the cluster can handle the
extra I/O and network pressure.

11. Check that your cluster is up and running:

$ dsetool status

If DSE has problems starting, look for starting DSE troubleshooting and other articles in the Support
Knowledge Center.

12. Complete 3 through 11 to add the remaining datacenters to the cluster.

The datacenters in the cluster are now replicating with each other.

DC: Cassandra Workload: Cassandra Graph: no


==============================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 110.82.155.0 21.33 KB 256 50.2% a9fa31c7-f3c0-... RAC1
UN 110.82.155.1 21.33 KB 256 49.8% f5bb416c-db51-... RAC1

DC: Analytics
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Tokens Rack
UN 110.54.125.2 28.44 KB 50.2.% e2451cdf-f070- ... -922337.... RAC1
UN 110.82.155.2 44.47 KB 49.8% f9fa427c-a2c5- ... 30745512... RAC2

DC: Solr
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Tokens Rack
UN 110.54.125.3 15.44 KB 50.2.% e2451cdf-f070- ... 9243578.... RAC1
UN 110.82.155.4 18.78 KB 49.8.% e2451cdf-f070- ... 10000 RAC2

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
222
Initializing a DataStax Enterprise cluster

DC: Cassandra2 Workload: Cassandra Graph: no


==============================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 110.54.125.1 21.33 KB 256 16.7% b836748f-c94f-... RAC2
UN 110.55.120.1 21.33 KB 256 16.7% b354798g-c94f-... RAC2

DC: Analytics2
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Tokens Rack
UN 110.82.155.3 54.33 KB 50.2% b9fc31c7-3bc0- ..- 45674488... RAC1
UN 110.55.120.2 54.33 KB 49.8% b8gd45e4-3bc0- ..- 45674488... RAC2

What's next:

• Initializing single-token architecture datacenters

• Configuring the security keyspaces replication factors

Setting seed nodes for a single datacenter


This overview is a simple example of setting seed nodes for a new datacenter with 5 nodes.
About seed nodes:

• A seed node is used to bootstrap the gossip process for new nodes joining a cluster.

• To learn the topology of the ring, a joining node contacts one of the nodes in the -seeds list in
cassandra.yaml.

• The first time you bring up a node in a new cluster, only one node is the seed node.

• The seeds list is a comma delimited list of addresses. Since this example cluster includes 5 nodes, you must
change the list from the default value "127.0.0.1" to the IP address of one of the nodes.

• After all nodes are added, all nodes in the datacenter must be configured to use the same seed nodes.

Preventing problems in gossip communications


To prevent problems in gossip communications, be sure to use the same list of seed nodes for all nodes in a
cluster. This is most critical the first time a node starts up. By default, a node remembers other nodes it has
gossiped with between subsequent restarts. The seed node designation has no purpose other than bootstrapping
the gossip process for new nodes joining the cluster. Seed nodes are not a single point of failure, nor do they
have any other special purpose in cluster operations beyond the bootstrapping of nodes.

Making every node a seed node is not recommended because of increased maintenance and reduced gossip
performance. Gossip optimization is not critical, but it is recommended to use a small seed list (approximately
three nodes per datacenter).

This single datacenter example has 5 nodes, where nodeA, nodeB, and nodeC are seed nodes.

Node IP address Seed

nodeA 110.82.155.0 #

nodeB 110.82.155.1 #

nodeC 110.54.125.1 #

nodeD 110.54.125.2

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
223
Initializing a DataStax Enterprise cluster

Node IP address Seed

nodeE 110.54.155.2

1. In the new datacenter, install DSE on each new node. Do not start the service or restart the node.

Use the same version of DSE on all nodes in the cluster.

2. For nodeA, nodeB, and nodeC, configure only nodeA as seed node:

a. In cassandra.yaml:

seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
- seeds: 110.82.155.0

3. Start the seed nodes one at a time nodeA, nodeB, and then nodeC.

4. For nodeA, nodeB, and nodeC, change cassandra.yaml to configure nodeA, nodeB, and nodeC as seed
nodes:

a. In cassandra.yaml:

seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
- seeds: 110.82.155.0, 110.82.155.1, 110.54.125.1

You do not need to restart nodeA, nodeB, or nodeC after changing the seed node entry in
cassandra.yaml; the nodes will reread the seed nodes.

5. For nodeD and nodeE, change cassandra.yaml to configure nodeA, nodeB, and nodeC as seed nodes:

a. In cassandra.yaml:

seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
- seeds: 110.82.155.0, 110.82.155.1, 110.54.125.1

6. Start nodeD and nodeE.


Result: All nodes in the datacenter have the same seed nodes: nodeA, nodeB, and nodeC.

Use cases for listen address


Correct cassandra.yaml listen_address settings for various use cases.

• Never set listen_address to 0.0.0.0.

• Set listen_address or listen_interface, do not set both.

• Single-node installations: do one of the following:

# Comment out the listen_address property. If the node is properly configured (host name, name
resolution, and so on), the database uses InetAddress.getLocalHost() to get the local address from
the system.

# Leave the default setting, localhost.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
224
Initializing a DataStax Enterprise cluster

• Node in a multi-node installations: set the listen_address property to the node's IP address or hostname,
or set listen_interface.

• Node in a multi-network or multi-datacenter installation, within an EC2 environment that supports


automatic switching between public and private interfaces: set listen_address to the node's IP address
or hostname, or set listen_interface.

• Node with two physical network interfaces in a multi-datacenter installation or cluster deployed
across multiple Amazon EC2 regions using the Ec2MultiRegionSnitch:

1. Set listen_address to this node's private IP or hostname, or set listen_interface (for communication
within the local datacenter).

2. Set broadcast_address to the second IP or hostname (for communication between datacenters).

3. Set listen_on_broadcast_address to true.

4. If this node is a seed node, add the node's public IP address or hostname to the seeds list.

• Open the storage_port or ssl_storage_port on the public IP firewall.

Initializing single-token architecture datacenters


Follow these steps only when not using virtual nodes (vnodes).
In most circumstances, each workload type, such as search, analytics, and transactional, should be organized
into separate virtual datacenters. Workload segregation avoids contention for resources. However, workloads can
be combined in SearchAnalytics nodes when there is not a large demand for analytics, or when analytics queries
must use a DSE Search index. Generally, combining transactional (OLTP) and analytics (OLAP) workloads
results in decreased performance.
When creating a keyspace using CQL, DataStax Enterprise creates a virtual datacenter for a cluster, even a one-
node cluster, automatically. You assign nodes that run the same type of workload to the same datacenter. The
separate, virtual datacenters for different types of nodes segregate workloads that run DSE Search from those
nodes that run other workload types.
Prerequisites:
Complete the tasks outlined in Initializing a DataStax Enterprise cluster to prepare the environment.

These steps provide information about setting up a cluster having one or more datacenters.

1. Suppose you install DataStax Enterprise on these nodes:

• node0 10.168.66.41 (seed1)

• node1 10.176.43.66

• node2 10.168.247.41

• node3 10.176.170.59 (seed2)

• node4 10.169.61.170

• node5 10.169.30.138

2. Calculate the token assignments as described in Calculating tokens for single-token architecture nodes.
The following tables list tokens for a 6 node cluster with a single datacenter or two datacenters.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
225
Initializing a DataStax Enterprise cluster

Table 13: Single Datacenter


Node Token

node0 0

node1 21267647932558653966460912964485513216

node2 42535295865117307932921825928971026432

node3 63802943797675961899382738893456539648

node4 85070591730234615865843651857942052864

node5 106338239662793269832304564822427566080

Table 14: Multiple Datacenters


Node Token Offset Datacenter

node0 0 NA DC1

node1 56713727820156410577229101238628035242 NA DC1

node2 113427455640312821154458202477256070485 NA DC1

node3 100 100 DC2

node4 56713727820156410577229101238628035342 100 DC2

node5 113427455640312821154458202477256070585 100 DC2

3. If the nodes are behind a firewall, open the required ports for internal/external communication.

4. If DataStax Enterprise is running, stop the node and clear the data:

• Package installations: To stop DSE:

$ sudo service dse stop

To remove data from the default directories:

$ sudo rm -rf /var/lib/cassandra/*

• Tarball installations:
From the installation location, stop the database:

$ bin/dse cassandra-stop

Remove all data:

$ cd /var/lib/cassandra/data && sudo rm -rf data/* commitlog/* saved_caches/*


hints/*

5. Configure properties in cassandra.yaml on each new node, following the configuration of the other nodes in
the cluster.

Use the yaml_diff tool to review and make appropriate changes to the cassandra.yaml and dse.yaml
configuration files.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
226
Initializing a DataStax Enterprise cluster

a. Configure node properties.

• initial_token: token_value_from_calculation

• num_tokens: 1

• -seeds: internal_IP_address of each seed node


Include at least one seed node from each datacenter. DataStax recommends more than
one seed node per datacenter. Do not make all nodes seed nodes.

• listen_address: empty
If not set, DSE asks the system for the local address, which is associated with its host name.
In some cases, DSE does not produce the correct address, which requires specifying the
listen_address.

• auto_bootstrap: false
Add the bootstrap setting only when initializing a new cluster with no data.

• endpoint_snitch: snitch
See endpoint_snitch and snitches.

Do not use the DseSimpleSnitch. The DseSimpleSnitch (default) is used only for single-
datacenter deployments (or single-zone deployments in public clouds), and does not
recognize datacenter or rack information.

Snitch Configuration file

GossipingPropertyFileSnitch cassandra-rackdc.properties file

Configuring the Amazon EC2 single-region snitch

Configuring Amazon EC2 multi-region snitch

Configuring the Google Cloud Platform snitch

PropertyFileSnitch cassandra-topology.properties file

• If using a cassandra.yaml or dse.yaml file from a previous version, check the Upgrade
Guide for removed settings.

6. Set the properties in the dse.yaml file as required by your use case.

7. In the cassandra-rackdc.properties (GossipingPropertyFileSnitch) or cassandra-topology.properties


(PropertyFileSnitch) file, assign datacenter and rack names to the IP addresses of each node, and assign a
default datacenter name and rack name for unknown nodes.

Migration information: The GossipingPropertyFileSnitch always loads cassandra-


topology.properties when the file is present. Remove the file from each node on any new cluster, or
any cluster migrated from the PropertyFileSnitch.

# Transactional Node IP=Datacenter:Rack


110.82.155.0=DC_Transactional:RAC1
110.82.155.1=DC_Transactional:RAC1
110.54.125.1=DC_Transactional:RAC2
110.54.125.2=DC_Analytics:RAC1
110.54.155.2=DC_Analytics:RAC2
110.82.155.3=DC_Analytics:RAC1
110.54.125.3=DC_Search:RAC1

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
227
Initializing a DataStax Enterprise cluster

110.82.155.4=DC_Search:RAC2

# default for unknown nodes


default=DC1:RAC1

After making any changes in the configuration files, you must the restart the node for the changes to take
effect.

8. After you have installed and configured DataStax Enterprise on all nodes, start the seed nodes one at a time,
and then start the rest of the nodes:

• Package installations: Starting DataStax Enterprise as a service

• Tarball installations: Starting DataStax Enterprise as a stand-alone process

9. Check that your cluster is up and running:

$ dsetool status

If DSE has problems starting, look for starting DSE troubleshooting and other articles in the Support
Knowledge Center.

Datacenter: Cassandra
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 110.82.155.0 21.33 KB 256 33.3% a9fa31c7-f3c0-... RAC1
UN 110.82.155.1 21.33 KB 256 33.3% f5bb416c-db51-... RAC1
UN 110.82.155.2 21.33 KB 256 16.7% b836748f-c94f-... RAC1

Calculating tokens for single-token architecture nodes


This page contains information on manually calculating tokens.
DataStax recommends using Lifecycle Manager in DSE OpsCenter instead.
About single-token architecture
Use single-token architecture when not using virtual nodes (vnodes). See Guidelines for using virtual nodes. You
do not need to calculate tokens when using vnodes.
When you start a DataStax Enterprise cluster without vnodes, you must ensure that the data is evenly divided
across the nodes in the cluster using token assignments and that no two nodes share the same token even if
they are in different datacenters. Tokens are hash values that partitioners use to determine where to store rows
on each node. This value determines the node's position in the ring and what data the node is responsible for.
Each node is responsible for the region of the cluster between itself (inclusive) and its predecessor (exclusive).
As a simple example, if the range of possible tokens is 0 to 100 and there are four nodes, the tokens for the
nodes are: 0, 25, 50, 75. This division ensures that each node is responsible for an equal range of data. For
more information, see Data distribution overview.
Before starting each node in the cluster for the first time, comment out the num_token property and assign an
initial_token value in the cassandra.yaml configuration file.
Using the Token generator
Use token-generator tool for

• Calculating tokens for a single datacenter with one rack

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
228
Initializing a DataStax Enterprise cluster

• Calculating tokens for a single datacenter with multiple racks

• Calculating tokens for a multiple datacenter cluster

• Calculating tokens when adding or replacing nodes/datacenters

Usage:

• Package installations:

$ token-generator num_of_nodes_in_dc ... [options]

• Tarball installations:

$ installation_location/resources/cassandra/tools/bin/token-generator num_of_nodes_in_dc
... [options]

If no options are entered, Token Generator Interactive Mode is invoked.

Options Description

Help Show help.

• -h

• --help

Partitioner Specify the partitioner:


63
• --murmur3 • Murmur3Partitioner uses a maximum possible range of hash values from -2 to
63
+2 -1. Default partitioner if not specified.
• --random
127
• Random partitioner uses a range from 0 to 2 -1. Default partitioner before
DataStax Enterprise 3.1/Apache Cassandra™ 1.2.

Offset token values Use when adding or replacing dead nodes or datacenters.

• --ringoffset offset

Ring range Specify token values within a specified range.

• --ringrange range_start range_end

Test Displays various ring arrangements and generates an HTML file showing these
arrangements.
• --test

Calculating tokens for a single datacenter with one rack


Example:

$ token-generator 6 DC #1: Node #1: -9223372036854775808 Node #2: -6148914691236517206


Node #3: -3074457345618258604 Node #4: -2 Node #5: 3074457345618258600 Node #6:
6148914691236517202

Calculating tokens for a single datacenter with multiple racks


DataStax recommends that each rack have the same number of nodes so you can alternate the rack
assignments.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
229
Initializing a DataStax Enterprise cluster

1. Calculate the tokens:

$ token-generator 8 DC #1: Node #1: -9223372036854775808 Node #2: -6917529027641081856


Node #3: -4611686018427387904 Node #4: -2305843009213693952 Node #5: 0 Node #6:
2305843009213693952 Node #7: 4611686018427387904 Node #8: 6917529027641081856

2. Assign the tokens to nodes on alternating racks in the cassandra-rackdc.properties or the cassandra-
topology.properties file.

Figure 1: Alternating rack assignments

Calculating tokens for a multiple datacenter cluster


Do not use SimpleStrategy for this type of cluster. You must use the NetworkTopologyStrategy. This strategy
determines replica placement independently within each datacenter.
Example:

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
230
Initializing a DataStax Enterprise cluster

1. Calculate the tokens:

$ token-generator 4 4 DC #1: Node #1: -9223372036854775808 Node #2:


-4611686018427387904 Node #3: 0 Node #4: 4611686018427387904 DC #2: Node #1:
-4690182801719768975 Node #2: -78496783292381071 Node #3: 4533189235135006833 Node #4:
9144875253562394737

2. After calculating the tokens, assign the tokens so that the nodes in each datacenter are evenly dispersed
around the ring.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
231
Initializing a DataStax Enterprise cluster

Figure 2: Token position and datacenter assignments

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
232
Initializing a DataStax Enterprise cluster

Datacenter 1

Datacenter 2

TokenToken position
position

3. Alternate the rack assignments as described above.

Calculating tokens when adding or replacing nodes/datacenters


To avoid token collisions, use the --ringoffset option.

1. Calculate the tokens with the offset:

$ token-generator 3 2 --ringoffset 100

The results show the generated token values for the Murmur3Partitioner for one datacenter with 3 nodes
and one datacenter with 2 nodes with an offset:

DC #1:
Node #1: 6148914691236517105
Node #2: 12297829382473034310
Node #3: 18446744073709551516
DC #2:
Node #1: 9144875253562394637
Node #2: 18368247290417170445

The value of the offset is for the first node and all other nodes are calculated for even distribution from the
offset.
The tokens without the offset are:

$ token-generator 3 2 DC #1: Node #1: -9223372036854775808 Node #2:


-3074457345618258603 Node #3: 3074457345618258602 DC #2: Node #1: -78496783292381071
Node #2: 9144875253562394737

2. After calculating the tokens, assign the tokens so that the nodes in each datacenter are evenly dispersed
around the ring and alternate the rack assignments.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
233
Chapter 6. Security
For securing DataStax Enterprise 6.0, see the DataStax Security Guide.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
234
Chapter 7. Using DataStax Enterprise advanced
functionality
Information on using DSE Analytics, DSEFS, DSE Search, DSE Graph, DSE Advanced Replication, DSE In-
Memory, DSE Multi-Instance, DSE Tiered Storage and DSE Performance services.

DSE Analytics
DataStax Enterprise (DSE) integrates real-time and batch operational analytics capabilities with an enhanced
version of Apache Spark™. With DSE Analytics you can easily generate ad-hoc reports, target customers with
personalization, and process real-time streams of data. The analytics toolset lets you write code once and then
use it for both real-time and batch workloads.
About DSE Analytics
DataStax Enterprise (DSE) integrates real-time and batch operational analytics capabilities with an enhanced
version of Apache Spark™. With DSE Analytics you can easily generate ad-hoc reports, target customers with
personalization, and process real-time streams of data. The analytics toolset lets you write code once and then
use it for both real-time and batch workloads.
DSE Analytics jobs can use the DataStax Enterprise File System (DSEFS) to handle the large data sets typical
of analytic processing. DSEFS replaces CFS (Cassandra File System).
DSE Analytics features
No single point of failure
DSE Analytics supports a peer-to-peer, distributed cluster for running Spark jobs. Being peers, any
node in the cluster can load data files, and any analytics node can assume the responsibilities of Spark
Master.
Spark Master management
DSE Analytics provides automatic Spark Master management.
Analytics without ETL
Using DSE Analytics, you run Spark jobs directly against data in the database. You can perform real-
time and analytics workloads at the same time without one workload affecting the performance of the
other. Starting some cluster nodes as Analytics nodes and others as pure transactional real-time nodes
automatically replicates data between nodes.
DataStax Enterprise file system (DSEFS)
DSEFS (DataStax Enterprise file system) is a fault-tolerant, general-purpose, distributed file system
within DataStax Enterprise. It is designed for use cases that need to leverage a distributed file system
for data ingestion, data staging, and state management for Spark Streaming applications (such
as checkpointing or write-ahead logging). DSEFS is similar to HDFS, but avoids the deployment
complexity and single point of failure typical of HDFS. DSEFS is HDFS-compatible and is designed to
work in place of HDFS in Spark and other systems.
DSE Analytics Solo
DSE Analytics Solo datacenters are devoted entirely to DSE Analytics processing, for deployments that
require separation of analytics jobs from transactional data.
Integrated security
DSE Analytics uses the advanced security features of DSE, simplifying configuration and deployment.
AlwaysOn SQL
AlwaysOn SQL is a highly-available service that provides JDBC and ODBC interfaces to applications
accessing DSE Analytics data.
Enabling DSE Analytics
To enable Anayltics, follow the architecture guidelines for choosing a workload type for the datacenters in the
cluster.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
235
Using DataStax Enterprise advanced functionality

Setting the replication factor for analytics keyspaces


Keyspaces and tables are automatically created when DSE Analytics nodes are started for the first time. The
replication factor must be adjusted for these keyspaces in order for the analytics features to work properly and to
avoid data loss.
The keyspaces used by DSE Analytics are the following:

• dse_analytics

• dse_leases

• dsefs

• "HiveMetaStore"

All analytics keyspaces are initially created with the SimpleStrategy replication strategy and a replication
factor (RF) of 1. Each of these must be updated in production environments to avoid data loss. After starting
the cluster, alter the keyspace to use the NetworkTopologyStrategy replication strategy with an appropriate
settings for the replication factor and datacenters. For most environments using DSE Analytics, a suitable
replication factor will be either 3 or the cluster size, whichever is smaller.
For example, use a CQL statement to configure the dse_leases keyspace for a replication factor of 3 in both
DC1 and DC2 datacenters using NetworkTopologyStrategy:

ALTER KEYSPACE dse_leases


WITH REPLICATION = {
'class': 'NetworkTopologyStrategy',
'DC1': '3',
'DC2': '3'
};

Only replicate DSE Analytics keyspaces to other DSE Analytics datacenters. DSEFS does not support
replication to other datacenters, and the dsefs keyspace only contains metadata, not the data stored in
DSEFS. Each DSE Analytics datacenter should have its own DSEFS instance.

The datacenter name used is case-sensitive. If needed, use the dsetool status command to confirm the exact
datacenter spelling.
After adjusting the replication factor, nodetool repair must be run on each node in the affected datacenters.
For example to repair the altered keyspace dse_leases:

$ nodetool repair -full dse_leases

Repeat the above steps for each of the analytics keyspaces listed above. For more information see Changing
keyspace replication strategy.
DSE Analytics and Search integration
An integrated DSE SearchAnalytics cluster allows analytics jobs to be performed using CQL queries. This
integration allows finer-grained control over the types of queries that are used in analytics workloads, and
improves performance by reducing the amount of data that is processed. However, a DSE SearchAnalytics
cluster does not provide workload isolation and there are no detailed guidelines for provisioning and performance
in production environments.
Nodes that are started in SearchAnalytics mode allow you to create analytics queries that use DSE Search
indexes. These queries return RDDs that are used by Spark jobs to analyze the returned data.
The following code shows how to use a DSE Search query from the DSE Spark console.

val table = sc.cassandraTable("music","solr")


val result = table.select("id","artist_name")
.where("solr_query='artist_name:Miles*'")

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
236
Using DataStax Enterprise advanced functionality

.take(10)

You can use Spark Spark Datasets/DataFrames instead of RDDs.

val table = spark.read.format("org.apache.spark.sql.cassandra")


.options(Map("keyspace"->"music", "table" -> "solr"))
.load()
val result =
table.select("id","artist_name").where("solr_query='artist_name:Miles*'")
.show(10)

You may alternately use a Spark SQL query.

val result = spark.sql("SELECT id, artist_name FROM music.solr WHERE solr_query =


'artist_name:Miles*' LIMIT 10")

For a detailed example, see Running the Wikipedia demo with SearchAnalytics.
Configuring a DSE SearchAnalytics cluster
1. Create DSE SearchAnalytics nodes in a mixed-workload cluster, as described in Initializing a single
datacenter per workload type.
The name of the datacenter is set to SearchAnalytics when using the DseSimpleSnitch. Do not modify
existing search or analytics nodes that use DseSimpleSnitch to be SearchAnalytics nodes. If you use
another snitch like GossipingPropertyFileSnitch you can have a mixed workload within a datacenter.

2. Perform load testing to ensure your hardware has enough CPU and memory for the additional resource
overhead that is required by Spark and Solr.
SearchAnalytics nodes always use driver paging settings. See Using pagination (cursors) with CQL Solr
queries.

SearchAnalytics nodes might consume more resources than search or analytics nodes. Resource
requirements of the nodes greatly depend on the type of query patterns you are using.

Considerations for DSE SearchAnalytics clusters


Care should be taken when enabling both Search and Analytics on a DSE node. Since both workloads will be
enabled, ensure proper resources are provisioned for these simultaneous workloads. This includes sufficient
memory and compute resources to accommodate the specific indexing, query, and processing appropriate to the
use case.
SearchAnalytics clusters are appropriate for production environments, provided these environments provide
sufficient resources for the specific workload, as is the case for all DSE clusters.
All of the fields that are queried on DSE SearchAnalytics clusters must be defined in the search index schema
definition. Fields that are not defined in the search index schema columns are excluded from the results returned
from Spark queries.
Using predicate push down on search indexes in Spark SQL
Search predicate push down allows queries in SearchAnalytics datacenters to use Solr-
indexed columns in Spark SQL queries. To enable Search predicate push down, set
the spark.sql.dse.search.enableOptimization property to on or auto. By default,
spark.sql.dse.search.enableOptimization is set to auto.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
237
Using DataStax Enterprise advanced functionality

When in auto mode the predicate push down will do a COUNT operation against the Search indices both with
and without the predicate filters applied. If the number of records with the predicate filter is less than the result
of the following formula:

spark.sql.dse.search.autoRatio * the total number of records

the optimization occurs automatically.


The property spark.sql.dse.search.autoRatio is user configurable. The default value is 0.03.
The performance of DSE Search is directly related to the number of records returned in a query. Requests
which require a large portion of the dataset are likely better served by a full table scan without using predicate
push downs.
To enable Solr predicate push down on a Scala dataset:

val solrEnabledDataSet = spark.read


.format("org.apache.spark.sql.cassandra")
.options(Map(
"keyspace" -> "ks",
"table" -> "tab",
"spark.sql.dse.search.enableOptimization" -> "on")
.load()

To create a temporary table in Spark SQL with Solr predicate push down enabled:

CREATE TEMPORARY TABLE temp USING org.apache.spark.sql.cassandra OPTIONS (


table "tab",
keyspace "ks",
spark.sql.dse.search.enableOptimization "on");

Set the spark.sql.dse.search.enableOptimization property globally by adding it to the server configuration


file.
The optimizer works on the push down level so only predicates which are being pushed to the source
can be optimized. Use the explain command to see exactly what predicates are being pushed to the
CassandraSourceRelation.

val query = spark.sql("query")


query.explain

Logging optimization plans


The optimization plans for a query using predicate push downs are logged by setting the
org.apache.spark.sql.SolrPredicateRules logger to DEBUG in the Spark logging configuration files.

<logger name="org.apache.spark.sql.SolrPredicateRules" level="DEBUG"/>

About DSE Analytics Solo


DSE Analytics Solo datacenters provide analytics processing with Spark and distributed storage using DSEFS
without storing transactional database data.
DataStax Enterprise is flexible when deploying analytic processing in concert with transactional workloads. There
are two main ways to deploy DSE Analytics: collocated with the database processing nodes, and on segregated
machines in their own datacenter.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
238
Using DataStax Enterprise advanced functionality

Figure 3: Traditional and DSE Analytics Solo deployments

Traditional DSE Analytics deployments have both the DataStax database process and the Spark process
running on the same machine. This allows for simple deployment of analytic processing when the analysis is not
as intensive, or the database is not as heavily used.
DSE Analytics Solo allows customers to deploy DSE Analytics processing on segregated hardware
configurations in a different datacenter from the transactional DSE nodes. This ensures consistent behavior of
both engines in a configuration that does not compete for computer resources. This configuration is good for
processing-intensive analytic workloads.
DSE Analytics Solo allows the flexibility to have more nodes dedicated to data processing than are used for
database transactions. This is particularly good for situations where the processing needs far exceed the
transactional resource needs. For example, suppose you have a Spark Streaming job that will analyze and
filter 99.9% of the incoming data, storing only a few records after analysis. The resources required by the
transactional datacenter are much smaller than the resources required to analyze the data.
DSE Analytics Solo is more elastic in terms of scaling up, or down, the analytic processing in the cluster. This is
particularly useful when you need extra analytics processing, such as end of the day or end of the quarter surges
in analytics jobs. Since a DSE Analytics Solo node does not store database data, when new nodes are added to
a cluster there is very little data moved across the network to the new nodes. In an analytics and transactional
collocated environment, adding a node means moving transactional data between the existing nodes and the
new nodes.
For information on creating a DSE Analytics Solo datacenter, see Creating a DSE Analytics Solo datacenter.
Analyzing data using Spark
Spark is the default mode when you start an analytics node in a packaged installation.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
239
Using DataStax Enterprise advanced functionality

About Spark
Apache Spark is a framework for analyzing large data sets across a cluster, and is enabled when you start an
Analytics node. Spark runs locally on each node and executes in memory when possible. Spark uses multiple
threads instead of multiple processes to achieve parallelism on a single node, avoiding the memory overhead of
several JVMs.
Apache Spark integration with DataStax Enterprise includes:

• Spark Cassandra Connector for accessing data stores in DSE

• DSE Resource Manager for managing Spark components in a DSE cluster

• Spark Job Server

• Spark SQL support

• AlwaysOn SQL

• Spark SQL Thrift Server

• Spark streaming

• DataFrames API to manipulate data within Spark

• SparkR integration

Spark architecture
The software components for a single DataStax Enterprise analytics node are:

• Spark Worker

• DataStax Enterprise File System (DSEFS)

• The database

A Spark Master acts purely as a resource manager for Spark applications. Spark Workers launch executors that
are responsible for executing part of the job that is submitted to the Spark Master. Each application has its own
set of executors. Spark architecture is described in the Apache documentation.
DSE Spark nodes use a different resource manager than standalone Spark nodes. The DSE Resource
Manager simplifies integration between Spark and DSE. In a DSE Spark cluster, client applications use the
CQL protocol to connect to any DSE node, and that node redirects the request to the Spark Master.
The communication between the Spark client application (or driver) and the Spark Master is secured the same
way as connections to DSE, which means that plain password authentication as well as Kerberos authentication
is supported, with or without SSL encryption. Encryption and authentication can be configured per application,
rather than per cluster. Authentication and encryption between the Spark Master and Worker nodes can be
enabled or disabled regardless of the application settings.
Spark supports multiple applications. A single application can spawn multiple jobs and the jobs run in parallel.
An application reserves some resources on every node and these resources are not freed until the application
finishes. For example, every session of Spark shell is an application that reserves resources. By default, the
scheduler tries allocate the application to the highest number of different nodes. For example, if the application
declares that it needs four cores and there are ten servers, each offering two cores, the application most likely
gets four executors, each on a different node, each consuming a single core. However, the application can
get also two executors on two different nodes, each consuming two cores. You can configure the application
scheduler. Spark Workers and Spark Master are part of the main DSE process. Workers spawn executor JVM
processes which do the actual work for a Spark application (or driver). Spark executors use native integration to
access data in local transactional nodes through the Open Source Spark-Cassandra Connector. The memory
settings for the executor JVMs are set by the user submitting the driver to DSE.
In deployment for each Analytics datacenter one node runs the Spark Master, and Spark Workers run on each
of the nodes. The Spark Master comes with automatic high availability.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
240
Using DataStax Enterprise advanced functionality

Figure 4: Spark integration with DataStax Enterprise

As you run Spark, you can access data in the Hadoop Distributed File System (HDFS), or the DataStax
Enterprise File System (DSEFS) by using the URL for the respective file system.
Highly available Spark Master
The Spark Master High Availability mechanism uses a special table in the dse_analytics keyspace to
store information required to recover Spark workers and the application. Reads to the recovery data in
dse_analytics are always performed using the LOCAL_QUORUM consistency level. Writes are attempted

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
241
Using DataStax Enterprise advanced functionality

first using LOCAL_QUORUM, and if that fails, the write is retried using LOCAL_ONE. Unlike the high availability
mechanism mentioned in Spark documentation, DataStax Enterprise does not use ZooKeeper.
If the original Spark Master fails, the reserved one automatically takes over. To find the current Spark Master,
run:

$ dse client-tool spark leader-address

DataStax Enterprise provides Automatic Spark Master management.

The Spark Master will not start until LOCAL_QUORUM is attainable for the dse_analytics keyspace.

Unsupported features
The following Spark features and APIs are not supported:

• Writing to blob columns from Spark


Reading columns of all types is supported; however, you must convert collections of blobs to byte arrays
before serializing.

Using Spark with DataStax Enterprise


DataStax Enterprise integrates with Apache Spark to allow distributed analytic applications to run using
database data.
Starting Spark
Before you start Spark, configure Authorizing remote procedure calls (RPC) for the DseClientTool object.
RPC permission for the DseClientTool object is required to run Spark because the DseClientTool object
is called implicitly by the Spark launcher.

By default DSEFS is required to execute Spark applications. DSEFS should not be disabled when Spark is
enabled on a DSE node. If there is a strong reason not to use DSEFS as the default file system, reconfigure
Spark to use a different file system. For example to use a local file system set the following properties in
spark-daemon-defaults.conf:

spark.hadoop.fs.defaultFS=file:///
spark.hadoop.hive.metastore.warehouse.dir=file:///tmp/warehouse

How you start Spark depends on the installation and if you want to run in Spark mode or SearchAnalytics
mode:
Package installations:
To start the Spark trackers on a cluster of analytics nodes, edit the /etc/default/dse file to set
SPARK_ENABLED to 1.
When you start DataStax Enterprise as a service, the node is launched as a Spark node. You can
enable additional components.

Mode Option in /etc/ Description


default/dse

Spark SPARK_ENABLED=1 Start the node in Spark mode.

SearchAnalytics mode SPARK_ENABLED=1 SearchAnalytics mode requires testing in your environment


SEARCH_ENABLED=1 before it is used in production clusters. In dse.yaml,
cql_solr_query_paging: driver is required.

Tarball installations:

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
242
Using DataStax Enterprise advanced functionality

To start the Spark trackers on a cluster of analytics nodes, use the -k option:

$ installation_location/bin/dse cassandra -k

Nodes started with -k are automatically assigned to the default Analytics datacenter if you do not
configure a datacenter in the snitch property file.
You can enable additional components:
Mode Option Description

Spark -k Start the node in Spark mode.

SearchAnalytics mode -k -s In dse.yaml, cql_solr_query_paging: driver is required.

For example:
To start a node in SearchAnalytics mode, use the -k and -s options.

$ installation_location/bin/dse cassandra -k -s

Starting the node with the Spark option starts a node that is designated as the master, as shown by the
Analytics(SM) workload in the output of the dsetool ring command:

$ dsetool ring

Address DC Rack Workload Graph Status


State Load Owns Token
Health [0,1]

0
10.200.175.149 Analytics rack1 Analytics(SM) no Up
Normal 185 KiB ? -9223372036854775808
0.90
10.200.175.148 Analytics rack1 Analytics(SW) no Up
Normal 194.5 KiB ? 0
0.90
Note: you must specify a keyspace to get ownership information.

Launching Spark
After starting a Spark node, use dse commands to launch Spark.
Usage:
Package installations: dse spark
Tarball installations: installation_location/bin/dse spark
You can use Cassandra specific properties to start Spark. Spark binds to the listen_address that is specified
in cassandra.yaml.
DataStax Enterprise supports these commands for launching Spark on the DataStax Enterprise command line:
dse spark
Enters interactive Spark shell, offers basic auto-completion.
Package installations: dse spark
Tarball installations: installation_location/bin/ dse spark
dse spark-submit

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
243
Using DataStax Enterprise advanced functionality

Launches applications on a cluster like spark-submit. Using this interface you can use Spark cluster
managers without the need for separate configurations for each application. The syntax for package
installations is:

$ dse spark-submit --class class_name jar_file other_options

For example, if you write a class that defines an option named d, enter the command as follows:

$ dse spark-submit --class com.datastax.HttpSparkStream target/


HttpSparkStream.jar -d $NUM_SPARK_NODES

The JAR file can be located in a DSEFS directory. If the DSEFS cluster is secured, provide
authentication credentials as described in DSEFS authentication.

The dse spark-submit command supports the same options as Apache Spark's spark-submit. For
example, to submit an application using cluster mode using the supervise option to restart in case of
failure:

$ dse spark-submit --deploy-mode cluster --supervise --class


com.datastax.HttpSparkStream target/HttpSparkStream.jar -d $NUM_SPARK_NODES

The directory in which you run the dse Spark commands must be writable by the current user.

Internal authentication is supported.


Use the optional environment variables DSE_USERNAME and DSE_PASSWORD to increase security and prevent the
user name and passwords from appearing in the Spark log files or in the process list on the Spark Web UI. To
specify a user name and password using environment variables, add the following to your Bash .profile or
.bash_profile:

export DSE_USERNAME=user
export DSE_PASSWORD=secret

These environment variables are supported for all Spark and dse client-tool commands.

DataStax recommends using the environment variables instead of passing user credentials on the
command line.

You can provide authentication credentials in several ways, see Credentials for authentication.
Specifying Spark URLs
You do not need to specify the Spark Master address when starting Spark jobs with DSE. If you connect to any
Spark node in a datacenter, DSE will automatically discover the Master address and connect the client to the
Master.
Specify the URL for any Spark node using the following format:

dse://[Spark node address[:port number]]?[parameter name=parameter value;]...

By default the URL is dse://?, which is equivalent to dse://localhost:9042. Any parameters you set in the
URL will override the configuration read from DSE's Spark configuration settings.
You can specify the work pool in which the application will be run by adding the workpool=work pool name as
a URL parameter. For example, dse://1.1.1.1:123?workpool=workpool2.
Valid parameters are CassandraConnectorConf settings with the spark.cassandra. prefix stripped. For
example, you can set the spark.cassandra.connection.local_dc option to dc2 by specifying dse://?
connection.local_dc=dc2.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
244
Using DataStax Enterprise advanced functionality

Or to specify multiple spark.cassandra.connection.host addresses for high-availability if the specified


connection point is down: dse://1.1.1.1:123?connection.host=1.1.2.2,1.1.3.3.
If the connection.host parameter is specified, the host provided in the standard URL is prepended to the list
of hosts set in connection.host. If the port is specified in the standard URL, it overrides the port number set
in the connection.port parameter.
Connection options when using dse spark-submit are retrieved in the following order: from the Master URL,
then the Spark Cassandra Connector options, then the DSE configuration files.
Detecting Spark application failures
DSE has a failure detector for Spark applications, which detects whether a running Spark application is dead
or alive. If the application has failed, the application will be removed from the DSE Spark Resource Manager.
The failure detector works by keeping an open TCP connection from a DSE Spark node to the Spark Driver in
the application. No data is exchanged, but regular TCP connection keep-alive control messages are sent and
received. When the connection is interrupted, the failure detector will attempt to reacquire the connection every
1 second for the duration of the appReconnectionTimeoutSeconds timeout value (5 seconds by default). If it
fails to reacquire the connection during that time, the application is removed.
A custom timeout value is specified by adding appReconnectionTimeoutSeconds=value in the master URI
when submitting the application. For example to set the timeout value to 10 seconds:

$ dse spark --master dse://?appReconnectionTimeoutSeconds=10

Running Spark commands against a remote cluster


To run Spark commands against a remote cluster, you must export the DSE configuration from one of the
remote nodes to the local client machine.
To run a driver application remotely, there must be full public network communication between the remote
nodes and the client machine.
Prerequisites:
The local client requires Spark driver ports on the client to be accessible by the remote DSE cluster nodes.
This might require configuring the firewall on the client machine and the remote DSE cluster nodes to allow
communication between the machines.
Spark dynamically selects ports for internal communication unless the ports are manually set. To use
dynamically chosen ports, the firewall needs to allow all port access from the remote cluster.
To set the ports manually, set the ports in the respective properties in spark-defaults.conf as shown in this
example:

spark.blockManager.port 38000
spark.broadcast.port 38001
spark.driver.port 38002
spark.executor.port 38003
spark.fileserver.port 38004
spark.replClassServer.port 38005

For a full list of ports used by DSE, see Securing DataStax Enterprise ports.

1. Export the DataStax Enterprise client configuration from the remote node to the client node:

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
245
Using DataStax Enterprise advanced functionality

a. On the remote node:

$ dse client-tool configuration export dse-config.jar

b. Copy the exported JAR to the client nodes.

$ scp dse-config.jar [email protected]:

c. On the client node:

$ dse client-tool configuration import dse-config.jar

2. Run the Spark command against the remote node.

$ dse spark-submit submit options myApplication.jar

To set the driver host to a publicly accessible IP address, pass in the spark.driver.host option.

$ dse spark-submit --conf spark.driver.host=IP address myApplication.jar

Monitoring Spark with the web interface


A web interface, bundled with DataStax Enterprise, facilitates monitoring, debugging, and managing Spark.
Using the Spark web interface
To use the Spark web interface enter the listen IP address of any Spark node in a browser followed by port
number 7080 (configured in the spark-env.sh configuration file). Starting in DSE 5.1, all Spark nodes within
an Analytics datacenter will redirect to the current Spark Master.
If the Spark Master is not available, the UI will keep polling for the Spark Master every 10 seconds until the
Master is available.
The Spark web interface can be secured using SSL. SSL encryption of the web interface is enabled by default
when client encryption is enabled.
If authentication is enabled, and plain authentication is available, you will be prompted for authentication
credentials when accessing the web UI. We recommend using SSL with authentication.

Kerberos authentication is not supported in the Spark web UI. If authentication is enabled and either LDAP
or Internal authentication is not available, the Spark web UI will not be accessible. If this occurs, disable
authentication for the Spark web UI only by removing the spark.ui.filters setting in spark-daemon-
defaults.conf located in the Spark configuration directory.

DSE SSL encryption and authentication only apply to the Spark Master and Worker UIs, not the Spark Driver
UI. To use encryption and authentication with the Driver UI, refer to the Spark security documentation.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
246
Using DataStax Enterprise advanced functionality

The UI includes information on the number of cores and amount of memory available to Spark in total and in
each work pool, and similar information for each Spark worker. The applications list the associated work pool.
See the Spark documentation for information on using the Spark web UI.
Authorization in the Spark web UI
When authorization is enabled and an authenticated user accesses the web UI, what they can see and do
is controlled by their permissions. This allows administrators to control who has permission to view specific
application logs, view the executors for the application, kill the application, and list all applications. Viewing and
modifying applications can be configured per datacenter, work pool, or application.
See Using authorization with Spark for details on granting permissions.
Displaying fully qualified domain names in the web UI
To display fully qualified domain names (FQDNs) in the Spark web UI, set the SPARK_PUBLIC_DNS variable in
spark-env.sh on each Analytics node.
Set SPARK_PUBLIC_DNS to the FQDN of the node if you have SSL enabled for the web UI.
Redirecting to the fully qualified domain name of the master
Set the SPARK_LOCAL_IP or SPARK_LOCAL_HOSTNAME in the spark-env.sh file on each node to the fully qualified
domain name (FQDN) of the node to force any redirects to the web UI using the FQDN of the Spark master.
This is useful when enabling SSL in the web UI.

export SPARK_LOCAL_HOSTNAME=FQDN of the node

Filtering properties in the Spark Driver UI


The Spark Driver UI has an Environment tab that lists the Spark configuration and system properties used
by Spark. This can include sensitive information like passwords and security tokens. DSE Spark filters these
properties and mask their values with sequences of asterisks. The spark.redaction.regex filter is configured
as a regular expression that by default includes all properties that contain the string "secret", "token", or
"password" as well as all system properties. To modify the filter, edit the spark.redaction.regex property in
spark-defaults.conf in the Spark configuration directory.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
247
Using DataStax Enterprise advanced functionality

Using DSE Spark with third party tools and integrations


The dse exec command sets the required environment variables required to run third-party tools that integrate
with Spark.

$ dse exec command

If the tool is run on a server that is not part of the DSE cluster, see Running Spark commands against a
remote cluster.
Jupyter integration
Download and install Jupyter notebook on a DSE node.
To launch Jupyter notebook:

$ dse exec jupyter notebook

A Jupyter notebook starts with the correct Python path. You must create a context to work with DSE. In
contrast to Livy and Zeppelin integrations, the Jupyter integration does not start an interpreter that creates a
context.
Livy integration
Download and install Livy on a DSE node. By default Livy runs Spark in local mode. Before starting Livy create
a configuration file by copying the conf/livy.conf.template to conf/livy.conf, then uncomment or add
the following two properties:

livy.spark.master = dse:///
livy.repl.enable-hive-context = true

To launch Livy:

$ dse exec livy-server

RStudio integration
Download and install R on all DSE Analytics nodes, install RStudio desktop on one of the nodes, then run
RStudio:

$ dse exec rstudio

In the RStudio session start a Spark session:

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))


sparkR.session()

These instructions are for RStudio desktop, not RStudio Server. In multiuser environments, we recommend
using AlwaysOn SQL and JDBC connections rather than SparkR.

Zeppelin integration
Download and install Zeppelin on a DSE node. To launch Zeppelin server:

$ dse exec zeppelin.sh

By default Zeppelin runs Spark in local mode. Update the master property to dse:/// in the Spark session in
the Interpreters configuration page. No configuration file changes are required to run Zeppelin.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
248
Using DataStax Enterprise advanced functionality

Configuring Spark
Configuring Spark for DataStax Enterprise includes:
Configuring Spark nodes
Modify the settings for Spark nodes security, performance, and logging.
To manage Spark performance and operations:

• Set the replication factor for DSE Analytics keyspaces

• Set environment variables

• Protect Spark directories

• Grant access to default Spark directories

• Secure Spark nodes

• Configure Spark memory and cores

• Configure Spark logging options

Set environment variables


DataStax recommends using the default values of Spark environment variables unless you need to increase
the memory settings due to an OutOfMemoryError condition or garbage collection taking too long. Use the
Spark memory configuration options in the dse.yaml and spark-env.sh files.
You can set a user-specific SPARK_HOME directory if you also set ALLOW_SPARK_HOME=true in your environment
before starting DSE.
For example, on Debian or Ubuntu using a package installation:

$ export SPARK_HOME=$HOME/spark && export ALLOW_SPARK_HOME=true && sudo service dse


start

The temporary directory for shuffle data, RDDs, and other ephemeral Spark data can be configured for both
the locally running driver and for the Spark server processes managed by DSE (Spark Master, Workers,
shuffle service, executor and driver running in cluster mode).
For the locally running Spark driver, the SPARK_LOCAL_DIRS environment variable can be customized in the
user environment or in spark-env.sh. By default, it is set to the system temporary directory. For example,
on Ubuntu it is /tmp/. If there's no system temporary directory, then SPARK_LOCAL_DIRS is set to a .spark
directory in the user's home directory.
For all other Spark server processes, the SPARK_EXECUTOR_DIRS environment variable can be customized in
the user environment or in spark-env.sh. By default it is set to /var/lib/spark/rdd.

The default SPARK_LOCAL_DIRS and SPARK_EXECUTOR_DIRS environment variable values differ from non-
DSE Spark.

To configure worker cleanup, modify the SPARK_WORKER_OPTS environment variable and add the cleanup
properties. The SPARK_WORKER_OPTS environment variable can be set in the user environment or in spark-
env.sh. For example, the following enables worker cleanup,.sets the cleanup interval to 30 minutes (i.e. 1800
seconds), and sets the length of time application worker directories will be retained to 7 days (i.e. 604800
seconds).

$ export SPARK_WORKER_OPTS="$SPARK_WORKER_OPTS \ -Dspark.worker.cleanup.enabled=true \


-Dspark.worker.cleanup.interval=1800 \ -Dspark.worker.cleanup.appDataTtl=604800"

Protect Spark directories


After you start up a Spark cluster, DataStax Enterprise creates a Spark work directory for each
Spark Worker on worker nodes. A worker node can have more than one worker, configured by the

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
249
Using DataStax Enterprise advanced functionality

SPARK_WORKER_INSTANCES option in spark-env.sh. If SPARK_WORKER_INSTANCES is undefined, a


single worker is started. The work directory contains the standard output and standard error of executors and
other application specific data stored by Spark Worker and executors; the directory is writable only by the DSE
user.
By default, the Spark parent work directory is located in /var/lib/spark/work, with each worker in a
subdirectory named worker-number, where the number starts at 0. To change the parent worker directory,
configure SPARK_WORKER_DIR in the spark-env.sh file.
The Spark RDD directory is the directory where RDDs are placed when executors decide to spill them to
disk. This directory might contain the data from the database or the results of running Spark applications.
If the data in the directory is confidential, prevent access by unauthorized users. The RDD directory might
contain a significant amount of data, so configure its location on a fast disk. The directory is writable only by
the cassandra user. The default location of the Spark RDD directory is /var/lib/spark/rdd. The directory
should be located on a fast disk. To change the RDD directory, configure SPARK_EXECUTOR_DIRS in the
spark-env.sh file.

Grant access to default Spark directories


Before starting up nodes on a tarball installation, you need permission to access the default Spark directory
locations: /var/lib/spark and /var/log/spark. Change ownership of these directories as follows:

sudo mkdir -p /var/lib/spark/rdd; sudo chmod a+w /var/lib/spark/rdd; sudo chown -R


$USER:$GROUP /var/lib/spark/rdd &&
sudo mkdir -p /var/log/spark; sudo chown -R $USER:$GROUP /var/log/spark

In multiple datacenter clusters, use a virtual datacenter to isolate Spark jobs. Running Spark jobs consume
resources that can affect latency and throughput.
DataStax Enterprise supports the use of virtual nodes (vnodes) with Spark.
Secure Spark nodes
Client-to-node SSL
Ensure that the truststore entries in cassandra.yaml are present as described in Client-to-node
encryption, even when client authentication is not enabled.
Enabling security and authentication
Security is enabled using the spark_security_enabled option in dse.yaml. Setting it to
enabled turns on authentication between the Spark Master and Worker nodes, and allows you to
enable encryption. To encrypt Spark connections for all components except the web UI, enable
spark_security_encryption_enabled. The length of the shared secret used to secure Spark
components is set using the spark_shared_secret_bit_length option, with a default value of 256
bits. These options are described in DSE Analytics options. For production clusters, enable these
authentication and encryption. Doing so does not significantly affect performance.
Authentication and Spark applications
If authentication is enabled, users need to be authenticated in order to submit an application.
Authorization and Spark applications
If DSE authorization is enabled, users needs permission to submit an application. Additionally, the
user submitting the application automatically receives permission to manage the application, which
can optionally be extended to other users.
Database credentials for the Spark SQL Thrift server
In the hive-site.xml file, configure authentication credentials for the Spark SQL Thrift server. Ensure
that you use the hive-site.xml file in the Spark directory:

• Package installations: /etc/dse/spark/hive-site.xml

• Tarball installations: installation_location/resources/spark/conf/hive-site.xml

Kerberos with Spark


With Kerberos authentication, the Spark launcher connects to DSE with Kerberos credentials and
requests DSE to generate a delegation token. The Spark driver and executors use the delegation
token to connect to the cluster. For valid authentication, the delegation token must be renewed

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
250
Using DataStax Enterprise advanced functionality

periodically. For security reasons, the user who is authenticated with the token should not be able to
renew it. Therefore, delegation tokens have two associated users: token owner and token renewer.
The token renewer is none so that only a DSE internal process can renew it. When the application is
submitted, DSE automatically renews delegation tokens that are associated with Spark application.
When the application is unregistered (finished), the delegation token renewal is stopped and the
token is cancelled.
Set Kerberos options, see Defining a Kerberos scheme.
Configure Spark memory and cores
Spark memory options affect different components of the Spark ecosystem:
Spark History server and the Spark Thrift server memory
The SPARK_DAEMON_MEMORY option configures the memory that is used by the Spark SQL
Thrift server and history-server. Add or change this setting in the spark-env.sh file on nodes that run
these server applications.
Spark Worker memory
The memory_total option in resource_manager_options.worker_options section of dse.yaml
configures the total system memory that you can assign to all executors that are run by the work
pools on the particular node. The default work pool will use all of this memory if no other work pools
are defined. If you define additional work pools, you can set the total amount of memory by setting the
memory option in the work pool definition.
Application executor memory
You can configure the amount of memory that each executor can consume for the application. Spark
uses a 512MB default. Use either the spark.executor.memory option, described in "Spark Available
Properties", or the --executor-memory mem argument to the dse spark command.
Application memory
You can configure additional Java options that are applied by the worker when spawning an executor for
the application. Use the spark.executor.extraJavaOptions property, described in Spark 1.6.2 Available
Properties. For example: spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value
-Dnumbers="one two three"

Core management
You can manage the number of cores by configuring these options.

• Spark Worker cores


The cores_total option in the resource_manager_options.worker_options section of dse.yaml
configures the total number of system cores available to Spark Workers for executors. If no work pools are
defined in the resource_manager_options.workpools section of dse.yaml the default work pool will
use all the cores defined by cores_total. If additional work pools are defined, the default work pool will
use the cores available after allocating the cores defined by the work pools.
A single executor can borrow more than one core from the worker. The number of cores used by the
executor relates to the number of parallel tasks the executor might perform. The number of cores offered
by the cluster is the sum of cores offered by all the workers in the cluster.

• Application cores
In the Spark configuration object of your application, you configure the number of application cores that
the application requests from the cluster using either the spark.cores.max configuration property or the
--total-executor-cores cores argument to the dse spark command.

See the Spark documentation for details about memory and core allocation.
DataStax Enterprise can control the memory and cores offered by particular Spark Workers in semi-automatic
fashion. The resource_manager_options.worker_options section in the dse.yaml file has options to
configure the proportion of system resources that are made available to Spark Workers and any defined
work pools, or explicit resource settings. When specifying decimal values of system resources the available
resources are calculated in the following way:

• Spark Worker memory = memory_total * (total system memory - memory assigned to DSE)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
251
Using DataStax Enterprise advanced functionality

• Spark Worker cores = cores_total * total system cores

This calculation is used for any decimal values. If the setting is not specified, the default value 0.7 is used. If
the value does not contain a decimal place, the setting is the explicit number of cores or amount of memory
reserved by DSE for Spark.

Setting cores_total or a workpool's cores to 1.0 is a decimal value, meaning 100% of the available cores
will be reserved. Setting cores_total or cores to 1 (no decimal point) is an explicit value, and one core will
be reserved.

The lowest values you can assign to a named work pool's memory and cores are 64 MB and 1 core,
respectively. If the results are lower, no exception is thrown and the values are automatically limited.
The following example shows a work pool named workpool1 with 1 core and 512 MB of RAM assigned to it.
The remaining resources calculated from the values in worker_options are assigned to the default work
pool.

resource_manager_options:
worker_options:
cores_total: 0.7
memory_total: 0.7

workpools:
- name: workpool1
cores: 1
memory: 512M

Running Spark clusters in cloud environments


If you are using a cloud infrastructure provider like Amazon EC2, you must explicitly open the ports for publicly
routable IP addresses in your cluster. If you do not, the Spark workers will not be able to find the Spark Master.
One work-around is to set the prefer_local setting in your cassandra-rackdc.properties snitch setup file to
true:

# Uncomment the following line to make this snitch prefer the internal ip when possible,
as the Ec2MultiRegionSnitch does.
prefer_local=true

This tells the cluster to communicate only on private IP addresses within the datacenter rather than the public
routable IP addresses.
Configuring the number of retries to retrieve Spark configuration
When Spark fetches configuration settings from DSE, it will not fail immediately if it cannot retrieve the
configuration data, but will retry 5 times by default, with increasing delay between retries. The number of
retries can be set in the Spark configuration, by modifying the spark.dse.configuration.fetch.retries
configuration property when calling the dse spark command, or in spark-defaults.conf.
Disabling continuous paging
Continuous paging streams bulk amounts of records from DSE to the DataStax Java Driver
used by DSE Spark. By default, continuous paging in queries is enabled. To disable it, set the
spark.dse.continuous_paging_enabled setting to false when starting the Spark SQL shell or in spark-
defaults.conf. For example:

$ dse spark-sql --conf spark.dse.continuous_paging_enabled=false

Using continuous paging can potentially improve performance up to 3 times, though the improvement
will depend on the data and the queries. Some factors that impact the performance improvement are the

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
252
Using DataStax Enterprise advanced functionality

number of executor JVMs per node and the number of columns included in the query. Greater performance
gains were observed with fewer executor JVMs per node and more columns selected.

Configuring the Spark web interface ports


By default the Spark web UI runs on port 7080. To change the port number, do the following:

1. Open the spark-env.sh file in a text editor.

2. Set the SPARK_MASTER_WEBUI_PORT variable to the new port number. For example, to set it to port 7082:

export SPARK_MASTER_WEBUI_PORT=7082

3. Repeat these steps for each Analytics node in your cluster.

4. Restart the nodes in the cluster.

Enabling Graphite Metrics in DSE Spark


Users can add third party JARs to Spark nodes by adding them to the Spark lib directory on each node and
restart the cluster. Add the Graphite Metrics JARs to this directory to enable metrics in DSE Spark.
The default location of the Spark lib directory depends on the type of installation:

• Package installations: /usr/share/dse/spark/lib

• Tarball installations: /var/lib/spark

To add the Graphite JARs to Spark in a package installation, copy them to the Spark lib directory:

$ cp metrics-graphite-3.1.2.jar /usr/share/dse/spark/lib/ && cp metrics-json-3.1.2.jar /


usr/share/dse/spark/lib/

Setting Spark properties for the driver and executor


Additional Spark properties for the Spark driver and executors are set in spark-defaults.conf. For example, to
enable Spark's commons-crypto encryption library:

spark.network.crypto.enabled true

Using authorization with Spark


See Analytic applications and Setting up DSE Spark application permissions.
Spark server configuration
The spark-daemon-defaults.conf file configures DSE Spark Masters and Workers.

Table 15: Spark server configuration properties


Option Default Description
value

dse.spark.application.timeout 30 The duration in seconds after which the application will be


considered dead if no heartbeat is received.

spark.dseShuffle.sasl.port 7447 The port number on which a shuffle service for


SASL secured applications is started. Bound to the
listen_address in cassandra.yaml.

spark.dseShuffle.noSasl.port 7437 The port number on which a shuffle service for unsecured
applications is started. Bound to the listen_address in
cassandra.yaml.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
253
Using DataStax Enterprise advanced functionality

By default Spark executor logs, which log the majority of your Spark Application output, are
redirected to standard output. The output is managed by Spark Workers. Configure logging by adding
spark.executor.logs.rolling.* properties to spark-daemon-defaults.conf file.

spark.executor.logs.rolling.maxRetainedFiles 3
spark.executor.logs.rolling.strategy size
spark.executor.logs.rolling.maxSize 50000

Additional Spark properties that affect the master and driver can be added to spark-daemon-defaults.conf.
For example, to enable Spark's commons-crypto encryption library:

spark.network.crypto.enabled true

Automatic Spark Master election


Spark Master elections are automatically managed, and do not require any manual configuration.
DSE Analytics datacenters communicate with each other to elect one of the nodes as the Spark Master and
another as the reserve Master. The Master keeps track of each Spark Worker and application, storing the
information in a system table. If the Spark Master node fails, the reserve Master takes over and a new reserve
Master is elected from the remaining Analytics nodes.
Each Analytics datacenter elects its own master.
For dsetool commands and options, see dsetool.
Determining the Spark Master address
You do not need to specify the Master address when configuring or using Spark with DSE Analytics.
Configuring applications with a valid URL is sufficient for DSE to connect to the Master node and run the
application. The following commands give information about the Spark configuration of DSE:

• To view the URL used to configure Spark applications:

$ dse client-tool spark master-address

dse://10.200.181.62:9042?
connection.local_dc=Analytics;connection.host=10.200.181.63;

• To view the current address of the Spark Master in this datacenter:

$ dse client-tool spark leader-address

10.200.181.62

• Workloads for Spark Master are flagged as Workload: Analytics(SM).

$ dsetool ring

Address DC Rack Workload Graph


Status State Load Owns Token
Health [0,1]
0

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
254
Using DataStax Enterprise advanced functionality

10.200.181.62 Analytics rack1 Analytics(SM) no Up


Normal 111.91 KiB ? -9223372036854775808
0.10

• Query the dse_leases.leases table to list all the masters from each data center with Analytics nodes:

select * from dse_leases.leases ;

name | dc | duration_ms | epoch | holder


-------------------+----------------------+-------------+---------+---------------
Leader/master/6.0 | Analytics | 30000 | 805254 | 10.200.176.42
Leader/master/6.0 | SearchGraphAnalytics | 30000 | 1300800 | 10.200.176.45
Leader/master/6.0 | SearchAnalytics | 30000 | 7 | 10.200.176.44

Ensure that the replication factor is configured correctly for the dse_leases keyspace
If the dse_leases keyspace is not properly replicated, the Spark Master might not be elected.
Every time you add a new datacenter, you must manually increase the replication factor of the dse_leases
keyspace for the new DSE Analytics datacenter. If DataStax Enterprise or Spark security options are
enabled on the cluster, you must also increase the replication factor for the dse_security keyspace across
all logical datacenters.
The initial node in a multi datacenter has a replication factor of 1 for the dse_leases keyspace. For new
datacenters, the first node is created with the dse_leases keyspace with an replication factor of 1 for that
datacenter. However, any datacenters that you add have a replication factor of 0 and require configuration
before you start DSE Analytics nodes. You must change the replication factor of the dse_leases keyspace for
multiple analytics datacenters. See Setting the replication factor for analytics keyspaces.
Monitoring the lease subsystem
All changes to lease holders are recorded in the dse_leases.logs table. Most of the time, you do not want to
enable logging.

1. To turn on logging, ensure that the lease_metrics_options is enabled in the dse.yaml file:

lease_metrics_options:
enabled:true
ttl_seconds: 604800

2. Look at the dse_leases.logs table:

select * from dse_leases.logs ;

name | dc | monitor | at |
new_holder | old_holder
-------------------+-----+---------------+---------------------------------
+---------------+------------
Leader/master/6.0 | dc1 | 10.200.180.44 | 2018-05-17 00:45:02.971000+0000 |
10.200.180.44 |
Leader/master/6.0 | dc1 | 10.200.180.49 | 2018-05-17 02:37:07.381000+0000 |
10.200.180.49 |

3. When the lease_metrics_option is enabled, you can examine the acquire, renew, resolve, and disable
operations. Most of the time, these operations should complete in 100 ms or less:

select * from dse_perf.leases ;

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
255
Using DataStax Enterprise advanced functionality

name | dc | monitor | acquire_average_latency_ms


| acquire_latency99ms | acquire_max_latency_ms | acquire_rate15 |
disable_average_latency_ms | disable_latency99ms | disable_max_latency_ms
| disable_rate15 | renew_average_latency_ms | renew_latency99ms |
renew_max_latency_ms | renew_rate15 | resolve_average_latency_ms |
resolve_latency99ms | resolve_max_latency_ms | resolve_rate15 | up |
up_or_down_since
-------------------+-----+---------------+----------------------------
+---------------------+------------------------+----------------
+----------------------------+---------------------+------------------------
+----------------+--------------------------+-------------------
+----------------------+--------------+----------------------------
+---------------------+------------------------+----------------+------
+---------------------------------
Leader/master/6.0 | dc1 | 10.200.180.44 | 0 |
0 | 0 | 0 | 0 |
0 | 0 | 0 |
24 | 100 | 100 | 0 |
8 | 26 | 26 | 0 | True |
2018-05-03 19:30:38.395000+0000
Leader/master/6.0 | dc1 | 10.200.180.49 | 0 |
0 | 0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 | 0 |
10 | 32 | 32 | 0 | True |
2018-05-03 19:30:55.656000+0000

4. If the log warnings and errors do not contain relevant information, edit the logback.xml file and add:

<logger name="com.datastax.bdp.leasemanager" level="DEBUG">

5. Restart the node for the debugging settings to take effect.

Troubleshooting
Perform these various lease holder troubleshooting activities before you contact DataStax Support.
Verify the workload status
Run the dsetool ring command:

$ dsetool ring

If the replication factor is inadequate or if the replicas are down, the output of the dsetool ring
command contains a warning:

Address DC Rack Workload Graph


Status State Load Owns Token
Health [0,1]

0
10.200.178.232 SearchGraphAnalytics rack1 SearchAnalytics yes
Up Normal 153.04 KiB ? -9223372036854775808
0.00
10.200.178.230 SearchGraphAnalytics rack1 SearchAnalytics(SM) yes
Up Normal 92.98 KiB ? 0
0.000

If the automatic Job Tracker or Spark Master election fails, verify that an appropriate replication factor
is set for the dse_leases keyspace.
Use cqlsh commands to verify the replication factor of the analytics keyspaces

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
256
Using DataStax Enterprise advanced functionality

1. Describe the dse_leases keyspace:

DESCRIBE KEYSPACE dse_leases;

CREATE KEYSPACE dse_leases WITH replication =


{'class': 'NetworkTopologyStrategy', 'Analytics1': '1'}
AND durable_writes = true;

2. Increase the replication factor of the dse_leases keyspace:

ALTER KEYSPACE dse_leases WITH replication =


{'class': 'NetworkTopologyStrategy', 'Analytics1': '3', 'Analytics2':'3'}
;

3. Run nodetool repair.

Configuring Spark logging options


You can configure Spark logging options for the Spark logs.
Log directories
The Spark logging directory is the directory where the Spark components store individual log files. DataStax
Enterprise places logs in the following locations:
Executor logs

• SPARK_WORKER_DIR/worker-n/application_id/executor_id/stderr

• SPARK_WORKER_DIR/worker-n/application_id/executor_id/stdout

Spark Master/Worker logs


Spark Master: the global system.log
Spark Worker: SPARK_WORKER_LOG_DIR/worker-n/worker.log
The default SPARK_WORKER_LOG_DIR location is /var/log/spark/worker.
Default log directory for Spark SQL Thrift server
The default log directory for starting the Spark SQL Thrift server is $HOME/spark-thrift-server.
Spark Shell and application logs
Spark Shell and application logs are output to the console.
SparkR shell log
The default location for the SparkR shell is $HOME/.sparkR.log
Log configuration file
Log configuration files are located in the same directory as spark-env.sh.

To configure Spark logging options:

1. Configure logging options, such as log levels, in the following files:


Executors logback-spark-executor.xml
Spark Master logback.xml
Spark Worker logback-spark-server.xml
Spark Driver (Spark Shell, Spark applications) logback-spark.xml
SparkR logback-sparkR.xml

2. If you want to enable rolling logging for Spark executors, add the following options to spark-daemon-
defaults.conf.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
257
Using DataStax Enterprise advanced functionality

Enable rolling logging with 3 log files retained before deletion. The log files are broken up by size with a
maximum size of 50,000 bytes.

spark.executor.logs.rolling.maxRetainedFiles 3
spark.executor.logs.rolling.strategy size
spark.executor.logs.rolling.maxSize 50000

The default location of the Spark configuration files depends on the type of installation:

• Package installations: /etc/dse/spark/

• Tarball installations: installation_location/resources/spark/conf

3. Configure a safe communication channel to access the Spark user interface.

When user credentials are specified in plain text on the dse command line, like dse -u username
-p password, the credentials are present in the logs of Spark workers when the driver is run in
cluster mode.
The Spark Master, Spark Worker, executor, and driver logs might include sensitive information.
Sensitive information includes passwords and digest authentication tokens for Kerberos guidelines
mode that are passed in the command line or Spark configuration. DataStax recommends using
only safe communication channels like VPN and SSH to access the Spark user interface.

You can provide authentication credentials in several ways, see Credentials for authentication.

Running Spark processes as separate users


Spark processes can be configured to run as separate operating system users.
By default, processes started by DSE are run as the same OS user who started the DSE server process. This
is called the DSE service user. One consequence of this is that all applications that are run on the cluster can
access DSE data and configuration files, and access files of other applications.
You can delegate running Spark applications to runner processes and users by changing options in dse.yaml.
Overview of the run_as process runner
The run_as process runner allows you to run Spark applications as a different OS user than the DSE service
user. When this feature is enabled and configured:

• All simultaneously running applications deployed by a single DSE service user will be run as a single OS
user.

• Applications deployed by different DSE service users will be run by different OS users.

• All applications will be run as a different OS user than the DSE service user.

This allows you to prevent an application from accessing DSE server private files, and prevent one application
from accessing the private files of another application.
How the run_as process runner works
DSE uses sudo to run Spark applications components (drivers and executors) as specific OS users. DSE
doesn't link a DSE service user with a particular OS user. Instead, a configurable number of spare user
accounts or slots are used. When a request to run an executor or a driver is received, DSE finds an unused
slot, and locks it for that application. Until the application is finished, all of that application's processes run as
that slot user. When the application completes, the slot user will be released and will be available to other
applications.
Since the number of slots is limited, a single slot is shared among all the simultaneously running applications
run by the same DSE service user. Such a slot is released once all the applications of that user are removed.
When there is not enough slots to run an application, an error is logged and DSE will try to run the executor or

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
258
Using DataStax Enterprise advanced functionality

driver on a different node. DSE does not limit the number of slots you can configure. If you need to run more
applications simultaneously, create more slot users.
Slots assignment is done on a per node basis. Executors of a single application may run as different slot users
on different DSE nodes. When DSE is run on a fat node, different DSE instances running within the same OS
should be configured with different sets of slot users. If they use the same slot users, a single OS user may run
the applications of two different DSE service users.
When a slot is released, all directories which are normally managed by Spark for the application are removed.
If the application doesn't finish, but all executors are done on a node, and a slot user is about to be released,
all the application files are modified so that their ownership is changed to the DSE service user with owner-
only permission. When a new executor for this application is run on this node, the application files are
reassigned back to the slot user assigned to that application.
Configuring the run_as process runner
The administrator needs to prepare slot users in the OS before configuring DSE. The run_as process runner
requires:

• Each slot user has its own primary group, which name is the same as the name of slot user. This is
typically the default behaviour of the OS. For example, the slot1 user's primary group is slot1.

• The DSE service user is a member of each slot's primary group. For example, if the DSE service user is
cassandra, the cassandra user is a member of the slot1 group.

• The DSE service user is a member of a group with the same name as the service user. For example, if
the DSE service user is cassandra, the cassandra user is a member of the cassandra group.

• sudo is configured so that the DSE service user can execute any command as any slot user without
providing a password.

Override the umask setting to 007 for slot users so that files created by sub-processes will not be accessible by
anyone else by default, and DSE configuration files are not visible to slot users.
You may further secure the DSE server environment by modifying the OS's limits.conf file to set exact disk
space quotas for each slot user.
After adding the slot users and groups and configuring the OS, modify the dse.yaml file. In the
spark_process_runner section enable the run_as process runner and set the list of slot users on each node.

spark_process_runner:
# Allowed options are: default, run_as
runner_type: run_as

run_as_runner_options:
user_slots:
- slot1
- slot2

Example configuration for run_as process runner


In this example, two slot users, slot1 and slot2 will be created and configured with DSE. The default DSE
service user of cassandra is used.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
259
Using DataStax Enterprise advanced functionality

1. Create the slot users.

$ sudo useradd -r -s /bin/false slot1 && sudo useradd -r -s /bin/false slot2

2. Add the slot users to the DSE service user's group.

$ sudo usermod -a -G slot1,slot2 cassandra

3. Make sure the DSE service user is a member of a group with the same name as the service user. For
example, if the DSE service user is cassandra:

$ groups cassandra

cassandra : cassandra

4. Log out and back in again to make the group changes take effect.

5. Modify the sudoers file with the slot users.

Runas_Alias SLOTS = slot1, slot2


Defaults>SLOTS umask=007
Defaults>SLOTS umask_override
cassandra ALL=(SLOTS) NOPASSWD: ALL

6. Modify dse.yaml to enable the run_as process runner and add the new runners.

# Configure the way how the driver and executor processes are created and managed.
spark_process_runner:
# Allowed options are: default, run_as
runner_type: run_as

# RunAs runner uses sudo to start Spark drivers and executors. A set of
predefined fake users, called slots, is used
# for this purpose. All drivers and executors owned by some DSE user are run as
some slot user x. At the same time
# drivers and executors of any other DSE user use different slots.
run_as_runner_options:
user_slots:
- slot1
- slot2

Configuring the Spark history server


The Spark history server provides a way to load the event logs from Spark jobs that were run with event
logging enabled. The Spark history server works only when files were not flushed before the Spark Master
attempted to build a history user interface.

To enable the Spark history server:

1. Create a directory for event logs in the DSEFS file system:

$ dse fs 'mkdir -p /spark/events'

2. On each node in the cluster, edit the spark-defaults.conf file to enable event logging and specify the
directory for event logs:

#Turns on logging for applications submitted from this machine


spark.eventLog.dir dsefs:///spark/events
spark.eventLog.enabled true

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
260
Using DataStax Enterprise advanced functionality

#Sets the logging directory for the history server


spark.history.fs.logDirectory dsefs:///spark/events
# Optional property that changes permissions set to event log files
# spark.eventLog.permissions=777

3. Start the Spark history server on one of the nodes in the cluster:
The Spark history server is a front-end application that displays logging data from all nodes in the
Spark cluster. It can be started from any node in the cluster.
If you've enabled authentication set the authentication method and credentials in a properties file and
pass it to the dse command. For example, for basic authentication:

spark.hadoop.com.datastax.bdp.fs.client.authentication.basic.username=role name
spark.hadoop.com.datastax.bdp.fs.client.authentication.basic.password=password

If you set the event log location in spark-defaults.conf, set the spark.history.fs.logDirectory
property in your properties file.

spark.history.fs.logDirectory=dsefs:///spark/events

$ dse spark-history-server start

With a properties file:

dse spark-history-server start --properties-file properties file

If you specify a properties file, none of the configuration in spark-defaults.conf is used. The
properties file should contain all the required configuration properties.

The history server is started and can be viewed by opening a browser to http://node
hostname:18080.

The Spark Master web UI does not show the historical logs. To work around this known issue,
access the history from port 18080.

4. When event logging is enabled, the default behavior is for all logs to be saved, which causes the storage
to grow over time. To enable automated cleanup edit spark-defaults.conf and edit the following
options:

spark.history.fs.cleaner.enabled true
spark.history.fs.cleaner.interval 1d
spark.history.fs.cleaner.maxAge 7d

For these settings, automated cleanup is enabled, the cleanup is performed daily, and logs older than
seven days are deleted.

Setting Spark Cassandra Connector-specific properties


Spark integration uses the Spark Cassandra Connector under the hood. You can use the configuration options
defined in that project to configure DataStax Enterprise Spark. Spark recognizes system properties that have
the spark. prefix and adds the properties to the configuration object implicitly upon creation. You can avoid
adding system properties to the configuration object by passing false for the loadDefaults parameter in the
SparkConf constructor.
The full list of parameters is included in the Spark Cassandra Connector documentation.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
261
Using DataStax Enterprise advanced functionality

You pass settings for Spark, Spark Shell, and other DataStax Enterprise Spark built-in applications using the
intermediate application spark-submit, described in Spark documentation.
Configuring the Spark shell
Pass Spark configuration arguments using the following syntax:

$ dse spark [submission_arguments] [application_arguments]

where submission_arguments are:

[--help] [--verbose]
[--conf name=spark.value|sparkproperties.conf]
[--executor-memory memory]
[--jars additional-jars]
[--master dse://?appReconnectionTimeoutSeconds=secs]
[--properties-file path_to_properties_file]
[--total-executor-cores cores]

--conf name=spark.value|sparkproperties.conf
An arbitrary Spark option to the Spark configuration prefixed by spark.

• name-spark.value

• sparkproperties.conf - a configuration

--executor-memory mem
The amount of memory that each executor can consume for the application. Spark uses a 512 MB
default. Specify the memory argument in JVM format using the k, m, or g suffix.
--help
Shows a help message that displays all options except DataStax Enterprise Spark shell options.
--jars path_to_additional_jars
A comma-separated list of paths to additional JAR files.
--properties-file path_to_properties_file
The location of the properties file that has the configuration settings. By default, Spark loads the
settings from spark-defaults.conf.
--total-executor-cores cores
The total number of cores the application uses.
--verbose
Displays which arguments are recognized as Spark configuration options and which arguments are
forwarded to the Spark shell.
Spark shell application arguments:
-i app_script_file
Spark shell application argument that runs a script from the specified file.
Configuring Spark applications
You pass the Spark submission arguments using the following syntax:

$ dse spark-submit [submission_arguments] application_file [application_arguments]

All submission_arguments and these additional spark-submit submission_arguments:


--class class_name
The full name of the application main class.
--name appname
The application name as displayed in the Spark web application.
--py-files files
A comma-separated list of the .zip, .egg, or .py files that are set on PYTHONPATH for Python
applications.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
262
Using DataStax Enterprise advanced functionality

--files files
A comma-separated list of files that are distributed among the executors and available for the
application.
In general, Spark submission arguments are translated into system properties -Dname=value and other VM
parameters like classpath. The application arguments are passed directly to the application.
Property list
When you run dse spark-submit on a node in your Analytics cluster, all the following properties are set
automatically, and the Spark Master is automatically detected. Only set the following properties if you need to
override the automatically managed properties.
spark.cassandra.connection.native.port
Default = 9042. Port for native client protocol connections.
spark.cassandra.connection.rpc.port
Default = 9160. Port for thrift connections.
spark.cassandra.connection.host
The host name or IP address to which the Thrift RPC service and native transport is bound.
The native_transport_address property in the cassandra.yaml, which is localhost by default,
determines the default value of this property.
You can explicitly set the Spark Master address using the --master master address parameter to dse spark-
submit.

$ dse spark-submit --master master address application JAR file

For example, if the Spark node is at 10.0.0.2:

$ dse spark-submit --master dse://10.0.0.2? myApplication.jar

The following properties can be overridden for performance or availability:


Connection properties
spark.cassandra.session.consistency.level
Default = LOCAL_ONE. The default consistency level for sessions which are accessed from the
CassandraConnector object as in CassandraConnector.withSessionDo.
This property does not affect the consistency level of DataFrame and RDD read and write
operations. Use spark.cassandra.input.consistency.level for read operations and
spark.cassandra.output.consistency.level for write operations.

Read properties
spark.cassandra.input.split.size
Default = 100000. Approximate number of rows in a single Spark partition. The higher the value, the
fewer Spark tasks are created. Increasing the value too much may limit the parallelism level.
spark.cassandra.input.fetch.size_in_rows
Default = 1000. Number of rows being fetched per round-trip to the database. Increasing this value
increases memory consumption. Decreasing the value increases the number of round-trips. In earlier
releases, this property was spark.cassandra.input.page.row.size.
spark.cassandra.input.consistency.level
Default = LOCAL_ONE. Consistency level to use when reading.
Write properties
You can set the following properties in SparkConf to fine tune the saving process.
spark.cassandra.output.batch.size.bytes
Default = 1024. Maximum total size of a single batch in bytes.
spark.cassandra.output.consistency.level
Default = LOCAL_QUORUM. Consistency level to use when writing.
spark.cassandra.output.concurrent.writes

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
263
Using DataStax Enterprise advanced functionality

Default = 100. Maximum number of batches executed in parallel by a single Spark


task.
spark.cassandra.output.batch.size.rows
Default = None. Number of rows per single batch. The default is unset, which means the connector
will adjust the number of rows based on the amount of data in each row.
See the Spark Cassandra Connector documentation for details on additional, low-level properties.
Creating a DSE Analytics Solo datacenter
DSE Analytics Solo datacenters do not store any database or search data, but are strictly used for analytics
processing. They are used in conjunction with one or more datacenters that contain database data.
Creating a DSE Analytics Solo datacenter within an existing DSE cluster
In this example scenario, there is an existing datacenter, DC1 which has existing database data. Create a new
DSE Analytics Solo datacenter, DC2, which does not store any data but will perform analytics jobs using the
database data from DC1.

• Make sure all keyspaces in the DC1 datacenter use NetworkTopologyStrategy. If necessary, alter the
keyspace.

ALTER KEYSPACE mykeyspace


WITH REPLICATION = { 'class' = 'NetworkTopologyStrategy', 'DC1' : 3 };

• Add nodes to a new datacenter named DC2, then enable Analytics on those nodes.

• Configure the dse_leases and dse_analytics keyspaces to replicate to both DC1 and DC2. For example:

ALTER KEYSPACE dse_leases


WITH REPLICATION = { 'class' = 'NetworkTopologyStrategy', 'DC1' : 3, 'DC2' : 3 };

• When submitting Spark applications specify the --master URL with the name or IP address of a node in
the DC2 datacenter, and set the spark.cassandra.connection.local_dc configuration option to DC1.

dse spark-submit --master "dse://?connection.local_dc=DC2"


--class com.datastax.dse.demo.loss.Spark10DayLoss --conf
"spark.cassandra.connection.local_dc=DC1" portfolio.jar

The Spark workers read the data from the DC1.

Accessing an external DSE transactional cluster from a DSE Analytics Solo cluster
To access an external DSE transactional cluster, explicitly set the connection to the transactional cluster when
creating RDDs or Datasets within the application.
In the following examples, the external DSE transactional cluster has a node running on 10.10.0.2.
To create an RDD from the transactional cluster's data:

import com.datastax.spark.connector._
import com.datastax.spark.connector.cql._
import org.apache.spark.SparkContext

def analyticsSoloExternalDataExample ( sc: SparkContext) = {


val connectorToTransactionalCluster =
CassandraConnector(sc.getConf.set("spark.cassandra.connection.host", "10.10.0.2"))

val rddFromTransactionalCluster = {
// Sets connectorToTransactionalCluster as default connection for everything in this
code block
implicit val c = connectorToTransactionalCluster
// get the data from the test.words table
sc.cassandraTable("test","words")
}

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
264
Using DataStax Enterprise advanced functionality

Creating a Dataset from the transactional :

import org.apache.spark.sql.cassandra._
import com.datastax.spark.connector.cql.CassandraConnectorConf

// set params for the particular cluster


spark.setCassandraConf("TransactionalCluster",
CassandraConnectorConf.ConnectionHostParam.option("10.10.0.2"))

val df = spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "words", "keyspace" -> "test"))
.load()

When you submit the application to the DSE Analytics Solo cluster, it will retrieve the data from the external
DSE transactional cluster.
Spark JVMs and memory management
Spark jobs running on DataStax Enterprise are divided among several different JVM processes, each with
different memory requirements.
DataStax Enterprise and Spark Master JVMs
The Spark Master runs in the same process as DataStax Enterprise, but its memory usage is negligible. The
only way Spark could cause an OutOfMemoryError in DataStax Enterprise is indirectly by executing queries
that fill the client request queue. For example, if it ran a query with a high limit and paging was disabled or it
used a very large batch to update or insert data in a table. This is controlled by MAX_HEAP_SIZE in cassandra-
env.sh. If you see an OutOfMemoryError in system.log, you should treat it as a standard OutOfMemoryError
and follow the usual troubleshooting steps.
Spark executor JVMs
The Spark executor is where Spark performs transformations and actions on the RDDs and is usually
where a Spark-related OutOfMemoryError would occur. An OutOfMemoryError in an executor will show
up in the stderr log for the currently executing application (usually in /var/lib/spark). There are several
configuration settings that control executor memory and they interact in complicated ways.

• The memory_total option in the resource_manager_options.worker_options section of dse.yaml


defines the maximum fraction of system memory to give all executors for all applications running on a
particular node. It uses the following formula:
memory_total * (total system memory - memory assigned to DataStax Enterprise)

• spark.executor.memory is a system property that controls how much executor memory a specific
application gets. It must be less than or equal to the calculated value of memory_total. It can be specified
in the constructor for the SparkContext in the driver application, or via --conf spark.executor.memory
or --executor-memory command line options when submitting the job using spark-submit.

The client driver JVM


The driver is the client program for the Spark job. Normally it shouldn't need very large amounts of memory
because most of the data should be processed within the executor. If it does need more than a few gigabytes,
your application may be using an anti-pattern like pulling all of the data in an RDD into a local data structure by
using collect or take. Generally you should never use collect in production code and if you use take, you
should be only taking a few records. If the driver runs out of memory, you will see the OutOfMemoryError in
the driver stderr or wherever it's been configured to log. This is controlled one of two places:

• SPARK_DRIVER_MEMORY in spark-env.sh

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
265
Using DataStax Enterprise advanced functionality

• spark.driver.memory system property which can be specified via --conf spark.driver.memory or


--driver-memory command line options when submitting the job using spark-submit. This cannot be
specified in the SparkContext constructor because by that point, the driver has already started.

Spark worker JVMs


The worker is a watchdog process that spawns the executor, and should never need its heap size increased.
The worker's heap size is controlled by SPARK_DAEMON_MEMORY in spark-env.sh. SPARK_DAEMON_MEMORY also
affects the heap size of the Spark SQL thrift server.
Using Spark modules with DataStax Enterprise

Getting started with Spark Streaming


Spark Streaming allows you to consume live data streams from sources, including Akka, Kafka, and Twitter.
This data can then be analyzed by Spark applications, and the data can be stored in the database.
You use Spark Streaming by creating an org.apache.spark.streaming.StreamingContext instance based
on your Spark configuration. You then create a DStream instance, or a discretionized stream, an object that
represents an input stream. DStream objects are created by calling one of the methods of StreamingContext,
or using a utility class from external libraries to connect to other sources like Twitter.
The data you consume and analyze is saved to the database by calling one of the saveToCassandra methods
on the stream object, passing in the keyspace name, the table name, and optionally the column names and
batch size.

Spark Streaming applications require synchronized clocks to operate correctly. See Synchronize clocks.

The following Scala example demonstrates how to connect to a text input stream at a particular IP address
and port, count the words in the stream, and save the results to the database.

1. Import the streaming context objects.

import org.apache.spark.streaming._

2. Create a new StreamingContext object based on an existing SparkConf configuration object, specifying
the interval in which streaming data will be divided into batches by passing in a batch duration.

val sparkConf = ....


val ssc = new StreamingContext(sc, Seconds(1)) // Uses the context automatically
created by the spark shell

Spark allows you to specify the batch duration in milliseconds, seconds, and minutes.

3. Import the database-specific functions for StreamingContext, DStream, and RDD objects.

import com.datastax.spark.connector.streaming._

4. Create the DStream object that will connect to the IP and port of the service providing the data stream.

val lines = ssc.socketTextStream(server IP address, server port number)

5. Count the words in each batch and save the data to the table.

val words = lines.flatMap(_.split(" "))


val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
266
Using DataStax Enterprise advanced functionality

.saveToCassandra("streaming_test", "words_table", SomeColumns("word", "count"))

6. Start the computation.

ssc.start()
ssc.awaitTermination()

In the following example, you start a service using the nc utility that repeats strings, then consume the
output of that service using Spark Streaming.
Using cqlsh, start by creating a target keyspace and table for streaming to write into.

CREATE KEYSPACE IF NOT EXISTS streaming_test


WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 };

CREATE TABLE IF NOT EXISTS streaming_test.words_table


(word TEXT PRIMARY KEY, count COUNTER);

In a terminal window, enter the following command to start the service:

$ nc -lk 9999 one two two three three three four four four four someword

In a different terminal start a Spark shell.

$ dse spark

In the Spark shell enter the following:

import org.apache.spark.streaming._
import com.datastax.spark.connector.streaming._

val ssc = new StreamingContext(sc, Seconds(1))


val lines = ssc.socketTextStream( "localhost", 9999)
val words = lines.flatMap(_.split( " "))
val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)


wordCounts.saveToCassandra( "streaming_test", "words_table", SomeColumns( "word",
"count"))
wordCounts.print()
ssc.start()
ssc.awaitTermination()
exit()

Using cqlsh connect to the streaming_test keyspace and run a query to show the results.

$ cqlsh -k streaming_test

select * from words_table;

word | count
---------+-------
three | 3
one | 1

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
267
Using DataStax Enterprise advanced functionality

two | 2
four | 4
someword | 1

What's next:
Run the http_receiver demo. See the Spark Streaming Programming Guide for more information, API
documentation, and examples on Spark Streaming.
Creating a Spark Structured Streaming sink using DSE
Spark Structured Streaming is a high-level API for streaming applications. DSE supports Structured
Streaming for storing data into DSE.
The following Scala example shows how to store data from a streaming source to DSE using the
cassandraFormat method.

val query = source.writeStream


.option("checkpointLocation", checkpointDir.toString)
.cassandraFormat("table name", "keyspace name")
.outputMode(OutputMode.Update)
.start()

This example sets the OutputMode to Update, described in the Spark API documentation.
The cassandraFormat method is equivalent to calling the format method and in
org.apache.spark.sql.cassandra.

val query = source.writeStream


.option("checkpointLocation", checkpointDir.toString)
.format("org.apache.spark.sql.cassandra")
.option("keyspace", ks)
.option("table", "kv")
.outputMode(OutputMode.Update)
.start()

Using Spark SQL to query data


Spark SQL allows you to execute Spark queries using a variation of the SQL language. Spark SQL includes
APIs for returning Spark Datasets in Scala and Java, and interactively using a SQL shell.
Spark SQL basics
In DSE, Spark SQL allows you to perform relational queries over data stored in DSE clusters, and executed
using Spark. Spark SQL is a unified relational query language for traversing over distributed collections of
data, and supports a variation of the SQL language used in relational databases. Spark SQL is intended as a
replacement for Shark and Hive, including the ability to run SQL queries over Spark data sets. You can use
traditional Spark applications in conjunction with Spark SQL queries to analyze large data sets.
The SparkSession class and its subclasses are the entry point for running relational queries in Spark.
DataFrames are Spark Datasets organized into named columns, and are similar to tables in a traditional
relational database. You can create DataFrame instances from any Spark data source, like CSV files, Spark
RDDs, or, for DSE, tables in the database. In DSE, when you access a Spark SQL table from the data in DSE
transactional cluster, it registers that table to the Hive metastore so SQL queries can be run against it.

Any tables you create or destroy, and any table data you delete, in a Spark SQL session will not be
reflected in the underlying DSE database, but only in that session's metastore.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
268
Using DataStax Enterprise advanced functionality

Starting the Spark SQL shell


The Spark SQL shell allows you to interactively perform Spark SQL queries. To start the shell, run dse spark-
sql:

$ dse spark-sql

The Spark SQL shell in DSE automatically creates a Spark session and connects to the Spark SQL Thrift
server to handle the underlying JDBC connections.
If the schema changes in the underlying database table during a Spark SQL session (for example, a column
was added using CQL), drop the table and then refresh the metastore to continue querying the table with the
correct schema.

DROP TABLE tablename;


SHOW TABLES;

Queries to a table whose schema has been modified cause a runtime exception.
Spark SQL limitations
• You cannot load data from one file system to a table in a different file system.

CREATE TABLE IF NOT EXISTS test (id INT, color STRING) PARTITIONED BY (ds STRING);
LOAD DATA INPATH 'hdfs2://localhost/colors.txt' OVERWRITE INTO TABLE test PARTITION
(ds ='2008-08-15');

The first line creates a table on the default file system. The second line attempts to load data into that
table from a path on a different file system, and will fail.

Querying database data using Spark SQL in Scala


When you start Spark, DataStax Enterprise creates a Spark session instance to allow you to run
Spark SQL queries against database tables. The session object is named spark and is an instance of
org.apache.spark.sql.SparkSession. Use the sql method to execute the query.

1. Start the Spark shell.

$ dse spark

2. Use the sql method to pass in the query, storing the result in a variable.

val results = spark.sql("SELECT * from my_keyspace_name.my_table")

3. Use the returned data.

results.show()

+--------------------+-----------+
| id|description|
+--------------------+-----------+
|de2d0de1-4d70-11e...| thing|
|db7e4191-4d70-11e...| another|
|d576ad50-4d70-11e...|yet another|

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
269
Using DataStax Enterprise advanced functionality

+--------------------+-----------+

Querying database data using Spark SQL in Java


Java applications that query table data using Spark SQL first need an instance of
org.apache.spark.sql.SparkSession.
The Spark session object is used to connect to DataStax Enterprise.
Create the Spark session instance using the builder interface:

SparkSession spark = SparkSession


.builder()
.appName("My application name")
.config("option name", "option value")
.master("dse://1.1.1.1?connection.host=1.1.2.2,1.1.3.3")
.getOrCreate();

After the Spark session instance is created, you can use it to create a DataFrame instance from the query.
Queries are executed by calling the SparkSession.sql method.

DataFrame employees = spark.sql("SELECT * FROM company.employees");


employees.registerTempTable("employees");
DataFrame managers = spark.sql("SELECT name FROM employees WHERE role = 'Manager' ");

The returned DataFrame object supports the standard Spark operations.

employees.collect();

Querying DSE Graph vertices and edges with Spark SQL


Spark SQL can query DSE Graph vertex and edge tables. The dse_graph database holds the vertex
and edge tables for each graph. The naming format for the tables is graph name_vertices and graph
name_edges. For example, if you have a graph named gods, the vertices and edges are accessible in Spark
SQL in the dse_graph.gods_vertices and dse_graph.gods_edges tables.

select * from dse_graph.gods_vertices;

If you have properties that are spelled the same but with different capitalizations (for example, id and Id),
start Spark SQL with the --conf spark.sql.caseSensitive=true option.
Prerequisites:
Start your cluster with both Graph and Spark enabled.

1. Start the Spark SQL shell.

$ dse spark-sql

2. Query the vertices and edges using SELECT statements.

USE dse_graph;
SELECT * FROM gods_vertices where name = 'Zeus';

3. Join the vertices and edges in a query.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
270
Using DataStax Enterprise advanced functionality

Vertices are identified by id columns. Edge tables have src and dst columns that identify the from
and to vertices, respectively. A join can be used to traverse the graph. For example to find all vertex
ids that are reached by the out edges:

SELECT gods_edges.dst FROM gods_vertices JOIN gods_edges ON gods_vertices.id =


gods_edges.src;

What's next: The same steps work from the Spark shell using spark.sql() to run the query statements, or
using the JDBC/ODBC driver and the Spark SQL Thrift Server.
Using Spark predicate push down in Spark SQL queries
Spark predicate push down to database allows for better optimized Spark queries. A predicate is a condition
on a query that returns true or false, typically located in the WHERE clause. A predicate push down filters
the data in the database query, reducing the number of entries retrieved from the database and improving
query performance. By default the Spark Dataset API will automatically push down valid WHERE clauses to the
database.
You can also use predicate push down on DSE Search indices within SearchAnalytics data centers.
Restrictions on column filters
Partition key columns can be pushed down as long as:

• All partition key columns are included in the filter.

• No more than one equivalence predicate per column.

Use an IN clause to specify multiple restrictions for a particular column:

val primaryColors = List("red", "yellow", "blue")

val df = spark.read.cassandraFormat("cars", "inventory").load


df.filter(df("car_color").isin(primaryColors: _*))

Clustering key columns can be pushed down with the following rules:

• Only the last predicate in the filter can be a non equivalence predicate.

• If there is more than one predicate for a column, the predicates cannot be equivalence predicates.

When predicate push down occurs


When a Dataset has no push down filters, all requests on the Dataset do a full unfiltered table scan. Adding
predicate filters on the Dataset for eligible database columns modifies the underlying query to narrow its
scope.
Determining if predicate push down is being used in queries
By using the explain method on a Dataset (or EXPLAIN in Spark SQL), queries can be analyzed to see if the
predicates need to be cast to the correct data type. For example, create the following CQL table:

CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy',


'replication_factor': 1 };
USE test;
CREATE table words (
user TEXT,
word TEXT,
count INT,
PRIMARY KEY (user, word));

INSERT INTO words (user, word, count ) VALUES ( 'Russ', 'dino', 10 );


INSERT INTO words (user, word, count ) VALUES ( 'Russ', 'fad', 5 );
INSERT INTO words (user, word, count ) VALUES ( 'Sam', 'alpha', 3 );

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
271
Using DataStax Enterprise advanced functionality

INSERT INTO words (user, word, count ) VALUES ( 'Zebra', 'zed', 100 );

Then create a Spark Dataset in the Spark console using that table and look for PushedFilters in the output
after issuing the EXPLAIN command:

val df = spark.read.cassandraFormat("words", "test").load


df.explain

== Physical Plan ==
*Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [user#0,word#1,count#2]
ReadSchema: struct<user:string,word:string,count:int>

Because this query doesn't filter on columns capable of being pushed down, there are no PushedFilters in
the physical plan.
Adding a filter, however, does change the physical plan to include PushedFilters:

val dfWithPushdown = df.filter(df("word") > "ham")


dfWithPushdown.explain

== Physical Plan ==
*Scan org.apache.spark.sql.cassandra.CassandraSourceRelation
[user#0,word#1,count#2] PushedFilters: [*GreaterThan(word,ham)], ReadSchema:
struct<user:string,word:string,count:int>

The PushedFilters section of the physical plan includes the GreaterThan push down filter. The asterisk
indicates that push down filter will be handled only at the datasource level.
Troubleshooting predicate push down
When creating Spark SQL queries that use comparison operators, making sure the predicates are pushed
down to the database correctly is critical to retrieving the correct data with the best performance.
For example, given a CQL table with the following schema:

CREATE TABLE test.common (


year int,
birthday timestamp,
userid uuid,
likes text,
name text,
PRIMARY KEY (year, birthday, userid)
)

Suppose you want to write a query that selects all entries where the birthday is earlier than a given date:

SELECT * FROM test.common WHERE birthday < '2001-1-1';

Use the EXPLAIN command to see the query plan:

EXPLAIN SELECT * FROM test.common WHERE birthday < '2001-1-1';

== Physical Plan ==
*Filter (cast(birthday#1 as string) < 2001-1-1)

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
272
Using DataStax Enterprise advanced functionality

+- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation
[year#0,birthday#1,userid#2,likes#3,name#4] ReadSchema:
struct<year:int,birthday:timestamp,userid:string,likes:string,name:string>
Time taken: 0.72 seconds, Fetched 1 row(s)

Note that the Filter directive is treating the birthday column, a CQL TIMESTAMP, as a string. The query
optimizer looks at this comparison and needs to make the types match before generating a predicate. In
this case the optimizer decides to cast the birthday column as a string to match the string '2001-1-1',
but cast functions cannot be pushed down. The predicate isn't pushed down, and it doesn't appear in
PushedFilters. A full table scan will be performed at the database layer, with the results returned to Spark
for further processing.
To push down the correct predicate for this query, use the cast method to specify that the predicate is
comparing the birthday column to a TIMESTAMP, so the types match and the optimizer can generate the
correct predicate.

EXPLAIN SELECT * FROM test.common WHERE birthday < cast('2001-1-1' as TIMESTAMP);

== Physical Plan ==
*Scan org.apache.spark.sql.cassandra.CassandraSourceRelation
[year#0,birthday#1,userid#2,likes#3,name#4]
PushedFilters: [*LessThan(birthday,2001-01-01 00:00:00.0)],
ReadSchema: struct<year:int,birthday:timestamp,userid:string,likes:string,name:string>
Time taken: 0.034 seconds, Fetched 1 row(s)

Note the PushedFilters indicating that the LessThan predicate will be pushed down for the column data in
birthday. This should speed up the query as a full table scan will be avoided.

Supported syntax of Spark SQL

The following syntax defines a SELECT query.

SELECT [DISTINCT] [column names]|[wildcard]


FROM [keyspace name.]table name
[JOIN clause table name ON join condition]
[WHERE condition]
[GROUP BY column name]
[HAVING conditions]
[ORDER BY column names [ASC | DSC]]

A SELECT query using joins has the following syntax.

SELECT statement
FROM statement
[JOIN | INNER JOIN | LEFT JOIN | LEFT SEMI JOIN | LEFT OUTER JOIN | RIGHT JOIN | RIGHT
OUTER JOIN | FULL JOIN | FULL OUTER JOIN]
ON join condition

Several select clauses can be combined in a UNION, INTERSECT, or EXCEPT query.

SELECT statement 1
[UNION | UNION ALL | UNION DISTINCT | INTERSECT | EXCEPT]
SELECT statement 2

Select queries run on new columns return '', or empty results, instead of None.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
273
Using DataStax Enterprise advanced functionality

The following syntax defines an INSERT query.

INSERT [OVERWRITE] INTO [keyspace name.]table name


VALUES values

The following syntax defines a CACHE TABLE query.

CACHE TABLE table name [AS table alias]

You can remove a table from the cache using a UNCACHE TABLE query.

UNCACHE TABLE table name

Keywords in Spark SQL


The following keywords are reserved in Spark SQL.
ALL
AND
AS
ASC
APPROXIMATE
AVG
BETWEEN
BY
CACHE
CAST
COUNT
DESC
DISTINCT
FALSE
FIRST
LAST
FROM
FULL
GROUP
HAVING
IF
IN
INNER
INSERT
INTO
IS
JOIN
LEFT
LIMIT
MAX
MIN
NOT
NULL
ON
OR
OVERWRITE
LIKE
RLIKE

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
274
Using DataStax Enterprise advanced functionality

UPPER
LOWER
REGEXP
ORDER
OUTER
RIGHT
SELECT
SEMI
STRING
SUM
TABLE
TIMESTAMP
TRUE
UNCACHE
UNION
WHERE
INTERSECT
EXCEPT
SUBSTR
SUBSTRING
SQRT
ABS
Inserting data into tables with static columns using Spark SQL
Static columns are mapped to different columns in Spark SQL and require special handling. Spark SQL Thrift
servers use Hive. When you when run an insert query, you must pass data to those columns.
To work around the different columns, set cql3.output.query in the insertion Hive table properties to
limit the columns that are being inserted. In Spark SQL, alter the external table to configure the prepared
statement as the value of the Hive CQL output query. For example, this prepared statement takes values that
are inserted into columns a and b in mytable and maps these values to columns b and a, respectively, for
insertion into the new row.

spark-sql> ALTER TABLE mytable SET TBLPROPERTIES ('cql3.output.query' = 'update


mykeyspace.mytable set b = ? where a = ?');
spark-sql> ALTER TABLE mytable SET SERDEPROPERTIES ('cql3.update.columns' =
'b,a');

Running HiveQL queries using Spark SQL


Spark SQL supports queries written using HiveQL, a SQL-like language that produces queries that are
converted to Spark jobs. HiveQL is more mature and supports more complex queries than Spark SQL. To
construct a HiveQL query, first create a new HiveContext instance, and then submit the queries by calling
the sql method on the HiveContext instance.
See the Hive Language Manual for the full syntax of HiveQL.

Creating indexes with DEFERRED REBUILD is not supported in Spark SQL.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
275
Using DataStax Enterprise advanced functionality

1. Start the Spark shell.

$ bin/dse spark

2. Use the provided HiveContext instance sqlContext to create a new query in HiveQL by calling the sql
method on the sqlContext object..

val results = sqlContext.sql("SELECT * FROM my_keyspace.my_table")

Using the DataFrames API


The Spark DataFrames API encapsulates data sources, including DataStax Enterprise data, organized into
named columns.
The Spark Cassandra Connector provides an integrated DataSource to simplify creating DataFrames. For
more technical details, see the Spark Cassandra Connector documentation that is maintained by DataStax
and the Cassandra and PySpark DataFrames post.
Examples of using the DataFrames API
This Python example shows using the DataFrames API to read from the table ks.kv and insert into a different
table ks.othertable.

$ dse pyspark

table1 = spark.read.format("org.apache.spark.sql.cassandra")
.options(table="kv", keyspace="ks")
.load()
table1.write.format("org.apache.spark.sql.cassandra")
.options(table="othertable", keyspace = "ks")
.save(mode ="append")

Using the DSE Spark console, the following Scala example shows how to create a DataFrame object from
one table and save it to another.

$ dse spark

val table1 = spark.read.format("org.apache.spark.sql.cassandra")


.options(Map( "table" -> "words", "keyspace" -> "test"))
.load()
table1.createCassandraTable("test", "otherwords", partitionKeyColumns =
Some(Seq("word")), clusteringKeyColumns = Some(Seq("count")))
table1.write.cassandraFormat("otherwords", "test").save()

The write operation uses one of the helper methods, cassandraFormat, included in the Spark Cassandra
Connector. This is a simplified way of setting the format and options for a standard DataFrame operation. The
following command is equivalent to write operation using cassandraFormat:

table1.write.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "othertable", "keyspace" -> "test"))
.save()

Using the Spark SQL Thriftserver


The Spark SQL Thriftserver uses JDBC and ODBC interfaces for client connections to the database.
The AlwaysOn SQL service is a high-availability service built on top of the Spark SQL Thriftserver. The Spark
SQL Thriftserver is started manually on a single node in an Analytics datacenter, and will not failover to

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
276
Using DataStax Enterprise advanced functionality

another node. Both AlwaysOn SQL and the Spark SQL Thriftserver provide JDBC and ODBC interfaces to
DSE, and share many configuration settings.

1. If you are using Kerberos authentication, in the hive-site.xml file, configure your authentication
credentials for the Spark SQL Thrift server.

<property>
<name>hive.server2.authentication.kerberos.principal</name>
<value>thriftserver/[email protected]</value>
</property>

<property>
<name>hive.server2.authentication.kerberos.keytab</name>
<value>/etc/dse/dse.keytab</value>
</property>

Ensure that you use the hive-site.xml file in the Spark directory:

• Package installations: /etc/dse/spark/hive-site.xml

• Tarball installations: installation_location/resources/spark/conf/hive-site.xml

2. Start DataStax Enterprise with Spark enabled as a service or in a standalone installation.

3. Start the server by entering the dse spark-sql-thriftserver start command as a user with
permissions to write to the Spark directories.
To override the default settings for the server, pass in the configuration property using the --hiveconf
option. See the HiveServer2 documentation for a complete list of configuration properties.

$ dse spark-sql-thriftserver start

By default, the server listens on port 10000 on the localhost interface on the node from which it was
started. You can specify the server to start on a specific port. For example, to start the server on port
10001, use the --hiveconf hive.server2.thrift.port=10001 option.

$ dse spark-sql-thriftserver start --hiveconf hive.server2.thrift.port=10001

You can configure the port and bind address permanently in resources/spark/conf/spark-env.sh:

$ export HIVE_SERVER2_THRIFT_PORT=10001 export


HIVE_SERVER2_THRIFT_BIND_HOST=1.1.1.1

You can specify general Spark configuration settings by using the --conf option.

$ dse spark-sql-thrift-server start --conf spark.cores.max=4

4. Use DataFrames to read and write large volumes of data. For example, to create the table_a_cass_df
table that uses a DataFrame while referencing table_a:

CREATE TABLE table_a_cass_df using org.apache.spark.sql.cassandra OPTIONS (table


"table_a", keyspace "ks")

With DataFrames, compatibility issues exist with UUID and Inet types when inserting data with the
JDBC driver.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
277
Using DataStax Enterprise advanced functionality

5. Use the Spark Cassandra Connector tuning parameters to optimize reads and writes.

6. To stop the server, enter the dse spark-sql-thriftserver stop command.

$ dse spark-sql-thriftserver stop

What's next:
You can now connect your application by using the Simba JDBC driver to the server at the URI:
jdbc:hive2://hostname:port number, using the Simba ODBC driver or use dse beeline.

Using SparkR with DataStax Enterprise


Apache SparkR is a front-end for the R programming language for creating analytics applications. DataStax
Enterprise integrates SparkR to support creating data frames from DSE data.
SparkR support in DSE requires you to first install R on the client machines on which you will be using SparkR.
To use R user defined functions and distributed functions the same version of R should be installed on all the
nodes in the Analytics cluster. DSE SparkR is built against R version 3.1.1. Many Linux distributions by default
install older versions of R.
For example, on Debian and Ubuntu clients:

$ sudo sh -c 'echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/" >> /etc/apt/


sources.list' && gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9 && gpg -a --
export E084DAB9 | sudo apt-key add - && sudo apt-get update && sudo apt-get install r-
base

On RedHat and CentOS clients:

$ sudo yum install R

Starting SparkR
Start the SparkR shell using the dse command to automatically set the Spark session within R.

1. Start the R shell using the dse command.

$ dse sparkR

Using AlwaysOn SQL service


AlwaysOn SQL is a high availability service that responds to SQL queries from JDBC and ODBC applications.
By default, AlwaysOn SQL is disabled. It is built on top of the Spark SQL Thriftserver, but provides failover and
caching between instances so there is no single point of failure. AlwaysOn SQL provides enhanced security,
leveraging the same user management as the rest of DSE, executing queries to the underlying database as the
user authenticated to AlwaysOn SQL.
In order to run AlwaysOn SQL, you must have:

• A running datacenter with DSE Analytics nodes enabled.

• Enabled AlwaysOn SQL on every Analytics node in the datacenter.

• Modified the replication factor for all Analytics nodes, if necessary.

• Set the native_transport_address in cassandra.yaml to an IP address accessible by the AlwaysOn SQL


clients. This address depends on your network topology and deployment scenario.

• Configured AlwaysOn SQL for security, if authentication is enabled.

Lifecycle Manager allows you to enable and configure AlwaysOn SQL in managed clusters.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
278
Using DataStax Enterprise advanced functionality

When AlwaysOn SQL is enabled within an Analytics datacenter, all nodes within the datacenter must have
AlwaysOn SQL enabled. Use dsetool ring to find which nodes in the datacenter are Analytics nodes.

AlwaysOn SQL is not supported when using DSE Multi-Instance or other deployments with multiple DSE
instances on the same server.

The dse client-tool alwayson-sql command controls the server. The command works on the local
datacenter unless you specify the datacenter with the --dc option:

$ dse client-tool alwayson-sql --dc datacenter_name command

Enabling AlwaysOn SQL


Set enabled to true and uncomment the AlwaysOn SQL options in dse.yaml .
Configuring AlwaysOn SQL
The alwayson_sql_options section in dse.yaml, described in detail at AlwaysOn SQL options, has options
for setting the ports, timeout values, log location, and other Spark or Hive configuration settings. Additional
configuration options are located in spark-alwayson-sql.conf.
AlwaysOn SQL binds to the native_transport_address in cassandra.yaml.
If you have changed some configuration settings in dse.yaml while AlwaysOn SQL is running, you can have the
server pick up the new configuration by entering:

dse client-tool alwayson-sql reconfig

The following settings can be changed using reconfig:

• reserve_port_wait_time_ms

• alwayson_sql_status_check_wait_time_ms

• log_dsefs_dir

• runner_max_errors

Changing other options requires a restart, except for the enabled option. Enabling or disabling AlwaysOn
SQL requires restarting DSE.
The spark-alwayson-sql.conf file contains Spark and Hive settings as properties. When AlwaysOn SQL is
started, spark-alwayson-sql.conf is scanned for Spark properties, similar to other Spark applications started
with dse spark-submit. Properties that begin with spark.hive are submitted as properties using --hiveconf,
removing the spark. prefix.
For example, if spark-alwayson-sql.conf has the following setting:

spark.hive.server2.table.type.mapping CLASSIC

That setting will be converted to --hiveconf hive.server2.table.type.mapping=CLASSIC when AlwaysOn


SQL is started.
Configuring AlwaysOnSQL in a DSE Analytics Solo datacenter
If AlwaysOn SQL is used in a DSE Analytics Solo datacenter, modify spark-alwayson-sql.conf to configure
Spark with the DSE Analytics Solo datacenters. In the following example, the transactional datacenter name is
dc0 and the DSE Analytics Solo datacenter is dc1.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
279
Using DataStax Enterprise advanced functionality

Under spark.master set the Spark URI to the connect to the DSE Analytics Solo datacenter.

spark.master=dse://?connection.local_dc=dc1

Add the spark.cassandra.connection.local_dc property to spark-alwayson-sql.conf and set it to the name


of the transactional datacenter.

spark.cassandra.connection.local_dc=dc0

Starting and stopping AlwaysOn SQL


If you have enabled AlwaysOn SQL, it will start when the cluster is started. If AlwaysOn SQL is enabled and
DSE is restarted, AlwaysOn SQL will be started regardless of the previous state of AlwaysOn SQL. You only
need to explicitly start the server if it has been stopped, for example for a configuration change.
To start AlwaysOn SQL service:

$ dse client-tool alwayson-sql start

To start the server on a specific datacenter, specify the datacenter name with the --dc option:

$ dse client-tool alwayson-sql --dc dc-west start

To completely stop AlwaysOn SQL service:

$ dse client-tool alwayson-sql stop

The server must be manually started after issuing a stop command.


To restart a running server:

$ dse client-tool alwayson-sql restart

Checking the status of AlwaysOn SQL


To find the status of AlwaysOn SQL issue a status command using dse-client-tool.

$ dse client-tool alwayson-sql status

You can also view the status in a web browser by going to http://node name or IP address:AlwaysOn SQL
web UI port. By default, the port is 9077. For example, if 10.10.10.1 is the IP address of an Analytics node with
AlwaysOn SQL enabled, navigate to http://10.10.10.1:9077.
The returned status is one of:

• RUNNING: the server is running and ready to accept client requests.

• STOPPED_AUTO_RESTART: the server is being started but is not yet ready to accept client requests.

• STOPPED_MANUAL_RESTART: the server was stopped with either a stop or restart command. If the server
was issued a restart command, the status will be changed to STOPPED_AUTO_RESTART as the server
starts again.

• STARTING: the server is actively starting up but is not yet ready to accept client requests.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
280
Using DataStax Enterprise advanced functionality

Caching tables within Spark SQL queries


To increase performance, you can specify tables to be cached into RAM using the CACHE TABLE directive.
Permanent cached tables will be recached on server restart.
You can cache an existing table by issuing a CACHE TABLE Spark SQL command through a client:

CACHE TABLE keyspace_name.table_name;

CACHE TABLE keyspace_name.table_name AS select statement;

The temporary cache table is only valid for the session in which it was created, and will not be recreated on
server restart.
Create a permanent cache table using the CREATE CACHE TABLE directive and a SELECT query:

CREATE CACHE TABLE keyspace_name.table_name AS select_statement;

The table cache can be destroyed using the UNCACHE TABLE and CLEAR CACHE directives.

UNCACHE TABLE keyspace_name.table_name;

The CLEAR CACHE directive removes the table cache.

CLEAR CACHE;

Issuing DROP TABLE will remove all metadata including the table cache.
Enabling SSL for AlwaysOn SQL
Communication between the driver and AlwaysOn SQL can be encrypted using SSL.
The following instructions give an example of how to set up SSL with a self-signed keystore and truststore.

1. Ensure client-to-node encryption is enabled and configured correctly.

2. If the SSL keystore and truststore used for AlwaysOn SQL differ from the keystore and truststore
configured in cassandra.yaml, add the required settings to enable SSL to the hive-site.xml configuration
file.

By default the SSL settings in cassandra.yaml will be used with AlwaysOn SQL.

<property>
<name>hive.server2.thrift.bind.host</name>
<value>hostname</value>
</property>
<property>
<name>hive.server2.use.SSL</name>
<value>true</value>
</property>
<property>
<name>hive.server2.keystore.path</name>
<value>path to keystore/keystore.jks</value>
</property>
<property>
<name>hive.server2.keystore.password</name>
<value>keystore password</value>

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
281
Using DataStax Enterprise advanced functionality

</property>

3. Start or restart the AlwaysOn SQL service.

Changes in the hive-site.xml configuration file only require a restart of AlwaysOn SQL service,
not DSE.

$ dse client-tool alwayson-sql start

4. Test the connection with Beeline.

$ dse beeline

beeline> !connect jdbc:hive2://hostname:10000/default;ssl=true;sslTrustStore=path to


truststore/truststore.jks;trustStorePassword=truststore password

The JDBC URL for the Simba JDBC Driver is:

jdbc:spark://hostname:10000/default;SSL=1;SSLTrustStore=path to truststore/
truststore.jks;SSLTrustStorePwd=truststore password

Using authentication with AlwaysOn SQL


AlwaysOn SQL can be configured to use DSE authentication.
When DSE authentication is enabled, modify the hive-site.xml configuration file to enable JDBC authentication.
DSE supports configurations for password authentication and Kerberos authentication. The hive-site.xml
file has sections with preconfigured settings to use no authentication (the default), password authentication, or
Kerberos authentication. Uncomment the preferred authentication mechanism, then restart AlwaysOn SQL.

DSE supports multiple authentication mechanisms, but AlwaysOn SQL only supports one mechanism per
datacenter.

AlwaysOn SQL supports DSE proxy authentication. The user who executes the queries is the user who
authenticated using JDBC. If AlwaysOn SQL was started by user Amy, and then Bob begins a JDBC session,
the queries are executed by Amy on behalf of Bob. Amy must have permissions to execute these queries on
behalf of Bob.
To enable authentication in AlwaysOn SQL alwayson_sql_options, follow these steps.

1. Create the auth_user role specified in AlwaysOn SQL options and grant the following permissions to the
role.

CREATE ROLE alwayson_sql WITH LOGIN=true; // role name matches auth_user

// Required if scheme_permissions true


GRANT EXECUTE ON ALL AUTHENTICATION SCHEMES TO alwayson_sql;

// Spark RPC settings


GRANT ALL PERMISSIONS ON REMOTE OBJECT DseResourceManager TO alwayson_sql;
GRANT ALL PERMISSIONS ON REMOTE OBJECT DseClientTool TO alwayson_sql;
GRANT ALL PERMISSIONS ON REMOTE OBJECT AlwaysOnSqlRoutingRPC to alwayson_sql;
GRANT ALL PERMISSIONS ON REMOTE OBJECT AlwaysOnSqlNonRoutingRPC to alwayson_sql;

// Spark and DSE required table access


GRANT SELECT ON system.size_estimates TO alwayson_sql;
GRANT SELECT, MODIFY ON "HiveMetaStore".sparkmetastore TO alwayson_sql;
GRANT SELECT, MODIFY ON dse_analytics.alwayson_sql_cache_table TO alwayson_sql;

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
282
Using DataStax Enterprise advanced functionality

GRANT SELECT, MODIFY ON dse_analytics.alwayson_sql_info TO alwayson_sql;

// Permissions to create and change applications


GRANT CREATE, DESCRIBE ON ANY WORKPOOL TO alwayson_sql;
GRANT MODIFY, DESCRIBE ON ANY SUBMISSION TO alwayson_sql;

See Setting up DSE Spark application permissions for more details.

2. Create the user role.


For internal authentication:

CREATE ROLE 'user_name'


WITH LOGIN = true;

If you use Kerberos, set up a role that matches the full Kerberos principal name for each user.

CREATE ROLE 'user_name/[email protected]'


WITH LOGIN = true;

3. Grant permissions to access keyspaces and tables to the user role.


For internal roles:

GRANT SELECT ON KEYSPACE keyspace_name


TO 'user_name';

For Kerberos roles:

GRANT SELECT ON KEYSPACE keyspace_name


TO 'user_name/[email protected]';

4. Allow the AlwaysOn SQL role (auth_user) to execute commands with the user role.
For internal roles:

GRANT PROXY.EXECUTE
ON ROLE 'user_name'
TO alwayson_sql;

For Kerberos roles:

GRANT PROXY.EXECUTE
ON ROLE 'user_name/[email protected]'
TO alwayson_sql;

5. Open the hive-site.xml configuration file in an editor.

6. Uncomment and modify the authentication mechanism used in hive-site.xml.

• If password authentication is used, enable password authentication in DSE.

• If Kerberos authentication is to be used, Kerberos does not need to be enabled in DSE. AlwaysOn
SQL must have its own service principal and keytab.

• The user must have login permissions in DSE in order to login through JDBC to AlwaysOn SQL.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
283
Using DataStax Enterprise advanced functionality

This example shows how to enable Kerberos authentication. Modify the Kerberos domain and path to the
keytab file.

<!-- Start of: configuration for authenticating JDBC users with Kerberos -->
<property>
<name>hive.server2.enable.doAs</name>
<value>true</value>
</property>

<property>
<name>hive.server2.authentication</name>
<value>KERBEROS</value>
</property>

<property>
<name>hive.server2.authentication.kerberos.principal</name>
<value>hiveserver2/_HOST@KERBEROS DOMAIN</value>
</property>

<property>
<name>hive.server2.authentication.kerberos.keytab</name>
<value>path to hiveserver2.keytab</value>
</property>
<!-- End of: configuration for authenticating JDBC users with Kerberos -->

7. Modify the owner of the /spark and /tmp/hive directories in DSEFS so the new role can write to the log
and temp files.

$ dse fs 'chown -R -u alwayson_sql -g alwayson_sql /spark'

$ dse fs 'chown -R -u alwayson_sql -g alwayson_sql /tmp/hive'

8. Restart AlwaysOn SQL.

$ dse client-tool alwayson-sql restart

Simba JDBC Driver for Apache Spark


The Simba JDBC Driver for Spark provides a standard JDBC interface to the information stored in DataStax
Enterprise with AlwaysOn SQL running.
See Installing Simba JDBC Driver for Apache Spark.
Simba ODBC Driver for Apache Spark
The Simba ODBC Driver for Spark provides users access to DataStax Enterprise (DSE) clusters with a
running AlwaysOn SQL. The driver is compliant with the latest ODBC 3.52 specification and automatically
translates any SQL-92 query into Spark SQL.
See Installing Simba ODBC Driver for Apache Sparkhttps://docs.datastax.com/en/driver-matrix/doc/
driver_matrix/common/installSimbaODBCdriver.html
Connecting to AlwaysOn SQL server using Beeline
You can use Shark Beeline to test AlwaysOn SQL.

1. Start AlwaysOn SQL.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
284
Using DataStax Enterprise advanced functionality

2. Start the Beeline shell.

$ dse beeline

3. Connect to the server using the JDBC URI for your server.

beeline> !connect jdbc:hive2://localhost:10000

4. Connect to a keyspace and run a query from the Beehive shell.

0: jdbc:hive2://localhost:10000> use test;


0: jdbc:hive2://localhost:10000> select * from test;

Accessing DataStax Enterprise data from external Spark clusters


DataStax Enterprise works with external Spark clusters in a bring-your-own-Spark (BYOS) model.
Overview of BYOS support in DataStax Enterprise
BYOS support in DataStax Enterprise consists of a JAR file and a generated configuration file that provides
all the necessary classes and configuration settings for connecting to a particular DataStax Enterprise cluster
from an external Spark cluster. To specify a different classpath to accommodate applications originally written
for open source Apache Spark, specify the -framework option with dse spark commands.
All DSE resources, including DSEFS file locations, can be accessed from the external Spark cluster.
BYOS is tested against the version of Spark integrated into DSE (described in the DataStax Enterprise 6.0
release notes) and the following Spark distributions:

• Hortonworks Data Platform (HDP) 2.5

• Cloudera CDH 5.10

Generating the BYOS configuration file


The byos.properties file is used to connect to a DataStax Enterprise cluster from a Spark cluster. The
configuration file contains connection information about the DataStax Enterprise cluster. This file must
be generated on a node in the DataStax Enterprise cluster. You can specify an arbitrary name for the
generated configuration file. The byos.properties name is used throughout the documentation to refer to this
configuration file.
Prerequisites:
If you are using Graph OLAP queries with BYOS, increase the max_concurrent_sessions setting in your
cluster to 120.

1. Connect to a node in your DataStax Enterprise cluster.

2. Generate the byos.properties file using the dse client-tool command.

$ dse client-tool configuration byos-export ~/byos.properties

This will generate the byos.properties file in your home directory. See dse client-tool for more
information on the options for dse client-tool.

What's next:
The byos.properties file can be copied to a node in the external Spark cluster and used with the Spark shell,
as described in Connecting to DataStax Enterprise using the Spark shell on an external Spark cluster.
Connecting to DataStax Enterprise using the Spark shell on an external Spark cluster
Use the generated byos.properties configuration file and the byos-version.jar from a DataStax Enterprise
node to connect to the DataStax Enterprise cluster from the Spark shell on an external Spark cluster.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
285
Using DataStax Enterprise advanced functionality

Prerequisites:
You must generate the byos.properties on a node in your DataStax Enterprise cluster.

1. Copy the byos.properties file you previously generated from the DataStax Enterprise node to the local
Spark node.

$ scp [email protected]:~/byos.properties .

If you are using Kerberos authentication, specify the --generate-token and --token-renewer
<username> options when generating byos.properties, as described in dse client-tool configuration
byos-export.

2. Copy the byos-version.jar file from the clients directory from a node in your DataStax Enterprise cluster
to the local Spark node.
The byos-version.jar file location depends on the type of installation.

$ scp [email protected]:/usr/share/dse/clients/dse-byos_2.11-6.0.2.jar
byos-6.0.jar

3. Merge external Spark properties into byos.properties.

$ cat ${SPARK_HOME}/conf/spark-defaults.conf >> byos.properties

4. If you are using Kerberos authentication, set up a CRON job or other task scheduler to periodically call
dse client-tool cassandra renew-token <token> where <token> is the encoded token string in
byos.properties.

5. Start the Spark shell using the byos.properties and byos-version.jar file.

$ spark-shell --jars byos-6.0.jar --properties-file byos.properties

Generating Spark SQL schema files


Spark SQL can import schema files generated by DataStax Enterprise.

1. Export the schema file using dse client-tool.

$ dse client-tool --use-server-config spark sql-schema --all > output.sql

2. Copy the schema to an external Spark node.

$ scp output.sql [email protected]:

3. On a Spark node, import the schema using Spark.

$ spark-sql --jars byos-5.1.jar --properties-file byos.properties -f output.sql

Starting Spark SQL Thrift Server with Kerberos


Spark SQL Thrift Server is a long running service and must be configured to start with a keytab file if Kerberos
is enabled. The user principal must be added to DSE, and Spark SQL Thrift Server restarted with the
generated BYOS configuration file and byos-version.jar.
Prerequisites:
These instructions are for the Spark SQL Thrift Server included in HortonWorks 2.4. The Hadoop Spark SQL
Thrift Server principal is hive/_HOST@REALM.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
286
Using DataStax Enterprise advanced functionality

1. Create the principal on the DSE node using cqlsh.

create user hive/spark_sql_thrift_server_host@REALM;

2. Login as the hive user on the Spark SQL Thrift Server host.

3. Create a ~/.java.login.config file with a JAAS Kerberos configuration.

4. Merge the existing Spark SQL Thrift Server configuration properties with the generated BYOS
configuration file into a new file.

$ cat /usr/hdp/current/spark-thriftserver/conf/spark-thrift-sparkconf.conf
byos.properties > custom-sparkconf.conf

5. Start Spark SQL Thrift Server with the custom configuration file and byos-version.jar.

$ /usr/hdp/2.4.2.0-258/spark/sbin/start-thriftserver.sh --jars byos-version.jar --


properties-file custom-sparkconf.conf

6. Connect using the Beeline client.

$ beeline -u 'jdbc:hive2://hostname:port/default;principal=hive/_HOST@REALM'

What's next:
Generated SQL schema files can be passed to beeline with the -f option to generate a mapping for DSE
tables so both Hadoop and DataStax Enterprise tables will be available through the service for queries.
Using the Spark Jobserver
DataStax Enterprise includes a bundled copy of the open-source Spark Jobserver, an optional component
for submitting and managing Spark jobs, Spark contexts, and JARs on DSE Analytics clusters. Refer to the
Components in the release notes to find the version of the Spark Jobserver included in this version of DSE.
Valid spark-submit options are supported and can be applied to the Spark Jobserver. To use the Jobserver:

• Start the job server:

$ dse spark-jobserver start [any_spark_submit_options]

• Stop the job server:

$ dse spark-jobserver stop

The default location of the Spark Jobserver depends on the type of installation:

• Package installations: /usr/share/dse/spark/spark-jobserver

• Tarball installations: installation_location/resources/spark/spark-jobserver

All the uploaded JARs, temporary files, and log files are created in the user's $HOME/.spark-jobserver
directory, first created when starting Spark Jobserver.
Beneficial use cases for the Spark Jobserver include sharing cached data, repeated queries of cached data,
and faster job starts.

Running multiple SparkContext instances in a single JVM is not recommended. Therefore it is not
recommended to create a new SparkContext for each submitted job in a single Spark Jobserver instance.
We recommend one of the two following Spark Jobserver usages.

• Persistent Context Mode: a single pre-created SparkContext shared by all jobs.

• Context per JVM: each job has it's own SparkContext in a separate JVM.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
287
Using DataStax Enterprise advanced functionality

By default, the H2 database is used for storing Spark Jobserver related metadata. In this setup, using
Context per JVM requires additional configuration. See the Spark Jobserver docs for details.

In Context per JVM mode, job results must not contain instances of classes that are not present in the
Spark Jobserver classpath. Problems with returning unknown (to server) types can be recognized by
following log line:

Association with remote system [akka.tcp://[email protected]:45153]


has failed, address is now gated for [5000] ms.
Reason: [<unknown type name is placed here>]

Please consult Spark Jobserver docs to see configuration details.

For an example of how to create and submit an application through the Spark Jobserver, see the spark-
jobserver demo included with DSE.
The default location of the demos directory depends on the type of installation:

• Package installations: /usr/share/dse/demos

• Tarball installations: installation_location/demos

Enabling SSL communication with Jobserver


To enable SSL encryption when connecting to Jobserver, you must have a server certificate, and a truststore
containing the certificate. Add the following configuration section to the dse.conf file in the Spark Jobserver
directory.

spray.can.server {
ssl-encryption = on
keystore = "path to keystore"
keystorePW = "keystore password"
}

The default location of the Spark Jobserver depends on the type of installation:

• Package installations: /usr/share/dse/spark/spark-jobserver

• Tarball installations: installation_location/resources/spark/spark-jobserver

Restart the Jobserver after saving the configuration changes.


DSEFS (DataStax Enterprise file system)
DSEFS is the default distributed file system on DSE Analytics nodes.
About DSEFS
DSEFS (DataStax Enterprise file system) is a fault-tolerant, general-purpose, distributed file system within
DataStax Enterprise. It is designed for use cases that need to leverage a distributed file system for data
ingestion, data staging, and state management for Spark Streaming applications (such as checkpointing or
write-ahead logging). DSEFS is similar to HDFS, but avoids the deployment complexity and single point of
failure typical of HDFS. DSEFS is HDFS-compatible and is designed to work in place of HDFS in Spark and
other systems.
DSEFS is the default distributed file system in DataStax Enterprise, and is automatically enabled on all analytics
nodes.
DSEFS stores file metadata (such as file path, ownership, permissions) and file contents separately:

• Metadata is stored in the database.

• File data blocks are stored locally on each node and are replicated onto multiples nodes.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
288
Using DataStax Enterprise advanced functionality

The redundancy factor is set at the DSEFS directory or file level, which is more granular than the
replication factor that is set at the keyspace level in the database.

For performance on production clusters, store the DSEFS data on physical devices that are separate from
the database. For development and testing you may store DSEFS data on the same physical device as the
database.
Deployment overview

• The DSEFS server runs in the same JVM as DataStax Enterprise. Similar to the database, there is no
master node. All nodes running DSEFS are equal.

• A single DSEFS cannot span multiple datacenters. To deploy DSEFS in multiple datacenters, you can
create a separate instance of DSEFS for each datacenter.

• You can use different keyspaces to configure multiple DSEFS file systems in a single datacenter.

• For optimal performance, locate the local DSEFS data on a different physical drive than the database.

• Encryption is not supported. Use operating system access controls to protect the local DSEFS data
directories. Other limitations apply.

• DSEFS uses the LOCAL_QUORUM consistency level to store file metadata. DSEFS will always try to write
each data block to replicated node locations, and even if a write fails, it will retry to another node before
acknowledging the write. DSEFS writes are very similar to the ALL consistency level, but with additional
failover to provide high-availability. DSEFS reads are similar to the ONE consistency level.

Enabling DSEFS
DSEFS is automatically enabled on analytics nodes, and disabled on non-analytics nodes. You can enable the
DSEFS service on any node in a DataStax Enterprise cluster. Nodes within the same datacenter with DSEFS
enabled will join together to behave as a DSEFS cluster.

On each node:

1. In the dse.yaml file, set the properties for the DSE File System options:

dsefs_options:
enabled:
keyspace_name: dsefs
work_dir: /var/lib/dsefs
public_port: 5598
private_port: 5599
data_directories:
- dir: /var/lib/dsefs/data
storage_weight: 1.0
min_free_space: 5368709120

a. Enable DSEFS:

enabled: true

If enabled is blank or commented out, DSEFS starts only if the node is configured to run analytics
workloads.

b. Define the keyspace for storing the DSEFS metadata:

keyspace_name: dsefs

You can optionally configure multiple DSEFS file systems in a single datacenter.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
289
Using DataStax Enterprise advanced functionality

c. Define the work directory for storing the DSEFS metadata for the local node. The work directory
should not be shared with other DSEFS nodes:

work_dir: /var/lib/dsefs

d. Define the public port on which DSEFS listens for clients:

public_port: 5598

DataStax recommends that all nodes in the cluster have the same value. Firewalls must open
this port to trusted clients. The service on this port is bound to the native_transport_address.

e. Define the private port for DSEFS inter-node communication:

private_port: 5599

Do not open this port to firewalls; this private port must be not visible from outside of the
cluster.

f. Set the data directories where the file data blocks are stored locally on each node.

data_directories:
- dir: /var/lib/dsefs/data

If you use the default /var/lib/dsefs/data data directory, verify that the directory exists and
that you have root access. Otherwise, you can define your own directory location, change the
ownership of the directory, or both:

$ sudo mkdir -p /var/lib/dsefs/data; sudo chown -R $USER:$GROUP /var/lib/


dsefs/data

Ensure that the data directory is writeable by the DataStax Enterprise user. Put the data
directories on different physical devices than the database. Using multiple data directories on
JBOD improves performance and capacity.

g. For each data directory, set the weighting factor to specify how much data to place in this directory,
relative to other directories in the cluster. This soft constraint determines how DSEFS distributes
the data. For example, a directory with a value of 3.0 receives about three times more data than a
directory with a value of 1.0.

data_directories:
- dir: /var/lib/dsefs/data
storage_weight: 1.0

h. For each data directory, define the reserved space, in bytes, to not use for storing file data blocks.
See min_free_space.

data_directories:
- dir: /var/lib/dsefs/data
storage_weight: 1.0
min_free_space: 5368709120

2. Restart the node.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
290
Using DataStax Enterprise advanced functionality

3. Repeat steps for the remaining nodes.

4. With guidance from DataStax Support, you can tune advanced DSEFS properties:

# service_startup_timeout_ms: 30000
# service_close_timeout_ms: 600000
# server_close_timeout_ms: 2147483647 # Integer.MAX_VALUE
# compression_frame_max_size: 1048576
# query_cache_size: 2048
# query_cache_expire_after_ms: 2000
# gossip_options:
# round_delay_ms: 2000
# startup_delay_ms: 5000
# shutdown_delay_ms: 10000
# rest_options:
# request_timeout_ms: 330000
# connection_open_timeout_ms: 55000
# client_close_timeout_ms: 60000
# server_request_timeout_ms: 300000
# idle_connection_timeout_ms: 60000
# internode_idle_connection_timeout_ms: 120000
# core_max_concurrent_connections_per_host: 8
# transaction_options:
# transaction_timeout_ms: 3000
# conflict_retry_delay_ms: 200
# conflict_retry_count: 40
# execution_retry_delay_ms: 1000
# execution_retry_count: 3
# block_allocator_options:
# overflow_margin_mb: 1024
# overflow_factor: 1.05

5. Continue with using DSEFS.

Disabling DSEFS
To disable DSEFS and remove metadata and data:

1. Remove all directories and files from the DSEFS file system:

$ dse fs rm -r filepath

2. Wait a while for all nodes to perform the delete operations.

3. Verify that all DSEFS data directories where the file data blocks are stored locally on each node are empty.
These data directories are configured in dse.yaml. Your directories are probably different from this
default data_directories value:

data_directories:
- dir: /var/lib/dsefs/data

4. Disable the DSEFS entries in all dse.yaml files on all nodes.

5. Restart DataStax Enterprise.

6. Truncate all of the tables in the dsefs keyspace.


Do not remove the dsefs keyspace. If you inadvertently removed the dsefs keyspace, you must
specify a different keyspace name in dse.yaml or create an empty dsefs keyspace (this empty dsefs
keyspace will be populated with tables during DSEFS start up).

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
291
Using DataStax Enterprise advanced functionality

Do not delete the data_directories before removing the dsefs keyspace tables, or removing the
node from the cluster.

Configuring DSEFS
You must configure data replication. You can optionally configure multiple DSEFS file systems in a datacenter,
and perform other functions, including setting the Kafka log retention.
DSEFS does not span datacenters. Create a separate DSEFS instance in each datacenter, as described in the
steps below.
DSEFS limitations
Know these limitations when you configure and tune DSEFS. The following functionality and features are not
supported:

• Encryption.
Use operating system access controls to protect the local DSEFS data directories.

• File system consistency checks (fsck) and file repair have only limited support. Running fsck will re-
replicate blocks that were under-replicated because a node was taken out of a cluster.

• File repair.

• Forced rebalancing, although the cluster will eventually reach balance.

• Checksum.

• Automatic backups.

• Multi-datacenter replication.

• Symbolic links (soft links, symlinks) and hardlinks.

• Snapshots.

1. Configure replication for the metadata and the data blocks.


You must set the replication factor appropriately to prevent data loss in the case of node failure.
Replication factors must be set for both the metadata and the data blocks. The replication factor of 3 for
data blocks is suitable for most use-cases.

a. Globally: set replication for the metadata in the dsefs keyspace that is stored in the database.
For example, use a CQL statement to configure a replication factor of 3 on the Analytics
datacenter using NetworkTopologyStrategy:

ALTER KEYSPACE dsefs


WITH REPLICATION = {
'class': 'NetworkTopologyStrategy',
'Analytics': '3'};

Datacenter names are case-sensitive. Verify the case of the using utility, such as dsetool
status.

b. Run nodetool repair on the DSEFS keyspace.

$ nodetool repair dsefs

c. Locally: set the redundancy factor on a specific DSEFS file or directory where the data blocks are
stored.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
292
Using DataStax Enterprise advanced functionality

For example, use the command line:

$ dse fs mkdir -n 4 newdirectory

When a redundancy factor is not specified, it is inherited from the parent directory. The default
redundancy factor is 3.

2. If you have multiple Analytics datacenters, you must configure each DSEFS file system to replicate within
its own datacenter:

a. In the dse.yaml file, specify a separate DSEFS keyspace for each logical datacenter.
For example, on a cluster with logical datacenters DC1 and DC2.
On each node in DC1:

dsefs_options:
...
keyspace_name: dsefs1

On each node in DC2:

dsefs_options:
...
keyspace_name: dsefs2

b. Restart the nodes.

c. Alter the keyspace replication to exist only on the specific datacenters.


On DC1:

ALTER KEYSPACE dsefs1


WITH REPLICATION = {
'class': 'NetworkTopologyStrategy',
'DC1': '3'};

On DC2:

ALTER KEYSPACE dsefs2


WITH REPLICATION = {
'class': 'NetworkTopologyStrategy',
'DC2': '3'};

d. Run nodetool repair on the DSEFS keyspace.

$ nodetool repair dsefs

For example, in a cluster with multiple datacenters, the keyspace names dsefs1 and dsefs2 define
separate file systems in each datacenter.

3. When bouncing a streaming application, verify the Kafka log configuration (especially
log.retention.check.interval.ms and policies.log.retention.bytes). Ensure the Kafka log
retention policy is robust enough to handle the length of time expected to bring the application and
consumers back up.
For example, if the log retention policy is too conservative and deletes or rolls are logged very
frequently to save disk space, the users are likely to encounter issues when attempting to recover from
a checkpoint that references offsets that are no longer maintained by the Kafka logs.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
293
Using DataStax Enterprise advanced functionality

DSEFS command line tool


The DSEFS functionality supports operations including uploading, downloading, moving, and deleting files,
creating directories, and verifying the DSEFS status.
DSEFS commands are available only in the logical datacenter. DSEFS works with secured and unsecured
clusters, see DSEFS authentication.
You can interact with the DSEFS file system in several modes:

• Interactive command line shell.

To start DSEFS and launch the DSE FS shell:

$ dse fs

• As part of dse commands.

• With a REST API.

Configuring DSEFS shell logging


The default location of the DSEFS shell log file .dsefs-shell.log is the user home directory. The default
log level is INFO. To configure DSEFS shell logging, edit the installation_location/resources/dse/conf/
logback-dsefs-shell.xml file.

Using with the dse command line


Precede the DSEFS command with dse fs:

$ dse [dse_auth_credentials] fs dsefs_command [options]

For example, to list the file system status and disk space usage in human-readable format:

$ dse -u user1 -p mypassword fs "df -h"

Optional command arguments are enclosed in square brackets. For example, [dse_auth_credentials] and [-
R]
Variable values are italicized. For example, directory and [subcommand].
Working with the local file system in the DSEFS shell
You can refer to files in the local file system by prefixing paths with file:. For example the following command
will list files in the system root directory:

dsefs dsefs://127.0.0.1:5598/ > ls file:/


bin cdrom dev home lib32 lost+found mnt proc run srv tmp var
initrd.img.old vmlinuz.old
boot data etc lib lib64 media opt root sbin sys usr initrd.img vmlinuz

If you need to perform many subsequent operations on the local file system, first change the current working
directory to file: or any local file system path:

dsefs dsefs://127.0.0.1:5598/ > cd file:


dsefs file:/home/user1/path/to/local/files > ls
conf src target build.sbt
dsefs file:/home/user1/path/to/local/files > cd ..

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
294
Using DataStax Enterprise advanced functionality

dsefs file:/home/user1/path/to/local >

DSEFS shell remembers the last working directory of each file system separately. To go back to the previous
DSEFS directory, enter:

dsefs file:/home/user1/path/to/local/files > cd dsefs:


dsefs dsefs://127.0.0.1:5598/ >

To go back again to the previous local directory:

dsefs dsefs://127.0.0.1:5598/ > cd file:


dsefs file:/home/user1/path/to/local/files >

To refer to a path relative to the last working directory of the file system, prefix a relative path with either dsefs:
or file:. The following session will create a directory new_directory in the directory /home/user1:

dsefs dsefs://127.0.0.1:5598/ > cd file:/home/user1


dsefs file:/home/user1 > cd dsefs:
dsefs dsefs://127.0.0.1:5598/ > mkdir file:new_directory
dsefs dsefs://127.0.0.1:5598/ > realpath file:new_directory
file:/home/user1/new_directory
dsefs dsefs://127.0.0.1:5598/ > stat file:new_directory
DIRECTORY file:/home/user1/new_directory:
Owner user1
Group user1
Permission rwxr-xr-x
Created 2017-01-15 13:10:06+0200
Modified 2017-01-15 13:10:06+0200
Accessed 2017-01-15 13:10:06+0200
Size 4096

To copy a file between two different file systems, you can also use the cp command with explicit file system
prefixes in the paths:

dsefs file:/home/user1/test > cp dsefs:archive.tgz another-archive-copy.tgz


dsefs file:/home/user1/test > ls
another-archive-copy.tgz archive-copy.tgz archive.tgz

Authentication
For dse dse_auth_credentials you can provide user credentials in several ways, see Providing credentials from
DSE tools. For authentication with DSEFS, see DSEFS authentication.
Wildcard support
Some DSEFS commands support wildcard pattern expansion in the path argument. Path arguments containing
wildcards are expanded before method invocation into a set of paths matching the wildcard pattern, then the
given method is invoked for each expanded path.
For example in the following directory tree:

dirA
|--dirB
|--file1
|--file2

Giving the stat dirA/* command would be transparently translated into three invocations: stat dirA/dirB,
stat dirA/file1, and stat dirA/file2.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
295
Using DataStax Enterprise advanced functionality

DSEFS supports the following wildcard patterns:

• * matches any files system entry (file or directory) name, as in the example of stat dirA/*.

• ? matches any single character in the file system entry name. For example stat dirA/dir? matches
dirA/dirB.

• [] matches any characters enclosed within the brackets. For example stat dirA/file[0123] matches
dirA/file1 and dirA/file2.

• {} matches any sequence of characters enclosed within the brackets and separated with ,. For example
stat dirA/{dirB,file2} matches dirA/dirB and dirA/file2.

There are no limitations on the number of wildcard patterns in a single path.


For authentication with DSEFS, see DSEFS authentication.
Executing multiple commands
DSEFS can execute multiple commands on one line. Use quotes around the commands and arguments. Each
command will be executed separately by DSEFS.

$ dse fs 'cat file1 file2 file3 file4' 'ls dir1'

Forcing synchronization
Before confirming writing a file, DSEFS by default forces all blocks of the file to be written to the storage
devices. This behavior can be controlled with --no-force-sync and --force-fsync flags when creating files
or directories in the DSEFS shell with mkdir, put, and cp commands. The force/no-force behavior is inherited
from the parent directory, if not specified. For example, if a directory is created with --no-force-sync, then all
files are created with --no-force-sync unless --force-fsync is explicitly set during file creation.
Turning off forced synchronization improves latency and performance at a cost of durability. For example,
if a power loss occurs before writing the data to the storage device, you may lose data. Turn off forced
synchronization only if you have a reliable backup power supply in your datacenter and failure of all replicas is
unlikely, or if you can afford losing file data.
The Hadoop SYNC_BLOCK flag has the same effect as --force-sync in DSEFS. The Hadoop LAZY_PERSIST
flag has the same effect as --no-force-sync in DSEFS.
Removing a DSEFS node
When removing a node running DSEFS from a DSE cluster, additional steps are needed to ensure proper
correctness within the DSEFS data set.

Make sure the replication factor for the cluster is greater than ONE before continuing.

1. From a node in the same datacenter as the node to be removed, start the DSEFS shell.

$ dse fs

2. Show the current DSEFS nodes with the df command.

dsefs > df

Location Status DC Rack Host


Address Port Directory Used Free Reserved
144e587c-11b1-4d74-80f7-dc5e0c744aca up GraphAnalytics rack1
node1.example.com 10.200.179.38 5598 /var/lib/dsefs/data 0 29289783296
5368709120

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
296
Using DataStax Enterprise advanced functionality

98ca0435-fb36-4344-b5b1-8d776d35c7d6 up GraphAnalytics rack1


node2.example.com 10.200.179.39 5598 /var/lib/dsefs/data 0 29302099968
5368709120

3. Find the node to be removed in the list and note the UUID value for it under the Location column.

4. If the node is up, unmount it from DSEFS with the command umount UUID.

dsefs > umount 98ca0435-fb36-4344-b5b1-8d776d35c7d6

5. If the node is not up (for example, after a hardware failure), force unmount it from DSEFS with the
command umount -f UUID.

dsefs > umount -f 98ca0435-fb36-4344-b5b1-8d776d35c7d6

6. Run a file system check with the fsck command to make sure all blocks are replicated.

dsefs > fsck

7. Continue with the normal steps for removing a node.

If data was written to a DSEFS node, more nodes were added to the cluster, and the original node was
removed without running fsck, the data in the original node may be permanently lost.

Removing old DSEFS directories


If you have changed the DSEFS data directory and the old directory is still visible, remove it using the umount
option.

1. Start the DSEFS shell as a role with superuser privileges.

$ dse fs

2. Show the current DSEFS nodes with the df command.

dsefs > df

3. Find the directory to be removed in the list and note the UUID value for it under the Location column.

4. Unmount it from DSEFS with the command umount UUID.

dsefs > umount 98ca0435-fb36-4344-b5b1-8d776d35c7d6

5. Run a file system check with the fsck command to make sure all blocks are replicated.

dsefs > fsck

If the file system check results in an IOException, make sure all the nodes in the cluster are running.
Examples
Using the DSEFS shell, these commands put the local bluefile to the remote DSEFS greenfile:

dsefs / > ls -l

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
297
Using DataStax Enterprise advanced functionality

dsefs / > put file:/bluefile greenfile

To view the new file in the DSEFS directory:

dsefs / > ls -l
Type Permission Owner Group Length Modified Name

file rwxrwxrwx none none 17 2016-05-11 09:34:26+0000 greenfile

Using the dse command, these commands create the test2 directory and upload the local README.md file to the
new DSEFS directory.

$ dse fs "mkdir /test2" && dse fs "put README.md /test2/README.md"

To view the new directory listing:

$ dse fs "ls -l /test2"

Type Permission Owner Group Length Modified Name


file rwxrwxrwx none none 3382 2016-03-07 23:20:34+0000 README.md

You can use two or more dse commands in a single command line. This is faster because the JVM is launched
and connected/disconnected with DSEFS only once. For example:

$ dse fs "mkdir / test2" "put README.md /test/README.md"

The following example shows how to use the --no-force-sync flag on a directory, and how to check the state
of the --force-sync flag using stat. These commands are run from within the DSEFS shell.

dsefs> mkdir --no-force-sync /tmp


dsefs> put file:some-file.dat /tmp/file.tmp
dsefs> stat /tmp/file.tmp
FILE dsefs://127.0.0.1:5598/tmp/file.tmp
Owner none
Group none
Permission rwxrwxrwx
Created 2017-03-06 17:54:35+0100
Modified 2017-03-06 17:54:35+0100
Accessed 2017-03-06 17:54:35+0100
Size 1674
Block size 67108864
Redundancy 3
Compressed false
Encrypted false
Forces sync false
Comment

DSEFS compression
DSEFS is able to compress files to save storage space and bandwidth. Compression is performed by DSE
during upload upon a user’s explicit request. Decompression is transparent. Data is always uncompressed by
the server before it is returned to the client.
Compression is performed within block boundaries. The unit of compression—the chunk of data that gets
compressed individually—is called a frame and its size can be specified during file upload.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
298
Using DataStax Enterprise advanced functionality

Encoders
DSEFS is shipped with the lz4 encoder which works out of the box.
Compression
To compress files use the -c or --compression-encoder parameter for put or cp command. The parameter
specifies the compression encoder to use for the file that is about to get uploaded.

dsefs / > put -c lz4 file /path/to/file

The frame size can optionally be set with the -f, --compression-frame-size option.
The maximum frame size in bytes is set in the compression_frame_max_size option in dse.yaml. If a user
sets the frame size to a value greater than compression_frame_max_size when using put -f an error will be
thrown and the command will fail. Modify the compression_frame_max_size setting based on the available
memory of the node.
Files that are compressed can be appended in the same way as uncompressed files. If the file is compressed
the appended data gets transparently compressed with the file's encoder specified for the initial put operation.
Directories can have a default compression encoder specified during directory creation with the mkdir
command. Newly added files with the put command inherit the default compression encoder from containing
directory. You can override the default compression encoder with the c parameter during put operations.

dsefs / > mkdir -c lz4 /some/path

Decompression
Decompression is performed automatically for all commands that transport data to the client. There is no need
for additional configuration to retrieve the original, decompressed file content.
Storage space
Enabling compression creates a distinction between the logical and physical file size.
The logical size is the size of a file before uploading it to DSEFS, where it is then compressed. The logical size
is shown by the stat command under Size.

dsefs dsefs://10.0.0.1:5598/ > stat /tmp/wikipedia-sample.bz2


FILE dsefs://10.0.0.1:5598/tmp/wikipedia-sample.bz2:
Owner none
Group none
Permission rwxrwxrwx
Created 2017-04-06 20:06:21+0000
Modified 2017-04-06 20:06:21+0000
Accessed 2017-04-06 20:06:21+0000
Size 7723180
Block size 67108864
Redundancy 3
Compressed true
Encrypted false
Comment

The physical size is the actual size of a data stored on the storage device. The physical size is shown by the df
command and by the stat -v command for each block separately, under the Compressed length column.
Limitations
Truncating compressed files is not possible.
DSEFS authentication
DSEFS works with secured DataStax Enterprise clusters.

Page
DSE 6.0 Administrator Guide Earlier DSE version Latest 6.0 patch: 6.0.13
299
Using DataStax Enterprise advanced functionality

For related SSL details, see Enabling SSL encryption for DSEFS.

DSEFS authentication with secured clusters


Authentication is required only when it is enabled in the cluster. DSEFS on secured clusters requires the
DseAuthenticator, see Configuring DSE Unified Authentication. Authentication is off by default.
DSEFS supports authentication using DSE Unified Authentication, and supports all authentication schemes
supported by DSE Authenticator, including Kerberos.
DSEFS authentication can secure client to server communication.
Spark applications
For Spark applications, provide authentication credentials in one of these ways:

• Set with the dse spark-submit command using one of the credential options described in Providing
credentials on command line.

• Programmatically set the user credentials in the Spark configuration object before the SparkContext is
created:

conf.set("spark.hadoop.com.datastax.bdp.fs.client.authentication.basic.username",
<user>)
conf.set("spark.hadoop.com.datastax.bdp.fs.client.authentication.basic.password",
<pass>)

If a Kerberos authentication token is in use, you do not need to set any properties in the context object. If
you need to explicitly set the token, set the spark.hadoop.cassandra.auth.token property.

• When running the Spark Shell, where the SparkContext is created at startup, set the properties in the
Hadoop configuration object:

sc.hadoopConfiguration.set("com.datastax.bdp.fs.client.authentication.basic.username",
<user>)
sc.hadoopConfiguration.set("com.datastax.bdp.fs.client.authentication.basic.password",
<pass>)

Note the absence of the spark.hadoop prefix.

• When running a Spark application or the Spark Shell, provide properties in the spark-defaults.conf
configuration file:

<property>
<name>com.datastax.bdp.fs.client.authentication.basic.username</name>
<value>username</value>
</property>
<property>
<name>com.datastax.bdp.fs.client.authentication.basic.password</name>
<value>password</value>
</property>

Optional: If you want to use this method, but do not have privileges to write to core-default.xml, copy
this file to any location path and set the environment variable to point to the file with:

export HADOOP2_CONF_DIR=path

DSEFS shell
Providing authentication credentials while using the DSEFS shell is as easy as in other DSE tools. The DSEFS
shell supports different authentication methods listed below in priority order. When more than one method
can be used, the one with hi