Running bigquery connector on spark2.4 in a cluster mode.
The below is being executed without any issues
val df = spark
.read
.format("bigquery")
.option("table", "xxx")
.option("dataset", "xxx")
.option("parentProject", "xxx")
.option("credentials", "base 64 encoded json file")
.load()
The below works fine yielding expected results
df.schema()
df.count()
df.filter(...).count()
However df.show() throws the below exception and it's very difficult to understand what fails upstream
Here's the full stack trace
java.lang.RuntimeException: java.io.NotSerializableException: com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.AwsCredentials$AwsCredentialSource
at com.google.cloud.bigquery.connector.common.BigQueryUtil.getCredentialsByteArray(BigQueryUtil.java:111)
at com.google.cloud.bigquery.connector.common.BigQueryClientFactory.hashCode(BigQueryClientFactory.java:88)
at java.util.HashMap.hash(HashMap.java:340)
at java.util.HashMap.containsKey(HashMap.java:597)
at com.google.cloud.bigquery.connector.common.BigQueryClientFactory.getBigQueryReadClient(BigQueryClientFactory.java:53)
at com.google.cloud.bigquery.connector.common.ReadSessionCreator.create(ReadSessionCreator.java:79)
at com.google.cloud.spark.bigquery.v2.context.BigQueryDataSourceReaderContext.planBatchInputPartitionContexts(BigQueryDataSourceReaderContext.java:197)
at com.google.cloud.spark.bigquery.v2.BigQueryDataSourceReader.planBatchInputPartitions(BigQueryDataSourceReader.java:66)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.batchPartitions$lzycompute(DataSourceV2ScanExec.scala:84)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.batchPartitions(DataSourceV2ScanExec.scala:80)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.outputPartitioning(DataSourceV2ScanExec.scala:60)
at org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$ensureDistributionAndOrdering$1(EnsureRequirements.scala:149)
at scala.collection.immutable.List.map(List.scala:293)
at org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:148)
at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:312)
at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:291)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:291)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:288)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:339)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:197)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:337)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:304)
at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:37)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$prepareForExecution$1(QueryExecution.scala:108)
at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
at scala.collection.immutable.List.foldLeft(List.scala:91)
at org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:108)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:92)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3442)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2627)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2841)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:255)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:292)
at org.apache.spark.sql.Dataset.show(Dataset.scala:752)
at org.apache.spark.sql.Dataset.show(Dataset.scala:711)
at org.apache.spark.sql.Dataset.show(Dataset.scala:720)
... 49 elided
Caused by: java.io.NotSerializableException: com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.AwsCredentials$AwsCredentialSource
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at com.google.cloud.bigquery.connector.common.BigQueryUtil.getCredentialsByteArray(BigQueryUtil.java:108)
... 89 more
Running bigquery connector on spark2.4 in a cluster mode.
The below is being executed without any issues
The below works fine yielding expected results
However
df.show()throws the below exception and it's very difficult to understand what fails upstreamHere's the full stack trace