Skip to content

java.io.NotSerializableException when running df.show() #845

@couchotato

Description

@couchotato

Running bigquery connector on spark2.4 in a cluster mode.

The below is being executed without any issues

val df = spark
        .read
        .format("bigquery")
        .option("table", "xxx")
        .option("dataset", "xxx")
        .option("parentProject", "xxx")
        .option("credentials", "base 64 encoded json file")
        .load()

The below works fine yielding expected results

df.schema()
df.count()
df.filter(...).count()

However df.show() throws the below exception and it's very difficult to understand what fails upstream
Here's the full stack trace

java.lang.RuntimeException: java.io.NotSerializableException: com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.AwsCredentials$AwsCredentialSource
  at com.google.cloud.bigquery.connector.common.BigQueryUtil.getCredentialsByteArray(BigQueryUtil.java:111)
  at com.google.cloud.bigquery.connector.common.BigQueryClientFactory.hashCode(BigQueryClientFactory.java:88)
  at java.util.HashMap.hash(HashMap.java:340)
  at java.util.HashMap.containsKey(HashMap.java:597)
  at com.google.cloud.bigquery.connector.common.BigQueryClientFactory.getBigQueryReadClient(BigQueryClientFactory.java:53)
  at com.google.cloud.bigquery.connector.common.ReadSessionCreator.create(ReadSessionCreator.java:79)
  at com.google.cloud.spark.bigquery.v2.context.BigQueryDataSourceReaderContext.planBatchInputPartitionContexts(BigQueryDataSourceReaderContext.java:197)
  at com.google.cloud.spark.bigquery.v2.BigQueryDataSourceReader.planBatchInputPartitions(BigQueryDataSourceReader.java:66)
  at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.batchPartitions$lzycompute(DataSourceV2ScanExec.scala:84)
  at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.batchPartitions(DataSourceV2ScanExec.scala:80)
  at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.outputPartitioning(DataSourceV2ScanExec.scala:60)
  at org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$ensureDistributionAndOrdering$1(EnsureRequirements.scala:149)
  at scala.collection.immutable.List.map(List.scala:293)
  at org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:148)
  at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:312)
  at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:304)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:291)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:291)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:288)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:339)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:197)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:337)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
  at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:304)
  at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:37)
  at org.apache.spark.sql.execution.QueryExecution.$anonfun$prepareForExecution$1(QueryExecution.scala:108)
  at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
  at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
  at scala.collection.immutable.List.foldLeft(List.scala:91)
  at org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:108)
  at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:92)
  at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:92)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3442)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2627)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2841)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:255)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:292)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:752)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:711)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:720)
  ... 49 elided
Caused by: java.io.NotSerializableException: com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.AwsCredentials$AwsCredentialSource
  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
  at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
  at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
  at com.google.cloud.bigquery.connector.common.BigQueryUtil.getCredentialsByteArray(BigQueryUtil.java:108)
  ... 89 more

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions