-
Notifications
You must be signed in to change notification settings - Fork 856
Description
SynapseML version
1.0.9
System information
- Language version (e.g. python 3.8, scala 2.12): Python 3.10.12, Scala 2.12.17
- Spark Version (e.g. 3.2.3): 3.4.3
- Spark Platform (e.g. Synapse, Databricks): Microsoft Fabric
Describe the problem
The writeToAzureSearch method has inconsistent handling of DateType/TimestampType fields between the index creation and data writing phases:
Current Behavior:
- Index Creation:
- Correctly maps Spark
DateType/TimestampTypefields to Azure SearchEdm.DateTimeOffset.
- Correctly maps Spark
- Data Writing:
- Fails when writing data, producing an error:
"field date requires type StringType your dataframe column is of type DateType" - Requires manual conversion of
DateType/TimestampTypecolumns into ISO8601 formatted strings before data can be successfully written.
- Fails when writing data, producing an error:
Issue Details:
sparkTypeToEdmTypemapping is correct (DateTypetoEdm.DateTimeOffset).checkSchemaParityincorrectly expects the DataFrame columns to remainDateTypeduring data writing, conflicting with the Azure Search API requirement for ISO8601 strings.- This mismatch results in successful schema creation but failed data ingestion.
Additional Context:
If users preemptively cast DateType columns to StringType before the initial write, the operation succeeds but Azure Search incorrectly infers the field as Edm.String instead of Edm.DateTimeOffset. This removes the semantic date functionality (date-range queries, sorting, etc.) and requires index recreation to fix.
Expected Behavior:
The library should automatically convert DateType/TimestampType values to ISO8601 formatted strings during the data writing phase when the target Azure Search field is Edm.DateTimeOffset, similar to how vector type conversions are handled automatically.
Code to reproduce issue
from synapse.ml.services import *
from pyspark.sql import functions as F
from datetime import date
AZURE_SEARCH_SUBSCRIPTION_KEY = "<your-subscription-key>"
AZURE_SEARCH_SERVICE_NAME = "<your-service-name>"
AZURE_SEARCH_INDEX_NAME = "test-date-handling"
# Create test DataFrame with DateType column
test_df = spark.createDataFrame([
("TEST01", date(2025, 5, 15)),
("TEST02", date(2025, 5, 20))
], ["id", "date"])
test_df = test_df.withColumn("SearchAction", F.lit("upload"))
# STEP 1: This creates the index successfully with Edm.DateTimeOffset
# BUT fails to write any data
try:
test_df.writeToAzureSearch(
subscriptionKey = AZURE_SEARCH_SUBSCRIPTION_KEY,
serviceName = AZURE_SEARCH_SERVICE_NAME,
indexName = AZURE_SEARCH_INDEX_NAME,
keyCol = "id",
actionCol = "SearchAction"
)
except Exception as e:
print(f"Error writing data: {e}")
print("But check Azure - the index WAS created with correct schema!")
# STEP 2: Now convert dates to strings and write again
converted_df = test_df.withColumn(
"date",
F.date_format(F.col("date"), "yyyy-MM-dd'T'HH:mm:ss'Z'")
)
# This succeeds - data is written to the existing index
converted_df.writeToAzureSearch(
subscriptionKey = AZURE_SEARCH_SUBSCRIPTION_KEY,
serviceName = AZURE_SEARCH_SERVICE_NAME,
indexName = AZURE_SEARCH_INDEX_NAME,
keyCol = "id",
actionCol = "SearchAction"
)
# The index field remains Edm.DateTimeOffset, not stringOther info / logs
No response
What component(s) does this bug affect?
-
area/cognitive: Cognitive project -
area/core: Core project -
area/deep-learning: DeepLearning project -
area/lightgbm: Lightgbm project -
area/opencv: Opencv project -
area/vw: VW project -
area/website: Website -
area/build: Project build system -
area/notebooks: Samples under notebooks folder -
area/docker: Docker usage -
area/models: models related issue
What language(s) does this bug affect?
-
language/scala: Scala source code -
language/python: Pyspark APIs -
language/r: R APIs -
language/csharp: .NET APIs -
language/new: Proposals for new client languages
What integration(s) does this bug affect?
-
integrations/synapse: Azure Synapse integrations -
integrations/azureml: Azure ML integrations -
integrations/databricks: Databricks integrations