Skip to content

[BUG] Azure Search: DateType and TimestampType fields not properly converted to ISO8601 format #2380

@DennizSvens

Description

@DennizSvens

SynapseML version

1.0.9

System information

  • Language version (e.g. python 3.8, scala 2.12): Python 3.10.12, Scala 2.12.17
  • Spark Version (e.g. 3.2.3): 3.4.3
  • Spark Platform (e.g. Synapse, Databricks): Microsoft Fabric

Describe the problem

The writeToAzureSearch method has inconsistent handling of DateType/TimestampType fields between the index creation and data writing phases:

Current Behavior:

  • Index Creation:
    • Correctly maps Spark DateType/TimestampType fields to Azure Search Edm.DateTimeOffset.
  • Data Writing:
    • Fails when writing data, producing an error:
      "field date requires type StringType your dataframe column is of type DateType"
      
    • Requires manual conversion of DateType/TimestampType columns into ISO8601 formatted strings before data can be successfully written.

Issue Details:

  • sparkTypeToEdmType mapping is correct (DateType to Edm.DateTimeOffset).
  • checkSchemaParity incorrectly expects the DataFrame columns to remain DateType during data writing, conflicting with the Azure Search API requirement for ISO8601 strings.
  • This mismatch results in successful schema creation but failed data ingestion.

Additional Context:
If users preemptively cast DateType columns to StringType before the initial write, the operation succeeds but Azure Search incorrectly infers the field as Edm.String instead of Edm.DateTimeOffset. This removes the semantic date functionality (date-range queries, sorting, etc.) and requires index recreation to fix.

Expected Behavior:
The library should automatically convert DateType/TimestampType values to ISO8601 formatted strings during the data writing phase when the target Azure Search field is Edm.DateTimeOffset, similar to how vector type conversions are handled automatically.

Code to reproduce issue

from synapse.ml.services import *
from pyspark.sql import functions as F
from datetime import date

AZURE_SEARCH_SUBSCRIPTION_KEY = "<your-subscription-key>"
AZURE_SEARCH_SERVICE_NAME = "<your-service-name>"
AZURE_SEARCH_INDEX_NAME = "test-date-handling"

# Create test DataFrame with DateType column
test_df = spark.createDataFrame([
    ("TEST01", date(2025, 5, 15)),
    ("TEST02", date(2025, 5, 20))
], ["id", "date"])
test_df = test_df.withColumn("SearchAction", F.lit("upload"))

# STEP 1: This creates the index successfully with Edm.DateTimeOffset
# BUT fails to write any data
try:
    test_df.writeToAzureSearch(
        subscriptionKey = AZURE_SEARCH_SUBSCRIPTION_KEY,
        serviceName = AZURE_SEARCH_SERVICE_NAME,
        indexName = AZURE_SEARCH_INDEX_NAME,
        keyCol = "id",
        actionCol = "SearchAction"
    )
except Exception as e:
    print(f"Error writing data: {e}")
    print("But check Azure - the index WAS created with correct schema!")

# STEP 2: Now convert dates to strings and write again
converted_df = test_df.withColumn(
    "date",
    F.date_format(F.col("date"), "yyyy-MM-dd'T'HH:mm:ss'Z'")
)

# This succeeds - data is written to the existing index
converted_df.writeToAzureSearch(
    subscriptionKey = AZURE_SEARCH_SUBSCRIPTION_KEY,
    serviceName = AZURE_SEARCH_SERVICE_NAME,
    indexName = AZURE_SEARCH_INDEX_NAME,
    keyCol = "id",
    actionCol = "SearchAction"
)

# The index field remains Edm.DateTimeOffset, not string

Other info / logs

No response

What component(s) does this bug affect?

  • area/cognitive: Cognitive project
  • area/core: Core project
  • area/deep-learning: DeepLearning project
  • area/lightgbm: Lightgbm project
  • area/opencv: Opencv project
  • area/vw: VW project
  • area/website: Website
  • area/build: Project build system
  • area/notebooks: Samples under notebooks folder
  • area/docker: Docker usage
  • area/models: models related issue

What language(s) does this bug affect?

  • language/scala: Scala source code
  • language/python: Pyspark APIs
  • language/r: R APIs
  • language/csharp: .NET APIs
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/synapse: Azure Synapse integrations
  • integrations/azureml: Azure ML integrations
  • integrations/databricks: Databricks integrations

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions