Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-51162

SPIP: Add the TIME data type

    XMLWordPrintableJSON

Details

    Description

      Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.

      Add new data type TIME to Spark SQL which represents a time value with fields hour, minute, second, up to microseconds. All operations over the type are performed without taking any time zone into account. New data type should conform to the type TIME(n) WITHOUT TIME ZONE defined by the SQL standard where 0 <= n <= 6.

      Q2. What problem is this proposal NOT designed to solve?

      Don't support the TIME type with time zone defined by the SQL standard: TIME(n) WITH TIME ZONE.
      Also don't support TIME with local timezone.

      Q3. How is it done today, and what are the limits of current practice?

      The TIME type can be emulated via the TIMESTAMP_NTZ data type by setting the date part to the some constant value like 1970-01-01, 0001-01-01 or 0000-00-00 (though this is out of supported range of dates).

      Although the type can be emulation via TIMESTAMP_NTZ, Spark SQL cannot recognize it in data sources, and for instance cannot load the TIME values from parquet files.

      Q4. What is new in your approach and why do you think it will be successful?

      The approach is not new, and we have clear picture how to split the work by sub-tasks based on our experience of adding new types ANSI intervals and TIMESTAMP_NTZ.

      Q5. Who cares? If you are successful, what difference will it make?

      The new type simplifies migrations to Spark SQL from other DBMS like PostgreSQL, Snowflake, Google SQL, Amazon Redshift, Teradata, DB2. Such users don't have to rewrite their SQL code to emulate the TIME type. Also new functionality impacts on existing Spark SQL users who need to load data w/ the TIME values that were stored by other systems.

      Q6. What are the risks?

      Additional handling new type in operators, expression and data sources can cause performance regressions. Such risk can be compensated by developing time benchmarks in parallel with supporting new type in different places in Spark SQL.
       
      Q7. How long will it take?

      In total it might take around 9 months. The estimation is based on similar tasks: ANSI intervals (SPARK-27790) and TIMESTAMP_NTZ (SPARK-35662). We can split the work by function blocks:

      1. Base functionality - 3 weeks
        Add new type TimeType, forming/parsing time literals, type constructor, and external types.
      2. Persistence - 3.5 months
        Ability to create tables of the type TIME, read/write from/to Parquet and other built-in data types, partitioning, stats, predicate push down.
      3. Time operators - 2 months
        Arithmetic ops, field extract, sorting, and aggregations.
      4. Clients support - 1 month
        JDBC, Hive, Thrift server, connect
      5. PySpark integration - 1 month
        DataFrame support, pandas API, python UDFs, Arrow column vectors
      6. Docs + testing/benchmarking - 1 month

      Q8. What are the mid-term and final “exams” to check for success?
      The mid-term is in 4 month: basic functionality, read/write new type to built-in datasources, basic time operations such as arithmetic ops, casting.
      The final "exams" is to support the same functionality as other time types: TIMESTAMP_NTZ, DATE, TIMESTAMP.

      Appendix A. Proposed API Changes.

      Add new case class TimeType to org.apache.spark.sql.types:

      /**
       * The time type represents a time value with fields hour, minute, second, up to microseconds.
       * The range of times supported is 00:00:00.000000 to 23:59:59.999999.
       *
       * Please use the singleton `DataTypes.TimeType` to refer the type.
       */
      class TimeType(precisionField: Byte) extends DatetimeType {
      
        /**
         * The default size of a value of the TimeType is 8 bytes.
         */
        override def defaultSize: Int = 8
      
        private[spark] override def asNullable: DateType = this
      }
      

      Appendix B: As the external types for the new TIME type, we propose:

      Attachments

        Issue Links

          There are no Sub-Tasks for this issue.

          Activity

            People

              maxgekk Max Gekk
              maxgekk Max Gekk
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: