spooq.transformer.mapper_transformations.to_timestamp

to_timestamp(source_column=None, name=None, **kwargs: Any) partial[source]

More robust conversion to TimestampType (or as a formatted string). This method supports following input types:

  • Unix timestamps in seconds

  • Unix timestamps in milliseconds

  • Timestamps in any format supported by Spark

  • Timestamps in any custom format (via input_format)

  • Preceding and/or trailing whitespace

Parameters
  • source_column (str or Column) – Input column. Can be a name, pyspark column or pyspark function

  • name (str, default -> derived from input column) – Name of the output column. (.alias(name))

Keyword Arguments
  • max_timestamp_sec (int, default -> 4102358400 (=> 2099-12-31 01:00:00)) – Defines the range in which unix timestamps are still considered as seconds (compared to milliseconds)

  • input_format ([str, Bool], default -> False) – Spooq tries to convert the input string with the provided pattern (via F.unix_timestamp())

  • output_format ([str, Bool], default -> False) – The output can be formatted according to the provided pattern (via F.date_format())

  • min_timestamp_ms (int, default -> -62135514321000 (=> Year 1)) – Defines the overall allowed range to keep the timestamps within Python’s datetime library limits

  • max_timestamp_ms (int, default -> 253402210800000 (=> Year 9999)) – Defines the overall allowed range to keep the timestamps within Python’s datetime library limits

  • alt_src_cols (str, default -> no coalescing, only source_column) – Coalesce with source_column and columns from this parameter.

  • cast (T.DataType(), default -> T.TimestampType()) – Applies provided datatype on output column (.cast(cast))

Warning

  • Timestamps in the range (-inf, -max_timestamp_sec) and (max_timestamp_sec, inf) are treated as milliseconds

  • There is a time interval (1970-01-01 +- ~2.5 months) where we can not distinguish correctly between s and ms (e.g. 3974400000 would be treated as seconds (2095-12-11T00:00:00) as the value is smaller than MAX_TIMESTAMP_S, but it could also be a valid date in Milliseconds (1970-02-16T00:00:00)

Examples

>>> input_df = spark.createDataFrame(
...     [
...         Row(input_key="2020-08-12T12:43:14+0000"),
...         Row(input_key="1597069446"),
...         Row(input_key="1597069446000"),
...         Row(input_key="2020-08-12"),
...     ], schema="input_key string"
... )
>>>
>>> input_df.select(spq.to_timestamp("input_key")).show(truncate=False)
+-------------------+
|2020-08-12 14:43:14|
|2020-08-10 16:24:06|
|2020-08-10 16:24:06|
|2020-08-12 00:00:00|
+-------------------+
>>>
>>> mapping = [
...     ("original_value",    "input_key", spq.as_is),
...     ("transformed_value", "input_key", spq.to_timestamp)
... ]
>>> output_df = Mapper(mapping).transform(input_df)
>>> output_df.show(truncate=False)
+------------------------+-------------------+
|original_value          |transformed_value  |
+------------------------+-------------------+
|2020-08-12T12:43:14+0000|2020-08-12 14:43:14|
|1597069446              |2020-08-10 16:24:06|
|1597069446000           |2020-08-10 16:24:06|
|2020-08-12              |2020-08-12 00:00:00|
+------------------------+-------------------+
Returns

This method returns a suitable type depending on how it was called. This ensures compability with Spooq’s mapper transformer - with or without explicit parameters - as well as direct calls via select, withColumn, where, …

Return type

partial or Column