Skip to content

summarizers is not working #3

@5mdd

Description

@5mdd

Thanks @kevrasm for solving the clock issue.
I tried to use the new jar but I am facing another issue with databricks 5.2 ML.
After successfully creating a clock, I wanted to use a summarizer with the function summarizeIntervals but it failed with the following error:

/local_disk0/spark-34261885-5939-47e4-b37c-fc95545a6b47/userFiles-25527d91-086d-4a90-839f-09b97f09c196/addedFile5376141714691461041dbfs__FileStore_jars_785cdf36_8307_41eb_9f3d_a9d1a89ab416_flint_0_6_0_databricks-7358e.jar/ts/flint/dataframe.py in summarizeIntervals(self, clock, summarizer, key, inclusion, rounding)
1071 else:
1072 with traceback_utils.SCCallSiteSync(self._sc) as css:
-> 1073 return self._summarizeIntervals_builtin(clock, summarizer, key, inclusion, rounding)
1074
1075 def _summarizeIntervals_udf(self, clock, columns,

/local_disk0/spark-34261885-5939-47e4-b37c-fc95545a6b47/userFiles-25527d91-086d-4a90-839f-09b97f09c196/addedFile5376141714691461041dbfs__FileStore_jars_785cdf36_8307_41eb_9f3d_a9d1a89ab416_flint_0_6_0_databricks-7358e.jar/ts/flint/dataframe.py in _summarizeIntervals_builtin(self, clock, summarizer, key, inclusion, rounding)
1093 scala_key,
1094 inclusion,
-> 1095 rounding)
1096
1097 return TimeSeriesDataFrame._from_tsrdd(tsrdd, self.sql_ctx)

/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in call(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()

/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(

Py4JJavaError: An error occurred while calling o557.summarizeIntervals.
: java.lang.NoClassDefFoundError: Could not initialize class com.twosigma.flint.rdd.function.group.Intervalize$
at com.twosigma.flint.rdd.OrderedRDD.intervalize(OrderedRDD.scala:560)
at com.twosigma.flint.timeseries.TimeSeriesRDDImpl.summarizeIntervals(TimeSeriesRDD.scala:1605)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)

The same with the function groupByInterval.
I tried to run the following example: https://github.com/twosigma/flint/tree/master/example without success. It failed at summarizers level:
sp500_decayed_return = sp500_joined_return.summarizeWindows( window = windows.past_absolute_time('7day'), summarizer = summarizers.ewma('previous_day_return', alpha=0.5) )

What is so special about databricks that makes the two sigma version not compatible ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions