工作与Java IndexOutOfBoundsException错误失败

当groupby()连同applyInPandas使用它生成一个异常由于箭缓冲区的限制。

写的rakesh.parija

最后发表:2022年12月21日

问题

你的工作与Java失败IndexOutOfBoundsException错误信息:

. lang。IndexOutOfBoundsException:指数:0,长度:<数字>(预期:范围(0,0))

当你回顾堆栈跟踪你看到类似这样:

Py4JJavaError:调用o617.count时发生一个错误。:org.apache.spark。SparkException:工作阶段失败而终止:任务0阶段7.0失败了4次,最近的失败:在舞台上失去了任务0.3 7.0 (TID 2195、10.207.235.228执行人0):. lang。IndexOutOfBoundsException:指数:0,长度:1073741824(预期:范围(0,0))io.netty.buffer.ArrowBuf.checkIndex (ArrowBuf.java: 716)io.netty.buffer.ArrowBuf.setBytes (ArrowBuf.java: 954)org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer (BaseVariableWidthVector.java: 508)org.apache.arrow.vector.BaseVariableWidthVector.handleSafe (BaseVariableWidthVector.java: 1239)org.apache.arrow.vector.BaseVariableWidthVector.setSafe (BaseVariableWidthVector.java: 1066)org.apache.spark.sql.execution.arrow.StringWriter.setValue (ArrowWriter.scala: 287)org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write (ArrowWriter.scala: 151)org.apache.spark.sql.execution.arrow.ArrowWriter.write (ArrowWriter.scala: 105)在org.apache.spark.sql.execution.python.ArrowPythonRunner立刻1美元。美元anonfun writeIteratorToStream 1美元(ArrowPythonRunner.scala: 100)在scala.runtime.java8.JFunction0专门sp.apply美元(美元JFunction0 mcV $ sp.java: 23)org.apache.spark.util.Utils .tryWithSafeFinally美元(Utils.scala: 1581)在另一次1.美元美元org.apache.spark.sql.execution.python.ArrowPythonRunner writeiteratortostream (ArrowPythonRunner.scala: 122)在org.apache.spark.api.python.BasePythonRunner WriterThread美元。anonfun运行$ 1美元(PythonRunner.scala: 478)org.apache.spark.util.Utils .logUncaughtExceptions美元(Utils.scala: 2146)org.apache.spark.api.python.BasePythonRunner WriterThread.run美元(PythonRunner.scala: 270)司机加:org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages (DAGScheduler.scala: 2519)在org.apache.spark.scheduler.DAGScheduler。anonfun abortStage美元2美元(DAGScheduler.scala: 2466)org.apache.spark.scheduler.DAGScheduler。anonfun abortStage美元$ 2 $改编(DAGScheduler.scala: 2460)scala.collection.mutable.ResizableArray.foreach (ResizableArray.scala: 62)在scala.collection.mutable.ResizableArray.foreach $ (ResizableArray.scala: 55)在scala.collection.mutable.ArrayBuffer.foreach (ArrayBuffer.scala: 49)org.apache.spark.scheduler.DAGScheduler.abortStage (DAGScheduler.scala: 2460)在org.apache.spark.scheduler.DAGScheduler。anonfun handleTaskSetFailed美元1美元(DAGScheduler.scala: 1152)org.apache.spark.scheduler.DAGScheduler。anonfun handleTaskSetFailed美元$ 1 $改编(DAGScheduler.scala: 1152)scala.Option.foreach (Option.scala: 407)org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed (DAGScheduler.scala: 1152)org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive (DAGScheduler.scala: 2721)org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive (DAGScheduler.scala: 2668)org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive (DAGScheduler.scala: 2656)org.apache.spark.util.EventLoop立刻1.美元美元运行(EventLoop.scala: 49)引起的:. lang。IndexOutOfBoundsException:指数:0,长度:1073741824(预期:范围(0,0))io.netty.buffer.ArrowBuf.checkIndex (ArrowBuf.java: 716)io.netty.buffer.ArrowBuf.setBytes (ArrowBuf.java: 954)org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer (BaseVariableWidthVector.java: 508)org.apache.arrow.vector.BaseVariableWidthVector.handleSafe (BaseVariableWidthVector.java: 1239)org.apache.arrow.vector.BaseVariableWidthVector.setSafe (BaseVariableWidthVector.java: 1066)org.apache.spark.sql.execution.arrow.StringWriter.setValue (ArrowWriter.scala: 287)org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write (ArrowWriter.scala: 151)org.apache.spark.sql.execution.arrow.ArrowWriter.write (ArrowWriter.scala: 105)在org.apache.spark.sql.execution.python.ArrowPythonRunner立刻1美元。美元anonfun writeIteratorToStream 1美元(ArrowPythonRunner.scala: 100)在scala.runtime.java8.JFunction0专门sp.apply美元(美元JFunction0 mcV $ sp.java: 23)org.apache.spark.util.Utils .tryWithSafeFinally美元(Utils.scala: 1581)在另一次1.美元美元org.apache.spark.sql.execution.python.ArrowPythonRunner writeiteratortostream (ArrowPythonRunner.scala: 122)在org.apache.spark.api.python.BasePythonRunner WriterThread美元。anonfun运行$ 1美元(PythonRunner.scala: 478)org.apache.spark.util.Utils .logUncaughtExceptions美元(Utils.scala: 2146)org.apache.spark.api.python.BasePythonRunner WriterThread.run美元(PythonRunner.scala: 270)

导致

由于一个会发生此错误箭头缓冲区的限制。当groupby ()一起使用applyInPandas它导致这个错误。

解决方案

你可以解决这个问题,在集群的设置以下值火花配置(AWS|Azure|GCP):

spark.databricks.execution.pandasZeroConfConversion.groupbyApply.enabled = true

此设置允许groupby ()与熊猫操作功能正确。

这篇文章有用吗?