冲突的目录结构错误

您应该使用不同的路径的存储位置,否则矛盾的目录结构可能会导致一个错误。

写的阿施施

去年发表在:2022年5月19日

问题

你有一个Apache火花工作与Java断言失败错误. lang。AssertionError:断言失败:检测到冲突的目录结构。

例子堆栈跟踪

引起的:org.apache.spark.sql.streaming。流媒体QueryException: There was an error when trying to infer the partition schema of the current batch of files. Please provide your partition columns explicitly by using: .option('cloudFiles.partitionColumns', 'comma-separated-list') === Streaming Query === Identifier: [id = aabc5549-cb4b-4e4e-9403-4e793f4824a0, runId = 4e743dda-909f-4932-9489-3dd0b364d811] Current Committed Offsets: {} Current Available Offsets: {CloudFilesSource[://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt]: {'seqNum':423,'sourceVersion':1}} Current State: ACTIVE Thread State: RUNNABLE Logical Plan: CloudFilesSource[://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt] at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:385) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:268) Caused by: java.lang.RuntimeException: There was an error when trying to infer the partition schema of the current batch of files. Please provide your partition columns explicitly by using: .option('cloudFiles.partitionColumns', 'comma-separated-list') at com.databricks.sql.fileNotification.autoIngest.CloudFilesErrors$.partitionInferenceError(CloudFilesErrors.scala:115) at com.databricks.sql.fileNotification.autoIngest.CloudFilesSourceFileIndex.liftedTree1$1(CloudFilesSourceFileIndex.scala:65) at com.databricks.sql.fileNotification.autoIngest.CloudFilesSourceFileIndex.partitionSpec(CloudFilesSourceFileIndex.scala:63) at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:50) at com.databricks.sql.fileNotification.autoIngest.CloudFilesSource.getBatch(CloudFilesSource.scala:361) ... 1 more Caused by: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths: ://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt ://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt/clfy_x_clfy_evt If provided paths are partition directories, please set 'basePath' in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them. at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:204) at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parseP

导致

你有冲突的目录路径的存储位置。

在堆栈跟踪的例子中,我们看到两个相互矛盾的目录路径。

  • <文件系统>:/ /domain.com/km/gold/cfy_gold/clfy_x_clfy_evt
  • <文件系统>:/ /domain.com/km/gold/cfy_gold/clfy_x_clfy_evt/clfy_x_clfy_evt

因为这些目录出现在相同层次结构,更新根或一个分支水平会导致冲突。

解决方案

避免分层目录结构中的多个并发更新或更新发生在同一分区内。

你应该让多个不同的路径更新一次冲突检测。或者,您可以添加更多的分区。

这些示例目录并不冲突。

  • <文件系统>:/ /domain.com/km/gold/cfy_gold/clfy_x_clfy_evt/evt=clfy_x_clfy_evt1
  • <文件系统>:/ /domain.com/km/gold/cfy_gold/clfy_x_clfy_evt/evt=clfy_x_clfy_evt2
这篇文章有用吗?