Issue loading data from Amazon S3 origin to Deltalake (on S3) destination using Transformer

asked 2020-02-17 17:59:50 -0500

Sadhu Rajasekhar gravatar image

updated 2020-02-18 10:55:07 -0500

metadaddy gravatar image

Hi, I am trying to load from Amazon S3 origin to Deltalake (on S3) destination using Transformer pipeline. Getting error while creating table in deltalake. Appreciate help to resolve.

Pipeline Status: RUN_ERROR: org.apache.spark.SparkException: Job aborted due to stage failure: Task 50 in stage 7.0 failed 4 times, most recent failure: Lost task 50.3 in stage 7.0 (TID 113, 54.40.19.38, executor 5): java.io.FileNotFoundException: File file:/opt/streamsets/streamsets-transformer-dirs/data/runInfo/SRDeltaTest48e28c49-fde6-4b74-8856-837e60b9d3fc/run1581938280588/iics-atscalepoc/Usercsv/_delta_log/00000000000000000000.json does not exist It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace:
edit retag flag offensive close merge delete

Comments

Have you already tried the suggestion to explicitly invalidate the cache? Also share your Delta Lake destination configuration as well as the cluster type.

iamontheinet gravatar imageiamontheinet ( 2020-02-17 18:16:20 -0500 )edit

Thanks for response. Yes, I have unchecked the cache and still failing. DL Config detail: Stage Library:DL Transformer provided libraries Tabke directory path:/Deltalake writemode:append Storage: Amazon S3 Create managed tabe:Yes table name:Usercsv_delta

Sadhu Rajasekhar gravatar imageSadhu Rajasekhar ( 2020-02-17 20:42:32 -0500 )edit