Avro to parquet conversion

asked 2020-09-10 11:37:16 -0500

Ankit8743 gravatar image

I am converting the avro data to parquet . In between the conversion I am also archiving the avro files from the s3 bucket(these are the files which are going to be converted to parquet) . This archiving operation I am performing once I read the avro file data via s3 component. Archiving code is written in PySpark component . But when I am deleting it in the spark component it's giving out the error that the files are not found in s3 . May I know why it's happening since I have already checked the cache option in the s3 component as well so the data read from the avro file should be available to the components used afterwards .

Below is the screenshot of the job - https://drive.google.com/file/d/1dKk5...

And below is the stack trace of the error -

Job aborted due to stage failure: Task 61 in stage 4.0 failed 4 times, most recent failure: Lost task 61.3 in stage 4.0 (TID 373, ip-54-41-160-59.merck.com, executor 74): java.io.FileNotFoundException: No such file or directory: s3a://merck-mrl-datahub-clin-dev/base/work/edx/edx3_v_gcto_ppv_view/sdc-1599751013282-246 It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:160) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:211) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:585) at org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.next(InMemoryRelation.scala:115) at org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.next(InMemoryRelation.scala:107) at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:222) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1164) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1155) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1090) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1155) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:881) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:357) at org.apache.spark.rdd.RDD.iterator(RDD.scala:308) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at org.apache.spark.rdd.RDD.iterator(RDD.scala:310 ... (more)

edit retag flag offensive close merge delete

Comments

Are you a customer? If so, please reach out to our support team. They're well equipped to help with these types of issues -- especially if it's time sensitive.

iamontheinet gravatar imageiamontheinet ( 2020-09-10 13:57:43 -0500 )edit