Ask Your Question
0

Probelm while trying to use a python pandas and xlrd package in one of the python script on pyspark streamsets transformer,it is giving "ImportError: Missing optional dependency 'xlrd'"

asked 2020-09-10 06:58:31 -0500

satender gravatar image

updated 2020-09-10 13:28:08 -0500

We have origin data on AWS S3 in a multi-tab excel file which we need to parse and as per our knowledge, we don't have such feature in streamsets to read/parse data from a multi-tab excel file. So, to do this thing we have written a pyspark script using the pandas package which is using the xlrd module but when we are running our script on EMR server it is working fine while this script is giving "ImportError: Missing optional dependency 'xlrd' " issue on Streamsets. We are using the "pyspark" stage for running our python script.It terms of location we have installed the pandas package on the same location wherever we had other pySpark packages.

Kindly help me to find out the root cause and if we are missing any step here while we are using any such package which is not part of streamsets library. Any documentation around it will be helpful.

code snippet

import sys
#sys.path.append('/usr/local/lib64/python3.7/site-packages')
import pandas as pd
import boto3
import xlrd
from boto.s3.connection import S3Connection
from functools import reduce
from pyspark.sql.functions import *
from pyspark.sql import DataFrame
from datetime import datetime
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf

bucket_accessKey = <>
bucket_secretKey = <>
bucket_nm =<>
file_path = 'base/work/cn_etmf_sdv_rqrd'
data_src = 'cn_etmf_sdv_rqrd'
file_nm =  'Technical_Design_Document_Specification_ILR'
s3_src_bucket = bucket_nm+'/base/work'
s3_tgt_bucket = bucket_nm+'/base'
s3_archive_bucket = bucket_nm+'/base/sys-history'

def tostring(df):
    for col in df.columns:
        df[col] = df[col].astype(str, copy=False)
    return df

session = boto3.session.Session(aws_access_key_id=bucket_accessKey ,aws_secret_access_key=bucket_secretKey)
src_s3_bucket= session.resource('s3').Bucket(bucket_nm)
counter = 0 
for my_bucket_object in src_s3_bucket.objects.filter(Prefix=file_path):
    print('my_bucket_object.key--', my_bucket_object.key)
    file_nm = my_bucket_object.key.rsplit('/',1)[1]
    study_nm = file_nm.strip().split(' ')[0]
    if len(study_nm)>0:
        study_file_nm= 's3://'+bucket_nm+'/'+my_bucket_object.key
        xls = pd.ExcelFile(study_file_nm)
        if 'Lists for Customization-1' in xls.sheet_names:
            df = pd.read_excel(xls, 'Lists for Customization-1',skiprows = 11, skipfooter = 8)
            df = tostring(df)
        else:
            df = pd.read_excel(xls, 'SDV006 Supporting Document')
            df = tostring(df)
        sdf = spark.createDataFrame(df)
        sdf = sdf.withColumn('STUDY_NM',lit(study_nm)).withColumn('STUDY_FILE_NM',lit(study_file_nm))
        counter = counter+1
        if  counter ==1:
            final_df = sdf
        else:
            final_df = final_df.union(sdf)

load_dt = datetime.today().strftime("%Y%m%d")
extract_dt = datetime.today().strftime("%Y%m%d")
final_df = final_df.withColumn('LOAD_DT',lit(load_dt)).withColumn('EXTRACT_DT',lit(extract_dt))
final_df.write.mode('overwrite').parquet('s3://'+s3_tgt_bucket+'/'+data_src+'/'+file_nm)
output = inputs[0]

**

  • Error

* An exception was raised by the Python Proxy. Return Message: Traceback (most recent call last): File "/mnt1/yarn/usercache/hadoop/appcache/application_1599633098996_0048/container_1599633098996_0048_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 2381, in _call_proxy return_value = getattr(self.pool[obj_id], method)(params) File "/mnt1/yarn/usercache/hadoop/appcache/application_1599633098996_0048/container_1599633098996_0048_01_000001/tmp/1599740843830-0/python_code_runner.py", line 55, in run exec(code, context) File "<string>", line 44, in <module> File "/usr/local/lib64/python3.7/site-packages/pandas/io/excel/_base.py", line 867, in __init__ self._reader = self._enginesengine File "/usr/local/lib64/python3.7/site-packages/pandas/io/excel/_xlrd.py", line ... (more)

edit retag flag offensive close merge delete

Comments

Since you're referring to PySpark, is it safe to assume you're using Transformer?

iamontheinet gravatar imageiamontheinet ( 2020-09-10 07:05:32 -0500 )edit

yes, we have used it in the transformer.

satender gravatar imagesatender ( 2020-09-10 07:10:21 -0500 )edit

Ok, update your question and include the entire stack trace. It's not clear which module it's complaining about. Maybe also include you code snippet if it's not too long :)

iamontheinet gravatar imageiamontheinet ( 2020-09-10 07:20:20 -0500 )edit

I just filled in the required information and corrected the error description as per the actual issue.

satender gravatar imagesatender ( 2020-09-10 08:00:43 -0500 )edit

Can you confirm if the dependency lib `xldr` is installed on all the nodes on your EMR cluster?

iamontheinet gravatar imageiamontheinet ( 2020-09-10 08:03:01 -0500 )edit

1 Answer

Sort by ยป oldest newest most voted
0

answered 2020-09-23 08:55:40 -0500

Ankit8743 gravatar image

Add spark.submit.pyFiles property under the cluster extra spark configuration . The value of this property will be the path of the archived folder of the libraries which the pyspark code unable to import for example in this case - xlrd & pandas .

edit flag offensive delete link more
Login/Signup to Answer

Question Tools

2 followers

Stats

Asked: 2020-09-10 06:49:29 -0500

Seen: 281 times

Last updated: Sep 23