# StreamSets Transformer: can't read Hive External Table mapped to HBase Table (on remote cluster)

Hello, I'm having troubles in the situation described in the subject. This is the detailed description of the issue:

• Streamsets Transformers installed on "cluster 1" (Hortonworks 2.6.5)
• HBase Table on "cluster 2" (Cloudera 5.13)
• Hive External Table on "cluster 2" pointing to the HBase Table described just above

If I manually run spark CLI from "cluster 1" as follows:

export SPARK_MAJOR_VERSION=2

pyspark --driver-memory 1G \
--executor-memory 1G \
--total-executor-cores 4 \
--num-executors 4 \
--master yarn \
--jars "/usr/hdp/current/hive-client/lib/hive-hbase-handler.jar,\
/usr/hdp/current/hbase-client/lib/hbase-client.jar,\
/usr/hdp/current/hbase-client/lib/hbase-common.jar,\
/usr/hdp/current/hbase-client/lib/hbase-server.jar,\
/usr/hdp/current/hbase-client/lib/guava-12.0.1.jar,\
/usr/hdp/current/hbase-client/lib/hbase-protocol.jar"


And then, in Spark, I set up the Context:

>>> sqlContext.setConf("fs.defaultFS","hdfs://cluster-2-ns/")
>>> sqlContext.setConf("hive.metastore.uris","thrift://cluster-2-01:9083,thrift://cluster-2-02:9083,thrift://cluster-2-03:9083")
>>> sqlContext.setConf("hive.zookeeper.quorum","cluster-2-01,cluster-2-02,cluster-2-03")
>>> sqlContext.setConf("hbase.zookeeper.quorum","cluster-2-01,cluster-2-02,cluster-2-03")
>>> sqlContext.setConf("zookeeper.znode.parent","/hbase")
>>> sqlContext.setConf("spark.hbase.host","cluster-2-05")


I can successfully read the remote HBase table, by querying the Hive External table pointing to it:

>>> df = sqlContext.table("rob.books_ext")
>>>
>>> df.printSchema()
root
|-- title: string (nullable = true)
|-- author: string (nullable = true)
|-- year: integer (nullable = true)
|-- views: double (nullable = true)

>>> df.show(20)
+--------------------+------------------+----+------+
|               title|            author|year| views|
+--------------------+------------------+----+------+
|Ci dispiace, ques...|              null|1922|  null|
| Godel, Escher, Bach|Douglas Hofstadter|1979| 821.0|
|In Search of Lost...|     Marcel Proust|1922|3298.0|
+--------------------+------------------+----+------+


Now, the next step was to configure a simple pipeline on StreamSets Transformers, passing the same additional jars as I did in the "manual test". I can't upload pictures, but I have correctly set up the jars uploading them using the "External Libraries" functionality.

Also, I have correctly set up all the pointers to the external cluster "cluster 2" by setting the following "extra spark configuration" parameters:

fs.defaultFS
hbase.zookeeper.quorum
zookeeper.znode.parent
spark.hbase.host
spark.shuffle.service.enabled


But when I finally start the pipeline (and it runs indeed, and it also gets to a conclusion!), what I get in the Driver log is the following, and apparently there is some sort of mismatch in the HBase additional libraries (or at least, this is my impression):

20/01/31 16:43:14 INFO cluster.YarnClusterScheduler: YarnClusterScheduler.postStartHook done
20/01/31 16:43:26 ERROR runner.DataTransformerRunner:
java.lang.ExceptionInInitializerError
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276 ...
edit retag close merge delete

Sort by » oldest newest most voted

It does look like a version mismatch between the libraries added and some jar on the classpath. Can you try re-running without adding those. Cloudera does add several jars to the application classpath without actually telling you.

more

Hi, thanks for your answer and sorry for my delay in replying, I just wanted to try several different options before commenting.

Specifically, I also tried to add different versions of the hbase jars as "external libraries", adding the jars from the remote Cloudera cluster (even if this doesn't make much sense, as the Spark Jobs are running on the "local" Hortonworks Cluster). The exception I get in this case is exactly like the one in my original post

Finally, I tried what you suggest (runnin again the pipeline without adding any external hbase library) but with no luck. The exception I get in this case is different though, it's as follows:

20/02/05 10:38:16 INFO cluster.YarnClusterScheduler: YarnClusterScheduler.postStartHook done
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfungetTableOption1$$anonfun$apply$7.apply(HiveClientImpl.scala:360)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfungetTableOption1$$anonfun$apply$7.apply(HiveClientImpl.scala:357)
at scala.Option.map(Option.scala:146)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfungetTableOption1.apply(HiveClientImpl.scala:357) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:355)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfunwithHiveState1.apply(HiveClientImpl.scala:274) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree11(HiveClientImpl.scala:212) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:211) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:257) at org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:355) at org.apache.spark.sql.hive.client.HiveClientclass.getTable(HiveClient.scala:81) at org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:83) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getRawTable$1.apply(HiveExternalCatalog.scala:118)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfungetRawTable1.apply(HiveExternalCatalog.scala:118) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.getRawTable(HiveExternalCatalog.scala:117) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:684)
at org.apache.spark.sql.hive.HiveExternalCataloganonfun$getTable$1.apply(HiveExternalCatalog.scala:684)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at org.apache.spark.sql.hive.HiveExternalCatalog.getTable(HiveExternalCatalog.scala:683)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupRelation(SessionCatalog.scala:672)
at org.apache.spark.sql.catalyst.analysis ...
more