Ask Your Question
1

StreamSets Transformer: can't read Hive External Table mapped to HBase Table (on remote cluster)

asked 2020-01-31 11:22:15 -0500

Rob_69 gravatar image

updated 2020-01-31 11:26:04 -0500

metadaddy gravatar image

Hello, I'm having troubles in the situation described in the subject. This is the detailed description of the issue:

  • Streamsets Transformers installed on "cluster 1" (Hortonworks 2.6.5)
  • HBase Table on "cluster 2" (Cloudera 5.13)
  • Hive External Table on "cluster 2" pointing to the HBase Table described just above

If I manually run spark CLI from "cluster 1" as follows:

export SPARK_MAJOR_VERSION=2

pyspark --driver-memory 1G \
--executor-memory 1G \
--total-executor-cores 4 \
--num-executors 4 \
--master yarn \
--jars "/usr/hdp/current/hive-client/lib/hive-hbase-handler.jar,\
/usr/hdp/current/hbase-client/lib/hbase-client.jar,\
/usr/hdp/current/hbase-client/lib/hbase-common.jar,\
/usr/hdp/current/hbase-client/lib/hbase-server.jar,\
/usr/hdp/current/hbase-client/lib/guava-12.0.1.jar,\
/usr/hdp/current/hbase-client/lib/hbase-protocol.jar"

And then, in Spark, I set up the Context:

>>> sqlContext.setConf("fs.defaultFS","hdfs://cluster-2-ns/")
>>> sqlContext.setConf("hive.metastore.uris","thrift://cluster-2-01:9083,thrift://cluster-2-02:9083,thrift://cluster-2-03:9083")
>>> sqlContext.setConf("hive.zookeeper.quorum","cluster-2-01,cluster-2-02,cluster-2-03")
>>> sqlContext.setConf("hbase.zookeeper.quorum","cluster-2-01,cluster-2-02,cluster-2-03")
>>> sqlContext.setConf("zookeeper.znode.parent","/hbase")
>>> sqlContext.setConf("spark.hbase.host","cluster-2-05")

I can successfully read the remote HBase table, by querying the Hive External table pointing to it:

>>> df = sqlContext.table("rob.books_ext")
>>>
>>> df.printSchema()
root
 |-- title: string (nullable = true)
 |-- author: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- views: double (nullable = true)

>>> df.show(20)
+--------------------+------------------+----+------+
|               title|            author|year| views|
+--------------------+------------------+----+------+
|Ci dispiace, ques...|              null|1922|  null|
| Godel, Escher, Bach|Douglas Hofstadter|1979| 821.0|
|In Search of Lost...|     Marcel Proust|1922|3298.0|
+--------------------+------------------+----+------+

Now, the next step was to configure a simple pipeline on StreamSets Transformers, passing the same additional jars as I did in the "manual test". I can't upload pictures, but I have correctly set up the jars uploading them using the "External Libraries" functionality.

Also, I have correctly set up all the pointers to the external cluster "cluster 2" by setting the following "extra spark configuration" parameters:

fs.defaultFS
hbase.zookeeper.quorum
zookeeper.znode.parent
spark.hbase.host
spark.hadoop.yarn.resourcemanager.address
spark.hadoop.yarn.resourcemanager.scheduler.address
spark.hadoop.yarn.resourcemanager.resource-tracker.address
spark.shuffle.service.enabled

But when I finally start the pipeline (and it runs indeed, and it also gets to a conclusion!), what I get in the Driver log is the following, and apparently there is some sort of mismatch in the HBase additional libraries (or at least, this is my impression):

20/01/31 16:43:14 INFO cluster.YarnClusterScheduler: YarnClusterScheduler.postStartHook done
20/01/31 16:43:26 ERROR runner.DataTransformerRunner: 
java.lang.ExceptionInInitializerError
    at org.apache.hadoop.hive.hbase.HBaseSerDe.parseColumnsMapping(HBaseSerDe.java:181)
    at org.apache.hadoop.hive.hbase.HBaseSerDeParameters.<init>(HBaseSerDeParameters.java:73)
    at org.apache.hadoop.hive.hbase.HBaseSerDe.initialize(HBaseSerDe.java:117)
    at org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
    at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
    at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
    at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276 ...
(more)
edit retag flag offensive close merge delete

2 Answers

Sort by ยป oldest newest most voted
2

answered 2020-02-02 15:36:42 -0500

hshreedharan gravatar image

It does look like a version mismatch between the libraries added and some jar on the classpath. Can you try re-running without adding those. Cloudera does add several jars to the application classpath without actually telling you.

edit flag offensive delete link more
0

answered 2020-02-05 07:30:15 -0500

Rob_69 gravatar image

Hi, thanks for your answer and sorry for my delay in replying, I just wanted to try several different options before commenting.

Specifically, I also tried to add different versions of the hbase jars as "external libraries", adding the jars from the remote Cloudera cluster (even if this doesn't make much sense, as the Spark Jobs are running on the "local" Hortonworks Cluster). The exception I get in this case is exactly like the one in my original post

Finally, I tried what you suggest (runnin again the pipeline without adding any external hbase library) but with no luck. The exception I get in this case is different though, it's as follows:

20/02/05 10:38:16 INFO cluster.YarnClusterScheduler: YarnClusterScheduler.postStartHook done
20/02/05 10:38:32 ERROR hive.log: error in initSerDe: java.lang.ClassNotFoundException Class org.apache.hadoop.hive.hbase.HBaseSerDe not found
java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.hbase.HBaseSerDe not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2214)
    at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:385)
    at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
    at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:258)
    at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:605)
    at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$7.apply(HiveClientImpl.scala:360)
    at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$7.apply(HiveClientImpl.scala:357)
    at scala.Option.map(Option.scala:146)
    at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:357)
    at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:355)
    at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:274)
    at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:212)
    at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:211)
    at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:257)
    at org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:355)
    at org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:81)
    at org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:83)
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getRawTable$1.apply(HiveExternalCatalog.scala:118)
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getRawTable$1.apply(HiveExternalCatalog.scala:118)
    at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
    at org.apache.spark.sql.hive.HiveExternalCatalog.getRawTable(HiveExternalCatalog.scala:117)
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:684)
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:684)
    at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
    at org.apache.spark.sql.hive.HiveExternalCatalog.getTable(HiveExternalCatalog.scala:683)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupRelation(SessionCatalog.scala:672)
    at org.apache.spark.sql.catalyst.analysis ...
(more)
edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2020-01-31 11:22:15 -0500

Seen: 117 times

Last updated: Feb 05