StreamSets Transformer: can't read Hive External Table mapped to HBase Table (on remote cluster)
Hello, I'm having troubles in the situation described in the subject. This is the detailed description of the issue:
- Streamsets Transformers installed on "cluster 1" (Hortonworks 2.6.5)
- HBase Table on "cluster 2" (Cloudera 5.13)
- Hive External Table on "cluster 2" pointing to the HBase Table described just above
If I manually run spark CLI from "cluster 1" as follows:
export SPARK_MAJOR_VERSION=2
pyspark --driver-memory 1G \
--executor-memory 1G \
--total-executor-cores 4 \
--num-executors 4 \
--master yarn \
--jars "/usr/hdp/current/hive-client/lib/hive-hbase-handler.jar,\
/usr/hdp/current/hbase-client/lib/hbase-client.jar,\
/usr/hdp/current/hbase-client/lib/hbase-common.jar,\
/usr/hdp/current/hbase-client/lib/hbase-server.jar,\
/usr/hdp/current/hbase-client/lib/guava-12.0.1.jar,\
/usr/hdp/current/hbase-client/lib/hbase-protocol.jar"
And then, in Spark, I set up the Context:
>>> sqlContext.setConf("fs.defaultFS","hdfs://cluster-2-ns/")
>>> sqlContext.setConf("hive.metastore.uris","thrift://cluster-2-01:9083,thrift://cluster-2-02:9083,thrift://cluster-2-03:9083")
>>> sqlContext.setConf("hive.zookeeper.quorum","cluster-2-01,cluster-2-02,cluster-2-03")
>>> sqlContext.setConf("hbase.zookeeper.quorum","cluster-2-01,cluster-2-02,cluster-2-03")
>>> sqlContext.setConf("zookeeper.znode.parent","/hbase")
>>> sqlContext.setConf("spark.hbase.host","cluster-2-05")
I can successfully read the remote HBase table, by querying the Hive External table pointing to it:
>>> df = sqlContext.table("rob.books_ext")
>>>
>>> df.printSchema()
root
|-- title: string (nullable = true)
|-- author: string (nullable = true)
|-- year: integer (nullable = true)
|-- views: double (nullable = true)
>>> df.show(20)
+--------------------+------------------+----+------+
| title| author|year| views|
+--------------------+------------------+----+------+
|Ci dispiace, ques...| null|1922| null|
| Godel, Escher, Bach|Douglas Hofstadter|1979| 821.0|
|In Search of Lost...| Marcel Proust|1922|3298.0|
+--------------------+------------------+----+------+
Now, the next step was to configure a simple pipeline on StreamSets Transformers, passing the same additional jars as I did in the "manual test". I can't upload pictures, but I have correctly set up the jars uploading them using the "External Libraries" functionality.
Also, I have correctly set up all the pointers to the external cluster "cluster 2" by setting the following "extra spark configuration" parameters:
fs.defaultFS
hbase.zookeeper.quorum
zookeeper.znode.parent
spark.hbase.host
spark.hadoop.yarn.resourcemanager.address
spark.hadoop.yarn.resourcemanager.scheduler.address
spark.hadoop.yarn.resourcemanager.resource-tracker.address
spark.shuffle.service.enabled
But when I finally start the pipeline (and it runs indeed, and it also gets to a conclusion!), what I get in the Driver log is the following, and apparently there is some sort of mismatch in the HBase additional libraries (or at least, this is my impression):
20/01/31 16:43:14 INFO cluster.YarnClusterScheduler: YarnClusterScheduler.postStartHook done
20/01/31 16:43:26 ERROR runner.DataTransformerRunner:
java.lang.ExceptionInInitializerError
at org.apache.hadoop.hive.hbase.HBaseSerDe.parseColumnsMapping(HBaseSerDe.java:181)
at org.apache.hadoop.hive.hbase.HBaseSerDeParameters.<init>(HBaseSerDeParameters.java:73)
at org.apache.hadoop.hive.hbase.HBaseSerDe.initialize(HBaseSerDe.java:117)
at org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276 ...