Ask Your Question

minReplication error writing to Hadoop

asked 2017-05-09 22:08:10 -0600

metadaddy gravatar image

updated 2017-05-09 22:09:01 -0600

Running the taxi tutorial against a single node CDH running in VirtualBox. When it tries to write to Hadoop FS, I get the error:

error: HADOOPFS_13 - Error while writing to HDFS: com.streamsets.pipeline.api.StageException: HADOOPFS_58 - Flush failed on file: '/sdc/taxi/_tmp_sdc-847321ce-0acb-4574-8d2c-ff63529f25b8_0' due to 'org.apache.hadoop.ipc.RemoteException( File /sdc/taxi/_tmp_sdc-847321ce-0acb-4574-8d2c-ff63529f25b8_0 could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
edit retag flag offensive close merge delete

1 Answer

Sort by » oldest newest most voted

answered 2017-05-09 22:10:47 -0600

metadaddy gravatar image

The clue is that There are 1 datanode(s) running and 1 node(s) are excluded in this operation. This is a single node CDH, so we would expect only one node to be running, but why is it excluded? By default, the CDH Quickstart VM runs in NAT mode – it has its own private IP address, in my case, with port forwarding configured so that Hadoop FS, Hive etc are available from the host machine. I have quickstart.cloudera set to (localhost) in my /etc/hosts so I can use that hostname to access services in the VM. What's happening is that SDC's request to the Hadoop name node is forwarded correctly, and the name node returns the data node's location. By default, the Hadoop client library tries to connect to the name node's IP address,, on port 50075, but this fails, since that address is inside the VM NAT, and not accessible from the host.

How do we resolve the problem? One way would be to reconfigure the VM to use bridged networking, so the VM's IP address is directly accessible from the host, but I chose a quicker, easier fix. Setting dfs.client.use.datanode.hostname to true in the Hadoop FS destination configuration tells the Hadoop client to connect using the hostname, quickstart.cloudera, rather than the IP address.

Now requests to the name node are forwarded to the VM, and my pipeline runs.

edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower


Asked: 2017-05-09 22:08:10 -0600

Seen: 39 times

Last updated: May 09 '17