How to check the integrity of the files between the source and destination

I have SFTP as source and destination as Hadoop.I am using StreamSets for ingest the data from source to the destination.The files are moved from the source to the destination successfully.However i like to find a way to check the integrity of files,to confirm the same file has been moved from the source to destination.

I have !cksum command in SFTP and in hadoop hdfs dfs -Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum "location of the file"

In SFTP,here is the command using

sftp>!cksum *filename * sftp>(666126820,14) - the output is hash values along with size of the file.

In Hadoop

hdfs dfs -Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum filename

MD5-of-0MD5-of-512CRC32C 000002000000000000000000b377153cc30d105a8fa55ca462836dea - This is the output i am getting.

I am running this pipeline in CDH 5.13.As requested i have updated the details please let me know your comments/suggestions

But i am getting two different results,how do i fix this?.Please provide me the direction...

Thanks in advance.

Please paste the full command you are using to calculate the checksum on the FTP side. Also edit your question to include information about your Hadoop cluster environment. Also, does a simple `diff` command show a difference between these files?

My money is on the hdfs checksum using a different algorithm from the SFTP one

1 Answer

The cksum command and Hadoop's CRC checksum use different algorithms. You must use the same algorithm when comparing checksums.

This page gives one method for comparing checksums of files in HDFS against local files, using the crc32 command: Comparing checksums in HDFS.

