Home >> Big Data Hadoop >> What is the significance of using --split-by clause for running parallel import tasks in Apache Sqoop in Big Data Hadoop

What is the significance of using --split-by clause for running parallel import tasks in Apache Sqoop in Big Data Hadoop

--Split-by clause is used to specify the columns of the table that are used to generate splits for data imports. This clause specifies the columns that will be used for splitting when importing the data into the Hadoop cluster. —split-by clause helps achieve improved performance through greater parallelism. Apache Sqoop will create splits based on the values present in the columns specified in the –split-by clause of the import command. If the –split-by clause is not specified, then the primary key of the table is used to create the splits while data import. At times the primary key of the table might not have evenly distributed values between the minimum and maximum range. Under such circumstances –split-by clause can be used to specify some other column that has even distribution of data to create splits so that data import is efficient.

Post Your Comment

Next Questions

Copyright ©2022 coderraj.com. All Rights Reserved.