Home >> Big Data Hadoop >> HIve DISTRIBUTE BY vs CLUSTER BY in Big Data Hadoop

HIve DISTRIBUTE BY vs CLUSTER BY in Big Data Hadoop

Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer.It ensures each of N reducers gets non-overlapping ranges of column, but doesn’t sort the output of each reducer. It end up with N or more unsorted files with non-overlapping ranges.

Cluster By is a short-cut for both Distribute By and Sort By.

CLUSTER BY x ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers.

Ordering : Global ordering between multiple reducers.

Outcome : N or more sorted files with non-overlapping ranges.

Post Your Comment

Next Questions
Hive HUE Editor
Hive vs Impala
Hive Data & Schema
Hive Partitioning
Hive Bucketing
Hive File Format
Hive Engine
Hive Vectorization
Hive User Defined Function
Hive How to Write a User Defined Function
Hive User Defined Aggregate Functions
Hive Performance Tuning
Hive Rank and Over
Hive SERDE
Hive Directed Acyclic Graph
Hive with Sqoop
How to save hive query output in csv using python
Hive How To Convert External table to Internal table or vice-versa
Hive What is User Defined Function and User Defined Aggregate Function
What are the different components of a Hive architecture
How can you prevent a large job from running for a long time
What is a Hive Metastore
Explain about the different types of join in Hive
How can you configure remote metastore mode with Hive
How data transfer happens from HDFS to Hive

Copyright ©2022 coderraj.com. All Rights Reserved.