Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer.It ensures each of N reducers gets non-overlapping ranges of column, but doesn’t sort the output of each reducer. It end up with N or more unsorted files with non-overlapping ranges.
Cluster By is a short-cut for both Distribute By and Sort By.
CLUSTER BY x ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers.
Ordering : Global ordering between multiple reducers.
Outcome : N or more sorted files with non-overlapping ranges.
Copyright ©2022 coderraj.com. All Rights Reserved.