Tuesday, February 5, 2013

Difference between Hash and sort grouping methods in Aggregator stage


Grouping Methods
Hash (default)
1)Calculations are made for all groups and stored in memory
2)Results are written out after all input has been processed so large memory is required when volume of input is high
3)Input does not need to be sorted
4)Useful when the number of unique groups is small 
Sort
1)Requires the input data to be sorted by grouping keys
2)Only a single aggregation group is kept in memory so less memory is required
3)When a new group is seen, the current group is written out
4)Can handle unlimited numbers of groups

Conclusion-When the volume of input is high  and is not predictable it is better to use Sort Method

No comments:

Post a Comment