https://issues.apache.org/jira/browse/HIVE-2340
select userid,count(*) from u_data group by userid order by userid will product MRR.
I think when the result of userid,count(*) is small(one reduce can process the result) . This query plan can optimize to MR ?
To prevent bad reducer merging, the reducer merging only kicks in when the
optimizer thinks it gets a perf boost.
MR -> MRR is not a big win when it comes Tez, due to container-reuse -
going wide on the large cardinality in case of missing map-side
aggregation will be safer.
If hive.map.aggr=true and the userid set fits within memory, then smushing
the reducers would be nicer.
To reset the wide-narrow checks, do
set hive.optimize.reducededuplication.min.reducer=1;
But be aware that it will fail (I1ve seen full disks) as you scale upwards
to the 10+ Tb cases.
Cheers,
Gopal
Default Value: 4
Added In: Hive 0.11.0 with HIVE-2340
Reduce deduplication merges two RSs (reduce sink operators) by moving key/parts/reducer-num of the child RS to parent RS. That means if reducer-num of the child RS is fixed (order by or forced bucketing) and small, it can make very slow, single MR. The optimization will be disabled if number of reducers is less than specified value.
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。