scala - Spark Dataset aggregation similar to RDD aggregate(zero)(accum, combiner)

Question

Welcome To Ask or Share your Answers For Others

scala - Spark Dataset aggregation similar to RDD aggregate(zero)(accum, combiner)

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:38:10+0000

There are two different classes which can be used to achieve aggregate-like behavior in Dataset API:

UserDefinedAggregateFunction which uses SQL types and takes Columns as an input.

Initial value is defined using initialize method, seqOp with update method and combOp with merge method.

Example implementation: How to define a custom aggregation function to sum a column of Vectors?
Aggregator which uses standard Scala types with Encoders and takes records as an input.

Initial value is defined using zero method, seqOp with reduce method and combOp with merge method.

Example implementation: How to find mean of grouped Vector columns in Spark SQL?

Both provide additional finalization method (evaluate and finish respectively) which is used to generate final results and can be used for both global and by-key aggregations.

Categories

scala - Spark Dataset aggregation similar to RDD aggregate(zero)(accum, combiner)

scala - Spark Dataset aggregation similar to RDD aggregate(zero)(accum, combiner)

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags