• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    迪恩网络公众号

sk-dist: PySpark中的分布式scikit-learn元估计器 sk-dist是一个用于机器学习的Python ...

原作者: [db:作者] 来自: 网络 收藏 邀请

开源软件名称:

sk-dist

开源软件地址:

https://gitee.com/mirrors/sk-dist

开源软件介绍:

sk-dist

sk-dist: Distributed scikit-learn meta-estimators in PySpark

License Build Status PyPI Package Downloads Python Versions

What is it?

sk-dist is a Python package for machine learning built on top ofscikit-learn and isdistributed under the Apache 2.0 softwarelicense. Thesk-dist module can be thought of as "distributed scikit-learn" asits core functionality is to extend the scikit-learn built-injoblib parallelization of meta-estimator training tospark. A popular use case is theparallelization of grid search as shown here:

sk-dist

Check out the blog postfor more information on the motivation and use cases of sk-dist.

Main Features

  • Distributed Training - sk-dist parallelizes the training ofscikit-learn meta-estimators with PySpark. This allowsdistributed training of these estimators without any constraint onthe physical resources of any one machine. In all cases, sparkartifacts are automatically stripped from the fitted estimator. Theseestimators can then be pickled and un-pickled for prediction tasks,operating identically at predict time to their scikit-learncounterparts. Supported tasks are:
  • Distributed Prediction - sk-dist provides a prediction modulewhich builds vectorizedUDFsforPySparkDataFramesusing fitted scikit-learn estimators. This distributes thepredict and predict_proba methods of scikit-learnestimators, enabling large scale prediction with scikit-learn.
  • Feature Encoding - sk-dist provides a flexible featureencoding utility called Encoderizer which encodes mix-typedfeature spaces using either default behavior or user definedcustomizable settings. It is particularly aimed at text features, butit additionally handles numeric and dictionary type feature spaces.

Installation

Dependencies

sk-dist requires:

Dependency Notes

  • versions of numpy, scipy and joblib that are compatible with any supported version of scikit-learn should be sufficient for sk-dist
  • sk-dist is not supported with Python 2

Spark Dependencies

Most sk-dist functionality requires a spark installation as well asPySpark. Some functionality can run without spark, so spark relateddependencies are not required. The connection between sk-dist and sparkrelies solely on a sparkContext as an argument to varioussk-dist classes upon instantiation.

A variety of spark configurations and setups will work. It is left up tothe user to configure their own spark setup. The testing suite runsspark 2.4 and spark 3.0, though any spark 2.0+ versionsare expected to work.

Additional spark related dependecies are pyarrow, which is used onlyfor skdist.predict functions. This uses vectorized pandas UDFs whichrequire pyarrow>=0.8.0, tested with pyarrow==0.16.0.Depending on the spark version, it may be necessary to setspark.conf.set("spark.sql.execution.arrow.enabled", "true") in thespark configuration.

User Installation

The easiest way to install sk-dist is with pip:

pip install --upgrade sk-dist

You can also download the source code:

git clone https://github.com/Ibotta/sk-dist.git

Testing

With pytest installed, you can run tests locally:

pytest sk-dist

Examples

The package contains numerousexampleson how to use sk-dist in practice. Examples of note are:

Gradient Boosting

sk-dist has been tested with a number of popular gradient boosting packages that conform to the scikit-learn API. Thisincludes xgboost and catboost. These will need to be installed in addition to sk-dist on all nodes of the sparkcluster via a node bootstrap script. Version compatibility is left up to the user.

Support for lightgbm is not guaranteed, as it requires additional installations on allnodes of the spark cluster. This may work given proper installation but has not beed tested with sk-dist.

Background

The project was started at IbottaInc. on the machine learningteam and open sourced in 2019.

It is currently maintained by the machine learning team at Ibotta. Specialthanks to those who contributed to sk-dist while it was initiallyin development at Ibotta:

Thanks to James Foley for logo artwork.

IbottaML

鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
热门推荐
热门话题
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap