tabml: TabML 旨在创建一个通用的机器学习框架来处理表格数据

原作者: [db:作者] 来自: 网络收藏邀请

开源软件名称：

tabml

开源软件地址：

https://gitee.com/mirrors/tabml

开源软件介绍：

TabML: a Machine Learning pipeline for tabular data

TabML: a Machine Learning pipeline for tabular data

Introduction

This is an active project that aims to create a general machine learning framework for working with tabular data.

Key features:

One of the most important tasks in working with tabular data is to hanlde feature extraction. TabML allow users to define multiple features isolatedly without worrying about other features. This helps reduce coding conflicts if your team have multiple members simultaneously developing different features. In addition, if one feature needs to be updated, unrelated features could be untouched. In this way, the computating cost is relatively small (compared with running a pipeline to re-generate all other features).
Parameters are specified in a config file as a config file. This config file is automatically saved into an experiment folder after each training for the reproducibility purpose.
TabML is integreated with MLflow which allows users to keep track all model parameters and metrics.
Support multiple ML packages for tabular data:
- LightGBM
- XGBoost
- CatBoost
- Scikit-learn
- Keras
- Pytorch
- TabNet
- ...

Installation

pip install tabml

Main components

components

In TRAINING step,

The FeatureManager class is responsible for loading raw data and engineering it into relavent features for model training and analysis. If a fit step, e.g. imputation, is required for a feature, the fitted parameters will be stored for using later in the transform step. One such usage is in the serving step when there is only transform step. For each project, there is one feature_manager.py file which specifies how each feature is computed (example). The computation order as well as feature dependencies are specified in a yaml config file (example).
The DataLoader loads training and validation data for model training and analysis. In a typical project, tabml already takes care of this class, users only need to specify configuration in the pipeline config file (example). In that file, features and label used for training need to be specified. In addition, a set of boolean features are used as conditions for selecting training and validation data. Only rows in the dataset that meet all training/validation conditions are selected.
The ModelWrapper class defines the model, how to train it and other methods for loading the model and making predictions.
The ModelAnalysis analyzes the model on different metrics at user-defined dimensions. Analyzing metrics at different slices of data could determine if the trained model is biased to some feature value or any slice of data that model performance could be improved.

In SERVING step, raw data is fed into the fitted FeatureManager to get the transfomed features that the trained model could use. The model is then making predictions for the transformed features.

Examples

Please check the examples folder for several example projects. For each project:

python feature_manager.py  # to generate featurespython pipelines.py  # to train the model

You can change some parameters in the config file, run python pipelines.py again then mlflow ui to see information about each run.

In most project, users only need to focus their efforts on designing features. The feature dependecy is defined in a yaml config file and the feature implementation is stored in feature_manager.py.

Setup for development

Add path to this repo

Add the following lines to your shell config file (~/.bashrc, ~/.zshrc or any shell config file ofyour choice):

export TABML=<local_path_to_this_git_repo>alias 2tabml='cd $TABML; source bashrc; source tabml_env/bin/activate; python3 setup.py install'

Create the environment

cd $TABMLpython3 -m venv tabml_envsource tabml_env/bin/activatepip3 install -r requirements.txt

Setup pre-commit to auto format code when creating a gitcommit:

pre-commit install

Check that everthing is working

by running test

2tabmlpython3 -m pytest ./tests ./examples

Common errors

SHAP

SHAP might not work for MacOS if Xcode version < 13, try to upgrade it to xcode 13. Related issue.

LightGBM

pip install lightgbm might not work for MacOS, try to follow official installation guide for mac.

If you find a bug or want to request a feature, feel free to create an issue. Any Pull Request would be much appreciated.

鲜花

握手

雷人

路过

鸡蛋

该文章已有0人参与评论

请发表评论

全部评论

专题导读

More+

nnUNet: nnU-Net is the first segmentation method that is designed to deal with t ...发布时间：2022-03-24

mldb: MLDB 是一个用于机器学习的开源数据库发布时间：2022-03-24

客服电话

电子邮件