开源软件名称:tabml
开源软件地址:https://gitee.com/mirrors/tabml
开源软件介绍:
TabML: a Machine Learning pipeline for tabular data
IntroductionThis is an active project that aims to create a general machine learning framework for working with tabular data. Key features: One of the most important tasks in working with tabular data is to hanlde feature extraction. TabML allow users to define multiple features isolatedly without worrying about other features. This helps reduce coding conflicts if your team have multiple members simultaneously developing different features. In addition, if one feature needs to be updated, unrelated features could be untouched. In this way, the computating cost is relatively small (compared with running a pipeline to re-generate all other features). Parameters are specified in a config file as a config file. This config file is automatically saved into an experiment folder after each training for the reproducibility purpose. TabML is integreated with MLflow which allows users to keep track all model parameters and metrics. Support multiple ML packages for tabular data:
InstallationMain componentsIn TRAINING step, The FeatureManager class is responsible for loading raw data and engineering it into relavent features for model training and analysis. If a fit step, e.g. imputation, is required for a feature, the fitted parameters will be stored for using later in the transform step. One such usage is in the serving step when there is only transform step. For each project, there is one feature_manager.py file which specifies how each feature is computed (example). The computation order as well as feature dependencies are specified in a yaml config file (example). The DataLoader loads training and validation data for model training and analysis. In a typical project, tabml already takes care of this class, users only need to specify configuration in the pipeline config file (example). In that file, features and label used for training need to be specified. In addition, a set of boolean features are used as conditions for selecting training and validation data. Only rows in the dataset that meet all training/validation conditions are selected. The ModelWrapper class defines the model, how to train it and other methods for loading the model and making predictions. The ModelAnalysis analyzes the model on different metrics at user-defined dimensions. Analyzing metrics at different slices of data could determine if the trained model is biased to some feature value or any slice of data that model performance could be improved.
In SERVING step, raw data is fed into the fitted FeatureManager to get the transfomed features that the trained model could use. The model is then making predictions for the transformed features. ExamplesPlease check the examples folder for several example projects. For each project: python feature_manager.py # to generate featurespython pipelines.py # to train the model You can change some parameters in the config file, run python pipelines.py again then mlflow ui to see information about each run. In most project, users only need to focus their efforts on designing features. The feature dependecy is defined in a yaml config file and the feature implementation is stored in feature_manager.py . Setup for developmentAdd path to this repoAdd the following lines to your shell config file (~/.bashrc , ~/.zshrc or any shell config file ofyour choice): export TABML=<local_path_to_this_git_repo>alias 2tabml='cd $TABML; source bashrc; source tabml_env/bin/activate; python3 setup.py install' Create the environmentcd $TABMLpython3 -m venv tabml_envsource tabml_env/bin/activatepip3 install -r requirements.txt Setup pre-commit to auto format code when creating a gitcommit: Check that everthing is workingby running test 2tabmlpython3 -m pytest ./tests ./examples Common errors- SHAP
SHAP might not work for MacOS if Xcode version < 13, try to upgrade it to xcode 13. Related issue. - LightGBM
pip install lightgbm might not work for MacOS, try to follow official installation guide for mac.
If you find a bug or want to request a feature, feel free to create an issue. Any Pull Request would be much appreciated. |
请发表评论