• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    迪恩网络公众号

edl: Elastic Deep Learning using PaddlePaddle and Kubernetes

原作者: [db:作者] 来自: 网络 收藏 邀请

开源软件名称:

edl

开源软件地址:

https://gitee.com/paddlepaddle/edl

开源软件介绍:




Issues Forks Stars License

Motivation

Computing resources on cloud such as Amazon AWSBaidu Cloud have multi-tenancy. Deep learning model training and inference with elastic resources will be common on cloud. We propose Elastic Deep Learning (EDL) that makes training and inference of deep learning models on cloud easier and more efficient.

Now EDL is an incubation-stage project of the LF AI Foundation.

Installation

EDL package support python2.7/3.6/3.7. You can install with pip install paddle_edl. But we highly recommend you use it in our docker:

docker pull hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda9.0-cudnn7nvidia-docker run -name paddle_edl hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda9.0-cudnn7 /bin/bash

Latest Release(0.3.1)

  • Support elastic training with inference type services during training, such as knowledge distillation
  • Inference type services are automatically registered through service discovery in EDL
  • Knowledge distillation examples in computer vision and natural language processing

Quick start Demo

  • Install Paddle Serving
pip install paddle-serving-server-gpu
cd example/distill/resnetwget --no-check-certificate https://paddle-edl.bj.bcebos.com/distill_teacher_model/ResNeXt101_32x16d_wsl_model.tar.gztar -zxf ResNeXt101_32x16d_wsl_model.tar.gzpython -m paddle_serving_server_gpu.serve \  --model ResNeXt101_32x16d_wsl_model \  --mem_optim \  --port 9898 \  --gpu_ids 1
  • The Student Model: ResNet50_vd(that is ResNet-D in paper). Train student on gpu 0.
python -m paddle.distributed.launch --selected_gpus 0 \  ./train_with_fleet.py \  --model=ResNet50_vd \  --data_dir=./ImageNet \  --use_distill_service=True \  --distill_teachers=127.0.0.1:9898
modeteacher resourcestudent resourcetotal batch sizeacc1acc5speed(img/s)
pure trainNone8 * v10025677.193.51828
teacher and student on the same gpus8 * v1008 * v10025679.094.3656
EDL service distill40 * P48 * v10025679.094.51514

About Knowledge Distillation in EDL

  • Theory: Distilling the Knowledge in a Neural Network
    • Knowledge distillation consists of two parts in general, i.e. strong teachers and weak students.
    • Student model learns from a teacher or mixture-of-teachers model's feed-forward results to achieve better results.
  • Application scenarios of EDL knowledge distillation
    • Teacher models and student models are running on the same GPU devices that training throughputs are not maximized
    • Offline GPU cluster has limited resources but some online GPU resources can be used during training.
    • Heterogenous teacher models can improve student model's performance but are hard to deploy on a single GPU card due to memory limitation.
    • Computation burden of teacher models and student models is hard to balance to maximize the training throughputs.
  • Solution:
    • Deploy teacher models as online inference service through Paddle Serving
    • Online inference services are elastic and are registered to EDL service management modules.
    • Dynamical adaptation of teacher models' online instance to maximize students' training throughputs and resource utilization.

Release 0.2.0

Checkpoint based elastic training on multiple GPUs

  • We have several training nodes running on each GPU.
  • A master node is responsible for checkpoint saving and all the other nodes are elastic nodes.
  • When elastic nodes join or leave current training job, training hyper-parameter will be adjusted automatically.
  • Newly comming training nodes will load checkpoint from remote FS automatically.
  • A model checkpoint is saved every serveral steps given by user

Resnet50 experiments on a single machine in docker

  • Start a JobServer on one node which generates changing scripts.
cd example/demo/collectivenode_ips="127.0.0.1"python -u paddle_edl.demo.collective.job_server_demo \    --node_ips ${node_ips} \    --pod_num_of_node 8 \    --time_interval_to_change 900 \    --gpu_num_of_node 8
  • Start a Jobclient which controls the worker process.
# set the ImageNet data pathexport PADDLE_EDL_IMAGENET_PATH=<your path># set the checkpoint pathexport PADDLE_EDL_FLEET_CHECKPOINT_PATH=<your path>mkdir -p resnet50_podunset http_proxy https_proxy# running under edlexport PADDLE_RUNING_ENV=PADDLE_EDLexport PADDLE_JOB_ID="test_job_id_1234"export PADDLE_POD_ID="not set"python -u paddle_edl.demo.collective.job_client_demo \    --log_level 20 \    --package_sh ./resnet50/package.sh \    --pod_path ./resnet50_pod \    ./train_pretrain.sh
  • Experiments result on 2 nodes cluster
modeldatasetgpu cardstotal batch sizeacc1acc5
Resnet50ImageNet16 * v100102475.592.8

The whole example is here

Community

FAQ

License

Contribution


鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
上一篇:
fastHan: 中文处理工具发布时间:2022-03-24
下一篇:
dcgan: DCGAN 开源中国官方镜像发布时间:2022-03-24
热门推荐
热门话题
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap