在线时间:8:00-16:00
迪恩网络APP
随时随地掌握行业动态
扫描二维码
关注迪恩网络微信公众号
开源软件名称:google-pegasus开源软件地址:https://gitee.com/mirrors/google-pegasus开源软件介绍:PEGASUS libraryPre-training with Extracted Gap-sentences for Abstractive SUmmarizationSequence-to-sequence models, or PEGASUS, uses self-supervised objective GapSentences Generation (GSG) to train a transformer encoder-decoder model. Thepaper can be found on arXiv. ICML 2020 accepted. If you use this code or these models, please cite the following paper: @misc{zhang2019pegasus, title={PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization}, author={Jingqing Zhang and Yao Zhao and Mohammad Saleh and Peter J. Liu}, year={2019}, eprint={1912.08777}, archivePrefix={arXiv}, primaryClass={cs.CL}} Results updateWe train a pegasus model with sampled gap sentence ratios on both C4 and HugeNews, and stochastically sample important sentences. The updated the results are reported in this table.
The "Mixed & Stochastic" model has the following changes:
(*) the numbers of wikihow and big_patent datasets are not comparable because of change in tokenization and data:
Setupcreate an instance on google cloud with GPU (optional)Please create a project first and create an instance gcloud compute instances create \ ${VM_NAME} \ --zone=${ZONE} \ --machine-type=n1-highmem-8 \ --accelerator type=nvidia-tesla-v100,count=1 \ --boot-disk-size=500GB \ --image-project=ml-images \ --image-family=tf-1-15 \ --maintenance-policy TERMINATE --restart-on-failure install library and dependenciesClone library on github and install requirements. git clone https://github.com/google-research/pegasuscd pegasusexport PYTHONPATH=.pip3 install -r requirements.txt Download vocab, pretrained and fine-tuned checkpoints of all experiments from Google Cloud. Alternatively in terminal, follow the instruction and install gsutil. Then mkdir ckptgsutil cp -r gs://pegasus_ckpt/ ckpt/ Finetuning on downstream datasetson existing datasetFinetune on an existing dataset python3 pegasus/bin/train.py --params=aeslc_transformer \--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model \--train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000 \--model_dir=ckpt/pegasus_ckpt/aeslc If you would like to finetune on a subset of dataset, please refer to the example of input pattern. Evaluate on the finetuned dataset. python3 pegasus/bin/evaluate.py --params=aeslc_transformer \--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model,batch_size=1,beam_size=5,beam_alpha=0.6 \--model_dir=ckpt/pegasus_ckpt/aeslc Note that the above example is using a single GPU so the batch_size is much smallerthan the results reported in the paper. add new finetuning datasetTwo types of dataset format are supported: TensorFlow Datasets (TFDS) or TFRecords. This tutorial shows how to add a new dataset in TFDS.(The fine-tuning dataset is expected to be supervised, please provide Tfrecords format requires each record to be a tf example of For example, if you registered a TFDS dataset called @registry.register("new_params")def my_param(param_overrides): return public_params.transformer_params( { "train_pattern": "tfds:new_tfds_dataset,train", "dev_pattern": "tfds:new_tfds_dataset,validation", "test_pattern": "tfrecord:new_dataset_files.tfrecord*", "max_input_len": 512, "max_output_len": 128, "train_steps": 10000, "learning_rate": 0.0001, "batch_size": 8, }, param_overrides) Evaluation metrics.Evaluation results can be found in
Several types of output files can be found in
Pre-trainingPretraining (on C4 or any other corpus) requires a customly built tensorflow that includes ops for on-the-fly parsing that processes raw text document into model inputs and targets ids. Please refer to pegasus/ops/pretrain_parsing_ops.cc and pegasus/data/parsers.py for details. AcknowledgementsContains parts of code and design for training and evaluation of summarization models originally by Ben Goodrich [email protected]. |
请发表评论