推特刷播放量网站 --推特涨赞

原副标题:EasyNLP畅享文件格式全文(Pseudophoxinus)聚合
概要: 本⽂将提供更多有关PEGASUS的控制技术阐释,和怎样在EasyNLP架构中使⽤与PEGASUS有关的文件格式全文(Pseudophoxinus)聚合数学模型。
作者:李俊、余倩雯
编者按
文件格式聚合是语法处置应用领域的一个关键科学研究方向,具备多样的前述应用应用领域情景和科学研究价值。其中,聚合式文件格式全文做为文件格式聚合的一个关键子各项任务,在前述应用应用领域情景中,主要包括Pseudophoxinus聚合、全文聚合、关键字聚合等各项任务形式。预体能训练词汇数学模型,如BERT、MASS、uniLM等尽管在NLU情景中获得了可喜的操控性,但数学模型采用的单字、子词遮住词汇数学模型并不适用于文件格式聚合情景中,特别是聚合式文件格式全文情景。其原因是,聚合式文件格式全文各项任务往往要求数学模型具备更粗粒度的语法认知,如语句、章节语法认知,以此展开全文聚合。为的是解决前述难题,PEGASUS数学模型(PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization)针对文件格式全文各项任务设计了无监督预体能训练各项任务(Gap Sentence Generation,简称GSG),即乱数遮住文件格式中的两个完备语句,让数学模型聚合被遮住的语句。该预体能训练各项任务能很好地与前述地文件格式全文各项任务匹配,从而使得预体能训练后的数学模型经过简单的松动后达至良好的全文聚合效用。因而,我们在EasyNLP架构中软件系统了PEGASUS演算法和数学模型,使用户能方便地使用该数学模型展开文件格式全文聚合有关各项任务的体能训练和预测。
EasyNLP(https://github.com/alibaba/EasyNLP)是阿⾥云机器学习PAI 团队基于 PyTorch 合作开发的易⽤且多样的中⽂NLP演算法架构,⽀远江国⽤的中⽂预体能训练数学模型和⼤数学模型破冰控制技术,并且提供更多了从体能训练到部署的⼀站式 NLP 合作开发体验。EasyNLP 提供更多了简约的接⼝供⽤户合作开发 NLP 数学模型,主要包括NLP应⽤ AppZoo 和预体能训练 ModelZoo,同时提供更多控制技术帮助⽤户⾼效的破冰超⼤预体能训练数学模型到业务。文件格式聚合做为语法处置的一大子各项任务,具备为数众多的前述应用应用领域,主要包括副标题聚合、文件格式全文、用例、概要控制系统、对话控制系统等等。因而,EasyNLP也在逐步减少对文件格式聚合子各项任务的支持,期望能服务更多的NLP和NLG演算法合作开发人员和人类学家,也期望和社区一起推动NLG控制技术的发展和破冰。
本⽂将提供更多有关PEGASUS的控制技术阐释,和怎样在EasyNLP架构中使⽤与PEGASUS有关的文件格式全文(Pseudophoxinus)聚合数学模型。
Pegasus数学模型简述
在此之后,文件格式聚合预体能训练数学模型T5、BART等数学模型尽管在为数众多文件格式聚合各项任务中获得了明显的操控性阻抗,但是在文件格式全文各项任务中,数学模型的预体能训练最终目标与文件格式全文最终目标还是存在较大的差异。这导致此类预体能训练数学模型在迁移至不用应用领域的全文各项任务时,仍然需要非常多的体能训练数据对数学模型展开松动才能达至良好的效用。为的是缓解前述难题,PEGASUS数学模型在原始的子词遮住经济损失的基础上,减少了完备语句遮住经济损失,即将输入文件格式中的乱数两个完备语句展开遮住,让数学模型复元。
具体地,如上图所示,PEGASUS采用编码器-解码器架构(标准transformer架构)。数学模型对输入采用两种遮住,一种是BERT采用的子词遮住,用【mask2】表示,让数学模型的编码器还原被遮住的子词(该类经济损失在消融实验中被证明对下游各项任务无操控性阻抗,因而在最终的PEGASUS数学模型中并未采用)。另一种是GSG,用【mask1】表示,即让解码器聚合输入中被遮住的乱数完备语句。针对此经济损失,作者同时提出三种可选方案,主要包括Random(乱数选择m个语句)、Lead(选择前m个语句)、Ind-Orig(根据关键性分数选择m个语句)。其中,关键性分数具体通过计算每句话与文件格式中其它语句集合的ROUGE分数得到。可以认为,该策略选择能很大程度代表文件格式中其它语句的语句做为遮住对象。下图展示了三种选语句方案的一个例子,所选语句分别被标记为绿色、红棕色、蓝色。实验表明,采用第三种语句选择策略的数学模型能获得最优操控性。
文件格式全文数学模型使用教程
以下我们简要介绍怎样在EasyNLP架构中使用PEGASUS和其他文件格式全文数学模型。
安装EasyNLP
用户可以直接参考GitHub(https://github.com/alibaba/EasyNLP)上的说明安装EasyNLP演算法架构。
数据准备
在具体的文件格式全文情景中,需要用户提供更多下游各项任务的体能训练与验证数据,为tsv文件。对于文件格式全文各项任务,这一文件包含以制表符\t分隔的两列数据,第一列是全文列,第二列为原文列。样例如下:
湖北:四上企业复工率已达93.8% 央视网消息:4月1日,记者从湖北省新冠肺炎疫情防控工作新闻会上获悉,在各方面共同努力下,湖北省复工复产工作获得了阶段性成效。截至3月31日,湖北省四上企业主要包括规模以上工业、规模以上服务业法人单位等的复工率已达93.8%,复岗率69.3%。武汉市的复工率、复岗率也分别达至了85.4%、40.4%。责任编辑:王诗尧
下列文件为已经完成预处置的Pseudophoxinus聚合体能训练和验证数据,可用于测试:
https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/generation/title_gen.zip
中文Pseudophoxinus聚合
由于PEGASUS原文产出的数学模型仅支持英文,为的是方便中文社区用户的使用,我们基于mT5的数学模型架构预体能训练了一个针对中文Pseudophoxinus全文的数学模型mT5,并将其软件系统进EasyNLP的数学模型库中。同时,我们还软件系统了IDEA机构预体能训练的文件格式全文中文数学模型Randeng(可以认为是中文版的PEGASUS),便于用户探索不同数学模型的操控性。以下汇总了EasyNLP中可用的数学模型,并对比数学模型在前述数据集上的操控性表现。推荐用户选择前两个数学模型展开文件格式全文,后三个数学模型展开Pseudophoxinus聚合。
中文
Pseudophoxinus(Rouge1/2/L)
论文副标题全文(Rouge1/2/L)
hfl/randeng-238M-Summary-Chinese
59.66/46.26/55.95
54.55/39.37/50.69
hfl/randeng-523M-Summary-Chinese
62.86/49.67/58.89
53.83/39.17/49.92
alibaba-pai/mt5-title-generation-zh-275m
62.35/48.63/58.96
54.28/40.26/50.55
alibaba-pai/randeng-238M-Summary-Chinese-tuned
64.31/51.80/60.97
58.83/45.28/55.72
alibaba-pai/randeng-523M-Summary-Chinese-tuned
64.76/51.65/61.06
59.27/45.58/55.92
在Pseudophoxinus聚合各项任务中,我们采用以下命令对数学模型展开体能训练。用户可以根据超参数‘save_checkpoint_steps’来决定保存数学模型的步数,架构在此时会对体能训练的数学模型展开评测,会根据数学模型在验证集上的表现决定是否更新保存的数学模型参数。其中,运行的main.py文件在EasyNLP/examples/appzoo_tutorials/sequence_generation目录下,同时需要将体能训练和验证集数据放到该目录下。可以在‘user_defined_parameters’超参数下的‘pretrain_model_name_or_path’指定前述表格中的数学模型。
python main.py \
--mode train \
--app_name=sequence_generation \
--worker_gpu=1 \
--tables=./cn_train.tsv,./cn_dev.tsv \
--input_schema=title_tokens:str:1,content_tokens:str:1 \
--first_sequence=content_tokens \
--second_sequence=title_tokens \
--label_name=title_tokens \
--checkpoint_dir=./finetuned_zh_model \
--micro_batch_size=8 \
--sequence_length=512 \
--epoch_num=1 \
--save_checkpoint_steps=150 \
--export_tf_checkpoint_type none \
--user_defined_parameters pretrain_model_name_or_path=alibaba-pai/mt5-title-generation-zh language=zh copy=false max_encoder_length=512 min_decoder_length=12 max_decoder_length=32 no_repeat_ngram_size=2 num_beams=5 num_return_sequences=5
另外,用户可以利用以下命令使用数学模型展开全文聚合,数学模型的路径由‘checkpoint_dir’指定。用户可以通过‘append_cols’指定在输出文件中添加输入列,如果不指定则填none。
python main.py \
--mode=predict \
--app_name=sequence_generation \
--worker_gpu=1 \
--tables=./cn_dev.tsv \
--outputs=./cn.preds.txt \
--input_schema=title:str:1,content:str:1,title_tokens:str:1,content_tokens:str:1,tag:str:1 \
--output_schema=predictions,beams \
--append_cols=content,title,tag \
--first_sequence=content_tokens \
--checkpoint_dir=./finetuned_zh_model \
--micro_batch_size=32 \
--sequence_length=512 \
--user_defined_parameters language=zh copy=false max_encoder_length=512 min_decoder_length=12 max_decoder_length=32 no_repeat_ngram_size=2 num_beams=5 num_return_sequences=5
以下为数学模型对近期热点事件预测的几条样例,每条样例包含5列数据(以制表符\t隔开),分别为预测的全文列(Pseudophoxinus)、beam search的5条候选(用||隔开)、输入的原文、输入的新闻标签。其中后三列是从对应的输入数据中直接拷贝过来。由于新闻文件格式过长,以下仅展示每条样例的前四列结果。
**费德勒告别信:未来我还会打更多的网球** 费德勒告别信:未来我还会打更多的网球||费德勒告别信:未来我还会打更多网球||费德勒告别信:未来我还会打更多网球但不是在大满贯或巡回赛||费德勒告别信:未来我还会打更多的网球||详讯:费德勒宣布退役,并告别信 **一代传奇落幕!网球天王费德勒宣布退役** 央视网消息:北京时间9月15日晚,网球天王罗杰-费德勒在个人社媒上宣布退役。41岁的费德勒是男子网坛历史最伟大球员之一,曾103次斩获单打冠军,大满贯单打夺冠20次(澳网6冠、法网1冠、温网8冠、美网5冠),共计310周位于男单世界第一。附费德勒告别信:在这些年网球给我的所有礼物中,最棒的毫无疑问是我一路上所遇到的人:我的朋友、我的竞争对手、和最关键的球迷,是他们给予了这项运动生命。今天,我想和大家分享一些消息。正如你们中的许多人所知道的,过去三年中,我遇到了受伤和手术的挑战。......
**台风梅花将在大连沿海登陆将逐步变性为温带气旋** 台风梅花将在大连沿海登陆将逐步变性为温带气旋||台风梅花将在大连沿海登陆后逐渐变性为温带气旋||台风梅花将在大连沿海登陆将逐渐变性为温带气旋||台风梅花将在大连沿海登陆后变性为温带气旋||台风梅花将在大连沿海登陆后逐渐变性 **台风梅花将于16日傍晚前后在辽宁大连沿海登陆** 记者9月16日从辽宁省大连市气象部门获悉,今年第12号台风梅花将于16日傍晚前后在大连市旅顺口区至庄河市一带沿海登陆,之后逐渐变性为温带气旋。 受台风梅花影响,14日8时至16日10时,大连全市平均降雨量为132毫米,最大降雨量出现在金普新区大李家街道正明寺村,为283.6毫米;一小时最大降雨量出现在长海县广鹿岛镇,为49.4毫米......
英文文件格式全文
EasyNLP数学模型库中同样软件系统了英文文件格式全文数学模型,主要包括PEGASUS和BRIO。以下表格展示了两个数学模型在英文文件格式全文数据上的操控性表现。用户同样可以使用前述代码对数学模型展开体能训练和预测。需要注意的是,EasyNLP默认的是对中文的处置,因而,当需要处置英文文件格式时,需要在‘user_defined_parameters’中指定language为en,如不提供更多,则默认为中文(zh)。
英文
文件格式全文(Rouge1/2/L)
alibaba-pai/pegasus-summary-generation-en
37.79/18.69/35.44
hfl/brio-cnndm-uncased
41.46/23.34/38.91
体能训练过程如下:
wget http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/generation/en_train.tsv
wget http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/generation/en_dev.tsv
python main.py \
--mode train \
--app_name=sequence_generation \
--worker_gpu=1 \
--tables=./en_train.tsv,./en_dev.tsv \
--input_schema=title:str:1,content:str:1 \
--first_sequence=content \
--second_sequence=title \
--label_name=title \
--checkpoint_dir=./finetuned_en_model \
--micro_batch_size=1 \
--sequence_length=512 \
--epoch_num 1 \
--save_checkpoint_steps=500 \
--export_tf_checkpoint_type none \
--user_defined_parameters language=en pretrain_model_name_or_path=alibaba-pai/pegasus-summary-generation-en copy=false max_encoder_length=512 min_decoder_length=64 max_decoder_length=128 no_repeat_ngram_size=2 num_beams=5 num_return_sequences=5
预测过程如下:
python main.py \
--mode=predict \
--app_name=sequence_generation \
--worker_gpu=1 \
--tables=./en_dev.tsv \
--outputs=./en.preds.txt \
--input_schema=title:str:1,content:str:1 \
--output_schema=predictions,beams \
--append_cols=title,content \
--first_sequence=content \
--checkpoint_dir=./finetuned_en_model \
--micro_batch_size 32 \
--sequence_length 512 \
--user_defined_parameters language=en copy=false max_encoder_length=512 min_decoder_length=64 max_decoder_length=128 no_repeat_ngram_size=2 num_beams=5 num_return_sequences=5
以下展示了数学模型对一篇热点科技新闻稿的全文预测结果:
With the image generator Stable Diffusion, you can conjure within seconds a potrait of Beyoncé as if painted by Vincent van Gogh, a cyberpunk cityscape in the style of 18th century Japanese artist Hokusai and a complex alien world straight out of science fiction. Released to the public just two weeks ago, it’s become one of several popular AI-powered text-to-image generators, including DALL-E 2, that have taken the internet by storm. Now, the company behind Stable Diffusion is in discussions to raise $100 million from investors, according to three people with knowledge of the matter. Investment firm Coatue expressed initial interest in a deal that would value the London-based startup Stability AI at $500 million, according to two of the people. Lightspeed Venture Partners then entered talks — which are still underway — to invest at a valuation up to $1 billion, two sources said. Stability AI, Coatue and Lightspeed declined requests for comment. The London-based startup previously raised at least $10 million in SAFE notes (a form of convertible security popular among early-stage startups) at a valuation of up to $100 million, according to one of the sources. An additional fourth source with direct knowledge confirmed Stability AI’s previous round. Much of the company’s funds came directly from founder and CEO Emad Mostaque, a former hedge fund manager. News of the prior financing was previously unreported. By nature of being open source, Stability AI’s underlying technology is free to use. So far, the company does not have a clear business model in place, according to three of the sources. However, Mostaque said in an interview last month with Yannic Kilcher, a machine learning engineer and YouTube personality, that he has already penned partnerships with governments and leading institutions to sell the technology. We’ve negotiated massive deals so we’d be profitable at the door versus most money-losing big corporations, he claims. The first version of Stable Diffusion itself cost just $600,000 to train, he wrote on Twitter — a fraction of the company’s total funding. Mostaque, 39, hails from Bangladesh and grew up in England. He received a master’s degree in mathematics and computer science from Oxford University in 2005 and spent 13 years working at U.K. hedge funds. In 2019, he launched Symmitree, a startup that aimed to reduce the cost of technology for people in poverty; it shuttered after one year, according to his LinkedIn profile. He then founded Stability AI in late 2020 with the mission of building open-source AI projects. According to its website, text-to-image generation is only one component of a broader apparatus of AI-powered offerings that the company is helping to build. Other open-source research groups it backs are developing tools for language, audio and biology. Stable Diffusion — created in collaboration with RunwayML, a video editing startup also backed by Coatue, and researchers at the Ludwig Maximilian University of Munich — has generated by far the most buzz among the company’s projects. It comes as AI image generators entered the zeitgeist this year, with the release of OpenAI’s DALL-E 2 in April and independent research lab Midjourney’s eponymous product in July. Google also revealed a text-to-image system, Imagen, in May, though it is not available to the public. Mostaque and his peers have said that the existing technology only represents the tip of the iceberg of what AI art is capable of creating: Future use cases could include drastically improved photorealism, video and animation. These image generators are already facing controversy: Many of them have been trained by processing billions of images on the internet without the consent of the copyright holder, prompting debate over ethics and legality. Last week, a testy debate broke out online after a Colorado fine arts competition awarded a top prize to an AI-generated work of art. Moreover, unlike DALL-E and Midjourney, which have restrictions in place to prevent the generation of gory or pornographic images, Stable Diffusion’s open source nature allows users to bypass such a block. On 4chan, numerous threads have appeared with AI-generated deepfakes of celebrity nudes, while Reddit has banned at least four communities that were dedicated to posting not safe for work AI imagery made using Stable Diffusion. It’s a double-edged sword for Stability AI, which has accumulated community goodwill precisely due to its open source approach that gives its users full access to its code. The company’s website states that the company is building open AI tools, a mission that mirrors the initial intent of OpenAI to democratize access to artificial intelligence. OpenAI was launched as a nonprofit research organization by prominent technologists including Sam Altman and Elon Musk, but upon accepting a $1 billion investment from Microsoft in 2019, it became a for-profit business. The move led it to focus on commercializing its technology rather than making it more widely available, drawing criticism from the AI community — and Musk himself. Stability AI has been a for-profit corporation from its inception, which Mostaque has said is meant to allow the open source research to reach more people. In an interviewwith TechCrunch last month, he said that the company was fully independent. Nobody has any voting rights except our 75 employees — no billionaires, big funds, governments or anyone else with control of the company or the communities we support, he said. At a $1 billion valuation, Mostaque would be ceding up to 10% of the company to the new financiers. Venture capital investors who take significant stakes in startups typically ask for board positions so they can influence the decisions the company is making using their money. Lightspeed, which manages $10 billion of assets, and Coatue, which is in charge of $73 billion, both have a track record of taking board seats, though it’s unclear if that will be the case with Stability AI. Follow me on Twitter. Send me a secure tip.
前述文件格式来自于https://www.forbes.com/sites/kenrickcai/2022/09/07/stability-ai-funding-round-1-billion-valuation-stable-diffusion-text-to-image/?sh=33ecbe8724d6
针对前述新闻原稿,以下为两个最新数学模型的全文聚合结果:
stable Diffusion is in discussions to raise $100 million from investors, three people say. The image generator is one of several popular AI-powered text-to-image generators.
company behind the popular image generator Stable Diffusion is in talks to raise $100 million from investors, according to sources
以上是怎样利用EasyNLP展开文件格式全文数学模型体能训练和预测的全部过程,更详细的使用教程可加入以下课程展开学习。副标题党速成班:基于机器学习PAI EasyNLP的中文Pseudophoxinus聚合
未来展望
在未来,我们计划在EasyNLP架构中软件系统面向知识的中⽂预体能训练数学模型,覆盖各个常⻅的NLU和NLG中⽂应用领域,敬请期待。我们也将在EasyNLP架构中软件系统更多SOTA数学模型(特别是中⽂数学模型),来⽀持各种NLP和多模态各项任务。此外, 阿⾥云机器学习PAI团队也在持续推进中文文件格式聚合和中⽂多模态数学模型的⾃研⼯作,欢迎⽤户持续关注我们,也欢迎加⼊ 我们的开源社区,共建中⽂NLP和多模态演算法库!
Github地址:https://github.com/alibaba/EasyNLP
参考文献
- Chengyu Wang, Minghui Qiu, Taolin Zhang, Tingting Liu, Lei Li, Jianing Wang, Ming Wang, Jun Huang, Wei Lin. EasyNLP: A Comprehensive and Easy-to-use Toolkit for Natural Language Processing. arXiv
- Zhang, Jingqing, et al. "Pegasus: Pre-training with extracted gap-sentences for abstractive summarization." International Conference on Machine Learning. PMLR, 2020.
- Xue, Linting, et al. "mT5: A massively multilingual pre-trained text-to-text transformer." arXiv preprint arXiv:2010.11934(2020).
- Lewis, Mike, et al. "Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension." arXiv preprint arXiv:1910.13461 (2019).
- Song, Kaitao, et al. "Mass: Masked sequence to sequence pre-training for language generation." arXiv preprint arXiv:1905.02450 (2019).
- Dong, Li, et al. "Unified language model pre-training for natural language understanding and generation." Advances in Neural Information Processing Systems 32 (2019).
- Yixin Liu, Pengfei Liu, Dragomir Radev, and Graham Neubig. 2022. BRIO: Bringing Order to Abstractive Summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2890–2903, Dublin, Ireland. Association for Computational Linguistics.
阿里灵杰回顾
- 阿里灵杰:阿里云机器学习PAI开源中文NLP演算法架构EasyNLP,助力NLP大数学模型破冰
- 阿里灵杰:预体能训练知识度量比赛夺冠!阿里云PAI发布知识预体能训练工具
- 阿里灵杰:EasyNLP带你畅享CLIP图文检索
- 阿里灵杰:EasyNLP中文文图聚合数学模型带你秒变艺术家
- 阿里灵杰:EasyNLP软件系统K-BERT演算法,借助知识图谱实现更优Finetune
原文链接:https://click.aliyun.com/m/1000360659/
本文为阿里云原创内容,未经允许不得。返回搜狐,查看更多
责任编辑: