0
  • 聊天消息
  • 系统消息
  • 评论与回复
登录后你可以
  • 下载海量资料
  • 学习在线课程
  • 观看技术视频
  • 写文章/发帖/加入社区
会员中心
创作中心

完善资料让更多小伙伴认识你,还能领取20积分哦,立即完善>

3天内不再提示

简述开放域长格式问答系统的进步与挑战

Tensorflowers 来源:Google Research 研究员 Aurko 作者:Google Research 研究员 2021-05-31 10:02 次阅读

发布人:Google Research 研究员 Aurko Roy

开放域长格式问答 (LFQA) 是自然语言处理 (NLP) 的一项基础挑战,涉及检索与给定问题相关的文档,并使用这些文档来生成一段详尽答案。在事实型开放域问答 (QA) 中,简单的短语或实体便足以回答问题。虽然我们近期在这一方面取得了显著进展,但在长格式问答领域中却做得远远不够。尽管如此,LFQA 仍是一项非常重要的任务,特别是它能提供一个测试平台来衡量生成文本模型的真实性。但是,当前的基准和评估指标真的能在 LFQA 方面取得进展吗?

在“在长格式问答领域取得进展的障碍”(Hurdles to Progress in Long-form Question Answering)(将在 NAACL 2021 会议上发表)中,我们介绍了一种新的开放域长格式问答系统,它利用了 NLP 的两项最新进展:

1.最先进的稀疏注意力模型(例如 Routing Transformer(RT)),能够将基于注意力的模型扩展至长序列;

2.基于检索的模型(例如 REALM),有助于检索与给定查询相关的维基百科文章。

Routing Transformer

https://www.mitpressjournals.org/doi/full/10.1162/tacl_a_00353

为获得更多的事实依据,对于检索到的与给定问题相关的一些维基百科文章,我们的系统会在答案生成之前将从中获得的信息结合起来 ELI5 是唯一一个可用于长格式问答的大规模公开数据集,我们的系统在该数据集上取得了突破性进展。

ELI5

https://ai.facebook.com/blog/longform-qa/

不过,虽然这个系统在公开排行榜上名列前茅,但我们发现 ELI5 数据集及其相关评估指标的一些趋势令人担忧。特别要强调的是,我们发现 1) 几乎没有证据表明模型实际使用了它们所要求的检索;2) 平凡基线(例如输入复制)击败了现代系统,如 RAG/BART+DPR;以及 3) 数据集中存在大量训练/验证重叠。我们的论文针对每一个问题提出了缓解策略。

输入复制

https://eval.ai/web/challenges/challenge-page/689/leaderboard/1908#leaderboardrank-6

文本生成

NLP 模型的核心要件是 Transformer 架构,其序列中的每个 Token 都会关注序列中的其他所有 Toekn,从而形成一个随序列长度呈二次增长的模型。RT 模型引入了一种基于内容的动态稀疏注意力机制,将 Transformer 模型中的注意力复杂度从 n2 降到了 n1.5( 其中 n 是序列长度),使其能够扩展到长序列。这使得每个单词都可以关注整个文本中 任何地方的其他相关单词, 而不像 Transformer XL 等类似方法,一个单词只能关注其附近的单词。

RT 发挥作用的关键在于每个 Token 对其他 Token 的关注通常是冗余的,并且可以通过结合局部和全局注意力进行估算。局部注意力允许每个 Token 在模型的几个层上建立一个局部表征,其中每个 Token 关注一个局部邻域,从而达到局部的一致性和流畅性。作为对局部注意力的补充,RT 模型还使用了小批量 k-均值集群, 使每个 Token 只关注一组最相关的 Token 。

我们以语言建模为目标,使用 ProjectGutenberg(PG-19) 数据集预先训练了一个 RT 模型,即在给定前面所有单词的情况下,让该模型学会预测下一个单词,从而能够生成流利的段落长文本。

ProjectGutenberg(PG-19)

https://deepmind.com/blog/article/A_new_model_and_dataset_for_long-range_memory

信息检索

为了证明 RT 模型在 LFQA 任务中的有效性,我们将其与 REALM 中检索到的内容结合使用。REALM 模型(Guu 等人于 2020 年发布)是基于检索的模型,使用最大内积搜索来检索与特定查询或问题相关的维基百科文章。我们对该模型进行了微调,以便根据自然问题数据集作出事实型问答。REALM 利用 BERT 模型学习问题的良好表征,并使用 SCANN 检索与问题表征具有高度主题相似性的维基百科文章。接着进行端到端训练,以最大程度地提高 QA 任务的对数似然值。

通过使用对比损失,我们进一步提高了 REALM 检索的质量。其背后的想法是让问题表征更靠近其基本事实答案,并与其他小批量答案有所不同。这样可以确保,当系统使用此问题表征检索相关项目时,会返回与基本事实答案“类似”的文章。我们称这种检索器为对比型-REALM 或 c-REALM。

对比损失

https://towardsdatascience.com/contrastive-loss-explaned-159f2d4a87ec

评估

我们使用 ELI5 数据集对该模型进行了长格式问答测试。ELI5 数据集是 KILT 基准的一部分, 也是唯一 公开可用的大规模 LFQA 数据集。KILT 基准使用准确率 (R-Prec) 衡量文本检索质量,并使用 ROUGE-L 衡量文本生成质量。将这两项得分相加即可得出一个 KILT R-L 得分,该得分决定了模型在排行榜上的排名。在 KILT 的 ELI5 数据集上,我们对预先训练好的 RT 模型和 c-REALM 检索进行了微调。

我们提交的内容在 ELI5 长格式问答 KILT 排行榜上名列前茅,综合 KILT R-L 得分为 2.36。对比先前 BART + DPR 排行榜的排名(KILT R-L 得分为 1.9)有了提升,同时具有与排行榜上其他模型相似的参数数量。在文本生成质量方面,我们发现 Rouge-L 与 T5、BART + DPR 和 RAG 相比,得分分别提高了 4.11、5.78 和 9.14。

排行榜

https://eval.ai/web/challenges/challenge-page/689/leaderboard/1908

RT 模型生成的示例

Q1

Why are almost all boats white?

Boats are generally painted white for obvious reasons, as they are inexpensive and easy to clean. Ships are also often painted to provide a visible visual contrast to the sea. Boats also often have high visibility ports to allow for easy international trade.

In addition, many boats have an air-breathing function that makes them easier to spot and stay afloat, which is a good thing if you're a pirate or trying to get into some country's waters. Lastly, many ships are built in the last few decades, so they are not as exotic looking as they were thirty years ago.

And lastly, many ships are made out of materials that are easy to work with, and so many of them are painted white as a practical matter. There are many, many more reasons for why ships are white, but those are the ones I know of. I'm sure there are many other reasons, but those are the ones I know of.

Edit: I forgot to mention that the other answers about visibility and easy to see ports are also true. Ships are painted to appear to be larger, and more expensive. That makes them appear more impressive, and so they are more likely to be bought by people who want to show off their wealth and status.

Q2

What causes the trail behind jets at high altitude?

The same thing that causes the clouds you see at the start of a storm to form in the first place. The air pressure changes as the jet moves through the air, causing it to cool down and condense into water droplets. These droplets are in fact a lot bigger than the water droplets in the clouds, so when the plane passes through them, they can cause the air pressure to change, causing the water droplets to condense into clouds.

This is what you see when you look at a cloud formation from the ground. In addition to this, when the plane passes through the clouds, it also moves through the air, causing it to change its temperature, causing the water droplets to condense and form clouds.

These clouds are often quite large, and can block out the sun. This is what you see when you look at a contrail. Finally, the water vapor that forms at high altitudes is cooled by the air above, and as it cools, it condenses into clouds. This is what you see when you look at a cumulonimbus cloud. These clouds are even larger than the water droplets in the clouds, and they block out the sun.

Q3

How do shampoo and conditioner work?

The key ingredients in shampoo and conditioner are usually surfactants, they tend to attract oil from your skin. Shampoo will usually contain an anti-dandruff agent (dandruff is caused by fungi in your skin which cause dandruff). They also contain some moisturizing agents (moisturizers are things that stay in your skin longer and protect your scalp from cleaning),

And a lot of anti-dandruff (dandruff is a side effect of high doses of both surfactants and moisturizers) Conditioner will contain moisturizing agents, like a skin moisturizer. They tend to cause oils to be separated from your hair (e.g. wash your hair with conditioner instead of shampoo). They both have a foaming agent that makes them foam up (usually Hydrogen Peroxide). They also have a foaming agent that stops dirt from attaching to your hair. Some conditioner will also suppress the action of the shampoo (to help prevent dirt from attaching to your hair).

Q4

When glancing at a clock, why does the first second after glancing at it sometimes feel longer than the rest?

This is known as chronostasis. Your brain gets used to seeing the second hand move after the first, so it ignores the first one. If you were to look at the clock a lot, you'd see the second hand move a lot more than the first, and you'd be able to tell the difference between the second and the first.

The second hand will appear to move much slower than the first hand because your brain is used to seeing the second hand move a lot more than the first. If you were to go on a long trip, you'd get used to the second hand moving a lot more than the first, and you'd have no problem telling the difference between the second and the first hand.

It's kind of like how you don't notice your nose when you're driving, but if you look at it for a while, you start to notice it. Your brain is just getting used to seeing the nose move a lot more than the other, and it can't tell the difference between the two. It's kind of like that with your eyes.

Q5

Why do Britain and other English empire countries still bow to monarchs? What real purpose does the queen serve?

Depends what you define as "purpose". There is no specific purpose for the British monarch - rather, they are figureheads that serve a role. The role is to have a high degree of influence over a country, though the rules of the monarchy can somewhat vary from country to country, and the ceremony of "kissing the royal corpse" does not always involve kissing the monarch. Whether that's the only reason for the custom, or if it was the only reason,

I can't say, but that's the reasoning. When the U.K. was in full power, the monarch was a puppet of Parliament, and the powers of the monarch were transferred to the Prime Minister and thus to the Prime Minister's deputy, who then became the Prime Minister. Since then, Parliament has been able to vote on legislation that goes through the monarch,

although they may still act as the monarch's representative in negotiating treaties, which can have very very deep consequences. The Queen's role, as a representative of Britain, doesn't necessarily involve her formal approval of any of the laws or legislation that goes through Parliament, though.

Q6

What exactly is fire, in detail? How can light and heat come from something we can't really touch?

Fire is the chemical reaction of fuel (oxygen) with (what we call) impurities (ash, soot, oil etc). These impurities are created due to a number of things. If you were to study the chemical composition of a log (or a candle) and analyse it you would discover the chemical formula (A, C, and G) which is a simplified version of what actually happens. There is always fuel, and there is always the ash/soot. Wood is made up of mostly carbon and hydrogen and these are the fuels that create the fire. Hydrogen and oxygen are reactive, and they form water (H2O) and CO2 (CO2). If we split a log into smaller pieces, we start to see the process of combustion as it occurs. The carbon is burned, and with the explosion of CO, H2O and H2CO2, we get water and carbon dioxide (mostly). If we smash a log together, we start to see what happens when we also hit another log with another log...imagine smashing two sand castles together. We see how the sand castles and brick castles form and collapse at different rates...then it starts to burn. The smoke, ash and flames are the actual combustion of fuel.

Q7

What goes on in those tall tower buildings owned by major banks?

The actual buildings are not there for profit but for show. There are many real world historical buildings that have little to no effect other than being a symbol of an economic entity or symbol of an era or location, and nothing more. For example look at Sears, Sears didn't care what went on inside, it was all about the \_appearance\_ of its location, the prestige of the location, the facilities and so on. It didn't care about how long it took it to operate, it was about how much people would pay to go see it. Sears was a landmark as a cultural movement and other big companies followed suit, so if you want to see a building you've never seen before, you have to go see Sears, just like you have to see a Toyota Camry for Toyota Camry. They used to be all about building new factories, some of them if I recall, but now that they're bigger, that means that more factory jobs are coming to them. You've probably seen them in stores as stores where people buy and sell stuff, so there aren't that many places for them to come from. Instead, it's just for show, a symbol of rich people.

在 LFQA 领域取得进展的障碍

不过,虽然此处介绍的 RT 系统在公开排行榜中名列前茅,但对模型和 ELI5 数据库的详细分析仍揭示了一些令人担忧的趋势。

Train/Valid Overlap

Many held-out questions are paraphrased in the training set. Best answer to similar train questions gets 27.4 ROUGE-L.

Lack of Grounding

Conditioning answer generation on random documents instead of relevant ones does not measurably impact its factual correctness. Longer outputs get higher ROUGE-L.

我们发现,几乎没有任何证据表明模型会将其文本生成实际定位到检索文档中。与 Wikipedia 中的随机检索搭配使用的微调 RT 模型(例如,随机检索 + RT),几乎与 c-REALM + RT 模型(24.2 与 24.4 ROUGE-L)表现得一样好。在训练、验证和测试 ELI5 数据集时,我们还发现了很多的重叠(几个问题相互解释),因此可能不再需要检索。KILT 基准会单独衡量检索和生成的质量,但不确定文本生成是否会在实际情况中使用检索。

与 RAG 和 BART + DPR 相比,平凡基线会获得更高的 Rouge-L 分数

此外,在使用 Rouge-L 指标和平凡无意义基线(如随机训练集答案和输入复制)来评估文本生成质量的过程中,我们发现了一些问题,并导致 Rouge-L 分数相对较高(甚至超过了 BART + DPR 和 RAG)。

结论

我们为基于 Routing Transformers 和 REALM 的长格式问答推出了一个系统,该系统在关于 ELI5 的 KILT 排行榜中名列前茅。但是,详细的分析揭示了存在的一些问题,即无法使用基准来显示有意义的建模进展。我们希望社区共同合作,一起解决这些问题,以便研究人员向正确的高峰攀登,在这个充满挑战但十分重要的任务中取得有意义的进展。

致谢

Routing Transformer 是 Aurko Roy、Mohammad Saffar、Ashish Vaswani 和 David Grangier 等人进行团队协作的结果。有关开放域长格式问答的后续工作是由 Kalpesh Krishna、Aurko Roy 和 Mohit Iyyer 协作完成的。我们要感谢 Vidhisha Balachandran、Niki Parmar 和 Ashish Vaswani 提供的多条实用意见,感谢 REALM 团队 (Kenton Lee、Kelvin Guu、Ming-Wei Chang 和 Zora Tung) 在代码库方面提供的帮助以及多条实用意见,这些意见帮助我们进一步完善了实验。

我们非常感谢 Tu Vu 针对 QQP 分类器提供的帮助,这些分类器用于在 ELI5 训练集和测试集中检测解释。感谢 Jules Gagnon-Marchand 和 Sewon Min 对检查 ROUGE-L 边界提供的有用实验建议。最后,感谢 Shufan Wang、Andrew Drozdov、Nader Akoury 以及 UMass NLP 小组的其他成员针对项目的不同阶段提出的实用意见和建议。

编辑:jq

声明:本文内容及配图由入驻作者撰写或者入驻合作网站授权转载。文章观点仅代表作者本人,不代表电子发烧友网立场。文章及其配图仅供工程师学习之用,如有内容侵权或者其他违规问题,请联系本站处理。 举报投诉
  • 数据集
    +关注

    关注

    4

    文章

    1208

    浏览量

    24696
  • nlp
    nlp
    +关注

    关注

    1

    文章

    488

    浏览量

    22034

原文标题:开放域长格式问答系统的进步与挑战

文章出处:【微信号:tensorflowers,微信公众号:Tensorflowers】欢迎添加关注!文章转载请注明出处。

收藏 人收藏

    评论

    相关推荐

    一文了解底盘控之制动功能

    1底盘控基础1.1底盘控的概念1.2线控底盘技术1)线控制动系统2)线控转向系统2制动系统功能2.1制动
    的头像 发表于 12-13 16:46 767次阅读
    一文了解底盘<b class='flag-5'>域</b>控之制动功能

    中兴通讯与上和美签署战略合作协议

    深入合作,形成全面战略合作伙伴关系。 中兴通讯副总裁、产业数字化方案部总经理陆平,上和美集团联合创始人、副董事何平,上和美集团董事、总裁李元媛等双方领导出席此次签约仪式。
    的头像 发表于 12-13 15:12 232次阅读

    TE一站式解决方案,助您“”见未来

    在汽车电子系统中, “”的概念源于车辆功能的不断增加和电子电气架构的迭代升级。传统的分布式控制架构难以高效管理众多功能,因此,车辆系统逐渐被划分为不同的“功能”(如动力
    的头像 发表于 11-04 10:20 276次阅读

    50万奖金池!开放原子大赛——第二届OpenHarmony创新应用挑战赛正式启动

    第二届OpenHarmony创新应用挑战赛作为开放原子大赛旗下的重要赛项,聚焦 OpenHarmony应用开发,致力提升开发者的动手实践能力与开发创新应用的能力。 赛项要求开发者
    发表于 10-24 15:40

    短文6:关于功率因素的有趣问答

    2个关于功率因素的有趣问答
    的头像 发表于 09-23 12:22 192次阅读

    【⌈嵌入式机电一体化系统设计与实现⌋阅读体验】+磁力输送机系统设计的创新与挑战

    磁力输送机系统设计的创新与挑战 在现代工程技术领域,磁力输送机系统作为一项前沿技术,正逐渐成为提高物流效率、减少能耗、增强系统稳定性的关键解决方案。本报告基于文献[22]的介绍,深入探
    发表于 09-14 22:44

    浪潮信息赵帅:开放计算创新 应对Scaling Law挑战

    Scaling Law带来的AI基础设施Scale up和Scale out的挑战,数据中心需要以开放创新加速算力系统、管理和基础设施的全向Scale进程,推动AI产业的创新发展。   开源
    的头像 发表于 08-15 16:02 285次阅读
    浪潮信息赵帅:<b class='flag-5'>开放</b>计算创新 应对Scaling Law<b class='flag-5'>挑战</b>

    为什么稳定系统的收敛包含虚轴

    引言 在控制系统的研究中,系统的稳定性是一个非常重要的概念。一个稳定的系统能够在受到扰动后,通过自身的调节机制,使系统状态恢复到稳定状态。而收敛
    的头像 发表于 08-15 10:32 1285次阅读

    can数据帧有哪几个组成

    CAN总线上传输数据的格式。一个典型的CAN数据帧由多个组成,包括帧起始、仲裁、控制、数据、CRC
    的头像 发表于 07-24 15:14 1011次阅读

    CAN数据帧的各个及其作用

    CAN(Controller Area Network)是一种用于汽车电子系统中的通信协议,它具有高可靠性、实时性和灵活性等特点。在CAN通信中,数据帧是最基本的通信单元,用于传输信息。 概述
    的头像 发表于 07-24 15:10 1275次阅读

    llm模型有哪些格式

    LLM(Large Language Model,大型语言模型)是一种深度学习模型,主要用于处理自然语言处理(NLP)任务。LLM模型的格式多种多样,以下是一些常见的LLM模型格式
    的头像 发表于 07-09 09:59 616次阅读

    请问CAN数据的数据格式该如何定义?以什么标准定义?

    最近需要实现上位机与下位机的CAN通讯,现在双方苦于不知该如何定义数据格式,尤其是传输浮点数据~~ 起初经过讨论直接根据个人主观意愿定义了双方的数据协议,,,其后,boss要求采用标准进行定义
    发表于 05-09 07:20

    开放原子开源大赛OpenHarmony智能化应用生态挑战赛决赛路成功举办

    软件定义世界,开源共筑未来。在江苏省工信厅、市工信局、开放原子开源基金会及相关单位的指导和支持下,4月19-20日,由中软国际教育科技集团联合举办的开放原子开源大赛-OpenHarmony智能化应用生态挑战赛决赛路演在盐城市大数
    的头像 发表于 04-24 09:48 412次阅读

    什么是智能汽车第六

    在集中式EEA中,博世五大划分最为经典:动力(Power Train)、底盘(Chassis)、车身(Body/Comfort)、座舱
    发表于 04-20 09:46 559次阅读
    什么是智能汽车第六<b class='flag-5'>域</b>?

    30万奖金!开放原子开源大赛“云原生数据缓存性能挑战赛” 等你来挑战

      开放原子开源基金会牵头发起的首届“开放原子开源大赛”正在火热进行中。东方通支撑的“云原生数据缓存性能挑战赛”将与1月10日正式启动报名,诚挚欢迎有理想、有激情的开发者、发烧友们前来挑战
    的头像 发表于 01-11 10:31 391次阅读