英语

处理回归

我们不会引起回归——本文档描述了“Linux 内核开发的第一条规则”对于开发人员的实际意义。它是对报告回归的补充,后者从用户的角度涵盖了该主题;如果您从未阅读过该文本,请先浏览一下然后再继续阅读。

重要内容(又名“TL;DR”)

  1. 确保 回归邮件列表 (regressions@lists.linux.dev) 的订阅者能快速了解任何新的回归报告

    • 当收到未抄送邮件列表的报告时,请立即回复所有并将列表抄送进来,使其加入讨论。

    • 将错误跟踪器中提交的任何报告转发或弹回列表。

  2. 使 Linux 内核回归跟踪机器人 “regzbot” 跟踪该问题(这是可选的,但建议这样做)

    • 对于邮件报告,请检查报告者是否包含类似 #regzbot introduced: v5.13..v5.14-rc1 的行。 如果没有,请发送包含以下段落的回复(抄送回归列表),告诉 regzbot 问题何时开始发生

      #regzbot ^introduced: 1f2e3d4c5b6a
      
    • 当将错误跟踪器中的报告转发到回归列表时(见上文),请包含如下段落

      #regzbot introduced: v5.13..v5.14-rc1
      #regzbot from: Some N. Ice Human <[email protected]>
      #regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789
      
  3. 当提交回归修复时,请按照 提交补丁:将您的代码放入内核的必要指南Documentation/process/5.Posting.rst 的要求,在补丁描述中添加指向报告该问题的所有位置的“Closes:”标签。 如果您只修复了导致回归的部分问题,则可以使用“Link:”标签代替。 regzbot 目前不区分两者。

  4. 一旦确定了罪魁祸首,请尝试快速修复回归;大多数回归的修复应该在两周内合并,但有些需要在两到三天内解决。

关于 Linux 内核回归的开发人员相关的所有详细信息

更详细的重要基础知识

收到回归报告时该怎么办

确保 Linux 内核的回归跟踪器和其他 回归邮件列表 的订阅者 (regressions@lists.linux.dev) 知道任何新报告的回归

  • 当您收到未抄送邮件列表的报告时,请立即回复所有并将列表抄送进来,使其加入讨论; 尝试确保在您回复省略该列表的回复时再次抄送该列表。

  • 如果错误跟踪器中提交的报告进入您的收件箱,请将其转发或弹回列表。 如果报告者已经按照 报告问题 的指示转发了报告,请考虑事先检查列表存档。

执行任何一项操作时,请考虑让 Linux 内核回归跟踪机器人“regzbot”立即开始跟踪问题

  • 对于邮件报告,请检查报告者是否包含类似 #regzbot introduced: 1f2e3d4c5b6a 的“regzbot 命令”。 如果没有,请发送包含以下段落的回复(抄送回归列表):

    #regzbot ^introduced: v5.13..v5.14-rc1
    

    这告诉 regzbot 问题开始发生的版本范围;您可以使用 commit-id 来指定一个范围,或者在报告者对罪魁祸首进行二分的情况下,声明一个单独的 commit-id。

    注意“introduced”之前的插入符号 (^):它告诉 regzbot 将父邮件(您回复的邮件)视为您要跟踪的回归的初始报告; 这很重要,因为 regzbot 稍后会查找带有“Closes:”标签的补丁,这些标签指向 lore.kernel.org 上的存档中的报告。

  • 当转发报告给错误跟踪器的回归时,请包含带有这些 regzbot 命令的段落

    #regzbot introduced: 1f2e3d4c5b6a
    #regzbot from: Some N. Ice Human <[email protected]>
    #regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789
    

    然后,Regzbot 将自动将带有“Closes:”标签的补丁与指向您的邮件或提及的工单的报告相关联。

修复回归时重要事项

提交回归修复时,您无需执行任何特殊操作,只需记住执行 提交补丁:将您的代码放入内核的必要指南Documentation/process/5.Posting.rst关于 Linux -stable 版本的您想知道的一切 中已经更详细解释的内容

  • 使用“Closes:”标签指向报告该问题的所有位置

    Closes: https://lore.kernel.org/r/[email protected]/
    Closes: https://bugzilla.kernel.org/show_bug.cgi?id=1234567890
    

    如果您只修复了问题的一部分,则可以按照上述第一个文档中的说明使用“Link:”代替。 regzbot 目前对两者进行同等处理,并将链接的报告视为已解决。

  • 添加“Fixes:”标签以指定导致回归的 commit。

  • 如果罪魁祸首在较早的开发周期中合并,请使用 Cc: [email protected] 标签显式标记修复以进行反向移植。

这是对您的期望,并且在回归方面很重要,因为这些标签对于可能在几周、几个月或几年后查看该问题的每个人(包括您自己)都具有重要价值。 这些标签对于其他内核开发人员或 Linux 发行版使用的工具和脚本也至关重要; 其中一种工具是 regzbot,它严重依赖“Closes:”标签将回归报告与解决它们的更改相关联。

修复回归的期望和最佳实践

作为 Linux 内核开发人员,您应该尽最大努力防止因您最近的更改引起的回归使用户只有以下选项

  • 运行带有影响使用的回归的内核。

  • 切换到较旧或较新的内核系列。

  • 在确定回归的罪魁祸首后,继续运行过时且因此可能不安全的内核超过三周。理想情况下,应该少于两周。如果问题严重或影响许多用户(一般或在普遍环境中),则应该只需要几天。

如何在实践中实现这一点取决于各种因素。 使用以下经验法则作为指导。

一般而言

  • 优先处理回归的工作,而不是所有其他 Linux 内核工作,除非后者涉及严重问题(例如,急性安全漏洞、数据丢失、硬件损坏等)。

  • 加快修复最近进入适当的主线、稳定版或长期版本的(直接或通过反向移植)主线回归。

  • 不要将当前周期的回归视为可以等到周期结束的事情,因为该问题可能会阻止或防止用户和 CI 系统现在或通常测试主线。

  • 即使解决问题可能需要比下面概述的时间更长,也要以必要的谨慎态度进行工作,以避免造成额外的或更大的损害。

在知道回归的罪魁祸首后,关于时间安排

  • 如果问题严重或困扰许多用户(一般或在特定硬件环境、发行版或稳定/长期系列等普遍条件下),则目标是在两到三天内将修复程序合并到主线。

  • 如果罪魁祸首进入最近的主线、稳定版或长期版本(直接或通过反向移植),则目标是在下周日之后将修复程序合并到主线;如果罪魁祸首在一周的早期被发现并且易于解决,请尝试在同一周内将修复程序合并到主线。

  • 对于其他回归,目标是在未来三周内的最晚周日之前合并修复。 如果回归是人们可以轻松忍受一段时间的事情(例如轻微的性能回归),则晚一到两个周日是可以接受的。

  • 强烈建议不要将主线回归修复推迟到下一个合并窗口,除非修复程序风险极高或罪魁祸首在一年前被合并到主线。

关于程序

  • 始终考虑还原罪魁祸首,因为这通常是修复回归最快且最不危险的方法。 不必担心以后合并已修复的变体:这应该是直接的,因为大多数代码已经进行过一次审查。

  • 在当前开发周期结束前,尝试解决过去十二个月内主线引入的任何回归问题:Linus 希望像处理当前周期的回归问题一样处理这些回归问题,除非修复存在异常风险。

  • 如果回归问题似乎很棘手,请考虑在讨论或补丁审查中抄送 Linus。在危险或紧急情况下也这样做,尤其是在子系统维护者可能无法联系时。此外,当您知道此类回归问题已进入主线、稳定版或长期发布版时,请抄送稳定版团队。

  • 对于紧急回归问题,请考虑请求 Linus 直接从邮件列表中采纳修复程序:对于没有争议的修复程序,他是完全可以接受的。理想情况下,此类请求应根据子系统维护者的意见或直接由他们提出。

  • 如果您不确定在新的主线版本发布前几天修复是否值得冒险,请给 Linus 发送邮件,抄送给常用列表和人员;在邮件中,总结情况,同时请他考虑直接从列表中采纳修复程序。然后,他自己可以做出决定,并在需要时甚至推迟发布。同样,此类请求理想情况下应根据子系统维护者的意见或直接由他们提出。

关于稳定版和长期内核

  • 如果回归问题从未在主线中出现,或者已经在主线中修复,您可以自由地将回归问题留给稳定版团队处理。

  • 如果回归问题在过去十二个月内进入了正式的主线版本,请确保使用 “Cc: stable@vger.kernel.org” 标记修复程序,因为仅使用“Fixes:” 标记并不能保证进行反向移植。如果您知道罪魁祸首已被反向移植到稳定版或长期内核,请添加相同的标记。

  • 当收到关于最近稳定版或长期内核系列中回归问题的报告时,请至少简要评估一下该问题是否也可能发生在当前主线中,如果似乎有可能,请接手该报告。如有疑问,请要求报告者检查主线。

  • 每当您想快速解决最近也进入正式主线、稳定版或长期发布版的回归问题时,请在主线中快速修复它;在适当的情况下,请 Linus 快速跟踪修复(见上文)。这是因为稳定版团队通常既不还原也不修复在主线中引起相同问题的任何更改。

  • 对于紧急回归修复,您可能希望在修复程序进入主线后立即通知稳定版团队,以确保及时反向移植;在合并窗口期间和之后不久尤其如此,因为修复程序否则可能会在庞大的补丁队列末尾出现。

关于补丁流

  • 开发人员在尝试达到上述时间段时,请记住考虑修复程序经过测试、审查并由 Linus 合并所需的时间,理想情况下,它们至少应在 linux-next 中短暂存在。因此,如果修复程序是紧急的,请明确说明,以确保其他人能够适当处理。

  • 审查人员,请你们及时审查回归修复程序,以协助开发人员达到上述时间段。

  • 子系统维护者,同样鼓励你们加快处理回归修复程序。因此,请评估是否可以跳过 linux-next 来进行特定修复。还请考虑在需要时比平时更频繁地发送 git pull 请求。并且尽量避免在周末保留回归修复程序,尤其是在修复程序标记为反向移植的情况下。

开发人员应注意的有关回归的更多方面

如何处理已知存在回归风险的更改

评估回归的风险有多大,例如通过在 Linux 发行版和 Git 代码库中执行代码搜索。还可以考虑要求其他可能受影响的开发人员或项目评估甚至测试拟议的更改;如果出现问题,也许可以找到所有人都可接受的解决方案。

如果最终回归风险似乎相对较小,请继续进行更改,但让所有相关方都知道风险。因此,请确保您的补丁说明明确说明此方面。一旦更改被合并,请告知 Linux 内核的回归跟踪器和回归邮件列表有关风险,以便每个人都可以在收到报告时关注更改。根据风险,您可能还希望要求子系统维护者在他的主线 pull 请求中提及该问题。

关于回归还有什么需要了解的?

请查看报告回归,它涵盖了您可能需要了解的许多其他方面

  • “无回归”规则的目的

  • 哪些问题实际上符合回归的条件

  • 谁负责查找回归的根本原因

  • 如何处理棘手的情况,例如,当回归是由安全修复引起的,或者当修复回归可能导致另一个回归时

在涉及回归时向谁寻求建议

向回归邮件列表 (regressions@lists.linux.dev) 发送邮件,同时抄送 Linux 内核的回归跟踪器 (regressions@leemhuis.info);如果该问题最好私下处理,请随意省略该列表。

有关回归跟踪和 regzbot 的更多信息

为什么 Linux 内核有回归跟踪器,以及为什么使用 regzbot?

像“无回归”这样的规则需要有人来确保它们被遵守,否则它们会因意外或故意而被破坏。历史表明,对于 Linux 内核也是如此。这就是为什么 Thorsten Leemhuis 自愿作为 Linux 内核的回归跟踪器来关注事情,他偶尔会得到其他人的帮助。他们都没有为此获得报酬,这就是为什么回归跟踪是在尽力而为的基础上进行的。

早期手动跟踪回归的尝试表明这是一项令人疲惫和沮丧的工作,这就是为什么它们在一段时间后被放弃。为了防止这种情况再次发生,Thorsten 开发了 regzbot 来促进工作,长期目标是为所有参与者尽可能多地自动化回归跟踪。

如何使用 regzbot 进行回归跟踪?

该机器人会监视对跟踪回归报告的回复。此外,它还会查找带有“Closes:” 标记的发布或提交的补丁,这些补丁引用了此类报告;也会跟踪对此类补丁发布的回复。结合这些数据,可以很好地了解修复过程的当前状态。

Regzbot 尝试在对报告者和开发人员尽可能少开销的情况下完成其工作。事实上,只有报告者承担着额外的义务:他们需要使用上面概述的 #regzbot introduced 命令告诉 regzbot 关于回归报告;如果他们不这样做,其他人可以使用 #regzbot ^introduced 来处理。

对于开发人员来说,通常不会涉及额外的工作,他们只需要确保做一些早在 regzbot 出现之前就应该做的事情:在补丁说明中添加指向所有已修复问题报告的链接。

我必须使用 regzbot 吗?

如果你使用它,符合每个人的利益,因为像 Linus Torvalds 这样的内核维护者在他们的工作中部分依赖于 regzbot 的跟踪,例如在决定发布新版本或延长开发阶段时。为此,他们需要了解所有未修复的回归问题;为了做到这一点,Linus 被认为会查看 regzbot 每周发送的报告。

我必须告诉 regzbot 我偶然发现的每一个回归问题吗?

理想情况下,是的:我们都是人类,当出现更重要的意外情况时,很容易忘记问题,例如 Linux 内核中更大的问题,或者现实生活中让我们暂时无法使用键盘的事情。因此,最好告诉 regzbot 每个回归问题,除非您立即编写修复程序并将其提交到定期合并到受影响的内核系列的树中。

如何查看 regzbot 当前跟踪哪些回归问题?

查看 regzbot 的 Web 界面 获取最新信息;或者,搜索最新的回归报告,regzbot 通常会在周日晚上(UTC)每周发送一次,这比 Linus 通常发布新的(预)版本早几个小时。

regzbot 正在监视哪些地方?

Regzbot 正在监视最重要的 Linux 邮件列表以及 linux-next、主线和稳定/长期版本的 git 代码库。

哪些类型的问题应该由 regzbot 跟踪?

该机器人旨在跟踪回归问题,因此请不要让 regzbot 参与常规问题。但是,如果您使用 regzbot 跟踪严重问题,例如关于挂起、数据损坏或内部错误(Panic、Oops、BUG()、警告等)的报告,那么对于 Linux 内核的回归跟踪器来说是可以的。

我可以将 CI 系统发现的回归问题添加到 regzbot 的跟踪中吗?

如果特定的回归问题可能对实际用例产生影响,因此可能会被用户注意到,请随意这样做;因此,请不要让 regzbot 参与不太可能在实际使用中出现的理论回归问题。

如何与 regzbot 互动?

通过在对回归报告的邮件的直接或间接回复中使用 “regzbot 命令”。这些命令需要位于它们自己的段落中(即:它们需要使用空行与邮件的其余部分分隔开)。

其中一个命令是 #regzbot introduced: <版本 提交>,它使 regzbot 将您的邮件视为已添加到跟踪中的回归报告,如上所述;#regzbot ^introduced: <版本 提交> 是另一个这样的命令,它使 regzbot 将父邮件视为它开始跟踪的回归报告。

一旦使用了这两个命令中的一个,其他 regzbot 命令就可以在对报告的直接或间接回复中使用。您可以在其中一个 introduced 命令下方编写它们,或在回复使用其中一个命令的邮件中编写它们,或者回复该邮件本身

  • 设置或更新标题

    #regzbot title: foo
    
  • 监视讨论或 bugzilla.kernel.org 工单,其中讨论了问题的其他方面或修复程序,例如发布修复回归的补丁

    #regzbot monitor: https://lore.kernel.org/all/[email protected]/
    

    监控仅适用于 lore.kernel.org 和 bugzilla.kernel.org;regzbot 将把该线程或工单中的所有消息都视为与修复过程相关。

  • 指向具有更多相关细节的地方,例如邮件列表帖子或错误跟踪器中的工单,这些内容稍微相关,但关于不同的主题。

    #regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=123456789
    
  • 将上游正在处理或已提交的提交标记为已修复回归问题。

    #regzbot fix: 1f2e3d4c5d
    
  • 将一个回归问题标记为 regzbot 已跟踪的另一个回归问题的重复问题。

    #regzbot dup-of: https://lore.kernel.org/all/[email protected]/
    
  • 将回归问题标记为无效。

    #regzbot invalid: wasn't a regression, problem has always existed
    

关于 regzbot 及其命令,还有什么要说的吗?

关于 Linux 内核回归跟踪机器人的更详细和最新的信息可以在其 项目页面上找到,其中包括 入门指南参考文档,这两者都比上面的部分涵盖了更多的细节。

Linus 关于回归的引言

下面是一些关于 Linus Torvalds 希望如何处理回归的真实示例。

  • 来自 2017-10-26 (1/2)

    If you break existing user space setups THAT IS A REGRESSION.
    
    It's not ok to say "but we'll fix the user space setup".
    
    Really. NOT OK.
    
    [...]
    
    The first rule is:
    
     - we don't cause regressions
    
    and the corollary is that when regressions *do* occur, we admit to
    them and fix them, instead of blaming user space.
    
    The fact that you have apparently been denying the regression now for
    three weeks means that I will revert, and I will stop pulling apparmor
    requests until the people involved understand how kernel development
    is done.
    
  • 来自 2017-10-26 (2/2)

    People should basically always feel like they can update their kernel
    and simply not have to worry about it.
    
    I refuse to introduce "you can only update the kernel if you also
    update that other program" kind of limitations. If the kernel used to
    work for you, the rule is that it continues to work for you.
    
    There have been exceptions, but they are few and far between, and they
    generally have some major and fundamental reasons for having happened,
    that were basically entirely unavoidable, and people _tried_hard_ to
    avoid them. Maybe we can't practically support the hardware any more
    after it is decades old and nobody uses it with modern kernels any
    more. Maybe there's a serious security issue with how we did things,
    and people actually depended on that fundamentally broken model. Maybe
    there was some fundamental other breakage that just _had_ to have a
    flag day for very core and fundamental reasons.
    
    And notice that this is very much about *breaking* peoples environments.
    
    Behavioral changes happen, and maybe we don't even support some
    feature any more. There's a number of fields in /proc/<pid>/stat that
    are printed out as zeroes, simply because they don't even *exist* in
    the kernel any more, or because showing them was a mistake (typically
    an information leak). But the numbers got replaced by zeroes, so that
    the code that used to parse the fields still works. The user might not
    see everything they used to see, and so behavior is clearly different,
    but things still _work_, even if they might no longer show sensitive
    (or no longer relevant) information.
    
    But if something actually breaks, then the change must get fixed or
    reverted. And it gets fixed in the *kernel*. Not by saying "well, fix
    your user space then". It was a kernel change that exposed the
    problem, it needs to be the kernel that corrects for it, because we
    have a "upgrade in place" model. We don't have a "upgrade with new
    user space".
    
    And I seriously will refuse to take code from people who do not
    understand and honor this very simple rule.
    
    This rule is also not going to change.
    
    And yes, I realize that the kernel is "special" in this respect. I'm
    proud of it.
    
    I have seen, and can point to, lots of projects that go "We need to
    break that use case in order to make progress" or "you relied on
    undocumented behavior, it sucks to be you" or "there's a better way to
    do what you want to do, and you have to change to that new better
    way", and I simply don't think that's acceptable outside of very early
    alpha releases that have experimental users that know what they signed
    up for. The kernel hasn't been in that situation for the last two
    decades.
    
    We do API breakage _inside_ the kernel all the time. We will fix
    internal problems by saying "you now need to do XYZ", but then it's
    about internal kernel API's, and the people who do that then also
    obviously have to fix up all the in-kernel users of that API. Nobody
    can say "I now broke the API you used, and now _you_ need to fix it
    up". Whoever broke something gets to fix it too.
    
    And we simply do not break user space.
    
  • 来自 2020-05-21

    The rules about regressions have never been about any kind of
    documented behavior, or where the code lives.
    
    The rules about regressions are always about "breaks user workflow".
    
    Users are literally the _only_ thing that matters.
    
    No amount of "you shouldn't have used this" or "that behavior was
    undefined, it's your own fault your app broke" or "that used to work
    simply because of a kernel bug" is at all relevant.
    
    Now, reality is never entirely black-and-white. So we've had things
    like "serious security issue" etc that just forces us to make changes
    that may break user space. But even then the rule is that we don't
    really have other options that would allow things to continue.
    
    And obviously, if users take years to even notice that something
    broke, or if we have sane ways to work around the breakage that
    doesn't make for too much trouble for users (ie "ok, there are a
    handful of users, and they can use a kernel command line to work
    around it" kind of things) we've also been a bit less strict.
    
    But no, "that was documented to be broken" (whether it's because the
    code was in staging or because the man-page said something else) is
    irrelevant. If staging code is so useful that people end up using it,
    that means that it's basically regular kernel code with a flag saying
    "please clean this up".
    
    The other side of the coin is that people who talk about "API
    stability" are entirely wrong. API's don't matter either. You can make
    any changes to an API you like - as long as nobody notices.
    
    Again, the regression rule is not about documentation, not about
    API's, and not about the phase of the moon.
    
    It's entirely about "we caused problems for user space that used to work".
    
  • 来自 2017-11-05

    And our regression rule has never been "behavior doesn't change".
    That would mean that we could never make any changes at all.
    
    For example, we do things like add new error handling etc all the
    time, which we then sometimes even add tests for in our kselftest
    directory.
    
    So clearly behavior changes all the time and we don't consider that a
    regression per se.
    
    The rule for a regression for the kernel is that some real user
    workflow breaks. Not some test. Not a "look, I used to be able to do
    X, now I can't".
    
  • 来自 2018-08-03

    YOU ARE MISSING THE #1 KERNEL RULE.
    
    We do not regress, and we do not regress exactly because your are 100% wrong.
    
    And the reason you state for your opinion is in fact exactly *WHY* you
    are wrong.
    
    Your "good reasons" are pure and utter garbage.
    
    The whole point of "we do not regress" is so that people can upgrade
    the kernel and never have to worry about it.
    
    > Kernel had a bug which has been fixed
    
    That is *ENTIRELY* immaterial.
    
    Guys, whether something was buggy or not DOES NOT MATTER.
    
    Why?
    
    Bugs happen. That's a fact of life. Arguing that "we had to break
    something because we were fixing a bug" is completely insane. We fix
    tens of bugs every single day, thinking that "fixing a bug" means that
    we can break something is simply NOT TRUE.
    
    So bugs simply aren't even relevant to the discussion. They happen,
    they get found, they get fixed, and it has nothing to do with "we
    break users".
    
    Because the only thing that matters IS THE USER.
    
    How hard is that to understand?
    
    Anybody who uses "but it was buggy" as an argument is entirely missing
    the point. As far as the USER was concerned, it wasn't buggy - it
    worked for him/her.
    
    Maybe it worked *because* the user had taken the bug into account,
    maybe it worked because the user didn't notice - again, it doesn't
    matter. It worked for the user.
    
    Breaking a user workflow for a "bug" is absolutely the WORST reason
    for breakage you can imagine.
    
    It's basically saying "I took something that worked, and I broke it,
    but now it's better". Do you not see how f*cking insane that statement
    is?
    
    And without users, your program is not a program, it's a pointless
    piece of code that you might as well throw away.
    
    Seriously. This is *why* the #1 rule for kernel development is "we
    don't break users". Because "I fixed a bug" is absolutely NOT AN
    ARGUMENT if that bug fix broke a user setup. You actually introduced a
    MUCH BIGGER bug by "fixing" something that the user clearly didn't
    even care about.
    
    And dammit, we upgrade the kernel ALL THE TIME without upgrading any
    other programs at all. It is absolutely required, because flag-days
    and dependencies are horribly bad.
    
    And it is also required simply because I as a kernel developer do not
    upgrade random other tools that I don't even care about as I develop
    the kernel, and I want any of my users to feel safe doing the same
    time.
    
    So no. Your rule is COMPLETELY wrong. If you cannot upgrade a kernel
    without upgrading some other random binary, then we have a problem.
    
  • 来自 2021-06-05

    THERE ARE NO VALID ARGUMENTS FOR REGRESSIONS.
    
    Honestly, security people need to understand that "not working" is not
    a success case of security. It's a failure case.
    
    Yes, "not working" may be secure. But security in that case is *pointless*.
    
  • 来自 2011-05-06 (1/3)

    Binary compatibility is more important.
    
    And if binaries don't use the interface to parse the format (or just
    parse it wrongly - see the fairly recent example of adding uuid's to
    /proc/self/mountinfo), then it's a regression.
    
    And regressions get reverted, unless there are security issues or
    similar that makes us go "Oh Gods, we really have to break things".
    
    I don't understand why this simple logic is so hard for some kernel
    developers to understand. Reality matters. Your personal wishes matter
    NOT AT ALL.
    
    If you made an interface that can be used without parsing the
    interface description, then we're stuck with the interface. Theory
    simply doesn't matter.
    
    You could help fix the tools, and try to avoid the compatibility
    issues that way. There aren't that many of them.
    

    来自 2011-05-06 (2/3)

    it's clearly NOT an internal tracepoint. By definition. It's being
    used by powertop.
    

    来自 2011-05-06 (3/3)

    We have programs that use that ABI and thus it's a regression if they break.
    
  • 来自 2012-07-06

    > Now this got me wondering if Debian _unstable_ actually qualifies as a
    > standard distro userspace.
    
    Oh, if the kernel breaks some standard user space, that counts. Tons
    of people run Debian unstable
    
  • 来自 2019-09-15

    One _particularly_ last-minute revert is the top-most commit (ignoring
    the version change itself) done just before the release, and while
    it's very annoying, it's perhaps also instructive.
    
    What's instructive about it is that I reverted a commit that wasn't
    actually buggy. In fact, it was doing exactly what it set out to do,
    and did it very well. In fact it did it _so_ well that the much
    improved IO patterns it caused then ended up revealing a user-visible
    regression due to a real bug in a completely unrelated area.
    
    The actual details of that regression are not the reason I point that
    revert out as instructive, though. It's more that it's an instructive
    example of what counts as a regression, and what the whole "no
    regressions" kernel rule means. The reverted commit didn't change any
    API's, and it didn't introduce any new bugs. But it ended up exposing
    another problem, and as such caused a kernel upgrade to fail for a
    user. So it got reverted.
    
    The point here being that we revert based on user-reported _behavior_,
    not based on some "it changes the ABI" or "it caused a bug" concept.
    The problem was really pre-existing, and it just didn't happen to
    trigger before. The better IO patterns introduced by the change just
    happened to expose an old bug, and people had grown to depend on the
    previously benign behavior of that old issue.
    
    And never fear, we'll re-introduce the fix that improved on the IO
    patterns once we've decided just how to handle the fact that we had a
    bad interaction with an interface that people had then just happened
    to rely on incidental behavior for before. It's just that we'll have
    to hash through how to do that (there are no less than three different
    patches by three different developers being discussed, and there might
    be more coming...). In the meantime, I reverted the thing that exposed
    the problem to users for this release, even if I hope it will be
    re-introduced (perhaps even backported as a stable patch) once we have
    consensus about the issue it exposed.
    
    Take-away from the whole thing: it's not about whether you change the
    kernel-userspace ABI, or fix a bug, or about whether the old code
    "should never have worked in the first place". It's about whether
    something breaks existing users' workflow.
    
    Anyway, that was my little aside on the whole regression thing.  Since
    it's that "first rule of kernel programming", I felt it is perhaps
    worth just bringing it up every once in a while