ᕕ( ᐛ )ᕗ Jimyag's Blog

Students's Guide to Raft 翻译

原文:https://thesquareplanet.com/blog/students-guide-to-raft/

For the past few months, I have been a Teaching Assistant for MIT’s 6.824 Distributed Systems class. The class has traditionally had a number of labs building on the Paxos consensus algorithm, but this year, we decided to make the move to Raft. Raft was “designed to be easy to understand”, and our hope was that the change might make the students’ lives easier. 在过去的几个月里,我一直是麻省理工学院6.824分布式系统课程的助教。传统上,该课程有许多基于Paxos共识算法构建的实验室,但今年,我们决定迁移到Raft。Raft被“设计为易于理解”,我们希望这种变化可以使学生的生活更轻松。

This post, and the accompanying Instructors’ Guide to Raft post, chronicles our journey with Raft, and will hopefully be useful to implementers of the Raft protocol and students trying to get a better understanding of Raft’s internals. If you are looking for a Paxos vs Raft comparison, or for a more pedagogical analysis of Raft, you should go read the Instructors’ Guide. The bottom of this post contains a list of questions commonly asked by 6.824 students, as well as answers to those questions. If you run into an issue that is not listed in the main content of this post, check out the Q&A. The post is quite long, but all the points it makes are real problems that a lot of 6.824 students (and TAs) ran into. It is a worthwhile read.

这篇文章,以及随附的 Instructors’ Guide to Raft 帖子,记录了我们与 Raft 的旅程,并希望对 Raft 协议的实现者和试图更好地了解 Raft 内部结构的学生有用。如果您正在寻找Paxos与Raft的比较,或者对Raft进行更教学分析,您应该阅读讲师指南。这篇文章的底部包含 6.824 名学生经常问的问题列表,以及这些问题的答案。如果您遇到本文主要内容中未列出的问题,请查看问答。这篇文章很长,但它提出的所有观点都是很多 6.824 名学生(和助教)遇到的实际问题。这是一本值得一读的书。

### Background

Before we dive into Raft, some context may be useful. 6.824 used to have a set of Paxos-based labs that were built in Go; Go was chosen both because it is easy to learn for students, and because is pretty well-suited for writing concurrent, distributed applications (goroutines come in particularly handy). Over the course of four labs, students build a fault-tolerant, sharded key-value store. The first lab had them build a consensus-based log library, the second added a key value store on top of that, and the third sharded the key space among multiple fault-tolerant clusters, with a fault-tolerant shard master handling configuration changes. We also had a fourth lab in which the students had to handle the failure and recovery of machines, both with and without their disks intact. This lab was available as a default final project for students.

在我们深入研究 Raft 之前,一些上下文可能会有用。6.824曾经有一套基于Paxos的实验,内置于Go ;选择 Go 既是因为它对学生来说很容易学习,也因为它非常适合编写并发的分布式应用程序(goroutines 特别方便)。在四个实验过程中,学生将构建一个容错的分片键值存储。第一个实验室让他们构建一个基于共识的日志库,第二个实验室在此基础上添加一个键值存储,第三个实验室在多个容错集群之间分片密钥空间,容错分片主处理配置更改。我们还有第四个实验室,学生们必须处理机器的故障和恢复,无论磁盘是否完好无损。此实验室可作为学生的默认最终项目使用。

This year, we decided to rewrite all these labs using Raft. The first three labs were all the same, but the fourth lab was dropped as persistence and failure recovery is already built into Raft. This article will mainly discuss our experiences with the first lab, as it is the one most directly related to Raft, though I will also touch on building applications on top of Raft (as in the second lab). 今年,我们决定使用 Raft 重写所有这些实验室。前三个实验室都是一样的,但第四个实验室被放弃了,因为持久性和故障恢复已经内置到 Raft 中。本文将主要讨论我们在第一个实验室中的经验,因为它是与 Raft 最直接相关的实验室,尽管我还将涉及在 Raft 之上构建应用程序(如在第二个实验室中)。

Raft, for those of you who are just getting to know it, is best described by the text on the protocol’s web site: Raft,对于那些刚刚了解它的人来说,最好用协议的网站上的文字来描述:

Raft is a consensus algorithm that is designed to be easy to understand. It’s equivalent to Paxos in fault-tolerance and performance. The difference is that it’s decomposed into relatively independent subproblems, and it cleanly addresses all major pieces needed for practical systems. We hope Raft will make consensus available to a wider audience, and that this wider audience will be able to develop a variety of higher quality consensus-based systems than are available today. Raft 是一种共识算法,旨在易于理解。它在容错和性能上相当于Paxos。不同之处在于它被分解为相对独立的子问题,并且它干净地解决了实际系统所需的所有主要部分。我们希望Raft能够为更广泛的受众提供共识,并且这些更广泛的受众将能够开发各种比现在更高质量的基于共识的系统。

Visualizations like this one give a good overview of the principal components of the protocol, and the paper gives good intuition for why the various pieces are needed. If you haven’t already read the extended Raft paper, you should go read that before continuing this article, as I will assume a decent familiarity with Raft. 像 这个这样的可视化很好地概述了协议的主要组成部分,并且该论文很好地说明了为什么需要各种部分。如果你还没有读过扩展的Raft论文,你应该在继续这篇文章之前阅读它,因为我假设对Raft相当熟悉。

As with all distributed consensus protocols, the devil is very much in the details. In the steady state where there are no failures, Raft’s behavior is easy to understand, and can be explained in an intuitive manner. For example, it is simple to see from the visualizations that, assuming no failures, a leader will eventually be elected, and eventually all operations sent to the leader will be applied by the followers in the right order. However, when delayed messages, network partitions, and failed servers are introduced, each and every if, but, and and, become crucial. In particular, there are a number of bugs that we see repeated over and over again, simply due to misunderstandings or oversights when reading the paper. This problem is not unique to Raft, and is one that comes up in all complex distributed systems that provide correctness. 与所有分布式共识协议一样,细节决定成败。在没有故障的稳定状态下,Raft 的行为很容易理解,可以用直观的方式解释。例如,从可视化中很容易看出,假设没有失败,最终将选出领导者,最终发送给领导者的所有操作都将由追随者以正确的顺序应用。但是,当引入延迟消息、网络分区和故障服务器时,每个 if、but、and 都变得至关重要。特别是,我们看到许多错误一遍又一遍地重复,仅仅是由于阅读论文时的误解或疏忽。这个问题并非 Raft 所独有,并且是所有提供正确性的复杂分布式系统中都出现的问题。

Implementing Raft

The ultimate guide to Raft is in Figure 2 of the Raft paper. This figure specifies the behavior of every RPC exchanged between Raft servers, gives various invariants that servers must maintain, and specifies when certain actions should occur. We will be talking about Figure 2  a lot  in the rest of this article. It needs to be followed  to the letter. Raft 的最终指南在 Raft 论文的图 2 中。此图指定了 Raft 服务器之间交换的每个 RPC 的行为,给出了服务器必须维护的各种不变量,并指定了何时应执行某些操作。我们将在本文的其余部分大量讨论图 2。它需要一字不差地遵循。

Figure 2 defines what every server should do, in ever state, for every incoming RPC, as well as when certain other things should happen (such as when it is safe to apply an entry in the log). At first, you might be tempted to treat Figure 2 as sort of an informal guide; you read it once, and then start coding up an implementation that follows roughly what it says to do. Doing this, you will quickly get up and running with a mostly working Raft implementation. And then the problems start. 图 2 定义了每个服务器在任何时候状态下应为每个传入的 RPC 执行的操作,以及何时应执行某些其他操作(例如,何时可以安全地在日志中应用条目)。起初,您可能会想将图 2 视为一种非正式指南;你读了一次,然后开始编写一个大致遵循它所说的实现。这样做,您将快速启动并运行一个大部分工作的 Raft 实现。然后问题开始了。

In fact, Figure 2 is extremely precise, and every single statement it makes should be treated, in specification terms, as MUST, not as SHOULD. For example, you might reasonably reset a peer’s election timer whenever you receive an AppendEntries or RequestVote RPC, as both indicate that some other peer either thinks it’s the leader, or is trying to become the leader. Intuitively, this means that we shouldn’t be interfering. However, if you read Figure 2 carefully, it says: 事实上,图 2 非常精确,它所做的每一条声明都应该按照规范术语处理为必须,而不是应该。例如,当您收到“AppendEntryries”或“RequestVote”RPC 时,您可以合理地重置对等方的选举计时器,因为两者都表明其他对等方要么认为自己是领导者,要么正在尝试成为领导者。直觉上,这意味着我们不应该干涉。但是,如果您仔细阅读图 2,它会说:

If election timeout elapses without receiving AppendEntries RPC from current leader or granting vote to candidate: convert to candidate. 如果选举超时过去而没有收到当前领导人的“附录条目”RPC 或授予候选人投票权:转换为候选人。

The distinction turns out to matter a lot, as the former implementation can result in significantly reduced liveness in certain situations. 事实证明,这种区别非常重要,因为前一种实现在某些情况下会导致活动性显着降低。

The importance of details

To make the discussion more concrete, let us consider an example that tripped up a number of 6.824 students. The Raft paper mentions heartbeat RPCs in a number of places. Specifically, a leader will occasionally (at least once per heartbeat interval) send out an AppendEntries RPC to all peers to prevent them from starting a new election. If the leader has no new entries to send to a particular peer, the AppendEntries RPC contains no entries, and is considered a heartbeat. 为了使讨论更具体,让我们考虑一些绊倒 6.824 的学生的例子。Raft论文在许多地方提到了heartbeat RPC。具体来说,领导者偶尔会(每个心跳间隔至少一次)向所有对等方发送“AppendEntries”RPC,以防止他们开始新的选举。如果领导者没有要发送到特定对等方的新条目,则“AppendEntries”RPC 不包含任何条目,并被视为检测信号。

Many of our students assumed that heartbeats were somehow “special”; that when a peer receives a heartbeat, it should treat it differently from a non-heartbeat AppendEntries RPC. In particular, many would simply reset their election timer when they received a heartbeat, and then return success, without performing any of the checks specified in Figure 2. This is extremely dangerous. By accepting the RPC, the follower is implicitly telling the leader that their log matches the leader’s log up to and including the prevLogIndex included in the AppendEntries arguments. Upon receiving the reply, the leader might then decide (incorrectly) that some entry has been replicated to a majority of servers, and start committing it. 我们的许多学生认为心跳在某种程度上是“特别的”;当对等方收到检测信号时,它应该将其与非检测信号“追加条目”RPC 区别对待。特别是,许多人在收到检测信号时只需重置其选举计时器,然后返回成功,而不执行图 2 中指定的任何检查。这是极其危险的。通过接受 RPC,追随者隐式地告诉领导者他们的日志与领导者的日志匹配,并包括“附录条目”参数中包含的“prevLogIndex”。收到回复后,领导者可能会(错误地)决定某个条目已复制到大多数服务器,并开始提交它。

Another issue many had (often immediately after fixing the issue above), was that, upon receiving a heartbeat, they would truncate the follower’s log following prevLogIndex, and then append any entries included in the AppendEntries arguments. This is also not correct. We can once again turn to Figure 2: 许多人遇到的另一个问题(通常在解决上述问题后立即)是,在收到心跳时,他们会在“prevLogIndex”之后截断关注者的日志,然后附加“AppendEntries”参数中包含的任何条目。这是也不正确。我们可以再次转到图 2:

If an existing entry conflicts with a new one (same index but different terms), delete the existing entry and all that follow it. 如果现有条目与新条目冲突(相同的索引但不同的术语),请删除现有条目及其后面的所有条目。

The if here is crucial. If the follower has all the entries the leader sent, the follower MUST NOT truncate its log. Any elements following the entries sent by the leader MUST be kept. This is because we could be receiving an outdated AppendEntries RPC from the leader, and truncating the log would mean “taking back” entries that we may have already told the leader that we have in our log. 这里的 if 至关重要。如果关注者拥有领导者发送的所有条目,则关注者不得截断其日志。领导者发送的条目之后的任何元素 必须 保留。这是因为我们可能会从领导者那里收到过时的“附录条目”RPC,截断日志将意味着“收回”我们可能已经告诉领导者我们在日志中的条目。

Debugging Raft

Inevitably, the first iteration of your Raft implementation will be buggy. So will the second. And third. And fourth. In general, each one will be less buggy than the previous one, and, from experience, most of your bugs will be a result of not faithfully following Figure 2. 不可避免地,Raft 实现的第一次迭代将是错误的。第二个也是如此。第三。第四。通常,每个错误都会比前一个少,而且根据经验,大多数错误都是由于没有忠实地遵循图 2 造成的。

When debugging, Raft, there are generally four main sources of bugs: livelocks, incorrect or incomplete RPC handlers, failure to follow The Rules, and term confusion. Deadlocks are also a common problem, but they can generally be debugged by logging all your locks and unlocks, and figuring out which locks you are taking, but not releasing. Let us consider each of these in turn: 在调试时,Raft 通常有四个主要的错误来源:活锁、不正确或不完整的 RPC 处理程序、不遵守规则和术语混淆。死锁也是一个常见问题,但通常可以通过记录所有锁和解锁来调试它们,并找出您正在使用哪些锁,但不释放。让我们依次考虑其中的每一个:

Livelocks

When your system livelocks, every node in your system is doing something, but collectively your nodes are in such a state that no progress is being made. This can happen fairly easily in Raft, especially if you do not follow Figure 2 religiously. One livelock scenario comes up especially often; no leader is being elected, or once a leader is elected, some other node starts an election, forcing the recently elected leader to abdicate immediately. 当您的系统活锁时,系统中的每个节点都在做某事,但总的来说,您的节点处于没有进展的状态。这在 Raft 中很容易发生,尤其是在您不虔诚地遵循图 2 的情况下。一个活锁场景特别频繁地出现;没有领导人被选举出来,或者一旦领导人当选,其他节点就会开始选举,迫使最近当选的领导人立即退位。

There are many reasons why this scenario may come up, but there is a handful of mistakes that we have seen numerous students make: 出现这种情况的原因有很多,但我们看到许多学生犯了一些错误:

If RPC request or response contains term T > currentTerm: set currentTerm = T, convert to follower (§5.1)

Incorrect RPC handlers

Even though Figure 2 spells out exactly what each RPC handler should do, some subtleties are still easy to miss. Here are a handful that we kept seeing over and over again, and that you should keep an eye out for in your implementation: 尽管图 2 准确地说明了每个 RPC 处理程序应该做什么,但仍然很容易忽略一些细微之处。以下是我们一遍又一遍地看到的一些内容,您应该在实现中注意:

Failure to follow The Rules

While the Raft paper is very explicit about how to implement each RPC handler, it also leaves the implementation of a number of rules and invariants unspecified. These are listed in the “Rules for Servers” block on the right hand side of Figure 2. While some of them are fairly self-explanatory, the are also some that require designing your application very carefully so that it does not violate The Rules: 虽然 Raft 文档非常明确地说明了如何实现每个 RPC 处理程序,但它也未指定许多规则和不变量的实现。这些规则列在图 2 右侧的“服务器规则”块中。虽然其中一些是不言自明的,但也有一些需要非常仔细地设计您的应用程序,以便它不违反规则:

One common source of confusion is the difference between nextIndex and matchIndex. In particular, you may observe that matchIndex = nextIndex - 1, and simply not implement matchIndex. This is not safe. While nextIndex and matchIndex are generally updated at the same time to a similar value (specifically, nextIndex = matchIndex + 1), the two serve quite different purposes. nextIndex is a guess as to what prefix the leader shares with a given follower. It is generally quite optimistic (we share everything), and is moved backwards only on negative responses. For example, when a leader has just been elected, nextIndex is set to be index index at the end of the log. In a way, nextIndex is used for performance – you only need to send these things to this peer. 混淆的一个常见来源是“nextIndex”和“matchIndex”之间的区别。特别是,您可能会观察到“matchIndex = nextIndex - 1”,并且根本不实现“matchIndex”。这是不安全的。虽然“nextIndex”和“matchIndex”通常同时更新为相似的值(特别是“nextIndex = matchIndex + 1”),但两者的目的完全不同。“nextIndex”是对领导者与给定追随者共享的前缀的猜测。它通常是相当乐观的(我们分享一切),并且仅在负面反应时向后移动。例如,当领导者刚刚当选时,“nextIndex”被设置为日志末尾的索引索引。在某种程度上,“nextIndex”用于性能 - 你只需要将这些东西发送给这个对等体。 matchIndex is used for safety. It is a conservative measurement of what prefix of the log the leader shares with a given follower. matchIndex cannot ever be set to a value that is too high, as this may cause the commitIndex to be moved too far forward. This is why matchIndex is initialized to -1 (i.e., we agree on no prefix), and only updated when a follower positively acknowledges an AppendEntries RPC. “matchIndex”用于安全。这是对领导者与给定追随者共享的日志前缀的保守度量。“matchIndex”不能设置为过高的值,因为这可能会导致“commitIndex”向前移动得太远。这就是为什么“matchIndex”被初始化为-1(即,我们同意没有前缀),并且只有在追随者肯定地承认“AppendEntries”RPC时才更新。

Term confusion

Term confusion refers to servers getting confused by RPCs that come from old terms. In general, this is not a problem when receiving an RPC, since the rules in Figure 2 say exactly what you should do when you see an old term. However, Figure 2 generally doesn’t discuss what you should do when you get old RPC replies. From experience, we have found that by far the simplest thing to do is to first record the term in the reply (it may be higher than your current term), and then to compare the current term with the term you sent in your original RPC. If the two are different, drop the reply and return. Only if the two terms are the same should you continue processing the reply. There may be further optimizations you can do here with some clever protocol reasoning, but this approach seems to work well. And not doing it leads down a long, winding path of blood, sweat, tears and despair. 术语混淆是指服务器被来自旧术语的 RPC 混淆。通常,这在接收 RPC 时不是问题,因为图 2 中的规则准确地说明了当您看到旧术语时应该执行的操作。但是,图 2 通常不讨论在获得旧的 RPC replies 时应该做什么。根据经验,我们发现到目前为止最简单的方法是首先在回复中记录术语(它可能高于您当前的术语),然后将当前术语与您在原始 RPC 中发送的术语进行比较。如果两者不同,请删除回复并返回。仅当两个术语相同时,您才应继续处理回复。在这里,你可以通过一些聪明的协议推理进行进一步的优化,但这种方法似乎效果很好。如果不这样做,就会走上一条漫长而曲折的血腥、汗水、眼泪和绝望之路。

A related, but not identical problem is that of assuming that your state has not changed between when you sent the RPC, and when you received the reply. A good example of this is setting matchIndex = nextIndex - 1, or matchIndex = len(log) when you receive a response to an RPC. This is not safe, because both of those values could have been updated since when you sent the RPC. Instead, the correct thing to do is update matchIndex to be prevLogIndex + len(entries[]) from the arguments you sent in the RPC originally. 一个相关但不完全相同的问题是假设您的状态在您发送 RPC 和收到回复之间没有更改。一个很好的例子是在收到对 RPC 的响应时设置 “matchIndex = nextIndex - 1”或“matchIndex = len(log)”。这是不安全的,因为这两个值都可能自您发送 RPC 以来已更新。相反,正确的做法是从您最初在 RPC 中发送的参数将“matchIndex”更新为“prevLogIndex + len(entries[])”。

An aside on optimizations

The Raft paper includes a couple of optional features of interest. In 6.824, we require the students to implement two of them: log compaction (section 7) and accelerated log backtracking (top left hand side of page 8). The former is necessary to avoid the log growing without bound, and the latter is useful for bringing stale followers up to date quickly. Raft 论文包括一些感兴趣的可选功能。在 6.824 中,我们要求学生实现其中两个:日志压缩(第 7 节)和加速日志回溯(第 8 页左上角)。前者对于避免日志无限制地增长是必要的,后者对于快速更新过时的关注者很有用。

These features are not a part of “core Raft”, and so do not receive as much attention in the paper as the main consensus protocol. Log compaction is covered fairly thoroughly (in Figure 13), but leaves out some design details that you might miss if you read it too casually: 这些功能不是“core raft”的一部分,因此在论文中没有得到像主要共识协议那样多的关注。日志压缩得到了相当全面的介绍(如图 13 所示),但遗漏了一些设计细节,如果您过于随意地阅读它,您可能会错过这些细节:

The accelerated log backtracking optimization is very underspecified, probably because the authors do not see it as being necessary for most deployments. It is not clear from the text exactly how the conflicting index and term sent back from the client should be used by the leader to determine what nextIndex to use. We believe the protocol the authors probably want you to follow is: 加速日志回溯优化的指定非常不足,可能是因为作者认为它对于大多数部署不是必需的。从文本中不清楚领导者应该如何使用从客户端发回的冲突索引和术语来确定要使用的“nextIndex”。我们相信作者可能希望您遵循的协议是:

A half-way solution is to just use conflictIndex (and ignore conflictTerm), which simplifies the implementation, but then the leader will sometimes end up sending more log entries to the follower than is strictly necessary to bring them up to date. 一个折衷的解决方案是只使用“conflictIndex”(并忽略“conflictTerm”),这简化了实现,但是领导者有时会最终向追随者发送更多的日志条目,而不是使它们保持最新所必需的。

Applications on top of Raft

When building a service on top of Raft (such as the key/value store in the second 6.824 Raft lab, the interaction between the service and the Raft log can be tricky to get right. This section details some aspects of the development process that you may find useful when building your application. 在 Raft 上构建服务时(例如 第二个 6.824 Raft 实验室中的键/值存储),服务和 Raft 日志之间的交互可能很难正确。本节详细介绍了开发过程中在构建应用程序时可能会有用的一些方面。

Applying client operations

You may be confused about how you would even implement an application in terms of a replicated log. You might start off by having your service, whenever it receives a client request, send that request to the leader, wait for Raft to apply something, do the operation the client asked for, and then return to the client. While this would be fine in a single-client system, it does not work for concurrent clients. 您可能会对如何根据复制日志实现应用程序感到困惑。你可以从你的服务开始,每当它收到客户端请求时,将该请求发送给领导者,等待Raft应用某些内容,执行客户端要求的操作,然后返回给客户端。虽然这在单客户端系统中很好,但它不适用于并发客户端。 Instead, the service should be constructed as a state machine where client operations transition the machine from one state to another. You should have a loop somewhere that takes one client operation at the time (in the same order on all servers – this is where Raft comes in), and applies each one to the state machine in order. This loop should be the only part of your code that touches the application state (the key/value mapping in 6.824). This means that your client-facing RPC methods should simply submit the client’s operation to Raft, and then wait for that operation to be applied by this “applier loop”. Only when the client’s command comes up should it be executed, and any return values read out. Note that this includes read requests! 相反,服务应构造为状态机,其中客户端操作将计算机从一个状态转换为另一个状态。你应该在某个地方有一个循环,它一次接受一个客户端操作(在所有服务器上以相同的顺序 - 这就是 Raft 的用武之地),并按顺序将每个客户端操作应用于状态机。此循环应该是代码中唯一触及应用程序状态的部分(6.824 中的键/值映射)。这意味着面向客户端的 RPC 方法应该简单地将客户端的操作提交给 Raft,然后等待此“应用器循环”应用该操作。只有当客户端的命令出现时,才应该执行它,并读出任何返回值。请注意,这包括读取请求! This brings up another question: how do you know when a client operation has completed? In the case of no failures, this is simple – you just wait for the thing you put into the log to come back out (i.e., be passed to apply()). When that happens, you return the result to the client. However, what happens if there are failures? For example, you may have been the leader when the client initially contacted you, but someone else has since been elected, and the client request you put in the log has been discarded. Clearly you need to have the client try again, but how do you know when to tell them about the error? 这就引出了另一个问题:您如何知道客户端操作何时完成?在没有失败的情况下,这很简单——你只需等待你放入日志的东西回来(即,传递给’apply()’)。发生这种情况时,将结果返回给客户端。但是,如果出现故障会发生什么?例如,当客户最初与您联系时,您可能是领导者,但此后其他人当选,并且您在日志中提出的客户请求已被丢弃。显然,您需要让客户端重试,但是您如何知道何时告诉他们错误? One simple way to solve this problem is to record where in the Raft log the client’s operation appears when you insert it. Once the operation at that index is sent to apply(), you can tell whether or not the client’s operation succeeded based on whether the operation that came up for that index is in fact the one you put there. If it isn’t, a failure has happened and an error can be returned to the client. 解决此问题的一种简单方法是记录插入客户端操作时在 Raft 日志中出现的位置。一旦该索引上的操作被发送到 ‘apply()’,您就可以根据为该索引出现的操作是否实际上是您放置在那里的操作来判断客户端的操作是否成功。如果不是,则已发生故障,并且可以将错误返回到客户端。

Duplicate detection

As soon as you have clients retry operations in the face of errors, you need some kind of duplicate detection scheme – if a client sends an APPEND to your server, doesn’t hear back, and re-sends it to the next server, your apply() function needs to ensure that the APPEND isn’t executed twice. To do so, you need some kind of unique identifier for each client request, so that you can recognize if you have seen, and more importantly, applied, a particular operation in the past. Furthermore, this state needs to be a part of your state machine so that all your Raft servers eliminate the same duplicates. 一旦客户端在面对错误时重试操作,就需要某种重复检测方案——如果客户端向服务器发送“APPEND”,没有收到回复,并将其重新发送到下一个服务器,你的“apply()”函数需要确保“APPEND”不会执行两次。为此,您需要为每个客户端请求提供某种唯一标识符,以便您可以识别过去是否见过,更重要的是,是否应用了特定操作。此外,此状态需要成为状态机的一部分,以便所有 Raft 服务器消除相同的重复项。 There are many ways of assigning such identifiers. One simple and fairly efficient one is to give each client a unique identifier, and then have them tag each request with a monotonically increasing sequence number. If a client re-sends a request, it re-uses the same sequence number. Your server keeps track of the latest sequence number it has seen for each client, and simply ignores any operation that it has already seen. 有许多方法可以分配此类标识符。一个简单且相当有效的方法是为每个客户端提供一个唯一的标识符,然后让他们用单调递增的序列号标记每个请求。如果客户端重新发送请求,它将重新使用相同的序列号。您的服务器会跟踪它为每个客户端看到的最新序列号,并简单地忽略它已经看到的任何操作。

Hairy corner-cases

If your implementation follows the general outline given above, there are at least two subtle issues you are likely to run into that may be hard to identify without some serious debugging. To save you some time, here they are: 如果您的实现遵循上面给出的一般概述,则至少会遇到两个微妙的问题,如果不进行一些认真的调试,这些问题可能很难识别。为了节省您的一些时间,它们是: Re-appearing indices: Say that your Raft library has some method Start() that takes a command, and return the index at which that command was placed in the log (so that you know when to return to the client, as discussed above). You might assume that you will never see Start() return the same index twice, or at the very least, that if you see the same index again, the command that first returned that index must have failed. It turns out that neither of these things are true, even if no servers crash. 重新出现的索引:假设你的 Raft 库有一些方法“Start()”,它接受一个命令,并返回该命令在日志中放置的索引(以便您知道何时返回到客户端,如上所述)。您可能会假设您永远不会看到“Start()”返回相同的索引两次,或者至少,如果您再次看到相同的索引,则首先返回该索引的命令一定失败。事实证明,即使没有服务器崩溃,这些都不是真的。 Consider the following scenario with five servers, S1 through S5. Initially, S1 is the leader, and its log is empty. 请考虑以下具有五台服务器(S1 到 S5)的方案。最初,S1 是领导者,其日志为空。

  1. Two client operations (C1 and C2) arrive on S1
  2. Start() return 1 for C1, and 2 for C2.
  3. S1 sends out an AppendEntries to S2 containing C1 and C2, but all its other messages are lost.
  4. S3 steps forward as a candidate.
  5. S1 and S2 won’t vote for S3, but S3, S4, and S5 all will, so S3 becomes the leader.
  6. Another client request, C3 comes in to S3.
  7. S3 calls Start() (which returns 1)
  8. S3 sends an AppendEntries to S1, who discards C1 and C2 from its log, and adds C3.
  9. S3 fails before sending AppendEntries to any other servers.
  10. S1 steps forward, and because its log is up-to-date, it is elected leader.
  11. Another client request, C4, arrives at S1
  12. S1 calls Start(), which returns 2 (which was also returned for Start(C2).
  13. All of S1’s AppendEntries are dropped, and S2 steps forward.
  14. S1 and S3 won’t vote for S2, but S2, S4, and S5 all will, so S2 becomes leader.
  15. A client request C5 comes in to S2
  16. S2 calls Start(), which returns 3.
  17. S2 successfully sends AppendEntries to all the servers, which S2 reports back to the servers by including an updated leaderCommit = 3 in the next heartbeat.
  18. 两个客户端操作(C1和C2)到达S1
  19. Start() 为 C1 返回 1,为 C2 返回 2。
  20. S1 向包含 C1 和 C2 的 S2 发送一个AppendEntries,但它的所有其他消息都丢失了。
  21. S3 作为候选人向前迈进。
  22. S1 和 S2 不会投票给 S3,但 S3、S4 和 S5 都会投票,所以 S3 成为领导者。
  23. 另一个客户端请求,C3 进来给S3。
  24. S3 调用“Start()”(返回 1)
  25. S3 向 S1 发送一个 AppendEntries,S1 从其日志中丢弃 C1 和 C2,并添加 C3。
  26. S3 在将 AppendEntries 发送到任何其他服务器之前失败。
  27. S1 向前迈进,因为它的日志是最新的,它被选为领导者。
  28. 另一个客户端请求 C4 到达 S1
  29. S1 调用 Start(),返回 2(也为 Start(C2) 返回。
  30. S1 的所有 AppendEntries 都被丢弃,S2 向前迈进。
  31. S1 和 S3 不会投票给 S2,但 S2、S4 和 S5 都会投票,因此 S2 成为领导者。
  32. 客户端请求 C5 进入 S2
  33. S2 调用 Start(),返回 3。
  34. S2 成功向所有服务器发送AppendEntries,S2 通过在下一个心跳中包含更新的leaderCommit = 3 将其报告回服务器。 Since S2’s log is [C1 C2 C5], this means that the entry that committed (and was applied at all servers, including S1) at index 2 is C2. This despite the fact that C4 was the last client operation to have returned index 2 at S1. 由于 S2 的日志是“[C1 C2 C5]”,这意味着在索引 2 处提交(并应用于所有服务器,包括 S1)的条目是 C2。尽管 C4 是在 S1 返回索引 2 的最后一个客户端操作。 The four-way deadlock: All credit for finding this goes to Steven Allen, another 6.824 TA. He found the following nasty four-way deadlock that you can easily get into when building applications on top of Raft. 四向僵局:发现这一点的所有功劳都归功于史蒂文·艾伦,另一个6.824 TA。他发现了以下令人讨厌的四向死锁,在 Raft 上构建应用程序时很容易陷入。

Your Raft code, however it is structured, likely has a Start()-like function that allows the application to add new commands to the Raft log. It also likely has a loop that, when commitIndex is updated, calls apply() on the application for every element in the log between lastApplied and commitIndex. These routines probably both take some lock a. In your Raft-based application, you probably call Raft’s Start() function somewhere in your RPC handlers, and you have some code somewhere else that is informed whenever Raft applies a new log entry. Since these two need to communicate (i.e., the RPC method needs to know when the operation it put into the log completes), they both probably take some lock b. 你的 Raft 代码,无论它的结构如何,都可能有一个类似“Start()”的函数,允许应用程序向 Raft 日志添加新命令。它还可能有一个循环,当“commitIndex”更新时,在应用程序上为“lastApply”和“commitIndex”之间的日志中的每个元素调用“apply()”。这些例程可能都需要一些锁定“a”。在基于 Raft 的应用程序中,您可能会在 RPC 处理程序中的某个位置调用 Raft 的“Start()”函数,并且在 Raft 应用新的日志条目时,您会在其他地方收到一些代码通知。由于这两者需要通信(即,RPC 方法需要知道它放入日志中的操作何时完成),因此它们都可能需要一些锁 ‘b’。

In Go, these four code segments probably look something like this 在 Go 中,这四个代码段可能如下所示:

func (a *App) RPC(args interface{}, reply interface{}) {
    // ...
    a.mutex.Lock()
    i := a.raft.Start(args)
    // update some data structure so that apply knows to poke us later
    a.mutex.Unlock()
    // wait for apply to poke us
    return
}
func (r *Raft) Start(cmd interface{}) int {
    r.mutex.Lock()
    // do things to start agreement on this new command
    // store index in the log where cmd was placed
    r.mutex.Unlock()
    return index
}
func (a *App) apply(index int, cmd interface{}) {
    a.mutex.Lock()
    switch cmd := cmd.(type) {
    case GetArgs:
        // do the get
	// see who was listening for this index
	// poke them all with the result of the operation
    // ...
    }
    a.mutex.Unlock()
}
func (r *Raft) AppendEntries(...) {
    // ...
    r.mutex.Lock()
    // ...
    for r.lastApplied < r.commitIndex {
      r.lastApplied++
      r.app.apply(r.lastApplied, r.log[r.lastApplied])
    }
    // ...
    r.mutex.Unlock()
}

Consider now if the system is in the following state: 现在考虑系统是否处于以下状态: