Students's Guide to Raft 翻译
原文:https://thesquareplanet.com/blog/students-guide-to-raft/
For the past few months, I have been a Teaching Assistant for MIT’s 6.824 Distributed Systems class. The class has traditionally had a number of labs building on the Paxos consensus algorithm, but this year, we decided to make the move to Raft. Raft was “designed to be easy to understand”, and our hope was that the change might make the students’ lives easier. 在过去的几个月里,我一直是麻省理工学院6.824分布式系统课程的助教。传统上,该课程有许多基于Paxos共识算法构建的实验室,但今年,我们决定迁移到Raft。Raft被“设计为易于理解”,我们希望这种变化可以使学生的生活更轻松。
This post, and the accompanying Instructors’ Guide to Raft post, chronicles our journey with Raft, and will hopefully be useful to implementers of the Raft protocol and students trying to get a better understanding of Raft’s internals. If you are looking for a Paxos vs Raft comparison, or for a more pedagogical analysis of Raft, you should go read the Instructors’ Guide. The bottom of this post contains a list of questions commonly asked by 6.824 students, as well as answers to those questions. If you run into an issue that is not listed in the main content of this post, check out the Q&A. The post is quite long, but all the points it makes are real problems that a lot of 6.824 students (and TAs) ran into. It is a worthwhile read.
这篇文章,以及随附的 Instructors’ Guide to Raft 帖子,记录了我们与 Raft 的旅程,并希望对 Raft 协议的实现者和试图更好地了解 Raft 内部结构的学生有用。如果您正在寻找Paxos与Raft的比较,或者对Raft进行更教学分析,您应该阅读讲师指南。这篇文章的底部包含 6.824 名学生经常问的问题列表,以及这些问题的答案。如果您遇到本文主要内容中未列出的问题,请查看问答。这篇文章很长,但它提出的所有观点都是很多 6.824 名学生(和助教)遇到的实际问题。这是一本值得一读的书。
### Background
Before we dive into Raft, some context may be useful. 6.824 used to have a set of Paxos-based labs that were built in Go; Go was chosen both because it is easy to learn for students, and because is pretty well-suited for writing concurrent, distributed applications (goroutines come in particularly handy). Over the course of four labs, students build a fault-tolerant, sharded key-value store. The first lab had them build a consensus-based log library, the second added a key value store on top of that, and the third sharded the key space among multiple fault-tolerant clusters, with a fault-tolerant shard master handling configuration changes. We also had a fourth lab in which the students had to handle the failure and recovery of machines, both with and without their disks intact. This lab was available as a default final project for students.
在我们深入研究 Raft 之前,一些上下文可能会有用。6.824曾经有一套基于Paxos的实验,内置于Go ;选择 Go 既是因为它对学生来说很容易学习,也因为它非常适合编写并发的分布式应用程序(goroutines 特别方便)。在四个实验过程中,学生将构建一个容错的分片键值存储。第一个实验室让他们构建一个基于共识的日志库,第二个实验室在此基础上添加一个键值存储,第三个实验室在多个容错集群之间分片密钥空间,容错分片主处理配置更改。我们还有第四个实验室,学生们必须处理机器的故障和恢复,无论磁盘是否完好无损。此实验室可作为学生的默认最终项目使用。
This year, we decided to rewrite all these labs using Raft. The first three labs were all the same, but the fourth lab was dropped as persistence and failure recovery is already built into Raft. This article will mainly discuss our experiences with the first lab, as it is the one most directly related to Raft, though I will also touch on building applications on top of Raft (as in the second lab). 今年,我们决定使用 Raft 重写所有这些实验室。前三个实验室都是一样的,但第四个实验室被放弃了,因为持久性和故障恢复已经内置到 Raft 中。本文将主要讨论我们在第一个实验室中的经验,因为它是与 Raft 最直接相关的实验室,尽管我还将涉及在 Raft 之上构建应用程序(如在第二个实验室中)。
Raft, for those of you who are just getting to know it, is best described by the text on the protocol’s web site: Raft,对于那些刚刚了解它的人来说,最好用协议的网站上的文字来描述:
Raft is a consensus algorithm that is designed to be easy to understand. It’s equivalent to Paxos in fault-tolerance and performance. The difference is that it’s decomposed into relatively independent subproblems, and it cleanly addresses all major pieces needed for practical systems. We hope Raft will make consensus available to a wider audience, and that this wider audience will be able to develop a variety of higher quality consensus-based systems than are available today. Raft 是一种共识算法,旨在易于理解。它在容错和性能上相当于Paxos。不同之处在于它被分解为相对独立的子问题,并且它干净地解决了实际系统所需的所有主要部分。我们希望Raft能够为更广泛的受众提供共识,并且这些更广泛的受众将能够开发各种比现在更高质量的基于共识的系统。
Visualizations like this one give a good overview of the principal components of the protocol, and the paper gives good intuition for why the various pieces are needed. If you haven’t already read the extended Raft paper, you should go read that before continuing this article, as I will assume a decent familiarity with Raft. 像 这个这样的可视化很好地概述了协议的主要组成部分,并且该论文很好地说明了为什么需要各种部分。如果你还没有读过扩展的Raft论文,你应该在继续这篇文章之前阅读它,因为我假设对Raft相当熟悉。
As with all distributed consensus protocols, the devil is very much in the details. In the steady state where there are no failures, Raft’s behavior is easy to understand, and can be explained in an intuitive manner. For example, it is simple to see from the visualizations that, assuming no failures, a leader will eventually be elected, and eventually all operations sent to the leader will be applied by the followers in the right order. However, when delayed messages, network partitions, and failed servers are introduced, each and every if, but, and and, become crucial. In particular, there are a number of bugs that we see repeated over and over again, simply due to misunderstandings or oversights when reading the paper. This problem is not unique to Raft, and is one that comes up in all complex distributed systems that provide correctness. 与所有分布式共识协议一样,细节决定成败。在没有故障的稳定状态下,Raft 的行为很容易理解,可以用直观的方式解释。例如,从可视化中很容易看出,假设没有失败,最终将选出领导者,最终发送给领导者的所有操作都将由追随者以正确的顺序应用。但是,当引入延迟消息、网络分区和故障服务器时,每个 if、but、and 都变得至关重要。特别是,我们看到许多错误一遍又一遍地重复,仅仅是由于阅读论文时的误解或疏忽。这个问题并非 Raft 所独有,并且是所有提供正确性的复杂分布式系统中都出现的问题。
Implementing Raft
The ultimate guide to Raft is in Figure 2 of the Raft paper. This figure specifies the behavior of every RPC exchanged between Raft servers, gives various invariants that servers must maintain, and specifies when certain actions should occur. We will be talking about Figure 2 a lot in the rest of this article. It needs to be followed to the letter. Raft 的最终指南在 Raft 论文的图 2 中。此图指定了 Raft 服务器之间交换的每个 RPC 的行为,给出了服务器必须维护的各种不变量,并指定了何时应执行某些操作。我们将在本文的其余部分大量讨论图 2。它需要一字不差地遵循。
Figure 2 defines what every server should do, in ever state, for every incoming RPC, as well as when certain other things should happen (such as when it is safe to apply an entry in the log). At first, you might be tempted to treat Figure 2 as sort of an informal guide; you read it once, and then start coding up an implementation that follows roughly what it says to do. Doing this, you will quickly get up and running with a mostly working Raft implementation. And then the problems start. 图 2 定义了每个服务器在任何时候状态下应为每个传入的 RPC 执行的操作,以及何时应执行某些其他操作(例如,何时可以安全地在日志中应用条目)。起初,您可能会想将图 2 视为一种非正式指南;你读了一次,然后开始编写一个大致遵循它所说的实现。这样做,您将快速启动并运行一个大部分工作的 Raft 实现。然后问题开始了。
In fact, Figure 2 is extremely precise, and every single statement it makes should be treated, in specification terms, as MUST, not as SHOULD. For example, you might reasonably reset a peer’s election timer whenever you receive an AppendEntries
or RequestVote
RPC, as both indicate that some other peer either thinks it’s the leader, or is trying to become the leader. Intuitively, this means that we shouldn’t be interfering. However, if you read Figure 2 carefully, it says:
事实上,图 2 非常精确,它所做的每一条声明都应该按照规范术语处理为必须,而不是应该。例如,当您收到“AppendEntryries”或“RequestVote”RPC 时,您可以合理地重置对等方的选举计时器,因为两者都表明其他对等方要么认为自己是领导者,要么正在尝试成为领导者。直觉上,这意味着我们不应该干涉。但是,如果您仔细阅读图 2,它会说:
If election timeout elapses without receiving
AppendEntries
RPC from current leader or granting vote to candidate: convert to candidate. 如果选举超时过去而没有收到当前领导人的“附录条目”RPC 或授予候选人投票权:转换为候选人。
The distinction turns out to matter a lot, as the former implementation can result in significantly reduced liveness in certain situations. 事实证明,这种区别非常重要,因为前一种实现在某些情况下会导致活动性显着降低。
The importance of details
To make the discussion more concrete, let us consider an example that tripped up a number of 6.824 students. The Raft paper mentions heartbeat RPCs in a number of places. Specifically, a leader will occasionally (at least once per heartbeat interval) send out an AppendEntries
RPC to all peers to prevent them from starting a new election. If the leader has no new entries to send to a particular peer, the AppendEntries
RPC contains no entries, and is considered a heartbeat.
为了使讨论更具体,让我们考虑一些绊倒 6.824 的学生的例子。Raft论文在许多地方提到了heartbeat RPC。具体来说,领导者偶尔会(每个心跳间隔至少一次)向所有对等方发送“AppendEntries”RPC,以防止他们开始新的选举。如果领导者没有要发送到特定对等方的新条目,则“AppendEntries”RPC 不包含任何条目,并被视为检测信号。
Many of our students assumed that heartbeats were somehow “special”; that when a peer receives a heartbeat, it should treat it differently from a non-heartbeat AppendEntries
RPC. In particular, many would simply reset their election timer when they received a heartbeat, and then return success, without performing any of the checks specified in Figure 2. This is extremely dangerous. By accepting the RPC, the follower is implicitly telling the leader that their log matches the leader’s log up to and including the prevLogIndex
included in the AppendEntries
arguments. Upon receiving the reply, the leader might then decide (incorrectly) that some entry has been replicated to a majority of servers, and start committing it.
我们的许多学生认为心跳在某种程度上是“特别的”;当对等方收到检测信号时,它应该将其与非检测信号“追加条目”RPC 区别对待。特别是,许多人在收到检测信号时只需重置其选举计时器,然后返回成功,而不执行图 2 中指定的任何检查。这是极其危险的。通过接受 RPC,追随者隐式地告诉领导者他们的日志与领导者的日志匹配,并包括“附录条目”参数中包含的“prevLogIndex”。收到回复后,领导者可能会(错误地)决定某个条目已复制到大多数服务器,并开始提交它。
- 收到heartbeat时候也要检测Index
Another issue many had (often immediately after fixing the issue above), was that, upon receiving a heartbeat, they would truncate the follower’s log following prevLogIndex
, and then append any entries included in the AppendEntries
arguments. This is also not correct. We can once again turn to Figure 2:
许多人遇到的另一个问题(通常在解决上述问题后立即)是,在收到心跳时,他们会在“prevLogIndex”之后截断关注者的日志,然后附加“AppendEntries”参数中包含的任何条目。这是也不正确。我们可以再次转到图 2:
If an existing entry conflicts with a new one (same index but different terms), delete the existing entry and all that follow it. 如果现有条目与新条目冲突(相同的索引但不同的术语),请删除现有条目及其后面的所有条目。
The if here is crucial. If the follower has all the entries the leader sent, the follower MUST NOT truncate its log. Any elements following the entries sent by the leader MUST be kept. This is because we could be receiving an outdated AppendEntries
RPC from the leader, and truncating the log would mean “taking back” entries that we may have already told the leader that we have in our log.
这里的 if 至关重要。如果关注者拥有领导者发送的所有条目,则关注者不得截断其日志。领导者发送的条目之后的任何元素 必须 保留。这是因为我们可能会从领导者那里收到过时的“附录条目”RPC,截断日志将意味着“收回”我们可能已经告诉领导者我们在日志中的条目。
Debugging Raft
Inevitably, the first iteration of your Raft implementation will be buggy. So will the second. And third. And fourth. In general, each one will be less buggy than the previous one, and, from experience, most of your bugs will be a result of not faithfully following Figure 2. 不可避免地,Raft 实现的第一次迭代将是错误的。第二个也是如此。第三。第四。通常,每个错误都会比前一个少,而且根据经验,大多数错误都是由于没有忠实地遵循图 2 造成的。
When debugging, Raft, there are generally four main sources of bugs: livelocks, incorrect or incomplete RPC handlers, failure to follow The Rules, and term confusion. Deadlocks are also a common problem, but they can generally be debugged by logging all your locks and unlocks, and figuring out which locks you are taking, but not releasing. Let us consider each of these in turn: 在调试时,Raft 通常有四个主要的错误来源:活锁、不正确或不完整的 RPC 处理程序、不遵守规则和术语混淆。死锁也是一个常见问题,但通常可以通过记录所有锁和解锁来调试它们,并找出您正在使用哪些锁,但不释放。让我们依次考虑其中的每一个:
Livelocks
When your system livelocks, every node in your system is doing something, but collectively your nodes are in such a state that no progress is being made. This can happen fairly easily in Raft, especially if you do not follow Figure 2 religiously. One livelock scenario comes up especially often; no leader is being elected, or once a leader is elected, some other node starts an election, forcing the recently elected leader to abdicate immediately. 当您的系统活锁时,系统中的每个节点都在做某事,但总的来说,您的节点处于没有进展的状态。这在 Raft 中很容易发生,尤其是在您不虔诚地遵循图 2 的情况下。一个活锁场景特别频繁地出现;没有领导人被选举出来,或者一旦领导人当选,其他节点就会开始选举,迫使最近当选的领导人立即退位。
There are many reasons why this scenario may come up, but there is a handful of mistakes that we have seen numerous students make: 出现这种情况的原因有很多,但我们看到许多学生犯了一些错误:
-
Make sure you reset your election timer exactly when Figure 2 says you should. Specifically, you should only restart your election timer if a) you get an
AppendEntries
RPC from the current leader (i.e., if the term in theAppendEntries
arguments is outdated, you should not reset your timer); b) you are starting an election; or c) you grant a vote to another peer. -
确保在图 2 显示应该重置选举计时器时完全重置。具体来说,只有在以下情况下,您才应该重新启动选举计时器:a) 您从当前领导者那里获得了“追加条目”RPC(即,如果“附录条目”参数中的术语已过时,则不应重置计时器);b) 您正在开始选举;或 c) 您向其他同行授予投票权。
This last case is especially important in unreliable networks where it is likely that followers have different logs; in those situations, you will often end up with only a small number of servers that a majority of servers are willing to vote for. If you reset the election timer whenever someone asks you to vote for them, this makes it equally likely for a server with an outdated log to step forward as for a server with a longer log. 最后一种情况在不可靠的网络中尤其重要,因为追随者可能有不同的日志;在这些情况下,您通常最终只会得到大多数服务器愿意投票的少量服务器。如果您在有人要求您投票时重置选举计时器,则日志过时的服务器与日志较长的服务器一样有可能向前迈进。 In fact, because there are so few servers with sufficiently up-to-date logs, those servers are quite unlikely to be able to hold an election in sufficient peace to be elected. If you follow the rule from Figure 2, the servers with the more up-to-date logs won’t be interrupted by outdated servers’ elections, and so are more likely to complete the election and become the leader. 事实上,由于具有足够最新日志的服务器太少,这些服务器不太可能在足够和平的情况下举行选举以当选。如果遵循图 2 中的规则,则具有最新日志的服务器不会因过时的服务器选举而中断,因此更有可能完成选举并成为领导者。
-
Follow Figure 2’s directions as to when you should start an election. In particular, note that if you are a candidate (i.e., you are currently running an election), but the election timer fires, you should start another election. This is important to avoid the system stalling due to delayed or dropped RPCs.
-
按照图 2 的说明操作,了解何时应开始选举。特别要注意的是,如果您是候选人(即您目前正在进行选举),但选举计时器触发,您应该开始另一次选举。这对于避免系统因 RPC 延迟或丢弃而停止非常重要。
-
Ensure that you follow the second rule in “Rules for Servers” before handling an incoming RPC. The second rule states:
-
在处理传入的 RPC 之前,请确保遵循“服务器规则”中的第二条规则。第二条规则规定:
If RPC request or response contains term
T > currentTerm
: setcurrentTerm = T
, convert to follower (§5.1)
Incorrect RPC handlers
Even though Figure 2 spells out exactly what each RPC handler should do, some subtleties are still easy to miss. Here are a handful that we kept seeing over and over again, and that you should keep an eye out for in your implementation: 尽管图 2 准确地说明了每个 RPC 处理程序应该做什么,但仍然很容易忽略一些细微之处。以下是我们一遍又一遍地看到的一些内容,您应该在实现中注意:
- If a step says “reply false”, this means you should reply immediately, and not perform any of the subsequent steps.
- 如果某个步骤显示“回复错误”,这意味着您应该立即回复,而不是执行任何后续步骤。
- If you get an
AppendEntries
RPC with aprevLogIndex
that points beyond the end of your log, you should handle it the same as if you did have that entry but the term did not match (i.e., reply false). - 如果您得到一个“AppendEntryries”RPC,其中包含指向日志末尾之外的“prevLogIndex”,则应像处理该条目但术语不匹配(即回复错误)一样处理它。
- Check 2 for the
AppendEntries
RPC handler should be executed even if the leader didn’t send any entries. - 检查 2 的“附录条目”RPC 处理程序,即使领导者没有发送任何条目,也应执行。
- The
min
in the final step (#5) ofAppendEntries
is necessary, and it needs to be computed with the index of the last new entry. It is not sufficient to simply have the function that applies things from your log betweenlastApplied
andcommitIndex
stop when it reaches the end of your log. This is because you may have entries in your log that differ from the leader’s log after the entries that the leader sent you (which all match the ones in your log). Because #3 dictates that you only truncate your log if you have conflicting entries, those won’t be removed, and ifleaderCommit
is beyond the entries the leader sent you, you may apply incorrect entries. - “AppendEntryries”的最后一步中的“min”是必需的,并且需要使用最后一个新条目的索引进行计算。仅仅让在“lastApply”和“commitIndex”之间应用日志中的内容的函数在到达日志末尾时停止是不够的。这是因为在领导者发送给您的条目之后,日志中的条目可能与领导者的日志不同(这些条目都与日志中的条目匹配)。因为 #3 规定只有在有冲突条目时才截断日志,这些条目不会被删除,如果“leaderCommit”超出了领导者发送给您的条目,您可能会应用不正确的条目。
- It is important to implement the “up-to-date log” check exactly as described in section 5.4. No cheating and just checking the length!
- 按照第 5.4 节所述实施“最新日志”检查非常重要。没有作弊,只是检查长度!
Failure to follow The Rules
While the Raft paper is very explicit about how to implement each RPC handler, it also leaves the implementation of a number of rules and invariants unspecified. These are listed in the “Rules for Servers” block on the right hand side of Figure 2. While some of them are fairly self-explanatory, the are also some that require designing your application very carefully so that it does not violate The Rules: 虽然 Raft 文档非常明确地说明了如何实现每个 RPC 处理程序,但它也未指定许多规则和不变量的实现。这些规则列在图 2 右侧的“服务器规则”块中。虽然其中一些是不言自明的,但也有一些需要非常仔细地设计您的应用程序,以便它不违反规则:
- If
commitIndex > lastApplied
at any point during execution, you should apply a particular log entry. It is not crucial that you do it straight away (for example, in theAppendEntries
RPC handler), but it is important that you ensure that this application is only done by one entity. Specifically, you will need to either have a dedicated “applier”, or to lock around these applies, so that some other routine doesn’t also detect that entries need to be applied and also tries to apply. - 如果在执行过程中的任何时候“commitIndex >lastApply”,则应应用特定的日志条目。立即执行此操作并不重要(例如,在“追加条目”RPC 处理程序中),但请务必确保此应用程序仅由一个实体完成。具体来说,您需要有一个专用的“applier”,或者锁定这些应用程序,以便其他一些例程不会检测到需要应用的条目并尝试应用。
- Make sure that you check for
commitIndex > lastApplied
either periodically, or aftercommitIndex
is updated (i.e., aftermatchIndex
is updated). For example, if you checkcommitIndex
at the same time as sending outAppendEntries
to peers, you may have to wait until the next entry is appended to the log before applying the entry you just sent out and got acknowledged. - 确保定期或在“commitIndex”更新后(即“matchIndex”更新后)检查“commitIndex >lastApply”。例如,如果您在向对等方发送“AppendEntryries”的同时选中“commitIndex”,则可能需要等到下一个条目附加到日志中,然后才能应用刚刚发送并得到确认的条目。
- If a leader sends out an
AppendEntries
RPC, and it is rejected, but not because of log inconsistency (this can only happen if our term has passed), then you should immediately step down, and not updatenextIndex
. If you do, you could race with the resetting ofnextIndex
if you are re-elected immediately. - 如果领导者发出了“AppendEntryries”RPC,并且被拒绝了,但不是因为日志不一致(这只有在我们的期限过后才会发生),那么您应该立即下台,而不是更新“nextIndex”。如果你这样做,如果你立即再次当选,你可以重新重置“nextIndex”。
- A leader is not allowed to update
commitIndex
to somewhere in a previous term (or, for that matter, a future term). Thus, as the rule says, you specifically need to check thatlog[N].term == currentTerm
. This is because Raft leaders cannot be sure an entry is actually committed (and will not ever be changed in the future) if it’s not from their current term. This is illustrated by Figure 8 in the paper. - 领导者不允许将“commitIndex”更新到上一个term(或者,就此而言,未来术语)的某个位置。因此,正如规则所说,您特别需要检查“log[N].term == currentTerm”。这是因为如果某个条目不是来自他们当前的任期,则无法确定条目是否实际提交(并且将来永远不会更改)。本文中的图8对此进行了说明。
One common source of confusion is the difference between nextIndex
and matchIndex
. In particular, you may observe that matchIndex = nextIndex - 1
, and simply not implement matchIndex
. This is not safe. While nextIndex
and matchIndex
are generally updated at the same time to a similar value (specifically, nextIndex = matchIndex + 1
), the two serve quite different purposes. nextIndex
is a guess as to what prefix the leader shares with a given follower. It is generally quite optimistic (we share everything), and is moved backwards only on negative responses. For example, when a leader has just been elected, nextIndex
is set to be index index at the end of the log. In a way, nextIndex
is used for performance – you only need to send these things to this peer.
混淆的一个常见来源是“nextIndex”和“matchIndex”之间的区别。特别是,您可能会观察到“matchIndex = nextIndex - 1”,并且根本不实现“matchIndex”。这是不安全的。虽然“nextIndex”和“matchIndex”通常同时更新为相似的值(特别是“nextIndex = matchIndex + 1”),但两者的目的完全不同。“nextIndex”是对领导者与给定追随者共享的前缀的猜测。它通常是相当乐观的(我们分享一切),并且仅在负面反应时向后移动。例如,当领导者刚刚当选时,“nextIndex”被设置为日志末尾的索引索引。在某种程度上,“nextIndex”用于性能 - 你只需要将这些东西发送给这个对等体。
matchIndex
is used for safety. It is a conservative measurement of what prefix of the log the leader shares with a given follower. matchIndex
cannot ever be set to a value that is too high, as this may cause the commitIndex
to be moved too far forward. This is why matchIndex
is initialized to -1 (i.e., we agree on no prefix), and only updated when a follower positively acknowledges an AppendEntries
RPC.
“matchIndex”用于安全。这是对领导者与给定追随者共享的日志前缀的保守度量。“matchIndex”不能设置为过高的值,因为这可能会导致“commitIndex”向前移动得太远。这就是为什么“matchIndex”被初始化为-1(即,我们同意没有前缀),并且只有在追随者肯定地承认“AppendEntries”RPC时才更新。
Term confusion
Term confusion refers to servers getting confused by RPCs that come from old terms. In general, this is not a problem when receiving an RPC, since the rules in Figure 2 say exactly what you should do when you see an old term. However, Figure 2 generally doesn’t discuss what you should do when you get old RPC replies. From experience, we have found that by far the simplest thing to do is to first record the term in the reply (it may be higher than your current term), and then to compare the current term with the term you sent in your original RPC. If the two are different, drop the reply and return. Only if the two terms are the same should you continue processing the reply. There may be further optimizations you can do here with some clever protocol reasoning, but this approach seems to work well. And not doing it leads down a long, winding path of blood, sweat, tears and despair. 术语混淆是指服务器被来自旧术语的 RPC 混淆。通常,这在接收 RPC 时不是问题,因为图 2 中的规则准确地说明了当您看到旧术语时应该执行的操作。但是,图 2 通常不讨论在获得旧的 RPC replies 时应该做什么。根据经验,我们发现到目前为止最简单的方法是首先在回复中记录术语(它可能高于您当前的术语),然后将当前术语与您在原始 RPC 中发送的术语进行比较。如果两者不同,请删除回复并返回。仅当两个术语相同时,您才应继续处理回复。在这里,你可以通过一些聪明的协议推理进行进一步的优化,但这种方法似乎效果很好。如果不这样做,就会走上一条漫长而曲折的血腥、汗水、眼泪和绝望之路。
A related, but not identical problem is that of assuming that your state has not changed between when you sent the RPC, and when you received the reply. A good example of this is setting matchIndex = nextIndex - 1
, or matchIndex = len(log)
when you receive a response to an RPC. This is not safe, because both of those values could have been updated since when you sent the RPC. Instead, the correct thing to do is update matchIndex
to be prevLogIndex + len(entries[])
from the arguments you sent in the RPC originally.
一个相关但不完全相同的问题是假设您的状态在您发送 RPC 和收到回复之间没有更改。一个很好的例子是在收到对 RPC 的响应时设置 “matchIndex = nextIndex - 1”或“matchIndex = len(log)”。这是不安全的,因为这两个值都可能自您发送 RPC 以来已更新。相反,正确的做法是从您最初在 RPC 中发送的参数将“matchIndex”更新为“prevLogIndex + len(entries[])”。
An aside on optimizations
The Raft paper includes a couple of optional features of interest. In 6.824, we require the students to implement two of them: log compaction (section 7) and accelerated log backtracking (top left hand side of page 8). The former is necessary to avoid the log growing without bound, and the latter is useful for bringing stale followers up to date quickly. Raft 论文包括一些感兴趣的可选功能。在 6.824 中,我们要求学生实现其中两个:日志压缩(第 7 节)和加速日志回溯(第 8 页左上角)。前者对于避免日志无限制地增长是必要的,后者对于快速更新过时的关注者很有用。
These features are not a part of “core Raft”, and so do not receive as much attention in the paper as the main consensus protocol. Log compaction is covered fairly thoroughly (in Figure 13), but leaves out some design details that you might miss if you read it too casually: 这些功能不是“core raft”的一部分,因此在论文中没有得到像主要共识协议那样多的关注。日志压缩得到了相当全面的介绍(如图 13 所示),但遗漏了一些设计细节,如果您过于随意地阅读它,您可能会错过这些细节:
- When snapshotting application state, you need to make sure that the application state corresponds to the state following some known index in the Raft log. This means that the application either needs to communicate to Raft what index the snapshot corresponds to, or that Raft needs to delay applying additional log entries until the snapshot has been completed.
- 快照应用程序状态时,需要确保应用程序状态对应于 Raft 日志中某个已知索引后面的状态。这意味着应用程序需要向 Raft 传达快照对应的索引,或者 Raft 需要延迟应用其他日志条目,直到快照完成。
- The text does not discuss the recovery protocol for when a server crashes and comes back up now that snapshots are involved. In particular, if Raft state and snapshots are committed separately, a server could crash between persisting a snapshot and persisting the updated Raft state. This is a problem, because step 7 in Figure 13 dictates that the Raft log covered by the snapshot must be discarded.
- 本文未讨论服务器崩溃时的恢复协议,并在涉及快照后恢复。特别是,如果 Raft 状态和快照是分开提交的,则服务器可能会在持久化快照和持久化更新的 Raft 状态之间崩溃。这是一个问题,因为图 13 中的步骤 7 规定必须丢弃快照覆盖的 Raft 日志。
If, when the server comes back up, it reads the updated snapshot, but the outdated log, it may end up applying some log entries that are already contained within the snapshot. This happens since the
commitIndex
andlastApplied
are not persisted, and so Raft doesn’t know that those log entries have already been applied. The fix for this is to introduce a piece of persistent state to Raft that records what “real” index the first entry in Raft’s persisted log corresponds to. This can then be compared to the loaded snapshot’slastIncludedIndex
to determine what elements at the head of the log to discard. 如果服务器在备份时读取更新的快照,但读取过时的日志,则最终可能会应用快照中已包含的某些日志条目。发生这种情况是因为“commitIndex”和“lastApply”没有持久化,因此Raft不知道这些日志条目已经被应用了。解决此问题的方法是向 Raft 引入一段持久状态,该状态记录 Raft 持久日志中的第一个条目对应于哪个“真实”索引。然后,可以将其与加载快照的“lastIncludeIndex”进行比较,以确定要丢弃日志头部的哪些元素。
The accelerated log backtracking optimization is very underspecified, probably because the authors do not see it as being necessary for most deployments. It is not clear from the text exactly how the conflicting index and term sent back from the client should be used by the leader to determine what nextIndex
to use. We believe the protocol the authors probably want you to follow is:
加速日志回溯优化的指定非常不足,可能是因为作者认为它对于大多数部署不是必需的。从文本中不清楚领导者应该如何使用从客户端发回的冲突索引和术语来确定要使用的“nextIndex”。我们相信作者可能希望您遵循的协议是:
- If a follower does not have
prevLogIndex
in its log, it should return withconflictIndex = len(log)
andconflictTerm = None
. - 如果关注者的日志中没有“prevLogIndex”,它应该返回“conflictIndex = len(log)”和“conflictTerm = None”。
- If a follower does have
prevLogIndex
in its log, but the term does not match, it should returnconflictTerm = log[prevLogIndex].Term
, and then search its log for the first index whose entry has term equal toconflictTerm
. - 如果关注者的日志中确实有“prevLogIndex”,但该术语不匹配,则应返回“conflictTerm = log[prevLogIndex]。术语“,然后在其日志中搜索其条目的术语等于”冲突术语“的第一个索引。
- Upon receiving a conflict response, the leader should first search its log for
conflictTerm
. If it finds an entry in its log with that term, it should setnextIndex
to be the one beyond the index of the last entry in that term in its log. - 收到冲突响应后,领导者应首先在其日志中搜索“conflictTerm”。如果它在其日志中找到包含该术语的条目,则应将“nextIndex”设置为其日志中该术语中最后一个条目的索引之外的条目。
- If it does not find an entry with that term, it should set
nextIndex = conflictIndex
. - 如果找不到包含该术语的条目,则应设置“nextIndex = conflictIndex”。
A half-way solution is to just use conflictIndex
(and ignore conflictTerm
), which simplifies the implementation, but then the leader will sometimes end up sending more log entries to the follower than is strictly necessary to bring them up to date.
一个折衷的解决方案是只使用“conflictIndex”(并忽略“conflictTerm”),这简化了实现,但是领导者有时会最终向追随者发送更多的日志条目,而不是使它们保持最新所必需的。
Applications on top of Raft
When building a service on top of Raft (such as the key/value store in the second 6.824 Raft lab, the interaction between the service and the Raft log can be tricky to get right. This section details some aspects of the development process that you may find useful when building your application. 在 Raft 上构建服务时(例如 第二个 6.824 Raft 实验室中的键/值存储),服务和 Raft 日志之间的交互可能很难正确。本节详细介绍了开发过程中在构建应用程序时可能会有用的一些方面。
Applying client operations
You may be confused about how you would even implement an application in terms of a replicated log. You might start off by having your service, whenever it receives a client request, send that request to the leader, wait for Raft to apply something, do the operation the client asked for, and then return to the client. While this would be fine in a single-client system, it does not work for concurrent clients.
您可能会对如何根据复制日志实现应用程序感到困惑。你可以从你的服务开始,每当它收到客户端请求时,将该请求发送给领导者,等待Raft应用某些内容,执行客户端要求的操作,然后返回给客户端。虽然这在单客户端系统中很好,但它不适用于并发客户端。
Instead, the service should be constructed as a state machine where client operations transition the machine from one state to another. You should have a loop somewhere that takes one client operation at the time (in the same order on all servers – this is where Raft comes in), and applies each one to the state machine in order. This loop should be the only part of your code that touches the application state (the key/value mapping in 6.824). This means that your client-facing RPC methods should simply submit the client’s operation to Raft, and then wait for that operation to be applied by this “applier loop”. Only when the client’s command comes up should it be executed, and any return values read out. Note that this includes read requests!
相反,服务应构造为状态机,其中客户端操作将计算机从一个状态转换为另一个状态。你应该在某个地方有一个循环,它一次接受一个客户端操作(在所有服务器上以相同的顺序 - 这就是 Raft 的用武之地),并按顺序将每个客户端操作应用于状态机。此循环应该是代码中唯一触及应用程序状态的部分(6.824 中的键/值映射)。这意味着面向客户端的 RPC 方法应该简单地将客户端的操作提交给 Raft,然后等待此“应用器循环”应用该操作。只有当客户端的命令出现时,才应该执行它,并读出任何返回值。请注意,这包括读取请求!
This brings up another question: how do you know when a client operation has completed? In the case of no failures, this is simple – you just wait for the thing you put into the log to come back out (i.e., be passed to apply()
). When that happens, you return the result to the client. However, what happens if there are failures? For example, you may have been the leader when the client initially contacted you, but someone else has since been elected, and the client request you put in the log has been discarded. Clearly you need to have the client try again, but how do you know when to tell them about the error?
这就引出了另一个问题:您如何知道客户端操作何时完成?在没有失败的情况下,这很简单——你只需等待你放入日志的东西回来(即,传递给’apply()’)。发生这种情况时,将结果返回给客户端。但是,如果出现故障会发生什么?例如,当客户最初与您联系时,您可能是领导者,但此后其他人当选,并且您在日志中提出的客户请求已被丢弃。显然,您需要让客户端重试,但是您如何知道何时告诉他们错误?
One simple way to solve this problem is to record where in the Raft log the client’s operation appears when you insert it. Once the operation at that index is sent to apply()
, you can tell whether or not the client’s operation succeeded based on whether the operation that came up for that index is in fact the one you put there. If it isn’t, a failure has happened and an error can be returned to the client.
解决此问题的一种简单方法是记录插入客户端操作时在 Raft 日志中出现的位置。一旦该索引上的操作被发送到 ‘apply()’,您就可以根据为该索引出现的操作是否实际上是您放置在那里的操作来判断客户端的操作是否成功。如果不是,则已发生故障,并且可以将错误返回到客户端。
Duplicate detection
As soon as you have clients retry operations in the face of errors, you need some kind of duplicate detection scheme – if a client sends an APPEND
to your server, doesn’t hear back, and re-sends it to the next server, your apply()
function needs to ensure that the APPEND
isn’t executed twice. To do so, you need some kind of unique identifier for each client request, so that you can recognize if you have seen, and more importantly, applied, a particular operation in the past. Furthermore, this state needs to be a part of your state machine so that all your Raft servers eliminate the same duplicates.
一旦客户端在面对错误时重试操作,就需要某种重复检测方案——如果客户端向服务器发送“APPEND”,没有收到回复,并将其重新发送到下一个服务器,你的“apply()”函数需要确保“APPEND”不会执行两次。为此,您需要为每个客户端请求提供某种唯一标识符,以便您可以识别过去是否见过,更重要的是,是否应用了特定操作。此外,此状态需要成为状态机的一部分,以便所有 Raft 服务器消除相同的重复项。
There are many ways of assigning such identifiers. One simple and fairly efficient one is to give each client a unique identifier, and then have them tag each request with a monotonically increasing sequence number. If a client re-sends a request, it re-uses the same sequence number. Your server keeps track of the latest sequence number it has seen for each client, and simply ignores any operation that it has already seen.
有许多方法可以分配此类标识符。一个简单且相当有效的方法是为每个客户端提供一个唯一的标识符,然后让他们用单调递增的序列号标记每个请求。如果客户端重新发送请求,它将重新使用相同的序列号。您的服务器会跟踪它为每个客户端看到的最新序列号,并简单地忽略它已经看到的任何操作。
Hairy corner-cases
If your implementation follows the general outline given above, there are at least two subtle issues you are likely to run into that may be hard to identify without some serious debugging. To save you some time, here they are:
如果您的实现遵循上面给出的一般概述,则至少会遇到两个微妙的问题,如果不进行一些认真的调试,这些问题可能很难识别。为了节省您的一些时间,它们是:
Re-appearing indices: Say that your Raft library has some method Start()
that takes a command, and return the index at which that command was placed in the log (so that you know when to return to the client, as discussed above). You might assume that you will never see Start()
return the same index twice, or at the very least, that if you see the same index again, the command that first returned that index must have failed. It turns out that neither of these things are true, even if no servers crash.
重新出现的索引:假设你的 Raft 库有一些方法“Start()”,它接受一个命令,并返回该命令在日志中放置的索引(以便您知道何时返回到客户端,如上所述)。您可能会假设您永远不会看到“Start()”返回相同的索引两次,或者至少,如果您再次看到相同的索引,则首先返回该索引的命令一定失败。事实证明,即使没有服务器崩溃,这些都不是真的。
Consider the following scenario with five servers, S1 through S5. Initially, S1 is the leader, and its log is empty.
请考虑以下具有五台服务器(S1 到 S5)的方案。最初,S1 是领导者,其日志为空。
- Two client operations (C1 and C2) arrive on S1
Start()
return 1 for C1, and 2 for C2.- S1 sends out an
AppendEntries
to S2 containing C1 and C2, but all its other messages are lost. - S3 steps forward as a candidate.
- S1 and S2 won’t vote for S3, but S3, S4, and S5 all will, so S3 becomes the leader.
- Another client request, C3 comes in to S3.
- S3 calls
Start()
(which returns 1) - S3 sends an
AppendEntries
to S1, who discards C1 and C2 from its log, and adds C3. - S3 fails before sending
AppendEntries
to any other servers. - S1 steps forward, and because its log is up-to-date, it is elected leader.
- Another client request, C4, arrives at S1
- S1 calls
Start()
, which returns 2 (which was also returned forStart(C2)
. - All of S1’s
AppendEntries
are dropped, and S2 steps forward. - S1 and S3 won’t vote for S2, but S2, S4, and S5 all will, so S2 becomes leader.
- A client request C5 comes in to S2
- S2 calls
Start()
, which returns 3. - S2 successfully sends
AppendEntries
to all the servers, which S2 reports back to the servers by including an updatedleaderCommit = 3
in the next heartbeat. - 两个客户端操作(C1和C2)到达S1
Start()
为 C1 返回 1,为 C2 返回 2。- S1 向包含 C1 和 C2 的 S2 发送一个
AppendEntries
,但它的所有其他消息都丢失了。 - S3 作为候选人向前迈进。
- S1 和 S2 不会投票给 S3,但 S3、S4 和 S5 都会投票,所以 S3 成为领导者。
- 另一个客户端请求,C3 进来给S3。
- S3 调用“Start()”(返回 1)
- S3 向 S1 发送一个 AppendEntries,S1 从其日志中丢弃 C1 和 C2,并添加 C3。
- S3 在将 AppendEntries 发送到任何其他服务器之前失败。
- S1 向前迈进,因为它的日志是最新的,它被选为领导者。
- 另一个客户端请求 C4 到达 S1
- S1 调用
Start()
,返回 2(也为Start(C2)
返回。 - S1 的所有 AppendEntries 都被丢弃,S2 向前迈进。
- S1 和 S3 不会投票给 S2,但 S2、S4 和 S5 都会投票,因此 S2 成为领导者。
- 客户端请求 C5 进入 S2
- S2 调用
Start()
,返回 3。 - S2 成功向所有服务器发送
AppendEntries
,S2 通过在下一个心跳中包含更新的leaderCommit = 3
将其报告回服务器。 Since S2’s log is[C1 C2 C5]
, this means that the entry that committed (and was applied at all servers, including S1) at index 2 is C2. This despite the fact that C4 was the last client operation to have returned index 2 at S1. 由于 S2 的日志是“[C1 C2 C5]”,这意味着在索引 2 处提交(并应用于所有服务器,包括 S1)的条目是 C2。尽管 C4 是在 S1 返回索引 2 的最后一个客户端操作。 The four-way deadlock: All credit for finding this goes to Steven Allen, another 6.824 TA. He found the following nasty four-way deadlock that you can easily get into when building applications on top of Raft. 四向僵局:发现这一点的所有功劳都归功于史蒂文·艾伦,另一个6.824 TA。他发现了以下令人讨厌的四向死锁,在 Raft 上构建应用程序时很容易陷入。
Your Raft code, however it is structured, likely has a Start()
-like function that allows the application to add new commands to the Raft log. It also likely has a loop that, when commitIndex
is updated, calls apply()
on the application for every element in the log between lastApplied
and commitIndex
. These routines probably both take some lock a
. In your Raft-based application, you probably call Raft’s Start()
function somewhere in your RPC handlers, and you have some code somewhere else that is informed whenever Raft applies a new log entry. Since these two need to communicate (i.e., the RPC method needs to know when the operation it put into the log completes), they both probably take some lock b
.
你的 Raft 代码,无论它的结构如何,都可能有一个类似“Start()”的函数,允许应用程序向 Raft 日志添加新命令。它还可能有一个循环,当“commitIndex”更新时,在应用程序上为“lastApply”和“commitIndex”之间的日志中的每个元素调用“apply()”。这些例程可能都需要一些锁定“a”。在基于 Raft 的应用程序中,您可能会在 RPC 处理程序中的某个位置调用 Raft 的“Start()”函数,并且在 Raft 应用新的日志条目时,您会在其他地方收到一些代码通知。由于这两者需要通信(即,RPC 方法需要知道它放入日志中的操作何时完成),因此它们都可能需要一些锁 ‘b’。
In Go, these four code segments probably look something like this 在 Go 中,这四个代码段可能如下所示:
func (a *App) RPC(args interface{}, reply interface{}) {
// ...
a.mutex.Lock()
i := a.raft.Start(args)
// update some data structure so that apply knows to poke us later
a.mutex.Unlock()
// wait for apply to poke us
return
}
func (r *Raft) Start(cmd interface{}) int {
r.mutex.Lock()
// do things to start agreement on this new command
// store index in the log where cmd was placed
r.mutex.Unlock()
return index
}
func (a *App) apply(index int, cmd interface{}) {
a.mutex.Lock()
switch cmd := cmd.(type) {
case GetArgs:
// do the get
// see who was listening for this index
// poke them all with the result of the operation
// ...
}
a.mutex.Unlock()
}
func (r *Raft) AppendEntries(...) {
// ...
r.mutex.Lock()
// ...
for r.lastApplied < r.commitIndex {
r.lastApplied++
r.app.apply(r.lastApplied, r.log[r.lastApplied])
}
// ...
r.mutex.Unlock()
}
Consider now if the system is in the following state: 现在考虑系统是否处于以下状态:
App.RPC
has just takena.mutex
and calledRaft.Start
Raft.Start
is waiting forr.mutex
Raft.AppendEntries
is holdingr.mutex
, and has just calledApp.apply
- “App.RPC”刚刚采用了“a.mutex”并称为“Raft.Start”
- “Raft.Start”正在等待“r.mutex”
- “Raft.AppendEntries”持有“r.mutex”,并且刚刚调用“App.apply” We now have a deadlock, because: 我们现在陷入僵局,因为:
Raft.AppendEntries
won’t release the lock untilApp.apply
returns.App.apply
can’t return until it getsa.mutex
.a.mutex
won’t be released untilApp.RPC
returns.App.RPC
won’t return untilRaft.Start
returns.Raft.Start
can’t return until it getsr.mutex
.Raft.Start
has to wait forRaft.AppendEntries
.- “Raft.AppendEntries”在“App.apply”返回之前不会释放锁定。
- “App.apply”在获得“a.mutex”之前无法返回。
- “a.mutex”在“App.RPC”返回之前不会发布。
- “App.RPC”在“Raft.Start”返回之前不会返回。
- “Raft.Start”在获得“r.mutex”之前无法返回。
- “Raft.Start”必须等待“Raft.AppendEntries”。
There are a couple of ways you can get around this problem. The easiest one is to take
a.mutex
after callinga.raft.Start
inApp.RPC
. However, this means thatApp.apply
may be called for the operation thatApp.RPC
just calledRaft.Start
on beforeApp.RPC
has a chance to record the fact that it wishes to be notified. Another scheme that may yield a neater design is to have a single, dedicated thread callingr.app.apply
fromRaft
. This thread could be notified every timecommitIndex
is updated, and would then not need to hold a lock in order to apply, breaking the deadlock. 有几种方法可以解决此问题。最简单的方法是在“App.RPC”中调用“a.raft.Start”,然后调用“a.mutex”。但是,这意味着“App.apply”可能会被调用用于“App.RPC”刚刚在_之前_“App.RPC”上称为“Raft.Start”的操作,该操作有机会记录它希望收到通知的事实。另一种可能产生更整洁设计的方案是使用单个专用线程,从“Raft”调用“r.app.apply”。每次更新“commitIndex”时都会通知此线程,然后不需要持有锁即可应用,从而打破僵局。