Raft Lock 翻译

23 Jan, 2023

If you are wondering how to use locks in the 6.824 Raft labs, here are some rules and ways of thinking that might be helpful. 如果您想知道如何在 6.824 Raft 实验室中使用锁，这里是一些规则和思维方式可能会有所帮助。

Rule 1: Whenever you have data that more than one goroutine uses, and at least one goroutine might modify the data, the goroutines should use locks to prevent simultaneous use of the data. The Go race detector is pretty good at detecting violations of this rule (though it won’t help with any of the rules below). 规则 1：当您拥有多个 goroutine 使用的数据时，以及至少有一个 GoRoutine 可能会修改数据，GoRoutines 应该使用锁来防止同时使用数据。围棋比赛检测器非常擅长检测违反此规则的行为（尽管它对以下任何规则都没有帮助）。

Rule 2: Whenever code makes a sequence of modifications to shared data, and other goroutines might malfunction if they looked at the data midway through the sequence, you should use a lock around the whole sequence. 规则 2：每当代码对共享进行一系列修改时数据和其他 goroutines 如果查看数据在序列的中途，您应该在整个序列。 An example:

  rf.mu.Lock()
  rf.currentTerm += 1
  rf.state = Candidate
  rf.mu.Unlock()

It would be a mistake for another goroutine to see either of these updates alone (i.e. the old state with the new term, or the new term with the old state). So we need to hold the lock continuously over the whole sequence of updates. All other code that uses rf.currentTerm or rf.state must also hold the lock, in order to ensure exclusive access for all uses. 对于另一个goroutine来说，看到其中任何一个都是错误的单独更新（即旧状态与新术语或新术语与旧状态）。所以我们需要在整个更新序列。使用 rf.currentTerm 或 rf.state还必须持有锁，以确保独占访问适用于所有用途。 The code between Lock() and Unlock() is often called a “critical section.” The locking rules a programmer chooses (e.g. “a goroutine must hold rf.mu when using rf.currentTerm or rf.state”) are often called a “locking protocol”. Lock（）和 Unlock（）之间的代码通常被称为“关键部分。程序员选择的锁定规则（例如“goroutine” 必须保持 rf.mu 使用 rf.currentTerm 或 rf.state 时“）通常是称为“锁定协议”。 Rule 3: Whenever code does a sequence of reads of shared data (or reads and writes), and would malfunction if another goroutine modified the data midway through the sequence, you should use a lock around the whole sequence. 规则 3：每当代码对共享数据执行一系列读取时（或读取和写入），并且如果另一个 goroutine 修改，则会出现故障数据在序列的中间，您应该在整个序列。 An example that could occur in a Raft RPC handler:

 rf.mu.Lock()
  if args.Term > rf.currentTerm {
   rf.currentTerm = args.Term
  }
  rf.mu.Unlock()

This code needs to hold the lock continuously for the whole sequence. Raft requires that currentTerm only increases, and never decreases. Another RPC handler could be executing in a separate goroutine; if it were allowed to modify rf.currentTerm between the if statement and the update to rf.currentTerm, this code might end up decreasing rf.currentTerm. Hence the lock must be held continuously over the whole sequence. In addition, every other use of currentTerm must hold the lock, to ensure that no other goroutine modifies currentTerm during our critical section. 此代码需要在整个序列中连续保持锁。 Raft 要求当前期限只增加，永不减少。另一个 RPC 处理程序可以在单独的 goroutine中执行;如果它允许在 if 语句和更新到 rf.currentTerm，此代码最终可能会减少因此，锁必须持续保持在整个序列。此外，当前期限的所有其他使用都必须成立锁，以确保没有其他 goroutine 修改当前术语在我们的关键部分。 Real Raft code would need to use longer critical sections than these examples; for example, a Raft RPC handler should probably hold the lock for the entire handler. 真正的 Raft 代码需要使用比这些更长的关键部分例子;例如，Raft RPC 处理程序可能应该持有锁定整个处理程序。 Rule 4: It’s usually a bad idea to hold a lock while doing anything that might wait: reading a Go channel, sending on a channel, waiting for a timer, calling time.Sleep(), or sending an RPC (and waiting for the reply). One reason is that you probably want other goroutines to make progress during the wait. Another reason is deadlock avoidance. Imagine two peers sending each other RPCs while holding locks; both RPC handlers need the receiving peer’s lock; neither RPC handler can ever complete because it needs the lock held by the waiting RPC call. 规则4：在做任何事情时握住锁通常是一个坏主意可能会等待：读取 Go 通道、发送通道、等待对于计时器，调用时间。Sleep（），或发送 RPC（并等待回复）。一个原因是你可能希望其他goroutines制作等待期间的进度。另一个原因是避免死锁。想象两个对等体在保持锁的同时相互发送 RPC;两个 RPC 处理程序需要接收对等方的锁;RPC 处理程序都不能完成，因为它需要等待的 RPC 调用所持有的锁。 Code that waits should first release locks. If that’s not convenient, sometimes it’s useful to create a separate goroutine to do the wait. 等待的代码应首先释放锁。如果不方便，有时，创建一个单独的 goroutine 来执行等待是很有用的。 Rule 5: Be careful about assumptions across a drop and re-acquire of a lock. One place this can arise is when avoiding waiting with locks held. For example, this code to send vote RPCs is incorrect: 规则 5：小心跌落和重新获取的假设锁。可能出现的一个地方是避免用锁等待举行。例如，发送投票 RPC 的以下代码不正确：

  rf.mu.Lock()
  rf.currentTerm += 1
  rf.state = Candidate
  for <each peer> {
    go func() {
      rf.mu.Lock()
      args.Term = rf.currentTerm
      rf.mu.Unlock()
      Call("Raft.RequestVote", &args, ...)
      // handle the reply...
    } ()
  }
  rf.mu.Unlock()

The code sends each RPC in a separate goroutine. It’s incorrect because args.Term may not be the same as the rf.currentTerm at which the surrounding code decided to become a Candidate. Lots of time may pass between when the surrounding code creates the goroutine and when the goroutine reads rf.currentTerm; for example, multiple terms may come and go, and the peer may no longer be a candidate. One way to fix this is for the created goroutine to use a copy of rf.currentTerm made while the outer code holds the lock. Similarly, reply-handling code after the Call() must re-check all relevant assumptions after re-acquiring the lock; for example, it should check that rf.currentTerm hasn’t changed since the decision to become a candidate. 该代码在单独的 goroutines 中发送每个 RPC。这是不正确的因为参数。术语可能与 rf.currentTerm 不同，其中周围的代码决定成为候选人。很多时间可能会在周围的代码创建 goroutine 的时间和时间之间传递 goroutine读取rf.currentTerm;例如，多个术语可能来来去去，同伴可能不再是候选人。一种解决方法这是为了让创建的 goroutine 使用一个 rf.currentterm 的副本。而外部代码持有锁。同样，回复处理代码之后 Call（）必须重新检查所有相关假设之后重新获取锁;例如，它应该检查自从决定成为候选人。 It can be difficult to interpret and apply these rules. Perhaps most puzzling is the notion in Rules 2 and 3 of code sequences that shouldn’t be interleaved with other goroutines’ reads or writes. How can one recognize such sequences? How should one decide where a sequence ought to start and end? 解释和应用这些规则可能很困难。也许大多数令人费解的是代码序列规则 2 和 3 中的概念不应与其他 goroutines 的读取或写入交错。如何人们能识别出这样的序列吗？应该如何决定在哪里序列应该开始和结束吗？ One approach is to start with code that has no locks, and think carefully about where one needs to add locks to attain correctness. This approach can be difficult since it requires reasoning about the correctness of concurrent code. 一种方法是从没有锁的代码开始，然后思考仔细考虑需要在哪里添加锁以获得正确性。这种方法可能很困难，因为它需要推理并发代码的正确性。 A more pragmatic approach starts with the observation that if there were no concurrency (no simultaneously executing goroutines), you would not need locks at all. But you have concurrency forced on you when the RPC system creates goroutines to execute RPC handlers, and because you need to send RPCs in separate goroutines to avoid waiting. You can effectively eliminate this concurrency by identifying all places where goroutines start (RPC handlers, background goroutines you create in Make(), &c), acquiring the lock at the very start of each goroutine, and only releasing the lock when that goroutine has completely finished and returns. This locking protocol ensures that nothing significant ever executes in parallel; the locks ensure that each goroutine executes to completion before any other goroutine is allowed to start. With no parallel execution, it’s hard to violate Rules 1, 2, 3, or 5. If each goroutine’s code is correct in isolation (when executed alone, with no concurrent goroutines), it’s likely to still be correct when you use locks to suppress concurrency. So you can avoid explicit reasoning about correctness, or explicitly identifying critical sections. 一个更务实的方法是从观察开始，如果有没有并发（没有同时执行的 goroutines），你根本不需要锁。但是你有并发强迫你当 RPC 系统创建 goroutine 来执行 RPC 处理程序时，以及因为您需要在单独的 goroutine 中发送 RPC 以避免等待。您可以通过识别所有 goroutines 开始的地方（RPC 处理程序，后台 goroutines 你在 Make(), &c) 中创建，在每个函数的最开始获取锁 goroutine，并且只在该 goroutine 有时释放锁完全完成并返回。该锁定协议确保没有什么重要的事情是并行执行的；锁确保每个 goroutine 在任何其他 goroutine 之前执行完成允许开始。没有并行执行，很难违反规则 1、2、3 或 5。如果每个 goroutine 的代码在隔离时都是正确的（当单独执行时，没有并发的 goroutines），它很可能当你使用锁来抑制并发时仍然是正确的。那么你可以避免关于正确性的明确推理，或者明确识别关键部分。 However, Rule 4 is likely to be a problem. So the next step is to find places where the code waits, and to add lock releases and re-acquires (and/or goroutine creation) as needed, being careful to re-establish assumptions after each re-acquire. You may find this process easier to get right than directly identifying sequences that must be locked for correctness. 但是，规则4可能是一个问题。所以下一步是找到代码等待的地方，并添加锁定释放和重新获取（和/或goroutine创建）根据需要，小心重新建立每次重新获取后的假设。您可能会发现此过程更容易比直接识别必须锁定的序列更正确正确性。 (As an aside, what this approach sacrifices is any opportunity for better performance via parallel execution on multiple cores: your code is likely to hold locks when it doesn’t need to, and may thus unnecessarily prohibit parallel execution of goroutines. On the other hand, there is not much opportunity for CPU parallelism within a single Raft peer.) （顺便说一句，这种方法牺牲的是任何机会通过在多个内核上并行执行提高性能：您的代码可能会在不需要时保持锁，因此可能会因此不必要地禁止并行执行 goroutine。另一方面手，在 a 中没有太多 CPU 并行的机会单个筏对等方。