KUDU-2335. Work around rare consensus health bug for 1.7 release
authorMike Percy <mpercy@apache.org>
Tue, 13 Mar 2018 01:04:15 +0000 (18:04 -0700)
committerMike Percy <mpercy@apache.org>
Tue, 13 Mar 2018 05:42:45 +0000 (05:42 +0000)
In very rare circumstances we have hit a DHCECK in quorum_util.cc in
pre-commit builds stating that the leader should always have a HEALTHY
health status. We have traced this to points in the replica lifecycle
when the health status could be UNKNOWN.

Since we want to release 1.7.0 soon, let's work around this issue for
now. We'll follow up with a "real" fix and a decent test later.

Change-Id: Iad67c7943a5b619ef2fa3a67c92cc033e207e197
Reviewed-on: http://gerrit.cloudera.org:8080/9597
Reviewed-by: Alexey Serbin <aserbin@cloudera.com>
Tested-by: Mike Percy <mpercy@apache.org>
src/kudu/consensus/quorum_util.cc

index 2697911..97c006b 100644 (file)
@@ -27,6 +27,7 @@
 
 #include "kudu/common/common.pb.h"
 #include "kudu/gutil/map-util.h"
+#include "kudu/gutil/port.h"
 #include "kudu/gutil/strings/join.h"
 #include "kudu/gutil/strings/substitute.h"
 #include "kudu/util/pb_util.h"
@@ -506,9 +507,18 @@ bool ShouldEvictReplica(const RaftConfigPB& config,
     switch (peer.member_type()) {
       case RaftPeerPB::VOTER:
         // A leader should always report itself as being healthy.
-        DCHECK(peer_uuid != leader_uuid || healthy) << Substitute(
-            "$0: leader reported as not healthy; config: $1",
-            peer_uuid, SecureShortDebugString(config));
+        if (PREDICT_FALSE(peer_uuid == leader_uuid && !healthy)) {
+          LOG(WARNING) << Substitute("leader peer $0 reported health as $1; config: $2",
+                                     peer_uuid,
+                                     HealthReportPB_HealthStatus_Name(
+                                        peer.health_report().overall_health()),
+                                     SecureShortDebugString(config));
+          DCHECK(false) << "Found non-HEALTHY LEADER"; // Crash in DEBUG builds.
+          // TODO(KUDU-2335): We have seen this assertion in rare circumstances
+          // in pre-commit builds, so until we fix this lifecycle issue we
+          // simply do not evict any nodes when the leader is not HEALTHY.
+          return false;
+        }
 
         ++num_voters_total;
         if (healthy) {