I currently have a pair of drbd servers that have decided to stop syncing and I can't seem to do anything to get them to sync up again. The sync process occurs over a dedicated crossover cable (1gbps copper) between the two servers.
Here is what I see in the logs for r01:
Aug 9 16:09:44 r02 kernel: [12739.178449] block drbd0: receiver (re)started
Aug 9 16:09:44 r02 kernel: [12739.178454] block drbd0: conn( Unconnected -> WFConnection )
Aug 9 16:09:44 r02 kernel: [12739.912037] block drbd0: Handshake successful: Agreed network protocol version 91
Aug 9 16:09:44 r02 kernel: [12739.912048] block drbd0: conn( WFConnection -> WFReportParams )
Aug 9 16:09:44 r02 kernel: [12739.912074] block drbd0: Starting asender thread (from drbd0_receiver [3740])
Aug 9 16:09:44 r02 kernel: [12739.936681] block drbd0: data-integrity-alg: <not-used>
Aug 9 16:09:44 r02 kernel: [12739.936691] block drbd0: Considerable difference in lower level device sizes: 256503768s vs. 1344982880s
Aug 9 16:09:44 r02 kernel: [12739.942918] block drbd0: drbd_sync_handshake:
Aug 9 16:09:44 r02 kernel: [12739.942923] block drbd0: self E17D2EE7BC2C235E:0000000000000000:0000000000000000:0000000000000000 bits:32062701 flags:0
Aug 9 16:09:44 r02 kernel: [12739.942928] block drbd0: peer E21F17F92705CD4F:E17D2EE7BC2C235F:1074ED292C876258:548AFBCD7D5C2C3B bits:32062701 flags:0
Aug 9 16:09:44 r02 kernel: [12739.942933] block drbd0: uuid_compare()=-1 by rule 50
Aug 9 16:09:44 r02 kernel: [12739.942935] block drbd0: Becoming sync target due to disk states.
Aug 9 16:09:44 r02 kernel: [12739.942946] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Aug 9 16:09:44 r02 kernel: [12740.099597] block drbd0: conn( WFBitMapT -> WFSyncUUID )
Aug 9 16:09:44 r02 kernel: [12740.104324] block drbd0: updated sync uuid BF8D25FBE26085B0:0000000000000000:0000000000000000:0000000000000000
Aug 9 16:09:44 r02 kernel: [12740.104423] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
Aug 9 16:09:44 r02 kernel: [12740.106582] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Aug 9 16:09:44 r02 kernel: [12740.106591] block drbd0: conn( WFSyncUUID -> SyncTarget )
Aug 9 16:09:44 r02 kernel: [12740.106599] block drbd0: Began resync as SyncTarget (will sync 128250804 KB [32062701 bits set]).
Aug 9 16:09:44 r02 kernel: [12740.140796] block drbd0: meta connection shut down by peer.
Aug 9 16:09:44 r02 kernel: [12740.141304] block drbd0: sock was shut down by peer
Aug 9 16:09:44 r02 kernel: [12740.141309] block drbd0: peer( Primary -> Unknown ) conn( SyncTarget -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
Aug 9 16:09:44 r02 kernel: [12740.141316] block drbd0: short read expecting header on sock: r=0
Aug 9 16:09:44 r02 kernel: [12740.142235] block drbd0: asender terminated
Aug 9 16:09:44 r02 kernel: [12740.142238] block drbd0: Terminating drbd0_asender
Aug 9 16:09:44 r02 kernel: [12740.151561] block drbd0: bitmap WRITE of 979 pages took 2 jiffies
Aug 9 16:09:44 r02 kernel: [12740.151567] block drbd0: 122 GB (32062701 bits) marked out-of-sync by on disk bit-map.
Aug 9 16:09:44 r02 kernel: [12740.151580] block drbd0: Connection closed
Aug 9 16:09:44 r02 kernel: [12740.151586] block drbd0: conn( BrokenPipe -> Unconnected )
Aug 9 16:09:44 r02 kernel: [12740.151592] block drbd0: receiver terminated
And for r01:
Aug 9 16:09:44 r01 kernel: [3438273.766768] block drbd0: receiver (re)started
Aug 9 16:09:44 r01 kernel: [3438273.771898] block drbd0: conn( Unconnected -> WFConnection )
Aug 9 16:09:44 r01 kernel: [3438274.474411] block drbd0: Handshake successful: Agreed network protocol version 91
Aug 9 16:09:44 r01 kernel: [3438274.483299] block drbd0: conn( WFConnection -> WFReportParams )
Aug 9 16:09:44 r01 kernel: [3438274.490420] block drbd0: Starting asender thread (from drbd0_receiver [6366])
Aug 9 16:09:44 r01 kernel: [3438274.498900] block drbd0: data-integrity-alg: <not-used>
Aug 9 16:09:44 r01 kernel: [3438274.505166] block drbd0: Considerable difference in lower level device sizes: 1344982880s vs. 256503768s
Aug 9 16:09:44 r01 kernel: [3438274.516226] block drbd0: max_segment_size ( = BIO size ) = 65536
Aug 9 16:09:44 r01 kernel: [3438274.523385] block drbd0: drbd_sync_handshake:
Aug 9 16:09:44 r01 kernel: [3438274.528677] block drbd0: self E21F17F92705CD4F:E17D2EE7BC2C235F:1074ED292C876258:548AFBCD7D5C2C3B bits:32062701 flags:0
Aug 9 16:09:44 r01 kernel: [3438274.541195] block drbd0: peer E17D2EE7BC2C235E:0000000000000000:0000000000000000:0000000000000000 bits:32062701 flags:0
Aug 9 16:09:44 r01 kernel: [3438274.553710] block drbd0: uuid_compare()=1 by rule 70
Aug 9 16:09:44 r01 kernel: [3438274.559677] block drbd0: Becoming sync source due to disk states.
Aug 9 16:09:44 r01 kernel: [3438274.566897] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
Aug 9 16:09:44 r01 kernel: [3438274.666397] block drbd0: conn( WFBitMapS -> SyncSource )
Aug 9 16:09:44 r01 kernel: [3438274.672845] block drbd0: Began resync as SyncSource (will sync 128250804 KB [32062701 bits set]).
Aug 9 16:09:44 r01 kernel: [3438274.683196] block drbd0: /build/buildd-linux-2.6_2.6.32-48squeeze3-amd64-mcoLgp/linux-2.6-2.6.32/debian/build/source_amd64_none/drivers/block/drbd/drbd_receiver.c:1932: sector: 0s, size: 65536
Aug 9 16:09:45 r01 kernel: [3438274.702834] block drbd0: error receiving RSDataRequest, l: 24!
Aug 9 16:09:45 r01 kernel: [3438274.702837] block drbd0: peer( Secondary -> Unknown ) conn( SyncSource -> ProtocolError )
Aug 9 16:09:45 r01 kernel: [3438274.703005] block drbd0: asender terminated
Aug 9 16:09:45 r01 kernel: [3438274.703009] block drbd0: Terminating drbd0_asender
Aug 9 16:09:45 r01 kernel: [3438274.711319] block drbd0: Connection closed
Aug 9 16:09:45 r01 kernel: [3438274.711323] block drbd0: conn( ProtocolError -> Unconnected )
Aug 9 16:09:45 r01 kernel: [3438274.711329] block drbd0: receiver terminated
This just repeats over and over.
The config is the same on both servers as it should be:
r01:~$ rsync --dry-run --verbose --checksum --itemize-changes 10.0.255.254:/etc/drbd.conf /etc/
sent 11 bytes received 51 bytes 124.00 bytes/sec
total size is 615 speedup is 9.92 (DRY RUN)
This is what the config looks like:
r01:~$ cat /etc/drbd.conf
global {
usage-count no;
}
resource drbd0 {
protocol C;
handlers { pri-on-incon-degr "echo '!DRBD! pri on incon-degr' | wall ; exit 1"; }
startup {
degr-wfc-timeout 60; # 1 minute.
wfc-timeout 55;
}
disk {
on-io-error detach;
}
syncer {
rate 100M;
al-extents 257;
}
on r01.c07.mtsvc.net {
device /dev/drbd0;
disk /dev/cciss/c0d0p3;
address 10.0.255.253:7788;
meta-disk internal;
}
on r02.c07.mtsvc.net {
device /dev/drbd0;
disk /dev/cciss/c0d0p6;
address 10.0.255.254:7788;
meta-disk internal;
}
}
Here is what the network config looks like on both sides:
r01:~$ sudo ifconfig -a | grep -B 2 -A 8 10.0.255
eth2 Link encap:Ethernet HWaddr 00:26:55:d6:f8:fc
inet addr:10.0.255.253 Bcast:10.0.255.255 Mask:255.255.255.0
inet6 addr: fe80::226:55ff:fed6:f8fc/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:4062510240 errors:0 dropped:0 overruns:0 frame:0
TX packets:5692251259 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5512604514975 (5.0 TiB) TX bytes:5820995499388 (5.2 TiB)
Interrupt:24 Memory:fbe80000-fbea0000
r01:~$ sudo ifconfig -a | grep -B 2 -A 8 10.0.255
eth2 Link encap:Ethernet HWaddr 00:1b:78:5c:a8:fd
inet addr:10.0.255.254 Bcast:10.0.255.255 Mask:255.255.255.252
inet6 addr: fe80::21b:78ff:fe5c:a8fd/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:321977747 errors:0 dropped:0 overruns:0 frame:0
TX packets:264683964 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:332813827055 (309.9 GiB) TX bytes:328142295363 (305.6 GiB)
Interrupt:17 Memory:fdfa0000-fdfc0000
Originally, both r01 nd r02 were running Debian Squeeze (drbd 8.3.7). Then I rebuilt r02 with Debian Wheezy (drbd 8.3.13). Things ran smooth for a few days and then after a restart of drbd, this problem started. I have several other drbd clusters that I've been upgrading in this same way. Some of them are fully upgraded to Wheezy, other are still half Squeeze, half Wheezy and are fine.
So far here are the things I've tried to resolve this issue.
- wipe the drbd volume on r02 and try to resync
- wipe, reinstall, and reconfig r02.
- replace r02 with different hardware, and rebuild from scratch.
- replace the crossover cable (twice)
Over the next serveral days I will be replacing r01 with 100% different hardware. But even if that works, I am still at a loss. I really want to understand what caused this issue and the proper way to resolve it.
/proc/drbd
. – Dok