0
votes

I currently have a pair of drbd servers that have decided to stop syncing and I can't seem to do anything to get them to sync up again. The sync process occurs over a dedicated crossover cable (1gbps copper) between the two servers.

Here is what I see in the logs for r01:

Aug  9 16:09:44 r02 kernel: [12739.178449] block drbd0: receiver (re)started
Aug  9 16:09:44 r02 kernel: [12739.178454] block drbd0: conn( Unconnected -> WFConnection ) 
Aug  9 16:09:44 r02 kernel: [12739.912037] block drbd0: Handshake successful: Agreed network protocol version 91
Aug  9 16:09:44 r02 kernel: [12739.912048] block drbd0: conn( WFConnection -> WFReportParams ) 
Aug  9 16:09:44 r02 kernel: [12739.912074] block drbd0: Starting asender thread (from drbd0_receiver [3740])
Aug  9 16:09:44 r02 kernel: [12739.936681] block drbd0: data-integrity-alg: <not-used>
Aug  9 16:09:44 r02 kernel: [12739.936691] block drbd0: Considerable difference in lower level device sizes: 256503768s vs. 1344982880s
Aug  9 16:09:44 r02 kernel: [12739.942918] block drbd0: drbd_sync_handshake:
Aug  9 16:09:44 r02 kernel: [12739.942923] block drbd0: self E17D2EE7BC2C235E:0000000000000000:0000000000000000:0000000000000000 bits:32062701 flags:0
Aug  9 16:09:44 r02 kernel: [12739.942928] block drbd0: peer E21F17F92705CD4F:E17D2EE7BC2C235F:1074ED292C876258:548AFBCD7D5C2C3B bits:32062701 flags:0
Aug  9 16:09:44 r02 kernel: [12739.942933] block drbd0: uuid_compare()=-1 by rule 50
Aug  9 16:09:44 r02 kernel: [12739.942935] block drbd0: Becoming sync target due to disk states.
Aug  9 16:09:44 r02 kernel: [12739.942946] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) 
Aug  9 16:09:44 r02 kernel: [12740.099597] block drbd0: conn( WFBitMapT -> WFSyncUUID ) 
Aug  9 16:09:44 r02 kernel: [12740.104324] block drbd0: updated sync uuid BF8D25FBE26085B0:0000000000000000:0000000000000000:0000000000000000
Aug  9 16:09:44 r02 kernel: [12740.104423] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
Aug  9 16:09:44 r02 kernel: [12740.106582] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Aug  9 16:09:44 r02 kernel: [12740.106591] block drbd0: conn( WFSyncUUID -> SyncTarget ) 
Aug  9 16:09:44 r02 kernel: [12740.106599] block drbd0: Began resync as SyncTarget (will sync 128250804 KB [32062701 bits set]).
Aug  9 16:09:44 r02 kernel: [12740.140796] block drbd0: meta connection shut down by peer.
Aug  9 16:09:44 r02 kernel: [12740.141304] block drbd0: sock was shut down by peer
Aug  9 16:09:44 r02 kernel: [12740.141309] block drbd0: peer( Primary -> Unknown ) conn( SyncTarget -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) 
Aug  9 16:09:44 r02 kernel: [12740.141316] block drbd0: short read expecting header on sock: r=0
Aug  9 16:09:44 r02 kernel: [12740.142235] block drbd0: asender terminated
Aug  9 16:09:44 r02 kernel: [12740.142238] block drbd0: Terminating drbd0_asender
Aug  9 16:09:44 r02 kernel: [12740.151561] block drbd0: bitmap WRITE of 979 pages took 2 jiffies
Aug  9 16:09:44 r02 kernel: [12740.151567] block drbd0: 122 GB (32062701 bits) marked out-of-sync by on disk bit-map.
Aug  9 16:09:44 r02 kernel: [12740.151580] block drbd0: Connection closed
Aug  9 16:09:44 r02 kernel: [12740.151586] block drbd0: conn( BrokenPipe -> Unconnected ) 
Aug  9 16:09:44 r02 kernel: [12740.151592] block drbd0: receiver terminated

And for r01:

Aug  9 16:09:44 r01 kernel: [3438273.766768] block drbd0: receiver (re)started
Aug  9 16:09:44 r01 kernel: [3438273.771898] block drbd0: conn( Unconnected -> WFConnection ) 
Aug  9 16:09:44 r01 kernel: [3438274.474411] block drbd0: Handshake successful: Agreed network protocol version 91
Aug  9 16:09:44 r01 kernel: [3438274.483299] block drbd0: conn( WFConnection -> WFReportParams ) 
Aug  9 16:09:44 r01 kernel: [3438274.490420] block drbd0: Starting asender thread (from drbd0_receiver [6366])
Aug  9 16:09:44 r01 kernel: [3438274.498900] block drbd0: data-integrity-alg: <not-used>
Aug  9 16:09:44 r01 kernel: [3438274.505166] block drbd0: Considerable difference in lower level device sizes: 1344982880s vs. 256503768s
Aug  9 16:09:44 r01 kernel: [3438274.516226] block drbd0: max_segment_size ( = BIO size ) = 65536
Aug  9 16:09:44 r01 kernel: [3438274.523385] block drbd0: drbd_sync_handshake:
Aug  9 16:09:44 r01 kernel: [3438274.528677] block drbd0: self E21F17F92705CD4F:E17D2EE7BC2C235F:1074ED292C876258:548AFBCD7D5C2C3B bits:32062701 flags:0
Aug  9 16:09:44 r01 kernel: [3438274.541195] block drbd0: peer E17D2EE7BC2C235E:0000000000000000:0000000000000000:0000000000000000 bits:32062701 flags:0
Aug  9 16:09:44 r01 kernel: [3438274.553710] block drbd0: uuid_compare()=1 by rule 70
Aug  9 16:09:44 r01 kernel: [3438274.559677] block drbd0: Becoming sync source due to disk states.
Aug  9 16:09:44 r01 kernel: [3438274.566897] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) 
Aug  9 16:09:44 r01 kernel: [3438274.666397] block drbd0: conn( WFBitMapS -> SyncSource ) 
Aug  9 16:09:44 r01 kernel: [3438274.672845] block drbd0: Began resync as SyncSource (will sync 128250804 KB [32062701 bits set]).
Aug  9 16:09:44 r01 kernel: [3438274.683196] block drbd0: /build/buildd-linux-2.6_2.6.32-48squeeze3-amd64-mcoLgp/linux-2.6-2.6.32/debian/build/source_amd64_none/drivers/block/drbd/drbd_receiver.c:1932: sector: 0s, size: 65536
Aug  9 16:09:45 r01 kernel: [3438274.702834] block drbd0: error receiving RSDataRequest, l: 24!
Aug  9 16:09:45 r01 kernel: [3438274.702837] block drbd0: peer( Secondary -> Unknown ) conn( SyncSource -> ProtocolError ) 
Aug  9 16:09:45 r01 kernel: [3438274.703005] block drbd0: asender terminated
Aug  9 16:09:45 r01 kernel: [3438274.703009] block drbd0: Terminating drbd0_asender
Aug  9 16:09:45 r01 kernel: [3438274.711319] block drbd0: Connection closed
Aug  9 16:09:45 r01 kernel: [3438274.711323] block drbd0: conn( ProtocolError -> Unconnected ) 
Aug  9 16:09:45 r01 kernel: [3438274.711329] block drbd0: receiver terminated

This just repeats over and over.

The config is the same on both servers as it should be:

r01:~$ rsync --dry-run --verbose --checksum --itemize-changes 10.0.255.254:/etc/drbd.conf /etc/

sent 11 bytes  received 51 bytes  124.00 bytes/sec
total size is 615  speedup is 9.92 (DRY RUN)

This is what the config looks like:

r01:~$ cat /etc/drbd.conf
global {
   usage-count no;
}

resource drbd0 {
  protocol C;
  handlers { pri-on-incon-degr "echo '!DRBD! pri on incon-degr' | wall ; exit 1"; }
  startup {
    degr-wfc-timeout 60;    # 1 minute.
    wfc-timeout 55;
  }

  disk {
    on-io-error   detach;
  }

  syncer {
    rate 100M;
    al-extents 257;
  }

  on r01.c07.mtsvc.net {
    device     /dev/drbd0;
    disk       /dev/cciss/c0d0p3;
    address    10.0.255.253:7788;
    meta-disk  internal;
  }

  on r02.c07.mtsvc.net {
    device     /dev/drbd0;
    disk       /dev/cciss/c0d0p6;
    address    10.0.255.254:7788;
    meta-disk  internal;
  }
}

Here is what the network config looks like on both sides:

r01:~$ sudo ifconfig -a | grep -B 2 -A 8 10.0.255

eth2      Link encap:Ethernet  HWaddr 00:26:55:d6:f8:fc  
          inet addr:10.0.255.253  Bcast:10.0.255.255  Mask:255.255.255.0
          inet6 addr: fe80::226:55ff:fed6:f8fc/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:4062510240 errors:0 dropped:0 overruns:0 frame:0
          TX packets:5692251259 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:5512604514975 (5.0 TiB)  TX bytes:5820995499388 (5.2 TiB)
          Interrupt:24 Memory:fbe80000-fbea0000 

r01:~$ sudo ifconfig -a | grep -B 2 -A 8 10.0.255

eth2      Link encap:Ethernet  HWaddr 00:1b:78:5c:a8:fd  
          inet addr:10.0.255.254  Bcast:10.0.255.255  Mask:255.255.255.252
          inet6 addr: fe80::21b:78ff:fe5c:a8fd/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:321977747 errors:0 dropped:0 overruns:0 frame:0
          TX packets:264683964 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:332813827055 (309.9 GiB)  TX bytes:328142295363 (305.6 GiB)
          Interrupt:17 Memory:fdfa0000-fdfc0000 

Originally, both r01 nd r02 were running Debian Squeeze (drbd 8.3.7). Then I rebuilt r02 with Debian Wheezy (drbd 8.3.13). Things ran smooth for a few days and then after a restart of drbd, this problem started. I have several other drbd clusters that I've been upgrading in this same way. Some of them are fully upgraded to Wheezy, other are still half Squeeze, half Wheezy and are fine.

So far here are the things I've tried to resolve this issue.

  • wipe the drbd volume on r02 and try to resync
  • wipe, reinstall, and reconfig r02.
  • replace r02 with different hardware, and rebuild from scratch.
  • replace the crossover cable (twice)

Over the next serveral days I will be replacing r01 with 100% different hardware. But even if that works, I am still at a loss. I really want to understand what caused this issue and the proper way to resolve it.

1
Are the DRBD versions the same on both nodes? The kernel module version (and git hash) should be displayed in /proc/drbd.Dok

1 Answers

0
votes

A lot of things changed in DRBD between 8.3.7 and 8.3.13; including major changes to how resyncs work: https://blogs.linbit.com/p/128/drbd-sync-rate-controller/

You could try removing any non-required settings from your resource configuration (so, the syncer{} section) and adjust DRBD: # drbdadm adjust all

If it still doesn't connect, you might have to upgrade the older node to get them syncing: http://www.drbd.org/download/drbd/8.3/drbd-8.3.13.tar.gz