1
votes

I have a server/client system that runs well on my machines. But it core dumps at one of the users machine (OS: Centos 5). Since I don't have access to the user's machine so I built a debug mode binary and asked the user to try it. The crash did happened again after around 2 days of running. And he sent me the core dump file. Loading the core dump file with gdb, it did shows the crash location but I don't understand the reason (sorry, my previous experience is mostly with Windows. I don't have much experience with Linux/gdb). I would like have your input. Thanks!

1. the /var/log/messages at the user's machine shows the segfault:

Jan 16 09:20:39 LPZ08945 kernel: LSystem[4688]: segfault at 0000000000000000 rip 00000000080e6433 rsp 00000000f2afd4e0 error 4

This message indicates that there is a segfault at instruction pointer 80e6433 and stack pointer f2afd4e0. Looks that the program tries to read/write at address 0.

2. load the core dump file into gdb and it shows the crash location:

$gdb LSystem core.19009

GNU gdb (GDB) CentOS (7.0.1-45.el5.centos)

... (many lines of outputs from gdb omitted)

Core was generated by `./LSystem'.

Program terminated with signal 11, Segmentation fault.

'#0' 0x080e6433 in CLClient::connectToServer (this=0xf2afd898, conn=11) at liccomm/LClient.cpp:214

214 memcpy((char *) & (a4.sin_addr), pHost->h_addr, pHost->h_length);

gdb says the crash occurs at Line 214?

3. Frame information. (at Frame #0)

(gdb) info frame

Stack level 0, frame at 0xf2afd7e0:

eip = 0x80e6433 in CLClient::connectToServer (liccomm/LClient.cpp:214); saved eip 0x80e6701

called by frame at 0xf2afd820

source language c++.

Arglist at 0xf2afd7d8, args: this=0xf2afd898, conn=11

Locals at 0xf2afd7d8, Previous frame's sp is 0xf2afd7e0

Saved registers:

ebx at 0xf2afd7cc, ebp at 0xf2afd7d8, esi at 0xf2afd7d0, edi at 0xf2afd7d4, eip at 0xf2afd7dc

The frame is at f2afd7e0, why it's different than the rsp from Part 1, which is f2afd4e0? I guess the user may have provided me with mismatched core dump file (whose pid is 19009) and /var/log/messages file (which indicates a pid 4688).

4. The source

(gdb) list +

209
210         //pHost is declared as struct hostent* and 'pHost = gethostbyname(serverAddress);'
211         memset( &a4, 0, sizeof(a4) );
212         a4.sin_family = AF_INET;
213         a4.sin_port = htons( nPort );
214         memcpy((char *) & (a4.sin_addr), pHost->h_addr, pHost->h_length);
215
216         aalen = sizeof(a4);
217         aa = (struct sockaddr *)&a4;

I could not see anything wrong with Line 214. And this part of the code must ran many times during the runtime of 2 days.

5. The variables

Since gdb indicated that Line 214 was the culprit. I printed everything.

memcpy((char *) & (a4.sin_addr), pHost->h_addr, pHost->h_length);

(gdb) print a4.sin_addr

$1 = {s_addr = 0}

(gdb) print &(a4.sin_addr)

$2 = (in_addr *) 0xf2afd794

(gdb) print pHost->h_addr_list[0]

$3 = 0xa24af30 "\202}\204\250"

(gdb) print pHost->h_length

$4 = 4

(gdb) print memcpy

$5 = {} 0x2fcf90

So I basically printed everything that's at Line 214. ('pHost->h_addr_list[0]' is 'pHost->h_addr' due to '#define h_addr h_addr_list[0]')

I was not able to catch anything wrong. Did you catch anything fishy? Is it possible the memory has been corrupted somewhere else? I appreciate your help!

[edited] 6. back trace

(gdb) bt

'#0' 0x080e6433 in CLClient::connectToServer (this=0xf2afd898, conn=11) at liccomm/LClient.cpp:214

'#1' 0x080e6701 in CLClient::connectToLMServer (this=0xf2afd898) at liccomm/LClient.cpp:121

... (Frames 2~7 omitted, not relevant)

'#8' 0x080937f2 in handleConnectionStarter (par=0xf3563f98) at LManager.cpp:166

'#9' 0xf7f5fb41 in ?? ()

'#10' 0xf3563f98 in ?? ()

'#11' 0xf2aff31c in ?? ()

'#12' 0x00000000 in ?? ()

I followed the nested calls. They are correct.

1
First thing to do when having a core dump file is 'bt full' which will give you the complete call stack at the time the segfault happens. Most of the time the problem lies not at the line the fault happens, but some time before, producing malformed functions calls, which eventually cause the segfault.Jan Henke
Thanks Jan! I did use "bt" and "bt full". The nested calls are valid. I will edit the question and post the output from bt.Wayne Lawson
(gdb) bt #0 0x080e6433 in CLClient::connectToServer (this=0xf2afd898, conn=11) at liccomm/LClient.cpp:214 ... (Frames 1~7 omitted, not relevant) #8 0x080937f2 in handleConnectionStarter (par=0xf3563f98) at LManager.cpp:166 #9 0xf7f5fb41 in ?? () #10 0xf3563f98 in ?? () #11 0xf2aff31c in ?? () #12 0x00000000 in ?? ()Wayne Lawson
Can you add the output of disass 0x080e6433 ?Mark Plotnick
Thanks Mark! The output is long so I put it at here: linkWayne Lawson

1 Answers

3
votes

The problem with the memcpy is that the source location is not of the same type than the destination.

You should use inet_addr to convert addresses from string to binary

a4.sin_addr = inet_addr(pHost->h_addr);

The previous code may not work depending on the implementation (some my return struct in_addr, others will return unsigned long, but the principle is the same.