0
votes

We are facing C++ application crash issue due to segmentation fault on RED hat Linux. We are using embedded python in C++.

Please find below my limitation

  1. Don’t I have access to production machine where application crashes. Client send us core dump files when application crashes.
  2. Problem is not reproducible on our test machine which has exactly same configuration as production machine.
  3. Sometime application crashes after 1 hour, 4 hour ….1 day or 1 week. We haven’t get time frame or any specific pattern in which application crashes.
  4. Application is complex and embedded python code is used from lot of places from within application. We have done extensive code reviews but couldn’t find the fix by doing code review.
  5. As per stack trace in core dump, it is crashing around multiplication operation, reviewed code for such operation in code we haven’t get any code where such operation is performed. Might be such operations are called through python scripts executed from embedded python on which we don’t have control or we can’t review it.
  6. We can’t use any profiling tool on production environment like Valgrind.
  7. We are using gdb on our local machine to analyze core dump. We can’t run gdb on production machine.

Please find below the efforts we have putted in.

  1. We have analyzed logs and continuously fired request that coming towards our application on our test environment to reproduce the problem.
  2. We are not getting crash point in logs. Every time we get different logs. I think this is due to; Memory is smashed somewhere else and application crashes after sometime.
  3. We have checked load at any point on our application and it is never exceeded our application limit.
  4. Memory utilization of our application is also normal.
  5. We have profiled our application with help of Valgrind in our test machine and removed valgrind errors but application is still crashing.

I appreciate any help to guide us to proceed further to solve the problem.

Below is the version details

Red hat linux server 5.6 (Tikanga) Python 2.6.2 GCC 4.1

Following is the stack trace I am getting from the core dump files they have shared (on my machine). FYI, We don’t have access to production machine to run gdb on core dump files.

0  0x00000033c6678630 in ?? ()
1  0x00002b59d0e9501e in PyString_FromFormatV (format=0x2b59d0f2ab00 "can't multiply sequence by non-int of type '%.200s'", vargs=0x46421f20) at Objects/stringobject.c:291
2  0x00002b59d0ef1620 in PyErr_Format (exception=0x2b59d1170bc0, format=<value optimized out>) at Python/errors.c:548
3  0x00002b59d0e4bf1c in PyNumber_Multiply (v=0x2aaaac080600, w=0x2b59d116a550) at Objects/abstract.c:1192
4  0x00002b59d0ede326 in PyEval_EvalFrameEx (f=0x732b670, throwflag=<value optimized out>) at Python/ceval.c:1119
5  0x00002b59d0ee2493 in call_function (f=0x7269330, throwflag=<value optimized out>) at Python/ceval.c:3794
6  PyEval_EvalFrameEx (f=0x7269330, throwflag=<value optimized out>) at Python/ceval.c:2389
7  0x00002b59d0ee2493 in call_function (f=0x70983f0, throwflag=<value optimized out>) at Python/ceval.c:3794
8  PyEval_EvalFrameEx (f=0x70983f0, throwflag=<value optimized out>) at Python/ceval.c:2389
9  0x00002b59d0ee2493 in call_function (f=0x6f1b500, throwflag=<value optimized out>) at Python/ceval.c:3794
10 PyEval_EvalFrameEx (f=0x6f1b500, throwflag=<value optimized out>) at Python/ceval.c:2389
11 0x00002b59d0ee2493 in call_function (f=0x2aaab09d52e0, throwflag=<value optimized out>) at Python/ceval.c:3794
12 PyEval_EvalFrameEx (f=0x2aaab09d52e0, throwflag=<value optimized out>) at Python/ceval.c:2389
13 0x00002b59d0ee2d9f in ?? () at Python/ceval.c:2968 from /usr/local/lib/libpython2.6.so.1.0
14 0x0000000000000007 in ?? ()
15 0x00002b59d0e83042 in lookdict_string (mp=<value optimized out>, key=0x46424dc0, hash=40722104) at Objects/dictobject.c:412
16 0x00002aaab09d5458 in ?? ()
17 0x00002aaab09d5458 in ?? ()
18 0x00002aaab02a91f0 in ?? ()
19 0x00002aaab0b2c3a0 in ?? ()
20 0x0000000000000004 in ?? ()
21 0x00000000026d5eb8 in ?? ()
22 0x00002aaab0b2c3a0 in ?? ()
23 0x00002aaab071e080 in ?? ()
24 0x0000000046422bf0 in ?? ()
25 0x0000000046424dc0 in ?? ()
26 0x00000000026d5eb8 in ?? ()
27 0x00002aaab0987710 in ?? ()
28 0x00002b59d0ee2de2 in PyEval_EvalFrame (f=0x0) at Python/ceval.c:538
29 0x0000000000000000 in ?? ()
3
Is the trace always the same? That error message appears pretty specific. Somewhere you are trying to do something like (1, 2, 3) * 2.5. Isn't that in itself a bug, which should be easy to trace down as that's an unusual kind of expression?Potatoswatter
What version of g++ are you using?Alex Chamberlain
Yes, when we get stack trace it is in same way. We have reviewed the embeded python code that we have used in our C++ application but wont find such operation. We have also tried executing python script from our embeded python code which has invalid multiplication operation but it resulted in unsuccessful execution of scipt; it is not crashing our application. Thank your for your commentJack
STOP! That was released on February 13, 2007. gcc.gnu.org/releases.htmlAlex Chamberlain
Using a GCC 4.1 is really crazy and insane. It is not even very well C++ standard conforming. It gives poor optimization, has a buggy non-conforming stdc++ library; GCC has progressed a lot since. And you could also compile with a recent Clang (from LLVM 3.2) in addition of using a recent GCC (4.7), to get more warnings etc....Basile Starynkevitch

3 Answers

3
votes

You are almost certainly doing something bad with pointers in your C++ code, which can be very tough to debug.

  • Do not assume that the stack trace is relevant. It might be relevant, but pointer misuse can often lead to crashes some time later
  • Build with full warnings on. The compiler can point out some non-obvious pointer misuse, such as returning a reference to a local.
  • Investigate your arrays. Try replacing arrays with std::vector (C++03) or std::array (C++11) so you can iterate using begin() and end() and you can index using at().
  • Investigate your pointers. Replace them with std::unique_ptr(C++11) or boost::scoped_ptr wherever you can (there should be no overhead in release builds). Replace the rest with shared_ptr or weak_ptr. Any that can't be replaced are probably the source of problematic logic.

Because of the very problems you're seeing, modern C++ allows almost all raw pointer usage to be removed entirely. Try it.

1
votes

First things first, compile both your binary and libpython with debug symbols and push it out. The stack trace will be much easier to follow.

The relevant argument to g++ is -g.

1
votes

Suggestions:

  • As already suggested, provide a complete debug build
  • Provide a memory test tool and a CPU torture test
  • Load debug symbols of python library when analyzing the core dump
  • The stacktrace shows something concerning eval(), so I guess you do dynamic code generation and evaluation/execution. If so, within this code, or passed arguments, there might be the actual error. Assertions at any interface to the code and code dumps may help.