C - Error with read() of a file, storage in an array, and printing output properly

Question

I am new to C, so I am not exactly sure where my error is. However, I do know that the great portion of the issue lies either in how I am storing the doubles in the d_buffer (double) array or the way I am printing it.

Specifically, my output keeps printing extremely large numbers (with around 10-12 digits before the decimal point and a trail of zeros after it. Additionally, this is an adaptation of an older program to allow for double inputs, so I only really added the two if statements (in the "read" for loop and the "printf" for loop) and the d_buffer declaration.

I would appreciate any input whatsoever as I have spent several hours on this error.

#include <stdio.h>
#include <fcntl.h>
#include <sys/types.h>
#include <unistd.h>
#include <string.h>


struct DataDescription
{
   char fieldname[30];
   char fieldtype;
   int  fieldsize;
};




/* -----------------------------------------------
   eof(fd): returns 1 if file `fd' is out of data
   ----------------------------------------------- */
int eof(int fd)
{
   char c;

   if ( read(fd, &c, 1) != 1 )
      return(1);
   else
    { lseek(fd, -1, SEEK_CUR);
      return(0);
    }
}


void main()
{
   FILE *fp;    /* Used to access meta data */
   int fd;  /* Used to access user data */

   /* ----------------------------------------------------------------
      Variables to hold the description of the data  - max 10 fields
      ---------------------------------------------------------------- */
   struct DataDescription DataDes[10];  /* Holds data descriptions
                                           for upto 10 fields */
   int  n_fields;                       /* Actual # fields */

   /* ------------------------------------------------------
      Variables to hold the data  - max 10 fields....
      ------------------------------------------------------ */
   char c_buffer[10][100];  /* For character data */
   int  i_buffer[10];       /* For integer data */
   double d_buffer[10];

   int i, j;
   int found;

   printf("Program for searching a mini database:\n");

   /* =============================
      Read in meta information
      ============================= */
   fp = fopen("db-description", "r");
   n_fields = 0;
   while ( fscanf(fp, "%s %c %d", DataDes[n_fields].fieldname, 
          &DataDes[n_fields].fieldtype,
          &DataDes[n_fields].fieldsize) > 0 )
      n_fields++;

   /* ---
      Prints meta information
      --- */
   printf("\nThe database consists of these fields:\n");
   for (i = 0; i < n_fields; i++)
      printf("Index %d: Fieldname `%s',\ttype = %c,\tsize = %d\n", 
        i, DataDes[i].fieldname, DataDes[i].fieldtype,
        DataDes[i].fieldsize);
   printf("\n\n");

   /* ---
      Open database file
      --- */
   fd = open("db-data", O_RDONLY);

   /* ---
      Print content of the database file
      --- */
   printf("\nThe database content is:\n");
   while ( ! eof(fd) )
   {  /* ------------------
         Read next record
         ------------------ */
      for (j = 0; j < n_fields; j++)
      {  
     if ( DataDes[j].fieldtype == 'I' )
        read(fd, &i_buffer[j], DataDes[j].fieldsize);
     if ( DataDes[j].fieldtype == 'F' )
        read(fd, &d_buffer[j], DataDes[j].fieldsize);
     if ( DataDes[j].fieldtype == 'C' )
        read(fd, &c_buffer[j], DataDes[j].fieldsize);       
      }

      double d;
      /* ------------------
         Print it...
         ------------------ */
      for (j = 0; j < n_fields; j++)
      {   
     if ( DataDes[j].fieldtype == 'I' )
        printf("%d ", i_buffer[j]);
     if ( DataDes[j].fieldtype == 'F' )
        d = d_buffer[j];
        printf("%lf ", d);
     if ( DataDes[j].fieldtype == 'C' )
        printf("%s ", c_buffer[j]);
      }
      printf("\n");
   }
   printf("\n");
   printf("\n");


}

Expected Output: 3 rows of data ending with the number "e = 2.18281828"

To reproduce the problem, the following two files need to be in the same directory as the lookup-data.c file:
- [db-data][1]
- [db-description][2]

How does one populate the database and index to reproduce the problem? — sarnold
We can't reproduce your problem because we don't have the input files. If you can't share those files, could you please share the output you are getting? I'm interested to know what fieldsize is for the float values. — steveha
Please learn about the switch statement, and use it. And you aren't checking the return value of read ... doing that would be a lot better than your funky eof function. — Jim Balter
Your new printf() for the float value is using format code "%d"; it should be "%f". — steveha
I'm going to try to remember this forever: a big-endian 1 read on a little-endian machine prints as 16777216. That's 2 raised to the 24th power. It works the other way, too: a little-endian 1 read on a big-endian machine will also print as 16777216. If I had memorized that, I could have figured this out much faster! — steveha

steveha steveha · Accepted Answer · 2012-07-04T01:33:06

EDIT: My earlier speculations were all wrong. The problem was that the database file had big-endian numbers and the data were being read on a little-endian computer. Skip down to the section marked "START READING HERE" to avoid wasting your time with the early speculations (left here for their very limited historical value).

I suspect that your problem is related to the printf() format specification %lf used to print values, together with the fact that you declared d_buffer to be of type int. Likely sizeof(double) > sizeof(int) is true, so you are interpreting extra bytes of data to be part of the float values.

I don't know this for sure since I can't see the data your program is using to run, but if your fieldsize for float data is sizeof(float) rather than sizeof(double) then perhaps you are storing the float values into d_buffer just fine but messing them up when printing. Alternatively, if fieldsize is equal to sizeof(double), and sizeof(double) is greater than sizeof(int), then you are writing off the end of d_buffer and then something is corrupting your data.

I suggest you change the declaration to double d_buffer[10]; and see if your program works better. Also, see if fieldsize is set to sizeof(float) or sizeof(double); if it is sizeof(float) then declare d_buffer as type float and use this code:

  if ( DataDes[j].fieldtype == 'F' )
  {
    double d = d_buffer[j];
    printf("%lf ", d);
  }

EDIT: Also, I recommend that you switch to using fopen() and fread() for all your I/O. The basic open() and read() can return EINTR which means you need to retry the operation, so these are only properly used inside a loop that will retry the call if it returns EINTR. The fopen() and fread() calls insulate you from that sort of detail, and have some buffering that can make your programs run more smoothly. (I'm pretty sure your current project is only reading and writing small amounts of data, so the performance difference likely doesn't matter much to you at this time.)

Also, I suggest you get rid of your eof() function; it is very unusual and likely a bit slow to read a character and then use fseek() to put it back. The C library has the function fgetc() which gets one character from the input stream, and you can test that character to see if it is the EOF value to detect end-of-file. And the C library also has the function ungetc() that can push back one character; this will not involve actually seeking the disk, but rather will just put back the character into some buffer somewhere. But you don't even need to use fgetc() or ungetc() for your code; just check the return value from fread() and if you get a zero-length read, you know you hit end-of-file. In production code, you will need to check the return value of each function call anyway; you can't get away with just hoping that every read succeeds.

EDIT: One more thing you could try: change the format code from "%lf" to just "%f" and see what happens. I'm not exactly sure what that l will do and you shouldn't need it. Plain old "%f" should format a double. But it may not change anything: according to this web page I found, in printf() there is no difference between "%lf" and "%f".

http://www.dgp.toronto.edu/~ajr/209/notes/printf.html

* START READING HERE *

EDIT: Okay, one thing I have figured out for sure. The database format is an index value (an integer value), then a float value, then a string value. You will need to read each one to advance the current position within the file. So your current code that is checking format codes and deciding which thing to read? Not correct; you need to read an integer, a float, and a string for each record.

EDIT: Okay, here is a correct Python program that can read the database. It doesn't bother to read the metadata file; I just hard-coded in the constants (it's okay since it is just a throw-away program).

import struct

_format_rec = ">id20s"  # big-endian: int, double, 20-char string
_cb_rec = struct.calcsize(_format_rec) # count of bytes in this format

def read_records(fname):
    with open(fname) as in_f:
        try:
            while True:
                idx, f, s = struct.unpack(_format_rec, in_f.read(_cb_rec))
                # Python doesn't chop at NUL byte by default so do it now.
                s, _, _ = s.partition('\0')
                yield (idx, f, s)
        except struct.error:
            pass

if __name__ == "__main__":
    for i, (idx, f, s) in enumerate(read_records("db-data")):
        print "%d) index: %d\tfloat: %f \ttext: \"%s\"" % (i, idx, f, s)

So the index values are 32-bit integers, big-endian; the floats are 64-bit floats, big-endian; and the text fields are 20 characters fixed (so a string of 0 to 19 characters plus a terminating NUL byte).

Here's the output of the above program:

0) index: 1     float: 3.141593         text: "Pi"
1) index: 2     float: 12.345000        text: "Secret Key"
2) index: 3     float: 2.718282         text: "The number E"

Now, when I try to compile your C code, I get garbage values out because my computer is little-endian. Are you trying to run your C code on a little-endian computer?

EDIT: To answer the most recent comment, which is a question: For each input record, you must call read() three times. The first time you read the index, which is a 4-byte integer (big-endian). The second read is an 8-byte float value, also big-endian. The third read is 20 bytes as a string. Each read moves the current position in the file; the three reads together read a single record from the file. After you have read and printed three records, you are done.

Since my computer is little-endian, it's tricky to get the values out correctly in C, but I did it. I made a union that lets me read out an 8-byte value as an integer or as a float, and I used that as the buffer for the call to read(); then I called __builtin_bswap64() (a feature in GCC) to swap the big-endian value to little-endian, stored the result as a 64-bit integer, and read it out as a float. I also used __builtin_bswap32() to swap the integer index. My C program now prints:

The database content is:
1 3.141593 Pi
2 12.345000 Secret Key
3 2.718282 The number E

So, read each record, make sure you are correct with respect to endian-ness of the data, and you will have a working program.

EDIT: Here are code fragments that show how I fixed the endianness problem.

typedef union
{
    unsigned char buf[8];
    double d;
    int64_t i64;
    int32_t i32;
} U;

// then, inside of main():

   printf("\nThe database content is:\n");
   {  /* ------------------
         Read next record
         ------------------ */
      for (j = 0; j < n_fields; j++)
      {
        U u;
        read(fd, u.buf, 4);
        u.i32 = __builtin_bswap32(u.i32);
        i_buffer[j] = u.i32;
        read(fd, u.buf, 8);
        u.i64 = __builtin_bswap64(u.i64);
        d_buffer[j] = u.d;
        read(fd, c_buffer[j], 20);
      }

I'm sort of surprised the database was in big-endian format. Any computer using an x86 family processor is little-endian.

You should definitely not hard-code those numbers (4, 8, 20) like I did; you should use the metadata you received. I'll leave that for you.

EDIT: Also you shouldn't call __builtin_bswap32() or __builtin_bswap64(). You should call ntohl() and... I'm not sure what the 64-bit one is. But ntohl() is portable; it will skip the swap if you are compiling on a big-endian computer, and do the swap if you are compiling on a little-endian computer.

EDIT: I found a solution to a 64-bit equivalent to ntohl(), right here on StackOverflow.

https://stackoverflow.com/a/4410728/166949

It's simple if you only care about Linux: instead of using #include <arpa/inet.h> you can use #include <endian.h> and then use be32toh() and be64toh().

Once you have this you can read the database file like so:

u.i64 = be64toh(u.i64);

If you compile the above code on a big-endian machine, it will compile to a do-nothing and read the big-endian value. If you compile the above code on a little-endian machine, it will compile to the equivalent of __builtin_bswap64() and swap the bytes so the 64-bit value is read correctly.

EDIT: I said in a couple of places that you need to do three separate reads: one to get the index, one to get the float, and one to get the string. Actually, you can declare a struct and issue a single read that would pull in all the data in one read. The tricky part, though, is that the C compiler might insert "padding" bytes inside your struct that would cause the byte structure of the struct to not exactly match the byte structure of the record read from the file. The C compiler should provide a way to control the alignment bytes (a #pragma statement) but I didn't want to get into the details. For this simple program, simply reading three times solves the problem.

C - Error with read() of a file, storage in an array, and printing output properly

3 Answers