In the original posted code, result
undergoes an initial allocation via calloc
, which zero-initializes the content therein, and being pointers, null-initializes. Later on, when expanding the sequence via realloc, no such affordance is taken. In effect if the original array looked like this:
[ NULL, NULL ]
and after adding two elements, looks like this:
[ addr1, addr2 ]
the realloc kicks in and gives you this :
[ addr1, addr2, ????, ???? ]
Adding salt to the wound, getline
also requires the length argument being reflective of the allocation size present in the line. But you're carrying over the length from the prior loop iteration, so not only is the pointer wrong after the first expansion, the length is never correct after the first invocation of getline
(leading to your actual crash; the rest of the problems are just not something you saw yet).
Solving all of this
- Use a separate pointer and length for each iteration,
- Ensure they're properly initialized to null,0 before the
getline
call
- If you read the line successfully, then expand the line pointer buffer.
- Store the pointer, discard the length, and reset both to null,0 before the next iteration.
In practice, it looks like this:
#define _POSIX_C_SOURCE 200809L
#include <stdio.h>
#include <stdlib.h>
char **readFile(const char *fileName, size_t *lines)
{
FILE *fp = fopen(fileName, "r");
if (fp == NULL)
exit(EXIT_FAILURE);
// initially empty, no size or capacity
char **result = NULL;
size_t size = 0;
size_t capacity = 0;
size_t len = 0;
char *line = NULL;
while (getline(&line, &len, fp) != -1)
{
if (size == capacity)
{
size_t new_capacity = (capacity ? 2 * capacity : 1);
void *tmp = realloc(result, new_capacity * sizeof *result);
if (tmp == NULL)
{
perror("Failed to expand lines buffer");
exit(EXIT_FAILURE);
}
// recoup the expanded buffer and capacity
result = tmp;
capacity = new_capacity;
}
result[size++] = line;
// reset these to NULL,0. they trigger the fresh allocation
// and size storage on the next iteration.
line = NULL;
len = 0;
}
// if getline allocated a buffer on the failure case
// get rid of it (didn't see that coming).
if (line)
free(line);
fclose(fp);
*lines = size;
return result;
}
int main()
{
size_t count = 0;
char **lines = readFile("/usr/share/dict/words", &count);
if (lines)
{
for (size_t i = 0; i < count; ++i)
{
fputs(lines[i], stdout);
free(lines[i]);
}
free(lines);
}
return 0;
}
On a stock Linux/Mac system /usr/share/dict/words contains about a quarter-million words in the English language. On my stock Mac, its 235886 (yours will vary). The callers gets the line pointer and the count, and is responsible for freeing the content therein.
Output
A
a
aa
aal
aalii
aam
Aani
aardvark
aardwolf
Aaron
Aaronic
Aaronical
Aaronite
Aaronitic
Aaru
.... a ton of lines omitted ....
zymotically
zymotize
zymotoxic
zymurgy
Zyrenian
Zyrian
Zyryan
zythem
Zythia
zythum
Zyzomys
Zyzzogeton
Valgrind Summary
==17709==
==17709== HEAP SUMMARY:
==17709== in use at exit: 0 bytes in 0 blocks
==17709== total heap usage: 235,909 allocs, 235,909 frees, 32,506,328 bytes allocated
==17709==
==17709== All heap blocks were freed -- no leaks are possible
==17709==
Alternative: Let getline
reuse its buffer
There is no guarantee the buffer getline
allocates matches the line length efficiently. In fact, the only guarantee is, on successful execution, the function returns the number of chars including the delimiter (but not the terminator), and the memory holds that data. The actual allocation size could be considerably more than that, and that space is effectively wasted.
To demonstrate this, consider the following. The same code, but we do NOT reset the buffer on each loop, and rather than store its pointer directly, we store a strdup
of the line. Note this only works if the line does not contain embedded null chars. This allows getline
to reuse its buffer, and only expand if needed, for each read. We're responsible for making the actual copy of the line data (and we do using POSIX strdup
). When executed there are still no leaks, but note the valgrind summary, specifically the number of bytes allocated in comparison to the number of bytes from the previous version above.
char **readFile(const char *fileName, size_t *lines)
{
FILE *fp = fopen(fileName, "r");
if (fp == NULL)
exit(EXIT_FAILURE);
// initially empty, no size or capacity
char **result = NULL;
size_t size = 0;
size_t capacity = 0;
size_t len = 0;
char *line = NULL;
while (getline(&line, &len, fp) != -1)
{
if (size == capacity)
{
size_t new_capacity = (capacity ? 2 * capacity : 1);
void *tmp = realloc(result, new_capacity * sizeof *result);
if (tmp == NULL)
{
perror("Failed to expand lines buffer");
exit(EXIT_FAILURE);
}
// recoup the expanded buffer and capacity
result = tmp;
capacity = new_capacity;
}
// make copy here. let getline reuse 'line'
result[size++] = strdup(line);
}
// free whatever was left
if (line)
free(line);
fclose(fp);
*lines = size;
return result;
}
Valgrind Summary
==17898==
==17898== HEAP SUMMARY:
==17898== in use at exit: 0 bytes in 0 blocks
==17898== total heap usage: 235,909 allocs, 235,909 frees, 6,929,003 bytes allocated
==17898==
==17898== All heap blocks were freed -- no leaks are possible
==17898==
The number of allocations is the same (which tells us getline
allocated a large enough buffer up front to never need expansion), but the actual total allocated space is considerably more efficient, as now we are storing strings in buffers allocated to match their length; not whatever getline
stood up as a read buffer.
&(*(result + i))
==>result + i
. The address-of-dereference is completely pointless. Furthermore, the expansion region of the pointers referenced by the pointer-to-pointerresult
are indeterminate in the expanded region after therealloc
. Only the original sequence from thecalloc
will be null-initialized (e.g will still be even after therealloc
, and ultimately populated withgetline
acting on null population semantics). – WhozCraiggetline
requires the first arg to be either a pointer to NULL or a pointer to a valid memory block. Sincerealloc
does not init the exapnded part of the memory it contains uninitalised values and hence the error complaining thegetline
is using such values. – kaylum