0
votes

I have two .csv files:

  • Table A with 32075892 rows which takes 2023365kb
  • Table B with 21383928 rows which only takes 1051836kb

Both tables have the same number of columns with about the same content (An Id, An integer, A short string (always the same size), A numeric, Another String). The only difference is that for table A the String values of the last columns are slightly longer: 26.83 chars on average compared to 9.

I read and wrote both .csv files with fread and fwrite from the data.table package in R.

Table A has 50% more rows than B, but takes twice the space in file size. What is the reason for the large difference in file size?

1
Hi Ruben, Have you checked encoding of files?Yuriy Barvinchenko
No, but I wrote both files using the same R function, so I would guess both files are encoded the same way.Ruben
UTF-8 uses 2 bytes instead of 1 in ASCII coding, If you have some non-ASCII symbols, it may be the cause. You can open these files with something like Notepad and see encoding.Yuriy Barvinchenko
File is too large to be opened in Notepad or Notepad++. I am looking for a way too check the encoding.Ruben
Right, files are too big. May be you can check with smaller files (some subset of data)?Yuriy Barvinchenko

1 Answers

1
votes

You can calculate the average record length for the 2 files:

    int recordLengthFile1= (2023365 * 1024) / 32075892; 
    int recordLengthFile2= (1051836 * 1024) / 21383928 ;

This gives record lengths of 64 and 50 a difference of 14 which is close to the difference between the last fields in the file 26.83 - 9 = 17.83