Large difference in .csv file size

Question

I have two .csv files:

Table A with 32075892 rows which takes 2023365kb
Table B with 21383928 rows which only takes 1051836kb

Both tables have the same number of columns with about the same content (An Id, An integer, A short string (always the same size), A numeric, Another String). The only difference is that for table A the String values of the last columns are slightly longer: 26.83 chars on average compared to 9.

I read and wrote both .csv files with fread and fwrite from the data.table package in R.

Table A has 50% more rows than B, but takes twice the space in file size. What is the reason for the large difference in file size?

No, but I wrote both files using the same R function, so I would guess both files are encoded the same way. — Ruben
UTF-8 uses 2 bytes instead of 1 in ASCII coding, If you have some non-ASCII symbols, it may be the cause. You can open these files with something like Notepad and see encoding. — Yuriy Barvinchenko
File is too large to be opened in Notepad or Notepad++. I am looking for a way too check the encoding. — Ruben
Right, files are too big. May be you can check with smaller files (some subset of data)? — Yuriy Barvinchenko

Bruce Martin Bruce Martin · Accepted Answer · 2019-05-08T09:00:05

You can calculate the average record length for the 2 files:

    int recordLengthFile1= (2023365 * 1024) / 32075892; 
    int recordLengthFile2= (1051836 * 1024) / 21383928 ;

This gives record lengths of 64 and 50 a difference of 14 which is close to the difference between the last fields in the file 26.83 - 9 = 17.83

Large difference in .csv file size

1 Answers