What is the average size ratio between a data file and ETS table?

Question

I am evaluating the use of Erlang ETS to store a large in-memory data set that. My test data source is a CSV file that consumes only 350 MBytes of disk.

My parser reads row by row and splices it into a list, then create a tuple and stores it in ETS, using a "bag" configuration.

After loading all the data in ETS I noticed that my computer's 8GB of RAM was all gone, and the OS had created virtual memory, occupying somewhere near 16GB or RAM. The erlang's Beam process seems to take a consume about 10 times fold more memory than the size of disk data.

Here is the test code:

-module(load_test_data).
-author("gextra").

%% API
-export([test/0]).

init_ets() ->
  ets:new(memdatabase, [bag, named_table]).

parse(File) ->
  {ok, F} = file:open(File, [read, raw]),
  parse(F, file:read_line(F), []).

parse(F, eof, Done) ->
  file:close(F),
  lists:reverse(Done);

parse(F, Line, Done) ->
  parse(F, file:read_line(F), [ parse_row_commodity_data(Line) | Done ]).

parse_row_commodity_data(Line) ->
  {ok, Data} = Line,
  %%io:fwrite(Data),
  LineList          = re:split(Data,"\,",[{return,list}]),
  ReportingCountry  = lists:nth(1, LineList),
  YearPeriod        = lists:nth(2, LineList),
  Year              = lists:nth(3, LineList),
  Period            = lists:nth(4, LineList),
  TradeFlow         = lists:nth(5, LineList), 
  Commodity         = lists:nth(6, LineList),
  PartnerCountry    = lists:nth(7, LineList),
  NetWeight         = lists:nth(8, LineList),
  Value             = lists:nth(9, LineList),
  IsReported        = lists:nth(10, LineList),
  ets:insert(memdatabase, {YearPeriod ++ ReportingCountry ++ Commodity , { ReportingCountry, Year, Period, TradeFlow, Commodity, PartnerCountry, NetWeight, Value, IsReported } }).


test() ->
  init_ets(),
  parse("/data/000-2010-1.csv").

An example of the tuples you store in the ETS table would help. It very much depends on what type of data and you represent and store it. Also how you read in the data would help. — rvirding

Hynek -Pichi- Vychodil Hynek -Pichi- Vychodil · Accepted Answer · 2014-02-22T21:24:04

It strongly depend what you mean by splices it into a list, then create a tuple. Especially splice into list can take a lot of memory. One byte can occupy 16B if split into list. It is 5.6GB with easy.

EDIT:

Try this:

parse(File) ->
  {ok, F} = file:open(File, [read, raw, binary]),
  ok = parse(F, binary:compile_pattern([<<$,>>, <<$\n>>])),
  ok = file:close(F).

parse(F, CP) ->
  case file:read_line(F) of
    {ok, Line} ->
      parse_row_commodity_data(Line, CP),
      parse(F, CP);
    eof -> ok
  end.

parse_row_commodity_data(Line, CP) ->
  [ ReportingCountry, YearPeriod, Year, Period, TradeFlow, Commodity,
    PartnerCountry, NetWeight, Value, IsReported]
      = binary:split(Line, CP, [global, trim]),
  true = ets:insert(memdatabase, {
         {YearPeriod, ReportingCountry, Commodity},
         { ReportingCountry, Year, Period, TradeFlow, Commodity,
           PartnerCountry, NetWeight, Value, IsReported}
       }).

What is the average size ratio between a data file and ETS table?

1 Answers