1
votes

I'm playing with ets tweaks and specifically with read_concurrency. I've written simple test to measure how this tweak impacts on read performance. Test implementations is here and there.

Briefly, this test sequentially creates three [public, set] ets tables with different read_concurrency options (without any tweaks, with {read_concurrency, true} and with {read_concurrency, false}). After one table created, test runs N readers (N is power of 2 from 4 to 1024). Then readers performs random reads for 10 seconds and reports how many read operation they have performed.

Result is quite surprising for me. There absolutely no difference between 3 these tests. Here is the test result.

Non-tweaked table 4 workers: 26610428 read operations 8 workers: 26349134 read operations 16 workers: 26682405 read operations 32 workers: 26574700 read operations 64 workers: 26722352 read operations 128 workers: 26636100 read operations 256 workers: 26714087 read operations 512 workers: 27110860 read operations 1024 workers: 27545576 read operations Read concurrency true 4 workers: 30257820 read operations 8 workers: 29991281 read operations 16 workers: 30280695 read operations 32 workers: 30066830 read operations 64 workers: 30149273 read operations 128 workers: 28409907 read operations 256 workers: 28381452 read operations 512 workers: 29253088 read operations 1024 workers: 30955192 read operations Read concurrency false 4 workers: 30774412 read operations 8 workers: 29596126 read operations 16 workers: 24963845 read operations 32 workers: 29144684 read operations 64 workers: 29862287 read operations 128 workers: 25618461 read operations 256 workers: 27457268 read operations 512 workers: 28751960 read operations 1024 workers: 28790131 read operations

So I'm wondering how should I implement my test to see any difference and realize usecase for this optimization?

I have run this test on following installations:

  1. 2-core, 1 physical CPU, Erlang/OTP 17 [erts-6.1] [64-bit] [smp:2:2] [async-threads:10] [hipe] [kernel-poll:false] (example test output is from this run)
  2. 2-core, 1 physical CPU, Erlang/OTP 17 [erts-6.1] [64-bit] [smp:2:2] [async-threads:10] [hipe] [kernel-poll:true]
  3. 8-core 1 physical CPU, Erlang/OTP 17 [erts-6.4] [source] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false]
  4. 8-core 1 physical CPU, Erlang/OTP 17 [erts-6.4] [source] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:true]
  5. 64-core 4 physical CPU, Erlang/OTP 17 [erts-6.3] [source] [64-bit] [smp:64:64] [async-threads:10] [hipe] [kernel-poll:false]
  6. 64-core 4 physical CPU, Erlang/OTP 17 [erts-6.3] [source] [64-bit] [smp:64:64] [async-threads:10] [hipe] [kernel-poll:true]

There is all the same (except absolute measurement values, of course). So could anybody tell me WHY? And what should I do to see any difference?

UPD According Fred's answer, I've updated my test to avoid workers of mailbox thrashing. Unfortunately, there was no significant change in results.

UPD One another implementation according to @Pascal advice. Now all workers properly seeding their random generators. Again same results.

2
One remark about the usage of random:uniform/1: as you spawn individual processes without initializing the random seed (a value stored silently in the process dictionary) all your processes will execute the same sequence, but maybe it is on purpose :o)Pascal
Funny try =) I've tried to seed random generator for each worker with erlang:now/1, but no changes again.Viacheslav Kovalev
If you call random:seed(Seed) in each process it works. erlang:now() guaranties to return different values at each calls, beware to not call it before spawning it.Pascal
Not sure I got you. Of course, I've added random:seed(erlang:now()) to each worker process, and there is no doubt that now all workers are running on their own random sequences, but when I told "no changes again" I meant there is no changes in performance sence. One idea which I want to try - make ets keys more expensive for maching operation. This is last my hope about this tweak.Viacheslav Kovalev
Oooops, it was just a remark about the usage of the random lib, I didn't expect change in performance measurement. In the mean time I modified your example to use it out of eunit, no influence, the only noticeable difference is between 1 single reader and multiple ones: with 1 reader it takes roughly twice the timePascal

2 Answers

2
votes

It's possible the brunt of your work tests the scheduling abilities of the node -- almost half the work done in your benchmark is polling your mailbox to know if you should exit. This usually requires the VM to switch each process, put it on the queue, run the others, check their mailboxes, etc. It's cheap to do, but so is reading ETS. It's quite possible you're creating a lot of noise.

An alternative approach to try is to ask all workers to read up to N million times in the table, and count how long it takes until they are all done. This would reduce the number of non-ETS work done on your node and instead focus on reading from the table only.

I don't have any guarantees, but I'd bet the tables with more concurrency would get faster runs.

1
votes

I made a new version of your code, in this one I added a boolean parameter to perform or skip the ets access. No doubt, most of the time is spent in other stuff than the ets read:

[edit]

After @Viacheslav remark, I now initialize the table... almost no effect.

the code:

-module(perf).

-export ([tests/0]).

-define(TABLE_SIZE, 100).
-define(READS_COUNT, 5000000).

read_test(Doit,WkCount,NbRead,TableOpt) ->
    Table = ets:new(?MODULE, TableOpt),
    [ ets:insert(Table, {I, something}) || I <- lists:seq(1, ?TABLE_SIZE)],
    L = [erlang:now() || _ <- lists:seq(1,WkCount)],
    F = fun() -> spawn_readers(Doit,WkCount,NbRead,Table,L) end,
    {T,_} = timer:tc(F),
    ets:delete(Table),
    T.
table_types() -> 
    [[public, set, {read_concurrency, false}],[public, set, {read_concurrency, true}],[public, set]].

spawn_readers(Doit,WkCount, NbRead, Table, L_init) ->
    [spawn_monitor( fun() -> reader(Doit,NbRead, Table, X) end) || X <- L_init],
    reap_workers(WkCount).

reader(Doit,NbRead, Table, Seed) ->
    random:seed(Seed),
    reader_loop(Doit,NbRead,Table).

reader_loop(_,0,_Table) ->
    ok;
reader_loop(true,ToRead,Table) ->
    Key = random:uniform(?TABLE_SIZE),
    ets:lookup(Table, Key),
    reader_loop(true,ToRead-1, Table);
reader_loop(false,ToRead,Table) ->
    _Key = random:uniform(?TABLE_SIZE),
    reader_loop(false,ToRead-1, Table).

reap_workers(0) ->
    ok;
reap_workers(Count) ->
    receive
        {'DOWN', _, process, _, _} ->
            reap_workers(Count-1)
    end.

tests() ->
    [[{X,number_proc,Y,read_test(true,Y,?READS_COUNT div Y,X),read_test(false,Y,?READS_COUNT div Y,X)}
    || X <- table_types()] 
    || Y <- [1,10,100,1000,10000]].

and the results:

8> perf:tests().
[[{[public,set,{read_concurrency,false}],
   number_proc,1,2166000,1456000},
  {[public,set,{read_concurrency,true}],
   number_proc,1,2452000,1609000},
  {[public,set],number_proc,1,2513000,1538000}],
 [{[public,set,{read_concurrency,false}],
   number_proc,10,1153000,767000},
  {[public,set,{read_concurrency,true}],
   number_proc,10,1180000,768000},
  {[public,set],number_proc,10,1181000,784000}],
 [{[public,set,{read_concurrency,false}],
   number_proc,100,1149000,755000},
  {[public,set,{read_concurrency,true}],
   number_proc,100,1157000,747000},
  {[public,set],number_proc,100,1130000,749000}],
 [{[public,set,{read_concurrency,false}],
   number_proc,1000,1141000,756000},
  {[public,set,{read_concurrency,true}],
   number_proc,1000,1169000,748000},
  {[public,set],number_proc,1000,1146000,769000}],
 [{[public,set,{read_concurrency,false}],
   number_proc,10000,1224000,832000},
  {[public,set,{read_concurrency,true}],
   number_proc,10000,1274000,855000},
  {[public,set],number_proc,10000,1162000,826000}]]