9
votes

I am struggling a little coming to grips with the OTP development model as I convert some code into an OTP app.

I am essentially making a web crawler and I just don't quite know where to put the code that does the actual work.

I have a supervisor which starts my worker:

-behaviour(supervisor).
-define(CHILD(I, Type), {I, {I, start_link, []}, permanent, 5000, Type, [I]}).

init(_Args) ->          
  Children = [
    ?CHILD(crawler, worker)
  ],  
  RestartStrategy = {one_for_one, 0, 1},
  {ok, {RestartStrategy, Children}}.

In this design, the Crawler Worker is then responsible for doing the actual work:

-behaviour(gen_server).

start_link() ->
  gen_server:start_link(?MODULE, [], []).

init([]) ->
  inets:start(),        
  httpc:set_options([{verbose_mode,true}]), 
  % gen_server:cast(?MODULE, crawl),
  % ok = do_crawl(),
  {ok, #state{}}.

do_crawl() ->
  % crawl!
  ok.

handle_cast(crawl}, State) -> 
  ok = do_crawl(),
  {noreply, State};

do_crawl spawns a fairly large number of processes and requests that handle the work of crawling via http.

Question, ultimately is: where should the actual crawl happen? As can be seen above I have been experimenting with different ways of triggering the actual work, but still missing some concept essential for grokering the way things fit together.

Note: some of the OTP plumbing is left out for brevity - the plumbing is all there and the system all hangs together

3

3 Answers

12
votes

I apologize if I got your question wrong.

A couple of suggestions that I can make to guide you in a right direction (or what I consider being a right direction :)

1 (Rather minor, but still important) I suggest getting inets startup code out of that worker and putting it in application statup code (appname_app.erl). As far as I can tell you're using rebar templates, so you should have those.

2 Now, onto essential parts. In order to make a full use of OTP's supervisor model, assuming that you want to spawn a large a large number of crawlers, it would make a lot of sense to use simple_one_for_one supervisors instead of one_for_one (read http://www.erlang.org/doc/man/supervisor.html for more details, but essential part is: simple_one_for_one - a simplified one_for_one supervisor, where all child processes are dynamically added instances of the same process type, i.e. running the same code.). So instead of launching just one process to supervise, you will actually specify a "template" of a sort — on how to start worker processes that are doing real job. Every worker of that kind is started using supervisor:start_child/2http://erldocs.com/R14B01/stdlib/supervisor.html?i=1&search=start_chi#start_child/2. None of those workers will start until you explicitly start them.

2.1 Depending on a nature of your crawlers, you might need to assess what kind of restart strategy you need for your workers. Right now in your template you have it set as permanent (however you have a different kind of supervised child). Here are your options:

 Restart defines when a terminated child process should be restarted. A permanent child process should always be restarted, 
 a temporary child process should never be restarted and a transient child process should be restarted only if it terminates 
 abnormally, i.e. with another exit reason than normal.

So, you might want to have something like:

 -behaviour(supervisor).
 -define(CHILD(I, Type, Restart), {I, {I, start_link, []}, Restart, 5000, Type, [I]}).

 init(_Args) ->          
     Children = [
          ?CHILD(crawler, worker, transient)
     ],  
     RestartStrategy = {simple_one_for_one, 0, 1},
    {ok, {RestartStrategy, Children}}.

I took a liberty of suggesting transient restarts for these children as it makes sense for this kind of workers (restart if they failed to do the job and don't if they completed normally)

2.2 Once you take care of the above items, your supervisor will be handling any number of dynamically added worker processes; and it will be monitoring and restarting (if necessary) each of them, which adds a great deal to your system stability and manageability.

3 Now, a worker process. I would assume that each crawler has some particular states which it might be in at any given moment. For that reason, I would suggest using gen_fsm (finite state machine, more about them available at http://learnyousomeerlang.com/finite-state-machines). This way, each gen_fsm instance you dynamically add to your supervisor, should send an event to itself in init/1 (using http://erldocs.com/R14B01/stdlib/gen_fsm.html?i=0&search=send_even#send_event/2).

Something alone the lines of:

   init([Arg1]) ->
       gen_fsm:send_event(self(), start),
       {ok, initialized, #state{ arg1 = Arg }}.

   initialized(start, State) ->
       %% do your work
       %% and then either switch to next state {next_state, ...
       %% or stop the thing: {stop, ...

Note that doing your work could be either contained within this gen_fsm process or you might consider spawning a separate process for it, depending on your particular needs.

You might want to have multiple state names for different phases of your crawling if it deems to be necessary.

Either way, hope this will help designing your application in a somewhat OTP-ish way. Please let me know if you have any questions, I'll be happy to add something if necessary.

4
votes

Are you actually keeping track of any state in your gen_server?

If the answer is yes, it looks like you are doing things the right way. Note that since messages are serialized, with the above implementation you could not have two crawls going at the same time. If you need concurrent crawls, see the answer to my question here.

If the answer is no then you can possibly ditch the server and the supervisor and just use the application module for any initialization code as seen here.

Finally, lhttpc and ibrowse are considered better alternatives to inets. I use lhttpc in production on my ad servers and it works great.

3
votes

My solution to this problem would be to look into the Erlang Solutions "jobs" application, which can be used to schedule jobs (i.e., requesting pages) and let a separate system handle each job, bound the concurrency and so on.

You can then feed new urls into a process crawl_sched_mgr which filters the urls and then spawns new jobs. You could also let the requestors do this themselves.

If you don't want to use jobs, Yurii's suggestion is the way to go.