0
votes

I am still pretty new to PIG but I understand the basic idea of map/reduce jobs. I am trying to figure out some statistics for a user based on some simple logs. We have a utility that parses out fields from the log and I am using DataFu to figure out the variance and quartiles.

My script is as follows:

log = LOAD '$data' USING SieveLoader('node', 'uid', 'long_timestamp');
log_map = FILTER log BY $0 IS NOT NULL AND $0#'uid' IS NOT NULL;
--Find all users
SPLIT log_map INTO cloud IF $0#'node' MATCHES '*.mis01*', dev OTHERWISE;
--For the real cloud
cloud = FOREACH cloud GENERATE $0#'uid' AS uid, $0#'long_timestamp' AS long_timestamp:long, 'dev' AS domain, '192.168.0.231' AS ldap_server;
dev = FOREACH dev GENERATE $0#'uid' AS uid, $0#'long_timestamp' AS long_timestamp:long, 'dev' AS domain, '10.0.0.231' AS ldap_server;
modified_logs = UNION dev, cloud;

--Calculate user times
user_times = FOREACH modified_logs GENERATE *, ToDate((long)long_timestamp) as date;
--Based on weekday/weekend
aliased_user_times = FOREACH user_times GENERATE *, GetYear(date) AS year:int, GetMonth(date) AS month:int, GetDay(date) AS day:int, GetWeekOrWeekend(date) AS day_of_week, long_timestamp % (24*60*60*1000) AS miliseconds_into_day;
--Based on actual day of week
--aliased_user_times = FOREACH user_times GENERATE *, GetYear(date) AS year:int, GetMonth(date) AS month:int, GetDay(date) AS day:int, GetDayOfWeek(date) AS day_of_week, long_timestamp % (24*60*60*1000) AS miliseconds_into_day;

user_days = GROUP aliased_user_times BY (uid, ldap_server,domain, year, month, day, day_of_week);

some_times_by_day = FOREACH user_days GENERATE FLATTEN(group) AS (uid, ldap_server, domain, year, month, day, day_of_week), MAX(aliased_user_times.miliseconds_into_day) AS max, MIN(aliased_user_times.miliseconds_into_day) AS min;

times_by_day = FOREACH some_times_by_day GENERATE *, max-min AS time_on;

times_by_day_of_week = GROUP times_by_day BY (uid, ldap_server, domain, day_of_week);
STORE times_by_day_of_week INTO '/data/times_by_day_of_week';

--New calculation, mean, var, std_d, (min, 25th quartile, 50th (aka median), 75th quartile, max)
averages = FOREACH times_by_day_of_week GENERATE FLATTEN(group) AS (uid, ldap_server, domain, day_of_week), 'USER' as type, AVG(times_by_day.min) AS start_avg, VAR(times_by_day.min) AS start_var, SQRT(VAR(times_by_day.min)) AS start_std, Quartile(times_by_day.min) AS start_quartiles;
--AVG(times_by_day.max) AS end_avg, VAR(times_by_day.max) AS end_var, SQRT(VAR(times_by_day.max)) AS end_std, Quartile(times_by_day.max) AS end_quartiles, AVG(times_by_day.time_on) AS hours_avg, VAR(times_by_day.time_on) AS hours_var, SQRT(VAR(times_by_day.time_on)) AS hours_std, Quartile(times_by_day.time_on) AS hours_quartiles ;

STORE averages INTO '/data/averages';

I've seen that other people have had problems with DataFu calculating multiple quantiles at once so I am only trying to calculate one at a time. The custom loader loads one line at a time, passes it through a utility which converts it into a map and there is a small UDF that checks to see if a date is a weekday or a weekend (originally we wanted to get statistics based on day of week, but loading enough data to get interesting quartiles was killing the map/reduce tasks.

Using Pig 0.11

1

1 Answers

0
votes

It looks like my specific problem was due to trying to calculate the min and the max in one PigLatin line. Splitting the work into two different commands and then joining them seems to have fixed my memory problem