1
votes

I am using the yelp dataset to get the hours multiple businesses are open. The time range is put in a list full of dictionaries for each company as shown below.

{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', 'Wednesday': '8:0-18:30', 'Thursday': '8:0-18:30', 'Friday': '8:0-18:30', 'Saturday': '8:0-14:0'}
{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', 'Wednesday': '8:0-22:0', 'Thursday': '8:0-22:0', 'Friday': '8:0-23:0', 'Saturday': '8:0-23:0', 'Sunday': '8:0-22:0'}
{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', 'Wednesday': '7:0-20:0', 'Thursday': '7:0-20:0', 'Friday': '7:0-21:0', 'Saturday': '7:0-21:0', 'Sunday': '7:0-21:0'}
{'Wednesday': '14:0-22:0', 'Thursday': '16:0-22:0', 'Friday': '12:0-22:0', 'Saturday': '12:0-22:0', 'Sunday': '12:0-18:0'}
{'Monday': '0:0-0:0', 'Tuesday': '6:0-22:0', 'Wednesday': '6:0-22:0', 'Thursday': '6:0-22:0', 'Friday': '9:0-0:0', 'Saturday': '9:0-22:0', 'Sunday': '8:0-22:0'}
{'Monday': '0:0-0:0', 'Tuesday': '10:0-18:0', 'Wednesday': '10:0-18:0', 'Thursday': '10:0-18:0', 'Friday': '10:0-18:0', 'Saturday': '10:0-18:0', 'Sunday': '12:0-18:0'}
{'Monday': '9:0-17:0', 'Tuesday': '9:0-17:0', 'Wednesday': '9:0-17:0', 'Thursday': '9:0-17:0', 'Friday': '9:0-17:0'}
None
{'Monday': '0:0-0:0', 'Tuesday': '6:0-21:0', 'Wednesday': '6:0-21:0', 'Thursday': '6:0-16:0', 'Friday': '6:0-16:0', 'Saturday': '6:0-17:0', 'Sunday': '6:0-21:0'}

This goes on for 150,000 elements.

As you can see, one element does not have any information and says "None" instead. I used the dropna() command to clear those. However, it leaves a gap in the list and disrupts a for loop I have that calculates the hours.

Here is an small example to explain what I mean.

The table starts as this

index,0
0,0.0
1,1.0
2,3.0
3,NaN
4,4.0
5,5.0.

and changes to

index,0
0,0.0
1,1.0
2,3.0
4,4.0
5,5.0

after using dropna()

As you can see, the table skips from 2 to 4.

In my for loop I made it calculate the total time for each week in a range(1-150000), but since there is no row 8 and other rows, it throws an error and stops the loop.

So... My question is, how can I run my for loop such that is skips over these non existent rows.

Also, here is my code:

df_new = df_hours.dropna()

for i in range(1,150000):

  dc = df_new[i]
  print(dc)
  sum_elapsed = 0
  for _, v in dc.items():
      start, end = v.split('-')
      hhs, mms = (int(v) for v in start.split(':'))
      hhe, mme = (int(v) for v in end.split(':'))
      elapsed = (hhe * 60 + mme) - (hhs * 60 + mms) 
      sum_elapsed += elapsed

  print(sum_elapsed)
2

2 Answers

2
votes

Given a list of dict:

data = [{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', 'Wednesday': '8:0-18:30', 'Thursday': '8:0-18:30', 'Friday': '8:0-18:30', 'Saturday': '8:0-14:0'},
{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', 'Wednesday': '8:0-22:0', 'Thursday': '8:0-22:0', 'Friday': '8:0-23:0', 'Saturday': '8:0-23:0', 'Sunday': '8:0-22:0'},
{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', 'Wednesday': '7:0-20:0', 'Thursday': '7:0-20:0', 'Friday': '7:0-21:0', 'Saturday': '7:0-21:0', 'Sunday': '7:0-21:0'},
{'Wednesday': '14:0-22:0', 'Thursday': '16:0-22:0', 'Friday': '12:0-22:0', 'Saturday': '12:0-22:0', 'Sunday': '12:0-18:0'},
{'Monday': '0:0-0:0', 'Tuesday': '6:0-22:0', 'Wednesday': '6:0-22:0', 'Thursday': '6:0-22:0', 'Friday': '9:0-0:0', 'Saturday': '9:0-22:0', 'Sunday': '8:0-22:0'},
{'Monday': '0:0-0:0', 'Tuesday': '10:0-18:0', 'Wednesday': '10:0-18:0', 'Thursday': '10:0-18:0', 'Friday': '10:0-18:0', 'Saturday': '10:0-18:0', 'Sunday': '12:0-18:0'},
{'Monday': '9:0-17:0', 'Tuesday': '9:0-17:0', 'Wednesday': '9:0-17:0', 'Thursday': '9:0-17:0', 'Friday': '9:0-17:0'},
None,
{'Monday': '0:0-0:0', 'Tuesday': '6:0-21:0', 'Wednesday': '6:0-21:0', 'Thursday': '6:0-16:0', 'Friday': '6:0-16:0', 'Saturday': '6:0-17:0', 'Sunday': '6:0-21:0'}]

Doing:

df = pd.json_normalize(data)
print(df)

Output:

     Monday    Tuesday  Wednesday   Thursday     Friday   Saturday     Sunday
0   0:0-0:0  8:0-18:30  8:0-18:30  8:0-18:30  8:0-18:30   8:0-14:0        NaN
1  8:0-22:0   8:0-22:0   8:0-22:0   8:0-22:0   8:0-23:0   8:0-23:0   8:0-22:0
2  7:0-20:0   7:0-20:0   7:0-20:0   7:0-20:0   7:0-21:0   7:0-21:0   7:0-21:0
3       NaN        NaN  14:0-22:0  16:0-22:0  12:0-22:0  12:0-22:0  12:0-18:0
4   0:0-0:0   6:0-22:0   6:0-22:0   6:0-22:0    9:0-0:0   9:0-22:0   8:0-22:0
5   0:0-0:0  10:0-18:0  10:0-18:0  10:0-18:0  10:0-18:0  10:0-18:0  12:0-18:0
6  9:0-17:0   9:0-17:0   9:0-17:0   9:0-17:0   9:0-17:0        NaN        NaN
7       NaN        NaN        NaN        NaN        NaN        NaN        NaN
8   0:0-0:0   6:0-21:0   6:0-21:0   6:0-16:0   6:0-16:0   6:0-17:0   6:0-21:0

Removing NaN rows, and resetting the index:

df = df.dropna(how='all').reset_index(drop=True)
print(df)

Output:

     Monday    Tuesday  Wednesday   Thursday     Friday   Saturday     Sunday
0   0:0-0:0  8:0-18:30  8:0-18:30  8:0-18:30  8:0-18:30   8:0-14:0        NaN
1  8:0-22:0   8:0-22:0   8:0-22:0   8:0-22:0   8:0-23:0   8:0-23:0   8:0-22:0
2  7:0-20:0   7:0-20:0   7:0-20:0   7:0-20:0   7:0-21:0   7:0-21:0   7:0-21:0
3       NaN        NaN  14:0-22:0  16:0-22:0  12:0-22:0  12:0-22:0  12:0-18:0
4   0:0-0:0   6:0-22:0   6:0-22:0   6:0-22:0    9:0-0:0   9:0-22:0   8:0-22:0
5   0:0-0:0  10:0-18:0  10:0-18:0  10:0-18:0  10:0-18:0  10:0-18:0  12:0-18:0
6  9:0-17:0   9:0-17:0   9:0-17:0   9:0-17:0   9:0-17:0        NaN        NaN
7   0:0-0:0   6:0-21:0   6:0-21:0   6:0-16:0   6:0-16:0   6:0-17:0   6:0-21:0

Doing your function correctly:

def myFunc(v):
    if isinstance(v, str):
        start, end = v.split('-')
        hhs, mms = (int(v) for v in start.split(':'))
        hhe, mme = (int(v) for v in end.split(':'))
        return (hhe * 60 + mme) - (hhs * 60 + mms)
    else:
        return v 

df = df.applymap(myFunc)     

Output:

   Monday  Tuesday  Wednesday  Thursday  Friday  Saturday  Sunday
0     0.0    630.0      630.0     630.0   630.0     360.0     NaN
1   840.0    840.0      840.0     840.0   900.0     900.0   840.0
2   780.0    780.0      780.0     780.0   840.0     840.0   840.0
3     NaN      NaN      480.0     360.0   600.0     600.0   360.0
4     0.0    960.0      960.0     960.0  -540.0     780.0   840.0
5     0.0    480.0      480.0     480.0   480.0     480.0   360.0
6   480.0    480.0      480.0     480.0   480.0       NaN     NaN
7     NaN      NaN        NaN       NaN     NaN       NaN     NaN
8     0.0    900.0      900.0     600.0   600.0     660.0   900.0

Total sum:

>>> df.sum().sum()
30600.0
1
votes

First of all, df_new[i] indexes df_new along the columns, not the index (as it seems you want). Instead, use df_new.loc[i], which does index along the columns. Additionally, you can use a try/except statement to catch any raised KeyError exceptions.

df_new = df_hours.dropna()

for i in range(1,150000):
    try:
        dc = df_new.loc[i]
    except KeyError:
        continue
    ...

However, you should just iterate over the index of the DataFrame, which avoids any KeyError exceptions:

df_new = df_hours.dropna()

for i in df_new.index:
    ...