0
votes

I am working with pandas and I wish to sample 2 stocks from each trade date and store as part of the dataset the average "Stock_Change" and the average "Vol_Change" for the given day in question based on the sample taken (in this case, 2 stocks per day). The actual data is much larger spanning years and hundreds of names. My sample will be of 100 names, I just use 2 for the purposes of this question.

Sample data set:

In [3]:

df
​
Out[3]:
        Date Symbol  Stock_Change  Vol_Change
0   1/1/2008      A         -0.05        0.07
1   1/1/2008      B         -0.06        0.17
2   1/1/2008      C         -0.05        0.07
3   1/1/2008      D          0.05        0.13
4   1/1/2008      E         -0.03       -0.10
5   1/2/2008      A          0.03       -0.17
6   1/2/2008      B          0.08        0.34
7   1/2/2008      C          0.03        0.17
8   1/2/2008      D          0.06        0.24
9   1/2/2008      E          0.02        0.16
10  1/3/2008      A          0.02        0.05
11  1/3/2008      B          0.01        0.39
12  1/3/2008      C          0.05       -0.17
13  1/3/2008      D         -0.01        0.37
14  1/3/2008      E         -0.06        0.23
15  1/4/2008      A          0.03        0.31
16  1/4/2008      B         -0.07        0.16
17  1/4/2008      C         -0.06        0.29
18  1/4/2008      D          0.00        0.09
19  1/4/2008      E          0.00       -0.02
20  1/5/2008      A          0.04       -0.04
21  1/5/2008      B         -0.06        0.16
22  1/5/2008      C         -0.08        0.07
23  1/5/2008      D          0.09        0.16
24  1/5/2008      E          0.06        0.18
25  1/6/2008      A          0.00        0.22
26  1/6/2008      B          0.08       -0.13
27  1/6/2008      C          0.07        0.18
28  1/6/2008      D          0.03        0.32
29  1/6/2008      E          0.01        0.29
30  1/7/2008      A         -0.08       -0.10
31  1/7/2008      B         -0.09        0.23
32  1/7/2008      C         -0.09        0.26
33  1/7/2008      D          0.02       -0.01
34  1/7/2008      E         -0.05        0.11
35  1/8/2008      A         -0.02        0.36
36  1/8/2008      B          0.03        0.17
37  1/8/2008      C          0.00       -0.05
38  1/8/2008      D          0.08       -0.13
39  1/8/2008      E          0.07        0.18

One other point, the samples can not contain the same security more than once (sample without replacement). My guess is that this a good R question but I don't know the last thing about R . .

I have no idea of even how to start this question.

thanks in advance for any help.

Edit by OP

I tried this but don't seem to be able to get it to work on a the group-by dataframe (grouped by Symbol and Date):

In [35]:

import numpy as np
import pandas as pd
from random import sample
​
# create random index
​
rindex =  np.array(sample(range(len(df)), 10))
​
# get 10 random rows from df
dfr = df.ix[rindex]
In [36]:

dfr
Out[36]:
        Date Symbol Stock_Change Vol_Change
6   1/2/2008      B           8%        34%
1   1/2/2008      B          -6%        17%
37  1/3/2008      C           0%        -5%
25  1/1/2008      A           0%        22%
3   1/4/2008      D           5%        13%
12  1/3/2008      C           5%       -17%
10  1/1/2008      A           2%         5%
2   1/3/2008      C          -5%         7%
26  1/2/2008      B           8%       -13%
17  1/3/2008      C          -6%        29%

OP Edit #2

As I read the question I realize that I may not have been very clear. What I want to do is sample the data many times (call it X) for each day and in essence end up with X times "# of dates" as my new dataset. This may not look like it makes sense with the data i am showing but my actual data has 500 names and 2 years (2x365 = 730) of dates and I wish to sample 50 random names for each day for a total of 50 x 730 = 36500 data points.

first attempt gave this:
In [10]:

# do sampling: get a random subsample with size 3 out of 5 symbols for each date
# ==============================
def get_subsample(group, sample_size=3):
    symbols = group.Symbol.values
    symbols_selected = np.random.choice(symbols, size=sample_size, replace=False)
    return group.loc[group.Symbol.isin(symbols_selected)]
​
df.groupby(['Date']).apply(get_subsample).reset_index(drop=True)
​
Out[10]:
        Date Symbol Stock_Change Vol_Change
0   1/1/2008      A          -5%         7%
1   1/1/2008      A           3%       -17%
2   1/1/2008      A           2%         5%
3   1/1/2008      A           3%        31%
4   1/1/2008      A           4%        -4%
5   1/1/2008      A           0%        22%
6   1/1/2008      A          -8%       -10%
7   1/1/2008      A          -2%        36%
8   1/2/2008      B          -6%        17%
9   1/2/2008      B           8%        34%
10  1/2/2008      B           1%        39%
11  1/2/2008      B          -7%        16%
12  1/2/2008      B          -6%        16%
13  1/2/2008      B           8%       -13%
14  1/2/2008      B          -9%        23%
15  1/2/2008      B           3%        17%
16  1/3/2008      C          -5%         7%
17  1/3/2008      C           3%        17%
18  1/3/2008      C           5%       -17%
19  1/3/2008      C          -6%        29%
20  1/3/2008      C          -8%         7%
21  1/3/2008      C           7%        18%
22  1/3/2008      C          -9%        26%
23  1/3/2008      C           0%        -5%
24  1/4/2008      D           5%        13%
25  1/4/2008      D           6%        24%
26  1/4/2008      D          -1%        37%
27  1/4/2008      D           0%         9%
28  1/4/2008      D           9%        16%
29  1/4/2008      D           3%        32%
30  1/4/2008      D           2%        -1%
31  1/4/2008      D           8%       -13%
32  1/5/2008      E          -3%       -10%
33  1/5/2008      E           2%        16%
34  1/5/2008      E          -6%        23%
35  1/5/2008      E           0%        -2%
36  1/5/2008      E           6%        18%
37  1/5/2008      E           1%        29%
38  1/5/2008      E          -5%        11%
39  1/5/2008      E           7%        18%
1
Have you tried sample? - EdChum

1 Answers

1
votes
import pandas as pd
import numpy as np

# replicate your data structure
# ==============================
np.random.seed(0)
dates = pd.date_range('2008-01-01', periods=100, freq='B')
symbols = 'A B C D E'.split()
multi_index = pd.MultiIndex.from_product([dates, symbols], names=['Date', 'Symbol'])
stock_change = np.random.randn(500)
vol_change = np.random.randn(500)
df = pd.DataFrame({'Stock_Change': stock_change, 'Vol_Change': vol_change}, index=multi_index).reset_index()



# do sampling: get a random subsample with size 3 out of 5 symbols for each date
# ==============================
def get_subsample(group, X=100, sample_size=3):
    frame = pd.DataFrame(columns=['sample_{}'.format(x) for x in range(1,X+1)])
    for col in frame.columns.values:
        frame[col] = group.loc[group.Symbol.isin(np.random.choice(symbols, size=sample_size, replace=False)), ['Stock_Change', 'Vol_Change']].mean()
    return frame.mean(axis=1)


result = df.groupby(['Date']).apply(get_subsample)

    Out[169]: 
            Stock_Change  Vol_Change
Date                                
2008-01-01        1.3937      0.2005
2008-01-02        0.0406     -0.7280
2008-01-03        0.6073     -0.2699
2008-01-04        0.2310      0.7415
2008-01-07        0.0718     -0.7269
2008-01-08        0.3808     -0.0584
2008-01-09       -0.5595     -0.2968
2008-01-10        0.3919     -0.2741
2008-01-11       -0.4856      0.0386
2008-01-14       -0.4700     -0.4090
...                  ...         ...
2008-05-06        0.1510      0.1628
2008-05-07       -0.1452      0.2824
2008-05-08       -0.4626      0.2173
2008-05-09       -0.2984      0.6324
2008-05-12       -0.3817      0.7698
2008-05-13        0.5796     -0.4318
2008-05-14        0.2875      0.0067
2008-05-15        0.0269      0.3559
2008-05-16        0.7374      0.1065
2008-05-19       -0.4428     -0.2014

[100 rows x 2 columns]