1
votes

Suppose that each product has different versions that change over time, and I have a data set of time observations with the product id, version id and other data

enter image description here

I am interested in the Cartesian product of the indices of successive versions. i.e. the cartesian products of the indices of version_1 and version_2, version_2 and version_3 and version_3 and version_4.

For example the cartesian product of version_1 and version_2 is: (0,3), (1,3), (2,3), (0,4), (1,4), (2,4), version_2 and version_3 is (3,5), (3,6), (3,7), (4,5), (4,6), (4,7), etc. Ideally I would like two arrays: one of the left indices and one of the right.

Any hints as to how this can be done efficiently using numpy rather than manually looping which is very slow.

2
Here are the same questions as you and the answers.r-beginners
Thanks but I am interested in the cartesian product of successive versions only, rather than the cartesian product of all versions.Giacomo

2 Answers

1
votes

You can try this:

import pandas as pd
import itertools

df = pd.DataFrame({'version': ['version_1', 'version_1', 'version_1', 'version_2', 'version_2', 'version_3', 'version_3', 'version_3', 'version_4']})

df.version = df.version.apply(lambda x: x[-1])
df = df.reset_index().groupby('version')['index'].apply(list).rename('versions').reset_index()
df['versions_shift'] = df['versions'].shift(-1, fill_value=[[]])
df['cartesian'] = df.apply(lambda x: itertools.product(x['versions'], x['versions_shift']), axis=1)
df['cartesian'] = df['cartesian'].apply(lambda x: list(zip(*x)))
df.drop(['version', 'versions', 'versions_shift'], axis=1, inplace=True)

print(df)

Ouput:

                                  cartesian
0  [(0, 0, 1, 1, 2, 2), (3, 4, 3, 4, 3, 4)]
1  [(3, 3, 3, 4, 4, 4), (5, 6, 7, 5, 6, 7)]
2                    [(5, 6, 7), (8, 8, 8)]
3                                        []
0
votes

The best way I found to do is to manually get the version order for each product, looping through the successive versions and then getting the indices of the cartesian product.

def cartesian_product(x: np.ndarray, y: np.ndarray):
    return np.tile(x, len(y)), np.repeat(y, len(x))

unique_product_ids = np.unique(product_ids)
unique_countries = np.unique(countries)

indices_left_list = []
indices_right_list = []

for product_id in unique_product_ids:
    current_product_versions = product_versions[product_ids == product_id]

    _, indexes = np.unique(current_product_versions, return_index=True)
    unique_versions_in_order = [current_product_versions[index] for index in sorted(indexes)]

    for country in unique_countries:
        for version_left, version_right in zip(unique_versions_in_order, unique_versions_in_order[1:]):
            indices_left, indices_right = cartesian_product(
                np.flatnonzero((countries == country) & (product_ids == product_id) & (product_versions == version_left)),
                np.flatnonzero((countries == country) & (product_ids == product_id) & (product_versions == version_right))
            )
            indices_left_list.append(indices_left)
            indices_right_list.append(indices_right)


indices_left = np.concatenate(indices_left_list)
indices_right = np.concatenate(indices_right_list)