0
votes

I tried with both pyspark.DataFrame and pandas.DataFrame to perform a cartesian product on the same dataframe (df). This implies that it is commuting dataframe.

Below with PySpark:

>>> a = spark.createDataFrame(data=[(1,'a', '1a'), (2,'b', '2b'),(3,'c', '3c')], schema=['letter', 'number', 'ln'])
>>> b = spark.createDataFrame(data=[(1,'a', '1a'), (2,'b', '2b'),(3,'c', '3c')], schema=['letter', 'number', 'ln'])
>>> a.join(b).show()
+------+------+---+------+------+---+
|letter|number| ln|letter|number| ln|
+------+------+---+------+------+---+
|     1|     a| 1a|     1|     a| 1a|
|     1|     a| 1a|     2|     b| 2b|
|     1|     a| 1a|     3|     c| 3c|
|     2|     b| 2b|     1|     a| 1a|
|     2|     b| 2b|     2|     b| 2b|
|     2|     b| 2b|     3|     c| 3c|
|     3|     c| 3c|     1|     a| 1a|
|     3|     c| 3c|     2|     b| 2b|
|     3|     c| 3c|     3|     c| 3c|
+------+------+---+------+------+---+

Here I did a little cheat in order to explain. Indeed they are two dataframe a and b but my question is applied on one dataframe a versus itself.

Here I would like to remove 2 kinds of results:

  1. rows with same ln (use here as id)
  2. duplicate rows, e.g: (1,a,1a,2,b,2b) is same to (2,b,2b,1,a,1a)

The first rule can be done as follow:

>>> a.join(b, a.nl != b.nl).show(truncate=False)
+------+------+---+------+------+---+
|number|letter|nl |number|letter|nl |
+------+------+---+------+------+---+
|1     |a     |1a |2     |b     |2b |
|1     |a     |1a |3     |c     |3c |
|2     |b     |2b |1     |a     |1a |
|2     |b     |2b |3     |c     |3c |
|3     |c     |3c |1     |a     |1a |
|3     |c     |3c |2     |b     |2b |
+------+------+---+------+------+---+

same with cross join

>>> a.join(a.alias('a1'), a.nl != col('a1.nl'),  how='cross').show()
+------+------+---+------+------+---+
|number|letter| nl|number|letter| nl|
+------+------+---+------+------+---+
|     1|     a| 1a|     2|     b| 2b|
|     1|     a| 1a|     3|     c| 3c|
|     2|     b| 2b|     1|     a| 1a|
|     2|     b| 2b|     3|     c| 3c|
|     3|     c| 3c|     1|     a| 1a|
|     3|     c| 3c|     2|     b| 2b|
+------+------+---+------+------+---+

The second rule should be achievable with a group by clause. but is an efficient way to do it?

I try with itertools.product but is slow as my dataframe has 300 000 rows.

Below with pandas:

with pandas and NumPy I try to create a multi-index based on a cross join but that fail (not enough memory)

>>> df1 = pd.DataFrame(data=[(1,'a', '1a'), (2,'b', '2b'),(3,'c', '3c')], columns=['letter', 'number', 'ln'])
>>> df = pd.DataFrame(data=[(1,'a', '1a'), (2,'b', '2b'),(3,'c', '3c')], columns=['letter', 'number', 'ln'])
>>> i1 = df.index
combine_sample = np.transpose([np.tile(i1, len(i1)), np.repeat(i1, len(i1))])
Traceback (most recent call last):
...
numpy.core._exceptions.MemoryError: Unable to allocate 721. GiB for an array with shape (311119, 311119) and data type int64
>>> index= pd.MultiIndex.from_product([df.index,df.index])
Traceback (most recent call last):
...
numpy.core._exceptions.MemoryError: Unable to allocate 361. GiB for an array with shape (96795032161,) and data type int32

Expected Result

It is a DataFrame like this:

+------+------+---+------+------+---+
|number|letter| nl|number|letter| nl|
+------+------+---+------+------+---+
|     1|     a| 1a|     2|     b| 2b|
|     1|     a| 1a|     3|     c| 3c|
|     2|     b| 2b|     3|     c| 3c|
+------+------+---+------+------+---+
1
@blackbishop thanks for your report. I have updated the question in order to add at the end an expected result. - bioinfornatics

1 Answers

1
votes

You can call drop duplicates on a sorted array column:

import pyspark.sql.functions as F

result = a.join(b,
    F.array(*[a[i] for i in a.columns]) < F.array(*[b[i] for i in b.columns])
)

result.show()
+------+------+---+------+------+---+
|letter|number| ln|letter|number| ln|
+------+------+---+------+------+---+
|     1|     a| 1a|     2|     b| 2b|
|     1|     a| 1a|     3|     c| 3c|
|     2|     b| 2b|     3|     c| 3c|
+------+------+---+------+------+---+