I tried with both pyspark.DataFrame and pandas.DataFrame to perform a cartesian product on the same dataframe (df). This implies that it is commuting dataframe.
Below with PySpark:
>>> a = spark.createDataFrame(data=[(1,'a', '1a'), (2,'b', '2b'),(3,'c', '3c')], schema=['letter', 'number', 'ln'])
>>> b = spark.createDataFrame(data=[(1,'a', '1a'), (2,'b', '2b'),(3,'c', '3c')], schema=['letter', 'number', 'ln'])
>>> a.join(b).show()
+------+------+---+------+------+---+
|letter|number| ln|letter|number| ln|
+------+------+---+------+------+---+
| 1| a| 1a| 1| a| 1a|
| 1| a| 1a| 2| b| 2b|
| 1| a| 1a| 3| c| 3c|
| 2| b| 2b| 1| a| 1a|
| 2| b| 2b| 2| b| 2b|
| 2| b| 2b| 3| c| 3c|
| 3| c| 3c| 1| a| 1a|
| 3| c| 3c| 2| b| 2b|
| 3| c| 3c| 3| c| 3c|
+------+------+---+------+------+---+
Here I did a little cheat in order to explain. Indeed they are two dataframe a and b but my question is applied on one dataframe a versus itself.
Here I would like to remove 2 kinds of results:
- rows with same
ln(use here as id) - duplicate rows, e.g:
(1,a,1a,2,b,2b)is same to(2,b,2b,1,a,1a)
The first rule can be done as follow:
>>> a.join(b, a.nl != b.nl).show(truncate=False)
+------+------+---+------+------+---+
|number|letter|nl |number|letter|nl |
+------+------+---+------+------+---+
|1 |a |1a |2 |b |2b |
|1 |a |1a |3 |c |3c |
|2 |b |2b |1 |a |1a |
|2 |b |2b |3 |c |3c |
|3 |c |3c |1 |a |1a |
|3 |c |3c |2 |b |2b |
+------+------+---+------+------+---+
same with cross join
>>> a.join(a.alias('a1'), a.nl != col('a1.nl'), how='cross').show()
+------+------+---+------+------+---+
|number|letter| nl|number|letter| nl|
+------+------+---+------+------+---+
| 1| a| 1a| 2| b| 2b|
| 1| a| 1a| 3| c| 3c|
| 2| b| 2b| 1| a| 1a|
| 2| b| 2b| 3| c| 3c|
| 3| c| 3c| 1| a| 1a|
| 3| c| 3c| 2| b| 2b|
+------+------+---+------+------+---+
The second rule should be achievable with a group by clause. but is an efficient way to do it?
I try with itertools.product but is slow as my dataframe has 300 000 rows.
Below with pandas:
with pandas and NumPy I try to create a multi-index based on a cross join but that fail (not enough memory)
>>> df1 = pd.DataFrame(data=[(1,'a', '1a'), (2,'b', '2b'),(3,'c', '3c')], columns=['letter', 'number', 'ln'])
>>> df = pd.DataFrame(data=[(1,'a', '1a'), (2,'b', '2b'),(3,'c', '3c')], columns=['letter', 'number', 'ln'])
>>> i1 = df.index
combine_sample = np.transpose([np.tile(i1, len(i1)), np.repeat(i1, len(i1))])
Traceback (most recent call last):
...
numpy.core._exceptions.MemoryError: Unable to allocate 721. GiB for an array with shape (311119, 311119) and data type int64
>>> index= pd.MultiIndex.from_product([df.index,df.index])
Traceback (most recent call last):
...
numpy.core._exceptions.MemoryError: Unable to allocate 361. GiB for an array with shape (96795032161,) and data type int32
Expected Result
It is a DataFrame like this:
+------+------+---+------+------+---+
|number|letter| nl|number|letter| nl|
+------+------+---+------+------+---+
| 1| a| 1a| 2| b| 2b|
| 1| a| 1a| 3| c| 3c|
| 2| b| 2b| 3| c| 3c|
+------+------+---+------+------+---+