0
votes

I have 2 JSON files in a relative folder named 'source_data'

"source_data/data1.json"

{ 
    "name": "John Doe", 
    "age": 32, 
    "address": "ZYZ - Heaven"
} 

"source_data/data2.json"

{
    "userName": "jdoe", 
    "password": "password", 
    "salary": "123456789"
}

Using the following PySpark code I have created DataFrame:

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

df = spark.read.json("source_data")

print(df.head())

Output:

df.head(10)
[Row(name='John Doe', age=32, address='ZYZ - Heaven', userName=None, password=None, salary=None), Row(name=None, age=None, address=None, userName='jdoe', password='password', salary='123456789')]

Now I want to create variable number of DataFrame, by dropping 'None' type column values, like this:

df1.head()
[Row(name='John Doe', age=32, address='ZYZ - Heaven']

and,

df2.head()
[Row(userName='jdoe', password='password', salary='123456789')]

I am only finding solutions for dropping entire row based on all or any column(s)

Is there any ways to achieve what I am looking for ?

TIA

1
the JSON files are residing in same folder. This is just the example I am posting in,actual folder have millions of files.Papun Mohanty
you could just do a select on the initial dataframe to get the required columns and then filter to remove the rows which have null or None valueNikunj Kakadiya
@NikunjKakadiya, Yes your solution worked. Many thanks for guiding me :)Papun Mohanty

1 Answers

0
votes

You can just select the columns that you require in a different dataframe and filter that based on the condition.

//source data
val df = spark.read.json("path")
//select and filter 
val df1 = df.select("address","age","name")
.filter($"address".isNotNull || $"age".isNotNull || $"name".isNotNull)
val df2 = df.select("password","salary","userName")
.filter($"password".isNotNull || $"salary".isNotNull || $"userName".isNotNull)
//see the output as dataframe or using head as you want
println(df1.head)
df2.head

Output for both the head command df1 :

enter image description here

df2: enter image description here