pyspark linear regression model gives error this column name must be numeric type but was actually string type

Question

I am making a multi-nominal regression model in pyspark and after running my linear regression model it gives me this error "IllegalArgumentException: u'requirement failed: Column label must be of type NumericType but was actually of type StringType."

Please help me here as I have spent so much time to resolve this but couldn't ale to solve.

    lr_data=   loan_data.select('int_rate','loan_amnt','term','grade','sub_grade','emp_length','verification_status','home_ownership','annual_inc','purpose','addr_state','open_acc') 
    lr_data.printSchema()

    root
    |-- int_rate: string (nullable = true)
    |-- loan_amnt: integer (nullable = true)
    |-- term: string (nullable = true)
    |-- grade: string (nullable = true)
    |-- sub_grade: string (nullable = true)
    |-- emp_length: string (nullable = true)
    |-- verification_status: string (nullable = true)
    |-- home_ownership: string (nullable = true)
    |-- annual_inc: double (nullable = true)
    |-- purpose: string (nullable = true)
    |-- addr_state: string (nullable = true)
    |-- open_acc: string (nullable = true)

here in the multinominol regression model my target variable should be int_rate(which is string type, probably that's why i am getting this error while running).

but initially i tried using only two values in the regression model which are int_rate,loan_amnt.

here is the code

    input_data=lr_data.rdd.map(lambda x:(x[0], DenseVector(x[1:2])))
    data3= spark.createDataFrame(input_data,["label","features",])
    data3.printSchema()

   root
   |-- label: string (nullable = true)
   |-- features: vector (nullable = true)

IMP:NOTE I tried with taking other variables in DenseVector array but it was throwing me long error something like invalide literal for float(): 36 months

   usr/local/spark/python/pyspark/sql/session.pyc in createDataFrame(self,    data, schema, samplingRatio, verifySchema)
    580 
    581         if isinstance(data, RDD):
    582  rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
   583         else:
   584             rdd, schema = self._createFromLocal(map(prepare, data), schema)
    if schema is None or isinstance(schema, (list, tuple)):
    380             struct = self._inferSchema(rdd, samplingRatio)
    381             converter = _create_converter(struct)
    382             rdd = rdd.map(converter)

   /usr/local/spark/python/pyspark/sql/session.pyc in _inferSchema(self,   rdd, samplingRatio)
    349         :return: :class:`pyspark.sql.types.StructType`
    350         """
    351         first = rdd.first()
    352         if not first:
    353             raise ValueError("The first row in RDD is empty, "

Please tell me how to select more than 2 variable in this regression model as well. I guess i have to typecast every variable in my data set.

   #spilt into two partition 
   train_data, test_data = data3.randomSplit([.7,.3], seed = 1)
   lr = LinearRegression(labelCol="label", maxIter=100, regParam= 0.3, elasticNetParam = 0.8)
   linearModel = lr.fit(train_data)

Now when i am running this linearmodel() I am getting this below error.

    IllegalArgumentException Traceback (most recent call  last)
   <ipython-input-20-5f84d575334f> in <module>()

----> 1 linearModel = lr.fit(train_data)

     /usr/local/spark/python/pyspark/ml/base.pyc in fit(self,dataset,params) 
      62                 return self.copy(params)._fit(dataset)
      63             else:
      64                 return self._fit(dataset)
      65         else:
      66             raise ValueError("Params must be either a param map  or a list/tuple of param maps, "

      /usr/local/spark/python/pyspark/ml/wrapper.pyc in _fit(self, dataset)
      263 
      264     def _fit(self, dataset):
      265         java_model = self._fit_java(dataset)
      266         return self._create_model(java_model)
      267 

      /usr/local/spark/python/pyspark/ml/wrapper.pyc in _fit_java(self, dataset)
        260         """
        261         self._transfer_params_to_java()
        262         return self._java_obj.fit(dataset._jdf)
        263 
        264     def _fit(self, dataset):

       /usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
        1131         answer = self.gateway_client.send_command(command)
        1132         return_value = get_return_value(
        1133             answer, self.gateway_client, self.target_id, self.name)

1134 1135 for temp_arg in temp_args:

       /usr/local/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
        77                 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
        78             if  s.startswith('java.lang.IllegalArgumentException: '):

---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) 80 raise 81 return deco

        IllegalArgumentException: u'requirement failed: Column label must be of type NumericType but was actually of type StringType.'

Please help me, I have tried every method of casting string value to numeric but doesn't make any difference. As my int_rate which is target variable is string types by deafult but it takes value of numeric.one more is I have to select the whole lr data set in my regression model. How can i do this. Thanks in advance :)

mayank agrawal mayank agrawal · Accepted Answer · 2018-03-08T09:59:17

Try this,

from pyspark.ml.linalg import Vectors
from pyspark.ml.regression import LinearRegression
from pyspark.sql.types import *
import pyspark.sql.functions as F

cols = lr_data.columns
input_data = lr_data.rdd.map(lambda x:(x['int_rate'], Vectors.dense([x[col] for col in cols if col != 'int_rate'])))\
                        .toDF(["label","features",])\
                        .select([F.col('label').cast(FloatType()).alias('label'), 'features'])

train_data, test_data = input_data.randomSplit([.7,.3], seed = 1)

lr = LinearRegression(labelCol="label", maxIter=100, regParam= 0.3, elasticNetParam = 0.8)
linearModel = lr.fit(train_data)

This is provided your all columns are or can be converted to numeric types.

pyspark linear regression model gives error this column name must be numeric type but was actually string type

1 Answers