I am making a multi-nominal regression model in pyspark and after running my linear regression model it gives me this error "IllegalArgumentException: u'requirement failed: Column label must be of type NumericType but was actually of type StringType."
Please help me here as I have spent so much time to resolve this but couldn't ale to solve.
lr_data= loan_data.select('int_rate','loan_amnt','term','grade','sub_grade','emp_length','verification_status','home_ownership','annual_inc','purpose','addr_state','open_acc')
lr_data.printSchema()
root
|-- int_rate: string (nullable = true)
|-- loan_amnt: integer (nullable = true)
|-- term: string (nullable = true)
|-- grade: string (nullable = true)
|-- sub_grade: string (nullable = true)
|-- emp_length: string (nullable = true)
|-- verification_status: string (nullable = true)
|-- home_ownership: string (nullable = true)
|-- annual_inc: double (nullable = true)
|-- purpose: string (nullable = true)
|-- addr_state: string (nullable = true)
|-- open_acc: string (nullable = true)
here in the multinominol regression model my target variable should be int_rate(which is string type, probably that's why i am getting this error while running).
but initially i tried using only two values in the regression model which are int_rate,loan_amnt.
here is the code
input_data=lr_data.rdd.map(lambda x:(x[0], DenseVector(x[1:2])))
data3= spark.createDataFrame(input_data,["label","features",])
data3.printSchema()
root
|-- label: string (nullable = true)
|-- features: vector (nullable = true)
IMP:NOTE I tried with taking other variables in DenseVector array but it was throwing me long error something like invalide literal for float(): 36 months
usr/local/spark/python/pyspark/sql/session.pyc in createDataFrame(self, data, schema, samplingRatio, verifySchema)
580
581 if isinstance(data, RDD):
582 rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
583 else:
584 rdd, schema = self._createFromLocal(map(prepare, data), schema)
if schema is None or isinstance(schema, (list, tuple)):
380 struct = self._inferSchema(rdd, samplingRatio)
381 converter = _create_converter(struct)
382 rdd = rdd.map(converter)
/usr/local/spark/python/pyspark/sql/session.pyc in _inferSchema(self, rdd, samplingRatio)
349 :return: :class:`pyspark.sql.types.StructType`
350 """
351 first = rdd.first()
352 if not first:
353 raise ValueError("The first row in RDD is empty, "
Please tell me how to select more than 2 variable in this regression model as well. I guess i have to typecast every variable in my data set.
#spilt into two partition
train_data, test_data = data3.randomSplit([.7,.3], seed = 1)
lr = LinearRegression(labelCol="label", maxIter=100, regParam= 0.3, elasticNetParam = 0.8)
linearModel = lr.fit(train_data)
Now when i am running this linearmodel() I am getting this below error.
IllegalArgumentException Traceback (most recent call last)
<ipython-input-20-5f84d575334f> in <module>()
----> 1 linearModel = lr.fit(train_data)
/usr/local/spark/python/pyspark/ml/base.pyc in fit(self,dataset,params)
62 return self.copy(params)._fit(dataset)
63 else:
64 return self._fit(dataset)
65 else:
66 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
/usr/local/spark/python/pyspark/ml/wrapper.pyc in _fit(self, dataset)
263
264 def _fit(self, dataset):
265 java_model = self._fit_java(dataset)
266 return self._create_model(java_model)
267
/usr/local/spark/python/pyspark/ml/wrapper.pyc in _fit_java(self, dataset)
260 """
261 self._transfer_params_to_java()
262 return self._java_obj.fit(dataset._jdf)
263
264 def _fit(self, dataset):
/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
1133 answer, self.gateway_client, self.target_id, self.name)
1134 1135 for temp_arg in temp_args:
/usr/local/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
78 if s.startswith('java.lang.IllegalArgumentException: '):
---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) 80 raise 81 return deco
IllegalArgumentException: u'requirement failed: Column label must be of type NumericType but was actually of type StringType.'
Please help me, I have tried every method of casting string value to numeric but doesn't make any difference. As my int_rate which is target variable is string types by deafult but it takes value of numeric.one more is I have to select the whole lr data set in my regression model. How can i do this. Thanks in advance :)