I am performing Linear Regression using Weka Java API. The data set consists of UserId, URL visited by user, Time spend on Page. As the URL is a String attribute, I am facing problem while performing Linear Regression with above dataset. Is their any ready to use method which converts String to equivalent Int values in Weka. I have seen a similar kind of functionality in Mahout but could not find it in Weka. I can easily create a function to output Int values of the string by calculating the sum of ASCII if each characters, but I want a more reliable and already tested solution.
2 Answers
You are correct that linear regression only operates on numeric values. However, it is not at all true that just any old conversion from categorical values to numbers will be fine. For example, hashing a string gives a number, but would give completely meaningless results as a feature for linear regression.
Numeric values are expected to have an ordering and meaningful magnitude. What would it mean that "foo.com" is 135092 and "bar.com" is 985882? Linear regression would try to interpret "bar.com" as "something like 5 times larger than foo.com" which is nonsense.
You may be thinking of 1-of-n encoding, where you create a new 0/1 feature for every possible value (URL). This won't be feasible for URLs. Domains -- maybe.
While Weka LinearRegression classification is not working with String data type, you can try convert all the websites to nominal, which can be calculated by algorithm. I am not too sure how exactly algorithm performs, but in my experience it improved results in some cases (as nominal can be counted as repeating values, distance between values ~1 if not equal etc.).