1
votes

My Query is, read input from a file and convert data lines of the file to List[Map[Int,String]] using scala. Here I give a dataset as the input. My code is,

  def id3(attrs: Attributes,
      examples: List[Example],
      label: Symbol
       ) : Node = {
level = level+1


  // if all the examples have the same label, return a new node with that label

  if(examples.forall( x => x(label) == examples(0)(label))){
  new Leaf(examples(0)(label))
  } else {
  for(a <- attrs.keySet-label){          //except label, take all attrs
    ("Information gain for %s is %f".format(a,
      informationGain(a,attrs,examples,label)))
  }


  // find the best splitting attribute - this is an argmax on a function over the list

  var bestAttr:Symbol = argmax(attrs.keySet-label, (x:Symbol) =>
    informationGain(x,attrs,examples,label))




  // now we produce a new branch, which splits on that node, and recurse down the nodes.

  var branch = new Branch(bestAttr)

  for(v <- attrs(bestAttr)){


    val subset = examples.filter(x=> x(bestAttr)==v)



    if(subset.size == 0){
      // println(levstr+"Tiny subset!")
      // zero subset, we replace with a leaf labelled with the most common label in
      // the examples
      val m = examples.map(_(label))
      val mostCommonLabel = m.toSet.map((x:Symbol) => (x,m.count(_==x))).maxBy(_._2)._1
      branch.add(v,new Leaf(mostCommonLabel))

    }
    else {
      // println(levstr+"Branch on %s=%s!".format(bestAttr,v))

      branch.add(v,id3(attrs,subset,label))
    }
   }
  level = level-1
  branch
  }
  }
  }
object samplet {
def main(args: Array[String]){

var attrs: sample.Attributes = Map()
attrs += ('0 -> Set('abc,'nbv,'zxc))
attrs += ('1 -> Set('def,'ftr,'tyh))
attrs += ('2 -> Set('ghi,'azxc))
attrs += ('3 -> Set('jkl,'fds))
attrs += ('4 -> Set('mno,'nbh))



val examples: List[sample.Example] = List(
  Map(
    '0 -> 'abc,
    '1 -> 'def,
    '2 -> 'ghi,
    '3 'jkl,
    '4 -> 'mno
  ),
  ........................
  )


// obviously we can't use the label as an attribute, that would be silly!
val label = 'play

println(sample.try(attrs,examples,label).getStr(0))

}
}

But How I change this code to - accepting input from a .csv file?

2

2 Answers

4
votes

I suggest you use Java's io / nio standard library to read your CSV file. I think there is no relevant drawback in doing so.

But the first question we need to answer is where to read the file in the code? The parsed input seems to replace the value of examples. This fact also hints us what type the parsed CSV input must have, namely List[Map[Symbol, Symbol]]. So let us declare a new class

class InputFromCsvLoader(charset: Charset = Charset.defaultCharset()) {
  def getInput(file: Path): List[Map[Symbol, Symbol]] = ???
}

Note that the Charset is only needed if we must distinguish between differently encoded CSV-files.

Okay, so how do we implement the method? It should do the following:

  1. Create an appropriate input reader
  2. Read all lines
  3. Split each line at the comma-separator
  4. Transform each substring into the symbol it represents
  5. Build a map from from the list of symbols, using the attributes as key
  6. Create and return the list of maps

Or expressed in code:

class InputFromCsvLoader(charset: Charset = Charset.defaultCharset()) {
  val Attributes = List('outlook, 'temperature, 'humidity, 'wind, 'play)
  val Separator = ","

  /** Get the desired input from the CSV file. Does not perform any checks, i.e., there are no guarantees on what happens if the input is malformed. */
  def getInput(file: Path): List[Map[Symbol, Symbol]] = {
    val reader = Files.newBufferedReader(file, charset)
    /* Read the whole file and discard the first line */
    inputWithHeader(reader).tail
  }

  /** Reads all lines in the CSV file using [[java.io.BufferedReader]] There are many ways to do this and this is probably not the prettiest. */
  private def inputWithHeader(reader: BufferedReader): List[Map[Symbol, Symbol]] = {
    (JavaConversions.asScalaIterator(reader.lines().iterator()) foldLeft Nil.asInstanceOf[List[Map[Symbol, Symbol]]]){
      (accumulator, nextLine) =>
        parseLine(nextLine) :: accumulator
    }.reverse
  }

  /** Parse an entry. Does not verify the input: If there are less attributes than columns or vice versa, zip creates a list of the size of the shorter list */
  private def parseLine(line: String): Map[Symbol, Symbol] = (Attributes zip (line split Separator map parseSymbol)).toMap

  /** Create a symbol from a String... we could also check whether the string represents a valid symbol */
  private def parseSymbol(symbolAsString: String): Symbol = Symbol(symbolAsString)
}

Caveat: Expecting only valid input, we are certain that the individual symbol representations do not contain the comma-separation character. If this cannot be assumed, then the code as is would fail to split certain valid input strings.

To use this new code, we could change the main-method as follows:

def main(args: Array[String]){
  val csvInputFile: Option[Path] = args.headOption map (p => Paths get p)
  val examples = (csvInputFile map new InputFromCsvLoader().getInput).getOrElse(exampleInput)
  // ... your code

Here, examples uses the value exampleInput, which is the current, hardcoded value of examples if no input argument is specified.

Important: In the code, all error handling has been omitted for convenience. In most cases, errors can occur when reading from files and user input must be considered invalid, so sadly, error handling at the boundaries of your program is usally not optional.

Side-notes:

  • Try not to use null in your code. Returning Option[T] is a better option than returning null, because it makes "nullness" explicit and provides static safety thanks to the type-system.
  • The return-keyword is not required in Scala, as the last value of a method is always returned. You can still use the keyword if you find the code more readable or if you want to break in the middle of your method (which is usually a bad idea).
  • Prefer val over var, because immutable values are much easier to understand than mutable values.
  • The code will fail with the provided CSV string, because it contains the symbols TRUE and FALSE which are not legal according to your programs logic (they should be true and false instead).
  • Add all information to your error-messages. Your error message only tells me what that a value for the attribute 'wind is bad, but it does not tell me what the actual value is.
1
votes

Read a csv file ,

val datalines = Source.fromFile(filepath).getLines()

So this datalines contains all the lines from the csv file.

Next, convert each line into a Map[Int,String]

val datamap = datalines.map{ line =>
    line.split(",").zipWithIndex.map{ case (word, idx) => idx -> word}.toMap
    }

Here, we split each line with ",". Then construct a map with key as column number and value as each word after the split.

Next, If we want List[Map[Int,String]],

val datamap = datalines.map{ line =>
    line.split(",").zipWithIndex.map{ case (word, idx) => idx -> word}.toMap
    }.toList