0
votes

I'm trying to build a POS tagger in Clojure. I need to iterate over a file and build out feature vectors. The input is (text pos chunk) triples from a file like the following:

input from the file:  
        I PP B-NP
        am VBP B-VB
        groot NN B-NP

I've written functions to input the file, transform each line into a map, and then slide over a variable amount of the data.

(defn lazy-file-lines
  "open a file and make it a lazy sequence."
  [filename]
  (letfn [(helper [rdr]
        (lazy-seq
         (if-let [line (.readLine rdr)]
           (cons line (helper rdr))
           (do (.close rdr) nil))))]
(helper (clojure.java.io/reader filename))))

(defn to-map
  "take a a line from a file and make it a map."
  [lines]
  (map
  #(zipmap [:text :pos :chunk] (clojure.string/split (apply str %) #" "))lines)
  )  

(defn window
  "create windows around the target word."
  [size filelines]
  (partition size 1 [] filelines))

I plan to use the above functions in the following way:

 (take 2 (window 3(to-map(lazy-file-lines "/path/to/train.txt"))))

which gives the following output for the first two entries in the sequence:

(({:chunk B-NP, :pos NN, :text Confidence} {:chunk B-PP, :pos IN, :text in} {:chunk B-NP, :pos DT, :text the}) ({:chunk B-PP, :pos IN, :text in} {:chunk B-NP, :pos DT, :text the} {:chunk I-NP, :pos NN, :text pound}))   

Given each sequence of maps within the sequence, I want to extract :pos and :text for each map and put them in one vector. Like so:

[Confidence in the NN IN DT]
[in the pound IN DT NN]

I've not been able to conceptualize how to handle this in clojure. My partial attempted solution is below:

(defn create-features
  "creates the features and tags from the datafile."
  [filename windowsize  & features]
 (map  #(apply select-keys % [:text :pos])
   (->>
    (lazy-file-lines filename)
    (window windowsize))))   

I think one of the issues is that apply is referencing a sequence itself, so select-keys isn't operating on a map. I'm not sure how to nest another apply function into this, though.

Any thoughts on this code would be great. Thanks.

2
If your question is really just about how to flatten a sequence of sequence of maps, then the first two code blocks and the description of the purpose of the maps, etc., just clutter up the question. Irrelevant information makes it less likely that you'll get an answer quickly. In this particular question, it would be helpful to give an example of the kind of sequence of sequences of maps you're trying to process as input, and an illustration of what you want as output. (If you're not sure whether the extra material is relevant, then explain why--in that case, it's part of the question.)Mars
I think that what you actually want is not just the flattening operation, but a select by key and then flatten.Mars
to-map is never used........? How should one understand what is asked here? What is the purpose of windowsize? How does the input relate to the "super basic output"? What problem are you trying to solve?Leon Grapenthin
oops. I clarified the question to include what I'm actually doing.chebyshev

2 Answers

1
votes

I'm not entirely sure what you want as input and output, and to be honest, I don't want to work through all of the code that you've provided to figure that out, since I don't think that all of the code is essential to the question. Someone else may give you an answer that's narrowly tailored to your code, but I think the real question is more general.

I'm guessing that the general idea of what you want to implement is that:

Given a sequence of sequence of maps, select those map entries that have particular keys, and then return a sequence of vectors representing map entries. If that's not what you want, I think that the following will probably give you an idea about how to proceed.

This method is not the most efficient or concise, but it breaks the problem down into a series of steps that are easy to understand:

(defn selkeys-or-not
  "Like select-keys, but returns nil rather than {} if no keys match."
  [keys map]
  (not-empty (select-keys map keys)))

(defn seq-seqs-maps-to-seq-vecs
  "Given a sequence of keys, and a sequence of sequences of maps,
  returns a sequence of vectors, where each vector contains key-val
  pairs from the maps for matching keys."
  [keys seq-seqs-maps]
  (let [maps (flatten seq-seqs-maps)]
    (map vec
         (apply concat
                (filter identity
                        (map (partial selkeys-or-not keys) maps))))))

What's happening in the second function:

First, we flatten the outer sequence, since fact that the maps are within inner sequences is irrelevant to our goals. This gives us a single sequence of maps.

Then we map a helper function selkeys-or-not over the sequence of maps, passing our keys to the helper function. select-keys returns {} when it finds nothing, but {} is truthy, and we want a falsey value in this case for the next step. selkeys-or-not returns a falsey value (nil) instead of {}.

Now we can filter out the nils using filter identity--filter returns a sequence containing all values such that its first argument returns a truthy value.

At this point we have a sequence of maps, but we want a sequence of vectors instead. applying concat turns the sequence of maps into a sequence of map entries, and mapping vec over them turns the map entries into vectors.

0
votes
(defn extract-line-seq
  [ls]
  (concat (map :text ls)
          (map :pos ls)))

(extract-line-seq '({:chunk B-NP, :pos NN, :text Confidence} {:chunk B-PP, :pos IN, :text in} {:chunk B-NP, :pos DT, :text the}))

;-> (Confidence in the NN IN DT)

You can put it into a vector if you want outside of the function. This way laziness is an option to the caller.