I'm trying to build a POS tagger in Clojure. I need to iterate over a file and build out feature vectors. The input is (text pos chunk) triples from a file like the following:
input from the file:
groot NN B-NP
I've written functions to input the file, transform each line into a map, and then slide over a variable amount of the data.
(defn lazy-file-lines
"open a file and make it a lazy sequence."
(letfn [(helper [rdr]
(if-let [line (.readLine rdr)]
(cons line (helper rdr))
(do (.close rdr) nil))))]
(helper (clojure.java.io/reader filename))))
(defn to-map
"take a a line from a file and make it a map."
#(zipmap [:text :pos :chunk] (clojure.string/split (apply str %) #" "))lines)
(defn window
"create windows around the target word."
[size filelines]
(partition size 1 [] filelines))
I plan to use the above functions in the following way:
(take 2 (window 3(to-map(lazy-file-lines "/path/to/train.txt"))))
which gives the following output for the first two entries in the sequence:
(({:chunk B-NP, :pos NN, :text Confidence} {:chunk B-PP, :pos IN, :text in} {:chunk B-NP, :pos DT, :text the}) ({:chunk B-PP, :pos IN, :text in} {:chunk B-NP, :pos DT, :text the} {:chunk I-NP, :pos NN, :text pound}))
Given each sequence of maps within the sequence, I want to extract :pos
and :text
for each map and put them in one vector. Like so:
[Confidence in the NN IN DT]
[in the pound IN DT NN]
I've not been able to conceptualize how to handle this in clojure. My partial attempted solution is below:
(defn create-features
"creates the features and tags from the datafile."
[filename windowsize & features]
(map #(apply select-keys % [:text :pos])
(lazy-file-lines filename)
(window windowsize))))
I think one of the issues is that apply is referencing a sequence itself, so select-keys isn't operating on a map. I'm not sure how to nest another apply function into this, though.
Any thoughts on this code would be great. Thanks.
is never used........? How should one understand what is asked here? What is the purpose ofwindowsize
? How does the input relate to the "super basic output"? What problem are you trying to solve? – Leon Grapenthin