I'm trying to build a POS tagger in Clojure. I need to iterate over a file and build out feature vectors. The input is (text pos chunk) triples from a file like the following:
input from the file:
I PP B-NP
am VBP B-VB
groot NN B-NP
I've written functions to input the file, transform each line into a map, and then slide over a variable amount of the data.
(defn lazy-file-lines
"open a file and make it a lazy sequence."
[filename]
(letfn [(helper [rdr]
(lazy-seq
(if-let [line (.readLine rdr)]
(cons line (helper rdr))
(do (.close rdr) nil))))]
(helper (clojure.java.io/reader filename))))
(defn to-map
"take a a line from a file and make it a map."
[lines]
(map
#(zipmap [:text :pos :chunk] (clojure.string/split (apply str %) #" "))lines)
)
(defn window
"create windows around the target word."
[size filelines]
(partition size 1 [] filelines))
I plan to use the above functions in the following way:
(take 2 (window 3(to-map(lazy-file-lines "/path/to/train.txt"))))
which gives the following output for the first two entries in the sequence:
(({:chunk B-NP, :pos NN, :text Confidence} {:chunk B-PP, :pos IN, :text in} {:chunk B-NP, :pos DT, :text the}) ({:chunk B-PP, :pos IN, :text in} {:chunk B-NP, :pos DT, :text the} {:chunk I-NP, :pos NN, :text pound}))
Given each sequence of maps within the sequence, I want to extract :pos
and :text
for each map and put them in one vector. Like so:
[Confidence in the NN IN DT]
[in the pound IN DT NN]
I've not been able to conceptualize how to handle this in clojure. My partial attempted solution is below:
(defn create-features
"creates the features and tags from the datafile."
[filename windowsize & features]
(map #(apply select-keys % [:text :pos])
(->>
(lazy-file-lines filename)
(window windowsize))))
I think one of the issues is that apply is referencing a sequence itself, so select-keys isn't operating on a map. I'm not sure how to nest another apply function into this, though.
Any thoughts on this code would be great. Thanks.
to-map
is never used........? How should one understand what is asked here? What is the purpose ofwindowsize
? How does the input relate to the "super basic output"? What problem are you trying to solve? – Leon Grapenthin