Can I use lex/yacc to parse IMDB darta or are both not feasible because of the structure of the data

Question

I am not an expert but I worked with both tools already and more or less got something working in another project. I actually do this in java with jflex/byaccJ I downloaded ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/movies.list.gz.

If you look at the movie.list file it looks well structure at first: Title in “ following a yeary in () following another year after some tabs. (I am not sure yet about the semantics of both year)

“What It Is” (2004)                 2004

If the entry is an episode than the title is the title of the series und there’s more data in currly brackets

"Breaking Bad" (2008) {Cornered (#4.6)}         2011

Year could be: 2014, ????, 2012-2014, 2014-????, 2014/II … I can handle this

There are more optional things: (V), (TV), (VG), {{SUSPENDED}} - I would call this TAGs

The bad things: Later they do not enclose the title with “. Brackets are use also somewhere else so I cannot just look at them to figure out the structure.

A través de A(lan) Glass (2006)             2006
Michi o tsugu mono (zempen) (1994) (V)          1994
"The Gayle King Show" (1997) {(1997-11-07)}     1997

My main question is if it's possible to use jflex/byaccJ on the given data or if the data is too unstructured and has not a feasible grammar.

jflex: 1st approach was to make a rule/token for WORDs and one for YEARs. but as "()1-9" are valid for WORDs too I cannot distinguish both.

2nd approach: make a rule for a string in brackets and if found check explicitly if it fits a YEAR, a TAG (e.g. (V), (VG)) or a WORD

3rd could I make use of states? i another project I used them to catch strings enclosed with "". Not sure if this comes in handy here.

Writing this I think I will try the 2nd approach. I am concerned that I put too much logic in the lexer, but if this is the only way possible than I should try anyway.

Thx 4 reading und rubberducking and I would still be interessted if you think I cannot be done with lex/yacc.

Dean Taylor Dean Taylor · Accepted Answer · 2014-03-03T01:58:40

You will find more information on the actual data format of the file inside this tool: ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/tools/unix/moviedb-3.24.tar.gz

Look at docs/ADDS-GUIDE file.

The format looks more simple if you ignore the title and work from the right hand side as an anchor / starting point.

To me a single regex looks like it will do the job, I'll leave the actual work to you.

Consider looking for an existing library that already does the job, simple search found these:

Can I use lex/yacc to parse IMDB darta or are both not feasible because of the structure of the data

1 Answers