Short answer to my first question: AWS does not do automatic indexing. I've confirmed this with my own job, and also read the same from Andrew@AWS on their forum.
Here's how you can do the indexing:
To index some LZO files, you'll need to use my own Jar built from the Twitter hadoop-lzo project. You'll need to build the Jar somewhere, then upload to Amazon S3, if you want to Index directly with EMR.
On side note, Cloudera has good instructions on all the steps for setting this up on your own cluster. I did this on my local cluster, which allowed me to build the Jar and upload to S3. You can probably find a pre-built Jar on the net if you don't want to build it yourself.
When outputting your data from your Hadoop job, make sure you use the LzopCodec and not the LzoCodec, otherwise the files are not indexable (at least based on my experience). Example Java code (same idea carries over to Streaming API):
import com.hadoop.compression.lzo.LzopCodec;
TextOutputFormat.setCompressOutput(job, true);
TextOutputFormat.setOutputCompressorClass(job, LzopCodec.class)
Once your hadoop-lzo Jar is on S3, and your Hadoop job has outputted .lzo files, run your indexer on the output directory (instructions below you got a EMR job/cluster running):
elastic-mapreduce -j <existingJobId> \
--jar s3n://<yourBucketName>/hadoop-lzo-0.4.17-SNAPSHOT.jar \
--args com.hadoop.compression.lzo.DistributedLzoIndexer \
--args s3://<yourBucketName>/output/myLzoJobResults \
--step-name "Lzo file indexer Jar"
Then when you're using the data in a future job, be sure to specify that the input is in LZO format, otherwise the splitting won't occur. Example Java code:
import com.hadoop.mapreduce.LzoTextInputFormat;
job.setInputFormatClass(LzoTextInputFormat.class);