1
votes

There is a large dataset on a public server (~0.5TB, multi-part here), which I would like to copy into my own s3 buckets. It seems like aws s3 cp is only for local files or files based in S3 buckets?

How can I copy that file (either single or multi-part) into S3? Can I use the AWS CLI or do i need to something else?

1
So... you're saying the dataset in question is not on S3, but you want to download it and store it on S3... right?Michael - sqlbot
@Michael-sqlbot ExactlyVictor.dMdB
But I'm running this from an EC2 instance, so i don't want to download the whole thing to the EC2 instance and upload it to S3, so directly loading to S3 is what I'm looking for.Victor.dMdB

1 Answers

-1
votes

There's no way to upload it directly to S3 from the remote location. But you can stream the contents of the remote files to your machine and then up to S3. This means that you will have downloaded the entire 0.5TB of data, but your computer will only ever hold a tiny fraction of that data in memory at a time (it will not be persisted to disc either). Here is a simple implementation in javascript:

const request = require('request')
const async = require('async')
const AWS = require('aws-sdk')
const s3 = new AWS.S3()
const Bucket = 'nyu_depth_v2'
const baseUrl = 'http://horatio.cs.nyu.edu/mit/silberman/nyu_depth_v2/'
const parallelLimit = 5
const parts = [
  'basements.zip',
  'bathrooms_part1.zip',
  'bathrooms_part2.zip',
  'bathrooms_part3.zip',
  'bathrooms_part4.zip',
  'bedrooms_part1.zip',
  'bedrooms_part2.zip',
  'bedrooms_part3.zip',
  'bedrooms_part4.zip',
  'bedrooms_part5.zip',
  'bedrooms_part6.zip',
  'bedrooms_part7.zip',
  'bookstore_part1.zip',
  'bookstore_part2.zip',
  'bookstore_part3.zip',
  'cafe.zip',
  'classrooms.zip',
  'dining_rooms_part1.zip',
  'dining_rooms_part2.zip',
  'furniture_stores.zip',
  'home_offices.zip',
  'kitchens_part1.zip',
  'kitchens_part2.zip',
  'kitchens_part3.zip',
  'libraries.zip',
  'living_rooms_part1.zip',
  'living_rooms_part2.zip',
  'living_rooms_part3.zip',
  'living_rooms_part4.zip',
  'misc_part1.zip',
  'misc_part2.zip',
  'office_kitchens.zip',
  'offices_part1.zip',
  'offices_part2.zip',
  'playrooms.zip',
  'reception_rooms.zip',
  'studies.zip',
  'study_rooms.zip'
]

async.eachLimit(parts, parallelLimit, (Key, cb) => {
  s3.upload({
    Key,
    Bucket,
    Body: request(baseUrl + Key)
  }, cb)
}, (err) => {
  if (err) console.error(err)
  else console.log('Done')
})