2
votes

Can I move some columns(cc_payment, keyid) as mentioned below from a Cassandra keyspace billing to other Cassandra payments keyspace ? payment_info is going to be a new table.

Is there any way I can move? Or do I need to COPY TO csv file and import with COPY FROM options? Since the data is huge I am looking for options to directly move from one keyspace to other. We are using datastax cassandra.

Appreciate your help.

    FROM
========

keyspace:  billing
create table if not exists billing_info (
      user_id text,
      billing_id timeuuid,
      cc_payment frozen<cc_payment>,
      keyid text;
      PRIMARY KEY((user_id), billing_id)
) WITH CLUSTERING ORDER BY (billing_id DESC);

    TO
======
keyspace:  payments
create table if not exists payment_info (
      user_id text,
      payment_id timeuuid,
      cc_payment frozen<cc_payment>,
      keyid text;
      PRIMARY KEY((user_id), payment_id)
) WITH CLUSTERING ORDER BY (payment_id DESC);
2
COPY will not work if records are more than 2 million. Also Primary keys are different in both tables, so ideally you will not be able to do this directly without any manual work. I think you might write code to add new columns in existing rows.Anil Kapoor
@AnilKapoor I've used COPY to export more than 300 million rows before. It's all about controlling the page and batch sizes.Aaron
@Aaron can you please provide some detail on this. I am interested to see this.Anil Kapoor
@AnilKapoor I mention describe the relevant properties on this answer: stackoverflow.com/questions/41448374/…Aaron

2 Answers

2
votes

There are multiple ways to do this:

Copy files directly and then change table structure

Because tables differ only by one column name, it could be much faster to directly copy files, as following:

  • Create a table payments.payment_info with exactly same structure as billing.billing_info
  • stop writing to billing.billing_info

Then on every node of cluster, do following:

  • do flush for it: nodetool flush billing billing_info
  • change to the Cassandra's data directory
  • under the same user that is running Cassandra copy files billing/billing_info-<ID_of_the_table>/* into payments/payment_info-<ID_of_the_table>/
  • execute nodetool refreshpayments.payment_info`
  • check in cqlsh that data is available
  • perform renaming of the column with: ALTER TABLE payments.payment_info RENAME billing_id TO payment_id;

Migrate data by copying, using, for example, DSBulk or Spark.

If you're using DSE, then you can use DSBulk (it's better to take latest version) to unload data from one table and load into another. This command may work without creating an intermediate copy by writing data to standard output and reading it from standard input via Unix pipe, although in this case it will be slower because it can't achieve necessary parallelism.

In the simplest case it will be invoked as following, providing a mapping between changed field names (see documentation for details0:

dsbulk unload -k ks1 -t table1 -c json | dsbulk load -k ks2 -t table2 -c json -m "mapping_to_accomodate_changes_in_field_names"

But the task will be more complex if you'll need to copy not only data, but other things as well, such as, TTL and WriteTime - in this case you'll need to export it explicitly, and then load data in several batches, for each column separately.

1
votes

Spark you can use this little snippet. You can do what you need to in updateColumns

val myKeyspace = "oldkeyspace" 
val myTable = "oldtable"
val newKeyspace = "newkeyspace" 
val newTable = "newtabl"

def updateColumns(row: CassandraRow): CassandraRow = { 
     val inputMap = row.toMap val newData = Map( "newColumn" -> "somevalue" ) 
     var outputMap = inputMap ++ newData CassandraRow.fromMap(outputMap) 
}

val result = sc.cassandraTable(myKeyspace, myTable) .map(updateColumns(_)) 
  .saveToCassandra(newKeyspace, newTable)