1
votes

How can I transfer a huge df to a hive table from python?

I want to transfer df in the memory of python to directly hive table. I generate connections with sqlalchemy create engine. I can write df to table. Below is how I applied this method.

from sqlalchemy import create_engine
from multiprocessing import Lock, Process, Manager

engine = create_engine(some configuration parameters)
df.to_sql(name='TABLE_1', con=engine, if_exists=’append’, index=False, method=’multi’, chucksize=10000)

By using above code it consume all my memory and takes so long. After that, ı use multithreading,

def func_write(nc, l):
    df1=ns.df
    engine =generate_new_engine() ## GENERATE NEW ENGINE FOR EACH THREAD
    l.acquire()
    df1.to_sql(name="TABLE_1", con=engine, if_exists=’append’, index=False, method=’multi’, chucksize=10000)
    ns.df=df1
    l.release()

It can write but it is totaly unefficent, Here is my code

if __name__ == __main__:
    mgr= Manager()
    ns =mgr.Namespace()
    ns.df = df
    l = Lock()
    p= Process(target=func_write, args=(ns,l))
    p.start()
    p.join()

My question is how can ı transfer data from a jupyter notebook to a hive table in an efficient manner? If it is possible to write multiple csv to define a hive path by using threading or multiprocessing, it can solve my problem.

  1. How can I transfer a huge df to a hive table?
  2. Do you have any idea how to write multiple csv to define a hive path from python?
  3. Should ı use multithreading or multiprocessing to write a dataframe into a hive table or file ?