How can I transfer a huge df to a hive table from python?
I want to transfer df in the memory of python to directly hive table. I generate connections with sqlalchemy create engine. I can write df to table. Below is how I applied this method.
from sqlalchemy import create_engine
from multiprocessing import Lock, Process, Manager
engine = create_engine(some configuration parameters)
df.to_sql(name='TABLE_1', con=engine, if_exists=’append’, index=False, method=’multi’, chucksize=10000)
By using above code it consume all my memory and takes so long. After that, ı use multithreading,
def func_write(nc, l):
df1=ns.df
engine =generate_new_engine() ## GENERATE NEW ENGINE FOR EACH THREAD
l.acquire()
df1.to_sql(name="TABLE_1", con=engine, if_exists=’append’, index=False, method=’multi’, chucksize=10000)
ns.df=df1
l.release()
It can write but it is totaly unefficent, Here is my code
if __name__ == __main__:
mgr= Manager()
ns =mgr.Namespace()
ns.df = df
l = Lock()
p= Process(target=func_write, args=(ns,l))
p.start()
p.join()
My question is how can ı transfer data from a jupyter notebook to a hive table in an efficient manner? If it is possible to write multiple csv to define a hive path by using threading or multiprocessing, it can solve my problem.
- How can I transfer a huge df to a hive table?
- Do you have any idea how to write multiple csv to define a hive path from python?
- Should ı use multithreading or multiprocessing to write a dataframe into a hive table or file ?