How data is loaded and managed by a MarkLogic cluster

Question

I would like to enquire on how data is loaded into this clusters, do I load data separately into each node manually? Or is marklogic able to manage and transfer data among the cluster itself so all I need to do is to load data into a single node?

For marklogic to understand and balanced the data between certain forests/databases is there a certain requirement? Such as a need for the forest and database to share the same name or the XDBC server to share the same port number? Finally, I would like to ask, whether is there a way to increase data ingestion throughput? I attempted to do this by pumping data into all 3 nodes at once, but this resulted in an error on the two other nodes. So I returned to using a single node for pumping data in, it's currently running at 100% of it's CPU used.

Question pulled from comments here: Clustering of nodes (MarkLogic)

Please comment if you are downvoting. The question references details that are in the linked question. — Michael Gardner

Michael Gardner Michael Gardner · Accepted Answer · 2019-04-26T18:59:31

Databases store the data in forests, and the data will be distributed where ever the forests are. So if you have a database with forests on both hosts, it will automatically balance the data between the two hosts. You can alter how the database determines which data to place on which forest with the assignment policy, which defaults to bucket

There is no certain requirement for rebalancing. App Servers and forests are assigned to specific databases, so they are already linked.

So data ingested to an app server, will be written to the assigned database. Then, that database will determine which forests to place the data on. This may sometimes result in forests on one cluster host growing larger then forests on another cluster host, at which point the database will decide to redistribute some of the data to other forests assigned to the same database, which may or may not be on the same host.

There are many ways to improve ingest throughput, but here are the most common:

Increase the constrained resources for the host. This means if you are CPU constrained, add cores; if you are Memory constrained, add memory; etc.
Increase the number of hosts involved. Either through load balancing, or multiple ingestion pipelines.

Since you are using MLCP, it will retrieve the list of forest hosts in the cluster, and it will distribute work across the cluster by default. There are some options, see here.

To see if the work is being distributed, you can check in the Admin UI: Configure -> Groups --> Default --> App Servers --> [Your Ingest App Server], Click on the Status tab, and the Show More button. It should list all of your hosts, and the number of requests being serviced by each host in the cluster. If one hosts numbers are significantly higher than the other hosts, then the work may not be getting distributed properly.

Once the data is ingested, it will get balanced across the forests. It won't be exactly the same number of documents, or same space used. The server will decide when a forest is too small or too large, and move documents accordingly. Rebalancing can be resource intensive, so the server tries to weigh the cost of leaving the data in place vs moving it to another forest.

If you ingest primarily into a single node, you may also see larger forests on that node, for the reason stated above, that the server weighs the cost of moving the data vs leaving the data in place.

The indexes will also impact the size on disk, particularly when there is a wide variety of document sizes, then some forests may end up with larger indexes than others due to the types of documents.

There are also a number of other things that can effect spaced used by each node. One is the number of deleted fragments; these are fragments that have been marked for deletion, but have not been merged out of the forest. If a forest is seeing a lot of ingest activity, or the merge priority is reduced, it can cause some forests to be quite a bit larger than others until it has a chance to merge out the deleted fragments.

You mentioned you tried to ingest into all three nodes, and it did not work. Without knowing how you are ingesting data, and the exact error you encountered, it's difficult to say why it didn't work for you, but that is typically how MarkLogic is used.

MarkLogic offers a number of free courses, both on demand and instructor lead. I suggest taking a few hours to take MarkLogic Fundamentals. Check out mlu.marklogic.com for a list of other courses as well. You can also check out the MarkLogic Concepts Guide, which gives a good overview of how MarkLogic works.

How data is loaded and managed by a MarkLogic cluster

1 Answers