Databases store the data in forests, and the data will be distributed where ever the forests are. So if you have a database with forests on both hosts, it will automatically balance the data between the two hosts. You can alter how the database determines which data to place on which forest with the assignment policy, which defaults to bucket
There is no certain requirement for rebalancing. App Servers and forests are assigned to specific databases, so they are already linked.
So data ingested to an app server, will be written to the assigned database. Then, that database will determine which forests to place the data on. This may sometimes result in forests on one cluster host growing larger then forests on another cluster host, at which point the database will decide to redistribute some of the data to other forests assigned to the same database, which may or may not be on the same host.
There are many ways to improve ingest throughput, but here are the most common:
- Increase the constrained resources for the host. This means if you are CPU constrained, add cores; if you are Memory constrained, add memory; etc.
- Increase the number of hosts involved. Either through load balancing, or multiple ingestion pipelines.
Since you are using MLCP, it will retrieve the list of forest hosts in the cluster, and it will distribute work across the cluster by default. There are some options, see here.
To see if the work is being distributed, you can check in the Admin UI: Configure -> Groups --> Default --> App Servers --> [Your Ingest App Server], Click on the Status tab, and the Show More button. It should list all of your hosts, and the number of requests being serviced by each host in the cluster. If one hosts numbers are significantly higher than the other hosts, then the work may not be getting distributed properly.
Once the data is ingested, it will get balanced across the forests. It won't be exactly the same number of documents, or same space used. The server will decide when a forest is too small or too large, and move documents accordingly. Rebalancing can be resource intensive, so the server tries to weigh the cost of leaving the data in place vs moving it to another forest.
If you ingest primarily into a single node, you may also see larger forests on that node, for the reason stated above, that the server weighs the cost of moving the data vs leaving the data in place.
The indexes will also impact the size on disk, particularly when there is a wide variety of document sizes, then some forests may end up with larger indexes than others due to the types of documents.
There are also a number of other things that can effect spaced used by each node. One is the number of deleted fragments; these are fragments that have been marked for deletion, but have not been merged out of the forest. If a forest is seeing a lot of ingest activity, or the merge priority is reduced, it can cause some forests to be quite a bit larger than others until it has a chance to merge out the deleted fragments.
You mentioned you tried to ingest into all three nodes, and it did not work. Without knowing how you are ingesting data, and the exact error you encountered, it's difficult to say why it didn't work for you, but that is typically how MarkLogic is used.
MarkLogic offers a number of free courses, both on demand and instructor lead. I suggest taking a few hours to take MarkLogic Fundamentals. Check out mlu.marklogic.com for a list of other courses as well. You can also check out the MarkLogic Concepts Guide, which gives a good overview of how MarkLogic works.