Keywords: Elasticsearch - Microsoft Azure - Technical issue - Other
Hi, please find below the details of the problem I'm facing.
I'm using a 4 node ES cluster on Azure. These machines are each Azure - Standard L16s (16 vcpus, 128 GB RAM). I'm ingesting data via a powerful spark cluster through scala library. Data is huge(4 Billion documents), ~2TB without compression.
In 1 hour data ingested is only 1% of the total data. This spark cluster is good enough to push 100% of data in under 30 mins to different datastores(like Azure DataWarehouse etc), so spark cluster is not a bottleneck here. Also it seems like dataframe.write() api is taking care of batching and pushing out data of the order of 280Mb per task without doing anything configuration/optimization from spark/client side.
Question - Is there any way I can increase the speed of data ingestion.