Different integrations require different styles of implementation. The business functionality varies from application to application and so each demands a unique strategy, one such use case or scenario is where large data needs to be moved from one system to another. A lot of consideration needs to be taken into account while designing solutions for these systems such as throughput performance, failure handling and so on. There are many ways to do this; however, it is important to choose the most suitable design that satisfies business requirements or KPIs.
Use Case/Experience Summary
Incepta had an opportunity to work with a client with a legacy process which transferred large dataset from multiple tables in JD Edward System to a SQL data warehouse. The routine business process required triggering multiple disparate jobs manually in a sequential manner due to dependency on each other and required intervention for handling errors and running jobs after completion of dependent tasks. The legacy process took about 3 hours for all the jobs together and had no intelligence to handle or notify the critical failures in filtering data and processing records. Although, this is a typical use case of extraction, transformation and load (ETL), the customer wanted to move away from their existing process and bring in automation and reusability of data by leveraging MuleSoft platform
The smart solution built by Incepta team through MuleSoft application not only reduced the processing time to just under 1 hour which was a significant improvement in terms of processing time but also brought in a lot of automation, intelligent handling of various scenarios, flow reusability, reliability and other value adds such as alerting and a notification system. Design was on the basis that any new changes could be easily managed reducing time to market. It was a great achievement and was highly appreciated by receiving clients and associated teams.
How did we do it?
MuleSoft solution handled massive amount of data more efficiently. Mule 4 eradicated the need for manual thread pool configuration as this is done automatically by the Mule runtime which optimizes the execution of a flow to avoid unnecessary thread switches.
The three centralized pools CPU_INTENSIVE, CPU_LITE, BLOCKING_IO are managed by the Mule runtime and shared across all applications deployed to that runtime. A running Mule application pulled threads from each of those pools as events passed through its processors. The consequence of this is that a single flow may run in multiple threads.
One of the options to have millions of records moved from one system to another is by using the Mule Batch Scope which makes it possible to handle large data by streaming it from source in smaller chunks of records and processing these asynchronously and reliably. Batch Scope provides several useful features such as-
- Block Sizes – defines the number of records processed in a step
- Batch Step – allows a block of records to be processed sequentially
- Accept expression configuration – allows you to filter records to be processed in a step by providing conditions
- Accept policy configuration – filtered for error handling
Consider an example of a nightly job where data from a data warehouse needs to be synchronized with a CRM system. A way to implement it would be to add a scheduler in the flow source which would trigger the flow on specific days as defined through cron expression or fixed frequency strategy in scheduler. A database connector’s select operation fetches records from the source database system. Since there are millions of records that need to be moved and reliability is an important KPI, we let the streaming strategy remain default – Repeatable file store iterable. The payload that is retrieved is then passed on to the batch job which splits it into smaller chunks of records as defined as ‘Batch Block Size’. The default is 100 and these chunks are processed by batch steps.
In the example, the batch step has a Salesforce Create Bulk operation which creates new records in Salesforce in bulk of 100. In this way, all the records are processed reliably. Additionally, for better error handling, we tried scope inside the batch aggregator which allows defining actions on any failures for a batch of records. Once all the records are processed, the On complete section can be used for defining other steps that need to be executed after batch step is completed. Here we have multiple connectors like logger, execute script, choice router with a logic to prepare and send mail notification before event is passed to next flow that is referenced.
Key Pointers for using Batch efficiently in MuleSoft:
- Transformation complexity – Use transform before batch step and avoid dataweave in process batch steps as it will process one record at a time which is inefficient and doesn’t justifiy the use of batch processing
- No. of Batch Steps – Dividing the process into steps makes it easier to isolate a failed batch and have it reprocessed separately
- Block Sizes – Running comparative tests with different values and testing performance helps you find an optimum block size before moving this change into production. Modifying this value is optional. If no changes are applied, the default value is 100 records per block.
- Scheduling Strategy – It enables you to control how instances of a given batch job are executed. The default configuration is ORDERED_SQUENTIAL which is suitable If several job instances are in an executable state at the same time, the instances execute one at a time based on their creation timestamp. The other setting available is ROUND_ROBIN which attempts to execute all available instances of a batch job using a round-robin algorithm to assign the available resources.
Incepta is a certified MuleSoft partner with an experienced team of experts in MuleSoft development and consulting. Visit our website: www.inceptasolutions.com or email us at email@example.com