In previous releases of WCS, mass load utility was the primary tool available out of the box to load data into WCS database. In the beta release of WCS 7, IBM had floated new dataload option known as Data Load Utility, it is important to know these definitions as BODL is an asset that has been developed by IBM Software Services for WebSphere to address some of the shortcomings of id resolver/massload and Data load utility is inline with the BODL architecture and technically BODL is not very different from Data Load utility.
Officially IBM still recommends to make use of massload based dataload approach for less commonly used data and to use data load utility to efficiently load product, price, and inventory data.
Overview of Dataload Utility
1. The DataReader is a Java component which does the job of reading the source file, in case of CSV reader you would define the source data structure into a configuration file and DataReader will use this informtation to load the data.
The Data Reader component implements a next() method which returns one chunk of data read from a data source. this can definitely not scale up well for high volume of data as every row of record is a new Object going into your JVM heap.
From my expierence I believe that the parsing of a source file using a Java component is always a bad choice for high volume dataloads, compared to this a SQL*Loader is a high-speed data loading utility that loads data from external files into tables in an Oracle database.
2. Mediators: Mediators are available out of the box, CatalogMediator
for instance if you are loading catalog table it populates the physical object of CATALOG table from the catalog logical object.
3. Dataload utility in my opinion is still not a serious contender for Dataload, as it has very limited support in terms of readers, supports very limited Business objects, I could not figure out how can one write custom logic, as it seems to do one to one mapping of source to destination field.
Overview of Massload
Custom Dataload Options
Dataload architecture Overview
Overview of Massload Utilities
Dataload Best Practices
Officially IBM still recommends to make use of massload based dataload approach for less commonly used data and to use data load utility to efficiently load product, price, and inventory data.
Overview of Dataload Utility
1. The DataReader is a Java component which does the job of reading the source file, in case of CSV reader you would define the source data structure into a configuration file and DataReader will use this informtation to load the data.
The Data Reader component implements a next() method which returns one chunk of data read from a data source. this can definitely not scale up well for high volume of data as every row of record is a new Object going into your JVM heap.
From my expierence I believe that the parsing of a source file using a Java component is always a bad choice for high volume dataloads, compared to this a SQL*Loader is a high-speed data loading utility that loads data from external files into tables in an Oracle database.
2. Mediators: Mediators are available out of the box, CatalogMediator
for instance if you are loading catalog table it populates the physical object of CATALOG table from the catalog logical object.
3. Dataload utility in my opinion is still not a serious contender for Dataload, as it has very limited support in terms of readers, supports very limited Business objects, I could not figure out how can one write custom logic, as it seems to do one to one mapping of source to destination field.
- Mostly an asset of IBM services group, it is freely available for WCS customers only.
- BODL is a set of Java files which work very similar to Data load utility, it has more readers, can be customized and surprisingly it can handle more components that Data load utility (I would imagine IBM will release all of the BODL features into Dataload utlity in future versions).
- Publicly IBM does not provide any documentation or sample code on BODL, this is available only by request for WCS customers.
Overview of Massload
- Source data should be converted to XML format (This should be based on the DTD generated using DTDgen utility of WCS)
- Genrerate xml should be Id resolved using idresgen utility.
- Mass load the idresolved xml file using massloader utility.
- Can have serious performance issues, in case you are processing large record sets, this is not the most efficient data load options.
- Debugging errors is very difficult with this dataload option.
Custom Dataload Options
- Some of the commercial ETL tools may not be good fit as WCS primary key generation and translation logic may get very dirty.
- From my experience Java is not my most preferred language for high volume data processing and data translation, if you understand the WCS data model well, you could write your own custom data processing tools using SQL Loader / PLSQL / Python scripts, which ever technology you use, at the end of the day you are inserting records into WCS tables using SQL.
Dataload architecture Overview
Overview of Massload Utilities
Dataload Best Practices