Posted by Raqsoft
on August 29, 2013 at 8:06 PM PDT
This article is about three methods of data processing.
In Java, implementing via SQL is a well-developed practice for database computation . However, the structured data is not only stored in the database, but also in the text, Excel, and XML files. Considering this, how to compute appropriately regarding the structured data from non-database files? This article raises 3 solutions for your reference: implement via Java API, convert to database computation, and adopt the common data computation layers.
Implement via Java API. This is the most straightforward method. Programmers will benefit from Java API in controlling every computational step meticulously, monitoring the computed result in each step intuitively, and debugging conveniently. Needless to say, no learning cost is also an additional advantage of Java API.
Thanks to the well-developed API for retrieving and writing-back data to Txt, Excel, and XML files, Java has enough technical strength to offer the full support for such computation, in particular the simple computational goals.
However, this method requires great workload and quite inconvenient.
For example, since the common data algorithms have not implemented in Java, programmers will have to spend great time and efforts to implement all the ins and outs manually by aggregating, filtering, grouping, and sorting and some other common actions.
For another example of data storage and detail data retrieval through Java API, programmers will have to combine every data and 2D table with List/map and other objects, and then compute in nested loops at multi-levels. Moreover, such computation usually involves the set operations and relational computations on massive data, as well as the computations between objects and object properties. It takes great efforts to implement the underlying logics and even greater workload in handling the complex ordered computation.
In order to reduce the programing workload, programmers always prefer leveraging the existing algorithms to implementing all specifics by themselves. In view of this, the second choice below would be a better choice:
Convert to database computation. This is the most conservative method. Concretely speaking, it is to import the non-database data to the database via the common ETL tools like DataStage, DTS, Informatica, and Kettle. The advantages of this practice include the high computational efficiency, steadfast running, and less workload for Java programmers. It fits for the scenarios of great data volume, high performance demand, and medium-level computational complexity. These advantages are evident for the mixed computation on the database and the non-database files in particular.
The main drawback of this method is the great workload in the early stage of ETL and the great maintenance difficulty. First, since the non-database data cannot be used directly without field-splitting, merging, and judging, programmers have to write a great many of Perl/JS scripts to clean and re-organize the data. Second, the data is usually updatable, so the scripting must handle the changing incremental update issues. The data from various data sources can hardly be compatible with a normal form. So, the data is unusable before the level 2 or even the level 3 ETL process. Third, scheduling is also a problem when there are lots of tables