Issue Details (XML | Word | Printable)

Key: MULE-755
Type: New Feature New Feature
Status: Closed Closed
Resolution: Won't Fix or Usage Issue
Priority: Minor Minor
Assignee: Dirk Olmes
Reporter: kenneth westelinck
Votes: 1
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Mule

Transformers for CSV/fixed length record input/output

Created: 07/Apr/06 07:07 AM   Updated: 04/Mar/08 08:14 PM
Component/s: Core: Transformers
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

Issue Links:
Block
 
Related

Labels:


 Description  « Hide
We need a transformer to transform CSV files to XML and vice versa.

 All   Comments   Work Log   Change History   Transitions   FishEye      Sort Order: Ascending order - Click to sort in descending order
kenneth westelinck added a comment - 07/Apr/06 07:10 AM
Attached you can find working source, based on CSVReader and CSVWriter from opencsv (http://opencsv.sourceforge.net/).

Travis Carlson added a comment - 07/Apr/06 01:51 PM
If a CSV file can be read into a JDBC RecordSet, the rest could be handled by MULE-756.

kenneth westelinck added a comment - 10/Apr/06 07:47 AM
Yes, but then you will need a RecordSet to CSV converter. I think my approach is easier, since it just transforms a CSV file to a list of String[] or String[][] and then I simply let the XStream serialize the result and vice versa.

Holger Hoffstaette added a comment - 25/Jun/06 08:50 AM
I've taken a good look at some of the available CSV parsers and opencsv does look like a good choice.

Andrew Perepelytsya added a comment - 19/Jul/06 10:45 AM
Holger, have you checked http://servingxml.sourceforge.net/ during your research? It looks like it offers quite complex transformation functions (check the examples link of theirs). Though it may not be that critical in the context of Mule (transformations can easily be applied later).

Holger Hoffstaette added a comment - 19/Jul/06 11:12 AM
Andrew, thanks for the link to ServingXML - looks very interesting but also way more complicated than a "simple" CSV parser. The things I looked out for were mostly:
  • Excel compatibility (they of course had to make things different in some weird way)
  • stream-capable - some people have insanely large files that must be split up or processed and we must have a way to handle them, even if most of the use cases will be possible in-memory (i.e. all at once).
  • usable license
  • easy API

While ServingXML could probably be used as a substitute for any other CSV reader/writer it looks like a lot more work to integrate, and its heavy bias towards XML is not always a good thing. Sometimes a List of Strings really works just fine, especially if you want to populate a regular object.


Andrew Perepelytsya added a comment - 19/Jul/06 11:18 AM
Yes, it looks powerful when used on its own, but may be too much for a fine-grained service, just wanted to run it by you. We will definitely have a plain Csv2Xml converter without all the bells and whistles, and keep ServingXML in mind (in case a need for smth like that arises).

Holger Hoffstaette added a comment - 13/Aug/06 11:32 AM
Looks like things are coming together: http://jakarta.apache.org/commons/sandbox/csv/

Aims to unify all the different implementations that are out there, based on OpenCSV. About time..


Holger Hoffstaette added a comment - 13/Sep/06 08:09 AM
Still waiting for the contributor's agreement. Ross?

Holger Hoffstaette added a comment - 14/Sep/06 07:19 AM
CLA received so this can go into sandbox for further development. thanks

Holger Hoffstaette added a comment - 14/Sep/06 08:30 AM
Initial commit to the sandbox in r3061, with created POM, 1.4-backported and cleaned up code. Still needs some more love and especially proper testing.

kenneth westelinck added a comment - 14/Sep/06 08:48 AM
Can I help on this? If so, please give me some hints.

Holger Hoffstaette added a comment - 14/Sep/06 09:21 AM
A couple things that I found during inital code read -
  • there are a few instance variables that are never read or unused (private and initialized but never used again)
  • the standalone test should be done as AbstractTransformerTestCase and live in the test dir; also load any test resource stored in there etc.

However none of this is really critical since it's not going to be in rc5 - maybe rc6 if we do one, more likely 1.3 final. If you find the time for some patches or more tests it would be appreciated. You can attach anything to this issue or start a csv-test project in the new contributor repo, see http://xircles.codehaus.org/projects/mule/repo/contrib.

One test I'd like to see is stability, memory and performance behaviour with a big CSV (or XML) file: >10 megabytes at least, more if you find the time. These test resources should probably be generated dynamically or be stored in compressed form and unzipped on the fly. It would be nice if the transformer could handle an arbitrarily big XML file and directly pipe it into a CSV file on disk without using loads of memory. However we don't have stream support for transformers in the official API yet, so this may or may not be possible for now; let us know what you find out.

Thanks


Ross Mason added a comment - 09/Oct/06 06:39 PM
Hi Kenneth,

Thanks for the submission. This is now avaialble at http://sven.codehaus.org/mule-contrib/modules/csv

You should have commit access to this an to build the module, use Maven 2.0.4

mvn install


Holger Hoffstaette added a comment - 05/Jan/07 03:27 PM
Just for the record:
  • the code in svn is updated to use the latest OpenCSV and builds out ouf the box so do NOT use the attachment any more
  • the code has not yet seen heavy testing, particularly for performance or memory use with really large CSV files

Andrew Perepelytsya added a comment - 05/Jan/07 04:27 PM
I've deleted the obsolete attachment to avoid confusion.

Lajos Moczar added a comment - 05/Jan/07 06:28 PM
Since I find myself doing this, I'll take it

However, I have a transformer I've used in the past to do simple CSV/XML transformations. Will post it for review when I locate it.


Lajos Moczar added a comment - 09/Jan/07 08:58 AM
This has worked out well so far. I've done a bit of rework in some areas, and added a few options. After proper testing, will add it to trunk

Holger Hoffstaette added a comment - 09/Jan/07 09:09 AM
Very good news! Please make sure the existing test (the little main driver in the sandbox) is added as proper AbstractTransformerTestCase, though if proper round-tripping is not possible for some reason, a "normal" test is of course fine too.
Btw, OpenCSV can also deal with JDBC RecordSets - maybe this helps with the items mentioned at the top of this issue?
Thanks

Lajos Moczar added a comment - 19/Jan/07 11:43 AM
I have completed my refactoring of this transformer. It is now three transformers: CSVToList, ListToCSV and MapToCSV. Having it serve xstream directly was too restrictive. Returning a List (of Maps) allows us to split with the array splitter, which fits at least two user scenarios I'm aware of. You can always convert to XML later.

I'm also putting this in the file transport. I know it doesn't quite fit, but it now certainly now doesn't fit with the xml module. In most cases, people will be dealing the csv files anyhow.


Lajos Moczar added a comment - 21/Jan/07 12:58 PM
Closed in r4810. I have been convinced by Holger that is it logical to put these three CSV transformers in their own module. We had a discussion about it on Friday. If you disagree, talk to the man! This commit has a test case. I also update the site docs.

A note on speed: I tested with an 80-column, 7000-line CSV file. My test was reading the file in, converting with CSVToMapList, splitting with the FilteringListMessageSplitter, sending each Map to a JMS queue, reading each Map back out, convering back to CSV with MapToCSV and storing in a separate file (this was a client scenario, that's why I followed this process). Total time to do this, and write the 7000 files, was under 1 minute. So OpenCSV and these transformers seem to function quite nicely.


Holger Hoffstaette added a comment - 21/Jan/07 01:35 PM
reopen for fix-version

Holger Hoffstaette added a comment - 21/Jan/07 06:34 PM
Reopening because it has some problems. The idea of using Maps with the column number as key will not work as implemented, since the values of a Map can (and will!) be iterated in undefined order, depending on JDK version. Theoretically you could use TreeMaps or sort things by key but all that is way too complicatred. IMHO it would be easier to just use Lists (i.e. a List of List elements) which have implicit ordering.

Lajos Moczar added a comment - 06/Feb/07 08:01 PM
r5027.

Much of this comes out of Holger's & my conversation. Renamed transformers to try to minimize confusion. Better handling of bad input data. Ordering of output data determined by labels, whether provided or extracted. Added tests to make sure a sloppy CSV file is correctly transformed.

Holger, please open another JIRA if you want to do your MapToXML transformer in the xml module. I don't think we need a special CSVToXml transformer, since we can just chain them.


Holger Hoffstaette added a comment - 07/Feb/07 05:23 AM
Actually the current implementation is still 80% not what I suggested since it still uses !"§$% HashMaps as general data structure

Holger Hoffstaette added a comment - 07/Feb/07 12:50 PM
Reopening since it is still incomplete and has the wrong title.

Lajos Moczar added a comment - 07/Feb/07 02:49 PM
Don't forget there are use cases for sending the Maps!!! We have to have a way to tie labels to values, especially when the output is split up. You can add a CSVToList, but don't remove CSVToMaps.

Holger Hoffstaette added a comment - 07/Feb/07 05:15 PM
Don't worry, there is nothing wrong with having CSVToMaps; the abstractions by CSVInput/OutputParser just make it very difficult to extend. Also using a Writer for all output is too restrictive.

Holger Hoffstaette added a comment - 09/Apr/07 09:45 AM
assigning to Dirk

Holger Hoffstaette added a comment - 09/Apr/07 11:40 AM
After talking to several people at MuleCon I have come to the conclusion that rewriting our own flat file support is too much work (even when using OpenCSV) and likely not good enough for many people's needs. Despite being on holiday I have read through the documentation & APIs of PZFileReader (http://pzfilereader.sourceforge.net/) and it looks like a much better base. It includes a proper class-based model (instead of crude approximations like "list of arrays") and not only reads CSV but also fixed-width delimited files while still being sufficiently lightweight compared to ServingXML (see linked issue). Its other big bonus is its built-in quasi-streaming/incremental support for very large files - an obvious fit for any future streaming transformers. The only things that are missing are writing of the internal data representation to XML which I have working as a simple StAX writer (will send to Dirk) and reading XML into the internal format, though that might already exist in some way (need to re-check APIs).
The new module could e.g. be named module-flatfile or something similar.

Holger Hoffstaette added a comment - 09/Apr/07 11:43 AM
Dirk, for further development I suggest you remove module-csv from trunk and into the sandbox where you can go wild.

kenneth westelinck added a comment - 09/Apr/07 12:57 PM
I don't want to be a nag, but does PZFileReader support the following output file:
"C1","C2", "C3"
"val1",12,9.0
"val2", 5,3.5

I have to interface with an application that only reads the above format. So columns containing text should be placed between "-characters (or whatever character is used as text delimiter). Columns containing numeric values should not contain quotes. OpenCSV does not support this, so I've used ServingXML for this.

Also, from what I can read, PZFileReader only supports parsing CSV files (hence its name I guess ) Shouldn't we be looking at a more general solution that reads/writes flat files at least in CSV and fixed width formats?

I realize ServingXML is less lightweight, but it supports almost anything, even reading/writing EDI files (see MULE-758).


Holger Hoffstaette added a comment - 09/Apr/07 01:07 PM
PZFileReader also supports flat files with fixed length or delimited records, just check the docs. This is exactly why I got interested in it in the first place.

Holger Hoffstaette added a comment - 09/Apr/07 01:09 PM
Oh and btw there will still be a ServingXML transformer (thanks for starting on that!) but we need to start somewhere and got a lot of feedback at MuleCon. Both are hammers but just like with every good toolbox there's one for nails and one for walls.

kenneth westelinck added a comment - 09/Apr/07 01:56 PM
True, ServingXML is a sledge hammer
I am still wondering, however, how we will write CSV/delimited files, since PZFileReader is only for reading/parsing files. The idea was to have CsvToXml and vice versa.

Holger Hoffstaette added a comment - 09/Apr/07 02:22 PM
Output is indeed a bit limited and will probably have to be written as extension to the PZ* classes and contributed back to PZFileReader; however it should be easy to write since all the necessary information is contained in the meta-data and output is essentially always the same. The samples contain a simple CSV writer and any other delimiter can be substituted easily.

Holger Hoffstaette added a comment - 10/Apr/07 07:30 AM
setting fix-version to 1.4.1

Andrew Perepelytsya added a comment - 24/Apr/07 12:42 PM
Descoping the 1.4.1, unset Fix Version for some issues.

Dirk Olmes added a comment - 04/Mar/08 08:14 PM
This is not going to be implemented as part of the regular Mule codebase. There's the flatfiles project on MuleForge that will host further development.