Comma Separated Values (CSV)

A Perl Module Based Upon A Formal CSV Definition

This Perl module defines some functions for reading comma separated value (CSV) files (e.g., Remedy-ARS generated files). It requires at least Perl 5.000 and is installed through the usual “perl Makefile.PL; make test; make install” mechanism.

Some years ago I needed to parse some comma separated value (CSV) files—specifically, address book files exported from my Newton MessagePad. I looked at all the available Perl modules that claimed to do CSV parsing and found them lacking. So, I worked with Mark Mielke, a much better Perl programmer than I, to create this module. Recently, Mark and I upgraded the module to allow the delimiter to be arbitrarily defined: it can be any character or set of characters.

Getting More Info.

Multiline Records

The one part of the CSV problem that the CSV.pm specification describes but does not solve is correctly reading CSV records that span more than one line (i.e., CSV fields containing a newline as part of the field's variable). One common source of CSV files containing multiline records is MS Excel. Two contributors have provided me with work-arounds to the multiline record problem, and both of their contributions are provided here:

multiline-csv.pl --- submission by David Crooke that works by monitoring quotation marks and figuring out which fields contain mutliline data
multiline-snippet.p l--- submission by Dave Stafford that works by counting the number of fields in a record; where a record is missing fields the code assumes a multiline field is present

RPM Details

In August 2003, Riku Meskanen kindly took the CSV tarball and packaged it using the RedHat Package Manager into an RPM file. The RPM files are provided here for easy access; however, their canonical location is his website.

More recently, in November 2004, Jonas Anden made improvements to the .spec file for perl-CSV. The .spec file was tested on RedHat 7.3 (perl-5.6.1) and Fedora Core 2 (perl-5.8.3). His changes are as follows:

added 'export PERL_INSTALL_ROOT' before launching Install.pm to install the module. This is necessary on at least Fedora Core 2; otherwise Perl will attempt to install in my regular Perl directories and the filelist compilation at the end will fail.
removed the PREFIX parameter to the make call. On old Perl releases, this (in combination with the above fix) causes dual installation prefixes.
changed the release number to include the Perl version used for compilation. This helps separate different builds for different platforms (e.g., RedHat 7.3 uses Perl 5.6.1, the Fedora Core uses Perl 5.8.3, and the packages need to be build specifically for each box).
changed the 'requires' statement to include the Perl version used to compile. This helps avoid silent installation problems which will cause programs using the CSV module ("Can't locate CSV.pm in @INC").

Updated Specfile and RPMs:

RPM for RedHat 7.3—perl-CSV-2.0-1.5.6.1.i386.rpm.bz2
RPM for Fedora Core 2—perl-CSV-2.0-1.5.8.3.i386.rpm.bz2 (Keith Owens has emailed me pointing out that this RPM installs the module incorrectly, under /usr/usr instead of /usr)
Updated RPM source file—perl-CSV-2.0-1.5.6.1.src.rpm.bz2
Updated Specfile—perl-CSV-2-1.spec.txt

Other CSV References

In late-2003, I had some correspondence with Dave Hawkey. He too has written a formal CSV spec. and created his own perl CSV module. There are two primary differences between our specifications:

His spec. requires CR/LF pairs to terminate records; and
In some circumstances, his module treats multiple spaces as a single space.

See http://www.visi.com/~hawkeyd/csvutils.html for his module.

Dave also pointed me to a couple of other CSV specs:

http://www.hmce.gov.uk/business/importing/intrastat/csvfilespec.htm
http://www.uktradeinfo.com/intrastat/elect-csv.htm (WayBack link, since the original page is no longer posted)

CSV.pm & Function Prototypes

The CSV.pm module does not declare function prototypes for any of the functions it provides. This was a conscious implementation decision and Mark has strong feelings on the subject that he is happy to share with those who disagree. The net result of this decision is that you must use the functions as the sample code demonstrates (Matt Crawford has rightly pointed out that this fact is worth noting on this web page). For example, when reading input from a file the scaler declaration must be used in your code; as in:

    my($firstLine) = scalar(<INPUT>);

See the sample application included with the module for further examples.

Self Promotion

This CSV module is better than its competitors (in my opinion) because it implements a properly specified CSV format definition---at the time the module was written, no other CSV module was written to a spec. The other pretenders I’ve looked at don’t display any rigour in their implementations (Dave Hawkey's module being the lone exception). I would greatly appreciate constructive and intelligent discussion about the module, and especially about the CSV specification itself; attempts to generate discussion in comp.lang.perl on this module while I was writing the specification failed to elicit anything except criticism.

Authors

Christopher Rath ( ) wrote the CSV specification and everything in the module except the essential snippet of code that actually does the work :).

Mark Mielke (mark@mielke.cc) took the specification and wrote the essential piece of code that actually breaks the CSV records into its constituent fields. He also took the initial .pl version and .pm’ed it, as well as extending the functionality to allow the field delimiter to be arbitrarily chosen (i.e., something other than a comma). Mark’s website can be found at http://mark.mielke.cc/.