Making IBM mainframe data sets usable on Linux via COBOL

IBM z10 Mainframe

A couple weeks back a co-worker was mentioning his frustrations with programming in COBOL. He revealed he had some data that originated from a mainframe and needed to essentially reformat the data so that it could be usable on a Linux system. After briefly discussing the situation and a revelation, to him, that I have done some COBOL programming on a mainframe through IBM's Master the Mainframe contest, he invited me to complete this portion of his project. My COBOL was a bit rusty, but we managed to get the data conversion completed with the help of a number of websites and a good bit of trial and error. I figured this is another opportunity to explain the process I went through so that it can be kept in a single place.

I'll expound on the problem a little more first. The data originated on an IBM mainframe. So, the first issue is that the data was in EBCDIC character encoding as opposed to ASCII which is used by PCs. The organization the provided the data included a nice document describing the format of the data. The data definition was provided as a segment of COBOL code which included a mix of regular numerals and text as well as numbers in packed decimal format (COMP-3 variables), which does not display nicely on Linux. So the combination of these issues resulted in the task of making the data usable on a PC Linux system non-trivial and, for me, fun.

Lets get the first major hicccup, and the last one we discovered, out of the way first. Values in the data file that were stored as a COMP-3 value can be used by COBOL programs on a Linux machine from the same file as the one provided from the mainframe. Values that are not COMP-3 values, regular text and numbers, must be converted from EBCDIC to ASCII. Yes, you read that right. If you have a file that contains both COMP-3 values and non-COMP-3 values, some of the information must be converted and some of it must not be converted. There are a number of possible solutions as to how to accomplish this little feat. I'll leave that decision to you to to figure out.

For the data that does need to be converted, Linux comes with a command to handle that for you. The 'dd' command allows you to perform a number of disk and file operations, particularly in converting different file formats and disk formats. To convert a file from EBCDIC format to ASCII you can use the following command:

dd conv=ascii if=FILE.ebc of=FILE.asc

Simple enough. Now you can work with the FILE.asc file without the character encoding problems for your non-COMP-3 values. To handle the packed decimals we'll still need to work with the EBCDIC version of the file. We still need the COMP-3 values in a form that can be displayed on a Linux system. Since the data came with a COBOL data definition it seemed natural to write up a quick bit of COBOL code to convert the packed decimals into "normal" numbers. You can do COBOL development on Linux systems with OpenCOBOL. OpenCOBOL is a COBOL compiler and will produce an executable that runs natively on Linux. Here's a walk-through of some sample code that finishes the conversion:

*---------------------
ENVIRONMENT DIVISION.
*---------------------
INPUT-OUTPUT SECTION.
FILE-CONTROL.
SELECT DataFile ASSIGN TO "FILE.asc"
ORGANIZATION IS SEQUENTIAL.

This sets up the program so that COBOL will open a file, FILE.asc. The organization statement is important as it will tell the program how to read the file. In our case, the data file was "record sequential" meaning that each data record is immediately followed by the next. Record sequential is the default sequential organization. Some data files might be in "line sequential" format meaning that each data record is on its own line of the file.

*---------------
DATA DIVISION.
*---------------
FILE SECTION.
FD DataFile.
01 ACCT-REC.
88 EndOfFile VALUE HIGH-VALUES.
05 ACCT-NO PIC 9(6) COMP-3.
05 ACCT-LIMIT PIC S9(7)V99 COMP-3.
05 NAME PIC X(20).

This segment of code is pretty much a copy paste of the data definition from the organization who provided the data set. An addition to the data definition is the line immediately after "01 ACCT-REC.". This defines a flag that we will use later to determine if we have finished reading the data file. Once we've got the file's data definition entered, we move on to how the data will be defined for our output:

*------------------------
WORKING-STORAGE SECTION.
*------------------------
01 ACCT-REC-O.
05 ACCT-NO-O PIC 9(6) SIGN LEADING SEPARATE.
05 DELIMIT PIC X VALUE ','.
05 ACCT-LIMIT-O PIC S9(7).99.
05 DELIMIT PIC X VALUE ','.
05 NAME-O PIC X(20).

For a high level explanation, we're duplicating the "ACCT-REC" for output without using the COMPUTATION-3 (COMP-3) formatted packed-decimals. For integer style data types (PIC 9), you can swap out the 'COMP-3' specification with 'SIGN LEADING SEPARATE'. For decimal data types (PIC S9V9), we can keep the same format. I've replaced the 'V' (for vertex) in the definition with an actual decimal point. For character data types (PIC X) no change is needed. The movement of data from the file to this data record will work transparently within COBOL as we'll see in a bit. I've also added two 'DELIMIT' variables which I'll explain later.

First we need to start the actual program to open the data file and start reading from it.

*-------------------
PROCEDURE DIVISION.
*-------------------
OPEN INPUT DataFile.
PERFORM UNTIL EndOfFile
READ DataFile
AT END SET EndOfFile TO TRUE
NOT AT END PERFORM READ-RECORD
END_READ
END-PERFORM.

This code should be self-explainable to programmers. We open the input data file. Then we go into a loop until EndOfFile is true. On each iteration of the loop we read from the data file. If, when we read from the file, we get to the end of file, we set our EndOfFile flag to true. Otherwise, we perform a function READ-RECORD.

READ-RECORD.
MOVE ACCT-NO TO ACCT-NO-O
MOVE ACCT-LIMIT TO ACCT-LIMIT-O
MOVE NAME TO NAME-O
DISPLAY ACCT-REC-O.

We can simply move the value from the data file to the output variable. COBOL knows how to handle the packed-decimal values and is able to cast them from one type to the other without a lot of effort on our part. Finally, we display the 'parent' record. When this line executes, it will print out to the screen each value of the record on the same line. Since we've defined the record to have commas as delimiters, we'll have a comma delimited line with the values in between.

When the program does a full run, it'll print out to the screen comma-delimited values for the original data set. This output piped to a file then provides a format that can be easily used by Linux. In the end, it provides a method for importing data from an IBM mainframe into spreadsheets or to a MySQL database.

This post was originally published on May 27, 2013 at former blog of mine. I recovered it when a friend recently asked about this project.

Add new comment

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.