taken from the Sunbelt technical Support web board:
Subject: Partial read of a file
#19765
4/23/2012
I have a tab delimited text file that a client downloads from another vendor. The latest file reads about 450 records out of a few thousand then gets an OVER condition.
I can read the entire file in NOTEPAD and in EXCEL without a problem. The "TYPE" command line utility stops at the same place that PL/B does.
My main program reads using *EDION. I've written a tiny test program that reads as just a flat file. Both stop at the same place.
Using a home grown PL/B utility that reads byte at a time with ABSON, I can go right through the stopper. I see nothing unusual about the data at that point.
I read the file in NOTEPAD and saved to another file. PL/B still stops at the same point, although NOTEPAD shows the entire thing.
Any ideas?
Stephen Kent
re: No end-of- file character (0x1A) Hiding at that point of the file?
I suspect the same thing or some other non-ascii character.
I just ran into a similar issue where one of our produciton Text files did
not display correctly in my View program. I suspected a non-ascii
character was the cause and the the PLB Utility Suite showed there
was was extract Carriage Returns in the file. (Thanks Robert and Lee!
http://visualplb.com/plbutilitysuite/download/)
I then opened the file in our Text editor (SlickEdit) and did a a search on
the Carriage Return character.
In your case, I would be curious if PLB Utility Suite would stop at the
same spot your program stops of if it will read all the way through.
Gerhard Weiss
Stephen sent us the file and we analyzed it. There was a 0x1A character in the 452nd record. Rewrote the test program to show how to use ABSON to read the entire file in 1 read, then REPLACE to get rid of the 0x1A and 0x0D characters, then process the file using the EXPLODE statement with 0x0A as the delimiter. Processes all records now.
Steve White, Sunbelt
Thanks for sharing the solution. I always like these clever ideas on
how to fix a file. I could also see how the SQUEEZE could be used to
get rid of the 0x1A and 0x0D characters.
Gerhard Weiss
Thanks to Steve White's quick response we know the problem: a hex 1A in a person's address in the FTP file we received. Who knows how it got there. Now that we can identify the account, the vendor on the other end can clean it up. It must have snuck in recently because previous downloads didn't have it but it continued to come to us when we asked for new downloads.
I didn't notice the 1A in my hex dump since I wasn't looking for it. In my old "green card" and other hex charts the description is blank or "SUB". The EOF meaning must be a MicroSoft consideration.
Sidetrack: Did you ever notice how hex tables have dissappeared? I got on a ladder to look at my old DOS books, which have a thick layer of dust on them. Only a few had any hex/ASCII tables and none had the 1A, other than as SUB. Fortunately, an internet search uncovered a number of 1A references.
Stephen Kent
This doesn't explain the 0x1A as end-of-file either, but I use this
ASCII table.
There is also the "Translation Table" found in the PL/B runtime Reference.
Stuart Elliott
1. The 0x1A character as an EOF is a hold over from the DOS Operating Systems.
2. Steven was seeing the 'SUB' being associated with the 0x1A. Here is a link that you can start to get more information about the 0x1A as a EOF.
www.wikipedia.org/wiki/Substitute_character
Ed Boedecker, Sunbelt
This technique is not just for fixing a file. Think of the simple task that we all do in many programs of reading a file sequentially from start to finish. PL/B has a default buffer for sequential files of 256 bytes. If you have a 100k file, then it takes 400 physical reads to read the entire file. With this technique you can read in the entire file in ONE physical read and then all processing is done internally and the program never has to hit the disk again for that file. Just think of the speed improvements that could be made to many programs. Here is a sample template for such a process:
Steve White, Sunbelt
.--------------------------------------------------------
.
File FILE
FileName DIM 50
Size FORM 10
FileData DIM ^
Zero FORM "0"
CR_Blank INIT 0x0D," ",0x1A," "
LF INIT 0x0A
EOF FORM 1
.--------------------------------------------------------
. Open file, get file size and create data buffer.
.
OPEN File,FileName
POSITEOF File
FPOSIT File,Size
SMAKE FileData,Size
.--------------------------------------------------------
. Read entire file into working buffer. Start reading at ZERO
.
READ File,ZERO;*ABSON,FileData,*ABSOFF
CLOSE File
.--------------------------------------------------------
. Replace Carriage Return and EOF marks with blanks
.
REPLACE CR_Blank,FileData
LOOP
.--------------------------------------------------------
. Move all data up to next Line Feed to record buffer. If End of
. String was encountered while moving data, the ZERO flag is set
.
. If set, then set a flag that EOF was found since we still have a
. good record to handle.
.
EXPLODE FileData,LF,FileRecord
IF ZERO
MOVE "1",EOF
ENDIF
.--------------------------------------------------------
. Process the data here. Do whatever needs to be done.
.
.....
.--------------------------------------------------------
. Continue with loop until the EOF flag is set.
.
REPEAT UNTIL ( EOF = "1" )
Most of our file access it done using ISAM files but there is one spot
where I can use this. It is a fix length file so I will not need to do a
REPLACE CR_Blank,FileData
I will probably also replace the SMAKE with DMAKE/DFREE logic.
SMAKE allows for 32mb DIM variables, where DMAKE allows for 2GB.
Of course, I am sure if I read a 2GB file into memory it will kill the
performace of my system.
The other think is I am using SUNSORT to create this file so there is a
good chance it is already disk cached.
Gerhard Weiss
If you're going to use EXPLODE to read through the "file" then you'll need a single delimiter and, in Steve's example, LF is it. So he got rid of the CR's because they became superfluous.
If you're going to handle the Form Pointer yourself and not use EXPLODE then you can do whatever you want.
But I think EXPLODE in the LOOP/REPEAT is one of the best features of this technique. The "bestest" feature, of course, is doing it all in RAM.
I don't remember using the POSITEOF/FPOSIT technique; I use FINDFILE to get the size of the file to DMAKE the variable. I suppose one is more efficient than the other.
--Stuart Elliott
I was thinking of a Fix Length file where the Carriage return would not
be read into the variable because it was smaller.
i.e. in the example below lowercase 'c' is CR and lowercase 'l' is LF.
The record is 10 bytes long so the variable being read into is a DIM10.
The EXPLODE will transfer 11 bytes with the CR but only the first 10 are
placed in the variable.
EXP1REC DIM 10
EXP1DATA INIT "1234567890cl":
"ABCDEFGHIJcl":
"1234567890cl":
"ABCDEFGHIJcl"
.
EOFSW FORM 1
.
LOOP
EXPLODE EXP1DATA,"l",EXP1REC
IF ZERO
MOVE "1",EOFSW
ENDIF
DISPLAY EXP1REC,"<"
REPEAT UNTIL (EOFSW=1)
Steve was working with a Variable Length record that had tab
seperated fields. This would need two EXPLODEs. One for the record
and one for the fields in the record. Even there, instead of replacing the
CR with a space you could use the CR as a deliminiter in the second
EXPLODE. Doing this would not add a space to the last field.
Here is an example of varable length record with comma seperated
fields. Notice the second EXPLODE has a delimiter of Comma and 'c'
EXP2REC DIM 30
EXP2DATA INIT "123,ABC,456,DEFcl":
"ABC,1234,DEF,5678cl":
"123,ABC,456,DEFGHcl":
"AB,123,D,45678cl"
.
EXP2VL LIST
EXP2FLD1 DIM 4
EXP2FLD2 DIM 4EXP2FLD3 DIM 4
EXP2FLD4 DIM 4
LISTEND
.
MOVE "0",EOFSW
LOOP
EXPLODE EXP2DATA,"l",EXP2REC
IF ZERO
MOVE "1",EOFSW
ENDIF
EXPLODE EXP2REC,",c",EXP2VL
DISPLAY "FLD1=",EXP2FLD1," FLD2=",EXP2FLD2," FLD3=",EXP2FLD3," FLD4=",EXP2FLD4
REPEAT UNTIL (EOFSW=1)
Gerhard Weiss
Thanks for the example Steve. I modified an old program that reads a 5mb weekly client supplied file 1-byte at a time looking for a ~ they use for record termination (no CR/LF in the file). It was a batch process that ran in the middle of the night so taking about an hour was not a big deal.
With your 'Big Gulp' technique it now takes about 6 seconds to process the file.
My server's HD thanks you.
-Mike Maynard
Also think how much more easier it is to test changes made to the program.
You did bring up an interesting condition where the EOR marker is not
the standard one supported by PL/B. I checked our system, by
searching for *ABSON, and did not find any progams doing that on our system.
Darn! I was hoping to fix it.
I bet your network hubs thanks Steve too!
Gerhard Weiss