Data Layout

Arranging Data in a File

the program determines the order data must occur in a data file
if users enter data in the wrong order, you can do little or nothing to adjust for it
there are three common ways to order data in a file: sequentially, in groups (often called blocks or records), and with labels
sequential data basically says that the first item of data is first in the file, followed by the second, then the third, etc. of course, you as the programmer have specified the order ahead of time. the user is expected to adhere to your prescribed order exactly.
a block data format says that all information about a specific real-world item is grouped together in a block within the file. this layout maps beautifully to classes and objects. simply place all the real-world item's data in a class and then read a single block of data into each object in the program.
labeled data simply means that each piece of data is not merely present in the file, but is preceded by its designation/name — it is labeled. this is a common syntax in Windows ini files and Unix rc files. (it is also a growing trend — note the popularity of such verbose labeled data environments as XML.)

Sequential Data Layout

in a sequential data layout you have one value after another. the values can be of different kinds and they can be of different sizes, but there is no seeming order since there is just raw data. for instance, here is a data file containing x and y coordinates for an object in the 2D plane:
```
    42 69
    13 43
    -3 5
    -23 62
```
this file could be read with a simply loop:
```
    file >> ws;
    while (!file.eof())
    {
        file >> x >> y;
        // do something with x and y just read
        file >> ws;
    }
```
if we know ahead of time that this is the case, we can immediately tell what is going on. however, just looking at the file contents is less obvious. especially if the data has been slightly altered:
```
    42 69 13
    43 -3 5
    -23 62
```
note how the data is exactly the same and can be read with the exact same code, but the data is now not even obvious when we know what is supposed to be in the file.

the reason this still works is that the compiler only needs spacing — not space characters specifically — between values to be read. a newline is just as good as a tab which is just as good as a space. there can even be multiple spaces. looking at both the previous files in buffer form we see the similarity:

                  +-+-+-+-+-+--+-+-+-+-+-+--+-+-+-+-+--+-+-+-+-+-+-+
    first file:   |4|2| |6|9|\n|1|3| |4|3|\n|-|3| |5|\n|-|2|3| |6|2|
                  +-+-+-+-+-+--+-+-+-+-+-+--+-+-+-+-+--+-+-+-+-+-+-+

                  +-+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+
   second file:   |4|2| |6|9| |1|3|\n|4|3| |-|3| |5|\n|-|2|3| |6|2|
                  +-+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+

again, same data, different spacing.

other formats are also possible and all will work as desired:

or:

    42 69 13 43 -3 5 -23 62

or even:

    42 69                13
                43
    -3
     5
                    -23               62

Block Data Layout

grouping all data about a single item together makes for more easily identified files. for instance, here is a file for student data:

    Joe Sally Hanesh Vong Ella
    Soph Fresh Fresh Soph Jr
    4 11 9 10 10
    Valley Lane
    Mountain View
    Mountain View
    Valley Lane
    River Trail

here's the same data in a block format:

    Joe
    Soph
    4
    Valley Lane
    Sally
    Fresh
    11
    Mountain View
    Hanesh
    Fresh
    9
    Mountain View
    Vong
    Soph
    10
    Valley Lane
    Ella
    Jr
    10
    River Trail

now it is easier to identify what data goes with which student. even clearer would be:

    Joe
    Soph
    4
    Valley Lane

    Sally
    Fresh
    11
    Mountain View

    Hanesh
    Fresh
    9
    Mountain View

    Vong
    Soph
    10
    Valley Lane

    Ella
    Jr
    10
    River Trail

but, for some reason, blank space like that is rare in data files. I would encourage you to do such things, though!

note how, now that all of a single student's information is grouped together, you could use a class to easily read it in:

    file.peek();
    while (!file.eof())
    {
        object.read(file);
        // do something with object just read
        file.peek();
    }

Labeled Data Layout

placing a label next to each item of data improves our raw file readability even more:

    name = Joe
    class = Soph
    gpa = 4
    branch = Valley Lane
    name = Sally
    class = Fresh
    gpa = 11
    branch = Mountain View

now it is obvious what each piece of data represents within the block. (not to say that sequential data can't be labeled, it is just more common to find block data labeled.)
but doesn't this complicate the reading process somewhat? well, yes. but it is well worth it!
a first approach is to do something simple like this:
```
    file >> setw(MAX_LABEL) >> label >> sep
         >> setw(MAX_NAME) >> name;
```
but this doesn't account for several things. first off, the user might re-order information within the block:
```
    name = Joe
    gpa = 4
    class = Soph
    branch = Valley Lane
    class = Fresh
    gpa = 11
    name = Sally
    branch = Mountain View
```
note that the same data is present, the blocks have just been internally shuffled. the assumption here is that your program should be able to place the correct data in the correct variables by the context of the labels. that is a major assumption, are we ready for it? SURE!

after reading the label, we simply look for that label string in a list of known label strings:

    const char known_labels[MAX_KNOWN_LABELS][MAX_LABEL] = { "name",
                                                             "gpa",
                                                             "class",
                                                             "branch" };
    file >> setw(MAX_LABEL) >> label >> sep;
    L = 0;
    while (L < MAX_KNOWN_LABELS && strcmp(label,known_labels[L]) != 0)
    {
        L++;
    }
    // L is either the index of the correct label or MAX_KNOWN_LABELS

it's just a simple linear search through an array of strings! now we can switch on the label index to an appropriate action:

    switch (L)
    {
        case 0: file >> setw(MAX_NAME) >> name; break;
        case 1: file >> gpa; break;
        case 2: file >> setw(MAX_CLASS) >> year; break;
        case 3: file.getline(branch, MAX_BRANCH); break;
        default:  // nothing -- ignore unknown labels
    }

easy as pie! (although it is rather a pain to make a pie, this is an old expression that means it really is easy...*shrug*)

speaking of hidden pain, don't forget to watch out for the unknown label condition — when the loop above ends at MAX_KNOWN_LABELS!

but, if we are labeling data so that it may be easily read, it might also be edited/changed outside our program. user's don't type as carefully as our program reads. the user might end up with something like:
```
    name=Joe
    class=Soph
    gpa=4
    branch= Valley Lane
    name =Sally
     class = Fresh
    GPA =  11
    branch  =Mountain View
```
which our code can no longer read correctly. the capitalization problems can be side-stepped by using a case-insensitive string comparison. perhaps you could do something about that...
the spacing problems are much harder to deal with. since we don't know whether they'll have or not have space preceding the separator character (an '=' above), following the separator, or even before the label itself, things are a bit more messy.
however, since all labeled data has but one item per file line, we can read in the whole line and then do string processing (which we are familiar with from previous studies) to break up the pieces within our program:
```
    file.getline(labeled_line, MAX_LINE);
    sep_at = search(labeled_line, '=');
    lcap = min(sep_at,MAX_LABEL)-1;
    strncpy(label, labeled_line, lcap);
    label[lcap] = '\0';
    // copy rest into value string:
    value_index = 0;
    do
    {
        sep_at++;
        value[value_index++] = labeled_line[sep_at];
    } while (value_index != MAX_VALUE_LEN &&
             labeled_line[sep_at] != '\0');
```

did I mention that using pointers would have made this a little more interesting?

    char * sep_at;  // make sep_at a pointer rather than an index
    file.getline(labeled_line, MAX_LINE);
    sep_at = strchr(labeled_line, '=');  // use cstring library function to search
    *sep_at = '\0';                      // split string logically in two
    strncpy(label, labeled_line, MAX_LABEL-1);
    label[MAX_LABEL-1] = '\0';
    strncpy(value, sep_at+1, MAX_VALUE_LEN-1);   // use pointer to second half
    value[MAX_VALUE_LEN-1] = '\0';

then just strip off any leading or trailing spaces from each string (we leave internal ones just in case the value — or even the label itself — has spaces inside it):

    // count number of leading spaces
    lead_space = 0;
    while (isspace(str[lead_space]))
    {
        lead_space++;
    }
    // shift data over
    moving = 0;
    while (str[moving+lead_space] != '\0')
    {
        str[moving] = str[moving+lead_space];
        moving++;
    }
    str[moving] = '\0';   // not entirely necessary, but good form
    // remove trailing spaces
    while (moving != 0 && isspace(str[moving-1]))
    {
        moving--;
        str[moving] = '\0';
    }

the shifting loop could be done with a call to strcpy, of course — if you are willing to suffer the pointers; I was just being obsessive...

(of course repeat this process for both the label and value strings — perhaps calling a function...?)

with all this in place, your class' reading function probably looks something like this:

    void Class::read(istream & strm)
    {
        // known label array
        strm.peek();
        while (!strm.eof() && !end_of_block)
        {
            // read line
            // split line at separator
            // search for label -- case-insensitively
            // switch to translate and store value based on label
            strm.peek();
        }
        return;
    }

But what about detecting the end of a re-arranged data block?
use an array of bool values — one for each valid label you have. Initialize this array to all false and then, as you process each label, set the corresponding value in the array to true.
Furthermore, if you find that you've seen a label (it has a true in its array spot), you know you've reached the end of your logical block! Simply seek back to the beginning of that line (you did a tellg before you input the line, didn't you?) and 'return' all your collected data to the caller.
in addition, after each labeled line you process, you can accumulate whether you've seen any labels at all (to help alleviate false eof reports) or if you've seen all the labels you needed to see — which would indicate you can stop processing and 'return' the data now.
the only problem left is how to translate that value string into a numeric value when the variable is of type double or long or such. (perhaps a lab you once did..?)

Mixed Data Layouts

it is perfectly acceptable to have these different layout styles mixed together: a sequence of blocks, a labeled sequence, a labeled block, etc.
you can even nest them. having one item of a block be a sequence is the most common. this can work in a couple of different ways. first is to simply arrange your data carefully:
```
    name
    other data here
    numeric sequence
```
now when the sequence of numbers hits either the end of the file or the next name, you'll know it is over:
```
    peek
    while (!eof)
    {
        read name
        read other data
        read first of sequence
        while (good)   // file will either eof or fail at end of sequence
        {
            process sequence item
            read next in sequence
        }
        clear
        peek
    }
```
we might even want to convert this to 'hacked' form with space-based priming
the second way would be to have the data items labeled, then you don't even have to show care with placing a non-numeric item of data first in the data block. (unless you have numeric looking labels...*odd*)
another approach is to use a continuation mark to say that a current line of data is not over — we're just gonna continue on the next physical line of the file. you'll have to decide if the leading whitespace on that next line (aren't they going to indent past the separator mark?!) is sigificant. but at least it'll be just a special circumstance scenario.
also be careful when you have a sequence of non-numeric data inside a block! (this is fine when using a labeled format — just check for a label as a flag value, but it isn't so easy when a new block begins with numeric data — numbers fit validly into string objects)
to help in this situation, you can use a termination mark. with such a mark, you'll never be confused about the end of a sequence (or labeled item) again! but it requires that the user always remember the mark at the end of their data values.
you can always use a pre-count or a trailing flag value if those seem easier or more appropriate for the data (rather than depending on the fail/eof thing)

Data File Comments

often times data files are commented with what are known as meta-data. this is typically information that isn't about the file contents so much as it is about those who have processed the file — who edited it, when, why, what did s/he change, etc. such information can be important when the department's sales figures are all wrong — then we'll know who to blame. *grin*
your program doesn't need to read and understand these comments — just like the compiler doesn't read/understand your comments in the program. however, you will have to process past them to get to the data (just like when we did nice console formatting: 4/12/82, (-4, 6.8), 3d6, etc.).
probably one of the most common symbols used to denote comments is the pound sign ('#'). there are many others: '*' in FORTRAN, ';' in ini files, '!' in Access, etc. and many programs even use a sequence of symbols to represent comments: "rem" in ini files or BASIC, "//" in C++ or Java, etc. we'll just pick a single comment character to represent a user's comments (this will simplify things for now).
there are also multiple styles of comments: whole line, end-of-line, and block. since whole line comments are really a subset of end-of-line comments, I'll let you decide which to support. we won't need to deal with block comments as these are tedious and get icky — at least when nested comments are allowed.
the basic idea of the end-of-line comment is that when your program sees the comment character, everything from there until the end of the line is to be ignored as a comment. (whole line comments are a subset of this because they say that nothing precedes the comment character — except possibly whitespace.)
the only thing left is to decide if comments must all occur at the top of your data file or if the user can place comments anywhere they want within the file. it is common for companies to keep all/most of their meta-data comments at the top of a file. however, it is not hard to strip out whole line or eol comments from a line of data as you read it.
one thing to remember is that if you are expecting a group of comments at the top of a file, you may want to remember the file position after you reach the data so that you can avoid re-processing those comments if the user decides to re-process the file.